### A Pluto.jl notebook ###
# v0.11.12

using Markdown
using InteractiveUtils

# ╔═╡ ae3cbd80-ed34-11ea-28d2-e9efc3815959
begin
	using DataStructures
	using Gadfly
	using WordTokenizers
	using StatsBase
	using Test
	
	Gadfly.push_theme(Theme(bar_spacing=1mm))
	Gadfly.set_default_plot_size(7inch, 4inch)
end

# ╔═╡ d4015470-d60c-11ea-3a2c-09e7cff8af7c
md"""
# Language Modeling

All of you have seen a language model at work. And by knowing a language, you have already developed you own language model.

You have probably seen a LM at work in predictuve text:
- a search engine predicts what you will type next
- your phone predicts the next work for you
- Gmail can auto-complete your entire sentence

Language models also help filter the output of systems for tasks like

- speech recognition

    You speak a phrase into your phone, which has to convert it into text. How does it know if you said "recognize speech" or "wreck a nice beach"? (say them really fast, they sound quite similar)

- machine translation

    You are translating the Chinese sentence "我在开车“ into English. Your translation system gives you several choices:

    - I at open car
    - me at open car
    - I at drive
    - me at drive
    - I am driving
    - me am driving

    A language model can tell you which translation sounds the most natural.
"""

# ╔═╡ 9a1fd080-ed34-11ea-2605-3da00683a939
md"""
We're going to be implementing a simple n-gram language model. First, let's load some packages and setup some stuff.
"""

# ╔═╡ d63dfb50-ed34-11ea-3ec7-39e83423e26f
md"""
We're going to be using *Pride and Prejudice* as our text. Download it from [this link](http://www.gutenberg.org/files/1342/1342-0.txt) and save it as `p&p.txt` in the same folder as this notebook. Then we can load it as follows:
"""

# ╔═╡ f337bca0-ed34-11ea-1aca-f5ee00b34c4b
function load_text(path)
	text = read(path, String)
	# remove some whitespace characters
	text = replace(text, "\ufeff" => "")
	text = replace(text, "\n" => " ")
	text = replace(text, "\r" => " ")
	return text
end

# ╔═╡ ff1749a0-ed34-11ea-21b9-cfdb9759c91b
text = load_text("sherlock.txt");  # TODO

# ╔═╡ 3eebdad0-ed37-11ea-1910-67d2006fb5ee
md"""
## Data Analysis

In this section, we're going to be looking at our data. It's always good to see what you're working with.
"""

# ╔═╡ 43857210-ed35-11ea-2bd4-35948fdea98c
md"""
Let's first take at look at the *tokens*. We will use the `WordTokenizers.tokenize()` function to split the text into tokens.
"""

# ╔═╡ 4b9691f0-ed35-11ea-3d57-a9eeb68e6ac5
tokens = tokenize(text)

# ╔═╡ a8a6ebf0-ed36-11ea-163a-e3701653ce97
md"**Question**: So what exactly are tokens?"

# ╔═╡ 9fd92840-ed35-11ea-24fc-71a1319b75d4
md"""
**Question**: Why don't we just split on spaces?
"""

# ╔═╡ ae6d9c10-ed35-11ea-2488-6387dd9542da
split_on_spaces = split(text, r"\s+")

# ╔═╡ ed1271c0-ed35-11ea-192d-d9971d77b7c5
md"""
Let's examine our data a bit.
"""

# ╔═╡ 5818ab10-ed36-11ea-3cea-219e51e2b13a
md"**Question**: How many tokens do we have?"

# ╔═╡ 6af96400-ed35-11ea-3f79-fda8fde36254
length(tokens)

# ╔═╡ 85037f70-ed35-11ea-007e-ad65fa9d6bc6
vocabulary = unique(tokens)

# ╔═╡ 701586d0-ed35-11ea-3a54-392347e3c913
length(vocabulary)

# ╔═╡ 9cccd2a0-ed35-11ea-2260-276ae7c091cd
md"""
This is the *type-token distinction*. There are $(length(vocabulary)) types and $(length(tokens)) tokens.
"""

# ╔═╡ 6e814052-ed37-11ea-31e8-597447705835
md"""
**Question**: What are the most common tokens?

We can use `DataStructures.counter()` to count them:
"""

# ╔═╡ b130c8d0-ed37-11ea-1529-b9bfe465a95c
token_counts = counter(tokens)

# ╔═╡ 7938c80e-ed37-11ea-0054-137e29962edd
begin
	counts = sort(collect(token_counts), by=x -> -x[2])[1:20]
	plot(
		y=reverse([c[1] for c in counts]),
		x=reverse([c[2] for c in counts]),
		Geom.bar(orientation=:horizontal),
		Guide.xlabel("Count"),
		Guide.ylabel("Token"),
		Guide.xticks(ticks=[0:1000:10000;])
	)
end

# ╔═╡ 3cbd1e7e-ed38-11ea-0f0f-75de6ad7cfd9
md"""
This is not as important, but for curiosity's sake, let's examine sentences.
"""

# ╔═╡ 89968a30-ed37-11ea-0be5-47d2e8cf1277
sentences = split_sentences(text)

# ╔═╡ 6cc62d60-ed38-11ea-32cf-6f9a2b4fcdfb
length(sentences)

# ╔═╡ 8e17fe30-ed38-11ea-2570-a5793d2895ab
tokenized_sentences = tokenize.(sentences)

# ╔═╡ 98a6c83e-ed38-11ea-3ae7-d5cc19cf6b5b
md"""
**Question**: Woah, what does that `.` do?

This is called vectorization or broadcast. It basically means you apply the function to every element of the array. In Python, you might write it as a list comprehension

	tokenized_sentences = [tokenize(s) for s in sentences]

Actually, this also works in Julia, but I find the vectorized version easier to read.
"""

# ╔═╡ 4c731270-ed39-11ea-3104-b12c14a55746
md"**Question**: What do the sentences look like?"

# ╔═╡ 6e7d36c2-ed39-11ea-243e-c17047719b01
sentence_lengths = counter(length.(tokenized_sentences))

# ╔═╡ 6a732030-ed39-11ea-3019-899c1a2cf33e
plot(
	x=1:100,
	y=[sentence_lengths[x] for x in 1:100],
	Geom.line,
	Guide.xlabel("Sentence Length (tokens)"),
	Guide.ylabel("Count"),
	Guide.xticks(ticks=[0:10:100;])
)

# ╔═╡ d8b3eb0e-ed39-11ea-39da-3f6867968a34
md"""
## Language Modeling

Now let's actually implement our language model. Let's write out a n-gram function.
"""

# ╔═╡ 976a1720-d60e-11ea-397a-9d1b790513a4
function ngrams(seq, n)
	return []  # TODO
end

# ╔═╡ 27c01fd0-ed3a-11ea-3c2e-99f76f83fa9b
ngrams("language", 3)

# ╔═╡ 11b0bc30-ed3b-11ea-11e0-75106f28fce0
md"Make sure the following test passes:"

# ╔═╡ 9b70360e-d60e-11ea-2330-5766808f5a96
@test ngrams("language", 3) == ["lan", "ang", "ngu", "gua", "uag", "age"]

# ╔═╡ e263dbf0-ed3b-11ea-04ba-817ee56e1257
md"""
Our model is very simple: store the counts of all the ngrams in the text.
"""

# ╔═╡ 81550420-ed3a-11ea-015a-01dc98639e67
model = counter(ngrams(tokens, 3))  # TODO

# ╔═╡ 1c790cd0-ed3b-11ea-2fba-d1b9d59d4e91
md"Now let's see our language model in action by generating some sentences. We will give it a *prefix* to start off with. It will pick the most likely word that follows the prefix and add it to the sentence. Then the last $n-1$ words become the prefix, and we repeat."

# ╔═╡ 5d066e0e-ed3a-11ea-328d-891de8a095b8
prefix = ["This", "is"]

# ╔═╡ 44742590-ed3a-11ea-1f2c-b58357633b70
function generate(model, prefix)
	result = []
	for word in prefix
		push!(result, word)
	end
	
	get_count(gram) = get(model, gram, 0)
	
	for i in 1:100
		best_i = argmax(get_count.([prefix; word] for word in vocabulary))
		best_next = vocabulary[best_i]
		push!(result, best_next)
		
		prefix = [prefix[2:end]; best_next]
	end
	return result
end

# ╔═╡ 7d3ce3d0-ed3a-11ea-2dd0-7dfb25d7c64b
join(generate(model, prefix), ' ')

# ╔═╡ ccfe10b0-ed3a-11ea-09d7-43deb2afabbe
md"""
**Question**: Given a specific input, `generate()` will always return the same output (i.e. it is *deterministic*). It doesn't generate realistic sentences. How can we easily improve this?
"""

# ╔═╡ 5821d600-ed3a-11ea-08e5-a324ade7fe35
function generate_random(model, prefix)
	result = []
	for word in prefix
		push!(result, word)
	end
	
	get_count(word) = get(model, [prefix; word], 0)
	
	for i in 1:100	
		weights = Weights([get_count.(word) for word in vocabulary])
		best_next = sample(vocabulary, weights)
		push!(result, best_next)
		
		prefix = [prefix[2:end]; best_next]
	end
	return result
end

# ╔═╡ 9954a990-ed3a-11ea-05a0-c10594775ccc
join(generate_random(model, prefix), ' ')

# ╔═╡ 98b71490-ed3b-11ea-12f1-5bd2adb81c7c
md"""
## Homework

Look for #TODO comments in the code.

1. Try this out on some other text! Something by Shakespeare might be fun.

2. Try varying the $n$ in ngrams. How do your results differ with bigrams or 4-grams?
"""

# ╔═╡ Cell order:
# ╟─d4015470-d60c-11ea-3a2c-09e7cff8af7c
# ╟─9a1fd080-ed34-11ea-2605-3da00683a939
# ╠═ae3cbd80-ed34-11ea-28d2-e9efc3815959
# ╟─d63dfb50-ed34-11ea-3ec7-39e83423e26f
# ╠═f337bca0-ed34-11ea-1aca-f5ee00b34c4b
# ╠═ff1749a0-ed34-11ea-21b9-cfdb9759c91b
# ╟─3eebdad0-ed37-11ea-1910-67d2006fb5ee
# ╟─43857210-ed35-11ea-2bd4-35948fdea98c
# ╠═4b9691f0-ed35-11ea-3d57-a9eeb68e6ac5
# ╟─a8a6ebf0-ed36-11ea-163a-e3701653ce97
# ╟─9fd92840-ed35-11ea-24fc-71a1319b75d4
# ╠═ae6d9c10-ed35-11ea-2488-6387dd9542da
# ╟─ed1271c0-ed35-11ea-192d-d9971d77b7c5
# ╟─5818ab10-ed36-11ea-3cea-219e51e2b13a
# ╠═6af96400-ed35-11ea-3f79-fda8fde36254
# ╠═85037f70-ed35-11ea-007e-ad65fa9d6bc6
# ╠═701586d0-ed35-11ea-3a54-392347e3c913
# ╟─9cccd2a0-ed35-11ea-2260-276ae7c091cd
# ╟─6e814052-ed37-11ea-31e8-597447705835
# ╠═b130c8d0-ed37-11ea-1529-b9bfe465a95c
# ╠═7938c80e-ed37-11ea-0054-137e29962edd
# ╟─3cbd1e7e-ed38-11ea-0f0f-75de6ad7cfd9
# ╠═89968a30-ed37-11ea-0be5-47d2e8cf1277
# ╠═6cc62d60-ed38-11ea-32cf-6f9a2b4fcdfb
# ╠═8e17fe30-ed38-11ea-2570-a5793d2895ab
# ╟─98a6c83e-ed38-11ea-3ae7-d5cc19cf6b5b
# ╟─4c731270-ed39-11ea-3104-b12c14a55746
# ╠═6e7d36c2-ed39-11ea-243e-c17047719b01
# ╠═6a732030-ed39-11ea-3019-899c1a2cf33e
# ╟─d8b3eb0e-ed39-11ea-39da-3f6867968a34
# ╠═976a1720-d60e-11ea-397a-9d1b790513a4
# ╠═27c01fd0-ed3a-11ea-3c2e-99f76f83fa9b
# ╟─11b0bc30-ed3b-11ea-11e0-75106f28fce0
# ╠═9b70360e-d60e-11ea-2330-5766808f5a96
# ╟─e263dbf0-ed3b-11ea-04ba-817ee56e1257
# ╠═81550420-ed3a-11ea-015a-01dc98639e67
# ╟─1c790cd0-ed3b-11ea-2fba-d1b9d59d4e91
# ╠═5d066e0e-ed3a-11ea-328d-891de8a095b8
# ╠═44742590-ed3a-11ea-1f2c-b58357633b70
# ╠═7d3ce3d0-ed3a-11ea-2dd0-7dfb25d7c64b
# ╟─ccfe10b0-ed3a-11ea-09d7-43deb2afabbe
# ╠═5821d600-ed3a-11ea-08e5-a324ade7fe35
# ╠═9954a990-ed3a-11ea-05a0-c10594775ccc
# ╟─98b71490-ed3b-11ea-12f1-5bd2adb81c7c