### A Pluto.jl notebook ### # v0.11.12 using Markdown using InteractiveUtils # ╔═╡ ae3cbd80-ed34-11ea-28d2-e9efc3815959 begin using DataStructures using Gadfly using WordTokenizers using StatsBase using Test Gadfly.push_theme(Theme(bar_spacing=1mm)) Gadfly.set_default_plot_size(7inch, 4inch) end # ╔═╡ d4015470-d60c-11ea-3a2c-09e7cff8af7c md""" # Language Modeling All of you have seen a language model at work. And by knowing a language, you have already developed you own language model. You have probably seen a LM at work in predictuve text: - a search engine predicts what you will type next - your phone predicts the next work for you - Gmail can auto-complete your entire sentence Language models also help filter the output of systems for tasks like - speech recognition You speak a phrase into your phone, which has to convert it into text. How does it know if you said "recognize speech" or "wreck a nice beach"? (say them really fast, they sound quite similar) - machine translation You are translating the Chinese sentence "我在开车“ into English. Your translation system gives you several choices: - I at open car - me at open car - I at drive - me at drive - I am driving - me am driving A language model can tell you which translation sounds the most natural. """ # ╔═╡ 9a1fd080-ed34-11ea-2605-3da00683a939 md""" We're going to be implementing a simple n-gram language model. First, let's load some packages and setup some stuff. """ # ╔═╡ d63dfb50-ed34-11ea-3ec7-39e83423e26f md""" We're going to be using *Pride and Prejudice* as our text. Download it from [this link](http://www.gutenberg.org/files/1342/1342-0.txt) and save it as `p&p.txt` in the same folder as this notebook. Then we can load it as follows: """ # ╔═╡ f337bca0-ed34-11ea-1aca-f5ee00b34c4b function load_text(path) text = read(path, String) # remove some whitespace characters text = replace(text, "\ufeff" => "") text = replace(text, "\n" => " ") text = replace(text, "\r" => " ") return text end # ╔═╡ ff1749a0-ed34-11ea-21b9-cfdb9759c91b text = load_text("sherlock.txt"); # TODO # ╔═╡ 3eebdad0-ed37-11ea-1910-67d2006fb5ee md""" ## Data Analysis In this section, we're going to be looking at our data. It's always good to see what you're working with. """ # ╔═╡ 43857210-ed35-11ea-2bd4-35948fdea98c md""" Let's first take at look at the *tokens*. We will use the `WordTokenizers.tokenize()` function to split the text into tokens. """ # ╔═╡ 4b9691f0-ed35-11ea-3d57-a9eeb68e6ac5 tokens = tokenize(text) # ╔═╡ a8a6ebf0-ed36-11ea-163a-e3701653ce97 md"**Question**: So what exactly are tokens?" # ╔═╡ 9fd92840-ed35-11ea-24fc-71a1319b75d4 md""" **Question**: Why don't we just split on spaces? """ # ╔═╡ ae6d9c10-ed35-11ea-2488-6387dd9542da split_on_spaces = split(text, r"\s+") # ╔═╡ ed1271c0-ed35-11ea-192d-d9971d77b7c5 md""" Let's examine our data a bit. """ # ╔═╡ 5818ab10-ed36-11ea-3cea-219e51e2b13a md"**Question**: How many tokens do we have?" # ╔═╡ 6af96400-ed35-11ea-3f79-fda8fde36254 length(tokens) # ╔═╡ 85037f70-ed35-11ea-007e-ad65fa9d6bc6 vocabulary = unique(tokens) # ╔═╡ 701586d0-ed35-11ea-3a54-392347e3c913 length(vocabulary) # ╔═╡ 9cccd2a0-ed35-11ea-2260-276ae7c091cd md""" This is the *type-token distinction*. There are $(length(vocabulary)) types and $(length(tokens)) tokens. """ # ╔═╡ 6e814052-ed37-11ea-31e8-597447705835 md""" **Question**: What are the most common tokens? We can use `DataStructures.counter()` to count them: """ # ╔═╡ b130c8d0-ed37-11ea-1529-b9bfe465a95c token_counts = counter(tokens) # ╔═╡ 7938c80e-ed37-11ea-0054-137e29962edd begin counts = sort(collect(token_counts), by=x -> -x[2])[1:20] plot( y=reverse([c[1] for c in counts]), x=reverse([c[2] for c in counts]), Geom.bar(orientation=:horizontal), Guide.xlabel("Count"), Guide.ylabel("Token"), Guide.xticks(ticks=[0:1000:10000;]) ) end # ╔═╡ 3cbd1e7e-ed38-11ea-0f0f-75de6ad7cfd9 md""" This is not as important, but for curiosity's sake, let's examine sentences. """ # ╔═╡ 89968a30-ed37-11ea-0be5-47d2e8cf1277 sentences = split_sentences(text) # ╔═╡ 6cc62d60-ed38-11ea-32cf-6f9a2b4fcdfb length(sentences) # ╔═╡ 8e17fe30-ed38-11ea-2570-a5793d2895ab tokenized_sentences = tokenize.(sentences) # ╔═╡ 98a6c83e-ed38-11ea-3ae7-d5cc19cf6b5b md""" **Question**: Woah, what does that `.` do? This is called vectorization or broadcast. It basically means you apply the function to every element of the array. In Python, you might write it as a list comprehension tokenized_sentences = [tokenize(s) for s in sentences] Actually, this also works in Julia, but I find the vectorized version easier to read. """ # ╔═╡ 4c731270-ed39-11ea-3104-b12c14a55746 md"**Question**: What do the sentences look like?" # ╔═╡ 6e7d36c2-ed39-11ea-243e-c17047719b01 sentence_lengths = counter(length.(tokenized_sentences)) # ╔═╡ 6a732030-ed39-11ea-3019-899c1a2cf33e plot( x=1:100, y=[sentence_lengths[x] for x in 1:100], Geom.line, Guide.xlabel("Sentence Length (tokens)"), Guide.ylabel("Count"), Guide.xticks(ticks=[0:10:100;]) ) # ╔═╡ d8b3eb0e-ed39-11ea-39da-3f6867968a34 md""" ## Language Modeling Now let's actually implement our language model. Let's write out a n-gram function. """ # ╔═╡ 976a1720-d60e-11ea-397a-9d1b790513a4 function ngrams(seq, n) return [] # TODO end # ╔═╡ 27c01fd0-ed3a-11ea-3c2e-99f76f83fa9b ngrams("language", 3) # ╔═╡ 11b0bc30-ed3b-11ea-11e0-75106f28fce0 md"Make sure the following test passes:" # ╔═╡ 9b70360e-d60e-11ea-2330-5766808f5a96 @test ngrams("language", 3) == ["lan", "ang", "ngu", "gua", "uag", "age"] # ╔═╡ e263dbf0-ed3b-11ea-04ba-817ee56e1257 md""" Our model is very simple: store the counts of all the ngrams in the text. """ # ╔═╡ 81550420-ed3a-11ea-015a-01dc98639e67 model = counter(ngrams(tokens, 3)) # TODO # ╔═╡ 1c790cd0-ed3b-11ea-2fba-d1b9d59d4e91 md"Now let's see our language model in action by generating some sentences. We will give it a *prefix* to start off with. It will pick the most likely word that follows the prefix and add it to the sentence. Then the last $n-1$ words become the prefix, and we repeat." # ╔═╡ 5d066e0e-ed3a-11ea-328d-891de8a095b8 prefix = ["This", "is"] # ╔═╡ 44742590-ed3a-11ea-1f2c-b58357633b70 function generate(model, prefix) result = [] for word in prefix push!(result, word) end get_count(gram) = get(model, gram, 0) for i in 1:100 best_i = argmax(get_count.([prefix; word] for word in vocabulary)) best_next = vocabulary[best_i] push!(result, best_next) prefix = [prefix[2:end]; best_next] end return result end # ╔═╡ 7d3ce3d0-ed3a-11ea-2dd0-7dfb25d7c64b join(generate(model, prefix), ' ') # ╔═╡ ccfe10b0-ed3a-11ea-09d7-43deb2afabbe md""" **Question**: Given a specific input, `generate()` will always return the same output (i.e. it is *deterministic*). It doesn't generate realistic sentences. How can we easily improve this? """ # ╔═╡ 5821d600-ed3a-11ea-08e5-a324ade7fe35 function generate_random(model, prefix) result = [] for word in prefix push!(result, word) end get_count(word) = get(model, [prefix; word], 0) for i in 1:100 weights = Weights([get_count.(word) for word in vocabulary]) best_next = sample(vocabulary, weights) push!(result, best_next) prefix = [prefix[2:end]; best_next] end return result end # ╔═╡ 9954a990-ed3a-11ea-05a0-c10594775ccc join(generate_random(model, prefix), ' ') # ╔═╡ 98b71490-ed3b-11ea-12f1-5bd2adb81c7c md""" ## Homework Look for #TODO comments in the code. 1. Try this out on some other text! Something by Shakespeare might be fun. 2. Try varying the $n$ in ngrams. How do your results differ with bigrams or 4-grams? """ # ╔═╡ Cell order: # ╟─d4015470-d60c-11ea-3a2c-09e7cff8af7c # ╟─9a1fd080-ed34-11ea-2605-3da00683a939 # ╠═ae3cbd80-ed34-11ea-28d2-e9efc3815959 # ╟─d63dfb50-ed34-11ea-3ec7-39e83423e26f # ╠═f337bca0-ed34-11ea-1aca-f5ee00b34c4b # ╠═ff1749a0-ed34-11ea-21b9-cfdb9759c91b # ╟─3eebdad0-ed37-11ea-1910-67d2006fb5ee # ╟─43857210-ed35-11ea-2bd4-35948fdea98c # ╠═4b9691f0-ed35-11ea-3d57-a9eeb68e6ac5 # ╟─a8a6ebf0-ed36-11ea-163a-e3701653ce97 # ╟─9fd92840-ed35-11ea-24fc-71a1319b75d4 # ╠═ae6d9c10-ed35-11ea-2488-6387dd9542da # ╟─ed1271c0-ed35-11ea-192d-d9971d77b7c5 # ╟─5818ab10-ed36-11ea-3cea-219e51e2b13a # ╠═6af96400-ed35-11ea-3f79-fda8fde36254 # ╠═85037f70-ed35-11ea-007e-ad65fa9d6bc6 # ╠═701586d0-ed35-11ea-3a54-392347e3c913 # ╟─9cccd2a0-ed35-11ea-2260-276ae7c091cd # ╟─6e814052-ed37-11ea-31e8-597447705835 # ╠═b130c8d0-ed37-11ea-1529-b9bfe465a95c # ╠═7938c80e-ed37-11ea-0054-137e29962edd # ╟─3cbd1e7e-ed38-11ea-0f0f-75de6ad7cfd9 # ╠═89968a30-ed37-11ea-0be5-47d2e8cf1277 # ╠═6cc62d60-ed38-11ea-32cf-6f9a2b4fcdfb # ╠═8e17fe30-ed38-11ea-2570-a5793d2895ab # ╟─98a6c83e-ed38-11ea-3ae7-d5cc19cf6b5b # ╟─4c731270-ed39-11ea-3104-b12c14a55746 # ╠═6e7d36c2-ed39-11ea-243e-c17047719b01 # ╠═6a732030-ed39-11ea-3019-899c1a2cf33e # ╟─d8b3eb0e-ed39-11ea-39da-3f6867968a34 # ╠═976a1720-d60e-11ea-397a-9d1b790513a4 # ╠═27c01fd0-ed3a-11ea-3c2e-99f76f83fa9b # ╟─11b0bc30-ed3b-11ea-11e0-75106f28fce0 # ╠═9b70360e-d60e-11ea-2330-5766808f5a96 # ╟─e263dbf0-ed3b-11ea-04ba-817ee56e1257 # ╠═81550420-ed3a-11ea-015a-01dc98639e67 # ╟─1c790cd0-ed3b-11ea-2fba-d1b9d59d4e91 # ╠═5d066e0e-ed3a-11ea-328d-891de8a095b8 # ╠═44742590-ed3a-11ea-1f2c-b58357633b70 # ╠═7d3ce3d0-ed3a-11ea-2dd0-7dfb25d7c64b # ╟─ccfe10b0-ed3a-11ea-09d7-43deb2afabbe # ╠═5821d600-ed3a-11ea-08e5-a324ade7fe35 # ╠═9954a990-ed3a-11ea-05a0-c10594775ccc # ╟─98b71490-ed3b-11ea-12f1-5bd2adb81c7c