### A Pluto.jl notebook ### # v0.11.14 using Markdown using InteractiveUtils # ╔═╡ 7fb11560-02b5-11eb-2480-db65ae8a49da begin using DataStructures using IterTools: groupby using Random Random.seed!(12345) using WordTokenizers using Gadfly end; # ╔═╡ 76a898c0-035b-11eb-2900-d3c30ce99296 using Formatting # ╔═╡ a7feb440-040f-11eb-1b41-ffdb8905b2e1 using ScikitLearn # ╔═╡ cecd89e0-02cd-11eb-36d8-b94851740de5 md"""# Language Identification A common task for multilingual NLP systems is to identify the language of some text. We're going to build models to do this!. First, download the Tatoeba data from [this link](https://downloads.tatoeba.org/exports/sentences.tar.bz2) (warning, it's ~450 MB) """ # ╔═╡ cd699630-0353-11eb-3de7-ebf3ce85ec68 md"## Data" # ╔═╡ 2f7c8d40-02b5-11eb-2a13-410e7a1753d1 function read_tatoeba() data = [] for line in eachline("C:/Users/winston/Desktop/sentences.csv") id, lang, sentence = split(line, '\t') push!(data, (sentence=sentence, lang=lang)) end return data end # ╔═╡ 6d5e3a00-02b5-11eb-3de3-8702a621cec5 raw_data = read_tatoeba() # ╔═╡ 69631750-0354-11eb-2898-1b8e2a004f50 function examine_data(data) counts = counter(String) for (sent, lang) in data counts[lang] += 1 end sorted_counts = sort(collect(counts), by=x -> -x[2])[1:100] Gadfly.set_default_plot_size(12inch, 4inch) plot( #yintercept=[100], Geom.hline(color="red"), x=[c[1] for c in sorted_counts], y=[c[2] for c in sorted_counts], Geom.bar, Scale.y_log10, Guide.xlabel("Language"), Guide.ylabel("Count"), Theme(bar_spacing=1mm)) end # ╔═╡ 75de9632-0354-11eb-2185-bdaf2590a14f examine_data(raw_data) # ╔═╡ 8755dea0-0354-11eb-3681-613f55f92f49 md""" Note the log scale on the y-axis. We see that the data is quite unbalanced (i.e. there are an unequal number of languages). First, let's choose part of the data to use. Then we split this into a *training set*, *dev set*, and *test set*. We will learn from the training set, and evaluate on our dev set to test out our model. Maybe will change some things. Finally, we will run our model ONLY ONCE on the test set. """ # ╔═╡ a6b4c3a2-0355-11eb-2161-c388fa1af4a8 lang_counts = counter([x.lang for x in raw_data]) # ╔═╡ b7dc6570-0355-11eb-15d7-fbbe392de043 # TODO: choose what data to work with data = filter(x -> lang_counts[x.lang] >= 100000, raw_data) # data = filter(x -> x.lang in ["cmn", "yue"], raw_data) # data = filter(x -> x.lang in ["rus", "bel", "ukr"], raw_data) # ╔═╡ c04f9960-02cf-11eb-3d38-2b94aea30f1b LANGUAGES = [x.lang for x in data] |> unique |> sort # ╔═╡ 4bca9a60-0417-11eb-27c4-b3f3217fa5f2 length(LANGUAGES) # ╔═╡ ad531080-02cf-11eb-0c97-7bb6e2d3e052 # TODO: change this, mainly to make training faster NUM_EXAMPLES_PER_LANGUAGE = 2000 # ╔═╡ f0480fd0-02b6-11eb-29e0-4fe89726b40f # this function will split the data so that the same proportion of languages appear in train, dev, and test function stratify_split(data) grouped = groupby(x -> x.lang, sort(data, by=x -> x.lang)) train = [] dev = [] test = [] for group in grouped group = shuffle(group)[1:NUM_EXAMPLES_PER_LANGUAGE] split1 = round(Int, length(group) * 0.8) split2 = round(Int, length(group) * 0.9) for i in 1:split1 push!(train, group[i]) end for i in (split1 + 1):split2 push!(dev, group[i]) end for i in (split2 + 1):length(group) push!(test, group[i]) end end return (train, dev, test) end # ╔═╡ 7cd45e30-0358-11eb-39fe-cb68ca8c3668 train_data, dev_data, test_data = stratify_split(data) # ╔═╡ b96ff660-0358-11eb-1a2a-27faace6df40 md""" ## Model We want to figure out the language a sentence is in. We will do this by keeping track of which words occur in what language. We will follow a simple procedure: for each (sentence, language) pair in our training data, accumulate a count of each word-language occurrence. """ # ╔═╡ 34200b20-0359-11eb-0bbf-0b8bc52695dd md""" The idea is that the words in the sentence will help us figure out the language of the entire sentence. Let's look at the word "hat": """ # ╔═╡ 15d913e0-035a-11eb-0f7e-817fb183b52a md"""These are the number of sentences containing the word 'hat'. We can normalize these counts to turn them into probabilities. """ # ╔═╡ 211159d0-0359-11eb-3740-a9d18b6e7269 function normalize!(counts) for word in keys(counts) total = sum(values(counts[word])) for lang in keys(counts[word]) counts[word][lang] /= total end end end # ╔═╡ f411cf80-035a-11eb-2e68-6f70e4a8d239 md""" Here, we see that if we pick a language to generate the word "hat", German would probably be the most likely language. But this is only if a sentence has one word. How do we deal with an entire sentence? """ # ╔═╡ dd085472-035b-11eb-3f40-61dfb68630ca md""" For each language, we multiply the probabilities that each word belongs to that language. If a language doesn't have the word, it gets a probability of zero, which zeros out the entire multiplication. This model is called a Naive Bayes Classifier. The idea is this: you don't know which language this sentence is from. So you pick a language, and then calculate the probability of picking each word in the sentence from that language. It's "naive" because the probability of picking one word does not affect the probability of picking another word. Let's code this up: """ # ╔═╡ a36325e0-02cc-11eb-17df-997de2b16667 function softmax(seq) total = sum(exp.(seq)) return exp.(seq) / total end # ╔═╡ 7dc21750-040d-11eb-2e9b-99d5990bd195 md""" Notice that we call `preprocess()` on the sentence, which currently splits the sentence by spaces. Try out other preprocessing strategies! """ # ╔═╡ 1926abf0-02d0-11eb-2613-435a4262fe4c # TODO: try various feature extraction strategies function extract_features(sentence) return split(sentence, ' ') # return tokenize(sentence) # return ngrams(tokenize(sentence), 2) # word n-grams # return string.(collect(sentence)) # each character as a string # return ngrams(sentence, 3) # character n-grams # return [something else that you come up with] end # ╔═╡ b6676600-02c3-11eb-0dcb-ebf6ecde8104 function train_model(train_data) counts = Dict{String, Accumulator{String, Float64}}() for (sent, lang) in train_data for word in extract_features(sent) if lang ∉ keys(counts) counts[lang] = counter(String) end counts[lang][word] += 1 end end return counts end # ╔═╡ 42e6d990-0359-11eb-1991-8f74d360cd6d counts = train_model(train_data) # ╔═╡ 6dd74810-0359-11eb-06f4-d936e5210d82 [(lang, counts[lang]["hat"]) for lang in keys(counts)] # ╔═╡ 9c627410-035a-11eb-0860-61b00b564971 normalize!(counts) # ╔═╡ da388812-035a-11eb-3755-037b146724ed [(lang, counts[lang]["hat"]) for lang in keys(counts)] # ╔═╡ 24bb1600-035b-11eb-3277-55262c559e3c filter(x -> x[2] != 0, [(lang, counts[lang]["I"]) for lang in keys(counts)]) # ╔═╡ 2e7cf320-035b-11eb-12d9-35b818c26eac filter(x -> x[2] != 0, [(lang, counts[lang]["like"]) for lang in keys(counts)]) # ╔═╡ 38619a80-035b-11eb-0133-e7ce05d0bc17 filter(x -> x[2] != 0, [(lang, counts[lang]["hat"]) for lang in keys(counts)]) # ╔═╡ 40744100-035b-11eb-02df-575934e3d0ab map(LANGUAGES) do lang I = counts[lang]["I"] like = counts[lang]["like"] hat = counts[lang]["hat"] join([ lang, fmt(".6f", I), " x ", fmt(".6f", like), " x ", fmt(".6f", hat), " = ", I * like * hat ], " ") end # ╔═╡ faa03fa0-035d-11eb-1dc0-03a001b17009 function predict(model, sentence) words = extract_features(sentence) scores = map(LANGUAGES) do lang score = 0 for word in words if lang ∈ keys(model) && word in keys(model[lang]) score += log(model[lang][word]) else score += -Inf # 0 in log probability end end score end scores = softmax(scores) lang = LANGUAGES[argmax(scores)] return lang end # ╔═╡ 14322472-02c7-11eb-37ea-ef9891f9fc45 function ngrams(seq, n::Int) indices = collect(eachindex(seq)) result = [] for i in 1:length(indices) - n + 1 push!(result, seq[indices[i]:indices[i + n - 1]]) end return result end # ╔═╡ 79372fe0-037c-11eb-3b62-8798303a9e7c md""" Now that we have trained our model, let's see how well we do on the dev data. """ # ╔═╡ eeee676e-035f-11eb-2250-5705fa43d301 predictions = [predict(counts, x.sentence) for x in dev_data] # ╔═╡ 85471360-02c6-11eb-3da0-75935b91fa2e function accuracy(predictions, gold) return sum(predictions .== gold) / length(gold) end # ╔═╡ 92ae51d0-02c6-11eb-21e0-3fc1508d7b45 accuracy(predictions, [x.lang for x in dev_data]) # ╔═╡ 82cc309e-037c-11eb-00c8-d7f6237fac85 md""" Why do we get so low accuracy? If we look at `predictions`, we see that it basically predicts Berber for everything. Maybe we can get a better idea by looking at the probabilities given by the model. """ # ╔═╡ 9fe4fc62-0360-11eb-2628-f7bafcfd3bb0 function predict_score(model, sentence) words = extract_features(sentence) scores = map(LANGUAGES) do lang score = 0 for word in words if lang ∈ keys(model) && word ∈ keys(model[lang]) score += log(model[lang][word]) else score += -Inf # TODO end end score end return scores end # ╔═╡ 170d8750-037d-11eb-3db4-6df8a6909fd9 zip(LANGUAGES, predict_score(counts, dev_data[1].sentence)) |> collect # ╔═╡ 57047f7e-037d-11eb-03fb-65affc9a6b21 md""" They're all -Inf! So we must be seeing words we have never seen before, which zeros out the probabilities. Instead of zero, let's give a tiny probability for unseen words. """ # ╔═╡ 0353e110-035e-11eb-1e7c-cf69dc0ce92d sort(collect(zip(LANGUAGES, predict_score(counts, "hat"))), by=x->-x[2]) # ╔═╡ 316d9310-040e-11eb-3f09-79f0e8a9f578 md""" Let's do some error analysis to see where the model goes wrong. """ # ╔═╡ afc1db30-040f-11eb-18a9-a105db6885be @sk_import metrics: confusion_matrix # ╔═╡ f3f9633e-040f-11eb-0bb1-677b1390d679 cm = confusion_matrix(predictions, [x.lang for x in dev_data]) # ╔═╡ 04200fd0-0410-11eb-31b6-2d9b4c7636ad begin Gadfly.set_default_plot_size(7inch, 6inch) spy(cm, Scale.x_discrete(labels = i -> LANGUAGES[i]), Scale.y_discrete(labels = i -> LANGUAGES[i]), Guide.xlabel("Gold"), Guide.ylabel("Predictions") ) end # ╔═╡ 9e438350-040e-11eb-24e6-4bf767948168 function wrong_predictions(model, test_data) wrong = [] for (sent, lang) in test_data pred = predict(model, sent) if pred != lang push!(wrong, (sentence=sent, lang=lang, pred=pred)) end end return wrong end # ╔═╡ 1d58ee00-040f-11eb-0bcd-27cc38ee8c48 wrong = wrong_predictions(counts, dev_data) # ╔═╡ 45ceb830-0412-11eb-39cb-01b84b47c43e filter(x -> x.pred == "ber", wrong) # ╔═╡ 6d3baab0-0361-11eb-01d8-11769f614e2b md"Finally, when you have done all the tweaking you want, run your model on the test data. Uncomment the following line, and only run it once!" # ╔═╡ 7e367020-0361-11eb-1e04-4d99d73b578a #test_predictions = [predict(counts, x.sentence) for x in test_data] # ╔═╡ 8b3c54b0-0361-11eb-303b-0b8fad1814f4 accuracy(test_predictions, [x.lang for x in test_data]) # ╔═╡ 05b6e780-02d3-11eb-2bba-5b6f5bdab32d md""" ## Homework Choose a set of languages to work with. Then try 1. varying the amount of data available 2. changing the features the model sees and then evaluate how well your model does. Search for the **TODO** comments for where to edit the code. What features work best? """ # ╔═╡ Cell order: # ╟─7fb11560-02b5-11eb-2480-db65ae8a49da # ╟─cecd89e0-02cd-11eb-36d8-b94851740de5 # ╟─cd699630-0353-11eb-3de7-ebf3ce85ec68 # ╠═2f7c8d40-02b5-11eb-2a13-410e7a1753d1 # ╠═6d5e3a00-02b5-11eb-3de3-8702a621cec5 # ╠═69631750-0354-11eb-2898-1b8e2a004f50 # ╠═75de9632-0354-11eb-2185-bdaf2590a14f # ╟─8755dea0-0354-11eb-3681-613f55f92f49 # ╠═a6b4c3a2-0355-11eb-2161-c388fa1af4a8 # ╠═b7dc6570-0355-11eb-15d7-fbbe392de043 # ╠═c04f9960-02cf-11eb-3d38-2b94aea30f1b # ╠═4bca9a60-0417-11eb-27c4-b3f3217fa5f2 # ╠═ad531080-02cf-11eb-0c97-7bb6e2d3e052 # ╠═f0480fd0-02b6-11eb-29e0-4fe89726b40f # ╠═7cd45e30-0358-11eb-39fe-cb68ca8c3668 # ╟─b96ff660-0358-11eb-1a2a-27faace6df40 # ╠═b6676600-02c3-11eb-0dcb-ebf6ecde8104 # ╠═42e6d990-0359-11eb-1991-8f74d360cd6d # ╠═34200b20-0359-11eb-0bbf-0b8bc52695dd # ╠═6dd74810-0359-11eb-06f4-d936e5210d82 # ╟─15d913e0-035a-11eb-0f7e-817fb183b52a # ╠═211159d0-0359-11eb-3740-a9d18b6e7269 # ╠═9c627410-035a-11eb-0860-61b00b564971 # ╠═da388812-035a-11eb-3755-037b146724ed # ╠═f411cf80-035a-11eb-2e68-6f70e4a8d239 # ╠═24bb1600-035b-11eb-3277-55262c559e3c # ╠═2e7cf320-035b-11eb-12d9-35b818c26eac # ╠═38619a80-035b-11eb-0133-e7ce05d0bc17 # ╠═76a898c0-035b-11eb-2900-d3c30ce99296 # ╠═40744100-035b-11eb-02df-575934e3d0ab # ╟─dd085472-035b-11eb-3f40-61dfb68630ca # ╠═faa03fa0-035d-11eb-1dc0-03a001b17009 # ╠═a36325e0-02cc-11eb-17df-997de2b16667 # ╟─7dc21750-040d-11eb-2e9b-99d5990bd195 # ╠═1926abf0-02d0-11eb-2613-435a4262fe4c # ╠═14322472-02c7-11eb-37ea-ef9891f9fc45 # ╟─79372fe0-037c-11eb-3b62-8798303a9e7c # ╠═eeee676e-035f-11eb-2250-5705fa43d301 # ╠═85471360-02c6-11eb-3da0-75935b91fa2e # ╠═92ae51d0-02c6-11eb-21e0-3fc1508d7b45 # ╟─82cc309e-037c-11eb-00c8-d7f6237fac85 # ╠═9fe4fc62-0360-11eb-2628-f7bafcfd3bb0 # ╠═170d8750-037d-11eb-3db4-6df8a6909fd9 # ╟─57047f7e-037d-11eb-03fb-65affc9a6b21 # ╠═0353e110-035e-11eb-1e7c-cf69dc0ce92d # ╟─316d9310-040e-11eb-3f09-79f0e8a9f578 # ╠═a7feb440-040f-11eb-1b41-ffdb8905b2e1 # ╠═afc1db30-040f-11eb-18a9-a105db6885be # ╠═f3f9633e-040f-11eb-0bb1-677b1390d679 # ╠═04200fd0-0410-11eb-31b6-2d9b4c7636ad # ╠═9e438350-040e-11eb-24e6-4bf767948168 # ╠═1d58ee00-040f-11eb-0bcd-27cc38ee8c48 # ╠═45ceb830-0412-11eb-39cb-01b84b47c43e # ╟─6d3baab0-0361-11eb-01d8-11769f614e2b # ╠═7e367020-0361-11eb-1e04-4d99d73b578a # ╠═8b3c54b0-0361-11eb-303b-0b8fad1814f4 # ╠═05b6e780-02d3-11eb-2bba-5b6f5bdab32d