### A Pluto.jl notebook ###
# v0.11.14

using Markdown
using InteractiveUtils

# ╔═╡ 7fb11560-02b5-11eb-2480-db65ae8a49da
begin
	using DataStructures
	using IterTools: groupby
	using Random
	Random.seed!(12345)
	using WordTokenizers
	using Gadfly
end;

# ╔═╡ 76a898c0-035b-11eb-2900-d3c30ce99296
using Formatting

# ╔═╡ a7feb440-040f-11eb-1b41-ffdb8905b2e1
using ScikitLearn

# ╔═╡ cecd89e0-02cd-11eb-36d8-b94851740de5
md"""# Language Identification

A common task for multilingual NLP systems is to identify the language of some text. We're going to build models to do this!. First, download the Tatoeba data from [this link](https://downloads.tatoeba.org/exports/sentences.tar.bz2) (warning, it's ~450 MB)
"""

# ╔═╡ cd699630-0353-11eb-3de7-ebf3ce85ec68
md"## Data"

# ╔═╡ 2f7c8d40-02b5-11eb-2a13-410e7a1753d1
function read_tatoeba()
	data = []
	for line in eachline("C:/Users/winston/Desktop/sentences.csv")
		id, lang, sentence = split(line, '\t')
		push!(data, (sentence=sentence, lang=lang))
	end
	return data
end

# ╔═╡ 6d5e3a00-02b5-11eb-3de3-8702a621cec5
raw_data = read_tatoeba()

# ╔═╡ 69631750-0354-11eb-2898-1b8e2a004f50
function examine_data(data)
	counts = counter(String)
    for (sent, lang) in data
        counts[lang] += 1
    end
    
    sorted_counts = sort(collect(counts), by=x -> -x[2])[1:100]
    
    Gadfly.set_default_plot_size(12inch, 4inch)
    plot(
        #yintercept=[100], Geom.hline(color="red"),
        x=[c[1] for c in sorted_counts],
		y=[c[2] for c in sorted_counts],
		Geom.bar,
		Scale.y_log10,
        Guide.xlabel("Language"),
		Guide.ylabel("Count"),
		Theme(bar_spacing=1mm))
end

# ╔═╡ 75de9632-0354-11eb-2185-bdaf2590a14f
examine_data(raw_data)

# ╔═╡ 8755dea0-0354-11eb-3681-613f55f92f49
md"""
Note the log scale on the y-axis. We see that the data is quite unbalanced (i.e. there are an unequal number of languages).

First, let's choose part of the data to use. Then we split this into a *training set*, *dev set*, and *test set*. We will learn from the training set, and evaluate on our dev set to test out our model. Maybe will change some things. Finally, we will run our model ONLY ONCE on the test set.
"""

# ╔═╡ a6b4c3a2-0355-11eb-2161-c388fa1af4a8
lang_counts = counter([x.lang for x in raw_data])

# ╔═╡ b7dc6570-0355-11eb-15d7-fbbe392de043
# TODO: choose what data to work with

data = filter(x -> lang_counts[x.lang] >= 100000, raw_data)
#  data = filter(x -> x.lang in ["cmn", "yue"], raw_data)
# data = filter(x -> x.lang in ["rus", "bel", "ukr"], raw_data)

# ╔═╡ c04f9960-02cf-11eb-3d38-2b94aea30f1b
LANGUAGES = [x.lang for x in data] |> unique |> sort

# ╔═╡ 4bca9a60-0417-11eb-27c4-b3f3217fa5f2
length(LANGUAGES)

# ╔═╡ ad531080-02cf-11eb-0c97-7bb6e2d3e052
# TODO: change this, mainly to make training faster
NUM_EXAMPLES_PER_LANGUAGE = 2000

# ╔═╡ f0480fd0-02b6-11eb-29e0-4fe89726b40f
# this function will split the data so that the same proportion of languages appear in train, dev, and test
function stratify_split(data)
    grouped = groupby(x -> x.lang, sort(data, by=x -> x.lang))
    
    train = []
    dev = []
    test = []

    for group in grouped
        group = shuffle(group)[1:NUM_EXAMPLES_PER_LANGUAGE]

        split1 = round(Int, length(group) * 0.8)
        split2 = round(Int, length(group) * 0.9)

        for i in 1:split1
            push!(train, group[i])
        end
        for i in (split1 + 1):split2
            push!(dev, group[i])
        end
        for i in (split2 + 1):length(group)
            push!(test, group[i])
        end
    end
    return (train, dev, test)
end

# ╔═╡ 7cd45e30-0358-11eb-39fe-cb68ca8c3668
train_data, dev_data, test_data = stratify_split(data)

# ╔═╡ b96ff660-0358-11eb-1a2a-27faace6df40
md"""
## Model

We want to figure out the language a sentence is in. We will do this by keeping track of which words occur in what language. We will follow a simple procedure: for each (sentence, language) pair in our training data, accumulate a count of each word-language occurrence.
"""

# ╔═╡ 34200b20-0359-11eb-0bbf-0b8bc52695dd
md"""
The idea is that the words in the sentence will help us figure out the language of the entire sentence. Let's look at the word "hat":
"""

# ╔═╡ 15d913e0-035a-11eb-0f7e-817fb183b52a
md"""These are the number of sentences containing the word 'hat'.

We can normalize these counts to turn them into probabilities.
"""

# ╔═╡ 211159d0-0359-11eb-3740-a9d18b6e7269
function normalize!(counts)	
	for word in keys(counts)
		total = sum(values(counts[word]))
		for lang in keys(counts[word])
			counts[word][lang] /= total
		end
	end
end

# ╔═╡ f411cf80-035a-11eb-2e68-6f70e4a8d239
md"""
Here, we see that if we pick a language to generate the word "hat", German would probably be the most likely language. But this is only if a sentence has one word. How do we deal with an entire sentence?
"""

# ╔═╡ dd085472-035b-11eb-3f40-61dfb68630ca
md"""
For each language, we multiply the probabilities that each word belongs to that language. If a language doesn't have the word, it gets a probability of zero, which zeros out the entire multiplication.

This model is called a Naive Bayes Classifier. The idea is this: you don't know which language this sentence is from. So you pick a language, and then calculate the probability of picking each word in the sentence from that language. It's "naive" because the probability of picking one word does not affect the probability of picking another word.

Let's code this up:
"""

# ╔═╡ a36325e0-02cc-11eb-17df-997de2b16667
function softmax(seq)
	total = sum(exp.(seq))
	return exp.(seq) / total
end

# ╔═╡ 7dc21750-040d-11eb-2e9b-99d5990bd195
md"""
Notice that we call `preprocess()` on the sentence, which currently splits the sentence by spaces. Try out other preprocessing strategies!
"""

# ╔═╡ 1926abf0-02d0-11eb-2613-435a4262fe4c
# TODO: try various feature extraction strategies

function extract_features(sentence)
	return split(sentence, ' ')
#  	return tokenize(sentence)
# 	return ngrams(tokenize(sentence), 2)  # word n-grams
# 	return string.(collect(sentence))  # each character as a string
# 	return ngrams(sentence, 3)  # character n-grams
#	return [something else that you come up with]
end

# ╔═╡ b6676600-02c3-11eb-0dcb-ebf6ecde8104
function train_model(train_data)
	counts = Dict{String, Accumulator{String, Float64}}()
	for (sent, lang) in train_data
		for word in extract_features(sent)
			if lang ∉ keys(counts)
				counts[lang] = counter(String)
			end
			counts[lang][word] += 1
		end
	end
	return counts
end

# ╔═╡ 42e6d990-0359-11eb-1991-8f74d360cd6d
counts = train_model(train_data)

# ╔═╡ 6dd74810-0359-11eb-06f4-d936e5210d82
[(lang, counts[lang]["hat"]) for lang in keys(counts)]

# ╔═╡ 9c627410-035a-11eb-0860-61b00b564971
normalize!(counts)

# ╔═╡ da388812-035a-11eb-3755-037b146724ed
[(lang, counts[lang]["hat"]) for lang in keys(counts)]

# ╔═╡ 24bb1600-035b-11eb-3277-55262c559e3c
filter(x -> x[2] != 0, [(lang, counts[lang]["I"]) for lang in keys(counts)])

# ╔═╡ 2e7cf320-035b-11eb-12d9-35b818c26eac
filter(x -> x[2] != 0, [(lang, counts[lang]["like"]) for lang in keys(counts)])

# ╔═╡ 38619a80-035b-11eb-0133-e7ce05d0bc17
filter(x -> x[2] != 0, [(lang, counts[lang]["hat"]) for lang in keys(counts)])

# ╔═╡ 40744100-035b-11eb-02df-575934e3d0ab
map(LANGUAGES) do lang
	I = counts[lang]["I"]
	like = counts[lang]["like"]
	hat = counts[lang]["hat"]
	join([
        lang,
        fmt(".6f", I),
        " x ",
        fmt(".6f", like),
        " x ",
        fmt(".6f", hat),
        " = ",
        I * like * hat
    ], "  ")
end

# ╔═╡ faa03fa0-035d-11eb-1dc0-03a001b17009
function predict(model, sentence)
    words = extract_features(sentence)
    
    scores = map(LANGUAGES) do lang
        score = 0
        for word in words
            if lang ∈ keys(model) && word in keys(model[lang])
                score += log(model[lang][word])
			else
				score += -Inf  # 0 in log probability
            end
        end
        score
    end
	
	scores = softmax(scores)
    lang = LANGUAGES[argmax(scores)]
    return lang
end

# ╔═╡ 14322472-02c7-11eb-37ea-ef9891f9fc45
function ngrams(seq, n::Int)
	indices = collect(eachindex(seq))
	result = []
	for i in 1:length(indices) - n + 1
		push!(result, seq[indices[i]:indices[i + n - 1]])
	end
	return result
end

# ╔═╡ 79372fe0-037c-11eb-3b62-8798303a9e7c
md"""
Now that we have trained our model, let's see how well we do on the dev data.
"""

# ╔═╡ eeee676e-035f-11eb-2250-5705fa43d301
predictions = [predict(counts, x.sentence) for x in dev_data]

# ╔═╡ 85471360-02c6-11eb-3da0-75935b91fa2e
function accuracy(predictions, gold)
	return sum(predictions .== gold) / length(gold)
end

# ╔═╡ 92ae51d0-02c6-11eb-21e0-3fc1508d7b45
accuracy(predictions, [x.lang for x in dev_data])

# ╔═╡ 82cc309e-037c-11eb-00c8-d7f6237fac85
md"""
Why do we get so low accuracy? If we look at `predictions`, we see that it basically predicts Berber for everything.

Maybe we can get a better idea by looking at the probabilities given by the model.
"""

# ╔═╡ 9fe4fc62-0360-11eb-2628-f7bafcfd3bb0
function predict_score(model, sentence)
    words = extract_features(sentence)
    
    scores = map(LANGUAGES) do lang
        score = 0
        for word in words
            if lang ∈ keys(model) && word ∈ keys(model[lang])
                score += log(model[lang][word])
			else
				score += -Inf  # TODO
            end
        end
        score
    end
	
    return scores
end

# ╔═╡ 170d8750-037d-11eb-3db4-6df8a6909fd9
zip(LANGUAGES, predict_score(counts, dev_data[1].sentence)) |> collect

# ╔═╡ 57047f7e-037d-11eb-03fb-65affc9a6b21
md"""
They're all -Inf! So we must be seeing words we have never seen before, which zeros out the probabilities. Instead of zero, let's give a tiny probability for unseen words.
"""

# ╔═╡ 0353e110-035e-11eb-1e7c-cf69dc0ce92d
sort(collect(zip(LANGUAGES, predict_score(counts, "hat"))), by=x->-x[2])

# ╔═╡ 316d9310-040e-11eb-3f09-79f0e8a9f578
md"""
Let's do some error analysis to see where the model goes wrong.
"""

# ╔═╡ afc1db30-040f-11eb-18a9-a105db6885be
@sk_import metrics: confusion_matrix

# ╔═╡ f3f9633e-040f-11eb-0bb1-677b1390d679
cm = confusion_matrix(predictions, [x.lang for x in dev_data])

# ╔═╡ 04200fd0-0410-11eb-31b6-2d9b4c7636ad
begin
	Gadfly.set_default_plot_size(7inch, 6inch)
	spy(cm, 
		Scale.x_discrete(labels = i -> LANGUAGES[i]),
		Scale.y_discrete(labels = i -> LANGUAGES[i]),
		Guide.xlabel("Gold"),
		Guide.ylabel("Predictions")
	)
end

# ╔═╡ 9e438350-040e-11eb-24e6-4bf767948168
function wrong_predictions(model, test_data)
    wrong = []
    for (sent, lang) in test_data
        pred = predict(model, sent)
        if pred != lang
			push!(wrong, (sentence=sent, lang=lang, pred=pred))
        end
    end
	return wrong
end

# ╔═╡ 1d58ee00-040f-11eb-0bcd-27cc38ee8c48
wrong = wrong_predictions(counts, dev_data)

# ╔═╡ 45ceb830-0412-11eb-39cb-01b84b47c43e
filter(x -> x.pred == "ber", wrong)

# ╔═╡ 6d3baab0-0361-11eb-01d8-11769f614e2b
md"Finally, when you have done all the tweaking you want, run your model on the test data. Uncomment the following line, and only run it once!"

# ╔═╡ 7e367020-0361-11eb-1e04-4d99d73b578a
#test_predictions = [predict(counts, x.sentence) for x in test_data]

# ╔═╡ 8b3c54b0-0361-11eb-303b-0b8fad1814f4
accuracy(test_predictions, [x.lang for x in test_data])

# ╔═╡ 05b6e780-02d3-11eb-2bba-5b6f5bdab32d
md"""
## Homework

Choose a set of languages to work with. Then try

1. varying the amount of data available
2. changing the features the model sees

and then evaluate how well your model does. Search for the **TODO** comments for where to edit the code. What features work best?
"""

# ╔═╡ Cell order:
# ╟─7fb11560-02b5-11eb-2480-db65ae8a49da
# ╟─cecd89e0-02cd-11eb-36d8-b94851740de5
# ╟─cd699630-0353-11eb-3de7-ebf3ce85ec68
# ╠═2f7c8d40-02b5-11eb-2a13-410e7a1753d1
# ╠═6d5e3a00-02b5-11eb-3de3-8702a621cec5
# ╠═69631750-0354-11eb-2898-1b8e2a004f50
# ╠═75de9632-0354-11eb-2185-bdaf2590a14f
# ╟─8755dea0-0354-11eb-3681-613f55f92f49
# ╠═a6b4c3a2-0355-11eb-2161-c388fa1af4a8
# ╠═b7dc6570-0355-11eb-15d7-fbbe392de043
# ╠═c04f9960-02cf-11eb-3d38-2b94aea30f1b
# ╠═4bca9a60-0417-11eb-27c4-b3f3217fa5f2
# ╠═ad531080-02cf-11eb-0c97-7bb6e2d3e052
# ╠═f0480fd0-02b6-11eb-29e0-4fe89726b40f
# ╠═7cd45e30-0358-11eb-39fe-cb68ca8c3668
# ╟─b96ff660-0358-11eb-1a2a-27faace6df40
# ╠═b6676600-02c3-11eb-0dcb-ebf6ecde8104
# ╠═42e6d990-0359-11eb-1991-8f74d360cd6d
# ╠═34200b20-0359-11eb-0bbf-0b8bc52695dd
# ╠═6dd74810-0359-11eb-06f4-d936e5210d82
# ╟─15d913e0-035a-11eb-0f7e-817fb183b52a
# ╠═211159d0-0359-11eb-3740-a9d18b6e7269
# ╠═9c627410-035a-11eb-0860-61b00b564971
# ╠═da388812-035a-11eb-3755-037b146724ed
# ╠═f411cf80-035a-11eb-2e68-6f70e4a8d239
# ╠═24bb1600-035b-11eb-3277-55262c559e3c
# ╠═2e7cf320-035b-11eb-12d9-35b818c26eac
# ╠═38619a80-035b-11eb-0133-e7ce05d0bc17
# ╠═76a898c0-035b-11eb-2900-d3c30ce99296
# ╠═40744100-035b-11eb-02df-575934e3d0ab
# ╟─dd085472-035b-11eb-3f40-61dfb68630ca
# ╠═faa03fa0-035d-11eb-1dc0-03a001b17009
# ╠═a36325e0-02cc-11eb-17df-997de2b16667
# ╟─7dc21750-040d-11eb-2e9b-99d5990bd195
# ╠═1926abf0-02d0-11eb-2613-435a4262fe4c
# ╠═14322472-02c7-11eb-37ea-ef9891f9fc45
# ╟─79372fe0-037c-11eb-3b62-8798303a9e7c
# ╠═eeee676e-035f-11eb-2250-5705fa43d301
# ╠═85471360-02c6-11eb-3da0-75935b91fa2e
# ╠═92ae51d0-02c6-11eb-21e0-3fc1508d7b45
# ╟─82cc309e-037c-11eb-00c8-d7f6237fac85
# ╠═9fe4fc62-0360-11eb-2628-f7bafcfd3bb0
# ╠═170d8750-037d-11eb-3db4-6df8a6909fd9
# ╟─57047f7e-037d-11eb-03fb-65affc9a6b21
# ╠═0353e110-035e-11eb-1e7c-cf69dc0ce92d
# ╟─316d9310-040e-11eb-3f09-79f0e8a9f578
# ╠═a7feb440-040f-11eb-1b41-ffdb8905b2e1
# ╠═afc1db30-040f-11eb-18a9-a105db6885be
# ╠═f3f9633e-040f-11eb-0bb1-677b1390d679
# ╠═04200fd0-0410-11eb-31b6-2d9b4c7636ad
# ╠═9e438350-040e-11eb-24e6-4bf767948168
# ╠═1d58ee00-040f-11eb-0bcd-27cc38ee8c48
# ╠═45ceb830-0412-11eb-39cb-01b84b47c43e
# ╟─6d3baab0-0361-11eb-01d8-11769f614e2b
# ╠═7e367020-0361-11eb-1e04-4d99d73b578a
# ╠═8b3c54b0-0361-11eb-303b-0b8fad1814f4
# ╠═05b6e780-02d3-11eb-2bba-5b6f5bdab32d