### A Pluto.jl notebook ###
# v0.12.3

using Markdown
using InteractiveUtils

# ╔═╡ 807a76c2-0982-11eb-20da-6d7f4ebc7fc8
begin
	using Clustering
	using StatsPlots
	using StringDistances
end

# ╔═╡ a506b000-0ccd-11eb-06ac-3d168a9911d7
md"""
# Language Phylogeny

In this notebook, we will be using cognate lists to reconstruct a language family tree. These cognate lists are the Swadesh lists from Wikipedia [here](https://en.wiktionary.org/wiki/Appendix:Swadesh_lists).
"""

# ╔═╡ 68457400-0983-11eb-1ade-35c608111565
function load_data(fn)
	data = []
	for line in eachline(fn)
		id, eng, word = split(line, '\t')
		word = split(word, ',')[1]
		push!(data, strip(word))
	end
	return data
end

# ╔═╡ 5f9f98c0-0984-11eb-270a-d9067069701d
langs = ["cat", "fra", "ita", "por", "ron", "spa"]

# ╔═╡ 4f01caf0-0985-11eb-0e2c-b9f99969179e
data = map(langs) do lang
	load_data("C:/users/winston/Dropbox/heart-mnlp/code/phylogeny/$lang.txt")
end

# ╔═╡ 3dcf4712-0ccf-11eb-3770-c5eb179f473b
md"""
As discussed in class, we will generate a distance matrix between languages by taking the mean string distance between cognate pairs. Experiment with which distance to use! This uses Levenshtein distance, but try other distances (a list is [here](https://github.com/matthieugomez/StringDistances.jl)).
"""

# ╔═╡ 1beab48e-098c-11eb-2923-9d1eba877152
matrix = zeros(length(data), length(data))

# ╔═╡ d6a7c740-099c-11eb-3dd7-67041353b85f
Levenshtein()("pizza", "pie")

# ╔═╡ 1b8616a0-099d-11eb-33a7-ebd9e4f46ba3
Jaro()("pizza", "pie")

# ╔═╡ b8b44e80-0ccf-11eb-28c9-89f7608906d0
cognate_sets = 1:207
# cognate_sets = [41, 90, 86, 97, 87, 56, 24, 39, 88, 61, 12, 49, 93, 16]  # Dogolpolsky list

# ╔═╡ 5088dbf0-098c-11eb-357d-f1adff3a82e0
for lang1 in 1:length(data)
	for lang2 in 1:length(data)
		dist = 0
		for w in cognate_sets
			dist += Levenshtein()(data[lang1][w], data[lang2][w])  # TODO: distance metric
		end
		dist = dist / length(cognate_sets)
		matrix[lang1, lang2] = dist
		matrix[lang2, lang1] = dist
	end
end

# ╔═╡ c0d50730-098c-11eb-2269-779bc90b5ffb
matrix

# ╔═╡ ac3f3840-0ccf-11eb-290f-e145ac393f03
md"""
Then we perform hierarchical agglomerative clustering. Experiment with different linkage methods (options are `:single`, `:average`, `:complete`, `:ward`)
"""

# ╔═╡ c486013e-098c-11eb-15c5-012ad1eeae7a
hc = hclust(matrix, linkage=:average)  # TODO: try other linkage methods

# ╔═╡ d7c7e8d0-0992-11eb-39bb-9f673f00475e
plot(hc)

# ╔═╡ 5fa37a70-09b2-11eb-1002-3f3a6aa20ad4
langs

# ╔═╡ 358128c0-0cd0-11eb-2204-e38ba77272fb
md"""
## Homework

Experiment with
- different string distance metrics
- different clustering linkage methods
- other languages (you will have to download your own lists)

How does the resulting dendrogram change?
"""

# ╔═╡ Cell order:
# ╠═a506b000-0ccd-11eb-06ac-3d168a9911d7
# ╠═807a76c2-0982-11eb-20da-6d7f4ebc7fc8
# ╠═68457400-0983-11eb-1ade-35c608111565
# ╠═5f9f98c0-0984-11eb-270a-d9067069701d
# ╠═4f01caf0-0985-11eb-0e2c-b9f99969179e
# ╟─3dcf4712-0ccf-11eb-3770-c5eb179f473b
# ╠═1beab48e-098c-11eb-2923-9d1eba877152
# ╠═d6a7c740-099c-11eb-3dd7-67041353b85f
# ╠═1b8616a0-099d-11eb-33a7-ebd9e4f46ba3
# ╠═b8b44e80-0ccf-11eb-28c9-89f7608906d0
# ╠═5088dbf0-098c-11eb-357d-f1adff3a82e0
# ╠═c0d50730-098c-11eb-2269-779bc90b5ffb
# ╟─ac3f3840-0ccf-11eb-290f-e145ac393f03
# ╠═c486013e-098c-11eb-15c5-012ad1eeae7a
# ╠═d7c7e8d0-0992-11eb-39bb-9f673f00475e
# ╠═5fa37a70-09b2-11eb-1002-3f3a6aa20ad4
# ╟─358128c0-0cd0-11eb-2204-e38ba77272fb