### A Pluto.jl notebook ### # v0.12.3 using Markdown using InteractiveUtils # ╔═╡ 807a76c2-0982-11eb-20da-6d7f4ebc7fc8 begin using Clustering using StatsPlots using StringDistances end # ╔═╡ a506b000-0ccd-11eb-06ac-3d168a9911d7 md""" # Language Phylogeny In this notebook, we will be using cognate lists to reconstruct a language family tree. These cognate lists are the Swadesh lists from Wikipedia [here](https://en.wiktionary.org/wiki/Appendix:Swadesh_lists). """ # ╔═╡ 68457400-0983-11eb-1ade-35c608111565 function load_data(fn) data = [] for line in eachline(fn) id, eng, word = split(line, '\t') word = split(word, ',')[1] push!(data, strip(word)) end return data end # ╔═╡ 5f9f98c0-0984-11eb-270a-d9067069701d langs = ["cat", "fra", "ita", "por", "ron", "spa"] # ╔═╡ 4f01caf0-0985-11eb-0e2c-b9f99969179e data = map(langs) do lang load_data("C:/users/winston/Dropbox/heart-mnlp/code/phylogeny/$lang.txt") end # ╔═╡ 3dcf4712-0ccf-11eb-3770-c5eb179f473b md""" As discussed in class, we will generate a distance matrix between languages by taking the mean string distance between cognate pairs. Experiment with which distance to use! This uses Levenshtein distance, but try other distances (a list is [here](https://github.com/matthieugomez/StringDistances.jl)). """ # ╔═╡ 1beab48e-098c-11eb-2923-9d1eba877152 matrix = zeros(length(data), length(data)) # ╔═╡ d6a7c740-099c-11eb-3dd7-67041353b85f Levenshtein()("pizza", "pie") # ╔═╡ 1b8616a0-099d-11eb-33a7-ebd9e4f46ba3 Jaro()("pizza", "pie") # ╔═╡ b8b44e80-0ccf-11eb-28c9-89f7608906d0 cognate_sets = 1:207 # cognate_sets = [41, 90, 86, 97, 87, 56, 24, 39, 88, 61, 12, 49, 93, 16] # Dogolpolsky list # ╔═╡ 5088dbf0-098c-11eb-357d-f1adff3a82e0 for lang1 in 1:length(data) for lang2 in 1:length(data) dist = 0 for w in cognate_sets dist += Levenshtein()(data[lang1][w], data[lang2][w]) # TODO: distance metric end dist = dist / length(cognate_sets) matrix[lang1, lang2] = dist matrix[lang2, lang1] = dist end end # ╔═╡ c0d50730-098c-11eb-2269-779bc90b5ffb matrix # ╔═╡ ac3f3840-0ccf-11eb-290f-e145ac393f03 md""" Then we perform hierarchical agglomerative clustering. Experiment with different linkage methods (options are `:single`, `:average`, `:complete`, `:ward`) """ # ╔═╡ c486013e-098c-11eb-15c5-012ad1eeae7a hc = hclust(matrix, linkage=:average) # TODO: try other linkage methods # ╔═╡ d7c7e8d0-0992-11eb-39bb-9f673f00475e plot(hc) # ╔═╡ 5fa37a70-09b2-11eb-1002-3f3a6aa20ad4 langs # ╔═╡ 358128c0-0cd0-11eb-2204-e38ba77272fb md""" ## Homework Experiment with - different string distance metrics - different clustering linkage methods - other languages (you will have to download your own lists) How does the resulting dendrogram change? """ # ╔═╡ Cell order: # ╠═a506b000-0ccd-11eb-06ac-3d168a9911d7 # ╠═807a76c2-0982-11eb-20da-6d7f4ebc7fc8 # ╠═68457400-0983-11eb-1ade-35c608111565 # ╠═5f9f98c0-0984-11eb-270a-d9067069701d # ╠═4f01caf0-0985-11eb-0e2c-b9f99969179e # ╟─3dcf4712-0ccf-11eb-3770-c5eb179f473b # ╠═1beab48e-098c-11eb-2923-9d1eba877152 # ╠═d6a7c740-099c-11eb-3dd7-67041353b85f # ╠═1b8616a0-099d-11eb-33a7-ebd9e4f46ba3 # ╠═b8b44e80-0ccf-11eb-28c9-89f7608906d0 # ╠═5088dbf0-098c-11eb-357d-f1adff3a82e0 # ╠═c0d50730-098c-11eb-2269-779bc90b5ffb # ╟─ac3f3840-0ccf-11eb-290f-e145ac393f03 # ╠═c486013e-098c-11eb-15c5-012ad1eeae7a # ╠═d7c7e8d0-0992-11eb-39bb-9f673f00475e # ╠═5fa37a70-09b2-11eb-1002-3f3a6aa20ad4 # ╟─358128c0-0cd0-11eb-2204-e38ba77272fb