### A Pluto.jl notebook ### # v0.11.12 using Markdown using InteractiveUtils # ╔═╡ 3741af70-f3a0-11ea-04c2-a9abd479b93d begin using DataStructures using WordTokenizers end # ╔═╡ 153bfb20-f3a4-11ea-0159-13a3814927c0 md"# Conditional Probability for N-gram Language Models" # ╔═╡ 506f4d92-f3a0-11ea-0bdb-37bf5bd00c48 md""" We already talked about probability as counting. $P(A) = \frac{count(A)}{count(total)}$ """ # ╔═╡ f1932250-f3a0-11ea-3ea3-e736d285cda0 md""" Let's roll a fair die. What is the probability of rolling a 2? $P(X = 2) = \frac{count(X = 2)}{\# outcomes} = \frac{1}{6}$ """ # ╔═╡ 4a349b60-f3a0-11ea-1331-d75f5b83fc40 outcomes = [1, 2, 3, 4, 5, 6] # ╔═╡ a883ac5e-f3a0-11ea-3bb4-d9f20c9f0a80 function istwo(x) return x == 2 end # ╔═╡ 7085df40-f3a0-11ea-385a-294a348388ff count(istwo, outcomes) # ╔═╡ e9758280-f3a4-11ea-2d6f-d90175811293 length(outcomes) # ╔═╡ ea12fb00-f3a4-11ea-0078-05a7bf8480da count(istwo, outcomes) / length(outcomes) # ╔═╡ 62af6ed0-f3a1-11ea-1e4c-39da90679f4b md""" What is the probability of rolling an even number? $P(X \text{ is even}) = \frac{count(X \text{ is even})}{\# outcomes} = \frac{3}{6}$ """ # ╔═╡ b8533840-f3a0-11ea-38b0-130a29c1d807 function iseven(x) return x % 2 == 0 end # ╔═╡ 07b6b7f0-f3a5-11ea-1e34-499443ec423c count(iseven, outcomes) # ╔═╡ 0c4d5710-f3a5-11ea-3565-2b3f7841d559 length(outcomes) # ╔═╡ bfdcd8a0-f3a0-11ea-0c90-43c4a958b396 count(iseven, outcomes) / length(outcomes) # ╔═╡ c525fee0-f3a0-11ea-2f9e-9b3eb45dcc63 md""" Here's the formula for **conditional probability**. $P(A | B) = \frac{p(A \text{ and } B)}{p(B)}$ """ # ╔═╡ 428a5e30-f3a1-11ea-1581-e78bd34ef24d md""" What is the probability of rolling a 2, **given** that you rolled an even number? A = the event that you roll a 2 B = the event that you roll an even number $P(X = 2 | X \text{ is even}) = \frac{p(X = 2 \text{ and } X \text{ is even})}{p(X \text{ is even})} = \frac{ 1/6 }{ 3/6 } = \frac{1}{3}$ """ # ╔═╡ 2a047c50-f3a2-11ea-3e83-1f6ad32323ed even_outcomes = filter(iseven, outcomes) # ╔═╡ 716a0b70-f3a5-11ea-0b02-0bc70b5b4648 count(istwo, even_outcomes) # ╔═╡ b87af242-f3a5-11ea-3672-fb1336cd450a count(x -> iseven(x) && istwo(x), outcomes) # ╔═╡ 75271370-f3a5-11ea-3409-fbc71682cf4a length(even_outcomes) # ╔═╡ 31655f00-f3a2-11ea-23cc-71563faadaf2 count(x -> iseven(x) && istwo(x), outcomes) / length(even_outcomes) # ╔═╡ 6cfbbc30-f3a2-11ea-18b3-4d40008745ac md"Now let's apply this to text" # ╔═╡ 755779a0-f3a2-11ea-3690-b1209db9e473 begin text = read("p&p.txt", String) tokens = tokenize(text) bigrams = [tokens[i:i+1] for i in 1:length(tokens) - 1] end # ╔═╡ bc57e5b0-f3a2-11ea-2a01-b709c8b26346 md""" If I pick a random word, what is the probability that it is "am"? $P(word = am) = \frac{count(am)}{\# tokens}$ """ # ╔═╡ 01ba0b10-f3a3-11ea-28cc-11cfe9fde927 is_am(x) = x == "am" # ╔═╡ ee5d60d0-f3a2-11ea-1463-15f1b62fb331 count(is_am, tokens) # ╔═╡ f86b1400-f3a2-11ea-0a3b-377fbdb78bc4 length(tokens) # ╔═╡ 0a344d00-f3a3-11ea-2710-71c365b5f01e count(is_am, tokens) / length(tokens) # ╔═╡ 10c8b7f0-f3a3-11ea-3d61-ed0258a4f676 md""" If I pick two consecutive words (bigrams), what is the probability that I randomly pick "I am"? $P(bigram = I\ am) = \frac{count(I\ am)}{\# bigrams}$ """ # ╔═╡ 3dd41730-f3a3-11ea-1908-a74af8e67289 is_Iam(x) = x == ["I", "am"] # ╔═╡ 4964c0e0-f3a3-11ea-38b0-91678c27f99e count(is_Iam, bigrams) # ╔═╡ 4f47dd30-f3a3-11ea-322d-27468c467f96 length(bigrams) # ╔═╡ 549bd8e0-f3a3-11ea-3819-e1c9414f4217 count(is_Iam, bigrams) / length(bigrams) # ╔═╡ 5ac3ed70-f3a3-11ea-3a1d-e557d1ad87a2 md""" If I pick two consecutive words (bigrams), what is the probability that I randomly pick "am" as the second word, **given** that I already picked "I" as the first word? $P(w_2 = am | w_1 = I) = \frac{count(w_1 = I \text{ and } w_2 = am) }{ count(w_1 = I) }$ """ # ╔═╡ 971c8932-f3a3-11ea-0c9a-717680a2925a count(is_Iam, bigrams) # ╔═╡ a5992f40-f3a3-11ea-2781-ed96e6765be6 first_is_I(bigram) = return bigram[1] == "I" # ╔═╡ b0af78d0-f3a3-11ea-0b72-35d8f05f0015 count(first_is_I, bigrams) # ╔═╡ b5893b70-f3a3-11ea-0846-8dc20f0afeca count(is_Iam, bigrams) / count(first_is_I, bigrams) # ╔═╡ d23a33f0-f3a3-11ea-172f-a10f1d60406c I_bigrams = filter(first_is_I, bigrams) # ╔═╡ f4099930-f3a3-11ea-3372-e55fe689ed5b sort(collect(counter(I_bigrams)), by=x -> -x[2]) # ╔═╡ Cell order: # ╟─153bfb20-f3a4-11ea-0159-13a3814927c0 # ╠═3741af70-f3a0-11ea-04c2-a9abd479b93d # ╟─506f4d92-f3a0-11ea-0bdb-37bf5bd00c48 # ╟─f1932250-f3a0-11ea-3ea3-e736d285cda0 # ╠═4a349b60-f3a0-11ea-1331-d75f5b83fc40 # ╠═a883ac5e-f3a0-11ea-3bb4-d9f20c9f0a80 # ╠═7085df40-f3a0-11ea-385a-294a348388ff # ╠═e9758280-f3a4-11ea-2d6f-d90175811293 # ╠═ea12fb00-f3a4-11ea-0078-05a7bf8480da # ╟─62af6ed0-f3a1-11ea-1e4c-39da90679f4b # ╠═b8533840-f3a0-11ea-38b0-130a29c1d807 # ╠═07b6b7f0-f3a5-11ea-1e34-499443ec423c # ╠═0c4d5710-f3a5-11ea-3565-2b3f7841d559 # ╠═bfdcd8a0-f3a0-11ea-0c90-43c4a958b396 # ╟─c525fee0-f3a0-11ea-2f9e-9b3eb45dcc63 # ╟─428a5e30-f3a1-11ea-1581-e78bd34ef24d # ╠═2a047c50-f3a2-11ea-3e83-1f6ad32323ed # ╠═716a0b70-f3a5-11ea-0b02-0bc70b5b4648 # ╠═b87af242-f3a5-11ea-3672-fb1336cd450a # ╠═75271370-f3a5-11ea-3409-fbc71682cf4a # ╠═31655f00-f3a2-11ea-23cc-71563faadaf2 # ╟─6cfbbc30-f3a2-11ea-18b3-4d40008745ac # ╠═755779a0-f3a2-11ea-3690-b1209db9e473 # ╟─bc57e5b0-f3a2-11ea-2a01-b709c8b26346 # ╠═01ba0b10-f3a3-11ea-28cc-11cfe9fde927 # ╠═ee5d60d0-f3a2-11ea-1463-15f1b62fb331 # ╠═f86b1400-f3a2-11ea-0a3b-377fbdb78bc4 # ╠═0a344d00-f3a3-11ea-2710-71c365b5f01e # ╟─10c8b7f0-f3a3-11ea-3d61-ed0258a4f676 # ╠═3dd41730-f3a3-11ea-1908-a74af8e67289 # ╠═4964c0e0-f3a3-11ea-38b0-91678c27f99e # ╠═4f47dd30-f3a3-11ea-322d-27468c467f96 # ╠═549bd8e0-f3a3-11ea-3819-e1c9414f4217 # ╟─5ac3ed70-f3a3-11ea-3a1d-e557d1ad87a2 # ╠═971c8932-f3a3-11ea-0c9a-717680a2925a # ╠═a5992f40-f3a3-11ea-2781-ed96e6765be6 # ╠═b0af78d0-f3a3-11ea-0b72-35d8f05f0015 # ╠═b5893b70-f3a3-11ea-0846-8dc20f0afeca # ╠═d23a33f0-f3a3-11ea-172f-a10f1d60406c # ╠═f4099930-f3a3-11ea-3372-e55fe689ed5b