Homework: Nice reference here: http://www.math.utah.edu/docs/info/gawk_11.html (with awk) Words in common: Easiest: cat wsj.types genesis.types | sort | uniq -c | awk '$1>=2 {print $2}' > wsj.genesis.types.inboth Another easy option: cat genesis.types wsj.types wsj.types | sort | uniq -c | awk '$1==3 {print $2}' | head OR, awk only options: awk 'NR==FNR{a[$0]=$0;next}a[$0]' wsj.types genesis.types | head OR: awk 'FILENAME=="genesis.types" {genF[$1]=1} FILENAME=="wsj.types" {wsjF[$1]=1} END {for (f in genF) if(wsjF[f]>0) print f} ' genesis.types wsj.types WSJ words only: Easiest, using common words file: cat wsj.genesis.types.inboth wsj.types | sort | uniq -c | awk '$1==1 {print $2}' | head Another easy option: cat genesis.types genesis.types wsj.types | sort | uniq -c | awk '$1==1 {print $2}' | head OR, awk only options: awk 'NR==FNR{a[$0]=$0;next}!($0 in a)' genesis.types wsj.types OR awk 'FILENAME=="genesis.types" {genF[$1]=1} FILENAME=="wsj.types" {wsjF[$1]=1} END {for (f in wsjF) if (genF[f]!=1) print f }' genesis.types wsj.types Genesis words only: Easiest, using common words file: cat wsj.genesis.types.inboth genesis.types | sort | uniq -c | awk '$1==1 {print $2}' | head Another easy option: cat genesis.types wsj.types wsj.types | sort | uniq -c | awk '$1==1 {print $2}' | head OR, awk only options: awk 'NR==FNR{a[$0]=$0;next}!($0 in a)' wsj.types genesis.types OR: awk 'FILENAME=="genesis.types" {genF[$1]=1} FILENAME=="wsj.types" {wsjF[$1]=1} END {for (f in genF) if (wsjF[f]!=1) print f }' genesis.types wsj.types (with comm) WSJ words only: comm -13 genesis.words.uniq wsj.frag.words.uniq | wc -l Genesis words only: comm -23 genesis.words.uniq wsj.frag.words.uniq | wc -l Words in common: comm -12 genesis.words.uniq wsj.frag.words.uniq | wc -l ----------------- Hashtables: cat genesis.words | awk '{freq[$0]++}; END {for (w in freq) print freq[w], w}' | sort -nr | less Calculate MI: cat genesis.hist genesis.bigrams.hist | awk 'NF==2 { f[$2]=$1 } NF==3 {print log((38516*$1)/(f[$2]*f[$3])), $2, $3 }' | sort -nr | less Calculate MI only for bigrams with freq>5: cat genesis.hist genesis.bigrams.hist | awk 'NF==2 { f[$2]=$1 } NF==3 && $1>5 {print log((38516*$1)/(f[$2]*f[$3])), $2, $3 }' | sort -nr | less Calculate t: cat genesis.hist genesis.bigrams.hist | awk 'NF==2 { f[$2]=$1 } NF==3 {print (($1 - ((1/38516)*$2*$3)) / (sqrt($1))), $2, $3 }' | sort -nr | less Join program in awk: awk 'FILENAME=="genesis.hist" {genF[$2]=$1} FILENAME=="wsj.frag.hist" {wsjF[$2]=$1} END {for (f in genF) if(wsjF[f]>0) print f, genF[f], wsjF[f]} ' genesis.hist wsj.frag.hist Now with join command: sort -k2 wsj.frag.hist > tempwsj sort -k2 genesis.hist > tempgen join -1 2 -2 2 tempwsj tempgen To get lower case histogram files: tr [A-Z] [a-z] < genesis.words | sort | uniq -c | head KWIC (keyword in context) make file with content, call anything (mine is 'newinput') awk '{for (i=1; i myoutputfile) 3. Now you have a textfile to manipulate in the terminal using awk, etc. Make a concordance on the same file and word with awk.