Before:
I put two corpora on hops:
~brill/WSJ.TAGGED and ~brill/WSJ.PARSED
The first is 1 million words of wall street journal text tagged with
parts of speech.
The second is the same text with parts of speech, and parsed.
These are pretty big, so use them from my directory instead of
copying them if at all possible.
This text is also tokenized (punctuation removed from words, etc), so
for those of you who need statistics from raw text, you might want to
strip the tags off of the WSJ.TAGGED text and use that, instead of
building a tokenizer to deal with other text. With some on-line
novels, your tokenizer would also have to deal with line-break
hyphenation, which will be a mini-pain.
-Eric
Effter:
I poot tvu curpura oon hups:
~breell/VSJ.TEGGED und ~breell/VSJ.PERSED
Zee furst is 1 meelleeun vurds ooff vell street juoornel text tegged veet
perts ooff speech.
Zee secund is zee seme-a text veet perts ooff speech, und persed.
Zeese-a ere-a pretty beeg, su use-a zeem frum my durectury insteed ooff
cupyeeng zeem iff et ell pusseeble-a.
Thees text is elsu tukeneezed (poonctooeshun remufed frum vurds, itc), su
fur thuse-a ooff yuoo vhu need steteesteecs frum rev text, yuoo meeght vunt tu
streep zee tegs ooffff ooff zee VSJ.TEGGED text und use-a thet, insteed ooff
booeeldeeng a tukeneezer tu deel veet oozeer text. Veet sume-a oon-leene-a
nufels, yuoor tukeneezer vuoold elsu hefe-a tu deel veet leene-a-breek
hypheneshun, vheech veell be-a a meenee-peeen.
-Ireec