SUBJECT: Re : rasp subcat lexicon There are &NUM verbs in &NAME , so it wld be interesting to know how well these map to the frequency ranking from &NAME -- maybe we shld use Comlex as our list of verbs to build entries for ? Yes , that would also enable me to use consistent method for lexicon building / filtering . We may want to omit some of the low frequency Comlex verbs , though ... I made some calculations . There 're now &NUM new files in /usr / groups / dict / subcat / rasp_lexicon &NUM ) in-comlex which contains &NAME frequencies for the &NUM Comlex verbs : &NUM verbs occur more than &NUM times &NUM verbs occur less than &NUM times &NUM ) not-comlex which contains frequencies for the &NUM , &NUM non-Comlex verbs in &NAME : &NUM verbs occur more than &NUM times ( most are &NAME spellings or noise from the tagger / parser ) &NUM , &NUM verbs occur less than &NUM times There are another &NUM which occur btwn &NUM times and &NUM which occur &NUM times -- how do we know that &NUM is the correct cut-off ? Maybe we shld be thinking more in terms of &NUM egs / verb so that we can allow for parse failures etc ? Yes , I meant &NUM occurrences after parsing ( i.e. after failures ) . The above freqs were calculated from parsed data . To end up with &NUM , we 'd need some &NUM egs / verb as input to parsing . That 's really the minimum . The accuracy of both subcat acquisition and clustering drops sharply if we use less than &NUM parsed sentences ( not sure about wsd / sel . prefs , but I 'd imagine it 's the same ) . Generally , of course , the more data the better . In my recent experiments , I got best subcat and clustering results with my biggest dataset -- &NUM egs per verb . For the * really * frequent verbs , should we use all the data we have in &NAME , or do we set some upper bound ? Perhaps this would be a sensible way to proceed : Use some correlation of the &NAME ranking and comlex to choose 5-6k verbs that we shld do ( i.e. add non-comlex verbs that occur more than &NUM times in &NAME Get all the relevant sentences from the &NAME and ( re ) parse them and extract patternsets Get more sentences for those verbs underrepresented in &NAME ( ( &NUM egs ? ) by creating queries to &NAME designed to find pdf / ps documents containing them , feed these sentences through pdftotext / pstoascii some special preprocessing , and then rasp I think I can set up the pipeline to parse and extract patternsets ) , but it wld be good to check manually with &NAME that we 'll find enough relevant sentences this way for a small sample of underrepresented candidates ... Proceed as before Any thoughts ? This sounds good . I can do the manual checking for &NUM frequency ranges of ( samples of ) underrepresented verbs e.g. by next week . &NAME