SUBJECT: Very rare terms elimination Dear all I have seen in several paper that very rare terms ( e.g. those ocurring in less than &NUM to &NUM documents ) are often deleted from the representation of documents for text categorization . I am on the need of doing the same , because I am working now on a big number of documents for a binary classification problem , and the indexing of documents for learning is qui te slow . According to &NAME 's law , if you delete terms ocurring in less than &CHAR documents , you will get much less terms and indexing will be much faster . After indexing , I am applying term selection with Information Gain . I plan to keep about &NUM to &NUM of the original terms . If I prevoiusly delete those ocurring in less than &CHAR documents , I will probably not miss any top scoring term , but I would like to have a theoretical background to so it safely . So , is there any theoretical result that allows to delete terms occurring in less than &CHAR documents , that supports there will be no miss of information for a rather balanced and large binary classification dataset Thank you &NAME &NAME &NAME &NAME &NAME &NAME &NAME &NAME &NAME &NAME &NAME &NAME &NAME &NAME &NUM - &NAME &NAME &NAME - &NAME ( &NUM ) &NUM &EMAIL La legislaciF3n espaF1ola ampara &NAME secreto de las comunicaciones . Este correo electrF3nico es estrictamente confidencial &CHAR va dirigido exclusivamente a su destinatario / &CHAR . Si no es &NAME , le rogamos que no difun da ni copie la transmisiF3n &CHAR nos &NAME notifique cuanto antes . Spanish law guarantees privacy in electronic communications . This electronic transmission is strictly confidential and intended solely for the addressee . If you are not the intended addressee , you are kindly requested not to disclose nor to copy this transmission and to notify us as soon as possible .