SUBJECT: Re : Multilingual text categorisation What is the problem precisely ? If it is to categorise the language of documents then I can see that this is a good approach . However if it is to apply a set of subject categories to multilingual documents then obtaining suitable training data will be a more difficult problem . You would need a sample of documents about each topic in each language . &NAME &NAME &NAME , Open University , &NAME &NAME , &NAME &NAME , &NAME &NAME : &NUM ( &NUM ) &NUM &NUM &WEBSITE Both language modeling papers I listed mentioned a simple technique for obtaining training data using a standard search engine such as &NAME i.e. you can simply specify the language of the document to be searched for . &NAME &NAME &NAME &NAME ( &EMAIL ) : By ' Multilingual &NAME &NAME ' I mean the problem of automatically assigning a category to a document whatever it is the language . The suggested approach , build separate language models for the different languages , supposes to have a sufficient quantity of texts in each language ( and for each category ) to be able to learn a model . In the real life this is not always true . &NAME , At &NUM : &NUM &NUM / &NUM / &NUM , you wrote : I suppose if you meant applying text categorization methods to a multi-lingual set of documents , then you could just build separate language models for the different languages using the same approach as mentioned in the above papers . &NAME &NAME School of &NAME University of &NAME , &NAME &NAME &NAME &NAME ( &EMAIL ) : &NAME &NAME I would be most grateful if anyone had any references on &NAME &NAME &NAME . Thank you for your help School of Informatics , University of &NAME , &NAME This mail sent through &NAME : &WEBSITE