SUBJECT: Re : Multilingual text categorisation I use the CLEF Corpora that I have managed for the problem of Multilingual Text categorisation . I do not know if my choise of this corpus will affect the result . Note that it is very difficult to obtain corpora for this task . &NAME Original Message From &NAME &NAME ( &EMAIL ) &NAME , What corpus are you using for your experiments ? I think that the second solution will work well if you are using a general corpus . However , one of the problem I have observed when using &NAME in other domains ( like medicine or biology ) is that the translation of domain specific terms will be very poor . In this case you will need have a domain specific multilingual dictionary that could be included in the &NAME system . &NAME &NAME On Thursday &NUM April &NUM &NUM : &NUM am , &NAME &NAME wrote : Dear &NAME , For some days I have sent a message to ask whether you know references in ' categorization of multilingual texts ' . I thank all the people who answered my question . I asked this question because I have been interested on this subject for some months . According to the answers received , I can say that there are not works on this subject compared to works in the &NAME ( cross-language information retrieval ) . The problem , that I treat , is as follows : Suppose that we have some corpora in several languages . The subjects of these corpora are comparable . Each text of these corpora is label by &NUM or more classes ( subjects ) . It is about sotries of newspapers in several languages which cover different subjects like ' Conflict in &NAME , Conflict of Interests in &NAME , &NAME &NAME &NAME , Destruction of Ukrainian nuclear weapons , &NAME American car industry , . ... The objective is to use the techniques of texts categorization on these corpora . Our hope is to be able to know the subject of each new text whatever its language . At the beginning I had &NUM possible solutions , suppose that &NAME &NUM , .. , &CHAR where &CHAR is the number of languages and &NAME &NUM , ... , &NAME , where C_l_i is the number of classes for the language &NAME : &NUM the first solution is to learn a model for each language ; thus if I have &NUM languages , I must learn &NUM models ( &NUM for each ) , then , in the phase of classification , for each new text one identifies its language and applies the model of this language on it . This is the simplest solution and the most direct but this solution is limited because it supposes to have sufficient quantity of texts for each class and for each language and it require &CHAR learning . &NUM The second solution is to learn on only &NUM language &NAME ( as we do usually ) and then , for each new text , one identifies its language and one translates it by an automatic translator into the language &NAME and then one applies the learned model of the language &NAME on the translated text . I made experiments on the &NUM approaches and according to obtained results the second solution perfomes well . Therefore , the introduction of machine translation , in the process of categorization , helps to recognize the subject of a text . All comments , helps or ideas are welcome . &NAME , Dr. &NAME &NAME &NAME School of Informatics Department of Library and Information Studies &NUM &NAME &NAME &NAME , NY &NUM &NAME : ( &NUM ) &NUM ext . &NUM &NAME : ( &NUM ) &NUM &NAME &NAME &NAME Laboratory &NAME &NUM university &NAME - &NAME