Comparing Corpora

A workshop to be held in conjunction with
The 38th Annual Meeting of the Association for Computational Linguistics
7th October 2000
Hong Kong Convention and Exhibition Centre
Hong Kong University of Science and Technology


Compcorp home page: here.
Registration details: http://www.cs.ust.hk/acl2000/.


PROGRAMME

14.00 Opening Remarks (Co-chair)  
14.15 Paul Rayson and Roger Garside, Lancaster University, UK Comparing Corpora Using Frequency Profiling
14.40 George Tambouratzis, Stella Markantonatou, Nikolaos Hairetakis, Marina Vassiliou, Dimitrios Tambouratzis and George Carayannis, ILSP, Athens Greece Discriminating the registers and styles in the Modern Greek Language
15.05 Patrick Ruch and Arnaud Gaudinat, Geneva University Hospital and University of Geneva, Switzerland Comparing Corpora and Lexical Ambiguity
15.30 Coffee break  
15.45 Chikashi Nobata, Nigel Collier and Jun'ichi Tsujii, Kansai Advanced Research Center and University of Tokyo, Japan Comparison between Tagged Corpora for the Named Entity Task
16.10 Douglas Roland, Daniel Jurafsky, Lise Menn, Susanne Gahl, Elizabeth Elder and Chris Riddoch; Colorado and Harvard Universities, USA Verb Subcategorization Frequency Differences between Business-News and Balanced Corpora: the role of verb sense
16.35 Discussion The role and importance of comparing corpora: the way forward
17.00 Close  

Abstracts

Paul Rayson and Roger Garside, Lancaster University, UK
Comparing Corpora Using Frequency Profiling

This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key grammatical or word-sense categories. This can be used as a quick way in to find the differences between the corpora and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document analysis in the software engineering process.


George Tambouratzis et al, ILSP, Athens Greece
Discriminating the registers and styles in the Modern Greek Language

This paper reports on the discrimination of registers and styles in written Modern Greek. Our research has focused on modern Greek political speech as recorded in the Greek Parliament Proceedings and investigates (i) the relationship between this register and other registers such as fiction and academic prose, as well as (ii) the variation of styles within this register. The application of clustering techniques indicates that the particular political speech texts form a cluster distinct to other registers. The use of discriminant analysis techniques indicates that the styles of individual speakers within the particular political speech register may be discriminated with a high degree of accuracy.


Patrick Ruch and Arnaud Gaudinat, Geneva University Hospital and University of Geneva, Switzerland
Comparing Corpora and Lexical Ambiguity

In this paper we compare two types of corpus, focusing on the lexical ambiguity of each of them. The first corpus consists mainly of newspaper articles and literature excerpts, while the second belongs to the medical domain. To conduct the study, we have used two different disambiguation tools. However, first of all, we must verify the performance of each system in its respective application domain. We then use these systems in order to assess and compare both the general ambiguity rate and the particularities of each domain. Quantitative results show that medical documents are lexically less ambiguous than unrestricted documents. Our conclusions show the importance of the application area in the design of NLP tools.


Chikashi Nobata, Nigel Collier and Jun'ichi Tsujii, Kansai Advanced Research Center and University of Tokyo, Japan
Comparison between Tagged Corpora for the Named Entity Task

We present two measures for comparing corpora based on information theory statistics such as gain ratio as well as simple term-class frequency counts. We tested the predictions made by these measures about corpus difficulty in two domains (news and molecular biology) using the result of two well-used paradigms for NE, decision trees and HMMs and found that gain ratio was the more reliable predictor.


Douglas Roland, Daniel Jurafsky, Lise Menn, Susanne Gahl, Elizabeth Elder and Chris Riddoch; Colorado and Harvard Universities, USA
Verb Subcategorization Frequency Differences between Business-News and Balanced Corpora: the role of verb sense

We explore the differences in verb subcategorization frequencies across several corpora in an effort to obtain stable cross corpus subcategorization probabilities for use in norming psychological experiments. For the 64 single sense verbs we looked at, subcategorization preferences were remarkably stable between British and American corpora, and between balanced corpora and financial news corpora. Of the verbs that did show differences, these differences were generally found between the balanced corpora and the financial news data. We show that all or nearly all of these shifts in subcategorization are realised via (often subtle) word sense differences. This is an interesting observation in itself, and also suggests that stable cross corpus subcategorization frequencies may be found when verb sense is adequately controlled.



Adam Kilgarriff
Last modified: Fri Sep 22 20:45:19 BST 2000