SUBJECT: Re : [ &NAME ] Corpus Sanitation Another comment on anonymisation . The problem is even worse if one wishes to make available ( for whatever purpose ) the audio or video tapes from which transcriptions have been prepared . I believe that considering this more challenging case also clarifies issues for text-only corpora . I 'll assume video , which is the extreme point . There are &NUM problems with video . No amount of signal manipulation however small , preserves the full scientific usefulness of the data . On the other hand , no reasonable amount of ' anonymisation ' , however large , really ensures anonymity . The first point is obvious if one contemplates trying to do psycholinguistic experiments with the data . It would for example seriously compromise a comprehension study if proper names are bleeped out , or even replaced by others . No psychology reviewer would ever accept that such data is as naturalistic as untreated data . The second point is almost as obvious , because humans are adept at inferring personal identity from all kinds of things , including voice quality , ear shape , gait and so on . Therefore , whatever &NUM does , short of blanking everything out , it is difficult to credibly claim that the risk of unintended identification has been avoided . &NAME &NAME 's &NAME documentation makes the case that such identification is highly undesirable . The only way I can see to handle this is to deal with the problem at the outset by making completely clear to participants what will happen to their data , and obtaining informed consent . If this is not done , the data is effectively lost to responsible researchers , and cannot be used except at risk of infringing participants rights . If , as in the case &NAME has , promises have been made to participants , those promises must be honoured . It may or may not be possible to recover data that is both useful and distributable under these circumstances . The same difficulties are also present for audio ( the &NAME audio has never been distributed , even though it exists ) . The risk of identification is just too great and the consequences of that too severe to be acceptable . Although the visual cues to identity are absent , speaking style persists . One may feel that the risks are less , but they still exist . And , here 's the rub , the same arguments apply , albeit more weakly , to text-only corpora . While voice quality is now absent , substantial cues to personal identity may persist in lexis and other idiosyncrasies , not to mention that people are extremely adept at reconstructing material from context . Once again , the risks are arguably less than in the other media , but they still exist . So , notwithstanding valiant efforts to anonymise in such a way that the scientific usefulness of the data is preserved , the original decision to promise anonymity comes back to haunt us . I lean to the view that there is no difference of principle between the different media . Is even &NAME 's rigorous approach to anonymization enough in practice ? Perhaps , but that depends on a very iffy judgement call . The lesson seems to be that great care is needed in collecting informed consent for corpus work . None of this addresses the additional point made in &NAME 's post about collateral damage to people and organisations not involved in the recording . I could imagine a prosecution against both participants and corpus distributors for defamation or slander . That would be bad . Perhaps corpus collectors need to indemnify participants against this , or perhaps it suffices to warn people that they are ( in effect ) speaking in a public place . Or perhaps we have a duty of care to ensure that our participants do not put themselves at risk ( doubly likely since many corpora include contributions by children ) . And that leaves aside the much more likely cases where nasty stuff in the corpus evokes resentment and unhappiness , but not enough to lead to prosecutions . &NAME Dr. &NAME &NAME , Assistant Professor of Computational Linguistics Department of Linguistics , &NUM &NAME Avenue , &NAME OH &NUM &NAME : &NUM &NUM &NUM &NAME : &NUM &NUM &NUM Web : &WEBSITE On &NAME &NAME 's posting , to me there is an important difference between ' bad language ' and individuals ' names , or information that could lead to identification of individuals . Like &NAME &NAME , I do n't believe there is any real reason to censor the ' bad language ' ; it is important linguistic data , and we are all grown-ups . But I do think that before such a resource is made public , strenuous efforts should be made to eliminate any possibility of users identifying either the individuals who produced the material , or any individuals or individual institutions written about . Actually , under various national Data Protection laws I suspect it might be illegal not to do this , even if the material is simply held at &NUM institution and not circulated . But it ought to be done anyway , for reasons that I discuss at some length in the ' ethics ' section of the documention file accompanying my &NAME Corpus ( available via the Web , from my home page &WEBSITE CHRISTINE ) . I discuss there what seems to me to have been inadequate practice in this respect in the spoken section of the British National Corpus . There are places where really damaging things are said in a quite casual way in conversation about people , or organizations , who / which might easily be identified by people who know them ( and could probably be identified by strangers with only minimal detective work ) . The recorded speakers had no motive to worry about this , but I believe corpus &NAME have a responsibility not to let such casual gossip about identifiable people be turned into permanent public records . &NAME &NAME Prof. &NAME &NAME MA &NAME &NAME Professor of Natural Language &NAME School of Cognitive & Computing Sciences University of &NAME &NAME , &NAME &NAME &NAME , &NAME e-mail &EMAIL ( no attachments please ) &NAME . &NUM &NUM &NUM fax &NUM &NUM &NUM web &WEBSITE