SUBJECT: Re : &NAME your email / reduce your spam ! &NAME reply . &CHAR On Mon , &NUM May &NUM , &NAME &NAME wrote : I 'm not sure if such a corpus would be more use to spammers to overcome bogofilter ( see feature article in &NAME Scientist about a month ago - hide some genuine text into a spam email to tip the probabilities all wrong ) , but anyway . This is already standard practice , though they simply type banal headers and do n't bother to model real text statistically . The point of a statistical approach is that it can always renew / adapt it 's model of spam but needs a good idea of what a genuine email looks like too , esp . if the &NUM get closer ' over time . It depends very much on the subject . For example , I could send you the contents of my ' supervision ' folder , which is full of emails related to supervision organisation . But that might bias things a bit ( I mean , you might conclude that genuine email is more likely to contain words like ' supervision ' than it actually is ) . Bias from any &NUM source is not likely to be a problem as we plan to gather a large corpus -- ideally 250k or so of emails . &NAME will get in touch about what you might donate -- thanks . I 'm not &NUM convinced that it 's not copyright violation for me to use other people 's emails without asking them . I trust the anonymiser removes ALL numbers , &NAME and &NAME &NAME ? Try it and see for yourself -- copyright is in verbatim text and has to be asserted at the time of dissemination . Nevertheless , you are right that there is an issue of privacy , so you shld not submit anything that you think your friends would object to once anonymised . It might make things better if people can anonymise their mail BEFORE sending it to you ( rather than trust that you 're not accidentally reading the original version ) . If you The system relies on having preferably large coherent bodies of emails to do it well . The site is to give you a taster of what your email shld look like when anonymised . We 'll only be looking at the anonymised email ( very selectively ) to check for errors , so at that stage it shld be pretty harmless as we wo n't know whose it is etc. It might be quicker just to look at lots of &NAME threads or something like that ( I think , at the moment , any thread that is ' big ' is not likely to be spam , although it 's theoretically possible that spammers will reply to real threads ) . This is certainly &NUM source ( if manually filtered ) but probably not a very good model of the full range of genuine email . best , &NAME