SUBJECT: Re : emails At the moment I am taking non-spam emails on a personal-use only basis . If I get to the point where distributing the corpus for wider use is a viable option I will need to go back to the people I got the emails from and get specific permission . My current ( quite underdeveloped ) thoughts on anonymising are as follows : - replace person names in body text and subject field with some form of ' Proper Name ' tag - remove email addresses from body text and subject field ( perhaps replaced with an ' Email Add ' tag containing the domain , if non-sensitive ) - remove postal addresses from body text ( and subject field ) ( again , may be replaced with a ' Postal Address ' tag ) - probably ignore From and To field information ( although could retain non-sensitive domain data ) It is likely that any classification-based spam filter would be used in conjunction with a black / white list technique , so From and To information would be incorporated into that rather than the text classification side . &NAME what is your policy / procedure for anonymising non-spam emails ? &NAME