CS 600.443 Spring, 2005 Assignment #2 This assignment is to be done in groups of three or four students. Privacy compromises have become commonplace in the news lately. First, there was the compromise of over 140,000 people's personal data with a security breach at Choicepoint. Then, Seisint lost over 30,000 people's social security numbers. As more and more information is collected about people, identity theft becomes a serious problem. For this project, you are going to demonstrate how a privacy-compromising database, such as those of Choicepoint's and Seisint's can be built. The goal is to develop an appreciation for how insidious these technologies can be. Then, you will develop techniques for weakening such databases and the data that can be learned about individuals to protect privacy. *** WARNING *** It is important to remember that you must work within legal and ethical bounds. While you are learning how privacy compromises can work, it is important that you protect any information that you find, and that you only obtain data that is publicly available, and that you only use legitimate means. There are two reasons why you should protect this information. The information you collect poses a potential privacy risk to the people whose information you collect, and furthermore, there could be errors in the information. *************** Part I: Data soure identification Being as creative as you can, identify sources of public information about people. This project will focus on residents of the city of Baltimore, Maryland. Identify public databases, web sites to be mined, public records, and physical world information that you can capture on the computer. You will be amazed at the information you can find online. For example, Prince Georges County puts all its property records online. Baltimore city has its own treasure trove of data online. Part II: Data format standardization Come up with a standard format for storing data that you are going to collect. Write translaters for the various formats you find data in into this standard format so that large databases can be easily converted into this standard format. Ideally, you will use a relational database that allows you to link peole to each other based on what they have in common. Examples are relatives, employer/employee relations, romantic involvement, and any other possible relationship that can exist between two people. The purpose of this step is so that when new databases are obtained, the data can be easily added to the existing collection through automated translation. Part III: Data collection Collect as much information as you can about residents of Baltimre City. As you collect the data, convert it using your translators into the standard format, and fill in the relations across people already in the database. Make every effort to resolve issues of multiple entries for the same person. If you have a "Bob Smith" and a "Robert Smith", figure out, using other data, the likelihood that they are the same person. Based on the probability that they are the same person, either combine them into one entry with 2 names, or make two entries, but provide a link between them and represent the probability that they are the same person using a home grown algorithm that you should develop. For this part of the project only, you may trade data with other groups. When doing so, be sure to document in the final project that you turn in which data you collected yourself, and which you received in a trade from another group in the class. The goal is to make the projects more interesting by increasing everyone's data size. Of course, you will receive the data from other groups in their own format, which is why the data format standardization is so imoprtant. Also, be very careful when automatically farming web sites for public data. Some sites will blacklist IP addresses from which they get overwhelmed with web requests. You may want to put some cleverness into your crawlers, so that the sites do not turn off your access. Part IV: Data mining Develop an interface for querying your huge database that you've grown. Try to answer questions such as: "Show me all the relatives for a givern person." "Show me all people living in this or that street who make more than $x dollars." Etc. You come up with the important questions, and show how your system answers them. Part V: Privacy protection Develop techniques for reducing the value of what can be collected about people. One of the best ways to do this is to poison public databases and fill them with nonsensical or incorrect data about yourself and others. For example, in the early days of file sharing, one of the tactics of the music industry was to put junk files out there with names of popular songs. If the data collected about people is wrong or untrustworthy, it will be used less often. Don't actually implement any of this, but write a report about it. You will be better able to do this once you've gone through the exercise of collecting and using information about people. Deliverables: 1. Progress report Due: April 14, 2005 a. identify team members b. One page (max) write-up describing data sources you are using, and showing your standard record format. c. One page (max) write-up describing progress so far and how the team members are dividing up the implementation and writing work among themselves. 2. Final project Due: May 5, 2005 Page sizes below represent maximum sizes for the write-ups. 1. Two page write-up describing the team's system, the size of the database, the number of people in the database, who duplicates are handled, and outlining the programs written by the group. 2. One page write-up describing what the group's successes were, and where they ran into insurmountable challenges. 3. Two page write-up explaining how the system works and showing examples of queries and how the systems handles them. 4. Three to four page write-up about privacy protection techniques. 5. Turn in printouts and a CD of all of the code you wrote and any supporting libraries that the TAs are not likely to have. 6. In class demo (about 10 minutes) 5/5 and 5/6 Describe your project, what sources you used, what standard format you used, etc. Demonstrate how data is added to the system. Demonstrate several queries. Be prepared to have your system answer real-time queries from the professor. Grade sheet: Progress report: _________ (5 points) Quality and originality of data sources: __________ (10 points) Quality of system in terms of - data standardization __________ (5 points) - handling of duplicates __________ (5 points) - ease of translation from multiple sources _________ (10 points) Overall quality of system and software________ (20 points) Quality of write-up about privacy protection techniques_______ (15 points) Overall quality of written reports________ (20 points) Class presentation and ability to handle unexpected queries_______ (10 points) Total__________