600.107 Introduction to Programming in Java
Project 2 - Due 11:30pm on Monday, 11/13 - 60 points
(Recommended completion date is Saturday 11/11.)

Overview

Note: corrections to the original posting are noted in red below.

This is your seconded graded project that will count for 10% of your course grade and must be treated like an individual take-home exam. You absolutely must not discuss it with or get help from anyone other than the course teaching staff (Joanne + TAs). Even then, we will help you minimally since the goal is for you to show us what you have learned in the prior class exercises and homework assignments by applying the concepts to this somewhat complex problem. One final ethics reminder: doing a web search and using or adapting code you find is forbidden! (We've already done one and know what to look for in your solutions...)

You'll need to apply the following programming skills and concepts to solving this problem: solution design, method writing & calling, working with text files, using one and two-dimensional arrays, generating random data, proper documentation and other style guidelines. Download our posted P2start.zip file to get the file(s) necessary to start this assignment. This includes plain text data files (training.txt and validate.txt).

Reflection: In addition to the code itself, we want some insights as to how you have approached the problem, your design and thought processes in developing your solution, and particular problems you were or were not able to solve. Therefore, you must submit a short (2-5 min) video recording of yourself explaining your approach and code to us. Please refer to your code as you do this (on screen or on paper), but we also need to see your face periodically. If you do not submit a reflection you will receive a 0 on the entire project. The video submission is due the same time as the actual code. (A secondary purpose of the reflection is to make sure you are accountable for the code you submit.)

You can ask a friend to help with the recording, as long as that person is not in this course. Please keep this reflection in mind as you develop your code and make some notes as you proceed so that you remember what to say in the video. The file type should be standard enough for us to view with any common video player.

Deliverables: Your main program solution should be a java program called P2a.java. You must submit your program code (actual project solution) and reflection video together in a zip file called P2-jhedLID.zip, substituting your actual JHED login as the second part of the file name, to P2 on Blackboard. You can submit just a web link to an otherwise private video for us to access on-line if your video file is too large to upload. Absolutely no publicly accessible videos (on YouTube for example) are allowed - these will be reported as ethics violation and result in a 0 on the entire project.

Submission: Remember that you can submit multiple times on Blackboard; just make sure that your final submission includes all parts (code and reflection)! Also remember that your Java code must compile (no syntax errors) or you will receive a 0 grade (seriously). You can check your submission by downloading from MyGrades on Blackboard after submitting and making sure all was included in the zip file, and everything still compiles as submitted. Late submissions will not be accepted, so plan to submit by 11pm even though the deadline is 11:30.


The Problem

For this project you will write a program to implement a Perceptron, which is an early simple form of machine learning algorithm. Thanks to Haroon Ghori for providing the problem concept and machine learning background for this project! Perceptrons are used to classify data as belonging to one of two possible categories. In brief, our perceptron will be trained on patient data and then used to predict survival a certain number of months after a heart attack. Each time the program is run the user will specify a desired cutoff for the survival. After training with this cutoff, you will be able to predict for new patients whether they will live past the cutoff or not. We will provide you with the training data (file "training.txt" in the provided zip). Part of your job will also be to randomly generate similar data to use in making predictions. DISCLAIMER: although the training data comes from real patients, it is rather old and too small to be of true predictive value, medically speaking.

The overall operation of your program will be as follows. More details are provided below.

Patient data: The provided data sets describe patients who have had a heart attack, one patient per line. Each line contains seven numbers with these meanings, most related to heart data from an echocardiogram. (Since this is not a biology course, we won't bother to explain any further.)

  1. How many months the patient lived after the heart attack. (Cause of death is not necessarily related to the heart condition which is one of several reasons why our results will not be medically valid.)
  2. The age at which the patient suffered their heart attack.
  3. Whether the patient had pericardial effusion (1=yes, 0=no).
  4. Fractional shortening.
  5. E-point septal separation.
  6. Left ventricular end-diastolic dimension.
  7. Wall motion index.

Perceptron approach: A perceptron is a binary linear classifier - that is it can be used to split data into two categories based on a linear function (y = m x + b). If the input data has M dimensions, the linear function is a hyperplane defined by y = W dot X where W and X are one dimensional arrays of length M. If we know the values of the elements in W (the weights), we can predict the class of X as follows:

X belongs in class 1 if W dot X >= 0
X belongs in class -1 if W dot X < 0
That is, the class equals the sign of W dot X. For the purposes of our assignment, class 1 is the patients who survived to or past the cutoff number of months and class -1 is those who died before the cutoff.

For two M length arrays A and B, C = A dot B is the sum of the element-wise product of the arrays:

A = [a1, a2, a3, ... , aM]
B = [b1, b2, b3, ... , bM]
C = a1*b1 + a2*b2 + a3*b3 + ... + aM*bM
C = A + B is the pairwise sum of the arrays:
C = [a1+b1, a2+b2, a3+b3, ..., aM+bM]
and for some constant r, C = r*A is the scalar product of r and array A:
C = [r*a1, r*a2, r*a3, ..., r*aM]

For our application, M is six because everything except the number of months in the patient data is a feature that we must use within our predictions. The problem is determining the correct set of six weights that will correctly classify new patients based on the training data. The weights determine the relative importance of each of the various features. The perceptron algorithm itself is used to train the model by determing the values in our weights array W.

Training data file processing: As you read the patient data, you'll need to store it in a two dimensional array (referred to as X below). The number of rows should be the same as the number of patients. You can just look at the number of lines in the file to determine this value and save it as a constant in your program. We will refer to this number as N in the algorithm below. The 2D array should have six columns, each one corresponding to the patient data features (not including the survival months), in order from age through wall motion index. You should also save the survival classification for each patient in a separate one-dimensional array (referred to as array Y below), of length N. These values should be 1 if the number of months the patient survived (the first value in the input file) is at least as large as the desired cutoff value, and -1 otherwise. Write the patient data to a new file called "training#.txt", where # is the survival months cutoff. Start each line for a patient with the survival classification (1 or -1), followed by the original 6 feature values (age through wall). The only difference between this file and the original "training.txt" file should be the first value on each line.

Perceptron weights training: Here we must use the training data in array X to train a one-dimensional array (vector) W of weights. The length of W should be the same as the number of features, M, and its values should be initialized to 0. Remember that X is a two-dimensional array of patient feature data, with N rows and M columns. The algorithm is as follows (note change to last line of algorithm posted 10/30):

      for every row Xk in X:
          compute survival (-1 or 1) based on the sign of W dot Xk
          if survival is not the same as the expected value Yk:
              set W to W + Yk * Xk
    
The resulting array W is our final set of weights that will be used in the next two parts of the project. As the training proceeds, write every iteration of the weights array to output file "weights.txt", with 4 digits of precision for each value, one iteration per line. This should include the array initialized to all 0s as well as the final set of weights. Refer above for general formulas to compute the dot product of two arrays, their sum, and the scalar product in order to fully implement this algorithm.

Perceptron validation processing: Now you'll use the weights in array W to predict whether each patient in the provided "validate.txt" file will survive to the cutoff number of months or now. This file is in the same exact format as training.txt. Similar to that processing, you'll need to store the known month data separately from the other features. For each patient data array P, compute the survival (-1 or 1) based on the sign of W dot P. Write the survival prediction (not the actual known survival) and the patient features (age through wall) to output file "validate#.txt" where # is the month cutoff. Then compare the predicted result for each patient to the known months of survival in order to determine how many predictions that were correct. After processing the entire validation file, output the number of correct, number of incorrect, and percentage of correct predictions to the screen, nicely labelled and formatted.

Random patient data generation and processing: For this final step, you will be generating 20 sets of random patient data (age through wall). As you generate each one, apply the weights vector to it (compute the expected survival as -1 or 1 using the dot product), similar to in the predictions step. Write the survival value and generated patient data to output file predict#.txt, where # is the same month cutoff as usual. Here are expected ranges of data values for each feature, inclusive at both ends. All values should be equally likely in a range unless otherwise specified:

Implementation Details & Hints: There are several key features of this program solution that are required and/or will make your development job much easier. Here is a list of requirements and implementation hints:


Grading

The 60 total points for this assignment will be broken down into expected functionality (how well the code works), how well the code is written, including style and documentation, and the video reflection. You are not required to submit pseudocode or extra tests for any parts of this project, but you are strongly encouraged to write them as part of the solution process.

Deadline & Time Management: We strongly recommend completing the coding for this project by Saturday 11/11. This will give you an extra day or two to iron out problems and make your reflection video (also due 11/13 - no extra hours this time). Minimally you should start all aspects of the project by Monday 11/6.


General assignment requirements, style and submission details: