Best Viewed with
Internet Explorer 6
To Run SMAC.jar, must
have JRE 1.4.2 or higher installed: get it here (click on
the Download J2SE JRE link)
SMAC
Stock
Market Analysis by Clustering
Joshua Kirshtein & Anand Sampathkumaran
(Towards the project requirement for Database Systems CS415 Fall 2004)
We perform an analysis of the fluctuation in stock prices over the past twenty years. We run an unsupervised learning algorithm using an incremental clustering technique. This method would structure the stock market universe to a set of pockets of stocks that have exhibited similar price fluctuation behaviour. The clustering should give an investor some idea of which groups of stocks are high risk/gain or low risk/gain. As an example, if an investor feels strongly about a certain stock, our application would list some similar stocks based on the parameters that the user chooses for clustering.
We have collected twenty years worth of daily price information for about 1500 stocks that are traded in the three major US stock markets – NASDAQ, NYSE and AMEX. The information consists of the date, open price, close price, day’s high and day’s low. This data yields a total of about two million records in the database. This data was collected from yahoo along with news stories as part of a stock graphing application using a web robot written in Java. This data was loaded to a MySQL database through the Java application.
The Database server used is MySQL version - 4.1.6, installed on a laptop whose home ip address is 68.34.123.140.
The Database consists mainly of the three tables
Org – All companies that are being traded
Stock – Historical price information table
StockNews – News Stories URLs for various stocks stored with date and time
Additionally a working table of statistics called the StockStat has been created and is used by the clustering algorithm. These statistics are aggregate measures of the twenty-year price data for each stock. The ticker symbol serves as the foreign key between all these tables. Clustered Indexes on the primary key and frequently accessed columns like date have been added to improve access times.
Org:
|
Field |
Type |
|
|
|
|
Ticker |
varchar(15) |
|
Name
|
varchar(255) |
|
Market |
varchar(100) |
Stock:
|
Field |
Type
|
|
|
|
|
Ticker |
varchar(15) |
|
Open |
double
|
|
Close |
double
|
|
High |
double
|
|
Low |
double
|
|
Volume |
double
|
|
AdjustedClose |
double
|
|
Date |
date
|
StockStat:
|
Field |
Type |
|
|
|
|
Ticker |
varchar(15) |
|
AvgHiLo |
double(30,15) |
|
WtDAvgHiLo |
double(30,15) |
|
WtMAvgHiLo |
double(30,15) |
|
WtYAvgHiLo |
double(30,15) |
|
VarHiLo |
double(30,15) |
|
WtDVarHiLo |
double(30,15) |
|
WtMVarHiLo |
double(30,15) |
|
WtYVarHiLo |
double(30,15) |
|
AvgClOp |
double(30,15) |
|
WtDAvgClOp |
double(30,15) |
|
WtMAvgClOp |
double(30,15) |
|
WtYAvgClOp |
double(30,15) |
|
VarClOp |
double(30,15) |
|
WtDVarClOp |
double(30,15) |
|
WtMVarClOp |
double(30,15) |
|
WtYVarClOp |
double(30,15) |
|
VarVolume |
double(30,15) |
|
WtDvarVolume |
double(30,15) |
|
WtMVarVolume |
double(30,15) |
|
CenterDeviation |
double(30,15) |
|
WtYVarVolume |
double(30,15) |
|
Amex |
int(11) |
|
Nasdaq |
int(11) |
|
Nyse |
int(11) |
|
WtDCenterDeviation |
double(30,15) |
|
WtMCenterDeviation |
double(30,15) |
|
WtYCenterDeviation |
double(30,15) |
|
CenterAvgDeviation |
double(30,15) |
|
WtDAvgCenterDeviation |
double(30,15) |
|
WtMAvgCenterDeviation |
double(30,15) |
|
WtYAvgCenterDeviation |
double(30,15) |
StockNews:
|
Field |
Type |
|
|
|
|
Ticker |
varchar(15) |
|
Date
|
date |
|
Time
|
time |
|
Source |
varchar(255) |
|
URL
|
text |
Retrieve a list of all the stock names in the database:
SELECT ticker, name
FROM org
WHERE ticker in (SELECT distinct ticker FROM stock)
Retrieve all stocks and their company names that have statistics calculated for them in the database
SELECT s.ticker, o.name
FROM org o INNER JOIN stockstat s on s.ticker = o.ticker
Retrieve all the information for the stock ‘ticker’ to compute statistics:
SELECT ticker, open, close, high, low, volume, cast(date as char),
date_format(date,\'%c\') month,
date_format(date,\'%Y\') year,
FROM stock
WHERE ticker = ‘ticker’
ORDER BY date ASC
Retrieve all the statistics information for each stock:
SELECT ticker, AvgHiLo, WtDAvgHiLo, WtMAvgHiLo, WtYAvgHiLo
AvgClOp, WtDAvgClOp, WtMAvgClOp, WtYAvgClOp,
CenterAvgDeviation, WtDAvgCenterDeviation, WtMAvgCenterDeviation, WtYAvgCenterDeviation,
VarHiLo, WtDVarHiLo, WtMVarHiLo, WtYVarHiLo,
VarClOp, WtDVarClOp, WtMVarClOp, WtYVarClOp,
CenterDeviation, WtDCenterDeviation, WtMCenterDeviation, WtYCenterDeviation,
VarVolume, WtDvarVolume, WtMVarVolume, WtYVarVolume,
Amex, Nasdaq, Nyse
FROM stockstat
Update the stockStat table after statistics are calculated to set the market value for the stock. The query below sets the values for the stocks in the AMEX market. Similar queries are run for the NASDAQ and NYSE markets (correspondingly setting their market values to “1” and the others’ to “0”):
UPDATE StockStat S
SET Amex = 1, Nasdaq = 0, Nyse = 0 where Ticker IN
(SELECT DISTINCT O.Ticker
FROM Org O
WHERE S.Ticker = O.Ticker AND O.Market LIKE 'AM%'
The Statistics
In this unsupervised method, we rely upon the raw price data to make inferences. However, the stock information is spread over a period of twenty years in two million records for each of the one thousand five hundred stocks. Our aim is to gain insights on a per stock basis and we resort to calculating aggregate statistics (for the period of time) for each of these stocks and use them as vectors in the clustering algorithm. Each vector identifies a stock with these statistics as individual components, whose values are normalized to a range between zero and one.
The following are the kinds of statistical measures that we compute, a detailed list of the actual 31 measures is provided in a tabular form. The statistics are computed for the difference in prices (Open & Close, High & Low) that model daily price fluctuations more closely than raw prices.
Average – Simple average over the number of observations.
Weighted Average – The oldest observation gets the lowest weight of 1 and weights progressively increment over days or months or years. The motivation being recent data and, hence, recent price variance has more significance than older ones.
Variance – The classic variance measure for these components
Weighted variance – Classic variance weighted by the daily, monthly or yearly weight.
CenterAvgDeviation = ((Open+close)/(High+Low))*0.5 – This gives a sense of variance of price over the day. Ideally if the open and close and high and low were symmetric, the variance in price for the days could be assumed as linear and the average would have been exactly half of the sum of any of those two prices.
WeightedCenterAvgDeviation
The combination of the above measures result in the following table, which serves as the vector input for the algorithm.
|
Column |
Desc |
|
AvgHiLo |
Average(high-low) |
|
WtDAvgHiLo |
WeightedDayavg(high-low) |
|
WtMAvgHiLo |
WeightedMonthAvg(high-low) |
|
WtYAvgHiLo |
WeightedYearAvg(high-low) |
|
VarHiLo |
Var(high-low) |
|
WtDVarHiLo |
WeightedDayVar(high-low) |
|
WtMVarHiLo |
WeightedMonthVar(high-low) |
|
WtYVarHiLo |
WeightedYearVar(high-low) |
|
AvgClOp |
Avg(close-open) |
|
WtDAvgClOp |
WeightedDayAvg(close-open) |
|
WtMAvgClOp |
WeightedMonthAvg(close-open) |
|
WtYAvgClOp |
WeightedYearAvg(close-open) |
|
VarClOp |
Var(close-open) |
|
WtDVarClOp |
WeightedDayVar(close-open) |
|
WtMVarClOp |
WeightedMonthVar(close-open) |
|
WtYVarClOp |
WeightedYearVar(close-open) |
|
VarVolume |
Var(volume) |
|
WtDvarVolume |
WeightedDayVar(volume) |
|
WtMVarVolume |
WeightedMonthVar(volume) |
|
CenterDeviation |
CenterDeviation |
|
WtYVarVolume |
WeightedYearVar(volume) |
|
Amex |
Amex – 1 if
stock belongs to this market otherwise 0 |
|
Nasdaq |
NasDaq– 1
if stock belongs to this market otherwise 0 |
|
Nyse |
NYSE– 1
if stock belongs to this market otherwise 0 |
|
WtDCenterDeviation |
WeightedDayCenterDeviation |
|
WtMCenterDeviation |
WeightedMonthCenterDeviation |
|
WtYCenterDeviation |
WeightedYearCenterDeviation |
|
CenterAvgDeviation |
CenterAvgDeviation |
|
WtDAvgCenterDeviation |
WeightedDayAvgCenterDeviation |
|
WtMAvgCenterDeviation |
WeightedMonthAvgCenterDeviation |
|
WtYAvgCenterDeviation |
WeightedtYearAvgCenterDeviation |
Notation:
v – Vector representing each stock made up of the statistics as components
V – Set of all vectors
c – Cluster, A collection of vectors with a special vector called the centroid , whose components are the average values of the rest of the vectors in that cluster.
S – Super Cluster, Set of all clusters.
D – Maximum Intra Cluster Distance, The maximum allowed distance between a vector and the centroid of the cluster to which its being assigned
Algorithm
Incremental Clustering (Vector List V, maximum intra cluster distance D)
1. For every vector v in V
a. For every cluster c in the Super Cluster S, compute the Euclidean distance between the vector v and the centroid
b. If the minimum distance is less than or equal to D, assign the vector to that least distant cluster, otherwise create a new cluster and assign the vector to it and also make it the cluster’s centroid
2. Calculate the new centroids for all the clusters
3. Return the Super Cluster S
K-Means Clustering (Vector List V, Super Cluster S, Maximum Iteration M)
1. k = No of clusters in the Super Cluster
2. Repeat until maximum iterations M reached
a) For every cluster c in S
i. For every vector v in the cluster, compute its distance from every other cluster and assign to the least distant
b) Recompute centroids for each cluster
c) If no vectors have changed membership, exit.
See the SMAC JavaDocs for a full list of Java code architecture.
We use MySQL 4.1.6 as out database server and a Java application to compute statistics and perform the clustering. The Application uses JDBC to communicate with the database and uses multi threaded small batch inserts and updates when computing statistics for high performance. The Java Application consists of the following classes and methods:
(Code hierarchy listed here for convenience; See the SMAC JavaDocs for complete comments and details.)
Class
db
DataManager
DBConnect
DBUtils
dbStat
AttributeVector
Cluster
ClusterDriver
ClusterUtils
NormalizeMinMax
StatDriver
SuperCluster
VectorField
jqubeObjects
Field
Record
jqubePopups
ChangeDBPreferences
ClusterDocument
ClusterGUI
ErrorMessage
SelectClusterParameters
WaitMessage
robotGUI
MainRobotGUI
stockDataRetrieval
DownloadManager
DownloadManagerAMEX
DownloadManagerNASDAQ
DownloadManagerNYSE
NewsDownloadEngine
NewsDownlaoder
NewsDownloaderThread
StockFileManipulator
WebConnectionEngine
WebPageParser
TableModels
TemplateVectorModel
To use the application, simply launch the GUI by double clicking on the SMAC.jar file in Windows or running “java –jar SMAC.jar” from a UNIX environment.
Once the application launches, the main GUI panel is displayed:

To begin using SMAC, click the button “Connect/Initialize” to display a dialog box where the database connection information can be entered:
The dialog box can be exited by either closing the window with the “X” button or by clicking the “cancel” button. The exit will be asked to be confirmed with a confirm box:

The settings entered can be tested before exiting the configuration panel by clicking on the “test” button. The application will attempt to connect to the database with the setting specified and one of two information boxes will be displayed depending on if the application was able to create a valid connection:


Clicking on the “ok” button will automatically close the configuration panel and will attempt to connect to the database and fetch all the statistics vectors present in the StockStat table.
If any errors occur during the attempt to fetch the statistics vectors, this information dialog box is displayed:

If any functionality is attempted to be used before the system is initialized, a variety of error messages are displayed:



When the vectors are downloaded, a wait box is displayed that prints messages notifying the user of the progress:

After the results from the database have been downloaded, the average distance between 100 randomly selected vectors are computed and displayed in the text box: “Samp Avg Inter-Vector Distance.” This value can be recomputed (by selecting another random 100 vectors and computing the average distance from each vector to all the others) by clicking on the button “ReCalc Sample d”. The text box “(multiplier) X” is where the user can enter a numeric value to be multiplied by the value in “Samp Avg Inter-Vector Distance” and the result is displayed in the text box “Max Intra-Cluster Disntace D.” This last distance is the threshold distance used for which all vectors within a cluster are guaranteed to be less than this distance from the centroid vector of that cluster. A sample screen shot after vector loading:

The user can enter the maximum number of iterations to be used in the clustering algorithm in the “Max Iterations” text box. The clustering algorithm will exit either after that number of iterations has been completed or until there exists no vector which switches clusters during the algorithm.
The cluster algorithm is initiated by clicking on the “Cluster!” button. A wait message box is displayed that notifies the user of the progress of the algorithm:

Once the algorithm completes, a summary window is displayed. This window can be minimized and saved for later or the user can save the document by clicking on the “save” button. The document can bee discarded by clicking on the “discard document” button or by closing the window. A neat feature is the “find” text box. Any text typed into this box will be highlighted in the document (case insensitive). Clicking the “find next” button can search for the next instance of the word. The search feature also loops wraps to the top of the document if no new instances are found below the last instance.

If the user wishes to adjust the parameters used when clustering the vectors, the user can click on the “Cluster Parameters” to bring up a window that allows the user to select which parameters are to be used when clustering. A sample window looks like this before and after selecting cluster parameters:


The selected parameters can be saved by clicking “ok” and can be aborted by selecting “cancel” or by closing the window.
The last option the user has is to update the database statistics. To do this, the user clicks on the “Cluster Parameters” button and is prompted to ensure that this is something the user really wishes to do. The operation downloads all the stock price data for every stock and then computes the statistics listed above. These statistics are then loaded into the database via a batch update. As the updates are processed, a wait message box displays messages that notify the user as to the progress of the update:

Click here to see a sample cluster output file
Only some sample results are listed here because there are many possible combinations of settings that can be used to get clustering results.
After clustering, the following stocks were all clustered together:
DITC
INFA
MCDT
SAPE
COLT
NTIQ
The data in the database for these stocks were retrieved and displayed. The image displays the JQuBE application (developed by Anand Sampathkumaran, Saroja Gundala, and Josh Kirshtein) with the “high” prices listed for the stocks above (the date range is from Dec 2000 to Dec 2003). It is interesting to note the large scale performance of the stocks follow the same general trend over the date period listed (even though the plot only displays 3 years of data and the clustering algorithm uses statistics computed over 20 years).

The graph below
displays the same graph as above except Microsoft’s stock has been added. Microsoft was not clustered with the other
stocks, therefore we would expect it to perform differently, and indeed this
holds true. Microsoft’s stock is the dark green plot that is very steady and
significantly higher than the other stocks for much of the time line displayed.

This application can help a user gain interesting insights about stocks in general and price behaviour in particular. The database design for the price data was fairly simple with few relations but the challenge was computing the aggregate statistics (31 for each stock) and designing the clustering algorithm using OOAD. The UI with its rich features (interactive tables, search) was well worth the effort as it made clustering and analysis easier. The application is very powerful in the fact that it lets the user model a variety of scenarios by trying a different combination of the statistic and simultaneously analyzes these results. It also allows the user to re-compute the statistics stored in the database if the price information has been augmented. The conclusions about stocks are left to the users to model and find out.
The next steps would be to integrate SMAC with JQuBE (Stock graphing application) completely. Currently, SMAC only uses the database connectivity classes that were originally developed for JQuBE. These classes were even extended to accommodate the functionality of SMAC. Another nice add-on would be a cluster visualization tool, which would give the user some idea on inter-cluster relationships.
Saroja Gundala – JQuBE Co-Developer
http://finance.yahoo.com - Stock Price information
http://www.google.com - Information source throughout development.
http://www.eclipse.org/ - The Java IDE used for the project
http://www.mysql.com - Free database server
ă A. Sampathkumaran, J. Kirshtein 2004