stockDataRetrieval
Class WebPageParser

java.lang.Object
  extended bystockDataRetrieval.WebPageParser

public class WebPageParser
extends java.lang.Object

Class contains the methods to allow a web page to be parsed for specific information.

Each stock market has it's own special format for parsing and the functions adapt to that special form and retrieves the required information.


Constructor Summary
WebPageParser()
           
 
Method Summary
private static java.lang.String cleanUpURL(java.lang.String url)
          Removes certain filler and session ID number from the url pointing to the next older stories page and formats the url such that it is ready to be used to fetch the older stories
static java.util.regex.Matcher createMatcher(java.lang.String expressionToMatch, java.lang.String dataToSearch)
          Generates some standard calls to take a pattern and try to match it to some bit of text.
static java.util.ArrayList extractNewsStories(java.lang.String pageSource, java.lang.String ticker)
          Identifies all news stories and creates objects for each news story
static java.lang.String extractNextPage(java.lang.String pageSource)
          Looks in the page for the link to the next available page containing older stories
static java.util.ArrayList getTickerSymbolAndCompany(java.lang.String pageSource, java.lang.String embeddedTickerCode)
          Parses the web page that contains stock ticker and company name information and returns an arraylist containing an ArrayList with each entry containing a pair of information
static boolean isLastNewsPage(java.lang.String pageSource)
          Parses the News Page for specific information that signals all available news stories pages have been traversed.
private static java.lang.String[] parseStoryProperties(java.lang.String completeString)
          Return the date, time, and source of the given news story
private static java.lang.String prepareForParsing(java.lang.String pageSource)
          Removes new lines and large white space gaps that make regular expression matching troublesome and extracts the section of the page that contains the news stories
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WebPageParser

public WebPageParser()
Method Detail

createMatcher

public static java.util.regex.Matcher createMatcher(java.lang.String expressionToMatch,
                                                    java.lang.String dataToSearch)
Generates some standard calls to take a pattern and try to match it to some bit of text.

The user passes in the expression to compile to a regular expression and the text to search and the function returns the matcher object that has been applied to the text to search.

Parameters:
expressionToMatch - regular expression to try to match to the text dataToSearch
dataToSearch - string to try to find instances of the expressionToMatch in
Returns:
Matcher object contining all information about the results of applying the regular expression to the text

getTickerSymbolAndCompany

public static java.util.ArrayList getTickerSymbolAndCompany(java.lang.String pageSource,
                                                            java.lang.String embeddedTickerCode)
Parses the web page that contains stock ticker and company name information and returns an arraylist containing an ArrayList with each entry containing a pair of information

Parameters:
pageSource - page source of the page for which to parse relevant information from
embeddedTickerCode - the pattern of HTML containing the ticker information
Returns:
an ArrayList object containing ticker symbol (index - [0]) and company name (index - [1])

isLastNewsPage

public static boolean isLastNewsPage(java.lang.String pageSource)
Parses the News Page for specific information that signals all available news stories pages have been traversed.

Parameters:
pageSource - the source for the page of interest
Returns:
true if all news pages have been accessed, false otherwise

extractNextPage

public static java.lang.String extractNextPage(java.lang.String pageSource)
Looks in the page for the link to the next available page containing older stories

Parameters:
pageSource - page source code containing the link to the older news stories
Returns:
the link to the older stories page, or null if the link was not found for some reason

cleanUpURL

private static java.lang.String cleanUpURL(java.lang.String url)
Removes certain filler and session ID number from the url pointing to the next older stories page and formats the url such that it is ready to be used to fetch the older stories

Parameters:
url - "dirty" url with extraneous information
Returns:
valid url capable of directing the downloader / browser to the next older stories page

prepareForParsing

private static java.lang.String prepareForParsing(java.lang.String pageSource)
Removes new lines and large white space gaps that make regular expression matching troublesome and extracts the section of the page that contains the news stories

Parameters:
pageSource - the page source to be chopped up
Returns:
the area of the page that contains the news stories, or empty string if the match failed

parseStoryProperties

private static java.lang.String[] parseStoryProperties(java.lang.String completeString)
Return the date, time, and source of the given news story

Parameters:
completeString - the string containing the date, time and source with filler text
Returns:
array containing the extracted date, time and source information

extractNewsStories

public static java.util.ArrayList extractNewsStories(java.lang.String pageSource,
                                                     java.lang.String ticker)
Identifies all news stories and creates objects for each news story

Parameters:
pageSource - the source code containing the news stories
Returns:
list containing all the NewsObjects created from the stories in the page, null if the page failed to fetch