Information Retrieval and Web Agents (600.466) Introduction to the PERL Programming Language (with thanks to Eric Brill) INTRODUCTION PERL is a very nice programming language for language processing. Once you learn a small bag of tricks, you should be able to develop programs very rapidly. Below is a simple PERL program: ======================================================= #!/usr/local/bin/perl # this is a comment print "hello, world\n"; ======================================================= The first line specifies where to find perl. On the JHU CS undergraduate and research networks, /usr/local/bin/perl is correct. If your system has perl elsewhere, you should modify this line (type 'which perl' for a path). Comments begin with the symbol #. The third line prints "hello, world". The last character, \n, returns a new line. To run this program, you would write this to some file, say foo.prl. Then you would make foo.prl executable by executing: chmod +x foo.prl To run the program, you would just type: foo.prl (or ./foo.prl, depending on your path). You can also type: perl foo.prl to execute perl -c foo.prl to check the syntax of the program (much like running compile in c/java) perl -w foo.prl turn warnings on (add the -w to either of the above) SCALAR VARIABLES Here is a program that adds two numbers together and prints them. ======================================================= #!/usr/local/bin/perl $first_num = 10; $sec_num = 5; $third_num = $first_num + $sec_num; print "The sum of ", $first_num, " and ", $sec_num, " is ", $third_num,"\n"; ======================================================= In this program $first_num, $sec_num and $third_num are variables. You do not have to declare variables in perl. Also, perl variables are weakly typed. The first character of a variable indicates its type. $ means that the variable is a scalar. A scalar variable can be a string, an integer or a real number. You do not have to specify which of these it is. Perl figures it out based on context. Before a scalar is declared, its value is 0 if in a number context or the null string if in a string context. Here is another version of the "hello world" program. (We will drop the #!/usr/local/bin/perl line from now on. You still need to include it with your program). ======================================================= $the_string = "hello world"; print $the_string,"\n"; ======================================================= A period is used for string concatenation. So yet another version: ======================================================= $hello = "hello"; $world = "world"; $hello_world = $hello . " " . $world . "\n"; print $hello_world; ======================================================= ARRAYS Another variable type is the array. Arrays do not have to be predeclared, nor does their size have to be specified. Arrays hold scalar values. A variable beginning with the character @ is an array. Now things get a bit confusing. @x is the array x. $x is the scalar variable x. These two variables are not in any way related. To reference the first item of the array @x, we use $x[0]. This is because the element in x[0] will be a scalar, and so the first character indicates that this is a scalar value. For an array @x, the special variable $#x indicates the highest index used in the array. So, $x[$#x] will be the last element in array @x. Here are two versions of a program for assigning values to an array. ==================================================== $x[0] = "dog"; $x[1] = 34; =================================================== =================================================== # $#x is initially -1, as @x does not exist. $x[++$#x] = "dog"; $x[++$#x] = 34; =================================================== In perl, strings and arrays are closely related. In fact, there are two functions for converting from one to the other. split takes a string and splits it into an array. split takes two arguments. The first argument is a regular expression specifying what to split on, the second argument is the string to split. The code: ======================================================= $avar = "abcAAdefAAg"; @x = split(/AA/,$avar); ======================================================= will result in $x[0] holding "abc", $x[1] holding "def", and $x[2] holding "g". split does not alter the contents of $avar. join takes two arguments. The first is the character sequence that will be placed between elements of the array, and the second specifies the array to be joined. The reverse operation of the split above is: ======================================================= $x[0] = "abc"; $x[1] = "def"; $x[2] = "g"; $avar = join("AA",@x); ======================================================= SCOPE AND THE "my" CONSTRUCT In some of the PERL 5 code that you might see in this class, such as the basic web robot that we give out to you to build upon, we have set the perl script to use strict subs. This means that the subroutines in the program will only take locally scoped variables. By default in PERL, all variables are globally scoped. If you define a variable, it will then be accessable anywhere else in that PERL program. This can be fun but also very dangerous. By declaring a program to use strict subs, we make PERL enforce scoping of variables. Which is where the "my" construct comes in. my $foo = "bar"; Is the declaration of a variable called $foo that is now locally scoped, thanks to the my keyword. This makes the variable act much like a variable in C, C++ and Java where you have to pass the variable to a function in order to use it, and it is passed by value -- i.e. the function gets its own copy of the variable. This also means that if we define the following: my $x = 1; my $foo = "bar"; if ($x == 1) { my $foo = "foo"; print $foo . "\n"; } print $foo . "\n"; ... the output will be: foo bar It is also important to note that the my construct is tightly binding, such that: my $foo, $bar; is the same as typing: my $foo; $bar; So if you want all your variables to be lexically scoped, either put them each on their own line, or type: my $foo, my $bar; After you've declared a variable and given it the lexical scope with the my construct, you can refer to that variable as you usually would, much like in C, C++ & Java where you would first declare: int x = 0; .. and can then refer to that variable: x++; ... you can in PERL: my $foo = 0; .. and then refer to it: $foo++: This will help keep you out of trouble, especially if you like to use short, undescriptive variable names. ======================================================= CONTROL STRUCTURES Perl's control structures are similar to those in C and Java, except that all blocks must be bracketed. ========================================== $x = 0; while($x < 5) { print $x,"\n"; ++$x; } ========================================== for ($count=0;$count<5;++$count) { print $x,"\n"; } ========================================== if ($x == 2) { print "yes\n"; } else { print "no\n"; } ========================================== if ($x == 1) { print "ONE\n"; } elsif ($x == 2){ print "TWO\n"; } else { print "OTHER\n"; } ========================================== For comparing numbers, perl uses the same symbols as C and Java. For comparing strings, eq is true if two strings are equivalent, and ne returns true if two strings are not equivalent. So, ======================================================= $x = "yes"; $y = "no"; if ($x eq $y) { print "1\n"; } else { print "2\n"; } ======================================================= will print 2. And ======================================================= $x = "yes"; $y = "no"; if ($x ne $y) { print "1\n"; } else { print "2\n"; } ======================================================= will print 1. ASSOCIATIVE ARRAYS (HASH TABLES) The final variable type is the associative array. An associative array is a structure of key, value pairs. A variable beginning with the character %x is an associative array. Here are some examples: $x{"dog"} returns the value (a scalar) associated with the key "dog". $x{"dog"} = "cat"; sets the value associated with "dog" to be "cat". Note that associative arrays use curly brackets ($x{2}), while arrays use square brackets ($x[2]). In other words, $x{2} returns the value associated with the key 2 in the associative array %x, whereas $x[2] returns the third element (because index starts at 0) of the array @x. Before a key is inserted into an associative array, the value associated with that key is 0 or the empty string. Here's a program to count the number of even numbers in [0..10]: ======================================================= for ($count=0;$count<=10;++$count) { if ($count % 2 == 0) # if count mod 2 is 0, then even { $thearray{"EVEN"}++; # this is shorthand for # $thearray{"EVEN"} = $thearray{"EVEN"} +1; } } print "There were ",$thearray{"EVEN"}, " even numbers\n"; ======================================================= If we want to print out all of the keys and values in an associative array, we do the following: ======================================================= while (($key,$val) = each %thearray) { print $key," ",$val,"\n"; } ======================================================= Note that we use % in front of thearray here because we are talking about the entire associative array, while we use $ for $thearray{"EVEN"} because we are talking about a specific value in the associative array. If your program takes arguments, you can refer to those arguments much as you do in C. $ARGV[0] refers to the first argument, not the name of the program as in C. $ARGV[1] refers to the second argument, and so on. Here's a simple program that takes two files as arguments and tells which file contains more lines. ======================================================= open(FILE1,$ARGV[0]); open(FILE2,$ARGV[1]); while() { $num_lines_1++; } close(FILE1); while() { $num_lines_2++; } close(FILE2); if ($num_lines_1 > $num_lines_2) { print "The first file contained more lines\n"; } else { print "The second file contained more lines\n"; } ======================================================= The line while() reads lines from FILE1 until it reaches the end of file. When a line is read, it is stored in the special variable $_. This is a program to print all lines from a file. ======================================================= open(FILE1,$ARGV[0]); while(){ print $_; } close(FILE1); ======================================================= To read from stdin, we do not need to call open: ======================================================= while() { print $_; } ======================================================= ANOTHER NOTE ABOUT FILES & STREAMS Some times you will want to make a stream non-buffered. This is important if you are making a log file, or watching output of a program and want to see when things happen and not when the buffer gets filled and flushed. To change an output stream to non-buffered, use the special variable $| This command will set standard out to non-buffering mode: =================== $| = 1; =================== To set non-buffering on another output stream, you first have to select it, and then set the variable. To set non-buffering on standard error, try: =================== select (STDERR); $| = 1; select (STDOUT); =================== STACKS & QUEUES Perl makes it very easy to set up a stack or queue. Because all memory is dynamic, you can use an array to implement your stack or queue object. Perl comes with the following four functions: push, pop, shift, unshift Push & Pop deal with the end of the array (the n end) and shift and unshift deal with the front of the array (the 0 end). So to make a stack, just use push & pop. To make a queue, use push and unshift or shift and pop, depending on which way you want to move things. ======================================================= my @stack = (); $x = 1; $y = 2; $z = 3; push @stack, $x; push @stack, $y; push @stack, $z; print pop(@stack) . "\n"; print pop(@stack) . "\n"; print pop(@stack) . "\n"; ======================================================= will output: 3 2 1 If instead of the pop calls, we used unshift: ======================================================= print unshift (@stack) . "\n"; print unshift (@stack) . "\n"; print unshift (@stack) . "\n"; ======================================================= the output will be: 1 2 3 REGULAR EXPRESSIONS \s matches a space or tab ^ matches the start of a string $ matches the end of a string a matches the letter a a+ matches 1 or more a's a* matches 0 or more a's (ab)+ matches 1 or more ab's [^abc] matches a character that is not a or b or c [a-z] matches any lower case letter {m,n} specifies the number of times we want to see something (between m and n times, where n > m) {,n} between 0 and n times n > 0 {n,} n or more times {n} exactly n times . matches any character To test whether a string in $x contains the string "abc", we can use: if ($x =~ /abc/) { . . . } To test whether a string begins with "abc", if ($x =~ /^abc/) { . . . } To test whether a string begins with a capital letter: if ($x =~ /^[A-Z]/) { . . . } To test whether a string does not begin with a lower case letter: if ($x =~ /^[^a-z]/) { . . . } In the above example, the first ^ matches the beginning of the string, while the ^ within the square brackets means "not". In addition to using regular expressions for testing strings, we can also use them to change strings. To do this, we use a command of the form: s/FROM/TO/options where FROM is the matching regular expression and TO is what to change this to. options can either be blank, meaning to only do this to the first match of FROM in the string, or it can be g, meaning do it globally. To change all a's to b's in the string in variable $x: $x =~ s/a/b/g; To change the first a to b: $x =~ s/a/b/; To change all strings of consecutive a's into one a: $x =~ s/a*/a/g; To remove all strings of consecutive a's: $x =~ s/a*//g; To remove blanks from the start of a string: $x =~ s/\s+//g; Finally, we can use regular expressions to translate character sets. This works nicely for simple shift cyphers and for changing upper to lower case or the other way around. To take all upper case letters and change them to lower case: $x =~ tr/[A-Z]/[a-z]/; This is the same as the uc function: uc ($x); There is also lc() for making things lower case. SAMPLE PROGRAMS An infinite loop to take a line of input with two numbers separated by a space and return their sum: ====================================================== while() { $_ =~ s/^\s+//; # removes spaces at start of line, # since we will split on space @nums = split(/\s+/,$_); # we can now easily access the two # numbers $answer = $nums[0] + $nums[1]; print "THE ANSWER IS: ",$answer,"\n"; } ====================================================== A messier way to do this: ===================================================== while() { $_ =~ s/^\s+//; # removes spaces at start of line, # since we will split on space $num1 = $_; $num2 = $_; # makes fresh copies of the input line $num1 =~ s/\s+[0-9]$//; $num2 =~ s/^[0-9]+\s+//; $answer = $num1 + $num2; print "THE ANSWER IS: ",$answer,"\n"; } ===================================================== Given a text, return a list of words and word counts: ===================================================== while() { $_ =~ s/^\s+//; # Good idea to always do this. If the line # starts with blanks, then the first element # of the array after splitting wound be null @words_in_line = split(/\s+/,$_); # splits the line into an array of words for ($count=0;$count<=$#words_in_line;++$count) { $word_count{$words_in_line[$count]}++; } } while(($key,$val) = each %word_count) { print "$key $val\n"; } ===================================================== Given a text, return a list of word pairs and their counts: ===================================================== while() { $_ =~ s/^\s+//; # Good idea to always do this. If the line # starts with blanks, then the first element # of the array after splitting wound be null @words_in_line = split(/\s+/,$_); # splits the line into an array of words for ($count=0;$count<=$#words_in_line-1;++$count) { $word_count{$words_in_line[$count] . " " . $words_in_line[$count+1]}++; } } while(($key,$val) = each %word_count) { print "$key $val\n"; } ===================================================== A program to calculate the frequency of three-letter endings for words in a text: ==================================================== while() { $_ =~ s/^\s+//; @words = split(/\s+/,$_); for ($count=0;$count<=$#words;++$count) { @chars = split(//,$words[$count]); # we split on nothing, which gives # an array of characters. if ($#chars > 1) { # make sure there are at least three chars $ending{$chars[$#chars-2] . " " . $chars[$#chars-1] . " " . $chars[$#chars]}++; } } } while (($key,$val) = each %ending) { print "$key $val\n"; } ==================================================== A program that takes two files and outputs all lines in the first file where the same line occurs in the same position in the second program: ===================================================== open(FILE1,$ARGV[0]); open(FILE2,$ARGV[1]); while() { $line_from_2 = ; if ($_ eq $line_from_2) { print $_; } } close(FILE1); close(FILE2); ====================================================== Given text labelled with parts of speech, such as The/det boy/noun ate/verb . . . strip off the part of speech tags: ====================================================== while() { $_ =~ s/^\s+//; @words = split(/\s+/,$_); for ($count=0;$count<=$#words;++$count) { $word = $words[$count]; # but word has tag on it $word =~ s/\/.*$//; # this says given a string that starts with # a slash and then contains any character sequence # until the end of the string, convert it # to the null string. Note that we have to # backslash the / character in the regular expression. print $word," "; } print "\n"; } ====================================================== Given the same input, this program strips off the words and returns the part of speech tags: ====================================================== while() { $_ =~ s/^\s+//; @words = split(/\s+/,$_); for ($count=0;$count<=$#words;++$count) { $word = $words[$count]; # but word has tag on it $word =~ s/^.*\///; print $word," "; } print "\n"; } ====================================================== Return the length of the longest string in a text: ===================================================== while() { $_ =~ s/^\s+//; @words = split(/\s+/,$_); for ($count=0;$count<=$#words;++$count) { @chars = split(//,$words[$count]); if ($#chars > $maxlength) { $maxlength = $#chars; } } } $maxlength++; # must add one, since the array index starts with 0 print $maxlength,"\n"; =================================================== Print out a random number from 1 to 10: =================================================== srand; # sets the random number generator seed $num = rand(10); $num = int($num); print "$num\n"; =================================================== SUB ROUTINES (FUNCTIONS) So you want to make your program look like it was written in C and not BASIC. Lucky for you, PERL has "sub routines" or functions. Let's make a subroutine: =================================================== sub f { print "Hello world.\n"; } =================================================== Now from anywhere else we can call it: =================================================== &f(); =================================================== As in C, Pascal, Java, etc. functions can take arguments and return a value. Since PERL has one type of variable, you can return anything from an integer to a pointer to a hash of hashes. =================================================== $x = 1; $y = 3; print &addtwo ($x, $y); sub addtwo { my $x = shift; my $y = shift; return ($x + $y); } =================================================== This code defines our function that adds two integers, and runs it. The result should be 4. The number of variables is not specified with the function anywhere, and there is no need to define the function or function prototype anywhere in the program. All variables that are passed to the function get turned into the default variable array (@_) in the function. The shift call allows us to pull off each variable (in the order they were passed) and make them local variables. Because this is dynamic, we can make functions that take a variable number of arguments: =================================================== $x = 1; $y = 3; $z = 4; print &addtwoorthree ($x, $y); print &addtwoorthree ($x, $y, $z); sub addtwoorthree { my $x = shift; my $y = shift; my $z = shift; if (defined ($z)) { return ($x + $y + $z); } return ($x + $y); } =================================================== The two calls to this function would return 4 and 8 respectively. =================================================== A FEW USEFUL THINGS FOR INFORMATION RETRIEVAL =================================================== Hashes of Hashes In a few cases, we like to make hashes of hashes (especially for the vector query systems) in IR. In Perl, this is a simple task. Agian, you do not need to declare the hash in advance, you can just start using it. Let us take the example of a search engine. We need to have a hash of document names or identifiers and each document has to have an associated set of tokens (words) that we found on the document, and each token has to have a weight. So, our hash looks something like: $doc_vector{$document_id}{$token} = $weight; It is legal to fill up our hash above. Now, let us say that we want to go through and add a weighting to all tokens in all documents, based upon the document number... so the first document gets 1 multiplied to the weight of each token, the second document gets 2, etc. To traverse the hash of hashes, we can use a nested foreach loop: $count = 1; foreach $document_id (keys %doc_vector) { # at this point, the $document_id variable # will be set to a new document_id at each # pass of the foreach loop ... we now # want to traverse the keys for that # document's tokens... foreach $token (keys %{$doc_vector{$document_id}}) { # $token will now enumerate over all tokens # for that document_id $doc_vector{$document_id}{$token} *= $count; } $count++; } The %{..} syntax above takes the $doc_vector variable's specific hash for that document_id and casts it to a hash specifically so that the keys function, which returns all the keys in the hash can be run. Another example would be an array variable, in which each array entry contained a hash. For example, we could be making a document vector in which document IDs were integers from 0 to n. We could then say: $sum = 0; for ($x = 0; $x < scalar (@doc_vector); $x++) { foreach $token (keys %{$doc_vector[$x]}) { $sum = $doc_vector[$x]{$token}; } } The scalar (@doc_vector) construct returns the number of elements in the array (in this case we could have substituted this for n, but if we are growing the array and don't always know its exact size it is easiest to use the scalar construct). The above code will sum up the values of each token for each document and put it in the $sum variable. =================================================== While Loops & Regular Expressions In some cases with Info. Retrieval, we'll want to process through a text, handling each item. For example, we might want to take the source code of a web page ($page) and pull out each link in it, one at a time. The following loop does a substitution, finding a link that matches our pattern. We use the ()'s around different key areas in the pattern to fill our default variables ($1, $2, $3, etc.). At each iteration, the pattern is matched, and $1 is set to the entire link, $2 is set to the URL and $3 text of the link, and then the entire link from to is replaced with nothing, so that the next iteration of the loop will match the next link. Once the pattern can't be matched, the while condition fails and the while loop ends. while ($page =~ s/(([^>]*))//i) { $url = $2; $text = $3; # now we can do whatever we want with the # url and text variables } The pattern that we are trying to match in this case is a link, in the format of: My WebSite ---------------------- ---------- this becomes $2 this $3 ----------------------------------------------- ... and the whole line becomes $1. =================================================== If in doubt, look at the manual pages for perl. In addition to hard-copy books, there are a number of on-line perl manuals. Look at the class home page for pointers to perl pages. To get info on a perl module, you can also use the "perldoc" command, which works just like man. i.e.: perldoc LWP::UserAgent Will give you the man page for the UserAgent object in the LWP class.