|
||||||||||||||||||||||||||||||||||||
Awards
AT&T
Fellow, 2001 Association for Computational Linguistics
Vice President-elect 2010 Publications
See
here
for a list of publications.
Many of my papers can be found in the ACL Anthology. See
here for a ranking of authors in the Anthology by h-index.
According
to Google Scholar, my top 3 cited papers are: 1.
Word association
norms, mutual information, and lexicography 1990 (with Patrick Hanks) 2.
A stochastic parts program
and noun phrase parser for unrestricted text 1988 3.
A program for aligning
sentences in bilingual corpora 1993 (with Bill Gale) Some recent papers with
citations (according to Google Scholar)
Education
Received Ph.D., MS, and SB
degrees in 1983, 1980, and 1978, all from MIT in Computer Science. Undergraduate
thesis, supervised by Richard Greenblatt, on chess programming. Thanks to considerable help and
encouragement from Hans Berliner, it was published in IJCAI-79 as Co-ordinate Squares: A Solution to Many
Chess Pawn Endgames. Masters
thesis, supervised by Peter Szolovits, argued that a finite state machine
(FSM) was sufficient, and that one did not need much more complex mechanisms
(ATNs or Turing Machines) that were in vogue at the time. The thesis was originally intended to
help Szolovits' Expert System MEDG (Medical Decision Making Group) generate
better explanations. Became convinced
that the MEDG explanations were difficult to understand because they were
full of center-embedding, a kind of recursion that requires unbounded memory
to process. This argument was
based on Chomsky's 1959 proof that center-embedded characterizes the
difference between push down automata and finite state machines (FSMs). The thesis ultimately evolved into a
demonstration that language could be processed on a FSM by implementing a
parser based on Mitch Marcus' thesis and Kaplan-Bresnan's Lexical-Functional
Grammar, and showing that the stack did not need more than about three cells
as long as one used certain compiler optimizations such as tail recursion to
conserve memory. Doctoral
thesis, supervised by Jon Allen, argued that well-known parsing methods could
be used help take advantage of allophonic variation in speech. The phoneme /t/, for example, is
typically realized with a heavily aspirated burst at the beginning of a
syllable as in Tom, but not at the
end of a syllable as in atlas. At
the time, these sorts of variations were seen as a kind of noise that a
speech recognition machine would have to overcome. I preferred to view the speech system as consisting of two
types of clues, those that varied systematically with the suprasegmental
(syllable structure and stress) context such as aspiration and those that
tended to be relatively invariant to contextual influences such as place,
manner and voicing. Both types
of clues are useful. Variant
clues can be used in a parser to discover the underlying context, and
therefore should not be considered noise. An abbreviated version of the thesis was published in Cognition in 1987; the complete thesis
was published by Kluwer the following year. Work
Experience
1983-1990 Joined
AT&T Bell Laboratories research in 1983 to work with Mitch Marcus and
Mark Liberman in Osamu Fujimura's Linguistic Research and Artificial
Intelligence department. Worked
on a number of problems in speech and language, especially part of speech
tagging, pronunciation of surnames for speech synthesis and text
analysis. Published a
statistical trigram part of speech tagger in 1988. The approach has since become standard, but at the time it
was believed that much more complex mechanisms such as ATNs or Turing
Machines were required. (There was also a strong bias at the time against
empirical/statistical approaches to language.) My
interest in text analysis grew out of a collaboration with the lexicographer Patrick Hanks. I ran into a number of lexicographers
while working on letter-to-sound rules for speech synthesis. At the time, letter-to-sound rules
were being used to guess when the letter “t” should be pronounced with the
phoneme /t/ and when it shouldn't.
Word accuracy rates were unacceptably low. After acquiring the rights to a dictionary, word accuracy
quickly climbed from about 80% to 99.9% or better. The dictionary-based approach has since become standard,
but at the time there was considerable controversy over the “larger” memory
requirement (0.25 megabytes).
Ever since I have been acquiring data just about as fast as I have
been able to acquire disks.
These days, we buy disks by the terabyte. After
this successful interaction with lexicography, Patrick Hanks visited Bell
Laboratories for a month in early 1989.
He showed me how lexicographers were beginning to use large corpora to
find collocations, words that are often found together like doctor/nurse, or save...from. We found ways to automate some of the
drudgery by using well-understood statistical measures like mutual
information. Began working with
William Gale, a statistician at Bell Laboratories, on estimating the
probability of rare events in large corpora, comparing the Good-Turing method
with a held-out estimator. Over
the next five years, ending with Gale's retirement, we published a series of
papers together on spelling correction, word-sense disambiguation, aligning
parallel corpora, and alternatives to the Poisson for modeling the
distribution of words in text. 1990-1991 Visited
Ed Hovy
at USC/ISI for an academic year, with support from ARPA for work on machine
translation and parallel corpora (texts such as the Canadian Parliamentary
debates that are published in two languages). Spent the time reading in many areas, especially in
machine translation (MT) and information retrieval (IR). Started attending meetings in both
areas. Became very active in
corpus lexicography, publishing a number of papers with Hanks, Gale, Yarowsky and others. Also used the time to help members of
the community get started in corpus-based research by serving on data
collection committees such as the ACL/DCI (Association for Computational Data
Collection Initiative), writing review articles (for Computational
Linguistics, Machine Translation, and CACM) and tutorials (Coling-90, 1992
European Summer School, ACL-95).
Taught a graduate seminar course at USC with Ed Hovy in the spring
term. Started reviewing large
numbers of papers for conferences, journals and funding agencies, perhaps as
many as a hundred papers in some years. 1991-1994 Returned
to Bell Laboratories in fall 1991 and switched into a software engineering
organization, headed by Peter Weinberger. Continued my interest in large
corpora, by showing how a large multi-million line program could be thought
of as a large corpora, and how corpus-based techniques could be used in the
discovery process, the process of reading a large program for the first
time. Wrote a program with Jon
Helfman called dotplot that helps
find self-similarity (copied regions) in text (large corpora, source code,
object code) using methods borrowed from biology for matching long sequences
of DNA. Developed
the Termworks platform with AT&T
Language Line, a commercial translation service. The platform was designed to help
translators with technical terminology and translation reuse. Translators don't need help with easy
vocabulary and easy grammar (the focus of most machine translation research)
but they do need help with technical terms like dialog box. The
solutions to these terminology puzzles can be surprising, even to a native
speaker of the language.
Microsoft uses finestra(window)
in their Italian manuals, but boite(box)
in their French manuals. The
tools help translators conform to house standards by extracting examples of
the term and its translations from a previous release of the manual that has
already been published in both languages. A statistical alignment program is used to figure out
which parts of the source text correspond to which parts of the
translation. The tool and its
implications for machine translation research were presented at an invited
talk at EACL-93. Pierre Isabelle and I edited a special issue of the journal Machine Translation
(1997 12:1/2) on tool-based alternatives to machine translation. Worked
on a digital library project based on a collection of 500,000 pages of
AT&T memos supplied by the Murray Hill Library in 400 dots-per-inch
bitmap format (20 gigabytes).
Demonstrated how off-the-shelf optical character recognition (OCR) and
information retrieval (IR) programs could be used to implement a
``fax-a-query'' information retrieval service where both the input and output
could be faxes or bitmaps or any format that could be converted to a fax or
bitmap. Argued at Coling-94 that fax had advantages over proposed ascii-based
standards such as SGML (availability, ease-of-use, longevity). Began work on bitmap-level matching,
avoiding the need for OCR. Found
that bitmap-level matching could also be used to improve up-sampling and
down-sampling of documents, which is important if the scanner, the fax
transmission line and the printer use different sampling rates. 1995-1996 Taught
a seminar course on corpus-based methods at Penn with Mitch Marcus. Served on thesis committees for
Pascale Fung and Vasileios Hatzivassiloglou, both at Columbia . In Pascale's case, I was in a certain
sense her de facto advisor. Her
interest in aligning parallel and non-parallel corpora, which ultimately became
her thesis, started with a summer job in my lab at AT&T. Switched into a new organization
headed by Eric Sumner, focusing on network services (issues more closely
aligned with the core telephone business than speech and language). Worked on some systems problems such
as how to replace a billing factory with a few PCs. There is a tendency to believe that if you have a serious
“big data” problem (e.g., billing, datamining), then you need a serious
(=expensive) computer center full of mainframes, tape robots and
bureaucracy. Attempted to show
that a gang of PCs could replace a mainframe, but discovered, at least in one
application (Billing Edge) that a single PC could replace a gang of
mainframes. 1996- Promoted
to head a new department in research.
AT&T has vast quantities of data, some is textual, but most is
not. The department uses
visualization and datamining, methods that have become popular in
corpus-based natural language processing, to find patterns of interest. Imagine that we have a large steady
stream of usage records indicating which customers are buying which services
and when. This kind of data feed
is usually developed for billing purposes, but many companies are discovering
a broad range of other applications such as fraud detection, inventory
control, provisioning scarce resources, forecasting demand, etc. The problems are challenging from a
research point of view because of the massive scale. The telephone network, for example,
has about 100 million customers (nodes), who purchase about 100 million calls
(edges) per day. This data
stream is much larger than the very large corpora that I had just a few years
ago. We used to think that a
billion words of text was a lot of data, but we are currently receiving a
couple billion records a week. 1998-1999 Developed
with Glenn Fowler and Don Caldwell a program called pzip that is like gzip, but uses an automatic training procedure to induce an
optimal schema for compressing tabular data such as database records and
spread sheets. Sometimes it is
better to compress a table in row major order and sometimes it is better to
compress a table in column major order.
Pzipuses a schema to
numerate the values in the table and then passes the results to gzip. Another program, pin (partition inducer), uses dynamic programming to find the
optimal partition (or schema) by applying gzip
to all possible partitions of a sample of the table. Pzip
typically saves a factor of two in both space and time over gzip, which has been very much
appreciated by several large data warehousing applications. A paper was published in SODA-2000 pdf. Won a “Circle of Excellence” award in
1998 from the AT&T CIO (Chief Information Office) for this work. Established
excellent relationships with AT&T marketing, especially the folks behind
the new Lucky Dog flanker brand. My team maintains a web site that
tracks sales in “real time” (faster than previously possible). This web site is changing the culture
in marketing. Now that they know
they can get quick feedback, they are much more willing to experiment. Try a little bit of this and a little
bit of that, and do more of what works and less of what doesn't. The web site has moved SWIFT from a
research prototype into a highly visible business reality. SWIFT was a very nice demo that
visualized traffic on the AT&T network on a map projected on a powerwall,
a huge wall-sized terminal. The
point of the demo is that we ought to be able to react to a change in the
marketplace in less than a day.
To make the idea a business reality, we simplified the interface so
that marketing managers could drive it, and replaced the fancy powerwall
hardware with a standard web browser.
In addition, we worked closely with marketing managers to understand
what matters to them, a task that reminded me of my days in Peter Szolovits'
expert systems group at MIT. 1999-2003 Built
upon our success with Lucky Dog to establish relationships with other
marketing groups, especially the launch of AT&T local service in New York
and Texas. Used datamining
techniques to discover considerable room for improvement in operations. Won “salesman of the month” in August
2000 for finding large numbers of sales that could be “reflowed.” This discovery played a major role in
the team meeting its sales commitments. Led a
team on virtual integration that used methods such as web crawling and screen
scraping to produce a view that made a bunch of disparate databases appear to
be more unified than they were. Despite
my new interests in datamining, visualization, management and business, I
remain active in computational linguistics, my first love. With so many responsibilities, it has
been more difficult to publish, but nevertheless, together with Yamamoto, we
showed how to compute term frequencies and document frequencies, two
statistics used in Information Retrieval, for all substrings in a large
corpus. It is well known how to
compute these statistics for short substrings, e.g., unigrams, bigrams and
trigrams, but we showed how to compute many of these kinds of statistics for
all substrings including very long ones, e.g., million-grams. One might have thought that this
would be impractical since there are n2
substrings, but it turns out that the n2 substrings can be grouped into n classes using a data structure called suffix arrays, and many
of the statistics of interest can be computed over the n classes rather than the n2
substrings. Since we were
able to compute these statistics over a larger set of words and phrases than
previously possible, we found an interesting new heuristic for distinguishing
good keywords from general vocabulary, which seems promising for both
Japanese and English. Both
mutual information and a new measure called residual IDF (inverse document
frequency) compare an observation to chance, but they use different notions
of chance. Mutual information
looks for deviations from compositionality. It grew out of work with lexicographers (Hanks) and so it
isn't surprising that it is better for finding general vocabulary, the kind
of words and phrases that one would expect to find in a dictionary. On the other hand, residual IDF looks
for words and phrases that are not randomly distributed across
documents. It grew out of the
information retrieval literature and so it isn't surprising that it is better
for finding keywords, the words and phrases that distinguish one document
from another. 2003- 2009 Joined
David Heckerman's Machine Learning Group in Microsoft Research and later
joined Eric Brill’s Text Mining, Search and Navigation research group (TMSN), currently
headed by Chris Burges. Web logs
are similar to logs of telephone calls (call detail records). Usage logs are quite a bit
larger than the documents themselves.
As big as the web is, there are more readers than writers, and
therefore usage logs (such as click logs) tend to be even larger than the
documents (web crawls). We used
to think that the AP news was a large corpus, but that is just a million
words a week. Telephone
companies and web search engines see much larger traffic volumes (e.g., a few
hundred million records per day). Big
data problems become more manageable if we can make them smaller. We used a method borrowed from the
Unix Spell program for the context sensitive speller in Office-2007. For example, if you use here instead of hear in a context such as, “you should here what he said,” then
Office-2007 will underline the typo with blue squiggles. The method uses a standard trigram
language model, but heroic compression method were required to compress the
language model down to just a few megabytes. Doug McIlroy faces a similar problem when he wrote the
Unix Speller. He used Golomb
coding to squeeze a 32,000 word dictionary into the address space of a PDP-11 (64k bytes). That was just 2 bytes per word. We used similar method to compress
the trigram language model to just a few bytes per trigram. Ping
Li and I have written quite a few papers on dimension reduction. Methods such
as random projections and sketches make it possible to find strong
associations from small samples.
Ping Li won a best student paper at KDD-2006 for his work on random
projections. Sketches are
probably even better. We
should not have to look at the entire corpus (e.g., the Web) to know if two
(or more) words are associated or not. Sketches were originally introduced to
remove duplicate Web pages. We generalize sketches to estimate contingency
tables and associations, using a maximum likelihood estimator to find the
most likely contingency table given a sample and the margins (also known as
document frequencies in information retrieval). Many
techniques that work well on text corpora also work well on web logs. With Qiaozhu Mei in WSDM-2009,
we argued that entropy is a useful tool for estimating how hard is search, as
well as the potential for personalization. We estimated the entropy of queries, urls and IP
addresses, taken one at time, two at a time, and three at a time. We found that it takes 20-something
bits to guess the next query, the next url or the next user. But search is easier. If we give you the query, then it
takes about 3 bits to guess the url that the user will click on. Personalized search is even
easier. If the search engine was
given both the query and the IP address, then the entropy is cut in
half. There is a huge
opportunity for personalization to improve the search experience. It is much easier to answer the
question if we know something about the audience. We borrowed a number of techniques from language
modeling in speech and language such as Katz backoff. With lots of backoff, personalization
starts to look like market segments in database marketing. Bo
Thiesson and I developed “The Wild Thing!” Users are encouraged to enter queries with wild
cards. For example “C Rice” or
“2#7423” is treated as a regular expression. The system returns the k-best (most popular) matches which
includes: “Condoleezza Rice.”
The method is particularly attractive for queries that are hard to
spell, or for scenarios where it is hard to type (like mobile phones). On a phone, “2#7423” maps to “C Rice”
since “C” is on the 2 key, and “R” is on the 7 key and so on. To make the matching more
efficient, we introduced K-best suffix arrays, a variant on suffix arrays
designed to find the K-best matches, as opposed to all matches. Best can be defined by any total
order on the strings such as popularity (frequency in the logs). With
Albert Greenberg and James Hamilton, we published a paper in Hotnets-2008
comparing the mega-model and the micro-model for data centers. If we want to put a million servers
in the cloud, should we deploy them in a single mega-data center, or
distribute them across a large number of micro-data centers? Given how much the industry is
currently investing in the mega model, the industry would do well to consider
the micro alternative, especially for embarrassingly distributed applications
such as web serving and email. 2009- Chief
Scientist of the Human Language Technology Center of Excellence at The Johns
Hopkins University. Community Service: Served
on editorial board of Computational
Linguistics (CL), Machine
Translation (MT), Quantitative
Linguistics (QL), International
Journal of Corpus Linguistics (IJCL). Served on program committees for numerous conferences
including SIGIR and ACL, both more than once. Helped to create SIGDAT, a special
interest group focusing on corpus data, and organized, chaired, or co-chaired
many of its annual workshops on very large corpora, which have evolved into
the EMNLP conferences with 200+ participants per meeting, about equally
distributed over Europe, Asia and America . Co-chaired ACL-99, the major meeting in computational
linguistics in 1999. Co-chaired
the Industrial Track of KDD-2000,
the major meeting in datamining in 2000.
|
||||||||||||||||||||||||||||||||||||