December 2024
In class this week, I told a little story from 25 years ago about text classification. Back in 1999, the Educational Testing Service (ETS) ran a little contest: “Try to fool our automatic essay grader.”
My NLP students were interested to see the winning essays, so I dug them out of old email and have included them below. Feel free to skip over everything else and jump to them.
I won the contest by the one weird trick of reading their paper, which revealed that they were just doing linear regression on some interpretable features. This struck me as a bad model. Linear regression says “more is better” or “less is better” (for each feature). But sometimes you want to be in the Goldilocks zone and avoid extreme values in either direction. Since there was no limit on essay length, it was easy for me to drive some of their simple feature functions to extremely positive values without affecting the others. I knew this would drive their prediction to be either extremely positive or extremely negative, depending on the coefficients of their model.
Page (1967) was the original paper on automated essay grading. It’s a fascinating snapshot of applied NLP at the time. It too, used regression on various features, but the important thing is that Page even thought of the task at a time when computer processing of text was still very difficult. Jill Burstein at ETS told me: “Ellis Page was pretty much the visionary.” She met him later, when he was quite old. Project Essay Grade began at the University of Connecticut in 1964. In the paper above, they talk about their work on deeper features (“a phrase analyzer”). They needed keypunch operators to transfer students’ essays to punched cards– so just grading the essay by hand might have been faster!
Fast-forwarding to the present day, there’s been a workshop series since 2003 on NLP for Building Educational Applications (link is again to the ACL Anthology). You may find it interesting to browse the titles. In 2024 (the 19th workshop in this series), there were 38 papers accepted (from 88 submissions), 7 of them with the word “essay” in the title. Of course, many of them used LLMs. ETS is one of the sponsors of the workshop, and Jill is one of its longtime organizers.
As far as I know, ETS still uses their e-rater system to help with essay scoring on real exams, and to provide feedback to students on practice essays. The last that I heard was that on real exams, they average e-rater’s score with a human’s score, unless those two scores differ by > 1 point, in which case they replace e-rater with a second human. They provide some information about the current version. Based on that page, it seems possible that they are still using linear regression on hand-crafted features.
Powers et al. (2001) described the 1999 contest. They wrote:
The clear “winner” in our contest was an issue essay submitted by a professor of computational linguistics.
(That was me, though actually I was still a grad student at the time of the contest.)
His principal strategy was simply to write several paragraphs and to repeat them (37 times, in fact!).
(Actually only one paragraph, repeated.)
This strategy did indeed fool e-rater, resulting in a maximum discrepancy of 5 points. E-rater assigned the essay the highest possible score (6), while both study readers awarded it the lowest possible score (1).
It worked!
Their paper reports preliminary work on additional features “for detecting essays of this nature.” In particular, they built a repetition detector.
But that missed my point about why linear regression was bad. Roughly speaking, concatenating any 37 paragraphs that were slightly overrated would have resulted in an essay that was highly overrated. Yes, I did reuse the same paragraph 37 times, but only out of laziness. It was just a cheap way to imitate someone who would ramble on for 37 paragraphs in the same slightly overrated manner (which might avoid triggering their repetition detector).
They wrote:
While it was clearly possible to write essays that e-rater scored too high, it appeared more difficult to write essays that tricked e-rater into giving lower-than-deserved scores.
I mostly disagree. Any regression model in practice will overrate some text and underrate other text. Concatenating many paragraphs that were slightly underrated would have resulted in an essay that was highly underrated. (Or at least low-rated – maybe excessive length would have lowered the human score too? Human raters actually tend to reward length, but perhaps only up to a point.)
So I could have tried the same repetition strategy. But I didn’t submit such an essay because I was having too much fun trying other attacks:
Only two issue essays received e-rater scores that were two points lower than reader scores. No other discrepancies in this direction were greater than one point. Both of the “winners” in this category made extensive use of metaphor and literary allusion, and avoided the use of e-rater clue words, instead using, as one writer characterized them, “more subtle transitions.”
I think one of those winners was my sultan essay, which had some fun attacking the fallacy in the prompt through an ornate analogy to another fallacy. (My McGwire essay also tried to avoid standard discourse connectives in favor of “more subtle transitions.”)
Their regression model was \[\text{score}(x) = \vec{a} \cdot \vec{f}(x) + b\] where \(x\) was the essay, \(\vec{f}(x)\) was its features, and \(\vec{a}, b\) were parameters that were trained separately for each essay prompt.
All of their features were measuring the prevalence of particular words, phrases, or syntactic constructions. Some of their features were counts, and some were ratios. Thus, if I concatenated \(n\) copies of a paragraph \(x\), I’d get a score of \[\begin{align*} \text{score}(nx) &= \vec{a} \cdot \vec{f}(nx) + b \\ &= \vec{a}_{\text{counts}} \cdot \vec{f}_\text{counts}(nx) + \vec{a}_{\text{ratios}} \cdot \vec{f}_{\text{ratios}}(nx) + b \\ &= n \cdot \vec{a}_{\text{counts}} \cdot \vec{f}_\text{counts}(x) + \vec{a}_{\text{ratios}} \cdot \vec{f}_{\text{ratios}}(x) + b \end{align*}\] which is an affine function of \(n\).
I just had to write a paragraph whose count features tended to have positive coefficients, so that \(\vec{a}_{\text{counts}} \cdot \vec{f}_\text{counts}(x) > 0\). Then increasing \(n\) would make \(\text{score}(nx)\) unboundedly positive. So I just tried to write a mediocre argument that had some superficially good features.
If I’d gotten it wrong and \(\vec{a}_{\text{counts}} \cdot \vec{f}_\text{counts}(x) < 0\), then increasing \(n\) would have made \(\text{score}(nx)\) unboundedly negative instead.
Martha Palmer forwarded this email to grad students at UPenn in July 1999:
Dear Colleague:
We are writing to ask you (and a few other people) for help with research we are conducting on behalf of the Graduate Record Examinations (GRE) Board. Our research concerns the automated (or computer) scoring of open-ended essay responses. In particular, we are studying the feasibility of applying automated scoring methods to the kinds of essays that will be written for the new GRE Writing Assessment. As you probably know, the “imminence” of automated essay scoring was envisioned more than 30 years ago, but it is just now becoming a realistic possibility.
Our “challenge” to you is to try to “beat” the current version of our automated scoring system, or e-rater, as we call it. The “game,” should you choose to play, is to write essays that will be problematical for e-rater. By “problematical” we mean essays that would get scores from e-rater that are either too high or too low, relative to the scores assigned by human readers. We hope to learn both (a) how robust e-rater is to attempts to foil it, and (b) whether it undervalues certain kinds or styles of writing. Ideally, our findings will enable us to distinguish between failures that are easily correctable and ones that are more serious because they stem from fundamental differences in the ways that humans and computers understand language.
Although we can’t compensate you for your time, we hope we can entice you by providing feedback on how successful your challenges were, as well as a summary of the challenges made by other participants. (We’ll also award $250 to a writer whose essay is greatly undervalued by e-rater, and $250 to another whose essay is greatly overrated.) If you can participate, please let us know as soon as you can (dpowers@ets.org), and we’ll send further details. We’d also appreciate your letting us know if you’re not interested. Also, we’d welcome nominations of others who might be interested. If you should have any questions, please do not hesitate to contact me. Thanks.
Sincerely,
Don Powers
Principal Research Scientist
Educational Testing Service
My reply that evening:
Don:
Sure, this would be fun. Please send me further details!
I do hope you’re planning to give participants the same advantage that the coaching services will have:
You should tell us something about how the e-rater works, since the coaching services will eventually find out. (At a minimum, point everyone to your tech reports or published papers.)
Rather than making this a one-shot contest, you should allow multiple submissions with feedback, since the coaching services will send a lot of people into the exam over a period of months or years in order to reverse-engineer the system. That is, I should be allowed to feed the e-rater a bunch of attempts that are only slightly different from each other, so that I can analyze the resulting scores to see what features are weighted highly.
Glad you’re doing this study.
Don graciously replied:
Jason: Thanks for your interest. You offer some good suggestions, and we’ll try to follow up on them. I think we have already anticipated a couple of your comments, as you’ll see (I hope) in the details that we will be sending shortly.
On behalf of Don Powers, I am sending you (as attachments to this message) the details that Don promised to send regarding the study designed to “challenge” e-rater, the automatic essay scoring system being researched at Educational Testing Service. We hope everything is clear, but if you have any questions or problems, please feel free to reply to me or to Don directly ([email]). Thanks for your help.
We suggest you read the attachments in this order:
#1 - Don’s letter explaining “challenge”
#2 - Description of how e-rater works
#3 - Argument scoring guide
#4 - Benchmark Argument essays
#5 - Issue scoring guide
#6 - Benchmark Issue essays
#7 - Two essay topics to write on
Here were my submissions (essays + explanations).
Issue: Conformity/Success
Discuss the extent to which you agree or disagree with the opinion stated below. Support your views with reasons and/or examples from your own experience, observations, or reading.“No one can possibly achieve any real and lasting success or ‘get rich’ in business by conforming to conventional practices or ways of thinking.”
As Malthus and Darwin observed, if everyone competes in the same market, then profit margins were zero, because that was the competitive free market. In other words, it was first articulated by Adam Smith that niche specialization is one means of competition that should be observed despite itself. To get rich in business, as a matter of fact, they claim that they must also compete by specializing; therefore, Marx said, in a famous quotation that my professor or I made up, you must not be conventional. Suppose that he meant Netscape Internet Web software, for example, as well as political revolutions or other unconventional human endeavors such as, e.g., artistic and scientific progress. Should unpopular nonconformists thus win, therefore? On the other hand, original innovation is problematic too.
Secondly, as Malthus and Darwin observed, …
Thirdly, as Malthus and Darwin observed, …
Secondly, as Malthus and Darwin observed, …
Thirdly, as Malthus and Darwin observed, …
[repeated 37 times]
My explanation: The paragraph that is repeated 37 times to form this response probably deserves a 2. It contains the core of a decent argument, possibly the beginning of a second argument, and some relevant references, but the discussion is disjointed and haphazard, with the references and transition words badly misdeployed.
E-rater might give too high a rating to the single paragraph because of the high incidence of relevant vocabulary and transition words. However, I’ve repeated this poor, short, dashed-off response several times to improve the score (as a test-taker could easily do). As far as I can tell, e-rater’s score should be linear on the number of copies: it is a linear combination of various features that all seem to be linear on the number of copies. (Some features (ratios, cosine measures) remain constant, and the others (counts) are computable by a weighted FSA and hence linear.) Therefore, if I understand correctly, sufficiently many copies should yield an extreme score of 0 or 6, depending on the slope of the linear score function. My guess is that in this example, the slope is positive and a high score will result.
[same prompt as above]
Different visions of the well-spent life call for different strengths. For many, to succeed is to be promoted every five years and eventually to join the country club - a path that rewards their hard work, ability, and conformity to expectations, and along which they follow thousands of other conformists. Even “real and lasting” success may be attained in this way: not for his original ideas will posterity remember ballplayer Mark McGwire. (If you object that McGwire added nothing “real” to the world, his parents certainly did by raising such a nice and well-adjusted son – surely a real and lasting and effortful success against the odds, but one that can be achieved by adhering to a set of (parenting) conventions suited to the child.) It is true that a certain kind of competitive success - capturing market share or mindshare from others, on a level playing field - requires one to pull away from the pack with new approaches begged, bought, or stolen. But there remains a core of conventionality even among those who (like me) envision “real and lasting sucess” in engineering the triumph of a new idea over old ones, since what we seek is only the next win in some age-old game for some agreed-on prize. The true nonconformists drop out of those games altogether - and no one thinks of that as succesful.
My explanation: This brief, dense, seven-sentence response deftly makes several points and presumably deserves a 6. The formal structure unfolds simply enough that it gets away without very few transition words – which I suspect is good for the reader, bad for e-rater.
Specifically, almost every sentence lays out a new argument that builds on the previous argument, alluding back to the previous argument’s content rather than using a transition word. This serial development is interrupted only by two instances of counterargument (signaled by “If you object”; “It is true … But”) and one instance of parenthesis.
Moreover, the response is stronger because it recognizes a number of subtleties in passing, without interrupting the flow to couch them as separate arguments. (See e.g. the “competitive success” sentence, which alludes to the case of unfair competition, distinguishes non-conformity from personal originality, and unifies various pursuits as competition over “market share or mindshare” while distinguishing them from other kinds of competition such as baseball.)
I should mention that this is essentially a natural and favored response on my part, i.e., something like this might show up on a real exam, although I personally would not have been able to achieve this degree of compression in 15 minutes. I wrote it freely with the intention of tweaking it somehow to lower the score, but concluded that e-rater might have problems anyway and so tweaked it very little.
Argument: Roller Skating
Hospital statistics regarding people who go to the emergency room after roller-skating accidents indicate the need for more protective equipment. Within this group of people, 75 percent of those who had accidents in streets or parking lots were not wearing any protective clothing (helmets, knee pads, etc.) or any light-reflecting material (clip-on lights, glow-in-the-dark wrist pads, etc.). Clearly, these statistics indicate that by investing in high- quality protective gear and reflective equipment, roller skaters will greatly reduce their risk of being severely injured in an accident. Discuss how well reasoned you find this argument.
While this argument may seem that there are not enough misleading details to evaluate, nonetheless, its force is apparent. Nonetheless, one might also certainly have concern concerning the reason of skaters with various forms of accidents. First, skaters already wear many sports equipment, which they can afford and so should be cautious to wear protective gear. Second, correlation in the statistic do not imply causality. Correlation in the statistic do not imply causality; correlation in the statistic do not imply the conclusion of causality. The evidence does not support the stated conclusion, for the evidence does not support the stated conclusions. Causes and effects are different issues. For example, cause and effect is different issue.
Third, here is a new argument that I claim that is good. To be sure, the emergency room that skaters utilize indicates the following relations that one can certainly make sound public policy: harm, lack of careful caution, taxpayer insurance money, safety, head injury, bone fracture, doctor, nurse, driver’s license, inadequate protective equipment. Even if the skater were a perfect safest citizen, there have been several types of skaters and various injuries that one must possibly consider. In particular, suppose that the streets and parking lots are not the only places where accidents may occur. Furthermore, most importantly, skaters do not wear enough protective gear. Therefore, no one is sure that they will live long enough to go to the hospital, let alone a high-quality one.
Second, many other issues that arise must arguably also be considered, because this is a complicated question, i.e., it is a GRE question. What if skaters did not have to wear any clothing at all? As a consequence, many motorists would drive slower because they were looking. In conclusion, everyone would be safe, including the motorists. In fact, if the cars went slower than the skaters, then we would put them into the bike lanes and the skaters could have more room.
Finally, in order to decide the merits that the case hinges upon, we must carefully consider costs and benefits. In particular, it is important to consider the cost and benefit. One cost is that all men are mortal. Since Socrates is a man, it follows that Socrates is mortal. Surprisingly, I claim, ironically, that this is a benefit, for if Socrates were still alive, he would undoubtedly disapprove about this irrelevant and fallacious paragraph, which I use to some extent in all my regrettably flawed argumentation, just as the parents of those skaters would disapprove of them suffering injuries. Fortunately, it is impossible for him to question me at this time.
My explanation: No argument, just a rambling pastiche of canned text (often repeated for emphasis), possibly relevant content words, and high-falutin function words. Despite the many flaws in diction, I don’t think a grammar checker would find much to complain about, if you are using one at all.
[same prompt as above]
My distaste for this sophistry is exceeded by my suspicion that much public policy (on weightier matters) is no better founded. Let me adopt a time-honored technique and venture to expose the logical problems by means of analogy.
Picture a sultan in your favourite exotic locale. Clad in the best high-quality protective glow-in-the-dark raiment, he lounges on peacock skins, nibbling fermented fruits fed on the waters of faraway sacred springs, waving his sceptre bibulously as he explains to his cowed populace why they are better off for supporting his opulent lifestyle. “As we are always eager to remark,” he cackles, “five-and-seventy percent of those subjects who reach octogenery are dutiful taxpayers. You will deduce that tax payment greatly lengthens one’s life. Alright, live long and fork over!”
The swindle in the sultan’s self-serving, fishy-sounding argument should be obvious. He presumably neglected to mention something: 75% of ALL adult subjects are dutiful taxpayers - whence taxes have no effect on one’s well-being, unless one happens to be the sultan! Returning to the original argument, maybe 75% of ALL skaters, not just “75% of those who had accidents in streets or parking lots,” eschew protectve clothing. Such ineffectual protection would help only the modern-day sultans of the sporting goods industry.
Perhaps additional data against this scenario would have salvaged the conclusion? No. The sultan, challenged, might respond with such data: say, only 1% of subjects who died at age ten (not 75%) had ever been taxpayers. Alas, he sounds fishy again. It seems unlikely their nonpayment LED TO their early death as he wants to imply; actually, quite the reverse! Along just those lines, say people who stay out of the ER do tend to wear protective gere after all; it isn’t obvious this group’s gear LEADS TO any safety. (Maybe contrariwise - the money saved on the ER helps them afford their high-quality reflective equipment. Maybe a temperamental timidity keeps them safe AND makes them insist on useless clothing. Maybe they aren’t safe at all, just suggestible, steered away from the ER by their HMOs and right into the sports stores by advertising.)
At the end of the day, the argumnt rests on “hospital statistics,” a phrase that dignifies a single number - 75 - with a plural it does not deserve. There are no significance tests of its reliability, or observed or experimentally manipulated numbers for comparison, or anything that informs the modifiers presented (“greatly,” “high-quality”).
OUT OF TIME, SORRY
My explanation: This response probably deserves a 6 or at least a 5. However, it has a number of features that may give it a lower score:
- few classical transition words, complement clauses, etc.
- some quoting from the problem, as may appear in some of the more desperate low-scoring training responses
- many modal auxiliaries, as well as “maybe” and “actually”
- a few typos that probably match against misspellings in a few of the low-scoring training responses (note that misspelled words have high IDF weight). (Essay 1 also has two examples of this.)
- comparatively low cosine measure with other responses of any score. (The largest coordinates are probably for unusual words drawn from the analogy, like “sultan.”) Thus, the cosine measure has fewer words (either positive or negative) to influence its decision. While this does not introduce a bias toward matching the low-scoring responses, it may reduce the reliability of matching with the high-scoring responses. That is, I imagine ArgContent’s output will be a bit more random than for an essay that falls squarely into the training class; and note that if ArgContent were fully random, its average score would be 3.
ETS tried to send me the prize money, but requested my SSN for tax reasons. Email seemed insecure, and I never got around to faxing it to them.
I hereby forgive the debt. :) Surely I’ll come out ahead anyway when
everyone who reads this page offers to buy me a drink ice
cream.