Emerging Technology

Are computers better qualified than humans to grade student essay exams?

By Steven Johnson|Sunday, June 01, 2003
RELATED TAGS: COMPUTERS



Illustration by Leo Espinoza

Pearson Education Measurement, which scores more than 40 million student achievement tests each year, announced in February that it would begin using computers to grade student essays. After decades scanning number two pencil dots, the machines have advanced to prose. And the key to that advance is an ingenious process called latent semantic analysis, one of several techniques that researchers and corporations are exploring to cajole machines into understanding the meaning of strings of words instead of just manipulating them.

The idea of a computer doing more sophisticated evaluations than tallying up multiple-choice answers has alarmed parents and teachers. If computers still can't figure out that those penis enlargement e-mails in their inboxes are spam, how can they possibly assess the merits of a book report on The Sun Also Rises? As it turns out, the process of training a machine to grade essays is similar to the process of training human graders.

Traditionally, human graders are shown samples of good, mediocre, and poor essays and instructed to base their grades on those models. The computerized grader, dubbed Intelligent Essay Assessor, plots those sample essays as points in a kind of conceptual space, based on patterns of word use in the document. Student essays that are close to the good models get an A, while those that are mapped near the mediocre ones get a C.

How does the software pull this off? First, imagine that you're looking for relationships in a set of encyclopedia entries. You begin by feeding the computer the combined text of all the entries. Then the software creates a list of all the major words, discarding pronouns, prepositions, articles, and so on. Let's say that at the end of that process, the software determines that there are 10,000 unique words in the compilation. The computer then sets aside an imagined space with 10,000 dimensions—one for each word. Each encyclopedia entry occupies a specific point in that space, depending on the specific words that made up the entry. Documents that are close to each other in the space are close to each other in meaning, because they share a lot of the same concepts. Documents at opposite ends of the space will be unrelated to one another. Making subtle associations between different documents is simply a matter of plotting one document on the grid and locating its near neighbors.

The multidimensional grid identifies semantic similarities between documents, even if the documents themselves don't contain the same words. This gets around the classic annoyance of traditional keyword-based search engines: You ask for information about dogs, and the engine ignores all pages that talk about canines. Latent semantic analysis software is smart enough to recognize that dogs and canines are closely related terms, and if you're searching for one, you're probably interested in the other.

The grid highlights those connections because it collapses the total number of dimensions down to a more manageable number: 300 instead of 10,000. Each word then has a fractional relationship to each dimension: Cats might have a seven-tenths connection to one dimension and a one-tenth connection to another. If dogs and canines are both nine-tenths correlated with a specific dimension, then the software assumes a semantic relationship between the words.

So far, so good, but you may be wondering about getting credit only for using the right words and not getting credit for being clever. Programmers are quick to acknowledge that the software isn't good at measuring creativity or using other classic measures. The software is quite sensitive to prose sophistication and relevance, however: If you're asked to write an essay on the Great Depression, and you end up talking about baseball, you'll fare poorly. If your sentences are repetitive and your vocabulary is weak, you won't get a good score. But the software has a harder time detecting other obvious problems: From the software's point of view, there is no real difference between the sentence "World War II came after the Great Depression" and the sentence "The Great Depression came after World War II." Latent semantic analysis can give a good appraisal of whether an essay is on-topic and the language is erudite, but human graders are still much better at determining whether the argument makes any sense.

A new software application called Summary Street lets teachers submit a specific course reading and then analyzes student reports on the reading to gauge how well they have summarized the original document. The software alerts the students if there are crucial topics they have overlooked.

"We distinguish between high-stakes and medium-stakes tests," says Jeff Nock, a vice president at K-A-T, the company that makes Intelligent Essay Assessor. "High stakes is: This test determines if you get to go to college. Medium is: I'm preparing to take a high-stakes test." Pearson Education Measurement has licensed the software to help grade its preparatory exams, but high-stakes essays are still graded by humans.

Nonetheless, Nock imagines a future for computerized grading in crucial testing environments: "Right now, essays on standardized tests are assessed by two separate human graders—if there's a disagreement about an essay, it gets handed off to a third person. We think latent semantic analysis could, down the line, replace one of those initial two graders with a machine. The machine brings a lot to the table. It costs a lot economically to train those human graders. And the latent semantics analysis approach brings more consistency to the process. The machine doesn't have bad days." Nock also envisions that teachers and students will use the software as a writing coach, analyzing early drafts of school essays and suggesting improvements, a step up the evolutionary chain from spell check and grammar check.

If we could all afford to have private tutors reading our first drafts, we would no doubt be better off, but a computerized writing coach might be better than no coach at all. And recent experiments suggest that text analysis can occasionally reveal meaning that human analysis has a hard time detecting.

Human reading follows a temporal sequence: You start at the beginning of a sentence and read on until the end. Software isn't smart enough to understand sentences, but it can analyze changing patterns in word choice. Researcher Jon Kleinberg of Cornell University tapped into this skill when he created a tool that analyzes "word burstiness." It is similar to latent semantic analysis in that it detects textual patterns, but it is designed to look specifically at semantic changes chronologically. The software sees a document archive as a narrative—at each point in the story, certain words will suddenly become popular as other words lose favor. Borrowing language from the study of computer-network traffic, Kleinberg calls these words "bursty." For months or years they lie dormant, then suddenly burst into the common vocabulary.

Kleinberg tested his software by analyzing an archive of papers published on high-energy physics, a field about which he professes to know absolutely nothing. The software scans the documents and reports back with a chronologically arranged list of words that show a sudden spike in usage. "The computer is effectively saying, 'I don't know what these words mean either, but there was a lot of interest in them in the late 1970s,'" Kleinberg says. "It gives you hooks into an unknown body of literature." If nothing else, the next time you meet a high-energy physicist at a cocktail party, and he starts talking about his research into superstrings, you'll be able to impress him by saying, "String theory? That's so 1992!"

But because the software "reads" text in such an unusual way, the tool also lets us see new attributes in documents that we already know something about. Kleinberg's most intriguing application is an analysis of the State of the Union addresses since 1790. Reading through the list of bursty words from past addresses is like browsing the pages of a history book designed for students with attention deficit disorder. Mostly, it is a parade of obvious word bursts: During the early 1860s, slaves, slavery, and emancipation jump onto the national stage; during the 1930s, depression, recovery, and banks.

Just when you think the software is demonstrating its flair for the obvious, however, you get to the 1980s. Suddenly, the bursty words shift from historical events to more homespun effects: I've, there's, we're. An observer can literally see Ronald Reagan reinvent the American political vernacular in those contractions, transforming the State of the Union from a formal address into a fireside chat, up close and personal. There's no trace of "fourscore and twenty years" or "ask not" in this language, just a more television-friendly intimacy.

Is this news? We knew that Reagan brought a more popular style to the presidency, but we didn't necessarily know the syntactic tools he used. As listeners, we intuitively grasp that there's a world of difference between we shall and we'll—one stiff, the other folksy—but we don't recognize what linguistic mechanism made the shift happen. Seen through the lens of Kleinberg's software, the mechanism pops out immediately, like a red flag waving among the dull grays of presidential oratory. The computer still doesn't know what Reagan is saying, but it helps us see something about those speeches we might have missed. As Kleinberg says, it gives us a hook.





Check out the Web site of K-A-T (Knowledge Analysis Technologies), the makers of the Intelligent Essay Assessor: www.k-a-t.com. In addition to product descriptions, the site has a few demos that you may want to try. Some of the demos provide sample college- and high-school-level essays that you can run through a sample evaluation. You can also create your own essay to see how your work stacks up. Another demo prompts you to write a middle-school-level composition, which is then evaluated—a potentially humbling experience: www.k-a-t.com/HRW12Demo/HRW12.html.

A Cornell news release describes Jon Kleinberg's work on search techniques and lists the 150 "bursty" words in State of the Union addresses: www.news.cornell.edu/releases/Feb03/AAAS.Kleinberg.bursty.ws.html.

Kleinberg's home page includes links to papers and descriptions of his current research: www.cs.cornell.edu/home/kleinber.

Scan a list of the burstiest words in the last few days' Web logs and find out what the hot topics are in the blogging community: www.daypop.com.

Find out what Steven Johnson is up to at his Web site, where you'll also find links to some of his recent articles, including pieces for Discover: www.stevenberlinjohnson.com.
Comment on this article
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

ADVERTISEMENT
ADVERTISEMENT
Collapse bottom bar
DSC-JanFeb15
+

Log in to your account

X
Email address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it emailed to you.

Not registered yet?

Register now for FREE. It takes only a few seconds to complete. Register now »