I've written about robo-graders here.
Here's a 4 question Q & A where I'll ask the questions and the answers come from a must-read post titled Computer Grading Will Destroy our Schools.
1. Why is there a push to have student's multiple choice tests and writing scored by a computer?
The reason for the push is both grim and obvious: money. Now that our schools are going to have more standardized tests, there are going to be more student paragraphs that need grading. Grading written work is laborious and time-consuming and, from a school board’s point of view, expensive. What computers offer is the ability to do this task faster and – once they are up and running – more cheaply. To be fair to the school boards, assuming the computer programs work, doing this task more efficiently could yield some benefits. It would lift a significant burden from the harried, overworked and underappreciated group of teachers and grad students we currently pay to grade standardized tests. But I suspect that, for most people, the thought of a computer “reading” essays is reflexively anxiety-provoking. It brings out the inner Luddite. Are we really supposed to believe that a machine can do just as good a job as a human being at a task like reading?2. Does computer grading eliminate teacher bias?
The way supervised machine learning basically works is this: The computer treats the student’s essay as though it were just some random assemblage of words. Indeed, the jargon term for the main analytical technique here is actually “Bag of Words” (the resonance of which is simultaneously kind of insulting and weirdly reassuring). The program then measures and counts some things about the words that the programmer thinks are likely to be correlated with good writing. For example, how long is the average word? How many words are in the average sentence? How accurate are the quotations from the source text, if any? Did the writer remember to put a punctuation mark at the end of every sentence? How long is the essay?
The programmer then “trains” the computer by telling it the grades assigned by human graders to a “training set” of essays. The computer compares – mathematically – the various things it has measured to the grades assigned to the essays in the training set. “Johnny’s essay had an average word length of 5 letters and an average sentence length of 20 words. The human told me that Johnny gets a B+. Now I know that much about essays that get a B+.” From there, given an average word length, sentence length, word frequency and so on, the computer is able to calculate the probability that a given student essay would receive a particular grade. When it encounters a new essay, it can take the things it knows how to measure, and – based on what it learned from the grades assigned to the training essays – simply assign the most probable grade.
It turns out that this works surprisingly well. Shockingly well. What we might think of as totally surface-level or accidental features – like having more words per sentence – are actually correlated very strongly with earning better grades. Statistical analyses, at least, tell us that machine-learning techniques perform just as well as human beings – that is, their grades for new essays are the same or similar a huge majority of the time. Plus there are some ways in which the computers might actually be better. The people who presently do essay grading for tests like the AP English exam work long hours and have to grade essay after essay. Unlike computers, they get tired and irritable and bored. Also unlike computers, they come pre-equipped with a whole bunch of biases. These little nuisances and irritations, we might hope, will wash away if we instead let a computer sort unfeelingly through the bag of words.
But, alas, it isn’t so. Since the process by which the computer “learns” is anchored to grades assigned by human beings – the training set teaches the computer what kinds of grades we tend to give to what kinds of essays – the tiresome, unsexy little things that make us imperfect are built right into the system. For instance, if the graders who grade the training set tend to strongly penalize nonstandard uses of English – including nonstandard uses more common among racial minorities – so too will the machine. The computer will operationalize, and then perpetually reinstate, the botherations and biases we feed it. The best strategy, thus, will be to use mechanized, look-alike writing, which will be tautologically defined as good writing because it is associated with receipt of a good grade.3. Can students "game" these computer graders?
One obvious problem is that if you know what the machine is measuring, it is easy to trick it. You can feed in an “essay” that it is actually a bag of words (or very nearly so), and if those words are SAT-vocab-builders arranged in long sentences with punctuation marks at the end, the computer will give you a good grade. The standard automated-essay-scoring-industry response to this criticism is that anyone smart enough to figure out how to trick the algorithm probably deserves a good grade anyway. But this reply smacks of disingenuity, since it’s obvious that the grade doesn’t reflect what it’s “supposed to” – namely, the ability to write a reasonably high-quality essay on some more or less arbitrary topic.
Another slightly less obvious problem is that, since the computer is just measuring and counting, it can’t actually give you meaningful feedback or criticism. It has no idea what big-picture themes you were exploring, what your tone was, or even what you actually said. It just tries to approximate the score you should get – that is, it tries to put you into a little box. While this kind of box-sorting is fine for literal grading, it doesn’t really help with teaching you to be a better writer.
There are other problems, too. A former professor at MIT named Les Perelman has pointed out that the way the automated-essay-grading companies are analyzing their software’s performance is unfairly biased toward the machine. Perelman’s paper, although eye-glazingly dense in data analysis, notes that while a human grader’s reliability is checked by comparing his or her grades to someone else’s, the machine’s reliability is checked against a resolved grade, which reflects the judgments of multiple human readers. But the standard statistical measure of agreement, called Cohen’s Kappa, is – as Perelman puts it – “meant to compare the scores of two autonomous readers, not a reader score and an artificially resolved score.”4. What's the worst that can happen with computer graders?
Our culture will stop engaging with students on those very aspects of the humanities that make them worth studying in the first place. We are going to end up with a system that dispenses rewards in a way that is indifferent to – and divorced from – the most alluring parts of the humanities, those creative capacities that they let us engage. If our instruction in the humanities necessitates ignoring these abilities, then it is my opinion that there no longer is much point to teaching the humanities at all, and we should end the charade. In other words, if this kind of mechanized, standardized-test-friendly drivel is all we can offer our children as “the humanities,” then who cares about the humanities?
Once the use of automated essay grading becomes common knowledge, the implicit message will be hard to miss. For any self-aware, warm-blooded American teenager, the conclusion will be all but inescapable: Nobody cares what you have to say. It could be brilliant and moving; it could be word-salad or utter balderdash; it really doesn’t matter. Content, feeling, creativity, thematic depth – none of it matters. Today’s students will recognize this; they will react to it; and it will inform who they grow up to be. Indeed, I confess that if I were a teenager, my response would be the same as theirs – the selfsame response that we tend to associate with (and dismiss as just) teen angst. What is the point, after all, in being rewarded by a system that doesn’t care who you are? If no one is going to read the essays, we might as well rip them up.