Making grading in university courses more reliable

Submitted by Eliza.Compton on Wed, 11/08/2021 - 08:43
Inconsistent or inaccurate grading can have serious real-world consequences for students. Paige Tsai and Danny Oppenheimer offer tips on how to recognise and fix the problem
Article type
Main text

If you’d like to grade exams for a major testing corporation, it takes a lot of work. Prospective graders for the Educational Testing Service, for example, undergo system and content training, and at least one content certification test. Then prospective graders grade several practice tests that have already been graded by established graders.

If, on the other hand, you’d like to grade exams for an undergraduate class, the standards are much lower. Often grading is left to graduate students or high-achieving undergraduates with little more than basic grading guidelines. In fact, aside from a subset of education researchers and psychologists, most faculty lack expertise in psychometrics, effective rubric design or assessment best practice. The reality is that there is very little formal training for faculty – let alone for teaching assistants – on how to create effective assessment instruments. As a result, while we would like university grades to be reliable and valid indicators of student achievement, in practice, grades often contain a considerable amount of noise, especially in more subjective fields.

Wisdom of crowds v experts

While the issue of noise in grades could be addressed through rigorous training and calibration methods, a more resource-efficient approach has been identified by massive online open course (Mooc) providers looking for a scalable way to assess thousands (or tens of thousands) of students. Some Mooc providers ask students to grade one another’s work, then rely on the wisdom of crowds to assign a grade. According to wisdom-of-crowds research, the collective judgements of multiple uninformed individuals can be as accurate, or even more so, as that of a single expert. Some people guess too high, others guess too low, and the noise cancels out, leaving only signal.

But how well does it work for grading?

To investigate this question, we asked graduate students to grade essays written for a Mooc in their field of study and compared their grades to a wisdom-of-crowds grading strategy (averaging the grades assigned by at least four of the student’s peers). The results were both promising and troubling: we were encouraged to find that the grades awarded by the crowds were not appreciably worse than those awarded by the experts. However, as we dug into the data, we discovered that the reason that the scores were so similar was because the experts’ scores were, in many cases, as inconsistent as the crowd’s. In fact, pairs of experts agreed on essay scores in only 20 per cent of cases. For nearly 30 per cent of the essays, the scores differed by three or more points on a nine-point scale – the difference between receiving a B+ on an exam and failing! In addition, in several instances the same expert read an essay twice, or three times, and awarded it different scores. Perhaps most troubling was that the factors we thought should most strongly predict grades (such as the accuracy of the content in the essays) had very little influence on final scores, leaving us unsure of what the experts were basing their scores on.

If the grades awarded to students in university classes differ dramatically depending on the grader, are they valid indicators of student achievement? As a feedback tool, grades are only useful insofar as they are accurate. More concerning, however, is how an erroneous low grade could lead a student to abandon a course and/or hurt their chances of getting a job or being accepted into graduate school. 

Resources for improving assessment practices

So how do we fix the problem? While best practices in assessment and psychometrics do exist, few faculty are aware of them or knowledgeable enough to implement them. We should publicise resources that help faculty adopt better rubrics and assessment practices. Indeed, there are dozens of websites dedicated to improving assessment and most university teaching centres have specialists who are available to consult with faculty. Often the more proximate issue is making faculty aware that they need these resources in the first place.

Second, it’s important to recognise that our judgement can be affected by seemingly irrelevant factors such as the weather, our hunger levels or even the time of day. We can improve consistency and accuracy by engaging in calibration exercises: having multiple graders read and score the same small sample of exams each day prior to grading. This can ensure that the graders are using the same standards and are aligned with one another. It can also help identify individual graders who are deviating dramatically from their peers and/or baseline standards (and so may need additional training).

Finally, while we find that wisdom of crowds among novices does not fully eliminate the problem of noise in grading, a large body of literature has demonstrated that averaging the scores of two (or more) independent evaluations is better than doing nothing. Indeed, even if there is only one grader available, having that grader give multiple estimates and averaging them (using a process called dialectical bootstrapping) can lead to improvements. 

Regardless of how it is done, universities need to attend more to the noise problem in grading. This will help us ensure that students are getting the accurate feedback they need to learn and grow in the classroom. In addition, given that grades are such strong determinants of socio-economic outcomes, reducing the noise in grades can help us reduce the likelihood that we are further contributing to injustice in society. 

Paige Tsai is a PhD student in technology and operations management at Harvard Business School. She is interested in the judgements and decisions by people in organisations. Danny Oppenheimer is a professor jointly appointed in psychology and decision sciences at Carnegie Mellon University. He researches judgement, decision-making, metacognition, learning and causal reasoning, and applies his findings to domains such as charitable giving, consumer behaviour and how to trick students into buying him ice cream.

Inconsistent or inaccurate grading can have serious real-world consequences for students. Paige Tsai and Danny Oppenheimer offer tips on how to recognise and fix the problem