The recent culture of leaving more and more of the process of assessment in the hands of teachers, raises the important question of reliability. Much research into teacher assessment, even by strong proponents of its advantages, reveals that it is inherently unreliable. We might have guessed this from our experience of human beings and the reliability of their subjective judgements! This is difficult even for quantitative measures, such as the weight of a pig at a fair, but much more so for such qualitative aspects as those in the wording of rubrics. These are what we are currently working with in the English primary school system. We teachers are required to be: assessing formatively and feeding back; summing up and evaluating; reporting in an unbiased way and all along being held accountable for progress, which we, ourselves are expected to be measuring. Imagine, if you will the aforementioned pig. Judge its weight yourself now, and then again when you have fed it for a month, but bear in mind that you will be accountable for the progress it has made. How reliable will either of these judgements be?
So, in an attempt to improve the reliability of teacher assessments (in order for them to have high-stakes, accountability purposes) we introduce the idea of moderation. This usually takes the form of a colleague or external moderator assisting in the judgement, based on the ‘evidence’ produced by the teacher. Now, whilst I can see the value to the teacher of the moderation process, if it involves discussion of criteria and evidence with colleagues and supposed ‘experts’ (who, exactly?), I’m skeptical that simply introducing more people into the discussion will lead to greater reliability. The problem is that the external yardstick is still missing. Even if the teacher and all those involved in the moderation process agree on the level, objective or whatever measurement is required of us, we are still making subjective judgements. Are collective, subjective judgements any better than individual ones? Sometimes, they may be if they genuinely have the effect of moderating extremes. However, we need also to consider the impact of cultural drift. By this, I mean that there is a group effect that reinforces bias and this does have an impact on assessment. I am convinced that I witnessed this over the years in the assessment of writing, where the bar for attaining each level seemed to continually be raised by teachers, afraid that they would be accused of inflating results – a real shame for the pupils who were being judged unfairly. In these instances, the moderation process doesn’t improve reliability; all it does is give a false sense of it which is then resistant to criticism or appeal. This is where we all stand around staring at the pig and we all agree that he looks a bit thinner than he should. Without the use of a weighing device, we really do not know.
I had a look back at this post – moderation being in the wind at the moment. I was interested in articles such as this one and I wonder what it will take to stop doing such pointless, meaningless practices in education? Do we not know? Do some people still believe these things work? Isn’t it a bit obvious that teacher assessment for high stakes purposes is completely counter-productive and that moderation can in no way be considered a strategy to achieve greater reliability?
I’d like to extend the ubiquitous pig metaphor now. In the case of primary writing moderation in 2016, it’s not even a case of staring at the pig and guessing its weight. We have a farmer who has a whole field of pigs – he has been told to guess all their weights, but he’d better not have more than 30% underweight! In order to make sure he doesn’t cheat, another farmer comes along, equally clueless, and tells him whether he thinks the farmer’s guesses are the same as his own guesses. The farmer next door doesn’t have to go through this pointless ritual. Strangely, that farmer’s pigs are all just a little fatter.