Got the T-shirt (a moderate tale)

Given that teacher assessment is a nonsense which lacks reliability, and that moderation can not really reduce this, nor ensure that gradings are comparable, our moderation experience was about as good as it could be! It was thus:

Each of we two Y6 teachers submitted all our assessments and three children in each category (more ridiculous, inconsistent and confusable codes, here), of which one each was selected, plus another two from each category at random. So, nine children from each class. We were told who these nine were a day in advance. Had we wanted to titivate, we could have, but with our ‘system’ it really wasn’t necessary.

The ‘system’ was basically making use of the interim statements and assigning each one of them a number. Marking since April has involved annotating each piece of work with these numbers, to indicate each criterion. It was far less onerous than it sounds and was surprisingly effective in terms of formative assessment. I shall probably use something similar in the future, even if not required to present evidence.

The moderator arrived this morning and gave us time to settle our classes whilst she generally perused our books. I had been skeptical. I posted on twitter that though a moderator would have authority, I doubted they’d have more expertise. I was concerned about arguing points of grammar and assessment. I was wrong. We could hardly have asked for a better moderator. She knew her stuff. She was a y6 teacher. We had a common understanding of the grammar and the statements. She’d made it her business to sample moderation events as widely as possible and therefore had had the opportunity to see many examples of written work from a wide range of schools. She appreciated our system and the fact that all our written work from April had been done in one book.

Discussions and examination of the evidence, by and large led to an agreed assessment. One was raised from working towards; one, who I had tentatively put forward as ‘greater depth’, but only recently, was agreed to have not quite made it. The other 16 went through as previously assessed, along with all the others in the year group. Overall my colleague and I were deemed to know what we were doing! We ought to, but a) the county moderation experience unsettled us and fed my ever-ready cynicism about the whole business and b) I know that it’s easy to be lulled into a false belief that what we’ve agreed is actually the ‘truth’ about where these pupils are at. All we can say is that we roughly agreed between the three of us. The limited nature of the current criteria makes this an easier task than the old levels, (we still referred to the old levels!) but the error in the system makes it unusable for accountability or for future tracking. I’m most interested to see what the results of the writing assessment are this year – particularly in moderated v non-moderated schools. Whatever it is, it won’t be a reliable assessment but, unfortunately it will still be used (for good or ill) by senior leaders, and other agencies, to make judgements about teaching.

Nevertheless, I’m quite relieved the experience was a positive one and gratified and somewhat surprised to have spent the day with someone with sense and expertise. How was it for you?

 

 

 

 

Advertisements

Trialling moderation

A quick one today to cover the ‘trialling moderation’ session this afternoon.

We had to bring all the documents and some samples of pupils’ writing, as expected.

Moderators introduced themselves. They seemed to be mainly Y6 teachers who also were subject leaders for English. Some had moderated before, but obviously not for the new standards.

The ‘feel’ from the introduction to the session was that it wasn’t as big a problem as we had all been making it out to be. We were definitely using the interim statements and that ‘meeting’ was indeed equivalent to a 4b.

At my table, we expressed our distrust of this idea and our fear that very few of our pupils would meet expected standards. Work from the first pupil was shared and the criteria ticked off. We looked at about 3 pieces of work. It came out as ‘meeting’ even though I felt it was comparable to the exemplar, ‘Alex’. The second pupil from the next school was ‘nearly exceeding’. I wasn’t convinced. There were lots of extended pieces in beautiful handwriting but sentence structures were rather unsophisticated. There was arguably a lack of variety in the range and position of clauses and transitional phrases. There was no evidence of writing for any other  curriculum area, such as science.

I put forward the work from a pupil I had previously thought  to be ‘meeting’ but had then begun to doubt. I wanted clarification. Formerly, I would have put this pupil at a 4a/5c with the need to improve consistency of punctuation. Our books were the only ones on our table (and others) that had evidence of writing across the curriculum; we moved a few years ago to putting all work in a ‘theme book’ (it has its pros and cons!).

Unfortunately the session was ultimately pretty frustrating as we didn’t get to agree on the attainment of my pupil; I was told that there needed to be evidence of the teaching process that had underpinned the writing that was evident in the books. That is to say, there should be the grammar exercises where we had taught such things as ‘fronted adverbials’ etc. and then the written pieces in which that learning was then evidenced. I challenged that and asked why we couldn’t just look at the writing as we had done for the first pupil. By then the session was pretty much over. In spite of the moderator’s attempt to finish the moderation for me, we didn’t. The last part of the session was given over to the session leader coming over and asking if we felt OK about everything, and my reply that no, I didn’t. I still didn’t know which of the multiplicity of messages to listen to and I hadn’t had my pupil’s work moderated. I had seen other pieces of work, but I didn’t trust the judgements that had been made.

The response was ‘what mixed messages?’ and the suggestion that it may take time for me to ‘get my head around it’ just like I must have had to do for the previous system. She seemed quite happy that the interim statements were broadly equivalent to a 4b and suggested that the government certainly wouldn’t want to see the data showing a drop in attainment. I suggested that if people were honest, that could be the only outcome.

My colleague didn’t fare much better. She deliberately brought samples from a pupil who fails to write much but when he does, it is accurate, stylish and mature. He had a range of pieces, but most of them were short. The moderator dismissed his work as insufficient evidence but did inform my colleague that she would expect to see the whole range of text types, including poetry because otherwise how would we show ‘figurative language and metaphor’?

I’m none the wiser but slightly more demoralised than before. One of my favourite writers from last year has almost given up writing altogether because he knows his dyslexia will prevent him from ‘meeting’. Judging the writing of pupils as effectively a pass or fail is heart-breaking. I know how much effort goes into their writing. I can see writers who have such a strong grasp of audience and style, missing the mark by just a few of the criteria. This is like being faced with a wall – if you cant get over it, stop bothering.

We are likely to be doing a lot of writing over the next few weeks.

 

The nonsense of ‘teacher assessment’ – an analogy

As we approach the start of the new school year, some of us will be continuing to try to make a silk purse out of the sow’s ear of  the new assessment requirements, ‘formally’ introduced last year. Whatever system individual schools decide to use to approach this farce, teachers will be expected to make judgements based on ‘teacher assessment’. Almost everywhere, this will be accepted without question, so I’m going to try to outline in simple terms just how I think it does not make sense.

I’m using a high-stakes analogy in which human judgement of performance needs to be seen to be as reliable as possible – the ‘execution’ score for competitive gymnastics as follows:

  • 6 independent, highly skilled, judges
  • 1 individual is judged on 1 performance at a time (and within a limited time)
  • Each performance has a small number of clearly defined criteria
  • There is no conferring (or moderating!)
  • The maximum score is 10 and points are dropped for errors

These are pretty good conditions for a high degree of reliability and yet the judges still arrive at different scores. Because of that, the top and bottom scores are dropped and the remaining 4 are averaged. Even so, the resulting scores are often ‘disputed’, although queries and official objections are not allowed. The judges are not the coaches and will not be held to account for the performance of the gymnasts.

Now let’s compare that with teacher assessment in an English primary school:

  • 1 class teacher, most of whom are not experts, neither in the subject, the curriculum nor in assessment
  • 32 individuals are judged on multiple performances in multiple subjects throughout the year
  • There are hundreds of criteria (somewhere along the lines of 130 for the core subjects in year 5)
  • Reliability is expected to be improved by moderation and discussion (conferring!)
  • There is no way to eliminate outlying judgements
  • There is no transparent way to score or translate observations of performance into grades

In most schools, there will be some kind of tracking system whereby teachers will be asked to make termly entries along the lines of ‘developing, meeting, exceeding’ and degrees thereof, for tracking purposes, culminating in a final decision which will indicate pupil attainment (readiness to move to the next stage) and teacher effectiveness for that year. In many cases, in spite of union objections, these judgements will form part of appraisal, promotion and performance-related pay. Is there any way, under those circumstances, that teacher assessment can reliable enough to be used for the high-stakes purposes expected in English primary schools?

Primary Science Assessment – not Even Close, Yet.

Assessment of primary science is something of a bugbear of mine. While I consider the so-called ‘formative assessment’ (it should never have been called ‘assessment’) to be no more or less of a challenge than the other core subjects, summative assessment of science is different. There is a multitude of research papers and writings on just how difficult it is to assess it properly for any type of measurement, particularly to track progress and for accountability purposes. In the UK the decline of science since the demise of the KS2 SATs test, has passed into legend. Check out OFSTED’s Maintaining Curiosity, for an official account of just how dire the situation is. It’s now been six years, since that event, however, and the protagonists in the world of science education and assessment have pretty much failed to come up with anything manageable and reliable. I’m not surprised; I think the job is almost impossible. However, I am surprised that they continue to try to fool themselves into thinking that it isn’t. Examples of advice from the most authoritative of sources are here and here and I’m very appreciative of their efforts, but I look at these and my heart sinks. I can’t imagine these ideas being put into effective practice in real primary schools.

When I was pushing to try and influence the protagonists, before they finished their projects and put their suggestions out to teachers, I compiled a list of questions which I felt needed to be addressed in thinking about assessment in primary science. I see very little to give me hope that these have been addressed. My main concern is that there is a persistent belief in the ‘magic’ of teacher assessment and moderation, serving a high-stakes purpose.

Formative/summative
  • Should we really be dissolving the formative/summative divide?
    • I have seen much confusion amongst teachers as to the purposes of assessment and they often conflate summative and formative, unwittingly, to the detriment of both.
  • Isn’t there more clarity needed on just how assessment can be made to serve different purposes?
    • Isn’t there a fair amount of debate about this in the literature?
  • How do we avoid serving neither very well?
  • How do we use formative information for summative purposes when this is often information gained in the early stages of learning and therefore not fair to pupils who may have progressed since its capture?
  • If summative assessments are to be used for high stakes purposes, how do we ensure that summarised, formative information really quantifies attainment and progress?
  • How can we avoid teachers always assessing instead of teaching?
Teacher assessment
  • Can we really resolve the issue of unreliability of teacher assessment when used in high-stakes settings?
  • Is it fair to expect teachers to carry out teacher assessments when they are directly impacted by the outcome of those assessments?
  • How do we make teacher assessment fair to all the children in the country if it is not standardised? – How do we avoid a ‘pot-luck’ effect for our pupils?
  • Have we really addressed the difficulty of assessing science as a multi-faceted subject?
  • How can we streamline this process?
  • How can we make sure it doesn’t feel to teachers as though they would be assessing science all the time?
Moderation and reliability
  • Are researchers assuming that moderation is a simple and effective ‘catch all’ to achieve reliability?
  • Do researchers know that this often feels like something that is done ‘to’ teachers and not part of a collaborative process?
    • This is a fraught process in many schools. It takes up an enormous amount of time and can be very emotional if judgements are being made and if there are disagreements. Moderation helps to moderate extremes, but can also lead groups in the wrong direction.
  • Will schools be able to give over the time required to adequately moderate science?
  • Is there really a good evidence base for the effectiveness of moderation on reliability?
  • Do we need to clarify the exact process of moderation?
  • Is ‘reliable’ something that is actually achievable by any assessment system? Should we not be talking about maximising rather than achieving reliability?

Assessing 21st Century skills with 19th Century technology

In England we’ve just had a revision of what used to be the ICT (Information and communications technology) curriculum for primary schools into what is now going to be ‘computing‘, focussing on what are imagined to be necessary skills for modern life, such as programming and ‘digital citizenship’. Interestingly, there is little mention of the use of the computer in any other subject of the new curriculum.

Having just completed the excellent FutureLearn courses ‘Teaching computing 1 and 2’, my initial thoughts were that there was little new here. The programming part is straight out of the latter half of the 20th Century. That’s not really surprising, though, given the paradigms of that time. What I find difficult to understand, however, is the anachronistic attitudes towards the assessment of this subject. Like everything else in the English primary system at the moment, it’s ‘teacher assessment’ – that old fall-back, catch-all, that miraculously covers everything. This is to be done by observing, of course; observing 32 pupils all carrying out a multiplicity of activities at a computer.

So why not use the computer?

This is my rant straight from the CAS (Computing at school) site:

Are we failing to exploit the potential for e-assessment to capture rich data in terms of pupil behaviour as well as give instant feedback to pupils? Why are we trying to construct elaborate rubrics with a multitude of descriptors which require hours of teacher observation? What is the likelihood that teachers are going to manage to dedicate any time to assessment of computing, given that the core subjects will have something like a total of 4600 items for a class of 32? I feel that on the one hand we’re supposedly promoting 21st Century skills (though I’ve yet to see that, in fact) whilst using 19th Century systems of assessment.

The ‘disastrous’ attempts to use e-assessment can largely be put down to the inappropriate use of the technology. I’m not advocating a ham-fisted approach which replicates online what pupils could do on paper. I’m amazed that computing practitioners are even considering that a paper-based approach is somehow more ‘academic’. Computers have enormous potential in terms of tracking and analysing what is done on the computer (and let’s face it, most computing is done on the computer!’). We haven’t even begun to tap into the formative potential of such things as ‘serious games’ and immersive software, and yet these are everywhere in the commercial world, in high risk occupations (medicine, aerospace, motor racing). Pupils’ behaviour is being assessed every day by the games they play online. It really is an issue for policy makers, educationalists and software companies, but I imagined that here in the computing curriculum domain, there would at least be that kind of thinking.

There are interesting developments in the US and in Australia, as well as some promising work in further and higher education. However, we really don’t seem to be aware of what is possible in English primary education which is overly influenced by a belief that ‘teacher assessment’ covers everything. It doesn’t, it can’t and the current model is severely flawed.

I’m growing increasingly frustrated with the education community in England, and their narrow inability to see the benefits of digital technology in teaching and assessing, even when the very subject is about the use of digital technology.

Moderation still doesn’t tell us the weight of the pig.

The recent culture of leaving more and more of the process of assessment in the hands of teachers, raises the important question of reliability. Much research into teacher assessment, even by strong proponents of its advantages, reveals that it is inherently unreliable. We might have guessed this from our experience of human beings and the reliability of their subjective judgements! This is difficult even for quantitative measures, such as the weight of a pig at a fair, but much more so for such qualitative aspects as those in the wording of rubrics. These are what we are currently working with in the English primary school system. We teachers are required to be: assessing formatively and feeding back; summing up and evaluating; reporting in an unbiased way and  all along being held accountable for progress, which we, ourselves are expected to be measuring. Imagine, if you will the aforementioned pig. Judge its weight yourself now, and then again when you have fed it for a month, but bear in mind that you will be accountable for the progress it has made. How reliable will either of these judgements be?

So, in an attempt to improve the reliability of teacher assessments (in order for them to have high-stakes, accountability purposes) we introduce the idea of moderation. This usually takes the form of a colleague or external moderator assisting in the judgement, based on the ‘evidence’ produced by the teacher. Now, whilst I can see the value to the teacher of the moderation process, if it involves discussion of criteria and evidence with colleagues and supposed ‘experts’ (who, exactly?), I’m skeptical that simply introducing more people into the discussion will lead to greater reliability. The problem is that the external yardstick is still missing. Even if the teacher and all those involved in the moderation process agree on the level, objective or whatever measurement is required of us, we are still making subjective judgements. Are collective, subjective judgements any better than individual ones? Sometimes, they may be if they genuinely have the effect of moderating extremes. However, we need also to consider the impact of cultural drift. By this, I mean that there is a group effect that reinforces bias and this does have an impact on assessment. I am convinced that I witnessed this over the years in the assessment of writing, where the bar for attaining each level seemed to continually be raised by teachers, afraid that they would be accused of inflating results – a real shame for the pupils who were being judged unfairly. In these instances, the moderation process doesn’t improve reliability; all it does is give a false sense of it which is then resistant to criticism or appeal. This is where we all stand around staring at the pig and we all agree that he looks a bit thinner than he should. Without the use of a weighing device, we really do not know.

June 2016

I had a look back at this post – moderation being in the wind at the moment. I was interested in articles such as this one and I wonder what it will take to stop doing such pointless, meaningless practices in education? Do we not know? Do some people still believe these things work? Isn’t it a bit obvious that teacher assessment for high stakes purposes is completely counter-productive and that moderation can in no way be considered a strategy to achieve greater reliability?

I’d like to extend the ubiquitous pig metaphor now. In the case of primary writing moderation in 2016, it’s not even a case of staring at the pig and guessing its weight. We have a farmer who has a whole field of pigs – he has been told to guess all their weights, but he’d better not have more than 30% underweight! In order to make sure he doesn’t cheat, another farmer comes along, equally clueless, and tells him whether he thinks the farmer’s guesses are the same as his own guesses. The farmer next door doesn’t have to go through this pointless ritual. Strangely, that farmer’s pigs are all just a little fatter.