Matt Yglesias, who apparently sees cheating as the scope of the problems with testing, says, "I hope that people who are skeptical about the possibility of measuring school performance will spell out in greater detail what they think the implications of this view are."
First, though, let's talk about the implications of doing testing badly, and how difficult it is for the persons responsible for it to do it well.
Hopefully, the implications of doing it badly are obvious: you boost the pay of teachers and principals who really aren't any better than anyone else, you fire those who aren't any worse, and you drive teachers out of the profession who have enough intellectual honesty and awareness to see what's going on and simply not be able to stand it.
Then the question becomes, is it reasonable to expect school administrators to come up with testing programs that are only used to measure what they can actually measure with some reliability?
I'd say it can happen, but it's not a reasonable expectation.
In the words of seminal statistician R.A. Fisher, "To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of." And to reach conclusions based on testing is to conduct a statistical experiment.
It seems simple enough: we'll reward teachers if their students' scores go up, and penalize them if the scores go down. What could go wrong?
What couldn't?
For one thing, do we know how much a class's scores tend to bounce around in a testing regime where carrots and sticks aren't remotely on the horizon? In other words, what's the variance of these scores? Because if you don't know that, you don't know whether a better number from one year to the next, or a difference in scores between one teacher and another, is likely to be an artifact of better teaching, or could be well within the realm of statistical chance.
Take something non-controversial: I just flipped a coin 30 times, getting heads 18 times. Then I flipped it another 30 times, getting 13 heads. Did the coin's performance get worse, or better, depending on whether we view heads or tails as the desired outcome? Of course not! The coin is the same coin, but the measurement we derived of its performance sure looked different: 18 heads v. 13, in 30 tries.
But is the difference statistically significant? No, it isn't; it's an artifact of chance. And if you reward or punish teachers for chance variations in their classes' test scores, that's going to be counterproductive. So the first thing that you want to do is make sure you've got enough prior data to be able to compute variances. The second is that you want to be able to keep your administrators from making carrot/stick decisions on the basis of differences that aren't statistically meaningful.
In how many school systems do you think they even thought about that first criterion before implementing high-stakes testing? And even if they did, how likely do you think it would be, in general, for the stat wonks (if the school system kept any in house, instead of just contracting out the occasional statistical issue) to keep the administrators from making decisions on what look like big differences in test scores, but that are not statistically significant?
Good luck with that.
After your school system jumps through that hurdle, it's got a bunch more. You'd like to compare one teachers' performance one year with the same teacher's performance the next year. You'd also like to be able to compare different teachers in the same year. You can compare test scores straight up, but how do you know you're really measuring differences in teaching quality, rather than something extraneous, like the affluence of the students' families, the quality of their teachers in earlier grades, and so forth? I think that given enough time and resources, I could set up an evaluation system that would at least somewhat control for a lot (but probably not all) of this stuff: you could compute mean scores for students from different neighborhoods, you could do year-to-year comparisons on a student-by-student basis (but you'd run up against the fact that a decent number of students, each year, would have been in some other school district the year before), and so forth.
But this would take a serious commitment to statistical wonkitude on the part of the school administrators. And while some of them would have it, most wouldn't, or would want to but not have the resources, or would face resistance from other political actors.
In short, it's hard to see that this is something we'd do particularly well as a society, even under the best of circumstances.
And even if we designed the testing program as a good statistical experiment, and eliminated any traces of cheating from the system, you still have the problem that Stephen emphasized: that of teachers gaming the system by spending far more time than we'd wish preparing kids for the specific knowledge covered by the tests, and in teaching test-taking skills. In other words, wasting a good chunk of your kid's school year so that the teachers can get good evaluations.
Maybe testing can be made to yield some useful benefits, but I really don't see how it will be possible anytime soon.
Recent Comments