Fixing Standardized Tests

I’m thinking through an article on standardized tests right now. Let me put my half-developed thoughts on the blog and get feedback:

Twelve years after the passage of NCLB, state standardized tests have become a spring ritual. Kids in every classroom across the country sit down to days of multiple choice math and reading tests. For the most part, teachers, parents, and administrators hate these tests. They say that these tests distract kids from real learning, that teachers are unfairly evaluated with these tests, and that they do not measure learning.

Most of the criticism of these tests stem from the fact that the results of the tests are used incorrectly. The tests are not designed to tell you anything about one kid or one teacher. They tell you something about large numbers of kids. It’s statistics, so the larger the N, the more useful the finding.

It doesn’t say anything about one kid’s knowledge, because this is one test on one day for one kid. A kid could have a bad night sleep and mess up the test. The best way to evaluate a kid is based on an entire year’s worth of work using multiple methods for assessment. It doesn’t say anything about one particular class, because N=20. Not a big number. Some years, you get a sharp class, other years you don’t.

These tests should not be used to evaluate teachers, individual kids, or even the effectiveness of new curriculum.

What the numbers do show rather nicely is what a whole school district or a large subgroup of kids knows (not what they learned) compared to another large subgroup of kids. These tests show rather nicely that the kids from Scarsdale are at a higher reading level in Kindergarten than the kids from New York City.

Why is this important? Educators tell me that it is irrelevant information. But it’s not. First of all, I think it’s really important to be reminded of the disparities in education  over and over. It makes us think harder about how to fix this problem. Secondly, it is a useful political tool. Because we know that kids from certain backgrounds are significantly behind their peers at the age of 5, we’re starting to think more about investing in pre-K education.

So, how do we fix the problem of the incorrect usage of standardized tests? We have to remove the penalty component to NCLB.

The other big problem with standardized tests is that the tests make some kids feel really bad. Special ed kids, kids from disadvantaged backgrounds, and kids with bad test-taking skills feel like shit during testing week. Did you ever take a test that you knew that you were going to fail? Sucks. I wonder if we could remedy this problem with a little technology.

Let’s put away the number 2 pencils and the bubble grid sheet and set up the tests on a computer. Let’s sit our hypothetical 5th grader in front of a computer with math problems that starts at the 1st grade level. If the kid answers the super simple questions correctly five times in a row, the program would switch to a series of 2nd grade questions. If the kid correctly answered the 2nd grade questions, then he would be bumped up to 3rd grade questions. He would continue quickly moving up the level of challenge, until he started get a few questions wrong. Then the computer would linger in that grade level work for a while to get a better assessment of knowledge. A slower 5th grader might be found proficient at 3rd grade level math, while a more advanced 5th grader might test at a 6th grade level.

This method for assessment would not only give a better picture of a child’s ability, but the testing process wouldn’t make them feel stupid.

What do you think?

14 thoughts on “Fixing Standardized Tests

  1. Just addressing the computer testing model you mention in your last paragraph: it’s already done here in TX. The kids have two sets of standardized tests. The State mandated ones (TEKS until last year, STAAR this year) are paper tests. Then there is “MAP” testing, which works exactly like you described, with the questions getting tougher as you answer more correctly at each level. The results of MAP testing are used to funnel the kids into advanced programs in Reading, Math and Science. A quick Google tells me there are lots of school districts across the country using MAP testing:


  2. I think your proposal has promise. Let me extend it. I believe what you describe is known as adaptive testing. If a child has demonstrated a firm grasp of reading at a 4th grade level the year before, start the testing at the next grade level the next year. Don’t waste everyone’s time making all students prove they still understand multiplication.

    Putting it on computer gets all those test booklets out of the schools–which removes the (adults’) temptation to cheat.

    Go one step further. Choose the students who are to be tested by a random computer algorithm. Set up the testing for an extra week of school–students’ families are paid for their time. Even with the payment, I’d bet it would be much more efficient to test a small subset of the school population at the end of the school year, rather than to stop classroom instruction completely for everyone in the school for hours on end. Such a system would avoid the problem of schools encouraging certain students to miss test days, and the ongoing problem (in our district) of parents choosing to go on vacation during the testing period. The public is paying these teachers to teach. Using them as test proctors is a waste.

    Oh! proctors. Hire and train proctors to guard against cheating. Or forbid teachers from supervising their own school’s students.

    The SAT should administer its tests on computers, under more secure conditions, with professional proctors. I suppose that’s a different discussion.


  3. Yes, this is adaptive testing–the GRE is like this and has been for a decade, I believe, and I have no idea why the SAT hasn’t switched over. Independent proctors are a necessity, and I like the random sampling idea too.


  4. A lot of companies are getting into the adaptive testing/learning game. Knewton is one that comes to mind. There’s big bucks to be made in it and big bucks to be lost by the current companies. I love data, though, and I do hope that better data can help us do something about education. The problem is, humans still have to interpret and act on the data appropriately.


  5. Adaptive testing (we also call it staircases) are sometimes used to estimate thresholds in psychophysics (i.e. estimate the threshold volume at which you can hear a sound, or the contrast at which you can see a pattern). One problem with using it in testing of more complex material (like reading or math) is that it’s a non-trivial exercise to determine the difficulty (or what people are calling “grade level”) of each question. So, you could, for example, answer a “easy” question wrong at the beginning of the adaptive testing and get shunted into a weaker version of the test. One can add in random questions to allow people to pop out of the rut, but if you’re not certain which questions are easier (and, when “easy” interacts with a particular individual) it’s not easy to develop adaptive paradigms.

    We use MAP testing in our neck of the woods, and one of the big problems is that the testing completely consumes the tech resources of a school during the testing. So, I like the suggestion of testing a random sample of students, and paying their families for the testing, as a method of evaluating the school. But, not sure affluent families would agree in return for the compensation that would likely be offered, which would also skew the results.


  6. We use the MAP test in WI. They take it 3 times/year on the computer and the questions get progressively harder. You mention some of the benefits, but there are also downsides. If you have kids who typically test in the 99th percentile, they may be sitting in front of the computer for a very long time. My kids usually have to take the test over two days. They don’t allot enough time for them to get through all the questions they can answer in one sitting. They miss valuable classroom instruction time because they are still taking the damn test.

    If you have kids for whom English is a second or third language, this test is much harder. You can’t go back and re-read previous questions, everything is on-line. You can’t highlight or underline a phrase that might be important. Our ELL kids seem to have a big disadvantage on the on-line tests. Finally, it overloads the internet at the schools and often crashes. DIdn’t Seattle just boycott MAP?


  7. Yes, many of the high school teachers in Seattle Public Schools decided to boycott MAP for the time it consumed an the resources it consumed (as well as not supporting the use of the test for student/teacher evaluation). I think there’s more support for the test at the elementary level (though there is opposition there, as well).


  8. Well “feeling-bad” isn’t really the problem with testing we should focus on fixing. If ‘we’ decide that testing is valuable then what needs to be fixed is how gender-biased and euro-centric these tests are. Also when you consider computer models how do you make up for a digital achievement gap between low and non low-income students.


  9. Lots of good comments… Hmmm. Let me take a stab at answering them all…

    1. Cranberry’s suggestion to have testing held after the school year, testing a randomized sample of students, and using proctors rather teachers was very interesting. I thought about it a lot. I think that there would be too many implementation problems.

    Here in NJ, we have a very strong teachers union. They would never allow proctors into the school. They wouldn’t allow school to go beyond the 180 days. It would be very expensive to keep the school buildings open and hire the proctors. There’s no money for new programs. Also, by using a randomized sample, you would be diluting your N quite a bit.

    I would prefer to test all the kids every other year.

    2. Ah, “adaptive testing” is the jargon. OK, good to know. Ian took a bunch of tests this winter to get him ready to move to a new school district. The tests didn’t work on him, because we had switched his ADHD meds and the meds were making him chew on his shirt and other weird tics. The tester was a moron, too. She found that he couldn’t add or subtract. (He’s getting an A in his math class right now and is multiplying fractions and all that stuff.) So, no standardized test that happens on one day for one hour will be very useful to understanding the capacity of one kid. These tests are useful for telling us about a large group of kids. So, Ian’s bad testing morning would be balanced out by another kid who is a testing machine.

    The only good thing about the adaptive test is that it doesn’t make a kid feel bad for getting answers wrong.


  10. Fionnula – The most (and only) important part of these tests is that they make the gaps in knowledge obvious. So, if kids flunk the tests because they don’t know how to use a computer or they don’t use the same words as middle class kids, then we need to know that.


  11. Are some states using the MAP test to comply with the NCLB requirements? I don’t think they are, but I’m not sure.


  12. The CAT GRE was very unnerving. I found half of doing well on the test test was enduring the psychological pressure of trying to figure out whether the questions were getting harder or not.

    This would make the test not really adaptive, but one thing you could do would be to offer the same range of questions to all students, but allow them to stop at any time. Say, every 3rd grader gets 5 questions each at the 1st – 8th grade level. They can answer any of the questions in any order, and can choose to stop the test at any time and leave the testing facility. This would allow for kids who were advanced in certain areas but behind in others to accurately reflect their knowledge in a test.


  13. B.I., my kids do really well on standardized tests. Given the choice, though, in third grade they would have chosen the five easiest questions, pausing for the shortest time possible in the testing facility before heading outside to *play*!

    Also, by using a randomized sample, you would be diluting your N quite a bit.

    I think that’s a feature, not a bug. We are asking our schools to do too many things. Why have we accepted without question that schools exist to produce data for researchers? They don’t. The first and most important mission for schools is to educate children.

    A random sample would decrease the temptation to turn the classroom into State Test Prep. I gather this is happening, from reading newspapers and online teacher blogs. This is A Bad Thing. It drastically reduces the time in the classroom devoted to larger questions. Before our kids left the public schools, I noticed the gradual creep of test prep. Homework was drawn from old state tests, and practice tests. I can’t imagine how restrictive it will be when the same big publishing company produces the tests, and the curricular materials.

    The Common Core hasn’t hit yet, but, mandating instructional text must be 50% nonfiction? The Common Core authors can claim as frequently as they want that they intend history, science and other subjects to be included in the total, but to judge from the newspapers, magazines and teacher blogs, the English classroom will bear the brunt of the changes. Cynically, I suspect that’s because other subjects may not require much reading at all at present.

    If every child is tested, it makes sense for administrators to demand teachers Teach to the Test. If every 10th child were tested, the administrators couldn’t predict which children would be tested, and they couldn’t predict which areas of knowledge are most likely to end up on the exam, (because it would be administered by computer, thus they couldn’t study the test at their leisure), it would be more rational to provide a wide and deep education to every child.

    It would also end the practice of using state standardized test results to make decisions about individual children, who might have had a bad day on the particular day, or gotten lucky when randomly filling in blanks.


  14. cranberry,

    Isn’t the danger of kids purposely not trying because they’d rather be doing something else true for any standardized test? If you told the kids that they had to stay until they’d answered all the questions they could, and once they’d done that they can leave, I would imagine you wouldn’t get much more of that than you would on any ordinary test, where I’m sure there are some kids who just fill in all Cs so they can spend more time sleeping/daydreaming.


Comments are closed.