Most of the tests we work on at Measurement Incorporated are of the high-stakes variety. That is, scores on many of those tests have immediate and far-reaching consequences for students, schools, districts, and even states. Scores on others have consequences for candidates for licensure and certification and for their professions. Whether the test is for a statewide assessment or for licensure and certification, someone has to decide what level of performance is good enough. That is where cut scores come in. Cut scores are set through a process commonly known as standard setting.
Standard setting involves gathering a panel of content experts (such as teachers for academic tests or current practitioners for licensure and certification tests) who apply their judgment regarding the requirements of the task and the difficulty of the items. The process follows one of several well-researched methods for setting cut scores. In the end, though, the process boils down to the opinions of a group of experts. Transforming those opinions into cut scores is the art and science of standard setting.
The science of standard setting lies in several well researched methods. These include the Angoff, Beuk, body of work, bookmark, contrasting groups, Ebel, and Hofstee methods and a host of others as well as variations on these methods. In some methods, experts review the items and offer their opinions as to the likelihood that a barely qualified examinee would answer it correctly. In others, the experts review actual work of examinees and sort it into categories (e.g., basic, proficient, advanced).
In all instances, someone takes these opinions, summarizes them statistically, and turns them into cut scores. For example, in the modified Angoff procedure, experts review item 1 and give their opinions as to the likelihood that a barely qualified candidate will answer it correctly. If there are ten experts, they are likely to offer ten opinions. Generally, they are within a fairly narrow band. These ten experts might offer estimates ranging from 40 to 60 percent for item 1. They then do the same thing for item 2 and so on through the entire test. Someone (usually a psychometrician) then averages the estimates for item 1, item 2, and so on. If each item is worth one point, and the average estimate for item 1 is 55 percent, then it is safe to say that the expected score for a barely qualified examinee is 0.55; i.e., the average of the estimates of the ten experts. By adding all these estimates, the psychometrician can obtain the expected score of a barely qualified examinee. Thus, if the test has 100 questions, each worth one point, and the average estimate over all items and experts is 63 percent, then the expected score of the barely qualified examinee is 63 percent, or 63 points, since this is a 100-point test.
The cut score is the score that divides the qualified from the unqualified. In the “qualified” group, there will be someone who just barely made the cut. That is our barely qualified examinee whom we expect to earn a score of 63 on this 100-item test.
But standard setting doesn’t stop there. Once the experts have rendered their expert opinions and produced this cut score, the psychometrician asks them to look at their own judgments and the judgments of the other nine experts and discuss their similarities and differences. During this discussion, differences in the interpretation of competence or mastery usually emerge, as do differences in viewpoints about the difficulty of the test. Then they do it all over again. Only this time, as the experts look at the items, they are aware of the opinions of other experts and have an opportunity to change their minds. Thus, they work through all the items, rendering their expert opinions as to the likelihood that a barely qualified examinee will answer each one correctly.
And standard setting doesn’t stop there either. The psychometrician calculates the cut score (or cut scores) just as in the first case and presents the results to the experts. Once again, the experts are able to compare their judgments and discuss similarities and differences. At this point, the psychometrician will frequently conduct a reality check. He or she will present the scores examinees earned on the test. The experts can then see how many examinees would pass and fail if they put their latest cut score into effect. This reality check prompts another round of discussion and offering of expert opinions, after which the psychometrician calculates a final cut score, just as before.
The other methods have different details, but the end result is essentially the same: the cut score reflects the judgments of several experts who have had multiple opportunities to examine the test items or examinee work, discuss their opinions with one another, and then render final judgments. Someone statistically summarizes these judgments into a cut score or set of cut scores.
Then there is the art of standard setting. After all, this is a group process, involving people with varied opinions. Keeping the experts on task and willing to listen to one another for two or three days requires diplomacy and tact. Moreover, the people who recommend the cut scores are usually not the same as the people who adopt them. State boards of education or licensure and certification boards have their own opinions as to what constitutes competence, and even they have their own internal differences of opinion. They have to be assured of the validity of the recommendations coming from a standard-setting group before adopting them or accepting them with minimal change. A bit of sales ability is often required here.
The cut scores on high-stakes tests that sort and sift students or professional candidates do not just appear or spring from any one person’s mind. They are the product of considerable effort on the part of many people working together over time, sharing their opinions, influencing and being influenced by others, and ultimately adopting a standard that represents their best thinking. The psychometrician, using one of the many available standard-setting methods, applies both art and science to help them arrive at that standard.
Measurement Incorporated has been a leader in the art and science of standard setting for over 20 years. In addition to helping clients in both the K-12 and professional assessment sectors set standards for high-stakes tests, we have provided leadership in the field. Dr. Michael Bunch, Senior Vice President of MI, is the co-author, with Dr. Gregory Cizek, of Standard Setting, published by Sage in 2007. That text has quickly become the go-to source for professionals and students around the world.
For more information, read the Standard Setting book review.
DesignHammer, a Durham Web Design Company ~ Building Smarter Websites