Daniel Koretz, Harvard University
In her note in this issue, President Tong calls for a stronger tie between research and practice. Her call is both critically important and overdue.
I suggest that we need to broaden this, however, from “practice” to “practical impact” more generally because educational measurement can have important, even profound consequences even when it does not directly affect educational practice. What matters in the end is not whether our work is innovative or elegant per se, but whether it provides a meaningful improvement—that is, better information or other positive effects, either within the educational system or elsewhere in society. Technical innovation should be especially valued when it is in service of these goals. To use Donald Stokes’ (1997) famous framework, our work belongs primarily in Pasteur’s Quadrant (use-inspired basic research) and Edison’s Quadrant (pure applied research).
I also suggest that we need not only to alter practice to better reflect research, but also to redirect research to better focus on important measurement issues that have major practical implications. I’ll focus on the latter here.
To bring about this needed refocusing will require a number of different steps, some more difficult than others. And for reasons I will note at the end, making sizeable progress will require actions on the part of many in our field, not just decisions by individuals about their own work.
The most straightforward step in this reorientation will be focusing more on evaluating and describing the effects both of innovations and of the choices between extant methods in real contexts. We need to know whether innovations will matter in practice, and when the effects are complex—as they often are—we need to evaluate the tradeoffs. All of us can point to important and praiseworthy work of this sort, but it isn’t now a large enough share of our collective work, and it is often not what the field most prizes.
The second step needed to accomplish this reorientation is more difficult: giving more weight to practical importance in selecting topics for research. I’m certain that we differ in our rankings of the measurement problems that have the greatest impact, but I’ll focus here on two that have particularly large implications: score inflation and the arbitrariness of performance standards.
The logic of score inflation is straightforward and has been discussed in the professional literature for at least 70 years, since E. F. Lindquist discussed it in his introduction to the second edition of Educational Measurement (Lindquist, 1951). Empirical studies of the phenomenon are also not new. Along with Bob Linn, Steve Dunbar, and Lorrie Shepard, I conducted the first empirical test of score inflation three decades ago. That study, a cluster randomized design, found inflation of half an academic year in the mathematics scores of third-graders in a system that imposed consequences that were modest by current standards (Koretz, Linn, Dunbar, & Shepard, 1991). Subsequent research has shown that the problem is both common and often very large (for example, Fuller, Gesicki, Kang, & Wright, 2006; Ho, 2007; Jacob, 2005; Klein, Hamilton, McCaffrey, & Stecher, 2000; Koretz & Barron, 1998). Any number of prominent scholars in our field, including Bob Linn, Lorrie Shepard, George Madaus, Ed Haertel, Laura Hamilton, and Andrew Ho, and a number of prominent scholars in other fields, for example, Brian Jacob and Derek Neal in economics and Jen Jennings in sociology, have written about the problem of score inflation. The general problem of which score inflation is a specific instance—that is, the corruption of numerical measures used for accountability—is so well documented in so many fields that it is commonly known as “Campbell’s Law.” The mainstream media have reported on score inflation (e.g., Medina, 2010), and even some education policymakers have acknowledged it publicly (e.g., Steiner, 2009; Tisch, 2009). Concurrently, numerous studies have investigated the types of test preparation that have the potential for score inflation, such as narrowing of instruction and coaching (see Koretz & Hamilton, 2006; Stecher, 2002).
Yet for all that, our field’s response to the problem as been spotty and insufficient, and all too often, inflation is simply ignored.
The implications of inflation for research are far-ranging and include matters of test design, linking, validation, and reporting. For example, we know that tests include many consistencies over time in content, presentation, and task demands that are not substantively necessary, and we have some evidence that these consistencies can positively affect performance (Holcombe, Jennings, & Koretz, 2013; Koretz et al., 2016; Morley, Bridgeman, & Lawless, 2004). But as yet we have learned very little about which of these predictable patterns afford the most opportunity for inflation—information that should be a consideration not only in overall test design, but also in designing and using task shells. We have seen only a few efforts to substantially rethink test design to lessen opportunities for score inflation (Barlevy & Neal, 2012; Hanushek, 2009, Koretz & Béguin, 2010), and these have sparked little additional research and have not affected practice. The NEAT linking used in most large-scale assessment programs fails when score gains are inflated—this linking is the mechanism by which the scale becomes biased (Koretz, 2015)—but this fact hasn’t spurred attempts to devise more robust alternatives. Traditional approaches to validation, which use cross-sectional data and entail no monitoring of trends, cannot evaluate potential inflation (Koretz & Hamilton, 2006), but the field rarely supplements them with methods that can. The field has done little to develop methods of validating score gains under high stakes conditions that are more widely applicable and more informative than using separate audit tests.
The arbitrariness of performance standards is also old news. More than 30 years ago, Dick Jaeger, then one of the preeminent scholars of standard-setting, published a review article in which he documented a striking lack of consistency of cut scores across methods. One of Jaeger’s measures was the ratio of the proportions of students who would fail with pairs of methods. He wrote:
The smallest ratio was 1.00,….but the largest was 29.75….The median of these ratios was 2.74…[and] the average was 5.90….There is little consistency in the results of applying different standard-setting methods under seemingly identical conditions, and there was even less consistency in the comparability of methods across settings (Jaeger, 1989, p. 500).
This conclusion was echoed five years later by the participants in the Joint Conference on Standard Setting for Large-Scale Assessments (Linn, 2000). Linn also reported a study of three methods applied to data from new assessments in Kentucky in which the largest multiple of the passing rate across methods was roughly a factor of 10.
However, even these disconcerting findings understate the lack of consistency. We learned years ago that the results of standard setting can also be inconsistent within methods as well as between, varying with factors that are irrelevant to the substantive decision being made. For example, the placement of standards can vary appreciably depending on item format and item difficulty (Shepard, 1993), and in the case of the now dominant bookmark method, depending on the arbitrarily set response probability (National Research Council, 2005).
Linn (2000) argued that we need to confront the implications of this variation for both error and validity. With respect to the former, he wrote that “we would at least acknowledge that there is a high degree of uncertainty associated with any performance standard” (Linn, 2000, p. 8).
With respect to validity, both Popham (1978) and Hambleton (1998) defended performance standards by saying that they are arbitrary but not capricious. If standards-based reporting were simply a coarse ordinal scale, this might be a reasonable defense; other than norm-referenced scales, many numerical reporting scales are of course arbitrary. However, given that performance standards are tied to substantively meaningful descriptors, arbitrariness necessarily undermines valid inference. As Linn wrote:
The variability in the percentage of students who are labeled proficient or above due to the context in which the standards are set, the choice of judges, and the choice of method to set the standards is, in each instance, so large that the term proficient becomes meaningless (Linn, 2000, p. 13.)
This seems hard to dispute. Imagine, for example, a principal telling a parent: “You should be concerned that your child is not proficient in mathematics, even though this label is arbitrary.”
One can point to important work done in response to this problem. For example, Ed Haertel (2002; Haertel & Lorié, 2004) endeavored to develop an approach that would provide a more defensible linking of cut scores to substantive descriptions of performance standards. Andrew Ho (2007; Ho & Reardon, 2012) developed methods for avoiding the distortions created by evaluating gaps and trends using arbitrary cut scores. One can point to other examples as well.
Nonetheless, it is fair to say that the field has done far too little to address these problems. Indeed, these problems, like score inflation, are all too often ignored entirely.
To be fair, it may not be surprising that few people attack problems like this in their own research. They are not amenable to tidy, elegant solutions. Conducting relevant research often poses severe practical problems, e.g., difficulties in obtaining needed data. (I can say from bitter personal experience that many state and local superintendents are simply unwilling to allow access to data to investigate problems of score inflation.) And addressing problems of this scope often takes a long time, which is a concern particularly of young faculty who are anxious about maintaining a high publication rate.
However, not all measurement issues with potentially substantial practical impact are as daunting as these two. There is plenty of additional work to be done that is both more modest in scope and more tractable.
It would be a serious mistake to attribute the insufficient focus on practical effects solely to the choices of individual researchers. As I became uncomfortably aware in advising my students, members of our community, particularly younger members, face strong incentives not to choose this path. To encourage a substantial increase in work of clear practical importance, these incentives must be changed. The field as a whole must show that it values this work more than it currently does. Editors of journals must give more weight to practical importance in deciding which papers to accept. Faculty and senior staff responsible for evaluating more junior researchers must give credit for doing work of this sort, and they have to make allowances for the fact that some very important work cannot be done quickly. And NCME needs to reward work of this sort. Making real progress will require the efforts of our community as a whole.
Barlevy, G. & Neal, D. (2012). Pay for percentile. American Economic Review, 102(5), 1805-1831.
Fuller, B., Gesicki, K., Kang, E., & Wright, J. (2006). Is the No Child Left Behind Act working? The reliability of how states track achievement (Working Paper 06-1). Policy Analysis for California Education, PACE.
Haertel, E. H. (2002). Standard Setting as a participatory process: Implications for validation of standards-based accountability programs. Educational Measurement: Issues and Practice, 21(1) 16-22.
Haertel, E. H., & Lorié (2004). Validating standards-based test score interpretations. Measurement, 2(2), 61-103.
Hambleton, R. K. (1998). Setting performance standards on achievement tests: Meeting the requirements of Title I. In L. N. Hansche (Ed.)., Handbook for the development of performance standards: Meeting the requirements of Title I(pp. 87-115). Washington, D. C.: U.S Department of Education and The Council of Chief State School Officers.
Hanushek, E. (2009). Building on No Child Left Behind. Science, 326, 802-803.
Ho, A. D. (2007). Discrepancies between score trends from NAEP and state tests: A scale-invariant perspective. Educational Measurement: Issues and Practice, 26(4), 11-20.
Ho, A. D., & Reardon, S. F. (2012). Estimating achievement gaps from test scores reported in ordinal “proficiency” categories. Journal of Educational and Behavioral Statistics, 37, 489-517.
Holcombe, R., Jennings, J., & Koretz, D. (2013). The roots of score inflation: An examination of opportunities in two states’ tests. In G. Sunderman (Ed.), Charting reform, achieving equity in a diverse nation, 163-189. Greenwich, CT: Information Age Publishing. http://dash.harvard.edu/handle/1/10880587
Jacob, B. A. (2005). Accountability, incentives and behavior: The impact of high-stakes testing in the Chicago public schools. Journal of Public Economics, 89(5-6), 761-796. doi: 10.1016/j.jpubeco.2004.08.004
Jaeger, R. M. (1989). Certification of student competence. In R. L. Linn (Ed.), Educational measurement (third edition, pp. 485-514). New York: American Council on Education & Macmillan Publishing Company.
Koretz, D. (2015). Adapting the practice of measurement to the demands of test-based accountability: Response to commentaries. Measurement: Interdisciplinary Research and Perspectives, 13(3), 1-6.
Klein, S. P., Hamilton, L.S., McCaffrey, D.F., and Stecher, B.M. (2000). What do test scores in Texas tell us? Santa Monica, CA: RAND (Issue Paper IP-202). Last accessed from http://www.rand.org/publications/IP/IP202/ on June 4, 2013.
Koretz, D., and Barron, S. I. (1998). The Validity of Gains on the Kentucky Instructional Results Information System (KIRIS). MR-1014-EDU, Santa Monica: RAND.
Koretz, D., & Béguin, A. (2010). Self-monitoring assessments for educational accountability systems. Measurement: Interdisciplinary Research and Perspectives, 8(2-3: special issue), 92-109. https://dash.harvard.edu/handle/1/4889562.
Koretz, D., & Hamilton, L. S. (2006). Testing for accountability in K-12. In R. L. Brennan (Ed.), Educational measurement (4th ed.), 531-578. Westport, CT: American Council on Education/Praeger.
Koretz, D., Jennings, J. L., Ng, H. L., Yu, C., Braslow, D., & Langi, M. (2016). Auditing scores for inflation using self-monitoring assessment: Findings from three pilot studies. Educational Assessment, 12(4), 231-247, http://dx.doi.org/10.1080/10627197.2016.1236674; https://dash.harvard.edu/handle/1/28269315.
Koretz, D., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991). The Effects of High-Stakes Testing: Preliminary Evidence About Generalization Across Tests. In R.L. Linn (chair), The Effects of High Stakes Testing, symposium presented at the annual meetings of the American Educational Research Association and the National Council on Measurement in Education, Chicago, April. http://dash.harvard.edu/handle/1/10880553.
Lindquist, E. F. (1951). Preliminary considerations in objective test construction. In E. F. Lindquist (Ed.), Educational measurement (2nd ed., pp. 119–158). Washington: American Council on Education.
Linn, R. L. (2000). Performance standards: Utility for different uses of assessments. Education Policy Analysis Archives, 11(1).
Medina, J. (2010, October 10). On New York school tests, warning signs ignored. The New York Times, A1. Retrieved June 14, 2020 from https://www.nytimes.com/2010/10/11/education/11scores.html.
Morley, M.E., Bridgeman, B., & Lawless, R. R. (2004). Transfer Between Variants of Quantitative Items. Princeton , N.J.: Educational Testing Service (GRE Board Research Report No. 00-06R).
National Research Council. (2005). Measuring literacy: Performance levels for adults. Committee on Performance Levels for Adult Literacy (R.M. Hauser, C.F. Edley, Jr., J.A Koenig, & S.W. Elliott, eds.). Washington, D.C.: The National Academies Press.
Popham, W. J. (1978). Criterion-referenced measurement. Englewood Cliffs, NJ: Prentice-Hall.
Shepard. L. (1993). Setting performance standards for student achievement. A report of the National Academy of Education Panel on the Evaluation of the NAEP Trial State Assessment: An evaluation of the 1992 Achievement Levels. Stanford, CA, Stanford University: The National Academy of Education.
Stecher, B. (2002). Consequences of Large-Scale High-Stakes Testing on School and Classroom Practice. In L. Hamilton, B. M. Stecher, & S. Klein (Eds.), Making Sense of Test-Based Accountability in Education (pp. 79-100). Santa Monica, CA: RAND Corporation.
Steiner, D. (2009). Commissioner Steiner’s Statement on New York NAEP Performance in Mathematics. Albany, N.Y.: The State Education Department, Office of Communications (October 14).
Stokes, D. E. (1997). Pasteur’s Quadrant: Basic Science and Technological Innovation. Washington, D. C.: Brookings Institutions Press.
Tisch, M. (2009). What the NAEP results mean for New York. Latham, N.Y.: New York School Boards Association (November 9). Last retrieved on June 14, 2020 from https://www.nyssba.org/index.php?src=news&refno=1110&category=On%20Board%20Online%20November%209%202009.