Marianne Perie, Measurement in Practice, LLC
Hearing that our new NCME president, Ye Tong, wants to emphasize “Bridging Research and Practice” during her tenure was gratifying, as that has been the focus of my research for the past two decades. I struggled finding sessions at the annual meeting to match my interests in the 1990s as so many focused on the more nuanced aspects of item response theory or Differential Item Functioning. I first found my niche in standard setting, which, to me, nicely blended psychometrics and test content with practical considerations. After the No Child Left Behind Act (NCLB) was passed in 2001, the field of statewide summative testing took off, and our primary focus was keeping up with demand. But, with the introduction of Race to the Top (RTT) in 2012, we became more innovative and learned a great deal from research that was immediately applied to practice.
For example, since 2012, some of the questions I have examined with colleagues and graduate students that bridge research and practice include:
- What is the best mechanism for scoring technology-enhanced items? The goal of novel item types introduced by the RTT assessment consortia was to tap constructs that could not be assessed through traditional multiple-choice questions. However, in scoring them right/wrong, they often lost their value. Partial credit, either through a rule or a rubric, provided more information about student knowledge that contributed positively to their ability estimate.
- To what degree are scores comparable across students who taken computer-based assessments on different devices? When computer testing became widespread, districts were testing their students on whatever devices they had available: Desktops, laptops, and tablets, all with varying screen sizes and resolutions, keyboard length and configuration, and mouse types. Research showed that, on average, there was little difference in student performance by device, with a few exceptions, assuming familiarity with the device. For instance, younger students performed better with larger screens and on items that did not require scrolling.
- What is the simplest stage-level adaptive design we can implement to give students a test that includes items they can answer easily and items that challenge them while simultaneously increasing measurement precision with fewer items? Educators and policymakers were drawn to the idea of a computer adaptive test providing a personalized experience, but those states who left the consortia often could not afford the size of an item bank required for a fully adaptive test. Thus, some turned to stage adaptive testing. Although multiple stages seemed to have more face validity and “felt” more adaptive, simulation studies showed that a simple 1-2 design produced sufficient precision for most students.
- What is the tipping point when test interruptions affect student scores? With the proliferation of online testing came the onset of new testing challenges. Computers would freeze during testing, necessary graphics did not appear, or student writing was lost. This problem opened a new avenue of research examining the extent to which test interruptions impacted student performance. Ultimately, while studies showed that student scores were fairly robust and impervious to computer freezes (as long as the student saw the item as intended and the system received the response as intended), state policymakers often decided to throw out scores due to public pressure.
Throughout all of this change, a focus for test developers and psychometricians has been matching the design of the assessment with the stated use of the results. At the beginning of NCLB, the statewide assessments needed to have measurement precision around the Proficient cut score as that was the one point used to make decisions on the new accountability decisions. However, the Federal government later added flexibility to make decisions based on student growth. Changing the focus of the use of assessment results from determining proficiency to year-long growth required a fast change in test design from one that maximized precision around one point on the scale to one that had sufficient measurement power to capture growth all along the scale. As the tests got longer to accommodate this change, the demand for “useful” results with finer grained subscores also grew.
Then, around 2015, the field hit a tipping point, and policymakers started legislating shorter tests. Now, our field is being asked to build balanced assessment systems that place useful information in the hands of teachers in a timely manner while still being aligned to state standards and comparable across schools. More recently, attention has been paid to personalized assessment, performance-based assessment, science simulations, and determining a way to include interim test results in a summative score. As past is prologue, we must learn from previous research done when many of these approaches were tried while simultaneously integrating these new tests into an online platform.
So, where are we headed? Now, more than ever, is the time to focus on strengthening the bridge between research and practice. The field is asking us to become much more innovative, allowing for personalization, authentic assessment, and the measurement of growth while staying criterion referenced. Some of the tasks we will be addressing over the next few years include:
- Developing a measurement model that accounts for through-course assessment. For years, policymakers have faced pressure from schools to just use the district interim assessment for summative purposes to decrease the amount of testing. Although using a test for a purpose other than what it is designed for is never recommended, this cry has led to efforts to split summative testing up at various points in the school year. We need new measurement models, or more evidence around current ones, to account for knowledge at a point in time, growth over time, and assumptions about prior knowledge.
Improving automated scoring to allow for greater inclusion of performance tasks. Increasingly, there is a call to make tasks more authentic and to test students’ ability to reason or score their approach to a problem. These tasks can be designed by content experts, but scoring them quickly, with sufficient accuracy and reliability while allowing for more nuanced interpretations of the score requires additional research on our part.
- Customizing assessment for individual interests while maintaining comparability. Our field has worked hard to ensure our tests are free from bias or content that may either penalize or trigger groups of students. However, recently there have been rumblings that we have so whitewashed our tests as to make them too boring for students to engage with and demonstrate their true ability. Early research on AP tests showed that students do not choose well when given an option to pick the prompt to respond to, but more research needs to be done on allowing students to pick a topic on which to read a text or conduct an experiment. We could also integrate books read during the school year into our tests, including history and science books, even when different schools or districts choose different books. But first, NCME members need to do the research and set parameters around the test design to ensure sufficient comparability.
In 2020, with the pandemic and Black Lives Matter protests, we are entering a new era. This year feels like another tipping point for education. Building more culturally sensitive assessments as well as increasing flexibility on when and where testing is done while maintaining sufficient reliability and validity is important now and requires us all to work together to achieve. I look forward to helping Ye Tong achieve her vision of integrating research with practice to build better assessments for our schools.