A Call for Additional Research to Support Online Assessments

By Megan Welsh posted 07-31-2020 11:10

J_Beimers.jpgJennifer Beimers, Pearson

In working with Dr. Tong for over a decade in operational psychometrics, a message that she consistently emphasizes is the importance of research to support operational work. Whether it be on a smaller scale such as sharing research during a team meeting or more broadly by submitting journal articles or proposals for conferences like NCME, she continuously encourages the publishing and presenting of work. One aspect of assessment that continues to be a driver for the need to bridge research and practice is online testing. That’s not to imply that the existing research is lacking; rather it needs to be expanded upon particularly given how quickly technology and assessments continue to evolve. Furthermore, the latest version of the Standards for Educational and Psychological Testing was published in 2014, prior to the more recent online testing advancements.

In general, assessments have transitioned from paper-and-pencil to computer-administered to innovative online assessments in a relatively short timeframe. For decades, assessments were paper-based, tending to consist of a large proportion of multiple-choice items and perhaps a few constructed-response items. Then with the increased use of technology in education, assessments started to move online. The assessments didn’t necessarily look all that different (i.e., same item types) but there were benefits to be gained from an online administration. For example, multiple-choice items could be scored instantaneously, automated scoring could be used for constructed-response items, and computer-adaptive testing could be supported. In more recent years, assessments have evolved to utilize online testing to an even greater degree, including the use of innovative item types and the incorporation of accessibility features (e.g., text-to-speech and speech-to-text). Benefits of online testing span the entire assessment development process from item creation to post-administration scoring and reporting, and new solutions and innovations are constantly being developed as technology continues to evolve. Research needs to keep pace with the advancements to help guide operational psychometrics. In the following paragraphs I expand on a few specific areas, including technology-enhanced items, timing data, and mode comparability.

Technology can allow for deeper and more authentic measurement of intended constructs through innovative item types and dynamic stimuli. In terms of item types, a student may be asked to drag and drop answers into the correct location or click on a particular area of a graphic. For dynamic stimuli, a student may be asked to conduct an experiment by working their way through the various steps. With technology-enhanced items, rather than creating a stem and a set of distractors as in the case for a multiple-choice item, the options for technology-enhanced items are essentially unlimited. For example, there is no standard number of draggers or drop bays or hot spots. And with a wide range of item configurations and potential naming conventions, the resulting data for technology-enhanced items are significantly more complex compared to a string of A, B, C, and Ds. Research is needed to refine guidance around item development, the analysis of resulting data, and the feedback loop between item development, content experts, educators, and psychometricians. How can items be constructed such that each score point is discretely defined? How should responses be analyzed and made meaningful to educators and content experts who may be reviewing the items? What statistical analyses are appropriate particularly for items that have multiple parts or dependent scoring? How can accurate scoring be verified? How can items that effectively utilize technology to assess students be distinguished from items that could have just as well been developed as traditional item types? These are just a few of the questions that arise with the implementation of technology-enhanced items.

Another potential benefit of online testing is the ability to capture the amount of time a student spends on each item or cumulatively on an assessment. However, there are a variety of ways in which timing data could be analyzed and utilized to help inform the assessment development process. Should there be guidelines around how long students should spend on an item for every point the item is worth? Should timing data be a piece of information used during data review? What timing information should be included in technical documentation for an assessment? Can timing data be used to effectively detect a security breach? Further exploration is needed around effective use of timing data.

Mode differences and/or comparability is an additional area of research that comes with online testing. Even for assessments that are primarily online, there is often a need for a paper form. This may be due to students needing a paper form for accommodation purposes or due to lack of access to online devices or reliable internet. Regardless, attention should be given to address how to handle the assessment being administered via two different modes. Should score adjustments be made to account for mode differences? If so, what methods are most appropriate for calculating the adjustment? Are mode differences stable across devices and demographic subgroups? Do mode differences tend to be consistent for particular grades or cohorts of students? Are mode effects dependent on the type or amount of technology used in instruction? How does having two modes impact the field test plan? Complicating analyses is the possibility that the students who take the paper form might not be reflective of the students who take the online form in terms of ability and demographics.

This article has touched on just a few of the many areas of research needed to support online assessments. Online testing provides numerous opportunities for advancements across the assessment development cycle, including innovative item types, accessibility functionality, automated scoring, computer-adaptive testing, rapid reporting-the list could go on. But with ever-changing technology and continued advancements in assessments comes the need for a robust body of research to support appropriate uses and gain a deeper understanding of best practices. As Dr. Tong leads NCME this year, I am looking forward to seeing progress made in bridging research and practice and our field continuing to advance in ways that ultimately support students and learning.