Assessment Validity to Support Research Validity

By Megan Welsh posted 06-28-2019 19:59


Molly Faulkner Bond, WestEd

As a former student of our new NCME President, Stephen G. Sireci, I was encouraged to think and care deeply about the importance of validity evidence to support test score interpretation and use. Whether using the framework of Messick (1990) or the Joint Standards (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014), Steve ensured we all understood the importance of collecting evidence to support the use and interpretation of test scores, and taught us how to do this work well (Sireci, 2013, 2016). 


In graduate school, most of my conversations about validity focused on situations wherein test scores are at the apex of decision-making. These included, for example, college admissions testing, or credentialing in professional fields like law and medicine. In these contexts, the importance of validating score uses is immediately clear; the test scores and their associated inferences have direct and consequential impacts on the lives and livelihoods of those who were assessed.


As a graduate student, I thought less about the importance of valid test use in settings where the assessment is closer to the foundation – a means to an end, rather than the end in itself. That perspective changed after a few years as a grant program officer at the National Center for Education Research (NCER), part of the Institute of Education Sciences (IES). Based on my time at NCER, I came to appreciate the important role that good measurement can play in research and development – even when the measure is not the main point of the investigation.


In my role as an NCER program officer, I provided technical assistance to prospective applicants and current grantees for research grants focused on K – 12 English learners (ELs). My topic was one of about a dozen in NCER’s primary research competition, the Education Research Grants Program (CFDA 84.305A); other topics included things like Post-secondary and Adult Education (PSAE) and Effective Teachers and Effective Teaching (ETET). As program officers, my colleagues and I were responsible for, among other things, reading proposal drafts for prospective grantees to offer feedback, and checking in regularly with active grantees who were in various stages of completing multi-year research projects. However, we did not have any say in which applications were ultimately funded.

Although NCER does explicitly invite research to develop new measures or validate existing ones, relatively few of NCER’s applicants and grantees are psychometricians. Rather, most grantees are either content specialists seeking to develop new educational interventions, or evaluators seeking to evaluate the efficacy of extant interventions or policies through causal research (for example, out of 203 unique math projects funded between 2002-2013, only 33 produced an assessment; For either type of project, investigators must collect evidence of the intervention or policy’s efficacy or promise, which requires collecting appropriate outcome data. NCER encourages grantees to select at least one proximal outcome (i.e., one that is tightly aligned to the intervention in terms of both time and focus), and at least one distal outcome (i.e., one that is not specific to the intervention and/or may be administered after some delay once the intervention has been implemented).


As this context makes clear, the availability of valid, reliable outcome measures is critical to the success of these projects.  Obviously, evidence of an intervention’s promise or efficacy is harder to find with a measure that is unreliable, poorly matched to the intervention’s focal construct, or both. Thus, not surprisingly, investigators spent a lot of time thinking about (and talking to their program officers about) which measures to use in their studies.

Unfortunately, good measures were often hard to find. In practice, most researchers would opt for a design in which they used researcher-designed proximal measures and a standardized distal measure, often in the form of a year-end large-scale assessment. Each of these had drawbacks: the former may have relatively less reliability or validity evidence to support its use, while the latter often did not directly align to the target of the intervention. In my own portfolio, I often found that neither of these options really amounted to the kind of measure the researcher might prefer in terms of the scale, scope, and alignment of the focal construct.

Based on my time at NCER, I am convinced that the measurement community could make a significant contribution to the types of work NCER typically funds. More, better measures are needed to support high-quality research, and who better to help craft those measures than a professional community of psychometricians? Here are just a few examples of specific assessment needs that I encountered in my time as a program officer: 

  1. Through-course assessments of content or language for English learners. In the EL portfolio, a common challenge was finding valid measures of students’ language proficiency, content knowledge, or both, that could be used throughout the school year. I know of only one language proficiency assessment designed for this purpose (WIDA’s MODEL), and although there are some off-the-shelf content measures, these often have not been normed on or validated for ELs. While Spanish translations are starting to become more common, though these do not work for all ELs, and may be less relevant for students who have not had opportunities to develop academic knowledge and skills in their home language. Taken together, this left many EL researchers struggling to collect evidence for the promise of new interventions and practices designed to develop ELs’ academic language, content knowledge, or both.


  1. Measures of basic skills for adult learners. Recent PIAAC results reveal that more than half of American adults are below proficient on measures of literacy and numeracy, and roughly one in six are considered struggling readers. As a field, we have developed scores of instruments and interventions to teach these skills and help learners who are struggling in these areas. But the vast majority of these have been developed for children, and that difference matters. In addition to the obvious question of age-appropriateness, there are also more nuanced questions about the constructs themselves. For example, can adults who are struggling readers be identified using the same indicators that work for children? Can assessments that have been developed for younger learners simply be aged up to work for adults? More research and validity evidence are needed.


  1. Measures of teaching practice and efficacy. There is a team of researchers based at Purdue University who are systematically using Generalizability Theory to evaluate the reliability and validity of currently used teaching measures such as the Mathematical Quality of Instruction and the CLASS-3. So far, their work suggests that there is room for improvement in the realm of measuring teacher practice. Are we measuring the right things badly, or the wrong things altogether? For now, this is less clear – but also an opportunity for more work.


  1. Measures of social, emotional, and behavioral competencies. Despite being a large and well-developed field in terms of intervention research, social, emotional, and behavioral learning can be challenging to measure. As in many other corners of education, a disconnect persists between practitioners and researchers; the focal constructs in this field can also be challenging to operationalize for assessment. In my time at NCER, I saw this field really come together and grapple with how to advance measurement practices to continue their work; opportunities for high-quality assessment here abound.

Having noted these examples of research fields that would be enhanced by the development of new assessments, I will close by noting that NCER also invites the development of new measures, generally – regardless of whether they have anything to do with research. In other words, whether we are developing embedded tools to support ongoing intervention research; new, stand-alone tools to provide measures of some skills and constructs badly needed by the field; or validating extant instruments for new uses or populations, I think our community has tremendous potential to make a contribution to the education research community.

I understand that measures like this may be less satisfying to build in some ways, as they may lack the economic and professional prestige associated with direct-to-consumer products like the SAT or the ITBS. There is, however, a healthy funding landscape to support this kind of work (in the form of the NCER competition, among other things), and a great need for good measurement work therein. In light of opportunities like this, I am hopeful that the measurement community may start to connect more intentionally with the NCER research community. And on this front, I think our new president, Dr. Sireci, is an ideal person to help build this bridge – in addition to validity, one of Steve’s other great strengths is building community and collaboration.