Other Events

NCME 2020 Virtual Sessions


JUNE 2020

As you are aware, NCME has determined that an in-person conference is not feasible for this coming September. While the loss of the conference certainly stings, we also realize how fortunate we are to be part of NCME each year and that the world is dealing with much more important matters right now. We are doing our best to keep us engaged and learning so that we can all do our best to keep measurement meaningful in these challenging times. Rather than scheduling an extensive number of sessions crammed into a 3 or 4-day period, NCME has determined that in addition to the virtual conference in September, we will also hold sessions throughout the summer. These sessions will be versions of coordinated sessions that had been accepted to be part of the 2020 conference. 
At this time, we have created a schedule of sessions that we will hold in the month of June and are beginning to work on our schedule for the remaining sessions. For now, please put a hold on your calendar for the days and times below. As we move forward, further information on how you can participate in the sessions will follow as will additional information for the sessions later in the summer. Thank you to the entire NCME community for your support and patience during these unprecedented times. We look forward to seeing the faces of many old friends and colleagues in the upcoming virtual sessions.
Psychometrics is Dead – Long Live Psychometrics: Measurement Still Matters
Watch Here:


Tuesday June 16th 12:00 – 1:30 ET
Organized by André De Champlain, Medical Council of Canada

  » Alina von Davier, ACTNext
  » Richard Luecht, University of North Carolina Greensbor
  » Hollis Lai, University of Alberta
  » Han van der Maas, University of Amsterdam

Discussant: Andrew Maul University of California, Santa Barbara

Measurement science, like many other disciplines, has been, and continues to be, severely impacted by the ushering of the information age. Traditional stalwarts, such as episodically administered fixed form MCQ exams, are being challenged in an era when crystalized intelligence-based skills may no longer completely mirror the breadth of abilities expected in order meet the challenges and opportunities that define current times. Skills such as critical thinking, problem solving, creativity, adaptability and information literacy are being heralded as critical areas for development, and by extension, assessment. The aim of this coordinated session is to outline four areas of research that highlight how measurement science is evolving to continue to be relevant in this fast-changing landscape.
Psychometricians Without Borders: Expanding the Reach of Excellent in Measurement
Watch here:
Tuesday June 23rd 12:00 – 1:30 ET

Organized by Michelle Boyer, Center for Assessment and NCME Mission Fund Chair

  » Brian C. Leventhal, James Madison University
  » Ren Liu, University of California, Merced
  » Maria Vasquez-Colina, Florida Atlantic University
  » Darius Taylor, University of Massachusetts, Amherst

Discussant: Brian Gong, Center for Assessment

The NCME Mission Fund was established to provide a means for donors to express their tangible support for NCME’s mission to advance the science and practice of measurement in education, and to provide individuals and organizations with financial support for projects, research, and travel that address this mission directly. Your generous donations provided funding for the Mission Fund’s first round of special initiatives designed to promote a broader understanding of high quality assessment practices and appropriate test use among diverse groups of assessment stakeholders. These initiative make important contributions to our field, expanding boundaries in how we, 1) recruit students to graduate programs in measurement, 2) communicate effectively with the public about high quality measurement, and 3) how we think about, and incorporate principles of diversity and inclusion in our approaches to testing. The results of these initiatives will be shared and discussed.
Indicators of Educational Equity: Tracking Disparity, Advancing Equity
Watch Here:

Tuesday June 30th 3:00 – 4:30 ET
Organized by Judith Koenig, National Academies of Sciences, Engineering, and Medicine

  » Christopher J. Edley, University of California Berkeley
  » Laura Hamilton, RAND
  » Sean F. Reardon, Stanford University
  » Natalie Nielsen, National Academies of Sciences, Engineering, and Medicine

Discussant: Andrew Ho, Harvard Graduate School of Education

In 2017, the National Academies of Sciences, Engineering, and Medicine appointed a committee to identify key indicators for measuring the extent of equity in the nation’s K-12 education system. The committee proposed 16 indicators classified into two categories: measures of disparities in students’ academic achievement, engagement, and educational attainment; and measures of equitable access to critical educational resources and opportunities. The primary objectives for this session are to disseminate the findings from this study and stimulate interest in doing the work needed to develop and implement a system of equity indicators.


JULY 2020

Recent Advances in Research on Response Times

Watch Here:

Wednesday July 15, 11:00 – 12:30 ET

Organized by Sandip Sinharay, Educational Testing Service

Presentation #1: Detecting Item Pre-Knowledge using Response Accuracy and Response Times
Sandip Sinharay, Educational Testing Service & Matthew Johnson, Educational Testing Service

Presentation #2: A New Multi-Method Approach of using Response Time to Detect Low Motivation
Ying Cheng, University of Notre Dame

Presentation #3: A Response Time Process Model for Not-Reached and Omitted Items in Standardized Testing
Jing Lu, Northeast Normal University & Chun Wang, University of Washington

Presentation #4: Semi-Parametric Factor Analysis for Response Times
Yang Liu, University of Maryland & Marian Strazzeri, University of Maryland

Discussant: Wim van der Linden, University of Twente

Given the increasing popularity of computerized testing, the question of how to utilize response times has become urgent. This session provides a glimpse of the recent research using response times and suggests several new approaches involving response times. The approaches are demonstrated using data sets from high-stakes tests. The first two presentations focus on the use of response times to detect test fraud, detect low motivation, and improve test-form assembly. The last two presentations suggest new models involving response times. The session includes discussion from an expert in analysis of response times.

The Struggle is Real: Tackling Alignment Challenges in a Changing Assessment World
Watch Here:

Monday, July 20th 1:00 - 2:30 PM ET

Organizer: Susan Davis-Becker, ACS Ventures
Chair: Thanos Patelis

  » Wayne J. Camara, ACT
  » Ellen Forte, edCount, LLC
  » Scott Marion, National Center for the Improvement of Educational Assessment

Alignment has served as a principal criterion in the validity evaluation of standards-based assessments. However, changes in the educational landscape including innovative approaches to the design of an educational system, more complex systems of standards, different types of measurement strategies, and alternative approaches to assessment design have strained the credibility of traditional alignment approaches. Unlike other areas of assessment design and evaluation, alignment methodology has had relatively less attention and scrutiny in professional settings and scholarly publications. In this session, panelists who work with educational assessment systems designing, conducting, and evaluating alignment studies will present and discuss such challenges, including:

  1. Defining the purpose of the alignment evaluation,
  2. Identifying the components of the system to be aligned (e.g., claims, standards, PLDs/ALDs, test form(s), score scales),
  3. Determining the scope of the alignment study,
  4. Determining the classes of evidence used for evaluating alignment (e.g., content, knowledge/skills, cognitive complexity, judgmental consistency), and
  5. Interpreting the alignment results (e.g., weight attributed to alignment vs. other types of evidence, and whether the evaluation/interpretation should vary based on the claims).

Learning Metrics: Using Log Data to Evaluate EdTech Products
Watch Here:

Tuesday July 28th 1:00 - 2:30 PM ET

Organizer: Kelli Hill, Khan Academy & Rajendra Chattergoon, Khan Academy

Presentation #1: Learning Effectiveness and User Behavior: The Case of Duolingo
Xiangying Jiang, Duolingo & Joseph Rollinson, Duolingo

Presentation #2: Measuring Classroom Learning in the Context of Research-Practice Partnerships
Rajendra Chattergoon, Khan Academy, Kodi Weatherholtz, Khan Academy, & Kelli M. Hill, Khan Academy

Presentation #3: Measuring Learning Behaviors Using Data from a Common Learning Management System
Andrew E. Krumm, Digital Promise

In the past decade, technology-based assessment in education has evolved considerably. Major research syntheses of how people learn (NRC, 2000, 2018) recognize that digital technologies have the potential to support learners in meeting a wide range of goals in different contents. However, students who use these products generate an immense amount of data that can be difficult to interpret and use. Sophisticated measurement models are being developed to precisely describe what students know and can do in a digital environment. However, a more pressing need is to use log data to evaluate whether students are learning.

 This session consists of three perspectives on how EdTech companies use metrics related to learning to evaluate the effectiveness of their products. The first presentation describes how Duolingo uses log data to evaluate their language teaching approach and identify user behaviors that are associated with better learning outcomes. The second presentation discusses how Khan Academy combines log data with standardized test scores through district partnerships to evaluate its product. The third presentation describes how Digital Promise partners with teachers and school leaders to develop measures of learning from event data. These presentations will be followed by comments from an expert in learning technologies.


August 2020

Teaching and Learning “Educational Measurement”: Defining the Discipline? 
Watch Here:

Tuesday, August 4, 2020 1:00 – 2:30 ET
Organizer: Derek Briggs, University of Colorado-Boulder 
Chair: Suzanne Lane, University of Pittsburgh 

  » Dan Bolt, UW-Madison School of Education
  » Derek Briggs, University of Colorado Boulder

  » Andrew Ho, Harvard Graduate School of Education
  » Won-Chan Lee, University of Iowa
  » Jennifer Randall, University of Massachusetts Amherst

What is educational measurement? Is it a subfield of psychometrics or a unique discipline? What is needed to be an expert in educational measurement?  The purpose of this proposed interactive panel session is to build upon a recent curriculum review of 135 graduate programs in educational measurement. The fundamental question posed in that review was “What do our curricula and training programs suggest it means to be an educational measurement expert?” In this session, we plan to ask a different question: what should it mean to be an educational measurement expert in the future?  To discuss and debate this and other questions, a panel of five mid-career professors have been assembled who are in leadership positions at five universities with prominent graduate programs known for training students who go on to join the NCME community.  Our hope is that this could become the start of a larger initiative and conversation among NCME members and the broader field.
Using Process Data for Advancing the Practice and Science of Educational Measurement

Watch Here:

Tuesday, August 11, 2020 12:00 - 1:30 PM ET

Organizer: Kadriye Ercikan, Educational Testing Service 
Chair: Joan Herman, CRESST/UCLA 


Implications of Considering Response Process Data for Psychometrics
Robert J. Mislevy, ETS & Roy Levy, Arizona State University

Use of Response Process Data to Inform Group Comparisons and Fairness Research  
Kadriye Ercikan, ETS/UBC; Hongwen Guo, Educational Testing Service; Qiwei He, Educational Testing Service 

How do Proficient and Less-Proficient Students Differ in their Composition Processes?  
Randy E Bennett, ETS; Mo Zhang, ETS; Paul Deane, ETS; Peter van Rijn, ETS 

James Pellegrino, University of Illinois, Chicago 
Joan Herman, CRESST/UCLA 

The move from paper-based to digitally-based assessments is creating new data sources that allow us to think differently about the foundational aspects of measurement, including sources of evidence for reliability, validity, fairness, and generalizability. Process data, also referred to as “observable data,” or “action data” reflect individuals’ behaviors while completing an assessment task. They are logs of individuals’ actions, such as keystrokes, time spent on tasks, and eye movements. These data reflect student engagement with assessment and can provide important insights about students’ response processes that may not be captured in their final solutions to the assessment tasks. The four presentations focus on use of process data (1) in measurement modeling and psychometric models; (2) for enhancing group comparisons and fairness research; (3) to examine differences in the processes proficient and less proficient students use to write essays; and (4) to assess computational thinking captured during gameplay.

What is the Value Proposition for Principled Assessment Design?
Watch Here:

August 17, 2020 11:00 - 12:30 PM ET
Organizer: Paul Nichols, NWEA 


Business Model for Principled Assessment Design  
Paul Nichols, NWEA 

The Only Job to be Done is Helping Teachers Teach and Students Learn  
Kristen Huff, Curriculum Associates 

No Need to Tear Down When You Can Build Up  
Steve Ferrara, Measured Progress 

PAD or not to Pad: Let the Market Decide  
Catherine Needham, NWEA; Christina Schneider, NWEA 

Jeremy Heneger, Nebraska Department of Education 
Rhonda True, Nebraska Department of Education 

Principled assessment design approaches (PAD), such as evidence centered design, have been available to the field of educational assessment for almost two decades. The use of PAD appears to offer many benefits including improved validity evidence, more efficient assessment development, and support for innovative assessment approaches. Yet, PAD does not dominate operational assessment design and development in educational assessment. Pieces are implemented here and there but no operational program has implemented all the elements of PAD. In this session, the adoption of PAD is hypothesized to be driven by the value created for customers. The lack of PAD adoption suggests that customers perceive little value in using PAD. The presenters will explore the value proposition for PAD, where a value proposition is the value or need that PAD is fulfilling for customers. The first presenter will describe the components of a PAD business model including the users and buyers for PAD, the problems for which they might use PAD to help address, and the channels for communicating with these customers. The following three presenters will propose who they view as the customer, describes the key drivers of value, identify blockers to use, and suggest a means to improve PAD adoption.

Alignment Frameworks for Complex Assessments: Score Interpretations Matter 
Watch Here:

Monday, August 17, 2020 1:00 – 2:30 ET
Organizer: M. Christina Schneider, NWEA 


Evaluating Alignment Between Complex Expectations and Assessments Meant to Measure Them
Ellen Forte, edCount, LLC

Examining Alignment of Test Score Interpretations on a Computer Adaptive Assessment
M. Christina Schneider, NWEA; Mary Veazey, NWEA

Examining Alignment of Test Score Interpretations Using Multiple Alignment Frameworks and Multiple Measures
Karla Egan, EdMetric

Embedded Standard Setting: Standard Setting as a Resolution of the Alignment Hypothesis
Daniel Lewis, Creative Measurement Solutions; Robert Cook, ACT

Paul Nichols, NWEA

The design and validation of an assessment system, intended for both formative and summative purposes, requires careful development processes especially when such assessments are intended to support interpretations regarding how student learning grows more sophisticated over time. Under a principled approach to test design, the intended test score interpretation is defined, the evidence needed to draw a conclusion about where a student is in their learning based on that interpretation is defined, and items are developed according to those evidence pieces.  The assessment of complex constructs such as student learning of NGSS and college and career standards may mean that traditional alignment and validity evidence is no longer optimal evidence that a test is aligned to state standards and its purpose. This session will focus on emerging frameworks for alignment and validity evidence explicitly designed to ensure that the assessment development process and evidence collection is cohesively centered in score interpretation. Experts in achievement level descriptors, alignment, principled assessment design, and standard setting will share emerging methodologies that fuse separate and previously distinct activities of test development, so these activities are embedded together into a cohesive whole in which score interpretations centered in student learning are the central focus.

College Admissions: Lessons Learned from Across the Globe
Watch Here:

Tuesday, August 18, 2020 12:00 - 1:30 PM ET

Organizer:  Maria Elena Olivera


An Overview of Higher Education Admissions Processes 
Rochelle Michel, ERB LEARN 

Access, Equity & Admissions Processes in South African Higher Education  
Naziema Jappie, University of Cape Town 

Perspectives on Admissions Practices: The Case of Chilean Universities  
Monica Silva, Pontificia Universidad Católica de Chile 

Character-Based Admissions Criteria: Validity and Diversity  
Rob Meijer, University of Groningen 

In this session, an international group of experts on higher education admissions practices share their insight on opportunities and challenges related to the processes and criteria used in postsecondary admissions decision-making to promote access, equity, and fairness for candidates from diverse backgrounds. The session brings outside voices into the educational measurement community to engage in meaningful discussions about current and future-looking uses of assessments utilized to inform admissions practices, and discusses opportunities and challenges to developing culturally-responsive assessments that are sensitive to the ways of knowing and learning of diverse populations. The presenters discuss challenges in improving diversity, access, and equity in admissions processes used across the globe. The session uses a panel format to bring together voices from educational and professional communities and invites them to discuss various perspectives regarding access issues and challenges to diversifying the admitted student pool.  The panel members will discuss their perceptions of what are the most critical measurement-based issues facing higher education admissions in their own country, why it is important to consider that perspective as part of fairness and access to admissions decision-making practices, and possible strategies to address the issue based on the lessons learned from their own country-level perspective.

How to Achieve (or Partially Achieve) Comparability of Scores from Large-Scale Assessments 
Watch Here:

Wednesday, August 19, 2020 12:00 - 1:30 PM ET

Organizer: Amy Berman, National Academy of Education 
Chairs: Edward Haertel, Stanford University & James Pellegrino, University of Illinois, Chicago 


Comparability of Individual Students’ Scores on the “Same Test”
Charles DePascale, Center for Assessment; Brian Gong, Center for Assessment;

Comparability of Aggregated Group Scores on the “Same Test”
Scott Marion, Center for Assessment; Leslie Keng, Center for Assessment 

Comparability Within a Single Assessment System
Mark Wilson, University of California, Berkeley; Richard Wolfe, Ontario Institute for Studies in Education of the University of Toronto 

Comparability Across Different Assessment Systems
Marianne Perie, Measurement in Practice, LLC 

Comparability When Assessing English Learner Students
Molly Faulkner-Bond, WestEd, and James Soland, University of Virginia/Northwest Evaluation Association (NWEA)

Comparability When Assessing Individuals with Disabilities
Stephen Sireci and Maura O’Riordan, University of Massachusetts, Amherst

Comparability in Multilingual and Multicultural Assessment Contexts
Kadriye Ercikan, Educational Testing Service/University of British Columbia, and Han-Hui Por, Educational Testing Service

Interpreting Test Score Comparisons  
Randy E Bennett, Educational Testing Service

How much and what types of flexibility in assessment content and procedures can be allowed, while still maintaining comparability of scores obtained from large-scale assessments that operate across jurisdictions and student populations? This is the question the National Academy of Education (NAEd) set out to answer in its Study on Comparability of Scores from Large-Scale Assessments.  This session presents the major findings from eight papers which explore a host of comparability issues that range from examining: (a) the comparability of individual students’ scores or aggregated scores, to (b) scores obtained within single or multiple assessment systems, to (c) specific issues about scores obtained for English language learners and students with disabilities.  In each interpretive context, the authors discuss comparability issues as well as possible approaches to addressing the information needs and policy concerns of various stakeholders including state-level educational assessment and accountability decisionmakers/leaders/coordinators, consortia members, technical advisors, vendors, and the educational measurement community.

Development and Empirical Recovery of a Learning Progression that Incorporates Student Voice 
Thursday, August 20, 2020 12:00 - 1:30 PM ET

Organizer: Edith Aurora Graf, Educational Testing Service 


Steps in the Design and Validation of the Assessment: An Overview  
Edith Aurora Graf, Educational Testing Service; Maisha Moses, Young People's Project; Cheryl Eames, Southern Illinois University Edwardsville; Peter van Rijn, ETS 

Eliciting Student Feedback on the Assessment Through Focus Groups  
Maisha Moses, Young People's Project 

Response Analysis using the Finite-to-Finite Strand of the Learning Progression for the Function Concept  
Cheryl Eames, Southern Illinois University Edwardsville 

Psychometric results for two strands of a learning progression for the concept of function  
Peter van Rijn, ETS; Edith Aurora Graf, Educational Testing Service 

Frank E. Davis, Frank E. Davis Consulting 

In keeping with the conference theme, Making Measurement Matter, we will discuss research work on building and validating a learning progression-based assessment for the concept of function, a keystone in students’ understanding of higher mathematics. This effort also speaks to the goals of “bringing outside voices into the educational measurement community,” and fairness as equal priorities. The research includes students and schools served by research collaborators seeking to improve mathematics education for students characterized as underserved in mathematics. Recently, we conducted a computer-delivered pilot of tasks in which data from 1102 students were collected. The first two speakers will focus on the theory and design behind the assessment. The first will discuss the overall design of the project, and summarize work conducted to date. The second will discuss how student feedback on the tasks was elicited during the focus groups, with the intent of making revisions that would enhance meaningfulness and clarity. The last two speakers will discuss outcomes from the pilot, discussing student responses and what they suggest about the validity of the LP, and psychometric results concerning the empirical recovery of the levels of the LP, an essential step in its validation. 

Advancing Multidimensional Science Assessment Design for Large-scale and Classroom Use 
Watch Here:

Friday, August 21, 2020 1:00 - 2:30 PM ET

Organizer: Erin Buchanan, edCount, LLC 


Ensuring Rigor and Strengthening Score Meaning in State and Local Assessment Systems  
Ellen Forte, edCount, LLC 

A Principled-Design Approach for Creating Multi-Dimensional Large-Scale Science Assessments  
Daisy Rutstein, SRI International 

A Principled-Design Approach for Creating Multi-Dimensional Classroom Science Assessments  
Charlene Turner, edCount, LLC 

State Implementation of SCILLSS Resources: A User's Perspective  
Rhonda True, Nebraska Department of Education 

Elizabeth Summers, edCount, LLC 

Presenters will share the goals, progress, and national significance of the Strengthening Claims-based Interpretations and Uses of Large-scale Science Assessment Scores (SCILLSS) project funded through the US Department of Education’s Enhanced Assessment Grants (EAG) program. SCILLSS brings together a consortium of three states, four organizations, and a panel of experts to strengthen the knowledge base among state and local educators for using principled-design approaches to design quality science assessments that generate meaningful and useful scores, and to establish a means for connecting statewide assessment results with classroom assessments and student work samples in a complementary system. Presenters will share how SCILLSS partners are applying current research, theory, and best practice to establish replicable and scalable principled-design tools that state and local educators can use to clarify and strengthen the connection between statewide assessments, local assessments, and classroom instruction, enabling all stakeholders to derive maximum meaning and utility from assessment scores.
The Changing Landscape of Statewide Assessment: Shifts towards Systems of Assessments 
Watch Here:

Monday, August 24, 2020 12:00 - 1:30 PM ET

Organizer: Nathan Dadey, Center for Assessment


On the Shift Towards Balanced Assessment Systems: Past, Present and Future  
 Brian Gong, Center for Assessment 

Developing a Validity Research Agenda for Louisiana’s Innovative Assessment Demonstration Authority Pilot  
 Nathan Dadey, Center for Assessment; Michelle Boyer, Center for Assessment 

On the Opportunities Provided by Through-Year Assessment Models, Including a Solution Configured for Districts in Georgia  
 Abby Javurek, NWEA; Paul Nichols, NWEA 

Carla Evans, Center for Assessment 

The landscape of statewide, large-scale educational assessment is shifting away from “stand-alone” summative assessments and towards integrated sets of assessments designed to support various interpretations and uses. For example, several states have provided interim assessments as a part of their statewide assessment program, either individually or as members of a consortium. This coordinated session will explore how the theory of systems of assessments is being applied in multiple contexts and provide insight into challenges and opportunities inherent in developing and implementing integrated sets of assessments in real world settings.  An overview will be provided on developments in theory and practice of balanced systems of assessments (e.g., Pellegrino, Chudowsky & Glaser, 2001), emphasizing implications for current practice.  Other presentations focus on ongoing initiatives taking place under the Innovative Assessment Demonstration Authority waivers granted to Louisiana and Georgia.  These states aim to replace single statewide summative assessments with multiple assessments that work together to produce a single summative score. This type of assessment model has been referred to as through-course (e.g., Wise, 2011) and might be seen as interim (Dadey & Gong, 2017), but at its core the model is organized around the same principles as balanced systems of assessments. 

Predictive Standard Setting: Improving the Method, Debating the Madness 
Watch Here:

Tuesday, August 25, 2020 4:00 - 5:30 PM ET

Organizer: Andrew Ho, Harvard University 
Chair: Walter (Denny) Way, College Board 

  » Jennifer Beimers, Pearson 
  » Wayne J. Camara, Law School Admission Council 
  » Laurie Laughlin Davis, Curriculum Associates 
  » Laura Hamilton, RAND 
  » Deanna Morgan, College Board 
  » Yi Xe Thng, Singapore Ministry of Education 

Test scores measure, and test scores predict. Predictions can anchor statements about current performance in terms of future outcomes—including test scores, grades, and graduation—through a process called "predictive standard setting." Presenters in this symposium will debate how and whether predictions should inform standard setting, whether standards should make predictions, and how predictions should count as validity evidence. Contexts include the SAT and ACT college readiness benchmarks, state accountability tests in grades 3-8, interim assessments, and NAEP. These issues are salient as educational policies, policymakers, and practitioners value these predictions in "career and college readiness" frameworks. Presenters will discuss and debate advances in three stages of predictive standard setting: 1) generating accurate predictive statements using statistical methods; 2) managing predictive data in the standard setting process; then 3) communicating results using benchmark and achievement-level descriptors. Some presenters believe strongly that predictive statements build valid consensus among standard setting panelists and help users understand the meaning and relevance of scores. Other presenters believe strongly that predictive statements build false consensus and subjugate the subject-matter relevance of scores in favor of ambiguous future outcomes. Presenters will give short presentations and then engage in moderated discussion with each other and the audience. 

Social Interaction Time
Tuesday, August 25, 2020 5:30 - 6:00 PM ET

Organizer: Andrew Ho, Harvard University

Computational Psychometrics as a Validity Framework for Process Data
Watch Here:

Wednesday, August 26, 2020 1:00 - 2:30 PM ET

Organizer: Alina von Davier, Duolingo
Chair: Ada Woo, TreeCrest Assessment Consulting

  » Yuchi Huang, ACT
  » Alina von Davier & Burr Settles, Duolingo
  » John Whitmer, Chi2 Labs

Bruno D. Zumbo, University of British Columbia

In 2015, von Davier coined the term “computational psychometrics” (CP) to describe the fusion of psychometric theories and data-driven algorithms for improving the inferences made from technology-supported learning and assessment systems (LAS). Meanwhile, “computational” [insert discipline] has become a common occurrence. In CP the process data collected from virtual environments should be intentional: we should design & provide ample opportunities for people to display the skills we want to measure. CP uses the expert-developed theory as a map for the measurement efforts using process data. CP is also interested in the knowledge discovery from the (little, big) process data. In this symposium, several examples of applications of computational models for the process data from learning systems and from the assessment of the 21st Century skills are presented. Psychometric theories and data-driven algorithms are fused to make accurate and valid inferences in complex, virtual learning and assessment environments.

CATs, BATs, and RATs—The Value of CAT for Educational Assessment

Watch Here:

Thursday, August 27, 2020 12:00 - 1:30 PM ET

Organizer: Laurie Laughlin Davis, Curriculum Associates
Chair: Michael Edwards, Arizona State University

  » Michelle Barrett, Edmentum
  » Richard Luecht, University of North Carolina, Greensboro
  » Laurie Laughlin Davis, Curriculum Associates
  » Michael Edwards, Arizona State University

David Thissen, University of North Carolina, Chapel Hill

Computerized Adaptive Testing (CAT) turns 50 years old in 2020 which may be a shock to many in educational assessment who are still struggling to implement CAT in a way that fully realizes its promised advantages in terms of improved efficiency in testing. Licensure and certification assessment have been leveraging CAT successfully for years. While there have been recent several recent examples of CAT implementations in K-12 summative assessment (such as the Smarter-Balanced Assessment Consortium and Virginia’s Standards of Learning assessment), CAT has been relatively slow to catch on in K-12 educational assessment. This is due, in part, to technology limitations and differences between delivering tests to test centers and delivering tests to students in classrooms. However, technology is not the only consideration influencing the effective use of CAT in K-12 assessment. Frequently, constraints are placed on K-12 assessment programs in terms of educational policies, content standards coverage, and comparability that limit the degree to which CAT can deliver assessment efficiently and effectively. This results in assessment programs which are sometimes referred to as “BAT”s (Barely Adaptive Tests) and “RAT”s (Rarely Adaptive Tests). This panel will discuss the challenges associated with CAT in K-12 assessment and forecast its future utility.

Modeling Measurement Invariance and Response Biases in International Large-Scale Assessments

Watch Here:

Monday, August 31, 2020 10:00 - 11:30 AM ET

Organizers: Lale Khorramdel, National Board of Medical Examiners,
Artur Pokropek, Educational Research Institute (IBE), Warsaw, Poland, &
Janine Buchholz, Leibniz Institute for Research and Information in Education (DIPF)
Chair: Lale Khorramdel, National Board of Medical Examiners


A comparison of Multigroup-CFA and IRT-based item fit for measurement invariance testing
Janine Buchholz, Leibniz Institute for Research and Information in Education (DIPF); Johannes Hartig, DIPF | Leibniz Institute for Research and Information in Education, Frankfurt, Germany

Comparing three-level GLMMs and multiple-group IRT models to detect group DIF
Carmen Köhler, Leibniz Institute for Research and Information in Education (DIPF); Lale Khorramdel, National Board of Medical Examiners; Johannes Hartig, DIPF | Leibniz Institute for Research and Information in Education, Frankfurt, Germany

Comparability and Dimensionality of Response Time in PISA
Emily Kerzabi, Technical University of Munich; Hyo Jeong Shin, ETS; Seang-Hwane Joo, Educational Testing Service; Frederic Robin, Educational Testing Service; Kentaro Yamamoto, Educational Testing Service

Validation of Extreme Response Style versus Rapid Guessing in Large-Scale Surveys
Ulf Kroehne, DIPF | Leibniz Institute for Research and Information in Education, Germany; Lale Khorramdel, National Board of Medical Examiners; Frank Goldhammer, DIPF | Leibniz Institute for Research and Information in Education, Centre for International Student Assessment (ZIB), Germany; Matthias von Davier, National Board of Medical Examiners

Examining the Releation between Measurement Invariance and Response Styles in Cross-Country Surveys
Artur Pokropek, Educational Research Institute (IBE), Warsaw, Poland; Lale Khorramdel, National Board of Medical Examiners 

Leslie Rutkowski, Indiana University, Bloomington

The main goal of international large-scale assessments (ILSAs) – such as PISA, PIAAC, TIMSS, PIRLS – is to provide unbiased and comparable test scores and data which enable valid and meaningful inferences about a variety of educational systems and societies. In contrast to national surveys, ILSAs can provide a frame of reference to extend our understanding of national educational systems and cross-country variability. To enable fair group comparisons (within and across countries) and valid interpretations of statistical results in low-stakes assessments such as ILSAs, two validity aspects need to be accounted for. First, the data need to be tested and corrected for response biases such as response styles (RS) in non-cognitive scales. Second, the comparability of the data and test scores across different countries and languages needs to be established. This is achieved by testing and modelling measurement invariance (MI) assumptions. The proposed coordinated session provides an overview of state of the art and new psychometric approaches to test MI assumptions, handle the problem of response biases, and investigate the relations and interactions between both. The goal is to provide researchers, practitioners and policy makers with comparable and meaningful data for secondary analyses and to enable fair comparisons of groups and countries.

Artificial Intelligence in Educational Measurement: Trends, Mindsets, and Practices

Watch Here:

Monday, August 31, 2020 12:00 - 1:30 PM ET

Organizers: Andre Rupp, Independent Consultant, Mindful Measurement &
Carol M. Forsyth, Educational Testing Service


AI in STEM Assessment: Trends, Mindsets, and Practices
Janice Gobert, Apprendis; Mike Sao Pedro, Apprendis 

AI in Education: New Data Sources and Modeling Opportunities
Piotr Mitros, ETS; Steven Tang, eMetric

Fairness, Accountability, and Transparency in Machine Learning
Collin Lynch, North Carolina State University

Bias and Fairness for Automated Feedback Generation
Neil Heffernan, Worcester Polytechnic Institute; Anthony Botelho, Worcester Polytechnic Institute

Building and Picking a Model for Learning and Assessment
Michael Yudelson, ACTNext by ACT

Alina von Davier, ACTNext
Andre A Rupp, Educational Testing Service (ETS)
Carol M Forsyth, Educational Testing Service

In this session, various experts from areas connected to artificial intelligence (AI) in assessment will provide thoughtful perspectives on how key issues in educational measurement are conceptually framed, empirically investigated, and critically communicated to key stakeholder groups. After a general overview of the current trends in AI research as it pertains to educational assessment, different presenters will critically discuss how the three core areas of (1) reliability / statistical modeling, (2) validity / construct representation, and (3) equity / fairness are tackled in a changing field of educational assessment. A related goal of the session is to have presenters and participants suggest key lines of work to which current members of NCME can productively contribute to shape best practices in this new world of assessment. In addition, the session will be used as an opportunity to discuss means of cross-community outreach and engagement that can help build further bridges between the current NCME membership and members from neighboring scientific and practitioner communities working with AI technologies in assessment. This session is connected to a newly proposed SIG “Artificial Intelligence in Assessment”.

Assessing Indigenous Students: Co-Creating a Culturally Relevant & Sustaining Assessment System

Watch Here:

Monday, August 31, 2020 3:00 - 4:00 PM ET

Organizers: Cristina Anguiano-Carrasco, ACT & Leanne R. Ketterlin-Geller, Southern Methodist University
Chair: Cristina Anguiano-Carrasco, ACT

  » Sherry Saevil, Halton Catholic District School Board
  » Pohai Kukea Shultz, University of Hawaii
  » Kerry Englert, Seneca Consulting

The NCME Diversity Issues in Testing Committee is pleased to offer an invited panel session at the NCME 2020 conference in San Francisco focused on assessment issues affecting indigenous students.  In the invited panel session that was held at the 2019 NCME conference in Toronto, discussion focused on how to make assessments more equitable for students of color.  This session extends that discussion with a focus on the equitable assessment of indigenous students, specifically, by naming and addressing the unique challenges these students face within the context of traditional systems of assessment. How can we help indigenous students succeed in an educational system that has failed them?

September 2020

Challenges with Automatic Item Generation Implementation: Recent Research, Strategies, and Lessons Learned

Watch Here:

Tuesday, September 1st from 2 PM to 3:30 PM ET
Organizer: Pamela Kaliski, ABIM


Operationalizing AIG: Talent, Process, and System Implications
Donna Matovinovic

Items and Item Models: AIG Traditional and Modern
Mary Ann Simpson, Emily Bo, Wei He, Abby Javurek, Sarah Miller, Sylvia Scheuring, & Naimish Shah, NWEA

Exploring the Utility of Semantic Similarity Indices for Automatic Item Generation
Pamela Kaliski, Jerome Clauser, & Matthew Burke, ABIM

The Effect of Using AIG Items to Pre-equate Form
Drew Dallas & Joshua Goodman, NCCPA


Discussant: Hollis Lai, University of Alberta

This session is important for organizations considering AIG implementation, as well as organizations currently implementing AIG, as these papers address issues that an organization is likely to encounter with AIG implementation. Over the past decade, the educational measurement field has seen great enthusiasm for automatic item generation (AIG). Producing more items in a faster time frame without increasing costs is a priority in many organizations, which is where AIG can be helpful. AIG protects test security by allowing for exposed items to be used less frequently, in turn protecting the validity of assessment scores. However, committing to AIG implementation is initially a significant cost investment. In order for AIG to deliver on its potential for improving assessment development processes, more studies are needed to demonstrate its effectiveness. Examples of successful AIG implementation in the literature, and discussions of lessons learned while facing common challenges during AIG implementation, would greatly benefit the educational measurement community. The focus of this coordinated session is recent research and strategies to inform successful AIG implementation. These papers from four unique organizations share the common theme of addressing challenges that are encountered when implementing AIG and is valuable for attendees from organizations interested in AIG implementation.

Research for Practical Issues and Solutions in Computerized Multistage Testing (Book by C&H)

Watch Here:

Thursday, September 3rd from 2 PM to 3:30 PM ET

Organizer: Duanli Yan, Educational Testing Service


Purposeful Design for Useful Tests: Considerations of Choice in Multistage-Adaptive Testing
April Zenisky & Stephen Sireci, University of Massachusetts-Amherst

MST Strategic Assembly and Implementation
Richard Luecht, University of North Carolina-Greensboro & Xiao Luo, Measured Progress

MST Design and Analysis Considerations for Item Calibration
Paul A. Jewsbury, Educational Testing Service & Peter van Rijn, ETS Global

Predicting and Evaluating the Performance of Multistage Tests
Tim Davey, Educational Testing Service


Since 2014, computerized multistage testing (MST) operational applications have been increasing rapidly, given MST practicability. Meanwhile, researchers have continued to explore approaches to address practical issues and develop software to support the operational applications. With the increasing number of MST operational applications, many testing institutions and researchers have also gained experience in dealing with their operational challenges and solving their practical problems in the context of their large scale assessments.

This symposium presents the most recent research for various practical issues and considerations, methodological approaches, and solutions for implementing MST for operational applications. It is based on an upcoming volume Research for Practical Issues and Solutions in Computerized Multistage Testing (2020) by Chapman and Hall, the sequel volume to the popular volume Computerized Multistage Testing: Theory and Applications (2014) by Chapman and Hall.


Integrating Timing Considerations to Improve Testing Practices

Watch Here:

Friday, September 4th from 11:30 AM to 1 PM ET

Organizer:Arden Ohls, NBME
Chair: Richard Feinberg, NBME


The Evolving Conceptualization and Evaluation of Test Speededness: A Historical Perspective
Daniel Jurich & Melissa Margolis, NBME

 The Impact of Time Limits and Timing Information on Validity
Michael Kane, Educational Testing Service

 Relationship between Testing Time and Testing Outcomes
Brent Bridgeman, Educational Testing Service

 Response Times in Cognitive Tests: Interpretation and Importance
Paul De Boeck, The Ohio State University & Frank Rijmen, American Institutes for Research

 A Cessation of Measurement: Identifying Test Taker Disengagement Using Response Time
Steven L. Wise & Megan R Kuhfeld, NWEA


Discussants:Richard Feinberg, NBME

For nearly a century, practitioners have theorized and investigated examination time limits and the effect they have on test takers. This session’s first presentation traces the historical foundations that have influenced current practice related to examination timing. Within this historical context, the historical background will relate to two main areas of research: 1) conceptualizations of speededness; and 2) methods to detect speededness. This presentation will proceed chronologically and will be structured at a high level by separating the relevant history into meaningful eras, such as pre and post implementation of computer-based testing (CBT). The development of timing theories and methods within the overarching framework of the Standards for Educational and Psychological Testing also will be discussed, as will the extent to which research and practice have influenced shifts in the Standards. The talk will close by highlighting remaining gaps in the literature and connecting these gaps with the work included in additional chapters presented during this symposium.