This event is being organized by the NCME Artificial Intelligence in Measurement and Education (AIME) SIGIMIE.
As LLMs sprint ahead in fact retrieval and pattern-matching, they’re racing into questions humans still ace—abductive puzzles, conceptual leaps, and cross-contextual riddles. But benchmarks age fast: today’s “hard” question can become tomorrow’s trivia. In this talk, you’ll discover two breakthrough tools—CAIMIRA, which uses item-response theory to chart human vs. AI proficiencies at scale, and AdvScore, a human-anchored metric that flags when adversarial datasets stop being challenging. We’ll reveal surprising gaps—for example, GPT-4’s dominance on lookup tasks vs. human intuition on abductive reasoning—and show how to design next-generation QA challenges that push both mind and machine. Finally, learn a roadmap for pairing humans and AI agents so that their complementary “superpowers” deliver truly robust, real-world question-answering systems.
Presenters:
National Council on Measurement in Education