Generating Automatic Feedback for Open-Ended Questions with Fine-Tuned LLMs

Generating Automatic Feedback for Open-Ended Questions with Fine-Tuned LLMs: A Comparison of GPT and Llama

This event is being organized by the NCME Artificial Intelligence in Measurement and Education (AIME) SIGIMIE.

Generating effective feedback for open-ended responses in high-stakes exams is a complex and resource-intensive task, as test-takers can approach the same item in diverse ways. Large Language Models (LLMs), with their ability to follow instructions and understand context, offer the potential to scale up feedback delivery. Prior research has demonstrated promising results with proprietary GPT models and carefully engineered prompts, but opportunities for improvement remain. Fine-tuning may enhance an LLM’s ability to generate feedback that aligns with test-takers’ needs while reducing dependence on manual prompt engineering. Additionally, open-source LLMs, such as Llama, could provide comparable performance to proprietary models while mitigating costs and privacy concerns.

To explore these possibilities, we conducted two studies addressing the following questions: (I) Can LLMs be fine-tuned to generate high-quality feedback for short open-ended responses? (II) Can the open-source Llama model deliver feedback of comparable quality to proprietary GPT models? (III) How does the performance of base models compare to that of fine-tuned models? Using responses from a high-stakes situational judgment test and a small set of hand-crafted training examples, we fine-tuned GPT and Llama models to generate feedback aligned with established principles of effective feedback. Model outputs were evaluated using a structured rubric, and feedback quality was compared using automated similarity metrics. Additionally, in the first study, test experts and simulated test-takers assessed the feedback.

Our findings reveal notable differences in the performance of Llama and GPT models and highlight the impact of fine-tuning on feedback quality. These results contribute to the growing body of research on AI-driven feedback generation and inform best practices for integrating LLMs into high-stakes testing.

Presenters:

  • Okan Bulut and Elisabetta Mazzullo, University of Alberta

When:  Mar 19, 2025 from 04:00 PM to 05:00 PM (ET)

Location

Online Instructions:
Url: https://us02web.zoom.us/j/88510261077?pwd=258jRNzrgAlsr09cGjAVBPBOfTnM68.1
Login: Meeting ID: 885 1026 1077 / Passcode: 164209