Measuring AI Accuracy on Standardized Tests: A Comparative Study of ChatGPT, Copilot and Gemini
Publication Date : May-08-2026
Author(s) :
Volume/Issue :
Abstract :
This study evaluates the performance of three widely used artificial intelligence systems, ChatGPT, Microsoft Copilot, and Google Gemini, on standardized test questions in Math, Reading, and English derived from SAT and ACT examinations. A total of 90 questions (30 per subject) were selected from multiple test forms across different years to reduce potential bias and ensure broad content coverage. All questions, including those with visual components, were presented to each AI system in a standardized format, and responses were scored for accuracy. A chi-square test for homogeneity was conducted to assess differences in performance among the models. Results indicate that all three AI systems performed strongly in language-based tasks, Reading and English. In contrast, performance in Math was notably lower across all models, with common errors involving advanced mathematical concepts and misinterpretation of visual and graphical information. Despite observable differences in error patterns, statistical analysis revealed no significant differences in overall performance among the three systems. These findings suggest that current AI models are highly proficient in processing and interpreting textual information but remain less reliable in mathematical reasoning and multimodal tasks. The study highlights both the capabilities and limitations of AI in standardized testing contexts and underscores the importance of prompt design and continued model development.
