Humanity’s Last Exam Reveals Limits of AI Benchmarks and Machine Intelligence

Artificial intelligence tools are increasingly used in laboratories to support data analysis, literature review, and experimental planning. As these systems become more integrated into scientific workflows, researchers are also working to better understand their limitations. A new benchmark called Humanity’s Last Exam aims to evaluate how well advanced artificial intelligence systems perform when confronted with highly specialized human knowledge.

Developed by a global consortium of nearly 1,000 researchers, the assessment introduces a rigorous framework for measuring machine intelligence beyond widely used benchmarks such as Massive Multitask Language Understanding (MMLU). The project, described in a recent paper in Nature, was led in part by Tung Nguyen of Texas A&M University and focuses on identifying areas where artificial intelligence systems still struggle to match expert-level reasoning.

As large language models increasingly achieve near-perfect scores on traditional benchmarks, researchers say new evaluation tools are needed to distinguish pattern recognition from deeper domain expertise.

Webinar

Maximizing Lab Workflows: Integrating Slide Staining and Cytocentrifugation with Aerospray

Join Lab Manager and our experts as we discuss slide staining and cytocentrifugation

The architecture of Humanity’s Last Exam

Humanity’s Last Exam contains 2,500 questions spanning diverse academic disciplines, including mathematics, the natural sciences, the humanities, ancient languages, and specialized technical subfields. Unlike previous AI benchmarks that emphasize general knowledge, HLE focuses on the specialized “long tail” of human expertise.

To maintain the assessment's difficulty, researchers designed the benchmark to prevent models from simply memorizing answers. Much of the question bank remains hidden, keeping the test effective as AI systems continue to evolve.

The methodology for building the exam followed several strict criteria:

Expert authorship: Questions were written and reviewed by subject-matter experts worldwide
Verifiable answers: Each problem was required to have a single, unambiguous solution
AI pre-testing: Questions that existing models could answer correctly during development were removed
Anti-retrieval design: Prompts were structured so they could not be easily solved through internet searches

The exam spans a wide range of topics. Tasks include translating Palmyrene inscriptions, analyzing Biblical Hebrew pronunciation patterns, identifying microanatomical structures in birds, and solving advanced mathematical problems.

Nguyen contributed many questions in mathematics and computer science to ensure strongcomputer s technical rigor within those domains.

Early performance of AI systems

Initial testing revealed a significant gap between current AI capabilities and the level of expertise required to answer many questions in Humanity’s Last Exam.

Reported performance included:

GPT-4o: 2.7 percent accuracy
Claude 3.5 Sonnet: 4.1 percent accuracy
Gemini Pro and Claude Opus models: Approximately 40 percent to 50 percent accuracy on portions of the benchmark

The results suggest that while large language models can perform well on conversational tasks or general knowledge questions, their performance declines sharply when confronted with deep, specialized expertise—a phenomenon that can lead to operational 'workslop' if not properly validated.

“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they are approaching human-level understanding,” Nguyen said. “But Humanity’s Last Exam reminds us that intelligence is not just about pattern recognition—it involves depth, context, and specialized knowledge.”

Why AI benchmarks matter for research labs

For laboratory leaders exploring artificial intelligence tools, benchmarks such as Humanity’s Last Exam provide critical insight into where those technologies may still fall short.

AI systems are increasingly used in research environments for tasks such as data interpretation, literature synthesis, and experimental design. Understanding their limitations can help laboratory managers determine when AI can support scientific work and when human expertise remains essential.

“Without accurate assessment tools, policymakers, developers, and users risk misinterpreting what AI systems can actually do,” Nguyen said. “Benchmarks provide the foundation for measuring progress and identifying risks.”

Human expertise remains central

The development of Humanity’s Last Exam itself illustrates the role of human expertise in evaluating artificial intelligence. The project brought together specialists from numerous disciplines, including historians, physicists, linguists, and medical researchers.

Researchers say identifying the limits of AI systems can help guide the development of safer technologies that enhance rather than erode scientific capability.

“This isn’t a race against AI,” Nguyen said. “It’s a way to understand where these systems are strong and where they struggle. That understanding helps us build more reliable tools and reminds us why specialized human expertise still matters.”

This article was created with the assistance of Generative AI and has undergone editorial review before publishing.

New AI Benchmark Highlights Limits of Machine Intelligence

“Humanity’s Last Exam” uses 2,500 expert-level questions to probe where AI still struggles

The architecture of Humanity’s Last Exam

Early performance of AI systems

Why AI benchmarks matter for research labs

Lab Management Certificate

Human expertise remains central

About the Author

Michelle Gaulin

Related Topics

When Lab Innovation Meets Sustainability

Sponsored

From Instruments to Infrastructure: Implementing and Scaling Lab Automation

Stop Blaming Pure Water for Stainless Steel Corrosion

In Step with Scott: Demystifying How Lab Managers Decide

Accurate, Trustworthy Laboratory Data, Every Time