Artificial intelligence tools are increasingly used in laboratories to support data analysis, literature review, and experimental planning. As these systems become more integrated into scientific workflows, researchers are also working to better understand their limitations. A new benchmark called Humanity’s Last Exam aims to evaluate how well advanced artificial intelligence systems perform when confronted with highly specialized human knowledge.
Developed by a global consortium of nearly 1,000 researchers, the assessment introduces a rigorous framework for measuring machine intelligence beyond widely used benchmarks such as Massive Multitask Language Understanding (MMLU). The project, described in a recent paper in Nature, was led in part by Tung Nguyen of Texas A&M University and focuses on identifying areas where artificial intelligence systems still struggle to match expert-level reasoning.
As large language models increasingly achieve near-perfect scores on traditional benchmarks, researchers say new evaluation tools are needed to distinguish pattern recognition from deeper domain expertise.
The architecture of Humanity’s Last Exam
Humanity’s Last Exam contains 2,500 questions spanning diverse academic disciplines, including mathematics, the natural sciences, the humanities, ancient languages, and specialized technical subfields. Unlike previous AI benchmarks that emphasize general knowledge, HLE focuses on the specialized “long tail” of human expertise.
To maintain the assessment's difficulty, researchers designed the benchmark to prevent models from simply memorizing answers. Much of the question bank remains hidden, keeping the test effective as AI systems continue to evolve.
The methodology for building the exam followed several strict criteria:
- Expert authorship: Questions were written and reviewed by subject-matter experts worldwide
- Verifiable answers: Each problem was required to have a single, unambiguous solution
- AI pre-testing: Questions that existing models could answer correctly during development were removed
- Anti-retrieval design: Prompts were structured so they could not be easily solved through internet searches
The exam spans a wide range of topics. Tasks include translating Palmyrene inscriptions, analyzing Biblical Hebrew pronunciation patterns, identifying microanatomical structures in birds, and solving advanced mathematical problems.
Nguyen contributed many questions in mathematics and computer science to ensure strongcomputer s technical rigor within those domains.
Early performance of AI systems
Initial testing revealed a significant gap between current AI capabilities and the level of expertise required to answer many questions in Humanity’s Last Exam.
Reported performance included:
- GPT-4o: 2.7 percent accuracy
- Claude 3.5 Sonnet: 4.1 percent accuracy
- Gemini Pro and Claude Opus models: Approximately 40 percent to 50 percent accuracy on portions of the benchmark
The results suggest that while large language models can perform well on conversational tasks or general knowledge questions, their performance declines sharply when confronted with deep, specialized expertise—a phenomenon that can lead to operational 'workslop' if not properly validated.
“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they are approaching human-level understanding,” Nguyen said. “But Humanity’s Last Exam reminds us that intelligence is not just about pattern recognition—it involves depth, context, and specialized knowledge.”
Why AI benchmarks matter for research labs
For laboratory leaders exploring artificial intelligence tools, benchmarks such as Humanity’s Last Exam provide critical insight into where those technologies may still fall short.
AI systems are increasingly used in research environments for tasks such as data interpretation, literature synthesis, and experimental design. Understanding their limitations can help laboratory managers determine when AI can support scientific work and when human expertise remains essential.
“Without accurate assessment tools, policymakers, developers, and users risk misinterpreting what AI systems can actually do,” Nguyen said. “Benchmarks provide the foundation for measuring progress and identifying risks.”
Human expertise remains central
The development of Humanity’s Last Exam itself illustrates the role of human expertise in evaluating artificial intelligence. The project brought together specialists from numerous disciplines, including historians, physicists, linguists, and medical researchers.
Researchers say identifying the limits of AI systems can help guide the development of safer technologies that enhance rather than erode scientific capability.
“This isn’t a race against AI,” Nguyen said. “It’s a way to understand where these systems are strong and where they struggle. That understanding helps us build more reliable tools and reminds us why specialized human expertise still matters.”
This article was created with the assistance of Generative AI and has undergone editorial review before publishing.













