Researcher evaluating AI models in a laboratory setting

New AI Safety Benchmarking Reveals Major Risks in Experimental Decision-Making

New benchmarking results show leading AI models struggle with hazard identification and risk prioritization

Written byMichelle Gaulin
| 2 min read
Register for free to listen to this article
Listen with Speechify
0:00
2:00

New AI safety benchmarking research published in Nature Machine Intelligence raises concerns about the ability of current artificial intelligence models to support experimental decision-making in safety-critical environments. The study finds that many leading large language models and vision-language models struggle to identify hazards, assess risks, and predict unsafe outcomes when evaluated against realistic experimental scenarios.

At the center of the study is AI safety benchmarking, a structured approach for evaluating whether AI systems can perform reliably when safety considerations are involved. As AI tools become more deeply embedded in scientific workflows, these findings highlight a growing disconnect between AI adoption and AI readiness for real-world experimental use.

AI safety benchmarking framework evaluates experimental risk

To assess AI safety performance, researchers developed LabSafety Bench, a benchmarking framework designed to test how well AI models recognize and reason about hazards across experimental domains. The framework evaluates performance in biology, chemistry, physics, and general laboratory settings using both structured and open-ended tasks.

LabSafety Bench includes:

  • 765 multiple-choice questions focused on laboratory safety knowledge
  • 404 realistic experimental scenarios
  • More than 3,000 open-ended tasks addressing hazard identification, risk perception, and consequence prediction

The research team evaluated 19 AI models, including proprietary large-language models, open-weight language models, and vision-language models that process images alongside text.

Benchmarking results reveal limits in experimental decision-making

Across the models tested, AI safety benchmarking revealed consistent performance gaps in tasks that most closely resemble real experimental decision-making. While some proprietary models performed well on structured multiple-choice questions, none exceeded 70 percent accuracy in hazard identification tasks.

Performance limitations were most pronounced in:

  • Chemistry-based experimental scenarios
  • Cryogenic material handling
  • Equipment operation and physical hazard recognition
  • Electrical and radiation safety contexts

Several models achieved below 50 percent accuracy in categories related to improper equipment use, indicating unreliable reasoning in scenarios where experimental judgment is critical.

AI hallucinations amplify safety concerns

The study also highlights AI hallucinations as a significant risk factor in experimental decision-making. Hallucinations occur when models generate confident but incorrect responses, which can be particularly dangerous when AI is used to inform experimental setup, procedural steps, or emergency response actions.

Researchers identified multiple failure modes relevant to experimental environments:

  • Poor prioritization of safety risks
  • Inconsistent reasoning across similar scenarios
  • Confident presentation of incorrect guidance
  • Overfitting to narrow or incomplete training data

These failure modes undermine trust in AI-generated recommendations and increase the likelihood of unsafe experimental outcomes if AI tools are used without verification.

Operational implications for lab managers

For lab managers, AI safety benchmarking results reinforce the need to treat AI as an assistive tool rather than a decision authority. Even advanced models cannot be assumed to reliably understand laboratory hazards or safety protocols.

Operational considerations include:

  • Maintaining human review of AI-assisted experimental planning
  • Limiting AI use in safety-critical decision-making
  • Training staff to recognize AI limitations and failure modes
  • Establishing governance policies for AI use in laboratory operations

The findings also show that newer or larger AI models do not automatically demonstrate better safety performance, emphasizing the importance of validation over assumptions.

Benchmarking as a foundation for safer AI integration

According to the study, AI safety benchmarking provides a practical foundation for improving AI reliability in experimental environments. Tools like LabSafety Bench help identify where models fail, informing future development and governance strategies.

Until AI systems demonstrate consistent, high-performance safety reasoning, the study concludes that human oversight must remain central to experimental decision-making. AI can enhance efficiency and analysis, but responsibility for safety cannot be delegated to current AI systems.

This article was created with the assistance of Generative AI and has undergone editorial review before publishing.

About the Author

  • Headshot photo of Michelle Gaulin

    Michelle Gaulin is an associate editor for Lab Manager. She holds a bachelor of journalism degree from Toronto Metropolitan University in Toronto, Ontario, Canada, and has two decades of experience in editorial writing, content creation, and brand storytelling. In her role, she contributes to the production of the magazine’s print and online content, collaborates with industry experts, and works closely with freelance writers to deliver high-quality, engaging material.

    Her professional background spans multiple industries, including automotive, travel, finance, publishing, and technology. She specializes in simplifying complex topics and crafting compelling narratives that connect with both B2B and B2C audiences.

    In her spare time, Michelle enjoys outdoor activities and cherishes time with her daughter. She can be reached at mgaulin@labmanager.com.

    View Full Profile

Related Topics

Loading Next Article...
Loading Next Article...

CURRENT ISSUE - November/December 2025

AI & Automation

Preparing Your Lab for the Next Stage

Lab Manager Nov/Dec 2025 Cover Image