Sup AI Sets New Benchmark Record with 52.15% on Humanity's Last Exam

Log In
New User? Sign Up

About | Contact | FAQ

HOME ACCOUNT PRODUCTS NEWS ARTICLES DIRECTORY CLASSIFIEDS FORUMS TOOLS

Home

News

Web Hosting

Domain Name Industry

Thursday, February 12, 2026

News

Feeds

Email

Sup AI Sets New Benchmark Record with 52.15% on Humanity's Last Exam
Thursday, January 22, 2026

Surpasses every individual frontier model on the world's hardest open-source AI reasoning test

Important Disclosure: This is an independent evaluation conducted by Sup AI and is not officially endorsed, validated, or recognized by the Center for AI Safety, Scale AI, or the HLE benchmark creators. Sup AI is not affiliated with CAIS or Scale AI.

PALO ALTO, Calif., Dec. 10, 2025 /PRNewswire/ -- Sup AI announced today that its multi-model orchestration system has achieved 52.15% accuracy on Humanity's Last Exam (HLE), the most challenging publicly available benchmark for advanced AI reasoning. This performance establishes Sup AI as the new state-of-the-art (SOTA), outperforming all individual frontier models including Google's Gemini 3 Pro Preview, OpenAI's GPT-5 Pro and GPT-5.1, Anthropic's Claude Opus 4.5, and xAI's Grok-4.

HLE is designed to resist saturation as AI capabilities improve, blending advanced mathematics, scientific reasoning, and logic into 2,500 expert-crafted questions. Crossing 50% accuracy on this benchmark marks a significant milestone in the progression of general AI reasoning capabilities.

Note: All models evaluated, including Sup AI, use enhanced evaluation settings such as custom instructions, web search, and low-confidence retries. These settings raise every model's score relative to published benchmarks, but relative rankings remain stable, and Sup AI maintains a clear lead.

Sup AI's Results at a Glance


              
            
              Metric              Value



     
     Accuracy                                    52.15 %



     
     Questions Evaluated                           1,369



     
     Lead Over Next Best Model           +7.49
                                             points



     
     ECE (stated-confidence calibration)         35.22 %

Sup AI's score is statistically significant at p < 0.001, with a 95% confidence interval of ±2.65 percentage points.

A Clear Lead Over Frontier Models

Sup AI's ensemble system consistently outperformed every major standalone model:


       
              
              Model Accuracy



     
                Sup AI           52.15 %



     Gemini 3 Pro Preview          44.66 %



     GPT-5 Pro                     39.43 %



     GPT-5.1                       38.18 %



     Claude Opus 4.5               29.56 %



     DeepSeek v3.2 Thinking        24.08 %

This margin underscores a core principle: an orchestrated ensemble, when engineered properly, can outperform its strongest component models by a wide margin.

Why Sup AI Wins: Ensemble Intelligence

Sup AI dynamically routes each question to a set of frontier models most suited to the problem, analyzes probability distributions across their outputs, and synthesizes an answer weighted by confidence, specialization, and inter-model agreement. If confidence is insufficient or models disagree meaningfully, Sup AI automatically retries.

Sup AI also enables multimodal handling even for models that lack native support, pre-processing images or PDFs when required.

The result is not a simple vote. Instead, it's a structured, confidence-weighted synthesis that consistently outperforms every individual model.

About Humanity's Last Exam

Humanity's Last Exam (HLE) is a high-difficulty benchmark developed by independent researchers to evaluate deep reasoning, mathematical problem-solving, scientific understanding, and multi-step logic. With 2,500 public questions and no "trivial" shortcuts, HLE remains one of the few benchmarks where frontier models do not cluster near human-expert-level performance.

Sup AI evaluated 1,369 randomly selected questions, using standard Sup AI chat settings identical to what any user would access through the platform.

Leadership Perspective

"Crossing 50% on HLE isn't about luck. It's about architecture," said Ken Mueller, CEO of Sup AI. "No single model dominates every domain, but an orchestrated system that understands when to trust, when to weight, and when to retry can. Sup AI shows that careful ensemble engineering can push beyond the ceiling of any standalone model."

Evaluation Methodology

    --  Each question (text and image) was submitted to the Sup AI API using the
        platform's normal system prompt.
    --  Responses were structured into explanation, answer, and a self-reported
        confidence score.
    --  GPT-5.1 served as the automated judge, evaluating answer correctness via
        strict extraction, semantic equivalence, and numerical-tolerance
        matching.
    --  Accuracy and calibration metrics were computed using established
        statistical estimators.

Sup AI's complete evaluation code, predictions, judged outputs, metrics, and per-question results are publicly available for full reproducibility.

Significance

Sup AI's performance demonstrates:

    1. It is possible to meaningfully surpass top individual models with an
       ensemble system.
    2. Specialization matters -- different models dominate different domains;
       orchestration captures these strengths.
    3. Benchmarks like HLE remain valuable, showing substantial headroom even as
       AI surpasses 50% accuracy.
    4. These capabilities are available today through Sup AI's production API
       and chat interface.

Availability

The full evaluation, code, and results can be accessed at:



     GitHub Repository: https://github.com/
                          supaihq/hle



     Sup AI Platform:   https://sup.ai



     HLE Benchmark:     https://lastexam.ai

Sup AI invites researchers, engineers, and enterprise teams to independently reproduce the results and explore the platform's orchestration capabilities.

Citation
@misc{supai-hle-2025,
title={Sup AI Achieves 52.15% on Humanity's Last Exam},
author={Sup AI},
year={2025},
url={https://github.com/supaihq/hle}
}

For press inquiries or partnership discussions, please contact: support@sup.ai

View original content to download multimedia:https://www.prnewswire.com/news-releases/sup-ai-sets-new-benchmark-record-with-52-15-on-humanitys-last-exam-302637675.html

SOURCE Sup AI

Email

Slashdot

Digg

Del.icio.us

Feeds

RELATED NEWS ARTICLES

	Weekly Recap: 11 Tech Press Releases You Need to See \| Jan 22, 2026
	Glasswall Brings Defense-Level File Sanitization to Every Government Agency and Business Using Microsoft 365 \| Jan 22, 2026
	Skunk Works® and XTEND Expand Joint All Domain Command and Control for Advanced Mission Execution \| Jan 22, 2026
	Exia Labs Brings Keystone to the U.S. Navy via DIU's Blue Object Management Challenge \| Jan 22, 2026
	DEADLINE ANNOUNCED FOR 2026 NEW TOP-LEVEL DOMAIN APPLICATIONS \| Jan 22, 2026
	Trigent Partners with WeWork India to Expand its GCC Footprint \| Jan 22, 2026
	Veteran Ventures Capital Announces Investment in Vatn Systems, Supporting a New Era of Scalable Undersea Autonomy \| Jan 22, 2026
	Genpact Named a Leader in ISG Provider Lens(TM) 2025 for Insurance GCCs and Agentic AI Services \| Jan 22, 2026
	Everflow Drives $4.3 Billion in High-Value Partner Revenue, Delivering Essential Solutions for Modern Affiliate Programs \| Jan 22, 2026
	Buyers Edge Platform Appoints Jaime Selga to Lead Expansion Across the Middle East, Africa & Asia \| Jan 22, 2026

NEWS SEARCH

Survey Software

Click Tracking

Poll Software

Rating Software