WebsiteGear Logo Log In
New User? Sign Up
About | Contact | FAQ
  Home News Web Hosting Domain Name Industry Thursday, February 12, 2026 
Add Press Release News | News Feeds Feeds | Email This News Email


Sup AI Sets New Benchmark Record with 52.15% on Humanity's Last Exam
Thursday, January 22, 2026

Surpasses every individual frontier model on the world's hardest open-source AI reasoning test

Important Disclosure: This is an independent evaluation conducted by Sup AI and is not officially endorsed, validated, or recognized by the Center for AI Safety, Scale AI, or the HLE benchmark creators. Sup AI is not affiliated with CAIS or Scale AI.

PALO ALTO, Calif., Dec. 10, 2025 /PRNewswire/ -- Sup AI announced today that its multi-model orchestration system has achieved 52.15% accuracy on Humanity's Last Exam (HLE), the most challenging publicly available benchmark for advanced AI reasoning. This performance establishes Sup AI as the new state-of-the-art (SOTA), outperforming all individual frontier models including Google's Gemini 3 Pro Preview, OpenAI's GPT-5 Pro and GPT-5.1, Anthropic's Claude Opus 4.5, and xAI's Grok-4.

HLE is designed to resist saturation as AI capabilities improve, blending advanced mathematics, scientific reasoning, and logic into 2,500 expert-crafted questions. Crossing 50% accuracy on this benchmark marks a significant milestone in the progression of general AI reasoning capabilities.

Note: All models evaluated, including Sup AI, use enhanced evaluation settings such as custom instructions, web search, and low-confidence retries. These settings raise every model's score relative to published benchmarks, but relative rankings remain stable, and Sup AI maintains a clear lead.

Sup AI's Results at a Glance


              
            
              Metric              Value



     
     Accuracy                                    52.15 %



     
     Questions Evaluated                           1,369



     
     Lead Over Next Best Model           +7.49
                                             points



     
     ECE (stated-confidence calibration)         35.22 %

Sup AI's score is statistically significant at p < 0.001, with a 95% confidence interval of ±2.65 percentage points.

A Clear Lead Over Frontier Models

Sup AI's ensemble system consistently outperformed every major standalone model:


       
              
              Model Accuracy



     
                Sup AI           52.15 %



     Gemini 3 Pro Preview          44.66 %



     GPT-5 Pro                     39.43 %



     GPT-5.1                       38.18 %



     Claude Opus 4.5               29.56 %



     DeepSeek v3.2 Thinking        24.08 %

This margin underscores a core principle: an orchestrated ensemble, when engineered properly, can outperform its strongest component models by a wide margin.

Why Sup AI Wins: Ensemble Intelligence

Sup AI dynamically routes each question to a set of frontier models most suited to the problem, analyzes probability distributions across their outputs, and synthesizes an answer weighted by confidence, specialization, and inter-model agreement. If confidence is insufficient or models disagree meaningfully, Sup AI automatically retries.

Sup AI also enables multimodal handling even for models that lack native support, pre-processing images or PDFs when required.

The result is not a simple vote. Instead, it's a structured, confidence-weighted synthesis that consistently outperforms every individual model.

About Humanity's Last Exam

Humanity's Last Exam (HLE) is a high-difficulty benchmark developed by independent researchers to evaluate deep reasoning, mathematical problem-solving, scientific understanding, and multi-step logic. With 2,500 public questions and no "trivial" shortcuts, HLE remains one of the few benchmarks where frontier models do not cluster near human-expert-level performance.

Sup AI evaluated 1,369 randomly selected questions, using standard Sup AI chat settings identical to what any user would access through the platform.

Leadership Perspective

"Crossing 50% on HLE isn't about luck. It's about architecture," said Ken Mueller, CEO of Sup AI. "No single model dominates every domain, but an orchestrated system that understands when to trust, when to weight, and when to retry can. Sup AI shows that careful ensemble engineering can push beyond the ceiling of any standalone model."

Evaluation Methodology

    --  Each question (text and image) was submitted to the Sup AI API using the
        platform's normal system prompt.
    --  Responses were structured into explanation, answer, and a self-reported
        confidence score.
    --  GPT-5.1 served as the automated judge, evaluating answer correctness via
        strict extraction, semantic equivalence, and numerical-tolerance
        matching.
    --  Accuracy and calibration metrics were computed using established
        statistical estimators.

Sup AI's complete evaluation code, predictions, judged outputs, metrics, and per-question results are publicly available for full reproducibility.

Significance

Sup AI's performance demonstrates:

    1. It is possible to meaningfully surpass top individual models with an
       ensemble system.
    2. Specialization matters -- different models dominate different domains;
       orchestration captures these strengths.
    3. Benchmarks like HLE remain valuable, showing substantial headroom even as
       AI surpasses 50% accuracy.
    4. These capabilities are available today through Sup AI's production API
       and chat interface.

Availability

The full evaluation, code, and results can be accessed at:



     GitHub Repository: https://github.com/
                          supaihq/hle



     Sup AI Platform:   https://sup.ai



     HLE Benchmark:     https://lastexam.ai

Sup AI invites researchers, engineers, and enterprise teams to independently reproduce the results and explore the platform's orchestration capabilities.

Citation
@misc{supai-hle-2025,
title={Sup AI Achieves 52.15% on Humanity's Last Exam},
author={Sup AI},
year={2025},
url={https://github.com/supaihq/hle}
}

For press inquiries or partnership discussions, please contact: support@sup.ai

View original content to download multimedia:https://www.prnewswire.com/news-releases/sup-ai-sets-new-benchmark-record-with-52-15-on-humanitys-last-exam-302637675.html

SOURCE Sup AI



Email This News Email | Submit To Slashdot Slashdot | Submit To Digg.com Digg | Submit To del.icio.us Del.icio.us | News Feeds Feeds

RELATED NEWS ARTICLES
Nav Weekly Recap: 11 Tech Press Releases You Need to See | Jan 22, 2026
Nav Glasswall Brings Defense-Level File Sanitization to Every Government Agency and Business Using Microsoft 365 | Jan 22, 2026
Nav Skunk Works® and XTEND Expand Joint All Domain Command and Control for Advanced Mission Execution | Jan 22, 2026
Nav Exia Labs Brings Keystone to the U.S. Navy via DIU's Blue Object Management Challenge | Jan 22, 2026
Nav DEADLINE ANNOUNCED FOR 2026 NEW TOP-LEVEL DOMAIN APPLICATIONS | Jan 22, 2026
Nav Trigent Partners with WeWork India to Expand its GCC Footprint | Jan 22, 2026
Nav Veteran Ventures Capital Announces Investment in Vatn Systems, Supporting a New Era of Scalable Undersea Autonomy | Jan 22, 2026
Nav Genpact Named a Leader in ISG Provider Lens(TM) 2025 for Insurance GCCs and Agentic AI Services | Jan 22, 2026
Nav Everflow Drives $4.3 Billion in High-Value Partner Revenue, Delivering Essential Solutions for Modern Affiliate Programs | Jan 22, 2026
Nav Buyers Edge Platform Appoints Jaime Selga to Lead Expansion Across the Middle East, Africa & Asia | Jan 22, 2026
NEWS SEARCH

FEATURED NEWS | POPULAR NEWS
Submit News | View More News View More News