Home General Various News Did xAI lie about Grok 3’s benchmarks?

General Various News

Did xAI lie about Grok 3’s benchmarks?

February 22, 2025

101

Debates over AI benchmarks — and the way they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI worker accused Elon Musk’s AI firm, xAI, of publishing deceptive benchmark outcomes for its newest AI mannequin, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the corporate was in the suitable.

The reality lies someplace in between.

In a publish on xAI’s weblog, the corporate printed a graph exhibiting Grok 3’s efficiency on AIME 2025, a set of difficult math questions from a latest invitational arithmetic examination. Some specialists have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older variations of the check are generally used to probe a mannequin’s math capability.

xAI’s graph confirmed two variants of Grok 3, Grok 3 Reasoning Beta and Grok Three mini Reasoning, beating OpenAI’s best-performing accessible mannequin, o3-mini-high, on AIME 2025. But OpenAI workers on X had been fast to level out that xAI’s graph didn’t embrace o3-mini-high’s AIME 2025 rating at “cons@64.”

What is cons@64, you may ask? Well, it’s quick for “consensus@64,” and it mainly offers a mannequin 64 tries to reply every downside in a benchmark and takes the solutions generated most steadily as the ultimate solutions. As you’ll be able to think about, cons@64 tends to spice up fashions’ benchmark scores fairly a bit, and omitting it from a graph may make it seem as if one mannequin surpasses one other when in actuality, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok Three mini Reasoning’s scores for AIME 2025 at “@1” — which means the primary rating the fashions obtained on the benchmark — fall beneath o3-mini-high’s rating. Grok 3 Reasoning Beta additionally trails ever-so-slightly behind OpenAI’s o1 mannequin set to “medium” computing. Yet xAI is promoting Grok Three because the “world’s smartest AI.”

Babushkin argued on X that OpenAI has printed equally deceptive benchmark charts prior to now — albeit charts evaluating the efficiency of its personal fashions. A extra impartial occasion within the debate put collectively a extra “accurate” graph exhibiting practically each mannequin’s efficiency at cons@64:

Hilarious how some folks see my plot as assault on OpenAI and others as assault on Grok whereas in actuality it’s DeepSeek propaganda
(I truly consider Grok appears to be like good there, and openAI’s TTC chicanery behind o3-mini-*excessive*-pass@”””1″”” deserves extra scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic

— Teortaxes (DeepSeek 推特铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

But as AI researcher Nathan Lambert identified in a publish, maybe crucial metric stays a thriller: the computational (and financial) value it took for every mannequin to attain its finest rating. That simply goes to indicate how little most AI benchmarks talk about fashions’ limitations — and their strengths.

Source hyperlink

Post Views: 155

Did xAI lie about Grok 3’s benchmarks?

LEAVE A REPLY Cancel reply

EVEN MORE NEWS

OpenAI Rolls Back March GPT-4o Update to Stop ChatGPT From B…

Samsung Expands Direct Access to AI Assistant With Side

Elon Musk’s DOGE Taps College Student to Rewrite Housing Rul…

POPULAR CATEGORY

RELATED ARTICLESMORE FROM AUTHOR

Meta’s benchmarks for its new AI fashions are a bit deceptive

Grok AI Chatbot Calls Elon Musk ‘Top Misinformation Spreader…

Grok 3 Pricing, Benchmarks, and Availability: More Details A…