Musk’s xAI may have fudged the AI benchmark results for Grok 3.

Posted by

Musk's xAI may have fudged the AI benchmark results for Grok 3.

Elon Musk’s xAI is accused of sharing misleading benchmark results for Grok 3 by an OpenAI employee. xAI posted a graph on its blog displaying Grok 3’s performance on AIME 2025, a math problems compilation. The graph indicated Grok 3 outperforming OpenAI’s model, but the OpenAI employee noted an essential performance metric missing from the graph.

The AIME 2025 score at “cons@64” for o3-mini-high was missing, a metric that allows multiple problem-solving attempts. AIME’s validity as an AI benchmark is debated, but it’s commonly used to evaluate mathematical skills in models.

The term “cons@64” represents “consensus@64,” which gives an AI model 64 attempts to solve each benchmark problem. The final results are the most common responses, enhancing benchmark scores significantly. Omitting this metric from a graph may lead to misconceptions about model superiority.

In benchmark tests, Grok 3 Reasoning Beta and Grok 3 mini Reasoning scored lower than o3-mini-high at their first testing point. Grok 3 Reasoning Beta slightly lagged behind OpenAI’s o1 model under medium computing conditions. Despite this, xAI claims Grok 3 is the “world’s smartest AI.”

Igor Babushkin of xAI defended his company against accusations by pointing out OpenAI’s past use of misleading benchmark charts with only its models for comparison, justifying xAI’s exclusion of certain data points in Grok 3’s performance graph against OpenAI’s models.

Sources News From Various Digital Platforms, Websites, Journalists, And Agencies.

Leave a Reply

Your email address will not be published. Required fields are marked *