Meta accused of Llama 4 bait-n-switch to juice LMArena rank

April 8, 2025

Meta submitted a specially crafted, non-public variant of its Llama 4 AI model to an online benchmark that may have unfairly boosted its leaderboard position over rivals.

The LLM was uploaded to LMArena, a popular site that pits models against each other. It’s admittedly more a popularity contest than a benchmark, as you can select two submitted models to compete head-to-head, give the pair an input prompt to each answer, and vote on the best output. Thousands of these votes are collected and used to draw up a leaderboard of crowd-sourced LLM performance.

According to LMArena on Monday, Meta provided a version of Llama 4 that is not publicly available and was seemingly specifically designed to charm those human voters, potentially giving it an edge in the rankings over publicly available competitors. And we wonder where artificially intelligent systems get their Machiavellian streak from.

“Early analysis shows style and model response tone was an important factor — demonstrated in style control ranking — and we are conducting a deeper analysis to understand more,” the chatbot ranking platform said Monday evening.

Meta should have made it clearer that Llama-4-Maverick-03-26-Experimental was a customized model to optimize for human preference

“Meta should have made it clearer that Llama-4-Maverick-03-26-Experimental was a customized model to optimize for human preference,” the group added.

Dropped on the world in a rather unusual Saturday launch, Meta’s now publicly available Llama 4 model codenamed Maverick was heralded for its LMArena performance. A “experimental” build of the model sat at number two in the chatbot leaderboard, just behind Google’s Gemini-2.5-Pro-Exp-03-25 release.

To back up its claims that the version of the model submitted for testing was a special custom job, LMArena published a full breakdown. “To ensure full transparency, we’re releasing 2,000-plus head-to-head battle results for public review. This includes user prompts, model responses, and user preferences,” the team said.

From the results published by LMArena to Hugging Face, the “experimental” version of Llama 4 Maverick, the one that went head to head against rivals in the arena, appeared to produce verbose results often peppered with emojis. The public version, the one you’d deploy in applications, produced far more concise responses that were generally devoid of emojis.

It’s important for Meta to provide publicly available versions of its models for the contest so that when people come to pick and use LLMs in applications, they get the neural network they were expecting and others had rated. In this case, it appears the “experimental” version for the contest differed from the official release.

Llama-4-Maverick-03-26-Experimental is a chat optimized version we experimented with that also performs well on LMArena

The Facebook giant did not deny any of this.

“We experiment with all types of custom variants,” a Meta spokesperson told El Reg.

“Llama-4-Maverick-03-26-Experimental is a chat optimized version we experimented with that also performs well on LMArena. We have now released our open source version and will see how developers customize Llama 4 for their own use cases. We’re excited to see what they will build and look forward to their ongoing feedback.”

Meta for its part wasn’t hiding the fact this was an experimental build. In its launch blog post, the Instagram parent wrote that “Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena.”

However, many assumed the experimental model was a beta-style preview, substantially similar to the version released to model hubs like Hugging Face on Saturday.

Suspicions were raised after excited netizens began getting their hands on the official model only to be met with lackluster results. The disconnect between Meta’s benchmark claims and public perception was big enough that Meta GenAI head Ahmad Al-Dahle weighed in on Monday, pointing to inconsistent performance across inference platforms that, he said, still needed time to be properly tuned.

“We’re already hearing lots of great results people are getting with these models. That said, we’re also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it’ll take several days for all the public implementations to get dialed in,” Al-Dahle said.

These kinds of issues are known to crop up with new model releases, particularly those that employ novel architectures or implementations. In our testing of Alibaba’s QwQ, we found that misconfiguring the model hyperparameters could result in excessively long responses.

We’ve also heard claims that we trained on test sets – that’s simply not true and we would never do that

Al-Dahle also denied allegations Meta had cheated by training Llama 4 on LLM benchmark test sets. “We’ve also heard claims that we trained on test sets – that’s simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.”

The denial followed online speculation that Meta’s leadership had suggested blending test sets from AI benchmarks to produce a more presentable result.

In response to this incident, LMArena says it has updated its “leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future. Meta’s interpretation of our policy did not match what we expect from model providers.”

It also plans to upload the public release of Llama 4 Maverick from Hugging Face to the leaderboard arena. ®