One of Meta’s new flagship AI models, Maverick, which was released on Saturday, currently ranks second on LM Arena. This platform assesses AI outputs by having human raters choose their preferred responses. However, it appears the version of Maverick evaluated by LM Arena differs from the version available to developers.
Several AI researchers noted on X that Meta’s announcement indicated the Maverick on LM Arena is an “experimental chat version.” Additionally, a chart on the official Llama website reveals that the LM Arena tests used “Llama 4 Maverick optimized for conversationality.”
ICYMT: Liverpool Arsenal Eye $129m Star
As previously discussed, LM Arena has not been the most reliable indicator of an AI model’s performance for various reasons. However, AI companies typically do not customize their models to achieve better scores on LM Arena, or at least do not publicly acknowledge doing so.
Customizing a model for a benchmark while withholding that version and releasing a “vanilla” variant creates challenges for developers in predicting the model’s real-world performance. This practice can also be misleading. Ideally, benchmarks—despite their inadequacies—should offer insights into a model’s strengths and weaknesses across various tasks.
Researchers on X have noted significant differences in the behavior of the publicly available Maverick compared to the version on LM Arena, with the latter using more emojis and providing excessively lengthy responses.
We have reached out to Meta and Chatbot Arena, the organization that manages LM Arena, for comments.
SOURCE: TECH CRUNCH