GPT-4o mini rankings are collapsing, the rules of the large model arena have been updated, and the Ultraman score-boosting tricks are no longer effective.

Latest update time：2024-08-31

Reads：

Mengchen from Aofei Temple
Quantum Bit | Public Account: QbitAI

With the update of the rules of the large model arena, GPT-4o mini's ranking immediately plummeted and fell out of the top 10 .

The new rankings downgrade AI responses based on features like length and style , ensuring the scores reflect the model's true ability to solve the problem.

Trying to please users and boost rankings by using beautiful formats, increasing the number of subheadings, and other techniques are now useless.

Under the new rules, Ultraman's GPT-4o mini and Musk's Grok-2 series saw significant drops in rankings, and Google's Gemini-1.5-flash miniature model also saw some decline.

The scores of the Claude series and the Llama-3.1-405b large model all rose.

When only Hard Prompt tasks are considered, the advantage of large models in the style control list is more obvious.

Previously, the GPT-4o mini model once topped the list, tied for first place with the full-blooded version of GPT-4o, which was obviously inconsistent with netizens' physical perception.

The Lmsys Large Model Arena, an evaluation standard once recommended by Karpathy, has also fallen in reputation to the point where it "can only reflect user preferences rather than model capabilities."

Lmsys learned from its mistakes and first released data from 1,000 battles in which GPT-4o mini participated, thereby analyzing the factors that affected the voting results, including the model's refusal to answer rate, the length of the generated content, and the formatting.

Moreover, before the release of GPT-4o mini, Ultraman hinted that it was optimized according to human preferences.

Now, Lmsys has taken a step further by introducing new algorithms to control these factors, and this is just the first step in a planned process.

How to control the influence of style?

Suppose there is a model A that is good at generating codes, facts, unbiased answers, etc., but its output is very concise.

Model B is not very good in terms of substantive content (such as correctness), but its output is long, detailed, and beautifully formatted.

So which one is better?

There is no single answer, and Lmsys attempts to mathematically figure out how much of a model's score is contributed by content or style.

In addition, recent studies have shown that humans may have a preference for AI answers that are beautifully formatted and more detailed.

By adding style features such as response length, number of markdown subheadings, lists, and amount of bold text as independent variables in the Bradley-Terry regression .

This is a common technique in statistics and has recently been used by AlpacaEval LC and others for large model evaluation.

Including any confounding variable in the regression (e.g., response length) allows attributing any increase in scores to the confounding variable rather than to the model ability itself.

The relevant code has been made public on Google Colab.

The team also conducted ablation experiments on "controlling only length" and "controlling only format." The scores of GPT-4o mini and Google Gemini series are more affected by format.

However, this approach has limitations, such as the possibility of unobserved confounding factors, such as the positive correlation between length and response quality, which were not accounted for (e.g., thought chaining prompts).

Many netizens said that the adjusted difficult task list is more consistent with their subjective impressions.

Some people also feel that it is this back-and-forth game between the rankings and the large model companies that compete for them that allows the entire field to progress.

Do you still use the results of the Grand Model Arena to select models? Or do you have a better evaluation method? Please share it in the comments section.

Reference links:
[1] https://x.com/lmsysorg/status/1829216988021043645
[2] https://lmsys.org/blog/2024-08-28-style-control/
[3] https://arxiv.org/abs/2402.10669

-over-

QuantumBit's annual AI theme planning project Now accepting submissions!

Submissions welcome! One Thousand and One AI Applications , AI Implementation Solutions in 365 Industries

Or share with us the AI products you're looking for or new AI trends you've discovered.

Click here ???? Follow me, and remember to mark the star~