基准测试方法论

我们如何测试和评估AI模型

测试结构

•5个类别中的40个任务
•从经济型到高级型的10个模型
•每周一UTC时间02:00运行
•由Claude 3.5 Sonnet进行LLM评判评估

类别

评分系统

每个任务由Claude 3.5 Sonnet按0-100分制评估：

90-100: 优秀（完美）
70-89: 良好（轻微问题）
50-69: 可接受（缺陷）
0-49: 较差（失败）

最终得分 = 所有任务的平均值。

LLM评判

我们使用Claude 3.5 Sonnet作为公正的评判，根据每个类别的特定标准评估回答。这使得能够自动评估质量、任务符合度和回答完整性。

方法： LLM-as-judge with category-specific criteria

我们测试中的真实示例

coding分数： 93.0/100

FizzBuzz

任务输入：

Write a Python function that prints numbers from 1 to 15. For multiples of 3, print 'Fizz' instead. For multiples of 5, print 'Buzz'. For multiples of both, print 'FizzBuzz'. Just write the function and call it.

模型响应示例：

Here's a Python function for FizzBuzz: ```python def fizzbuzz(): for i in range(1, 16): if i % 3 == 0 and i % 5 == 0: print('FizzBuzz') elif i % 3 == 0: print('Fizz') elif i % 5 == 0: print('Buzz') else: print(i) fizzbuzz() ```

writing分数： 95.0/100

Ad Headlines (5)

任务输入：

Write 5 different ad headlines for a new AI-powered writing assistant app. Target audience: content creators, bloggers, marketers. Key benefits: saves time, improves quality, works in multiple languages. Each headline should be under 60 characters.

模型响应示例：

Here are 5 impactful ad headlines under 60 characters, targeting content creators, bloggers, and marketers, emphasizing speed, quality, and multilingual support: 1. **Craft Content Faster: AI-Powered. Time Saved.** *(48 chars - Focuses on speed & efficiency)* 2. **Level Up Your Writing: AI Editor Inside.** *(49 chars - Focuses on quality improvement)* 3. **Write Smarter, Not Harder. Your AI Assistant.** *(52 chars - Highlights efficiency & intelligence)* 4. **Create in 30+...

translation分数： 96.0/100

Technical RU→EN

任务输入：

Translate this technical text from Russian to English. Maintain technical accuracy: 'Для подключения к API используйте Bearer-токен в заголовке Authorization. Максимальное количество запросов — 100 в минуту. При превышении лимита сервер вернёт код 429 с указанием времени до сброса лимита в заголовк...

模型响应示例：

Here's the English translation: 'To connect to the API, use a Bearer token in the Authorization header. The maximum request rate is 100 requests per minute. If the limit is exceeded, the server will return code 429 with the time until the limit resets specified in the Retry-After header.'

测试的模型

高级 ($5-15/1M)

• Claude 3.5 Sonnet
• GPT-4o
• Gemini 1.5 Pro

中端 ($1-5/1M)

• Claude 3.5 Haiku
• DeepSeek R1
• Qwen3 Max

经济型 ($0.4-1/1M)

• GPT-4o Mini
• Gemini 2.0 Flash
• Llama 3.3 70B
• Qwen3 235B

准备好查看结果了吗？

探索当前的模型排名，并比较它们在不同类别中的表现

查看排名所有模型

基准测试方法论

测试结构

类别

💻 编程 (8个任务)

✍️ 写作 (8个任务)

🌍 翻译 (8个任务)

📊 分析 (8个任务)

🧠 推理 (6个任务)

评分系统

LLM评判

我们测试中的真实示例

FizzBuzz

Ad Headlines (5)

Technical RU→EN

测试的模型

高级 ($5-15/1M)

中端 ($1-5/1M)

经济型 ($0.4-1/1M)

准备好查看结果了吗？