返回基准测试
基准测试方法论
我们如何测试和评估AI模型
测试结构
- •5个类别中的40个任务
- •从经济型到高级型的10个模型
- •每周一UTC时间02:00运行
- •由Claude 3.5 Sonnet进行LLM评判评估
评分系统
每个任务由Claude 3.5 Sonnet按0-100分制评估:
- 90-100: 优秀(完美)
- 70-89: 良好(轻微问题)
- 50-69: 可接受(缺陷)
- 0-49: 较差(失败)
最终得分 = 所有任务的平均值。
LLM评判
我们使用Claude 3.5 Sonnet作为公正的评判,根据每个类别的特定标准评估回答。这使得能够自动评估质量、任务符合度和回答完整性。
方法: LLM-as-judge with category-specific criteria
我们测试中的真实示例
coding分数: 93.0/100
FizzBuzz
任务输入:
Write a Python function that prints numbers from 1 to 15. For multiples of 3, print 'Fizz' instead. For multiples of 5, print 'Buzz'. For multiples of both, print 'FizzBuzz'. Just write the function and call it.
模型响应示例:
Here's a Python function for FizzBuzz:
```python
def fizzbuzz():
for i in range(1, 16):
if i % 3 == 0 and i % 5 == 0:
print('FizzBuzz')
elif i % 3 == 0:
print('Fizz')
elif i % 5 == 0:
print('Buzz')
else:
print(i)
fizzbuzz()
```
writing分数: 95.0/100
Ad Headlines (5)
任务输入:
Write 5 different ad headlines for a new AI-powered writing assistant app. Target audience: content creators, bloggers, marketers. Key benefits: saves time, improves quality, works in multiple languages. Each headline should be under 60 characters.
模型响应示例:
Here are 5 impactful ad headlines under 60 characters, targeting content creators, bloggers, and marketers, emphasizing speed, quality, and multilingual support:
1. **Craft Content Faster: AI-Powered. Time Saved.**
*(48 chars - Focuses on speed & efficiency)*
2. **Level Up Your Writing: AI Editor Inside.**
*(49 chars - Focuses on quality improvement)*
3. **Write Smarter, Not Harder. Your AI Assistant.**
*(52 chars - Highlights efficiency & intelligence)*
4. **Create in 30+...
translation分数: 96.0/100
Technical RU→EN
任务输入:
Translate this technical text from Russian to English. Maintain technical accuracy:
'Для подключения к API используйте Bearer-токен в заголовке Authorization. Максимальное количество запросов — 100 в минуту. При превышении лимита сервер вернёт код 429 с указанием времени до сброса лимита в заголовк...
模型响应示例:
Here's the English translation:
'To connect to the API, use a Bearer token in the Authorization header. The maximum request rate is 100 requests per minute. If the limit is exceeded, the server will return code 429 with the time until the limit resets specified in the Retry-After header.'
测试的模型
高级 ($5-15/1M)
- • Claude 3.5 Sonnet
- • GPT-4o
- • Gemini 1.5 Pro
中端 ($1-5/1M)
- • Claude 3.5 Haiku
- • DeepSeek R1
- • Qwen3 Max
经济型 ($0.4-1/1M)
- • GPT-4o Mini
- • Gemini 2.0 Flash
- • Llama 3.3 70B
- • Qwen3 235B