Back to Benchmarks

Benchmark Methodology

How we test and evaluate AI models

Test Structure

  • 40 tasks across 5 categories
  • 10 models from budget to premium
  • Weekly runs every Monday at 02:00 UTC
  • LLM-judge evaluation by Claude 3.5 Sonnet

Scoring System

Each task is evaluated on a 0-100 scale by Claude 3.5 Sonnet (LLM-judge):

  • 90-100: Excellent (perfect or near-perfect)
  • 70-89: Good (minor issues)
  • 50-69: Acceptable (noticeable flaws)
  • 0-49: Poor (major issues or failure)

Final score = average across all completed tasks in category/overall.

LLM Judge

We use Claude 3.5 Sonnet as an impartial judge to evaluate responses based on category-specific criteria. This enables automated assessment of quality, task adherence, and response completeness.

Approach: LLM-as-judge with category-specific criteria

Real Examples from Our Tests

codingScore: 93.0/100

FizzBuzz

Task Input:
Write a Python function that prints numbers from 1 to 15. For multiples of 3, print 'Fizz' instead. For multiples of 5, print 'Buzz'. For multiples of both, print 'FizzBuzz'. Just write the function and call it.
Sample Model Response:
Here's a Python function for FizzBuzz: ```python def fizzbuzz(): for i in range(1, 16): if i % 3 == 0 and i % 5 == 0: print('FizzBuzz') elif i % 3 == 0: print('Fizz') elif i % 5 == 0: print('Buzz') else: print(i) fizzbuzz() ```
writingScore: 95.0/100

Ad Headlines (5)

Task Input:
Write 5 different ad headlines for a new AI-powered writing assistant app. Target audience: content creators, bloggers, marketers. Key benefits: saves time, improves quality, works in multiple languages. Each headline should be under 60 characters.
Sample Model Response:
Here are 5 impactful ad headlines under 60 characters, targeting content creators, bloggers, and marketers, emphasizing speed, quality, and multilingual support: 1. **Craft Content Faster: AI-Powered. Time Saved.** *(48 chars - Focuses on speed & efficiency)* 2. **Level Up Your Writing: AI Editor Inside.** *(49 chars - Focuses on quality improvement)* 3. **Write Smarter, Not Harder. Your AI Assistant.** *(52 chars - Highlights efficiency & intelligence)* 4. **Create in 30+...
translationScore: 96.0/100

Technical RU→EN

Task Input:
Translate this technical text from Russian to English. Maintain technical accuracy: 'Для подключения к API используйте Bearer-токен в заголовке Authorization. Максимальное количество запросов — 100 в минуту. При превышении лимита сервер вернёт код 429 с указанием времени до сброса лимита в заголовк...
Sample Model Response:
Here's the English translation: 'To connect to the API, use a Bearer token in the Authorization header. The maximum request rate is 100 requests per minute. If the limit is exceeded, the server will return code 429 with the time until the limit resets specified in the Retry-After header.'

Models Tested

Premium ($5-15/1M)

  • • Claude 3.5 Sonnet
  • • GPT-4o
  • • Gemini 1.5 Pro

Mid-tier ($1-5/1M)

  • • Claude 3.5 Haiku
  • • DeepSeek R1
  • • Qwen3 Max

Budget ($0.4-1/1M)

  • • GPT-4o Mini
  • • Gemini 2.0 Flash
  • • Llama 3.3 70B
  • • Qwen3 235B

Ready to see the results?

Explore the current model rankings and compare their performance across categories