AI Model Comparison
LLM Leaderboard & Comparison
Compare top AI models by quality, speed, price, and benchmarks. Find the best LLM for your use case with real-time rankings.
Compare Models
Discover the top-performing LLM model by evaluating and comparing their key metrics in-depth.
LLM Leaderboard
Largest Context
Llama 4 Scout
10M Context Window
Most Expensive
GPT-5.4 pro
$0.21/1K Tokens
Least Expensive
DeepSeek V3
$0/1K Tokens
Best GPQA
GPT-5.4 pro
94.4% GPQA
Best SWE-Bench
Claude Opus 4.6
80.8% Verified
Fastest TPS
Cerebras Llama 4 Scout
2600 Tokens/s
GPQA
MMMU
SWEBench
HLE
Model Performance and Benchmark Scores
Performance and Benchmark Score: A Comparative Analysis of Model Capabilities Across Different Benchmarks.
| Model | GPQA | MMMU | HLE | SWEBench | BrowseComp |
|---|---|---|---|---|---|
| Kimi K2.5 | 75.1% | 73.2% | 28.4% | N/A | N/A |
| GPT-5 nano | 71.2% | 75.6% | 8.7% | 54.7% | 80.4% |
| GPT-5 mini | 82.3% | 81.6% | 16.7% | 71% | 89.4% |
| GPT-5 | 85.7% | 84.2% | 24.8% | 72.8% | 90% |
| GPT-5.4 | 92.8% | 81.2% | 39.8% | 57.7% | 82.7% |
| GPT-5.4 pro | 94.4% | N/A | 42.7% | N/A | 89.3% |
| Claude Sonnet 4.5 | 83.4% | 77.8% | 17.7% | 77.2% | 67.23% |
| Claude Haiku 4.5 | N/A | N/A | N/A | N/A | 54.7% |
| Claude Opus 4.1 | 81% | 77.1% | N/A | 74.5% | N/A |
| Claude Sonnet 4.6 | 89.9% | 74.5% | 33.2% | 79.6% | 74.72% |
| Claude Opus 4.6 | 91.3% | 73.9% | 40% | 80.8% | 83.73% |
| Grok-4 | 87.5% | N/A | 25.4% | N/A | N/A |
| Grok-4-0709 | 87.5% | N/A | 25.4% | N/A | N/A |
| Grok-3-mini | 66.2% | 69.4% | N/A | N/A | N/A |
| Grok-4-fast-reasoning | 85.7% | N/A | 20% | N/A | N/A |
| Grok-4-fast-non-reasoning | 85.7% | N/A | 20% | N/A | N/A |
| Gemini 3.1 Pro Preview | 91.9% | 81% | 37.5% | 76.2% | N/A |
| Gemini 3 Flash Preview | 90.4% | 81.2% | 33.7% | 78% | N/A |
| o3 | 83.3% | 82.9% | 20.2% | 69.1% | 88.3% |
| o4-mini | 81.4% | 81.6% | 14.7% | 68.1% | 80% |
| GPT-4.1 | 66.3% | 74.8% | 5.4% | 54.6% | 85.9% |
| GPT-4.1 mini | 65% | 72.7% | 3.7% | 23.6% | 89% |
| GPT-4.1 nano | 50.3% | 55.4% | N/A | N/A | 89.4% |
| Grok 3 | 75.4% | 73.2% | N/A | N/A | N/A |
| o3-mini | 77% | N/A | N/A | 48.9% | N/A |
| o1 | 75.7% | 77.3% | N/A | 48.9% | 9.9% |
| GPT-4o (omni) | 53.6% | 69.1% | N/A | 30.7% | 0.6% |
| GPT-4o mini | N/A | 59.4% | N/A | N/A | N/A |
| DeepSeek V3 | 68.4% | N/A | N/A | N/A | N/A |
| DeepSeek-R1 | 81% | N/A | N/A | N/A | N/A |
We show benchmark data only when it has been sourced for a model. Pricing, context, and model availability are synced from the shared catalog used by the pricing calculator.
Share:
Email your comparison
Get the full LLM comparison data delivered to your inbox.
Frequently Asked Questions
What is the fastest LLM model in terms of tokens per second?
The fastest model in terms of tokens per second is Claude 3 Sonnet, which processes 170.4 tokens per second. Higher TPS (tokens per second) indicates faster text generation and processing capabilities.
What is MMLU?
MMLU stands for Massive Multitask Language Understanding. It is a benchmark designed to evaluate a model’s general knowledge and understanding across a wide range of topics. This includes subjects like history, science, literature, and law. The MMLU score reflects how well a model handles diverse questions and information, providing a measure of its broad language comprehension and problem-solving skills.
What is HumanEval?
HumanEval is a benchmark used to assess a model’s coding and programming abilities. It features a set of programming problems aimed at evaluating how effectively a model can write functional code, solve algorithms, and debug issues. This benchmark measures a model’s proficiency in understanding and generating correct, efficient code across various programming languages.
What is GSM8K?
GSM8K (Grade-School Mathematics 8K) is a benchmark focused on evaluating a model’s mathematical capabilities. It includes a variety of grade-school level math problems, such as arithmetic and algebra. The GSM8K score indicates how well a model can perform calculations, understand mathematical concepts, and solve problems accurately.
How does the chatbot improve customer service in banking?
The top models for multitask reasoning based on MMLU scores are:
- GPT-4o: 88.7%
- Llama 3.1: 88.6%
- Claude-3.5 Sonnet: 88.3%
These models excel in handling a broad range of tasks and domains, demonstrating strong general knowledge and problem-solving skills.
Can I compare multiple LLM models at once?
Yes, you can use our comparison tool to evaluate and compare multiple LLM models simultaneously. This tool allows you to assess models based on various criteria such as performance, cost, and speed, helping you choose the best option for your needs.
Which LLM model is the most cost-effective?
The GPT-4o mini is the most cost-effective model, priced at $0.0007 per 1,000 tokens. This model is ideal for projects with limited budgets that require a low-cost option without compromising too much on performance.
What is the most expensive LLM model available?
The GPT-4 is the most expensive model, costing $0.18 per 1,000 tokens. This higher cost reflects its advanced features and extensive training data, which contribute to its superior performance.
What is the best and most advanced open-source LLM models available?
The most advanced open-weighted LLM model currently available is Llama 3.1 405b. This model offers high performance and can be freely accessed and modified. However, it may not always match the performance of commercial models like GPT-4 or GPT-4 Turbo in every benchmark.
What is inference speed?
Inference speed is the time it takes an LLM model to generate a response or execute a query after receiving input. It is measured in seconds and impacts how quickly a model can provide results.
Can LLM models be fine-tuned for specific tasks?
Yes, many LLM models may be fine-tuned for specific tasks or sectors. Fine-tuning improves the model's performance on specialised tasks by training it on domain-specific data, increasing its relevance and accuracy for particular applications.
What are the limitations of current LLM models?
Current LLM models have several limitations:
- Context length: Even models with large context windows may struggle with very long texts.
- Understanding nuances: Some models may misinterpret complex or ambiguous language.
- Cost and resources: Advanced models can be expensive to use, and their computational requirements can be high.
Ready to Turn Insights Into an AI Agent?
Use these free tools to plan faster, then launch a YourGPT AI agent trained on your content, policies, and workflows.