Compare Models

Compare: Discover the top-performing LLM model by evaluating and comparing their key metrics in-depth.

LLM Leaderboard

Highly Preferred

GPT-4 Turbo (0409)

User's Choice

Largest Context

Gemini 1.5 Pro

2.00M Context Window

Most Expensive

GPT-4

$0.18/1K Tokens

Least Expensive

Llama 4 Scout

$0.0003/1K Tokens

Fastest TPS

o3-mini

189 Tokens/s

Least Latency

GPT-4o

0.48s

Model Performance and Benchmark Scores

Performance and Benchmark Score: A Comparative Analysis of Model Capabilities Across Different Benchmarks.

Model	MMLU	GPQA	MMMU	HellaSwag	HumanEval	BBHard	GSM8K	MATH
Kimi K2	89.5%	75.1%	N/A	N/A	N/A	N/A	N/A	97.4%
Gemini 2.5 Flash Lite	84.5%	66.7%	72.9%	N/A	N/A	N/A	N/A	63.1%
Claude Sonnet 4	86.5%	83.8%	74.4%	N/A	N/A	N/A	N/A	70.5%
Claude Opus 4	88.8%	83.3%	76.5%	N/A	N/A	N/A	N/A	75.5%
o3-pro	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
o3	N/A	83.3%	82.9%	N/A	N/A	N/A	N/A	88.9%
o4-mini	N/A	81.4%	81.6%	N/A	N/A	N/A	N/A	92.7%
GPT-4.1	90.2%	66.3%	74.8%	N/A	N/A	N/A	N/A	N/A
GPT-4.1 mini	87.5%	65%	72.7%	N/A	N/A	N/A	N/A	N/A
GPT-4.1 nano	80.1%	50.3%	55.4%	N/A	N/A	N/A	N/A	N/A
Llama 3.2 90B	86%	46.7%	60.3%	N/A	N/A	N/A	86.9%	68%
Llama 3.3 70B	86%	50.5%	N/A	N/A	88.4%	N/A	N/A	77%
Gemini 2.5 Pro	89.8%	84%	81.7%	N/A	N/A	N/A	N/A	N/A
Llama 4 Scout	74.3%	57.2%	69.4%	N/A	N/A	N/A	N/A	N/A
Llama 4 Maverick	84.6%	69.8%	73.4%	N/A	N/A	N/A	N/A	N/A
Grok 3	N/A	75.4%	73.2%	N/A	N/A	N/A	N/A	N/A
Grok-2	87.5%	56%	66.1%	N/A	88.4%	N/A	N/A	76.1%
Claude 3.7 Sonnet	86.1%	84.8%	75%	N/A	N/A	N/A	N/A	96.2%
Claude 3.7 Sonnet(Normal)	83.2%	68%	71.8%	N/A	N/A	N/A	N/A	82.2%
o3-mini	86%	75%	N/A	N/A	97%	N/A	N/A	N/A
Deepseek-R1	90.8%	71.5%	N/A	N/A	N/A	N/A	N/A	97.3%
o1	92.3%	78%	N/A	N/A	N/A	N/A	N/A	94.8%
o1-preview	90.8%	78.3%	N/A	N/A	N/A	N/A	N/A	85.5%
o1-mini	85.2%	60%	N/A	N/A	92.4%	N/A	N/A	90%
DeepSeek V3	88.5%	59.1%	N/A	N/A	82.6%	N/A	N/A	90.2%
Gemini 2.0 Pro Experimental	79.1%	64.7%	72.7%	N/A	N/A	N/A	N/A	91.8%
Gemini 2.0 Flash	87%	59%	N/A	N/A	91%	N/A	N/A	90%
GPT-4o Realtime	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
GPT-4o mini Realtime	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
Llama 3.1 405b	88.6%	51.1%	64.5%	87%	89%	81.3%	96.8%	73.8%
GPT-4 Turbo	86.5%	48%	63.1%	94.2%	90.2%	87.6%	91%	72.2%
GPT-4	86.4%	35.7%	56.8%	95.3%	67%	83.1%	92%	52.9%
GPT-4o	88.7%	53.6%	69.1%	94.2%	90.2%	91.3%	89.8%	76.6%
GPT-4o mini	82%	40.2%	59.4%	N/A	87.2%	N/A	N/A	70.2%
Claude 3.5 Haiku	65%	41.6%	N/A	N/A	88.1%	N/A	N/A	69.2%
Claude 3 Haiku	75.2%	33.3%	50.2%	85.9%	75.9%	73.7%	88.9%	38.9%
Claude 3 Sonnet	79%	40.4%	53.1%	89%	73%	82.9%	92.3%	43.1%
Claude 3.5 Sonnet	88.7%	59.4%	68.3%	89%	92%	93.1%	96.4%	71.1%
Claude 3 Opus	86.8%	50.4%	59.4%	95.4%	84.9%	86.8%	95%	60.1%
Gemini 1.5 Pro	81.9%	46.2%	62.2%	92.5%	71.9%	84%	91.7%	58.5%
Gemini 1.5 Flash	78.9%	39.5%	56.1%	81.3%	67.5%	89.2%	68.8%	67.7%

We use benchmarks to measure different attributes such as general knowledge (MMLU), common-sense reasoning (HELLASWAG), coding proficiency (HUMANEVAL), and mathematical ability (GSM8K, MATH). By analyzing these aspects, we gain insight into the relative strengths and weaknesses of various models.

Frequently Asked Questions

What is the fastest LLM model in terms of tokens per second?

The fastest model in terms of tokens per second is Claude 3 Sonnet, which processes 170.4 tokens per second. Higher TPS (tokens per second) indicates faster text generation and processing capabilities.

What is MMLU?

MMLU stands for Massive Multitask Language Understanding. It is a benchmark designed to evaluate a model’s general knowledge and understanding across a wide range of topics. This includes subjects like history, science, literature, and law. The MMLU score reflects how well a model handles diverse questions and information, providing a measure of its broad language comprehension and problem-solving skills.

What is HumanEval?

HumanEval is a benchmark used to assess a model’s coding and programming abilities. It features a set of programming problems aimed at evaluating how effectively a model can write functional code, solve algorithms, and debug issues. This benchmark measures a model’s proficiency in understanding and generating correct, efficient code across various programming languages.

What is GSM8K?

GSM8K (Grade-School Mathematics 8K) is a benchmark focused on evaluating a model’s mathematical capabilities. It includes a variety of grade-school level math problems, such as arithmetic and algebra. The GSM8K score indicates how well a model can perform calculations, understand mathematical concepts, and solve problems accurately.

How does the chatbot improve customer service in banking?

The top models for multitask reasoning based on MMLU scores are:

GPT-4o: 88.7%
Llama 3.1: 88.6%
Claude-3.5 Sonnet: 88.3%

These models excel in handling a broad range of tasks and domains, demonstrating strong general knowledge and problem-solving skills.

Can I compare multiple LLM models at once?

Yes, you can use our comparison tool to evaluate and compare multiple LLM models simultaneously. This tool allows you to assess models based on various criteria such as performance, cost, and speed, helping you choose the best option for your needs.

Which LLM model is the most cost-effective?

The GPT-4o mini is the most cost-effective model, priced at $0.0007 per 1,000 tokens. This model is ideal for projects with limited budgets that require a low-cost option without compromising too much on performance.

What is the most expensive LLM model available?

The GPT-4 is the most expensive model, costing $0.18 per 1,000 tokens. This higher cost reflects its advanced features and extensive training data, which contribute to its superior performance.

What is the best and most advanced open-source LLM models available?

The most advanced open-weighted LLM model currently available is Llama 3.1 405b. This model offers high performance and can be freely accessed and modified. However, it may not always match the performance of commercial models like GPT-4 or GPT-4 Turbo in every benchmark.

What is inference speed?

Inference speed is the time it takes an LLM model to generate a response or execute a query after receiving input. It is measured in seconds and impacts how quickly a model can provide results.

Can LLM models be fine-tuned for specific tasks?

Yes, many LLM models may be fine-tuned for specific tasks or sectors. Fine-tuning improves the model's performance on specialised tasks by training it on domain-specific data, increasing its relevance and accuracy for particular applications.

What are the limitations of current LLM models?

Current LLM models have several limitations:

Context length: Even models with large context windows may struggle with very long texts.
Understanding nuances: Some models may misinterpret complex or ambiguous language.
Cost and resources: Advanced models can be expensive to use, and their computational requirements can be high.