Compare Models

Compare: Discover the top-performing LLM model by evaluating and comparing their key metrics in-depth.

LLM Leaderboard

Highly Preferred

GPT-4 Turbo (0409)

User's Choice

Largest Context

Gemini 1.5 Pro

2.00M Context Window

Most Expensive

GPT-4

$0.18/1K Tokens

Least Expensive

GPT-4o mini

$0.0007/1K Tokens

Fastest TPS

Claude 3 Sonnet

170.4 Tokens/s

Least Latency

GPT-3.5 Turbo

0.34s

Model Performance and Benchmark Scores

Performance and Benchmark Score: A Comparative Analysis of Model Capabilities Across Different Benchmarks.

ModelMMLUHumanEvalGSM8KHellaSwagGPQAMMMUBBHardMATH
Llama 2 70b68.9%29.9%56.8%85.3%N/AN/A51.2%35.2%
Llama 3.1 405b88.6%89%96.8%87%51.1%64.5%81.3%73.8%
GPT-3.5 Turbo69.8%68%N/AN/A30.8%N/AN/A43.1%
GPT-4 Turbo86.5%90.2%91%94.2%48%63.1%87.6%72.2%
GPT-486.4%67%92%95.3%35.7%56.8%83.1%52.9%
GPT-4o88.7%90.2%89.8%94.2%53.6%69.1%91.3%76.6%
GPT-4o mini82%87.2%N/AN/A40.2%59.4%N/A70.2%
Claude Instant73.4%N/A80.9%N/AN/AN/AN/AN/A
Claude 2.1N/A71.2%N/AN/AN/AN/AN/AN/A
Claude 3 Haiku75.2%75.9%88.9%85.9%33.3%50.2%73.7%38.9%
Claude 3 Sonnet79%73%92.3%89%40.4%53.1%82.9%43.1%
Claude 3.5 Sonnet88.7%92%96.4%89%59.4%68.3%93.1%71.1%
Claude 3 Opus86.8%84.9%95%95.4%50.4%59.4%86.8%60.1%
Gemini 1.5 Pro81.9%71.9%91.7%92.5%46.2%62.2%84%58.5%
Gemini 1.5 Flash78.9%67.5%68.8%81.3%39.5%56.1%89.2%67.7%

We use benchmarks to measure different attributes such as general knowledge (MMLU), common-sense reasoning (HELLASWAG), coding proficiency (HUMANEVAL), and mathematical ability (GSM8K, MATH). By analyzing these aspects, we gain insight into the relative strengths and weaknesses of various models.

Frequently Asked Questions

What is the fastest LLM model in terms of tokens per second?

The fastest model in terms of tokens per second is Claude 3 Sonnet, which processes 170.4 tokens per second. Higher TPS (tokens per second) indicates faster text generation and processing capabilities.

What is MMLU?

MMLU stands for Massive Multitask Language Understanding. It is a benchmark designed to evaluate a model’s general knowledge and understanding across a wide range of topics. This includes subjects like history, science, literature, and law. The MMLU score reflects how well a model handles diverse questions and information, providing a measure of its broad language comprehension and problem-solving skills.

What is HumanEval?

HumanEval is a benchmark used to assess a model’s coding and programming abilities. It features a set of programming problems aimed at evaluating how effectively a model can write functional code, solve algorithms, and debug issues. This benchmark measures a model’s proficiency in understanding and generating correct, efficient code across various programming languages.

What is GSM8K?

GSM8K (Grade-School Mathematics 8K) is a benchmark focused on evaluating a model’s mathematical capabilities. It includes a variety of grade-school level math problems, such as arithmetic and algebra. The GSM8K score indicates how well a model can perform calculations, understand mathematical concepts, and solve problems accurately.

How does the chatbot improve customer service in banking?

The top models for multitask reasoning based on MMLU scores are:

  • GPT-4o: 88.7%
  • Llama 3.1: 88.6%
  • Claude-3.5 Sonnet: 88.3%

These models excel in handling a broad range of tasks and domains, demonstrating strong general knowledge and problem-solving skills.

Can I compare multiple LLM models at once?

Yes, you can use our comparison tool to evaluate and compare multiple LLM models simultaneously. This tool allows you to assess models based on various criteria such as performance, cost, and speed, helping you choose the best option for your needs.

Which LLM model is the most cost-effective?

The GPT-4o mini is the most cost-effective model, priced at $0.0007 per 1,000 tokens. This model is ideal for projects with limited budgets that require a low-cost option without compromising too much on performance.

What is the most expensive LLM model available?

The GPT-4 is the most expensive model, costing $0.18 per 1,000 tokens. This higher cost reflects its advanced features and extensive training data, which contribute to its superior performance.

What is the best and most advanced open-source LLM models available?

The most advanced open-weighted LLM model currently available is Llama 3.1 405b. This model offers high performance and can be freely accessed and modified. However, it may not always match the performance of commercial models like GPT-4 or GPT-4 Turbo in every benchmark.

What is inference speed?

Inference speed is the time it takes an LLM model to generate a response or execute a query after receiving input. It is measured in seconds and impacts how quickly a model can provide results.

Can LLM models be fine-tuned for specific tasks?

Yes, many LLM models may be fine-tuned for specific tasks or sectors. Fine-tuning improves the model's performance on specialised tasks by training it on domain-specific data, increasing its relevance and accuracy for particular applications.

What are the limitations of current LLM models?

Current LLM models have several limitations:

  • Context length: Even models with large context windows may struggle with very long texts.
  • Understanding nuances: Some models may misinterpret complex or ambiguous language.
  • Cost and resources: Advanced models can be expensive to use, and their computational requirements can be high.