3.1 Overview
The table below lists the most common/popular LLMs.
The graph below shows a timeline of top LLMs and the number of hyperparameters.
The number of hyperparameters although indicative of the capabilities of a model doesn’t explain the performance completely. The model architecture, training methodology and the quality and quantity of training data may allow a smaller model to catch up to the capabilities of a larger model.
Other factors that make a model desirable are speed and cost of running the model. Certain models may also be strong in certain focus areas such as coding, reasoning and knowledge.
3.2 Benchmarking
LLM benchmarks are a set of standardized tests designed to evaluate the performance of LLMs on various skills, such as reasoning and comprehension, and utilize specific scorers or metrics to quantitatively measure these abilities.
These are some of the benchmark standards/datasets.
- GPQA Diamond (Science)
Evaluates ability to answer challenging graduate-level questions in biology, chemistry, and physics. These questions are google-proof and requires deep spealized knowledge in the respective fields. Human experts achieve around 65% accuracy on this benchmark. Tests a model’s ability to understand and apply complex domain-specific scientific concepts. - MMLU (Multi-task)
57 tasks designed to evaluate general knowledge and problem-solving capabilities of LLMs across diverse subjects spanning topics like humanities, STEM, social sciences, and more. Evaluates models in zero-shot and few-shot settings. Models are scored based on their accuracy in answering multiple-choice questions. - MMLU-Pro (Multi-task)
An enchanced version of MMLU with more challenging and complex questions. More reasoning intensive questions decreasing chance of gguessing. Less sensitive tto prompt variation. - MMMLU (Multi-task)
Same as MMLU but multilingual - MMMU (Multimodal)
11500 questions covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering - SWE-bench (Agentic coding)
Evaluates if LLMs can evaluate real-world GitHub issues. Measures Agentic reasoning - Terminal-bench (Agentic coding)
A collection of tasks to evaluate terminal mastery - TAU-bench (Agentic tool usage)
Measures an agent’s ability to interact with (simulated) human users and programmatic APIs while following domain-specific policies in a consistent manner. - HellaSwag (Reasoning)
Language completion task that evaluates commonsense reasoning - Humanity’s Last Exam (Reasoning)
A multi-modal benchmark with 2500 challenging questions across over a hundred subjects to evaluate reasoning and knowledge - AIME (Mathematics)
Competitive high school math benchmark - Math 500 (Mathematics)
Evaluates mathematical problem solving from high school to competition-level problems. Includes algebra, geometry, probability, and calculus. - BFCL
Measures how LLMs use tools - Aider polyglot (Coding)
Measures capabilities for writing and editing code - LiveCodeBench (Coding)
Measures capabilities for code editing
Here are some leaderboards that rank LLMs.
3.3 Image models
Here is a list of the top image generation models
- GPT Image 1 | OpenAI, Closed
- Flux | Blackforst labs, Open source
- Imagen 3 | Deepmind, Closed
- Grok 2 | xAI, Closed
- GPT-4 | OpenAI, Closed
- Dall-E 3 | OpenAI, Closed
- Stable diffusion | Stability AI, Open source
- Firefly | Adobe, Closed
- Midjourney | Midjourney, Closed
- Amazon Nova Canvas | Amazon, Closed
3.4 Video models
- Sora | OpenAI, Closed
- Veo 2 | Google Deepmind, Closed
- Amazon Nova Reel | Amazon, Closed
3.5 Music models
- Lyria | Google Deepmind, Closed