3  Models

Guide to various LLM models, features and benchmarks

3.1 Overview

The table below lists the most common/popular LLMs.

Tab 3.1: List of recent LLMs. Params in either known or estimated number of parameters in billions. Size is approximate model size in GB. Input is supported input modalities, Context is the maximum context window size in tokens.

The graph below shows a timeline of top LLMs and the number of hyperparameters.

Fig 3.1: Timeline of recent LLMs showing number of hyperparameters on the Y axis. The size of the points denote the context window size in tokens.

The number of hyperparameters although indicative of the capabilities of a model doesn’t explain the performance completely. The model architecture, training methodology and the quality and quantity of training data may allow a smaller model to catch up to the capabilities of a larger model.

Other factors that make a model desirable are speed and cost of running the model. Certain models may also be strong in certain focus areas such as coding, reasoning and knowledge.

3.2 Benchmarking

LLM benchmarks are a set of standardized tests designed to evaluate the performance of LLMs on various skills, such as reasoning and comprehension, and utilize specific scorers or metrics to quantitatively measure these abilities.

These are some of the benchmark standards/datasets.

  • GPQA Diamond (Science)
    Evaluates ability to answer challenging graduate-level questions in biology, chemistry, and physics. These questions are google-proof and requires deep spealized knowledge in the respective fields. Human experts achieve around 65% accuracy on this benchmark. Tests a model’s ability to understand and apply complex domain-specific scientific concepts.
  • MMLU (Multi-task)
    57 tasks designed to evaluate general knowledge and problem-solving capabilities of LLMs across diverse subjects spanning topics like humanities, STEM, social sciences, and more. Evaluates models in zero-shot and few-shot settings. Models are scored based on their accuracy in answering multiple-choice questions.
  • MMLU-Pro (Multi-task)
    An enchanced version of MMLU with more challenging and complex questions. More reasoning intensive questions decreasing chance of gguessing. Less sensitive tto prompt variation.
  • MMMLU (Multi-task)
    Same as MMLU but multilingual
  • MMMU (Multimodal)
    11500 questions covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering
  • SWE-bench (Agentic coding)
    Evaluates if LLMs can evaluate real-world GitHub issues. Measures Agentic reasoning
  • Terminal-bench (Agentic coding)
    A collection of tasks to evaluate terminal mastery
  • TAU-bench (Agentic tool usage)
    Measures an agent’s ability to interact with (simulated) human users and programmatic APIs while following domain-specific policies in a consistent manner.
  • HellaSwag (Reasoning)
    Language completion task that evaluates commonsense reasoning
  • Humanity’s Last Exam (Reasoning)
    A multi-modal benchmark with 2500 challenging questions across over a hundred subjects to evaluate reasoning and knowledge
  • AIME (Mathematics)
    Competitive high school math benchmark
  • Math 500 (Mathematics)
    Evaluates mathematical problem solving from high school to competition-level problems. Includes algebra, geometry, probability, and calculus.
  • BFCL
    Measures how LLMs use tools
  • Aider polyglot (Coding)
    Measures capabilities for writing and editing code
  • LiveCodeBench (Coding)
    Measures capabilities for code editing

Here are some leaderboards that rank LLMs.

3.3 Image models

Here is a list of the top image generation models

3.4 Video models

3.5 Music models

  • Lyria | Google Deepmind, Closed