Files
ara/orchestra-skills/01-model-architecture/litgpt/references/supported-models.md
T

7.7 KiB
Raw Blame History

Supported Models

Complete list of model architectures supported by LitGPT with parameter sizes and variants.

Overview

LitGPT supports 20+ model families with 100+ model variants ranging from 135M to 405B parameters.

List all models:

litgpt download list

List pretrain-capable models:

litgpt pretrain list

Model Families

Llama Family

Llama 3, 3.1, 3.2, 3.3:

  • Sizes: 1B, 3B, 8B, 70B, 405B
  • Use Cases: General-purpose, long-context (128K), multimodal
  • Best For: Production applications, research, instruction following

Code Llama:

  • Sizes: 7B, 13B, 34B, 70B
  • Use Cases: Code generation, completion, infilling
  • Best For: Programming assistants, code analysis

Function Calling Llama 2:

  • Sizes: 7B
  • Use Cases: Tool use, API integration
  • Best For: Agents, function execution

Llama 2:

  • Sizes: 7B, 13B, 70B
  • Use Cases: General-purpose (predecessor to Llama 3)
  • Best For: Established baselines, research comparisons

Llama 3.1 Nemotron:

  • Sizes: 70B
  • Use Cases: NVIDIA-optimized variant
  • Best For: Enterprise deployments

TinyLlama:

  • Sizes: 1.1B
  • Use Cases: Edge devices, resource-constrained environments
  • Best For: Fast inference, mobile deployment

OpenLLaMA:

  • Sizes: 3B, 7B, 13B
  • Use Cases: Open-source Llama reproduction
  • Best For: Research, education

Vicuna:

  • Sizes: 7B, 13B, 33B
  • Use Cases: Chatbot, instruction following
  • Best For: Conversational AI

R1 Distill Llama:

  • Sizes: 8B, 70B
  • Use Cases: Distilled reasoning models
  • Best For: Efficient reasoning tasks

MicroLlama:

  • Sizes: 300M
  • Use Cases: Extremely small Llama variant
  • Best For: Prototyping, testing

Platypus:

  • Sizes: 7B, 13B, 70B
  • Use Cases: STEM-focused fine-tune
  • Best For: Science, math, technical domains

Mistral Family

Mistral:

  • Sizes: 7B, 123B
  • Use Cases: Efficient open models, long-context
  • Best For: Cost-effective deployments

Mathstral:

  • Sizes: 7B
  • Use Cases: Math reasoning
  • Best For: Mathematical problem solving

Mixtral MoE:

  • Sizes: 8×7B (47B total, 13B active), 8×22B (141B total, 39B active)
  • Use Cases: Sparse mixture of experts
  • Best For: High capacity with lower compute

Falcon Family

Falcon:

  • Sizes: 7B, 40B, 180B
  • Use Cases: Open-source models from TII
  • Best For: Multilingual applications

Falcon 3:

  • Sizes: 1B, 3B, 7B, 10B
  • Use Cases: Newer Falcon generation
  • Best For: Efficient multilingual models

Phi Family (Microsoft)

Phi 1.5 & 2:

  • Sizes: 1.3B, 2.7B
  • Use Cases: Small language models with strong performance
  • Best For: Edge deployment, low-resource environments

Phi 3 & 3.5:

  • Sizes: 3.8B
  • Use Cases: Improved small models
  • Best For: Mobile, browser-based applications

Phi 4:

  • Sizes: 14B
  • Use Cases: Medium-size high-performance model
  • Best For: Balance of size and capability

Phi 4 Mini Instruct:

  • Sizes: 3.8B
  • Use Cases: Instruction-tuned variant
  • Best For: Chat, task completion

Gemma Family (Google)

Gemma:

  • Sizes: 2B, 7B
  • Use Cases: Google's open models
  • Best For: Research, education

Gemma 2:

  • Sizes: 2B, 9B, 27B
  • Use Cases: Second generation improvements
  • Best For: Enhanced performance

Gemma 3:

  • Sizes: 1B, 4B, 12B, 27B
  • Use Cases: Latest Gemma generation
  • Best For: State-of-the-art open models

CodeGemma:

  • Sizes: 7B
  • Use Cases: Code-specialized Gemma
  • Best For: Code generation, analysis

Qwen Family (Alibaba)

Qwen2.5:

  • Sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
  • Use Cases: General-purpose multilingual models
  • Best For: Chinese/English applications

Qwen2.5 Coder:

  • Sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B
  • Use Cases: Code-specialized variants
  • Best For: Programming in multiple languages

Qwen2.5 Math:

  • Sizes: 1.5B, 7B, 72B
  • Use Cases: Mathematical reasoning
  • Best For: Math problems, STEM education

QwQ & QwQ-Preview:

  • Sizes: 32B
  • Use Cases: Question-answering focus
  • Best For: Reasoning tasks

Pythia Family (EleutherAI)

Pythia:

  • Sizes: 14M, 31M, 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B
  • Use Cases: Research, interpretability
  • Best For: Scientific studies, ablations

StableLM Family (Stability AI)

StableLM:

  • Sizes: 3B, 7B
  • Use Cases: Open models from Stability AI
  • Best For: Research, commercial use

StableLM Zephyr:

  • Sizes: 3B
  • Use Cases: Instruction-tuned variant
  • Best For: Chat applications

StableCode:

  • Sizes: 3B
  • Use Cases: Code generation
  • Best For: Programming tasks

FreeWilly2 (Stable Beluga 2):

  • Sizes: 70B
  • Use Cases: Large Stability AI model
  • Best For: High-capability tasks

Other Models

Danube2:

  • Sizes: 1.8B
  • Use Cases: Efficient small model
  • Best For: Resource-constrained environments

Dolly:

  • Sizes: 3B, 7B, 12B
  • Use Cases: Databricks' instruction-following model
  • Best For: Enterprise applications

LongChat:

  • Sizes: 7B, 13B
  • Use Cases: Extended context windows
  • Best For: Long-document understanding

Nous-Hermes:

  • Sizes: 7B, 13B, 70B
  • Use Cases: Instruction-following fine-tune
  • Best For: Task completion, reasoning

OLMo:

  • Sizes: 1B, 7B
  • Use Cases: Allen AI's fully open model
  • Best For: Research transparency

RedPajama-INCITE:

  • Sizes: 3B, 7B
  • Use Cases: Open reproduction project
  • Best For: Research, education

Salamandra:

  • Sizes: 2B, 7B
  • Use Cases: Multilingual European model
  • Best For: European language support

SmolLM2:

  • Sizes: 135M, 360M, 1.7B
  • Use Cases: Ultra-small models
  • Best For: Edge devices, testing

Download Examples

Download specific model:

litgpt download meta-llama/Llama-3.2-1B
litgpt download microsoft/phi-2
litgpt download google/gemma-2-9b

Download with HuggingFace token (for gated models):

export HF_TOKEN=hf_...
litgpt download meta-llama/Llama-3.1-405B

Model Selection Guide

By Use Case

General Chat/Instruction Following:

  • Small: Phi-2 (2.7B), TinyLlama (1.1B)
  • Medium: Llama-3.2-8B, Mistral-7B
  • Large: Llama-3.1-70B, Mixtral-8x22B

Code Generation:

  • Small: Qwen2.5-Coder-3B
  • Medium: CodeLlama-13B, CodeGemma-7B
  • Large: CodeLlama-70B, Qwen2.5-Coder-32B

Math/Reasoning:

  • Small: Qwen2.5-Math-1.5B
  • Medium: Mathstral-7B, Qwen2.5-Math-7B
  • Large: QwQ-32B, Qwen2.5-Math-72B

Multilingual:

  • Small: SmolLM2-1.7B
  • Medium: Qwen2.5-7B, Falcon-7B
  • Large: Qwen2.5-72B

Research/Education:

  • Pythia family (14M-12B for ablations)
  • OLMo (fully open)
  • TinyLlama (fast iteration)

By Hardware

Consumer GPU (8-16GB VRAM):

  • Phi-2 (2.7B)
  • TinyLlama (1.1B)
  • Gemma-2B
  • SmolLM2 family

Single A100 (40-80GB):

  • Llama-3.2-8B
  • Mistral-7B
  • CodeLlama-13B
  • Gemma-9B

Multi-GPU (200GB+ total):

  • Llama-3.1-70B (TP=4)
  • Mixtral-8x22B (TP=2)
  • Falcon-40B

Large Cluster:

  • Llama-3.1-405B (FSDP)
  • Falcon-180B

Model Capabilities

Context Lengths

Model Context Window
Llama 3.1 128K
Llama 3.2/3.3 128K
Mistral-123B 128K
Mixtral 32K
Gemma 2 8K
Phi-3 128K
Qwen2.5 32K

Training Data

  • Llama 3: 15T tokens (multilingual)
  • Mistral: Web data, code
  • Qwen: Multilingual (Chinese/English focus)
  • Pythia: The Pile (controlled training)

References