The Swallow Project is conducting independent evaluation experiments on major large language models (LLMs) in parallel with the development of a high-performance LLM specialized in Japanese. By comparing LLMs developed not only in Japan but also worldwide, we can better understand the current level of the Swallow Project. We conduct evaluations under fair conditions while considering the unique specifications of each LLM, such as tokenization and system prompts. By analyzing these evaluations in relation to the development methods of each LLM, we aim to explore the "recipe" for creating a high-performance LLM. This website visualizes the evaluation results of LLMs tested within the Swallow Project in bar charts, radar charts, and scatter plots. We hope this website serves not only as a guide for selecting high-performance LLMs but also as a reference for developing LLMs with strong Japanese language capabilities.

The content of this leaderboard (including data and graphs) is provided under a Creative Commons Attribution 4.0 (CC-BY 4.0) License, the evaluation software (swallow-evaluation) is distributed under the MIT License, and the source code of this website is also provided under the MIT License.

Change log

  • 2025-08-18
    • Swallow LLM Leaderboard v2.
    • We have revamped evaluation benchmarks and methods for post-trained models in order to properly measure the capabilities of new large language models, such as reasoning models. We adopted six Japanese benchmarks (JEMHopQA, MMLU-ProX, GPQA, MATH-100, JHumanEval, M-IFEval-Ja) and six English benchmarks (HellaSwag, MMLU-Pro, GPQA, MATH-500, AIME 2024-2025, LiveCodeBench), and changed the evaluation method to zero-shot reasoning (previously few-shot reasoning). In addition, we have released the developed evaluation framework as swallow-evaluation-instruct.
    • We added evaluation results of ABEJA-QwQ32b-Reasoning-Japanese-v1.0, DeepSeek-R1-Distill series, ELYZA-Thinking-1.0-Qwen-32B, GPT-5 (gpt-5-2025-08-07), gpt-oss-20b, gpt-oss-120b, Llama-3.1-Nemotron series, Llama 4 Scout Instruct, MedGemma 27B IT, o3 (o3-2025-04-16), o3-mini (o3-mini-2025-01-31), Phi-4-reasoning-plus, Qwen3 series.
    • We have revised the structure to consist of three types of pages: overall results (bar chart of average scores), task-specific results (radar chart), and scatter plots. Each page visualizes the evaluation results of either pretrained models (without post-training) or post-trained models.
    • We implemented a feature on the right side of the model list (table) displayed on each page that allows users to bulk-select models by scale or category.
    • We implemented a feature in the bar chart on the overall results page that allows users to toggle the sorting order of models by clicking on a model name.
    • We added functionality to display the number of active parameters for Mixture of Experts (MoE) models.
    • We updated the scatter plot so that the plotted points are color-coded by model family (OpenAI, Llama, Gemma, Qwen, and others).
    • The old version was moved to https://swallow-llm.github.io/leaderboard-v1/.
  • 2025-06-27
    • Added a note regarding the in-domain evaluation of llm-jp-3.1-*-instruct4.
  • 2025-06-25
    • Added evaluation results of Llama 3.1 Swallow 8B v0.5.
    • Added evaluation results of Llama 4 Scout.
    • Added evaluation results of llm-jp-3-7.2b.
    • Added evaluation results of llm-jp-3-1.8b-instruct3, llm-jp-3-3.7b-instruct3, llm-jp-3-7.2b-instruct3, llm-jp-3-13b-instruct3.
    • Added evaluation results of llm-jp-3.1-1.8b-instruct4, llm-jp-3.1-13b-instruct4.
    • Added evaluation results of Qwen2.5-32B.
    • Added evaluation results of Qwen3-1.7B-Base, Qwen3-4B-Base, Qwen3-8B-Base, Qwen3-14B-Base, Qwen3-30B-A3B-Base.
  • 2025-05-21
    • Added evaluation results of Sarashina2.2 0.5B, 1B, 3B.
  • 2025-05-19
    • Added evaluation results of Gemma-2-Llama Swallow 2B, 9B, 27B.
  • 2025-04-14
    • Added evaluation results of Gemma 3 5B, 12B, 27B.
    • Added evaluation results (Japanese Understanding & Generation and Japanese MT-Bench) of GPT-4 (gpt-4-0613).
    • Added evaluation results (Japanese Understanding & Generation and Japanese MT-Bench) of GPT-4.5 (gpt-4.5-preview-2025-02-27) and o1 (o1-2024-12-17). We also considered evaluating on Japanese understanding and generation tasks; however, due to limitations in the OpenAI API specifications — specifically, the inability to generate 10 responses for a single prompt under the same conditions as other models — we will treat the scores for Japanese understanding and generation tasks as blank.
  • 2025-03-10
    • Relaunched as the Swallow LLM Leaderboard.
  • 2024-07-01

Evaluation tasks

Post-trained (Japanese)

This benchmark evaluates post-trained LLMs including reasoning models on Japanese benchmark datasets. The evaluation scores range from 0 (lowest) to 1 (highest).

Multihop reasoning
JEMHopQA

Japanese explainable multi-hop question answering

Metric: Character F1 (lenient)
College-level exam
MMLU-ProX (Japanese)

Proficient-level multi-discipline language understanding and reasoning

Metric: Accuracy
Science
GPQA (Japanese)

Graduate-level Google-proof question answering

Metric: Accuracy
Mathematics
MATH-100 (Japanese)

Competition-level mathmatics

Metric: Accuracy
Coding
JHumanEval

Japanese translation of HumanEval (code genration benchmark)

Metric: Pass@1 (n=10)
Instruction following
M-IFEval-Ja

Controllability of instruction following

Metric: Accuracy

Evaluation results of this task are excluded from average calculation.

Post-trained (English)

This benchmark evaluates post-trained LLMs including reasoning models on English benchmark datasets. The evaluation scores range from 0 (lowest) to 1 (highest).

Natural language inference
HellaSwag

Four-choice questions to predict the next event

Metric: Accuracy
College-level exam
MMLU-Pro (English)

Proficient-level multi-discipline language understanding and reasoning

Metric: Accuracy
Science
GPQA (English)

Graduate-level Google-proof question answering

Metric: Accuracy
Mathematics
MATH-500 (English)

Competition-level mathmatics

Metric: Accuracy
Mathematics
AIME 2024-2025

Qualification for the United States Mathematical Olympiad (USAMO)

Metric: Accuracy
Coding
LiveCodeBench

Contests across competition platforms (LeetCode, AtCoder, and CodeForces)

Metric: Pass@1 (n=10)
Japanese MT-Bench

The Japanese version of MT-Bench (Nejumi LLM Leaderboard edition) evaluates multi-turn dialogue capabilities. The test questions are based on v4, and the reference answers are derived from v2 with corrections to incorrect responses. The evaluation scores range from 0 (lowest) to 1 (highest).

Coding

Implementing algorithms in Python or C++, and creating websites using HTML.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Extraction

Extracting named entities (such as author names and numerical values) and sentiment (e.g., positive or negative) from text.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Humanities

Creating essays and strategies on topics related to law, economics, history, philosophy, and education.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Math

Generating solutions for problems and word problems in algebra, geometry, probability, and number theory.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Reasoning

Generating answers to questions by leveraging common knowledge and reasoning skills.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Roleplay

Writing creative texts by assuming the persona of famous individuals or fictional characters and imagining hypothetical scenarios.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
STEM

Generating answers and explanations on topics related to physics, chemistry, biology, geography, architecture, and machine learning.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Writing

Writing blog articles, email drafts, and fictional narratives.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
英語 MT-Bench

English MT-Bench evaluates multi-turn dialogue capabilities. The evaluation scores range from 0 (lowest) to 1 (highest).

Coding

Implementing algorithms in Python or C++, and creating websites using HTML.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Extraction

Extracting named entities (such as author names and numerical values) and sentiment (e.g., positive or negative) from text.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Humanities

Creating essays and strategies on topics related to law, economics, history, philosophy, and education.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Math

Generating solutions for problems and word problems in algebra, geometry, probability, and number theory.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Reasoning

Generating answers to questions by leveraging common knowledge and reasoning skills.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Roleplay

Writing creative texts by assuming the persona of famous individuals or fictional characters and imagining hypothetical scenarios.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
STEM

Generating answers and explanations on topics related to physics, chemistry, biology, geography, architecture, and machine learning.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Writing

Writing blog articles, email drafts, and fictional narratives.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Pre-trained (Japanese)

This benchmark evaluates pre-trained LLMs models (without post-training) on Japanese benchmark datasets. The evaluation scores range from 0 (lowest) to 1 (highest).

Commonsense
JCommonsenseQA

Five-choice questions created with a knowledge base

Metric: Accuracy
Multi-hop Q&A
JEMHopQA

Open-ended Q&A to assess the amount of knowledge and reasoning ability

Metric: Character F1
Classical Q&A
NIILC

Open-ended Q&A that can be answered by an encyclopedia

Metric: Character F1
Reference: Sekine (2003)
Reading comprehension
JSQuAD

Open-ended Q&A for Wikipedia article

Metric: Character F1
Summarization
XL-Sum

Task to generate a highlight from a news article of BBC

Metric: ROUGE-2
Mathematics
MGSM

Japanese translation of math word problems (GSM8K)

Metric: Accuracy (exact match)
English-Japanese translation
WMT20 (en-ja)

Translation of news articles (English to Japanese)

Metric: BLEU
Japanese-English translation
WMT20 (ja-en)

Translation of news articles (Japanese to English)

Metric: BLEU
Multi-task natural language understanding
JMMLU

Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)

Metric: Accuracy
Reference: Yin et al (2024)
Code generation
JHumanEval

Japanese translation of HumanEval (code genration benchmark)

Metric: pass@1
Pre-trained (English)

This benchmark evaluates pre-trained LLMs models (without post-training) on English benchmark datasets. The evaluation scores range from 0 (lowest) to 1 (highest).

Q&A based on facts and common sense
OpenBookQA

Four-choice questions based on scientific knowledge and common sense

Metric: Accuracy
Q&A based on knowledge
TriviaQA

Open-ended Q&A based on trivias

Metric: Accuracy (exact match)
Commonsense inference
HellaSwag

Four-choice questions to predict the next event

Metric: Accuracy
Reading comprehension
SQuAD2

Open-ended Q&A developed for the evidence document

Metric: Accuracy (exact match)
Commonsense inference
XWINO

Two-choice question to predict the antecedent of a pronoun

Metric: Accuracy
Multitask natural language understanding
MMLU

Four-choice exam questions benchmark MMLU (53 subjects)

Metric: Accuracy
Mathematics
GSM8K

Math word problems

Metric: Accuracy (exact match)
Mathematics
MATH

High school math competitions

Metric: Accuracy (exact match)
Collection of hard-to-solve tasks for LLM
BIG-Bench-Hard (BBH)

23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)

Metric: Accuracy (exact match)
Code generation
HumanEval

Ability of code generation measured by unit test

Metric: pass@1

Evaluation tools

LLM-jp evaluation script (1.3.0)
Automatic evaluation tool for Japanese LLMs
JP Language Model Evaluation Harness (commit #9b42d41)
An evaluation framework for Japanese LLMs
Language Model Evaluation Harness (0.4.2)
An evaluation framework for LLMs
Code Generation LM Evaluation Harness (commit #0261c52)
An evaluation framework for code generation (HumanEval)
FastChat (commit #e86e70d0)
An automatic evaluation framework by an LLM (MT-Bench)
swallow-evaluation
An evaluation framework used in Swallow Project (encompassing all the above-mentioned tools)

Evaluated models

Model name # Parameters [B] Release date Post-training Reasoning mode Missing scores
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 33 2025-04-25 Yes on
CyberAgentLM3-22B-chat 22 2024-07-09 Yes N/A
DeepSeek-R1-Distill-Llama-8B 8.0 2025-01-20 Yes on
DeepSeek-R1-Distill-Llama-70B 70 2025-01-20 Yes N/A
DeepSeek-R1-Distill-Qwen-7B 7.6 2025-01-20 Yes on
DeepSeek-R1-Distill-Qwen-14B 15 2025-01-20 Yes on
DeepSeek-R1-Distill-Qwen-32B 33 2025-01-20 Yes on
DeepSeek-R1-Distill-Qwen-14B-Japanese 15 2025-01-27 Yes on
DeepSeek-R1-Distill-Qwen-32B-Japanese 33 2025-01-27 Yes on
ELYZA-Thinking-1.0-Qwen-32B 33 2025-05-01 Yes on
Falcon3-1B-Base 1.7 2024-12-19 No
Falcon3-3B-Base 3.2 2024-12-19 No
Falcon3-7B-Base 7.5 2024-12-19 No
Falcon3-10B-Base 10 2024-12-19 No
Gemma 2 2B 2.6 2024-06-27 No
Gemma 2 2B IT 2.6 2024-06-27 Yes N/A
Gemma 2 9B 9.2 2024-06-27 No
Gemma 2 9B IT 9.2 2024-06-27 Yes N/A
Gemma 2 27B 27 2024-06-27 No
Gemma 2 27B IT 27 2024-06-27 Yes N/A
Gemma-2-Llama Swallow 2B 2.6 2025-05-19 No
Gemma-2-Llama Swallow 2B IT 2.6 2025-05-19 Yes N/A
Gemma-2-Llama Swallow 9B 9.2 2025-05-19 No
Gemma-2-Llama Swallow 9B IT 9.2 2025-05-19 Yes N/A
Gemma-2-Llama Swallow 27B 27 2025-05-19 No
Gemma-2-Llama Swallow 27B IT 27 2025-05-19 Yes N/A
Gemma 3 1B 1 2025-03-12 No
Gemma 3 1B IT 1.0 2025-03-12 Yes N/A
Gemma 3 4B 4.3 2025-03-12 No
Gemma 3 4B IT 4.3 2025-03-12 Yes N/A
Gemma 3 12B 12 2025-03-12 No
Gemma 3 12B IT 12 2025-03-12 Yes N/A
Gemma 3 27B 27 2025-03-12 No
Gemma 3 27B IT 27 2025-03-12 Yes N/A
GPT-4.1 (gpt-4.1-2025-04-14) 0 2025-04-14 Yes N/A
GPT-4o (gpt-4o-2024-08-06) 0 2024-08-06 Yes N/A
GPT-5 (gpt-5-2025-08-07) 0 2025-08-07 Yes on (middle)
gpt-oss-20b 22 (3.6) 2025-08-05 Yes on (middle)
gpt-oss-120b 120 (5.1) 2025-08-05 Yes on (middle)
Llama 3.1 8B 8.0 2024-07-23 No
Llama 3.1 8B Instruct 8.0 2024-07-23 Yes N/A
Llama 3.1 70B 70 2024-07-23 No
Llama-3.1-Nemotron-Nano-8B-v1 8.0 2025-03-18 Yes on
Llama 3.1 Swallow 8B Instruct v0.3 8.0 2024-12-23 Yes N/A
Llama 3.1 Swallow 8B v0.5 8.0 2025-06-25 No
Llama 3.1 Swallow 8B Instruct v0.5 8.0 2025-06-25 Yes N/A
Llama 3.2 1B 1.2 2024-09-25 No
Llama 3.2 3B 3.2 2024-09-25 No
Llama 3.3 70B Instruct 70 2024-12-06 Yes N/A
Llama-3.3-Nemotron-Super-49B-v1 50 2025-03-18 Yes N/A
Llama 3.3 Swallow 70B v0.4 70 2025-03-14 No
Llama 3.3 Swallow 70B Instruct v0.4 70 2025-03-10 Yes N/A
Llama 4 Scout 109 (17) 2025-04-04 No
Llama 4 Scout Instruct 109 (17) 2025-04-04 Yes N/A
llm-jp-3-1.8b 1.8 2024-09-25 No
llm-jp-3-3.7b 3.7 2024-09-25 No
llm-jp-3-7.2b 7.3 2025-02-05 No
llm-jp-3-13b 13 2024-09-25 No
llm-jp-3.1-1.8b-instruct4 1.8 2025-05-30 Yes N/A
llm-jp-3.1-13b-instruct4 14 2025-05-30 Yes N/A
MedGemma 27B IT 27 2025-07-09 Yes N/A
o3 (o3-2025-04-16) 0 2025-04-16 Yes on (middle)
o3-mini (o3-mini-2025-01-31) 0 2025-01-31 Yes on (middle)
Phi-4 15 2024-12-13 Yes N/A
Phi-4-reasoning-plus 15 2025-04-30 Yes on
PLaMo 2 1B 1.3 2025-02-21 No
PLaMo 2 8B 9.1 2025-02-21 No
Qwen2.5-1.5B 1.5 2024-09-19 No
Qwen2.5-3B 3.1 2024-09-19 No
Qwen2.5-7B 7.6 2024-09-19 No
Qwen2.5-7B-Instruct 7.6 2024-09-19 Yes N/A
Qwen2.5-14B 14 2024-09-19 No
Qwen2.5-14B-Instruct 15 2024-09-19 Yes N/A
Qwen2.5-32B 33 2024-09-19 No
Qwen2.5-32B-Instruct 33 2024-09-19 Yes N/A
Qwen2.5-72B 72 2024-09-19 No
Qwen3-0.6B 0.5 2025-04-29 Yes on
Qwen3-0.6B-Base 0.6 2025-04-29 No
Qwen3-1.7B 1.5 2025-04-29 Yes on
Qwen3-1.7B-Base 1.7 2025-04-29 No
Qwen3-4B 3.1 2025-04-29 Yes on
Qwen3-4B-Base 4.0 2025-04-29 No
Qwen3-8B-Base 8.2 2025-04-29 No
Qwen3-8B 8.2 2025-04-29 Yes on
Qwen3-14B-Base 15 2025-04-29 No
Qwen3-14B 15 2025-04-29 Yes on
Qwen3-32B 33 2025-04-29 Yes on
Qwen3-30B-A3B-Base 31 (3.3) 2025-04-29 No
Qwen3-235B-A22B-Instruct-2507 235 (22) 2025-07-23 Yes N/A
Qwen3-235B-A22B-Thinking-2507 235 (22) 2025-07-23 Yes on
Sarashina2-7B 7.3 2024-06-14 No
Sarashina2-13B 13 2024-06-14 No
Sarashina2-70B 70 2024-06-14 No
Sarashina2.2 0.5B 0.8 2025-03-07 No
Sarashina2.2 1B 1.4 2025-03-07 No
Sarashina2.2 3B 3.4 2025-03-07 No
Sarashina2.2 3B Instruct v0.1 3.4 2025-03-07 Yes N/A
TinySwallow-1.5B 1.5 2025-01-30 No

Acknowledgements

  • Tabler Admin Template licensed under MIT License
  • ApexCharts licensed under MIT License
  • Swallow icon by Game Icons.net in CC Attribution License via SVG Repo
  • The research and development of the large language model Swallow has been supported by the AIST Project "Research and Development on Generative AI Foundation Models in the Physical Domain"