Swallow LLM Leaderboard

The Swallow Project is conducting independent evaluation experiments on major large language models (LLMs) in parallel with the development of a high-performance LLM specialized in Japanese. By comparing LLMs developed not only in Japan but also worldwide, we can better understand the current level of the Swallow Project. We conduct evaluations under fair conditions while considering the unique specifications of each LLM, such as tokenization and system prompts. By analyzing these evaluations in relation to the development methods of each LLM, we aim to explore the "recipe" for creating a high-performance LLM. This website visualizes the evaluation results of LLMs tested within the Swallow Project in bar charts, radar charts, and scatter plots. We hope this website serves not only as a guide for selecting high-performance LLMs but also as a reference for developing LLMs with strong Japanese language capabilities.

The content of this leaderboard (including data and graphs) is provided under a Creative Commons Attribution 4.0 (CC-BY 4.0) License, the evaluation software (swallow-evaluation) is distributed under the MIT License, and the source code of this website is also provided under the MIT License.

Change log

2025-11-21
- We added evaluation results of PLaMo 3 NICT 2B, 8B, and 31B Base.
2025-10-29
- We upgraded the evaluation framework to swallow-evaluation-instruct v202510.
- We have added JamC-QA (question answering for Japan-specific knowledge) as Japanese benchmarks for post-trained models.
- We have removed JEMHopQA from the Japanese benchmarks for post-trained models.
- We added evaluation results of Apertus-8B-Instruct, Apertus-70B-Instruct, ELYZA-Shortcut-1.0-Qwen-32B, Flux-Japanese-Qwen2.5-32B-Instruct-V1.0, Qwen2.5-0.5B, and QwQ Bakeneko 32B.
2025-08-18
- Swallow LLM Leaderboard v2.
- We have revamped evaluation benchmarks and methods for post-trained models in order to properly measure the capabilities of new large language models, such as reasoning models. We adopted six Japanese benchmarks (JEMHopQA, MMLU-ProX, GPQA, MATH-100, JHumanEval, M-IFEval-Ja) and six English benchmarks (HellaSwag, MMLU-Pro, GPQA, MATH-500, AIME 2024-2025, LiveCodeBench), and changed the evaluation method to zero-shot reasoning (previously few-shot reasoning). In addition, we have released the developed evaluation framework as swallow-evaluation-instruct.
- We added evaluation results of ABEJA-QwQ32b-Reasoning-Japanese-v1.0, DeepSeek-R1-Distill series, ELYZA-Thinking-1.0-Qwen-32B, GPT-5 (gpt-5-2025-08-07), gpt-oss-20b, gpt-oss-120b, Llama-3.1-Nemotron series, Llama 4 Scout Instruct, MedGemma 27B IT, o3 (o3-2025-04-16), o3-mini (o3-mini-2025-01-31), Phi-4-reasoning-plus, Qwen3 series.
- We have revised the structure to consist of three types of pages: overall results (bar chart of average scores), task-specific results (radar chart), and scatter plots. Each page visualizes the evaluation results of either pretrained models (without post-training) or post-trained models.
- We implemented a feature on the right side of the model list (table) displayed on each page that allows users to bulk-select models by scale or category.
- We implemented a feature in the bar chart on the overall results page that allows users to toggle the sorting order of models by clicking on a model name.
- We added functionality to display the number of active parameters for Mixture of Experts (MoE) models.
- We updated the scatter plot so that the plotted points are color-coded by model family (OpenAI, Llama, Gemma, Qwen, and others).
- The old version was moved to https://swallow-llm.github.io/leaderboard-v1/.
2025-06-27
- Added a note regarding the in-domain evaluation of llm-jp-3.1-*-instruct4.
2025-06-25
- Added evaluation results of Llama 3.1 Swallow 8B v0.5.
- Added evaluation results of Llama 4 Scout.
- Added evaluation results of llm-jp-3-7.2b.
- Added evaluation results of llm-jp-3-1.8b-instruct3, llm-jp-3-3.7b-instruct3, llm-jp-3-7.2b-instruct3, llm-jp-3-13b-instruct3.
- Added evaluation results of llm-jp-3.1-1.8b-instruct4, llm-jp-3.1-13b-instruct4.
- Added evaluation results of Qwen2.5-32B.
- Added evaluation results of Qwen3-1.7B-Base, Qwen3-4B-Base, Qwen3-8B-Base, Qwen3-14B-Base, Qwen3-30B-A3B-Base.
2025-05-21
- Added evaluation results of Sarashina2.2 0.5B, 1B, 3B.
2025-05-19
- Added evaluation results of Gemma-2-Llama Swallow 2B, 9B, 27B.
2025-04-14
- Added evaluation results of Gemma 3 5B, 12B, 27B.
- Added evaluation results (Japanese Understanding & Generation and Japanese MT-Bench) of GPT-4 (gpt-4-0613).
- Added evaluation results (Japanese Understanding & Generation and Japanese MT-Bench) of GPT-4.5 (gpt-4.5-preview-2025-02-27) and o1 (o1-2024-12-17). We also considered evaluating on Japanese understanding and generation tasks; however, due to limitations in the OpenAI API specifications — specifically, the inability to generate 10 responses for a single prompt under the same conditions as other models — we will treat the scores for Japanese understanding and generation tasks as blank.
2025-03-10
- Relaunched as the Swallow LLM Leaderboard.
2024-07-01
- The predecessor project, Japanese LLM Evaluation, was publicly released.

Evaluation tasks

Post-trained (Japanese)

This benchmark evaluates post-trained LLMs including reasoning models on Japanese benchmark datasets. The evaluation scores range from 0 (lowest) to 1 (highest).

Question answering

JamC-QA

Question answering for Japan-specific knowledge

Metric: Accuracy

Reference: Oka et al. (2025)

College-level exam

MMLU-ProX (Japanese)

Proficient-level multi-discipline language understanding and reasoning

Metric: Accuracy

Reference: Xuan et al. (2025)

Science

GPQA (Japanese)

Graduate-level Google-proof question answering

Metric: Accuracy

Reference: Huang et al. (2025)

Mathematics

MATH-100 (Japanese)

Competition-level mathmatics

Metric: Accuracy

Reference: Son et al. (2025)

Coding

JHumanEval

Japanese translation of HumanEval (code genration benchmark)

Metric: Pass@1 (n=10)

Reference: Sato et al. (2024)

Instruction following

M-IFEval-Ja

Controllability of instruction following

Metric: Accuracy

Reference: Dussolle et al. (2025)

Evaluation results of this task are excluded from average calculation.

Post-trained (English)

This benchmark evaluates post-trained LLMs including reasoning models on English benchmark datasets. The evaluation scores range from 0 (lowest) to 1 (highest).

Natural language inference

HellaSwag

Four-choice questions to predict the next event

Metric: Accuracy

Reference: Zellers et al. (2019)

College-level exam

MMLU-Pro (English)

Proficient-level multi-discipline language understanding and reasoning

Metric: Accuracy

Reference: Wang et al. (2024)

Science

GPQA (English)

Graduate-level Google-proof question answering

Metric: Accuracy

Reference: Rein et al. (2024)

Mathematics

MATH-500 (English)

Competition-level mathmatics

Metric: Accuracy

Reference: Hendrycks et al. (2021)

Mathematics

AIME 2024-2025

Qualification for the United States Mathematical Olympiad (USAMO)

Metric: Accuracy

Coding

LiveCodeBench

Contests across competition platforms (LeetCode, AtCoder, and CodeForces)

Metric: Pass@1 (n=10)

Japanese MT-Bench

The Japanese version of MT-Bench (Nejumi LLM Leaderboard edition) evaluates multi-turn dialogue capabilities. The test questions are based on v4, and the reference answers are derived from v2 with corrections to incorrect responses. The evaluation scores range from 0 (lowest) to 1 (highest).

Coding

Implementing algorithms in Python or C++, and creating websites using HTML.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)

Model name	# Parameters [B]	Release date	Post-training	Reasoning mode
ABEJA-QwQ32b-Reasoning-Japanese-v1.0	33	2025-04-25	Yes	on
Apertus-8B-Instruct	8.1	2025-09-02	Yes	N/A
Apertus-70B-Instruct	71	2025-09-02	Yes	N/A
CyberAgentLM3-22B-chat	22	2024-07-09	Yes	N/A
DeepSeek-R1-Distill-Llama-8B	8.0	2025-01-20	Yes	on
DeepSeek-R1-Distill-Llama-70B	70	2025-01-20	Yes	N/A
DeepSeek-R1-Distill-Qwen-7B	7.6	2025-01-20	Yes	on
DeepSeek-R1-Distill-Qwen-14B	15	2025-01-20	Yes	on
DeepSeek-R1-Distill-Qwen-32B	33	2025-01-20	Yes	on
DeepSeek-R1-Distill-Qwen-14B-Japanese	15	2025-01-27	Yes	on
DeepSeek-R1-Distill-Qwen-32B-Japanese	33	2025-01-27	Yes	on
ELYZA-Shortcut-1.0-Qwen-32B	33	2025-05-01	Yes	N/A
ELYZA-Thinking-1.0-Qwen-32B	33	2025-05-01	Yes	on
Falcon3-1B-Base	1.7	2024-12-19	No
Falcon3-3B-Base	3.2	2024-12-19	No
Falcon3-7B-Base	7.5	2024-12-19	No
Falcon3-10B-Base	10	2024-12-19	No
Flux-Japanese-Qwen2.5-32B-Instruct-V1.0	33	2025-09-28	Yes	N/A
Gemma 2 2B	2.6	2024-06-27	No
Gemma 2 2B IT	2.6	2024-06-27	Yes	N/A
Gemma 2 9B	9.2	2024-06-27	No
Gemma 2 9B IT	9.2	2024-06-27	Yes	N/A
Gemma 2 27B	27	2024-06-27	No
Gemma 2 27B IT	27	2024-06-27	Yes	N/A
Gemma-2-Llama Swallow 2B	2.6	2025-05-19	No
Gemma-2-Llama Swallow 2B IT	2.6	2025-05-19	Yes	N/A
Gemma-2-Llama Swallow 9B	9.2	2025-05-19	No
Gemma-2-Llama Swallow 9B IT	9.2	2025-05-19	Yes	N/A
Gemma-2-Llama Swallow 27B	27	2025-05-19	No
Gemma-2-Llama Swallow 27B IT	27	2025-05-19	Yes	N/A
Gemma 3 1B	1	2025-03-12	No
Gemma 3 1B IT	1.0	2025-03-12	Yes	N/A
Gemma 3 4B	4.3	2025-03-12	No
Gemma 3 4B IT	4.3	2025-03-12	Yes	N/A
Gemma 3 12B	12	2025-03-12	No
Gemma 3 12B IT	12	2025-03-12	Yes	N/A
Gemma 3 27B	27	2025-03-12	No
Gemma 3 27B IT	27	2025-03-12	Yes	N/A
GPT-4.1 (gpt-4.1-2025-04-14)	0	2025-04-14	Yes	N/A
GPT-4o (gpt-4o-2024-08-06)	0	2024-08-06	Yes	N/A
GPT-5 (gpt-5-2025-08-07)	0	2025-08-07	Yes	on (middle)
gpt-oss-20b	22 (3.6)	2025-08-05	Yes	on (middle)
gpt-oss-120b	120 (5.1)	2025-08-05	Yes	on (middle)
Llama 3.1 8B	8.0	2024-07-23	No
Llama 3.1 8B Instruct	8.0	2024-07-23	Yes	N/A
Llama 3.1 70B	70	2024-07-23	No
Llama-3.1-Nemotron-Nano-8B-v1	8.0	2025-03-18	Yes	on
Llama 3.1 Swallow 8B Instruct v0.3	8.0	2024-12-23	Yes	N/A
Llama 3.1 Swallow 8B v0.5	8.0	2025-06-25	No
Llama 3.1 Swallow 8B Instruct v0.5	8.0	2025-06-25	Yes	N/A
Llama 3.2 1B	1.2	2024-09-25	No
Llama 3.2 3B	3.2	2024-09-25	No
Llama 3.3 70B Instruct	70	2024-12-06	Yes	N/A
Llama-3.3-Nemotron-Super-49B-v1	50	2025-03-18	Yes	N/A
Llama 3.3 Swallow 70B v0.4	70	2025-03-14	No
Llama 3.3 Swallow 70B Instruct v0.4	70	2025-03-10	Yes	N/A
Llama 4 Scout	109 (17)	2025-04-04	No
Llama 4 Scout Instruct	109 (17)	2025-04-04	Yes	N/A
llm-jp-3-1.8b	1.8	2024-09-25	No
llm-jp-3-3.7b	3.7	2024-09-25	No
llm-jp-3-7.2b	7.3	2025-02-05	No
llm-jp-3-13b	13	2024-09-25	No
llm-jp-3.1-1.8b-instruct4	1.8	2025-05-30	Yes	N/A
llm-jp-3.1-13b-instruct4	14	2025-05-30	Yes	N/A
MedGemma 27B IT	27	2025-07-09	Yes	N/A
o3 (o3-2025-04-16)	0	2025-04-16	Yes	on (middle)
o3-mini (o3-mini-2025-01-31)	0	2025-01-31	Yes	on (middle)
Phi-4	15	2024-12-13	Yes	N/A
Phi-4-reasoning-plus	15	2025-04-30	Yes	on
PLaMo 2 1B	1.3	2025-02-21	No
PLaMo 2 8B	9.1	2025-02-21	No
PLaMo 3 NICT 2B Base	2.6	2025-11-14	No
PLaMo 3 NICT 8B Base	8.1	2025-11-14	No
PLaMo 3 NICT 31B Base	32	2025-11-14	No
Qwen2.5-0.5B	0.5	2024-09-19	No
Qwen2.5-1.5B	1.5	2024-09-19	No
Qwen2.5-3B	3.1	2024-09-19	No
Qwen2.5-7B	7.6	2024-09-19	No
Qwen2.5-7B-Instruct	7.6	2024-09-19	Yes	N/A
Qwen2.5-14B	14	2024-09-19	No
Qwen2.5-14B-Instruct	15	2024-09-19	Yes	N/A
Qwen2.5-32B	33	2024-09-19	No
Qwen2.5-32B-Instruct	33	2024-09-19	Yes	N/A
Qwen2.5-72B	72	2024-09-19	No
Qwen3-0.6B	0.5	2025-04-29	Yes	on
Qwen3-0.6B-Base	0.6	2025-04-29	No
Qwen3-1.7B	1.5	2025-04-29	Yes	on
Qwen3-1.7B-Base	1.7	2025-04-29	No
Qwen3-4B	3.1	2025-04-29	Yes	on
Qwen3-4B-Base	4.0	2025-04-29	No
Qwen3-8B-Base	8.2	2025-04-29	No
Qwen3-8B	8.2	2025-04-29	Yes	on
Qwen3-14B-Base	15	2025-04-29	No
Qwen3-14B	15	2025-04-29	Yes	on
Qwen3-32B	33	2025-04-29	Yes	on
Qwen3-30B-A3B-Base	31 (3.3)	2025-04-29	No
Qwen3-235B-A22B-Instruct-2507	235 (22)	2025-07-23	Yes	N/A
Qwen3-235B-A22B-Thinking-2507	235 (22)	2025-07-23	Yes	on
QwQ Bakeneko 32B	33	2025-03-13	Yes	on
Sarashina2-7B	7.3	2024-06-14	No
Sarashina2-13B	13	2024-06-14	No
Sarashina2-70B	70	2024-06-14	No
Sarashina2.2 0.5B	0.8	2025-03-07	No
Sarashina2.2 1B	1.4	2025-03-07	No
Sarashina2.2 3B	3.4	2025-03-07	No
Sarashina2.2 3B Instruct v0.1	3.4	2025-03-07	Yes	N/A
TinySwallow-1.5B	1.5	2025-01-30	No

About

Change log

Evaluation tasks

Post-trained (Japanese)

JamC-QA

MMLU-ProX (Japanese)

GPQA (Japanese)

MATH-100 (Japanese)

JHumanEval

M-IFEval-Ja

Post-trained (English)

HellaSwag

MMLU-Pro (English)

GPQA (English)

MATH-500 (English)

AIME 2024-2025

LiveCodeBench

Japanese MT-Bench

Coding

Extraction

Humanities

Math

Reasoning

Roleplay

STEM

Writing

英語 MT-Bench

Coding

Extraction

Humanities

Math

Reasoning

Roleplay

STEM

Writing

Pre-trained (Japanese)

JCommonsenseQA

JEMHopQA

NIILC

JSQuAD

XL-Sum

MGSM

WMT20 (en-ja)

WMT20 (ja-en)

JMMLU

JHumanEval

Pre-trained (English)

OpenBookQA

TriviaQA

HellaSwag

SQuAD2

XWINO

MMLU

GSM8K

MATH

BIG-Bench-Hard (BBH)

HumanEval

Evaluation tools

LLM-jp evaluation script (1.3.0)

JP Language Model Evaluation Harness (commit #9b42d41)

Language Model Evaluation Harness (0.4.2)

Code Generation LM Evaluation Harness (commit #0261c52)

FastChat (commit #e86e70d0)

swallow-evaluation

Evaluated models

Acknowledgements