The Swallow Project is conducting independent evaluation experiments on major large language models (LLMs) in parallel with the development of a high-performance LLM specialized in Japanese. By comparing LLMs developed not only in Japan but also worldwide, we can better understand the current level of the Swallow Project. We conduct evaluations under fair conditions while considering the unique specifications of each LLM, such as tokenization and system prompts. By analyzing these evaluations in relation to the development methods of each LLM, we aim to explore the "recipe" for creating a high-performance LLM. This website visualizes the evaluation results of LLMs tested within the Swallow Project in bar charts, radar charts, and scatter plots. We hope this website serves not only as a guide for selecting high-performance LLMs but also as a reference for developing LLMs with strong Japanese language capabilities.
The content of this leaderboard (including data and graphs) is provided under a Creative Commons Attribution 4.0 (CC-BY 4.0) License, the evaluation software (swallow-evaluation) is distributed under the MIT License, and the source code of this website is also provided under the MIT License.
This benchmark evaluates post-trained LLMs including reasoning models on Japanese benchmark datasets. The evaluation scores range from 0 (lowest) to 1 (highest).
Japanese explainable multi-hop question answering
Proficient-level multi-discipline language understanding and reasoning
Graduate-level Google-proof question answering
Competition-level mathmatics
Japanese translation of HumanEval (code genration benchmark)
Controllability of instruction following
Evaluation results of this task are excluded from average calculation.
This benchmark evaluates post-trained LLMs including reasoning models on English benchmark datasets. The evaluation scores range from 0 (lowest) to 1 (highest).
Four-choice questions to predict the next event
Proficient-level multi-discipline language understanding and reasoning
Graduate-level Google-proof question answering
Competition-level mathmatics
Qualification for the United States Mathematical Olympiad (USAMO)
Contests across competition platforms (LeetCode, AtCoder, and CodeForces)
The Japanese version of MT-Bench (Nejumi LLM Leaderboard edition) evaluates multi-turn dialogue capabilities. The test questions are based on v4, and the reference answers are derived from v2 with corrections to incorrect responses. The evaluation scores range from 0 (lowest) to 1 (highest).
Implementing algorithms in Python or C++, and creating websites using HTML.
Extracting named entities (such as author names and numerical values) and sentiment (e.g., positive or negative) from text.
Creating essays and strategies on topics related to law, economics, history, philosophy, and education.
Generating solutions for problems and word problems in algebra, geometry, probability, and number theory.
Generating answers to questions by leveraging common knowledge and reasoning skills.
Writing creative texts by assuming the persona of famous individuals or fictional characters and imagining hypothetical scenarios.
Generating answers and explanations on topics related to physics, chemistry, biology, geography, architecture, and machine learning.
Writing blog articles, email drafts, and fictional narratives.
English MT-Bench evaluates multi-turn dialogue capabilities. The evaluation scores range from 0 (lowest) to 1 (highest).
Implementing algorithms in Python or C++, and creating websites using HTML.
Extracting named entities (such as author names and numerical values) and sentiment (e.g., positive or negative) from text.
Creating essays and strategies on topics related to law, economics, history, philosophy, and education.
Generating solutions for problems and word problems in algebra, geometry, probability, and number theory.
Generating answers to questions by leveraging common knowledge and reasoning skills.
Writing creative texts by assuming the persona of famous individuals or fictional characters and imagining hypothetical scenarios.
Generating answers and explanations on topics related to physics, chemistry, biology, geography, architecture, and machine learning.
Writing blog articles, email drafts, and fictional narratives.
This benchmark evaluates pre-trained LLMs models (without post-training) on Japanese benchmark datasets. The evaluation scores range from 0 (lowest) to 1 (highest).
Five-choice questions created with a knowledge base
Open-ended Q&A to assess the amount of knowledge and reasoning ability
Open-ended Q&A that can be answered by an encyclopedia
Open-ended Q&A for Wikipedia article
Task to generate a highlight from a news article of BBC
Japanese translation of math word problems (GSM8K)
Translation of news articles (English to Japanese)
Translation of news articles (Japanese to English)
Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)
Japanese translation of HumanEval (code genration benchmark)
This benchmark evaluates pre-trained LLMs models (without post-training) on English benchmark datasets. The evaluation scores range from 0 (lowest) to 1 (highest).
Four-choice questions based on scientific knowledge and common sense
Open-ended Q&A based on trivias
Four-choice questions to predict the next event
Open-ended Q&A developed for the evidence document
Two-choice question to predict the antecedent of a pronoun
Four-choice exam questions benchmark MMLU (53 subjects)
High school math competitions
23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)
Ability of code generation measured by unit test
Model name | # Parameters [B] | Release date | Post-training | Reasoning mode | Missing scores |
---|---|---|---|---|---|
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 | 33 | 2025-04-25 | Yes | on | |
CyberAgentLM3-22B-chat | 22 | 2024-07-09 | Yes | N/A | |
DeepSeek-R1-Distill-Llama-8B | 8.0 | 2025-01-20 | Yes | on | |
DeepSeek-R1-Distill-Llama-70B | 70 | 2025-01-20 | Yes | N/A | |
DeepSeek-R1-Distill-Qwen-7B | 7.6 | 2025-01-20 | Yes | on | |
DeepSeek-R1-Distill-Qwen-14B | 15 | 2025-01-20 | Yes | on | |
DeepSeek-R1-Distill-Qwen-32B | 33 | 2025-01-20 | Yes | on | |
DeepSeek-R1-Distill-Qwen-14B-Japanese | 15 | 2025-01-27 | Yes | on | |
DeepSeek-R1-Distill-Qwen-32B-Japanese | 33 | 2025-01-27 | Yes | on | |
ELYZA-Thinking-1.0-Qwen-32B | 33 | 2025-05-01 | Yes | on | |
Falcon3-1B-Base | 1.7 | 2024-12-19 | No | ||
Falcon3-3B-Base | 3.2 | 2024-12-19 | No | ||
Falcon3-7B-Base | 7.5 | 2024-12-19 | No | ||
Falcon3-10B-Base | 10 | 2024-12-19 | No | ||
Gemma 2 2B | 2.6 | 2024-06-27 | No | ||
Gemma 2 2B IT | 2.6 | 2024-06-27 | Yes | N/A | |
Gemma 2 9B | 9.2 | 2024-06-27 | No | ||
Gemma 2 9B IT | 9.2 | 2024-06-27 | Yes | N/A | |
Gemma 2 27B | 27 | 2024-06-27 | No | ||
Gemma 2 27B IT | 27 | 2024-06-27 | Yes | N/A | |
Gemma-2-Llama Swallow 2B | 2.6 | 2025-05-19 | No | ||
Gemma-2-Llama Swallow 2B IT | 2.6 | 2025-05-19 | Yes | N/A | |
Gemma-2-Llama Swallow 9B | 9.2 | 2025-05-19 | No | ||
Gemma-2-Llama Swallow 9B IT | 9.2 | 2025-05-19 | Yes | N/A | |
Gemma-2-Llama Swallow 27B | 27 | 2025-05-19 | No | ||
Gemma-2-Llama Swallow 27B IT | 27 | 2025-05-19 | Yes | N/A | |
Gemma 3 1B | 1 | 2025-03-12 | No | ||
Gemma 3 1B IT | 1.0 | 2025-03-12 | Yes | N/A | |
Gemma 3 4B | 4.3 | 2025-03-12 | No | ||
Gemma 3 4B IT | 4.3 | 2025-03-12 | Yes | N/A | |
Gemma 3 12B | 12 | 2025-03-12 | No | ||
Gemma 3 12B IT | 12 | 2025-03-12 | Yes | N/A | |
Gemma 3 27B | 27 | 2025-03-12 | No | ||
Gemma 3 27B IT | 27 | 2025-03-12 | Yes | N/A | |
GPT-4.1 (gpt-4.1-2025-04-14) | 0 | 2025-04-14 | Yes | N/A | |
GPT-4o (gpt-4o-2024-08-06) | 0 | 2024-08-06 | Yes | N/A | |
GPT-5 (gpt-5-2025-08-07) | 0 | 2025-08-07 | Yes | on (middle) | |
gpt-oss-20b | 22 (3.6) | 2025-08-05 | Yes | on (middle) | |
gpt-oss-120b | 120 (5.1) | 2025-08-05 | Yes | on (middle) | |
Llama 3.1 8B | 8.0 | 2024-07-23 | No | ||
Llama 3.1 8B Instruct | 8.0 | 2024-07-23 | Yes | N/A | |
Llama 3.1 70B | 70 | 2024-07-23 | No | ||
Llama-3.1-Nemotron-Nano-8B-v1 | 8.0 | 2025-03-18 | Yes | on | |
Llama 3.1 Swallow 8B Instruct v0.3 | 8.0 | 2024-12-23 | Yes | N/A | |
Llama 3.1 Swallow 8B v0.5 | 8.0 | 2025-06-25 | No | ||
Llama 3.1 Swallow 8B Instruct v0.5 | 8.0 | 2025-06-25 | Yes | N/A | |
Llama 3.2 1B | 1.2 | 2024-09-25 | No | ||
Llama 3.2 3B | 3.2 | 2024-09-25 | No | ||
Llama 3.3 70B Instruct | 70 | 2024-12-06 | Yes | N/A | |
Llama-3.3-Nemotron-Super-49B-v1 | 50 | 2025-03-18 | Yes | N/A | |
Llama 3.3 Swallow 70B v0.4 | 70 | 2025-03-14 | No | ||
Llama 3.3 Swallow 70B Instruct v0.4 | 70 | 2025-03-10 | Yes | N/A | |
Llama 4 Scout | 109 (17) | 2025-04-04 | No | ||
Llama 4 Scout Instruct | 109 (17) | 2025-04-04 | Yes | N/A | |
llm-jp-3-1.8b | 1.8 | 2024-09-25 | No | ||
llm-jp-3-3.7b | 3.7 | 2024-09-25 | No | ||
llm-jp-3-7.2b | 7.3 | 2025-02-05 | No | ||
llm-jp-3-13b | 13 | 2024-09-25 | No | ||
llm-jp-3.1-1.8b-instruct4 | 1.8 | 2025-05-30 | Yes | N/A | |
llm-jp-3.1-13b-instruct4 | 14 | 2025-05-30 | Yes | N/A | |
MedGemma 27B IT | 27 | 2025-07-09 | Yes | N/A | |
o3 (o3-2025-04-16) | 0 | 2025-04-16 | Yes | on (middle) | |
o3-mini (o3-mini-2025-01-31) | 0 | 2025-01-31 | Yes | on (middle) | |
Phi-4 | 15 | 2024-12-13 | Yes | N/A | |
Phi-4-reasoning-plus | 15 | 2025-04-30 | Yes | on | |
PLaMo 2 1B | 1.3 | 2025-02-21 | No | ||
PLaMo 2 8B | 9.1 | 2025-02-21 | No | ||
Qwen2.5-1.5B | 1.5 | 2024-09-19 | No | ||
Qwen2.5-3B | 3.1 | 2024-09-19 | No | ||
Qwen2.5-7B | 7.6 | 2024-09-19 | No | ||
Qwen2.5-7B-Instruct | 7.6 | 2024-09-19 | Yes | N/A | |
Qwen2.5-14B | 14 | 2024-09-19 | No | ||
Qwen2.5-14B-Instruct | 15 | 2024-09-19 | Yes | N/A | |
Qwen2.5-32B | 33 | 2024-09-19 | No | ||
Qwen2.5-32B-Instruct | 33 | 2024-09-19 | Yes | N/A | |
Qwen2.5-72B | 72 | 2024-09-19 | No | ||
Qwen3-0.6B | 0.5 | 2025-04-29 | Yes | on | |
Qwen3-0.6B-Base | 0.6 | 2025-04-29 | No | ||
Qwen3-1.7B | 1.5 | 2025-04-29 | Yes | on | |
Qwen3-1.7B-Base | 1.7 | 2025-04-29 | No | ||
Qwen3-4B | 3.1 | 2025-04-29 | Yes | on | |
Qwen3-4B-Base | 4.0 | 2025-04-29 | No | ||
Qwen3-8B-Base | 8.2 | 2025-04-29 | No | ||
Qwen3-8B | 8.2 | 2025-04-29 | Yes | on | |
Qwen3-14B-Base | 15 | 2025-04-29 | No | ||
Qwen3-14B | 15 | 2025-04-29 | Yes | on | |
Qwen3-32B | 33 | 2025-04-29 | Yes | on | |
Qwen3-30B-A3B-Base | 31 (3.3) | 2025-04-29 | No | ||
Qwen3-235B-A22B-Instruct-2507 | 235 (22) | 2025-07-23 | Yes | N/A | |
Qwen3-235B-A22B-Thinking-2507 | 235 (22) | 2025-07-23 | Yes | on | |
Sarashina2-7B | 7.3 | 2024-06-14 | No | ||
Sarashina2-13B | 13 | 2024-06-14 | No | ||
Sarashina2-70B | 70 | 2024-06-14 | No | ||
Sarashina2.2 0.5B | 0.8 | 2025-03-07 | No | ||
Sarashina2.2 1B | 1.4 | 2025-03-07 | No | ||
Sarashina2.2 3B | 3.4 | 2025-03-07 | No | ||
Sarashina2.2 3B Instruct v0.1 | 3.4 | 2025-03-07 | Yes | N/A | |
TinySwallow-1.5B | 1.5 | 2025-01-30 | No |