Swallow LLM Leaderboard

The Swallow Project is conducting independent evaluation experiments on major large language models (LLMs) in parallel with the development of a high-performance LLM specialized in Japanese. By comparing LLMs developed not only in Japan but also worldwide, we can better understand the current level of the Swallow Project. We conduct evaluations under fair conditions while considering the unique specifications of each LLM, such as tokenization and system prompts. By analyzing these evaluations in relation to the development methods of each LLM, we aim to explore the "recipe" for creating a high-performance LLM. This website visualizes the evaluation results of LLMs tested within the Swallow Project in bar charts, radar charts, and scatter plots. We hope this website serves not only as a guide for selecting high-performance LLMs but also as a reference for developing LLMs with strong Japanese language capabilities.

The content of this leaderboard (including data and graphs) is provided under a Creative Commons Attribution 4.0 (CC-BY 4.0) License, the evaluation software (swallow-evaluation) is distributed under the MIT License, and the source code of this website is also provided under the MIT License.

Change log

2025-06-27
- Added a note regarding the in-domain evaluation of llm-jp-3.1-*-instruct4.
2025-06-25
- Added evaluation results of Llama 3.1 Swallow 8B v0.5.
- Added evaluation results of Llama 4 Scout.
- Added evaluation results of llm-jp-3-7.2b.
- Added evaluation results of llm-jp-3-1.8b-instruct3, llm-jp-3-3.7b-instruct3, llm-jp-3-7.2b-instruct3, llm-jp-3-13b-instruct3.
- Added evaluation results of llm-jp-3.1-1.8b-instruct4, llm-jp-3.1-13b-instruct4.
- Added evaluation results of Qwen2.5-32B.
- Added evaluation results of Qwen3-1.7B-Base, Qwen3-4B-Base, Qwen3-8B-Base, Qwen3-14B-Base, Qwen3-30B-A3B-Base.
2025-05-21
- Added evaluation results of Sarashina2.2 0.5B, 1B, 3B.
2025-05-19
- Added evaluation results of Gemma-2-Llama Swallow 2B, 9B, 27B.
2025-04-14
- Added evaluation results of Gemma 3 5B, 12B, 27B.
- Added evaluation results (Japanese Understanding & Generation and Japanese MT-Bench) of GPT-4 (gpt-4-0613).
- Added evaluation results (Japanese Understanding & Generation and Japanese MT-Bench) of GPT-4.5 (gpt-4.5-preview-2025-02-27) and o1 (o1-2024-12-17). We also considered evaluating on Japanese understanding and generation tasks; however, due to limitations in the OpenAI API specifications — specifically, the inability to generate 10 responses for a single prompt under the same conditions as other models — we will treat the scores for Japanese understanding and generation tasks as blank.
2025-03-10
- Relaunched as the Swallow LLM Leaderboard.
2024-07-01
- The predecessor project, Japanese LLM Evaluation, was publicly released.

Evaluation tasks

Japanese understanding & generation

We evaluate LLMs on question answering and reading comprehension to assess language understanding and common knowledge, summarization and translation to measure language generation, and code generation and mathematics to test logical reasoning abilities. The evaluation scores range from 0 (lowest) to 1 (highest).

Q&A regarding commonsense and inference

JCommonsenseQA (JComQA)

Five-choice questions created with a knowledge base

Metric: Accuracy

Setting: 4-shot

Reference: Kurihara et al. (2022)

Multi-hop Q&A

JEMHopQA

Open-ended Q&A to assess the amount of knowledge and reasoning ability

Metric: Character F1

Setting: 4-shot

Reference: Ishii et al. (2024)

Classical Q&A

NIILC

Open-ended Q&A that can be answered by an encyclopedia

Metric: Character F1

Setting: 4-shot

Reference: Sekine (2003)

Reading comprehension

JSQuAD

Open-ended Q&A for Wikipedia article

Metric: Character F1

Setting: 4-shot

Reference: Kurihara et al. (2022)

Summarization

XL-Sum

Task to generate a highlight from a news article of BBC

Metric: ROUGE-2

Setting: 1-shot

Reference: Hasan et al. (2021)

Mathematics

MGSM

Japanese translation of math word problems (GSM8K)

Metric: Accuracy (exact match)

Setting: 4-shot

Reference: Shi et al. (2023)

English-Japanese translation

WMT20 (en-ja)

Translation of news articles (English to Japanese)

Metric: BLEU

Setting: 4-shot

Reference: Barrault et al. (2020)

Japanese-English translation

WMT20 (ja-en)

Translation of news articles (Japanese to English)

Metric: BLEU

Setting: 4-shot

Reference: Barrault et al. (2020)

Multi-task natural language understanding

JMMLU

Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)

Metric: Accuracy

Setting: 5-shot

Reference: Yin et al (2024)

Code generation

JHumanEval

Japanese translation of HumanEval (code genration benchmark)

Metric: pass@1

Setting: 0-shot, 10 trials

Reference: Sato et al. (2024)

English understanding & generation

We evaluate LLMs on question answering, reading comprehension, and exam questions to assess language understanding and common knowledge, summarization to measure language generation, and code generation and mathematics to test logical reasoning abilities. The evaluation scores range from 0 (lowest) to 1 (highest).

Q&A based on facts and common sense

OpenBookQA

Four-choice questions based on scientific knowledge and common sense

Metric: Accuracy

Setting: 4-shot

Reference: Mihaylov et al. (2018)

Q&A based on knowledge

TriviaQA

Open-ended Q&A based on trivias

Metric: Accuracy (exact match)

Setting: 4-shot

Reference: Joshi et al. (2017)

Commonsense inference

HellaSwag

Four-choice questions to predict the next event

Metric: Accuracy

Setting: 4-shot

Reference: Zellers et al. (2019)

Reading comprehension

SQuAD2

Open-ended Q&A developed for the evidence document

Metric: Accuracy (exact match)

Setting: 4-shot

Reference: Rajpurkar et al. (2018)

Commonsense inference

XWINO

Two-choice question to predict the antecedent of a pronoun

Metric: Accuracy

Setting: 4-shot

Reference: Tikhonov and Ryabinin (2021)

Multitask natural language understanding

MMLU

Four-choice exam questions benchmark MMLU (53 subjects)

Metric: Accuracy

Setting: 5-shot

Reference: Hendrycks et al. (2021)

Mathematics

GSM8K

Math word problems

Metric: Accuracy (exact match)

Setting: 4-shot

Reference: Cobbe et al. (2021)

Mathematics

MATH

High school math competitions

Metric: Accuracy (exact match)

Setting: 4-shot

Reference: Hendrycks et al. (2021)

Collection of hard-to-solve tasks for LLM

BIG-Bench-Hard (BBH)

23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)

Metric: Accuracy (exact match)

Setting: 3-shot, CoT

Reference: Suzgun et al. (2023)

Code generation

HumanEval

Ability of code generation measured by unit test

Metric: pass@1

Setting: 0-shot, 10 trials

Reference: Chen et al. (2021)

Japanese MT-Bench

We used the Japanese version of MT-Bench (Nejumi LLM Leaderboard edition) to evaluate dialogue capabilities. The test questions are based on v4, and the reference answers are derived from v2 with corrections to incorrect responses. The evaluation scores range from 0 (lowest) to 1 (highest).