The Swallow Project is conducting independent evaluation experiments on major large language models (LLMs) in parallel with the development of a high-performance LLM specialized in Japanese. By comparing LLMs developed not only in Japan but also worldwide, we can better understand the current level of the Swallow Project. We conduct evaluations under fair conditions while considering the unique specifications of each LLM, such as tokenization and system prompts. By analyzing these evaluations in relation to the development methods of each LLM, we aim to explore the "recipe" for creating a high-performance LLM. This website visualizes the evaluation results of LLMs tested within the Swallow Project in bar charts, radar charts, and scatter plots. We hope this website serves not only as a guide for selecting high-performance LLMs but also as a reference for developing LLMs with strong Japanese language capabilities.

The content of this leaderboard (including data and graphs) is provided under a Creative Commons Attribution 4.0 (CC-BY 4.0) License, the evaluation software (swallow-evaluation) is distributed under the MIT License, and the source code of this website is also provided under the MIT License.

Change log

  • 2025-03-10: Relaunched as the Swallow LLM Leaderboard.
  • 2024-07-01: The predecessor project, Japanese LLM Evaluation, was publicly released.

Evaluation tasks

Japanese understanding & generation

We evaluate LLMs on question answering and reading comprehension to assess language understanding and common knowledge, summarization and translation to measure language generation, and code generation and mathematics to test logical reasoning abilities. The evaluation scores range from 0 (lowest) to 1 (highest).

Q&A regarding commonsense and inference
JCommonsenseQA (JComQA)

Five-choice questions created with a knowledge base

Metric: Accuracy
Setting: 4-shot
Multi-hop Q&A
JEMHopQA

Open-ended Q&A to assess the amount of knowledge and reasoning ability

Metric: Character F1
Setting: 4-shot
Classical Q&A
NIILC

Open-ended Q&A that can be answered by an encyclopedia

Metric: Character F1
Setting: 4-shot
Reference: Sekine (2003)
Reading comprehension
JSQuAD

Open-ended Q&A for Wikipedia article

Metric: Character F1
Setting: 4-shot
Summarization
XL-Sum

Task to generate a highlight from a news article of BBC

Metric: ROUGE-2
Setting: 1-shot
Mathematics
MGSM

Japanese translation of math word problems (GSM8K)

Metric: Accuracy (exact match)
Setting: 4-shot
English-Japanese translation
WMT20 (en-ja)

Translation of news articles (English to Japanese)

Metric: BLEU
Setting: 4-shot
Japanese-English translation
WMT20 (ja-en)

Translation of news articles (Japanese to English)

Metric: BLEU
Setting: 4-shot
Multi-task natural language understanding
JMMLU

Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)

Metric: Accuracy
Setting: 5-shot
Reference: Yin et al (2024)
Code generation
JHumanEval

Japanese translation of HumanEval (code genration benchmark)

Metric: pass@1
Setting: 0-shot, 10 trials
English understanding & generation

We evaluate LLMs on question answering, reading comprehension, and exam questions to assess language understanding and common knowledge, summarization to measure language generation, and code generation and mathematics to test logical reasoning abilities. The evaluation scores range from 0 (lowest) to 1 (highest).

Q&A based on facts and common sense
OpenBookQA

Four-choice questions based on scientific knowledge and common sense

Metric: Accuracy
Setting: 4-shot
Q&A based on knowledge
TriviaQA

Open-ended Q&A based on trivias

Metric: Accuracy (exact match)
Setting: 4-shot
Commonsense inference
HellaSwag

Four-choice questions to predict the next event

Metric: Accuracy
Setting: 4-shot
Reading comprehension
SQuAD2

Open-ended Q&A developed for the evidence document

Metric: Accuracy (exact match)
Setting: 4-shot
Commonsense inference
XWINO

Two-choice question to predict the antecedent of a pronoun

Metric: Accuracy
Setting: 4-shot
Multitask natural language understanding
MMLU

Four-choice exam questions benchmark MMLU (53 subjects)

Metric: Accuracy
Setting: 5-shot
Mathematics
GSM8K

Math word problems

Metric: Accuracy (exact match)
Setting: 4-shot
Mathematics
MATH

High school math competitions

Metric: Accuracy (exact match)
Setting: 4-shot
Collection of hard-to-solve tasks for LLM
BIG-Bench-Hard (BBH)

23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)

Metric: Accuracy (exact match)
Setting: 3-shot, CoT
Code generation
HumanEval

Ability of code generation measured by unit test

Metric: pass@1
Setting: 0-shot, 10 trials
Japanese MT-Bench

We used the Japanese version of MT-Bench (Nejumi LLM Leaderboard edition) to evaluate dialogue capabilities. The test questions are based on v4, and the reference answers are derived from v2 with corrections to incorrect responses. The evaluation scores range from 0 (lowest) to 1 (highest).

Coding

Implementing algorithms in Python or C++, and creating websites using HTML.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Extraction

Extracting named entities (such as author names and numerical values) and sentiment (e.g., positive or negative) from text.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Humanities

Creating essays and strategies on topics related to law, economics, history, philosophy, and education.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Math

Generating solutions for problems and word problems in algebra, geometry, probability, and number theory.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Reasoning

Generating answers to questions by leveraging common knowledge and reasoning skills.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Roleplay

Writing creative texts by assuming the persona of famous individuals or fictional characters and imagining hypothetical scenarios.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
STEM

Generating answers and explanations on topics related to physics, chemistry, biology, geography, architecture, and machine learning.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)
Writing

Writing blog articles, email drafts, and fictional narratives.

Metric: Reference-guided grading by GPT-4o (gpt-4o-2024-08-06)

Evaluation tools

LLM-jp evaluation script (1.3.0)
Automatic evaluation tool for Japanese LLMs
JP Language Model Evaluation Harness (commit #9b42d41)
An evaluation framework for Japanese LLMs
Language Model Evaluation Harness (0.4.2)
An evaluation framework for LLMs
Code Generation LM Evaluation Harness (commit #0261c52)
An evaluation framework for code generation (HumanEval)
FastChat (commit #e86e70d0)
An automatic evaluation framework by an LLM (MT-Bench)
swallow-evaluation
An evaluation framework used in Swallow Project (encompassing all the above-mentioned tools)

Evaluated models

Model name # Parameters [B] Release date Type Missing scores
Aya Expanse 8B 8.0 2024-10-24 chat
Aya Expanse 32B 32 2024-10-24 chat
CyberAgentLM3-22B-chat 22 2024-07-09 chat
Falcon3-1B-Base 1.7 2024-12-19 base
Falcon3-1B-Instruct 1.7 2024-12-19 chat
Falcon3-3B-Base 3.2 2024-12-19 base
Falcon3-3B-Instruct 3.2 2024-12-19 chat
Falcon3-7B-Base 7.5 2024-12-19 base
Falcon3-7B-Instruct 7.5 2024-12-19 chat
Falcon3-10B-Base 10 2024-12-19 base
Falcon3-10B-Instruct 10 2024-12-19 chat
Gemma 2 2B 2.6 2024-06-27 base
Gemma 2 2B IT 2.6 2024-06-27 chat
Gemma 2 9B 9.2 2024-06-27 base
Gemma 2 9B IT 9.2 2024-06-27 chat
Gemma 2 27B 27 2024-06-27 base
Gemma 2 27B IT 27 2024-06-27 chat
Gemma 2 Baku 2B 2.6 2024-10-03 base
Gemma 2 Baku 2B IT 2.6 2024-10-03 chat
Gemma 2 JPN 2.6 2024-06-27 chat
GPT-3.5 (gpt-3.5-turbo-0125) 0 2024-01-25 chat En Avg OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K MATH BBH HumanEval
GPT-4-turbo (gpt-4-turbo-2024-04-09) 0 2024-04-09 chat En Avg OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K MATH BBH HumanEval
GPT-4o (gpt-4o-2024-05-13) 0 2024-05-13 chat En Avg OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K MATH BBH HumanEval
GPT-4o (gpt-4o-2024-08-06) 0 2024-08-06 chat En Avg OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K MATH BBH HumanEval
GPT-4o-mini (gpt-4o-mini-2024-07-18) 0 2024-08-06 chat En Avg OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K MATH BBH HumanEval
Llama 3 8B 8.0 2024-04-18 base
Llama 3 8B Instruct 8.0 2024-04-18 chat
Llama 3 70B 70 2024-04-18 base
Llama 3 70B Instruct 70 2024-04-18 chat
Llama-3-ELYZA-JP-8B 8.0 2024-06-26 chat
Llama 3 heron brain 8B v0.3 8.0 2024-07-01 chat
Llama 3 heron brain 70B v0.3 70 2024-07-01 chat
Llama 3 Swallow 8B 8.0 2024-07-01 base
Llama 3 Swallow 8B Instruct 8.0 2024-07-01 chat
Llama 3 Swallow 70B 70 2024-07-01 base
Llama 3 Swallow 70B Instruct 70 2024-07-01 chat
Llama 3 Youko 8B 8.0 2024-05-07 base
Llama 3 Youko 8B Instruct 8.0 2024-05-07 chat
Llama 3 Youko 70B 70 2024-07-25 base
Llama 3 Youko 70B Instruct 70 2024-07-25 chat
Llama 3.1 8B 8.0 2024-07-23 base
Llama 3.1 8B Instruct 8.0 2024-07-23 chat
Llama 3.1 70B 70 2024-07-23 base
Llama 3.1 70B Instruct 70 2024-07-23 chat
Llama-3.1-70B-Japanese-Instruct-2407 70 2024-07-23 chat
Llama 3.1 Swallow 8B v0.1 8.0 2024-10-08 base
Llama 3.1 Swallow 8B Instruct v0.1 8.0 2024-10-08 chat
Llama 3.1 Swallow 70B v0.1 70 2024-10-08 base
Llama 3.1 Swallow 70B Instruct v0.1 70 2024-10-08 chat
Llama 3.1 Swallow 8B v0.2 8.0 2024-11-11 base
Llama 3.1 Swallow 8B Instruct v0.2 8.0 2024-11-11 chat
Llama 3.1 Swallow 8B Instruct v0.3 8.0 2024-12-23 chat
Llama 3.1 Swallow 70B Instruct v0.3 70 2024-12-30 chat
Llama 3.2 1B 1.2 2024-09-25 base
Llama 3.2 1B Instruct 1.2 2024-09-25 chat
Llama 3.2 3B 3.2 2024-09-25 base
Llama 3.2 3B Instruct 3.2 2024-09-25 chat
Llama 3.3 70B Instruct 70 2024-12-06 chat
Llama 3.3 Swallow 70B v0.4 70 2025-03-14 base
Llama 3.3 Swallow 70B Instruct v0.4 70 2025-03-10 chat
llm-jp-3-1.8b 1.8 2024-09-25 base
llm-jp-3-1.8b-instruct 1.8 2024-09-25 chat
llm-jp-3-3.7b 3.7 2024-09-25 base
llm-jp-3-3.7b-instruct 3.7 2024-09-25 chat
llm-jp-3-13b 13 2024-09-25 base
llm-jp-3-13b-instruct 13 2024-09-25 chat
Mistral-Nemo-Base-2407 (12B) 12 2024-07-18 base
Mistral-NeMo-Instruct-2407 (12B) 12 2024-07-18 chat
Mistral-NeMo-Minitron 8B 8.4 2024-08-21 base
Mistral-NeMo-Minitron 8B Instruct 8.4 2024-08-21 chat
Mistral-7B-v0.3 7.2 2024-05-22 base
Mistral-7B-Instruct-v0.3 7.2 2024-05-22 chat
Mixtral-8x22B-v0.1 141 2024-04-17 base
Mixtral-8x22B-Instruct-v0.1 141 2024-04-17 chat
Phi-3-Mini-128K-Instruct 3.8 2024-04-23 chat
Phi-4 14 2024-12-13 chat
PLaMo 2 1B 1.3 2025-02-21 base
PLaMo 2 8B 9.1 2025-02-21 base
Qwen2-7B 7.6 2024-06-07 base
Qwen2-7B-Instruct 7.6 2024-06-07 chat
Qwen2-72B 72 2024-06-07 base
Qwen2-72B-Instruct 72 2024-06-07 chat
Qwen2.5-0.5B 0.5 2024-09-19 base
Qwen2.5-0.5B-Instruct 0.5 2024-09-19 chat
Qwen2.5-1.5B 1.5 2024-09-19 base
Qwen2.5-1.5B-Instruct 1.5 2024-09-19 chat
Qwen2.5-3B 3.1 2024-09-19 base
Qwen2.5-3B-Instruct 3.1 2024-09-19 chat
Qwen2.5-7B 7.6 2024-09-19 base
Qwen2.5-7B-Instruct 7.6 2024-09-19 chat
Qwen2.5-14B-Instruct 14 2024-09-25 chat
Qwen2.5-32B-Instruct 32 2024-09-25 chat
Qwen2.5-72B 72 2024-09-19 base
Qwen2.5-72B-Instruct 72 2024-09-19 chat
Sarashina2-7B 7.3 2024-06-14 base
Sarashina2-13B 13 2024-06-14 base
Sarashina2-70B 70 2024-06-14 base
Stockmark-100b 100 2024-05-16 base
Swallow 7B 6.7 2023-12-19 base
Swallow 13B 13 2023-12-19 base
Swallow 70B 70 2023-12-19 base
Swallow-MS 7B v0.1 7.2 2024-03-11 base
Swallow-MS-7b-instruct-v0.1 7.2 2024-03-11 chat
Swallow-MX 8x7B v0.1 47 2024-03-11 base
Swallow-7b-instruct-v0.1 6.7 2023-12-19 chat
Swallow-70b-instruct-v0.1 70 2023-12-19 chat
Tanuki-8B-dpo-v1.0 7.5 2024-08-30 chat
Tanuki-8x8B-dpo-v1.0 47 2024-08-30 chat
TinySwallow-1.5B 1.5 2025-01-30 base
TinySwallow-1.5B-Instruct 1.5 2025-01-30 chat
Yi-1.5 6B 6.1 2024-05-13 base
Yi-1.5 9B 8.8 2024-05-13 base
Yi-1.5 34B 34 2024-05-13 base

Acknowledgements

  • Tabler Admin Template licensed under MIT License
  • ApexCharts licensed under MIT License
  • Swallow icon by Game Icons.net in CC Attribution License via SVG Repo
  • The research and development of the large language model Swallow has been supported by the AIST Project "Research and Development on Generative AI Foundation Models in the Physical Domain"