About evaluation

The Swallow Project is independently conducting evaluation experiments of publicly available LLMs in parallel with the development of LLMs in order to serve as a reference for the development of high-performance large language models (LLMs). By comparing with LLMs developed not only in Japan but also around the world, we can learn the “current level” of the Swallow project. By evaluating each LLM under the fair conditions while taking into account its unique specifications (tokenization, system prompts, etc.) and contrasting them with the development methods of LLMs, we can examine the “recipe” for developing a high-performance LLM. We also realize the challenges in LLM evaluation by experiencing that high or low task evaluation scores are due to not only differences in LLM performance but also trivial specifications in the evaluation (e.g., prompt format). On this site, you can view the results of LLM evaluations conducted within the Swallow project, including bar graphs, radar charts, and scatter plots. We hope that this site will be useful not only as information for selecting the right LLM for your application, but also as reference information for the development of LLMs that are strong in Japanese.

Evaluation tasks

In the 2024 Swallow project, we are conducting LLM evaluation experiments using 10 datasets for the Japanese understanding and generation tasks, MT-Bench for the Japanese multi-turn dialogue task, and 9 datasets for the English understanding and generation tasks. For all tasks, the evaluation scores range from 0 (lowest) to 1 (highest).

Japanese understanding and generation tasks

JCom (JCommonsenseQA)

Q&A regarding commonsense and inference

Five-choice questions created with a knowledge base

Metric: Accuracy
Setting: 4-shot
Reference: (Kurihara et al., 2022)

JEMHopQA

Multi-hop Q&A

Open-ended Q&A to assess the amount of knowledge and reasoning ability

Metric: Character F1
Setting: 4-shot
Reference: (Ishii et al., 2024)

NIILC

Classical Q&A

Open-ended Q&A that can be answered by an encyclopedia

Metric: Character F1
Setting: 4-shot
Reference: (Sekine, 2003)

JSQuAD

Reading comprehension

Open-ended Q&A for Wikipedia article

Metric: Character F1
Setting: 4-shot
Reference: (Kurihara et al., 2022)

XL-Sum

Summarization

Task to generate a highlight from a news article of BBC

Metric: ROUGE-2
Setting: 1-shot
Reference: (Hasan et al., 2021)

MGSM

Mathematics

Japanese translation of math word problems (GSM8K)

Metric: Accuracy (exact match)
Setting: 4-shot
Reference: (Shi et al., 2023)

WMT20 (en-ja)

English-Japanese translation

Translation of news articles

Metric: BLEU
Setting: 4-shot
Reference: (Barrault et al., 2020)

WMT20 (ja-en)

Japanese-English translation

Translation of news articles

Metric: BLEU
Setting: 4-shot
Reference: (Barrault et al., 2020)

JMMLU

Multi-task natural language understanding

Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)

Metric: Accuracy
Setting: 5-shot
Reference: (Yin et al, 2024)

JHumanEval

Code generation

Japanese translation of HumanEval (code genration benchmark)

Metric: pass@1
Setting: 0-shot, 10 trials
Reference: (Sato et al., 2024)

Japanese multi-turn dialogue tasks (Japanese MT-Bench)

We used Japanese MT-Bench Nejumi Leaderboard Neo version, a Japanese version of MT-Bench, a benchmark for multi-turn dialogue capability. We evaluate instruction-tuned models only. This benchmark automatically rate response sentences on a 10-point scale using GPT-4 (gpt-4-1106-preview). The categories of evaluation are as follows.

Coding

Extraction

Humanities

Math

Reasoning

Roleplay

STEM

Writing

Note that our Japanese MT-Bench evaluation results are lower than those of the other leaderboards. We think that this difference in scores is caused by the fact that many leaderboards use GPT-4 (gpt-4-0613) to evaluate response texts, while we use GPT-4 (gpt-4-1106-preview). Our investigation revealed that while there are significant differences between our and the other leaderboard’s evaluation scores, the relative rankings among the models remain largely unchanged. Therefore, we continued the evaluation without changing the GPT-4 version (since we had already completed many of the evaluations).

nglish understanding and generation tasks

OpenBookQA

Q&A based on facts and common sense

Four-choice questions based on scientific knowledge and common sense

Metric: Accuracy
Setting: 4-shot
Reference: (Mihaylov et al., 2018)

TriviaQA

Q&A based on knowledge

Open-ended Q&A based on trivias

Metric: Accuracy (exact match)
Setting: 4-shot
Reference: (Joshi et al., 2017)

HellaSwag

Commonsense inference

Four-choice questions to predict the next event

Metric: Accuracy
Setting: 4-shot
Reference: (Zellers et al., 2019)

SQuAD2

Reading comprehension

Open-ended Q&A developed for the evidence document

Metric: Accuracy (exact match)
Setting: 4-shot
Reference: (Rajpurkar et al., 2018)

XWINO

Commonsense inference

Two-choice question to predict the antecedent of a pronoun

Metric: Accuracy
Setting: 4-shot
Reference: (Tikhonov and Ryabinin, 2021)

MMLU

Multitask natural language understanding

Four-choice exam questions benchmark MMLU (53 subjects)

Metric: Accuracy
Setting: 5-shot
Reference: (Hendrycks et al., 2021)

GSM8K

Mathematics

Math word problems

Metric: Accuracy (exact match)
Setting: 4-shot
Reference: (Cobbe et al., 2021)

BBH (BIG-Bench-Hard)

Collection of hard-to-solve tasks for LLM

23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)

Metric: Accuracy (exact match)
Setting: 3-shot, CoT
Reference: (Suzgun et al., 2023)

HumanEval

Code generation

Ability of code generation measured by unit test

Metric: pass@1
Setting: 0-shot, 10 trials
Reference: (Chen et al., 2021)

Evaluation tools

We used these software packages for evaluation.

LLM-jp evaluation script (1.3.0)

Automatic evaluation tool for Japanese LLMs

(Han et al, 2024)

JP Language Model Evaluation Harness (commit #9b42d41)

An evaluation framework for Japanese LLMs

https://github.com/Stability-AI/lm-evaluation-harness/

Language Model Evaluation Harness (0.4.2)

An evaluation framework for LLMs

(Biderman et al., 2024)

Code Generation LM Evaluation Harness (commit #0261c52)

An evaluation framework for code generation (HumanEval)

https://github.com/bigcode-project/bigcode-evaluation-harness

FastChat (commit #e86e70d0)

An automatic evaluation framework by an LLM (MT-Bench)

https://github.com/lm-sys/FastChat

swallow-evaluation

An evaluation framework used in Swallow Project (encompassing all the above-mentioned tools)

https://github.com/swallow-llm/swallow-evaluation

Evaluated models

We list the LLMs in alphabetical order. Some LLMs do not have scores for language understanding and generation but only those for Japanese MT-bench.

Name	Size	Type	Distribution	Missing scores
Aya Expanse 32B	32.0	instruct	CohereForAI/aya-expanse-32b
Aya Expanse 8B	8.0	instruct	CohereForAI/aya-expanse-8b
C4AI Command-R v0.1	35	base	CohereForAI/c4ai-command-r-v01
CyberAgentLM2-7B-chat	7.0	instruct	cyberagent/calm2-7b-chat
CyberAgentLM2-7B	7.0	base	cyberagent/calm2-7b
CyberAgentLM3-22B-chat	22	instruct	cyberagent/calm3-22b-chat
ELYZA-japanese-Llama-2-13b	13	base	elyza/ELYZA-japanese-Llama-2-13b
Fugaku-LLM 13B	13	base	Fugaku-LLM/Fugaku-LLM-13B
GPT-3.5 (gpt-3.5-turbo-0125)	NaN	instruct	gpt-3.5-turbo-0125	Japanese tasks, English tasks
GPT-4o (gpt-4o-2024-05-13)	NaN	instruct	gpt-4o-2024-05-13	Japanese tasks, English tasks
GRIN-MoE	42	instruct	microsoft/GRIN-MoE
Gemma 2 27B IT	27	instruct	google/gemma-2-27b-it
Gemma 2 2B IT	2.6	instruct	google/gemma-2-2b-it
Gemma 2 9B IT	9.2	instruct	google/gemma-2-9b-it
Gemma 2 27B	27	base	google/gemma-2-27b
Gemma 2 2B	2.6	base	google/gemma-2-2b
Gemma 2 9B	9.2	base	google/gemma-2-9b
Gemma 2 Baku 2B IT	2.6	instruct	rinna/gemma-2-baku-2b-it
Gemma 2 Baku 2B	2.6	base	rinna/gemma-2-baku-2b
Gemma 2 JPN	2.6	instruct	google/gemma-2-2b-jpn-it
Japanese Stable LM Base Gamma 7B	7.2	base	stabilityai/japanese-stablelm-base-gamma-7b
Japanese Stable LM Beta 70B	70	base	stabilityai/japanese-stablelm-base-beta-70b
Japanese Stable LM Beta 7B	6.7	base	stabilityai/japanese-stablelm-base-beta-7b
KARAKURI LM 70B Chat v0.1	70	instruct	karakuri-ai/karakuri-lm-70b-chat-v0.1
KARAKURI LM 8x7B Instruct v0.1	47	instruct	karakuri-ai/karakuri-lm-8x7b-instruct-v0.1
KARAKURI LM 70B v0.1	70	base	karakuri-ai/karakuri-lm-70b-v0.1
LLM-jp-13B v2.0	13	base	llm-jp/llm-jp-13b-v2.0
Llama 2 13B	13	base	meta-llama/Llama-2-13b-hf
Llama 2 70B	70	base	meta-llama/Llama-2-70b-hf
Llama 2 7B	6.7	base	meta-llama/Llama-2-7b-hf
Llama 3 70B Instruct	70	instruct	meta-llama/Meta-Llama-3-70B-Instruct
Llama 3 8B Instruct	8.0	instruct	meta-llama/Meta-Llama-3-8B-Instruct
Llama 3 70B	70	base	meta-llama/Meta-Llama-3-70B
Llama 3 8B	8.0	base	meta-llama/Meta-Llama-3-8B
Llama 3 Swallow 70B Instruct	70	instruct	tokyotech-llm/Llama-3-Swallow-8B-Instruct-v0.1
Llama 3 Swallow 8B Instruct	8.0	instruct	tokyotech-llm/Llama-3-Swallow-8B-v0.1
Llama 3 Swallow 70B	70	base	tokyotech-llm/Llama-3-Swallow-70B-v0.1
Llama 3 Swallow 8B	8.0	base	tokyotech-llm/Llama-3-Swallow-8B-v0.1
Llama 3 Youko 70B Instruct	70	instruct	rinna/llama-3-youko-70b-instruct
Llama 3 Youko 8B Instruct	8.0	instruct	rinna/llama-3-youko-8b-instruct
Llama 3 Youko 70B	70	base	rinna/llama-3-youko-70b
Llama 3 Youko 8B	8.0	base	rinna/llama-3-youko-8b
Llama 3 heron brain 70B v0.3	70	instruct	turing-motors/Llama-3-heron-brain-70B-v0.3
Llama 3 heron brain 8B v0.3	8.0	instruct	turing-motors/Llama-3-heron-brain-8B-v0.3
Llama 3.1 405B Instruct	405	instruct	deepinfra/meta-llama/Meta-Llama-3.1-405B-Instruct	Japanese tasks, English tasks
Llama 3.1 70B Instruct	70	instruct	meta-llama/Meta-Llama-3.1-70B-Instruct
Llama 3.1 8B Instruct	8.0	instruct	meta-llama/Meta-Llama-3.1-8B-Instruct
Llama 3.1 70B	70	base	meta-llama/Meta-Llama-3.1-70B
Llama 3.1 8B	8.0	base	meta-llama/Meta-Llama-3.1-8B
Llama 3.1 Swallow 70B Instruct v0.1	70	instruct	tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1
Llama 3.1 Swallow 8B Instruct v0.1	8.0	instruct	tokyotech-llm/Llama-3.1-Swallow-8B-v0.1
Llama 3.1 Swallow 8B Instruct v0.2	8.0	instruct	tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.2
Llama 3.1 Swallow 70B Instruct v0.3	70	instruct	tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3
Llama 3.1 Swallow 8B Instruct v0.3	8.0	instruct	tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3
Llama 3.1 Swallow 70B v0.1	70	base	tokyotech-llm/Llama-3.1-Swallow-70B-v0.1
Llama 3.1 Swallow 8B v0.1	8.0	base	tokyotech-llm/Llama-3.1-Swallow-8B-v0.1
Llama 3.1 Swallow 8B v0.2	8.0	base	tokyotech-llm/Llama-3.1-Swallow-8B-v0.2
Llama 3.2 1B Instruct	1.2	instruct	meta-llama/Llama-3.2-1B-Instruct
Llama 3.2 3B Instruct	3.2	instruct	meta-llama/Llama-3.2-3B-Instruct
Llama 3.2 1B	1.2	base	meta-llama/Llama-3.2-1B
Llama 3.2 3B	3.2	base	meta-llama/Llama-3.2-3B
Llama 3.3 70B Instruct	70	instruct	meta-llama/Llama-3.3-70B-Instruct
Llama-3-ELYZA-JP-8B	8.0	instruct	elyza/Llama-3-ELYZA-JP-8B
Llama-3.1-70B-Japanese-Instruct-2407	70	instruct	cyberagent/Llama-3.1-70B-Japanese-Instruct-2407
Mistral-7B-Instruct-v0.3	7.2	instruct	mistralai/Mistral-7B-Instruct-v0.3
Mistral-7B-v0.1	7.2	base	mistralai/Mistral-7B-v0.1
Mistral-7B-v0.2	7.2	base	mistral-community/Mistral-7B-v0.2
Mistral-7B-v0.3	7.2	base	mistralai/Mistral-7B-v0.3
Mistral-NeMo-Instruct-2407 (12B)	12	instruct	mistralai/Mistral-Nemo-Instruct-2407
Mistral-NeMo-Minitron 8B Instruct	8.4	instruct	nvidia/Mistral-NeMo-Minitron-8B-Instruct
Mistral-NeMo-Minitron 8B	8.4	base	nvidia/Mistral-NeMo-Minitron-8B-Base
Mistral-Nemo-Base-2407 (12B)	12	base	mistralai/Mistral-Nemo-Base-2407
Mixtral-8x7B-Instruct-v0.1	47	instruct	mistralai/Mixtral-8x7B-Instruct-v0.1
Mixtral-8x22B-Instruct-v0.1	141	instruct	mistralai/Mixtral-8x22B-Instruct-v0.1
Mixtral-8x7B-v0.1	47	base	mistralai/Mixtral-8x7B-v0.1
Mixtral-8x22B-v0.1	141	base	mistralai/Mixtral-8x22B-v0.1
OLMo-2-1124-13B-Instruct	13	instruct	allenai/OLMo-2-1124-13B-Instruct
OLMo-2-1124-7B-Instruct	7.3	instruct	allenai/OLMo-2-1124-7B-Instruct
OLMo-2-1124-13B	13	base	allenai/OLMo-2-1124-13B
OLMo-2-1124-7B	7.3	base	allenai/OLMo-2-1124-7B
Phi-3-Mini-128K-Instruct	3.8	instruct	microsoft/Phi-3-mini-128k-instruct
Phi-3.5-MoE Instruct	42	instruct	microsoft/Phi-3.5-MoE-instruct
Qwen1.5-7B	7.7	base	Qwen/Qwen1.5-7B
Qwen2-72B-Instruct	72	instruct	Qwen/Qwen2-72B-Instruct
Qwen2-7B-Instruct	7.6	instruct	Qwen/Qwen2-7B-Instruct
Qwen2-72B	72	base	Qwen/Qwen2-72B
Qwen2-7B	7.6	base	Qwen/Qwen2-7B
Qwen2.5-72B-Instruct	72	instruct	Qwen/Qwen2.5-72B-Instruct
Qwen2.5-3B-Instruct	3.1	instruct	Qwen/Qwen2.5-3B-Instruct
Qwen2.5-7B-Instruct	7.6	instruct	Qwen/Qwen2.5-7B-Instruct
Qwen2.5-0.5B-Instruct	0.5	instruct	Qwen/Qwen2.5-0.5B-Instruct
Qwen2.5-0.5B	0.5	base	Qwen/Qwen2.5-0.5B
Qwen2.5-72B	72	base	Qwen/Qwen2.5-72B
Qwen2.5-1.5B-Instruct	1.5	instruct	Qwen/Qwen2.5-1.5B-Instruct
Qwen2.5-1.5B	1.5	base	Qwen/Qwen2.5-1.5B
Qwen2.5-3B	3.1	base	Qwen/Qwen2.5-3B
Qwen2.5-7B	7.6	base	Qwen/Qwen2.5-7B
RakutenAI-7B-chat	7.2	instruct	Rakuten/RakutenAI-7B-chat
RakutenAI-7B	7.2	base	Rakuten/RakutenAI-7B
Sarashina2-13B	13	base	sbintuitions/sarashina2-13b
Sarashina2-70B	70	base	sbintuitions/sarashina2-70b
Sarashina2-7B	7.3	base	sbintuitions/sarashina2-7b
Stockmark-100b	100	base	stockmark/stockmark-100b
Swallow 13B	13	base	tokyotech-llm/Swallow-13b-hf
Swallow 70B	70	base	tokyotech-llm/Swallow-70b-hf
Swallow 7B	6.7	base	tokyotech-llm/Swallow-7b-hf
Swallow-70b-instruct-v0.1	70	instruct	tokyotech-llm/Swallow-70b-instruct-v0.1
Swallow-7b-instruct-v0.1	6.7	instruct	tokyotech-llm/Swallow-7b-instruct-v0.1
Swallow-MS 7B v0.1	7.2	base	tokyotech-llm/Swallow-MS-7b-v0.1
Swallow-MS-7b-instruct-v0.1	7.2	instruct	tokyotech-llm/Swallow-MS-7b-instruct-v0.1
Swallow-MX 8x7B v0.1	47	base	tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1
Tanuki-8x8B-dpo-v1.0	47	instruct	weblab-GENIAC/Tanuki-8x8B-dpo-v1.0
Tanuki-8B-dpo-v1.0	7.5	instruct	weblab-GENIAC/Tanuki-8B-dpo-v1.0
Yi-1.5 34B	34	base	01-ai/Yi-1.5-34B
Yi-1.5 6B	6.1	base	01-ai/Yi-1.5-6B
Yi-1.5 9B	8.8	base	01-ai/Yi-1.5-9B
Youri 7B	6.7	base	rinna/youri-7b
llm-jp-3-13b-instruct	13	instruct	llm-jp/llm-jp-3-13b-instruct
llm-jp-3-13b	13	base	llm-jp/llm-jp-3-13b
llm-jp-3-1.8b-instruct	1.8	instruct	llm-jp/llm-jp-3-1.8b-instruct
llm-jp-3-1.8b	1.8	base	llm-jp/llm-jp-3-1.8b
llm-jp-3-3.7b-instruct	3.7	instruct	llm-jp/llm-jp-3-3.7b-instruct
llm-jp-3-3.7b	3.7	base	llm-jp/llm-jp-3-3.7b

Acknowledgements

This website used these software packages.