About evaluation

The Swallow Project is independently conducting evaluation experiments of publicly available LLMs in parallel with the development of LLMs in order to serve as a reference for the development of high-performance large language models (LLMs). By comparing with LLMs developed not only in Japan but also around the world, we can learn the “current level” of the Swallow project. By evaluating each LLM under the fair conditions while taking into account its unique specifications (tokenization, system prompts, etc.) and contrasting them with the development methods of LLMs, we can examine the “recipe” for developing a high-performance LLM. We also realize the challenges in LLM evaluation by experiencing that high or low task evaluation scores are due to not only differences in LLM performance but also trivial specifications in the evaluation (e.g., prompt format). On this site, you can view the results of LLM evaluations conducted within the Swallow project, including bar graphs, radar charts, and scatter plots. We hope that this site will be useful not only as information for selecting the right LLM for your application, but also as reference information for the development of LLMs that are strong in Japanese.

Evaluation tasks

In the 2024 Swallow project, we are conducting LLM evaluation experiments using 10 datasets for the Japanese understanding and generation tasks, MT-Bench for the Japanese multi-turn dialogue task, and 9 datasets for the English understanding and generation tasks. For all tasks, the evaluation scores range from 0 (lowest) to 1 (highest).

Japanese understanding and generation tasks

JCom (JCommonsenseQA)
Q&A regarding commonsense and inference

Five-choice questions created with a knowledge base

JEMHopQA
Multi-hop Q&A

Open-ended Q&A to assess the amount of knowledge and reasoning ability

NIILC
Classical Q&A

Open-ended Q&A that can be answered by an encyclopedia

JSQuAD
Reading comprehension

Open-ended Q&A for Wikipedia article

XL-Sum
Summarization

Task to generate a highlight from a news article of BBC

MGSM
Mathematics

Japanese translation of math word problems (GSM8K)

WMT20 (en-ja)
English-Japanese translation

Translation of news articles

WMT20 (ja-en)
Japanese-English translation

Translation of news articles

JMMLU
Multi-task natural language understanding

Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)

JHumanEval
Code generation

Japanese translation of HumanEval (code genration benchmark)

Japanese multi-turn dialogue tasks (Japanese MT-Bench)

We used Japanese MT-Bench Nejumi Leaderboard Neo version, a Japanese version of MT-Bench, a benchmark for multi-turn dialogue capability. We evaluate instruction-tuned models only. This benchmark automatically rate response sentences on a 10-point scale using GPT-4 (gpt-4-1106-preview). The categories of evaluation are as follows.

Coding
Extraction
Humanities
Math
Reasoning
Roleplay
STEM
Writing

Note that our Japanese MT-Bench evaluation results are lower than those of the other leaderboards. We think that this difference in scores is caused by the fact that many leaderboards use GPT-4 (gpt-4-0613) to evaluate response texts, while we use GPT-4 (gpt-4-1106-preview). Our investigation revealed that while there are significant differences between our and the other leaderboard’s evaluation scores, the relative rankings among the models remain largely unchanged. Therefore, we continued the evaluation without changing the GPT-4 version (since we had already completed many of the evaluations).

nglish understanding and generation tasks

OpenBookQA
Q&A based on facts and common sense

Four-choice questions based on scientific knowledge and common sense

TriviaQA
Q&A based on knowledge

Open-ended Q&A based on trivias

HellaSwag
Commonsense inference

Four-choice questions to predict the next event

SQuAD2
Reading comprehension

Open-ended Q&A developed for the evidence document

XWINO
Commonsense inference

Two-choice question to predict the antecedent of a pronoun

MMLU
Multitask natural language understanding

Four-choice exam questions benchmark MMLU (53 subjects)

GSM8K
Mathematics

Math word problems

BBH (BIG-Bench-Hard)
Collection of hard-to-solve tasks for LLM

23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)

HumanEval
Code generation

Ability of code generation measured by unit test

Evaluation tools

We used these software packages for evaluation.

LLM-jp evaluation script (1.3.0)

Automatic evaluation tool for Japanese LLMs

(Han et al, 2024)
JP Language Model Evaluation Harness (commit #9b42d41)

An evaluation framework for Japanese LLMs

https://github.com/Stability-AI/lm-evaluation-harness/
Language Model Evaluation Harness (0.4.2)

An evaluation framework for LLMs

(Biderman et al., 2024)
Code Generation LM Evaluation Harness (commit #0261c52)

An evaluation framework for code generation (HumanEval)

https://github.com/bigcode-project/bigcode-evaluation-harness
FastChat (commit #e86e70d0)

An automatic evaluation framework by an LLM (MT-Bench)

https://github.com/lm-sys/FastChat
swallow-evaluation

An evaluation framework used in Swallow Project (encompassing all the above-mentioned tools)

https://github.com/swallow-llm/swallow-evaluation

Evaluated models

We list the LLMs in alphabetical order.

Name Size Type Distribution Missing scores
C4AI Command-R v0.1 35 base CohereForAI/c4ai-command-r-v01
CyberAgentLM2-7B-chat 7 instruct cyberagent/calm2-7b-chat
CyberAgentLM2-7B 7 base cyberagent/calm2-7b
CyberAgentLM3-22B-chat 22 instruct cyberagent/calm3-22b-chat Japanese MT-bench tasks
ELYZA-japanese-Llama-2-13b 13 base elyza/ELYZA-japanese-Llama-2-13b
Fugaku-LLM 13B 13 base Fugaku-LLM/Fugaku-LLM-13B
GPT-3.5 (gpt-3.5-turbo-0125) NaN instruct gpt-3.5-turbo-0125 Japanese tasks, English tasks
GPT-4o (gpt-4o-2024-05-13) NaN instruct gpt-4o-2024-05-13 Japanese tasks, English tasks
Gemma 2 9B IT 9 instruct google/gemma-2-9b-it
Gemma 2 27B IT 27 instruct google/gemma-2-27b-it Japanese MT-bench tasks
Gemma 2 9B 9 base google/gemma-2-9b
Gemma 2 27B 27 base google/gemma-2-27b
Japanese Stable LM Base Gamma 7B 7 base stabilityai/japanese-stablelm-base-gamma-7b
Japanese Stable LM Beta 7B 7 base stabilityai/japanese-stablelm-base-beta-7b
Japanese Stable LM Beta 70B 70 base stabilityai/japanese-stablelm-base-beta-70b
KARAKURI LM 70B Chat v0.1 70 instruct karakuri-ai/karakuri-lm-70b-chat-v0.1
KARAKURI LM 70B v0.1 70 base karakuri-ai/karakuri-lm-70b-v0.1
LLM-jp-13B v2.0 13 base llm-jp/llm-jp-13b-v2.0
Llama 2 7B 7 base meta-llama/Llama-2-7b-hf
Llama 2 13B 13 base meta-llama/Llama-2-13b-hf
Llama 2 70B 70 base meta-llama/Llama-2-70b-hf
Llama 3 8B Instruct 8 instruct meta-llama/Meta-Llama-3-8B-Instruct
Llama 3 70B Instruct 70 instruct meta-llama/Meta-Llama-3-70B-Instruct
Llama 3 8B 8 base meta-llama/Meta-Llama-3-8B
Llama 3 70B 70 base meta-llama/Meta-Llama-3-70B
Llama 3 Swallow 8B Instruct 8 instruct tokyotech-llm/Llama-3-Swallow-8B-v0.1
Llama 3 Swallow 70B Instruct 70 instruct tokyotech-llm/Llama-3-Swallow-8B-Instruct-v0.1
Llama 3 Swallow 8B 8 base tokyotech-llm/Llama-3-Swallow-8B-v0.1
Llama 3 Swallow 70B 70 base tokyotech-llm/Llama-3-Swallow-70B-v0.1
Llama 3 Youko 8B 8 base rinna/llama-3-youko-8b
Llama-3-ELYZA-JP-8B 8 instruct elyza/Llama-3-ELYZA-JP-8B
Mistral-7B-v0.1 7 base mistralai/Mistral-7B-v0.1
Mistral-7B-v0.2 7 base mistral-community/Mistral-7B-v0.2
Mistral-Nemo-Base-2407 12 base mistralai/Mistral-Nemo-Base-2407 Japanese tasks, English tasks
Mistral-Nemo-Instruct-2407 12 instruct mistralai/Mistral-Nemo-Instruct-2407 Japanese tasks, Japanese MT-bench tasks, English tasks
Mixtral-8x7B-Instruct-v0.1 47 instruct mistralai/Mixtral-8x7B-Instruct-v0.1
Mixtral-8x22B-Instruct-v0.1 141 instruct mistralai/Mixtral-8x22B-Instruct-v0.1 Japanese MT-bench tasks
Mixtral-8x7B-v0.1 47 base mistralai/Mixtral-8x7B-v0.1
Qwen1.5-7B 7 base Qwen/Qwen1.5-7B
Qwen2-7B-Instruct 7 instruct Qwen/Qwen2-7B-Instruct
Qwen2-72B-Instruct 72 instruct Qwen/Qwen2-72B-Instruct
Qwen2-7B 7 base Qwen/Qwen2-7B
Qwen2-72B 72 base Qwen/Qwen2-72B
RakutenAI-7B-chat 7 instruct Rakuten/RakutenAI-7B-chat
RakutenAI-7B 7 base Rakuten/RakutenAI-7B
Sarashina2-7B 7 base sbintuitions/sarashina2-7b
Sarashina2-13B 13 base sbintuitions/sarashina2-13b
Swallow 7B 7 base tokyotech-llm/Swallow-7b-hf
Swallow 13B 13 base tokyotech-llm/Swallow-13b-hf
Swallow 70B 70 base tokyotech-llm/Swallow-70b-hf
Swallow-7b-instruct-v0.1 7 instruct tokyotech-llm/Swallow-7b-instruct-v0.1
Swallow-70b-instruct-v0.1 70 instruct tokyotech-llm/Swallow-70b-instruct-v0.1
Swallow-MS v0.1 7 base tokyotech-llm/Swallow-MS-7b-v0.1
Swallow-MS-7b-instruct-v0.1 7 instruct tokyotech-llm/Swallow-MS-7b-instruct-v0.1
Swallow-MX 8x7B v0.1 47 base tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1
Yi-1.5 6B 6 base 01-ai/Yi-1.5-6B
Yi-1.5 9B 9 base 01-ai/Yi-1.5-9B
Yi-1.5 34B 34 base 01-ai/Yi-1.5-34B
Youri 7B 7 base rinna/youri-7b

Acknowledgements

This website used these software packages.