About evaluation

The Swallow Project is independently conducting evaluation experiments of publicly available LLMs in parallel with the development of LLMs in order to serve as a reference for the development of high-performance large language models (LLMs). By comparing with LLMs developed not only in Japan but also around the world, we can learn the “current level” of the Swallow project. By evaluating each LLM under the fair conditions while taking into account its unique specifications (tokenization, system prompts, etc.) and contrasting them with the development methods of LLMs, we can examine the “recipe” for developing a high-performance LLM. We also realize the challenges in LLM evaluation by experiencing that high or low task evaluation scores are due to not only differences in LLM performance but also trivial specifications in the evaluation (e.g., prompt format). On this site, you can view the results of LLM evaluations conducted within the Swallow project, including bar graphs, radar charts, and scatter plots. We hope that this site will be useful not only as information for selecting the right LLM for your application, but also as reference information for the development of LLMs that are strong in Japanese.

Evaluation tasks

In the 2024 Swallow project, we are conducting LLM evaluation experiments using 10 datasets for the Japanese understanding and generation tasks, MT-Bench for the Japanese multi-turn dialogue task, and 9 datasets for the English understanding and generation tasks. For all tasks, the evaluation scores range from 0 (lowest) to 1 (highest).

Japanese understanding and generation tasks

JCom (JCommonsenseQA)
Q&A regarding commonsense and inference

Five-choice questions created with a knowledge base

JEMHopQA
Multi-hop Q&A

Open-ended Q&A to assess the amount of knowledge and reasoning ability

NIILC
Classical Q&A

Open-ended Q&A that can be answered by an encyclopedia

JSQuAD
Reading comprehension

Open-ended Q&A for Wikipedia article

XL-Sum
Summarization

Task to generate a highlight from a news article of BBC

MGSM
Mathematics

Japanese translation of math word problems (GSM8K)

WMT20 (en-ja)
English-Japanese translation

Translation of news articles

WMT20 (ja-en)
Japanese-English translation

Translation of news articles

JMMLU
Multi-task natural language understanding

Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)

JHumanEval
Code generation

Japanese translation of HumanEval (code genration benchmark)

Japanese multi-turn dialogue tasks (Japanese MT-Bench)

We used Japanese MT-Bench Nejumi Leaderboard Neo version, a Japanese version of MT-Bench, a benchmark for multi-turn dialogue capability. We evaluate instruction-tuned models only. This benchmark automatically rate response sentences on a 10-point scale using GPT-4 (gpt-4-1106-preview). The categories of evaluation are as follows.

Coding
Extraction
Humanities
Math
Reasoning
Roleplay
STEM
Writing

Note that our Japanese MT-Bench evaluation results are lower than those of the other leaderboards. We think that this difference in scores is caused by the fact that many leaderboards use GPT-4 (gpt-4-0613) to evaluate response texts, while we use GPT-4 (gpt-4-1106-preview). Our investigation revealed that while there are significant differences between our and the other leaderboard’s evaluation scores, the relative rankings among the models remain largely unchanged. Therefore, we continued the evaluation without changing the GPT-4 version (since we had already completed many of the evaluations).

nglish understanding and generation tasks

OpenBookQA
Q&A based on facts and common sense

Four-choice questions based on scientific knowledge and common sense

TriviaQA
Q&A based on knowledge

Open-ended Q&A based on trivias

HellaSwag
Commonsense inference

Four-choice questions to predict the next event

SQuAD2
Reading comprehension

Open-ended Q&A developed for the evidence document

XWINO
Commonsense inference

Two-choice question to predict the antecedent of a pronoun

MMLU
Multitask natural language understanding

Four-choice exam questions benchmark MMLU (53 subjects)

GSM8K
Mathematics

Math word problems

BBH (BIG-Bench-Hard)
Collection of hard-to-solve tasks for LLM

23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)

HumanEval
Code generation

Ability of code generation measured by unit test

Evaluation tools

We used these software packages for evaluation.

LLM-jp evaluation script (1.3.0)

Automatic evaluation tool for Japanese LLMs

(Han et al, 2024)
JP Language Model Evaluation Harness (commit #9b42d41)

An evaluation framework for Japanese LLMs

https://github.com/Stability-AI/lm-evaluation-harness/
Language Model Evaluation Harness (0.4.2)

An evaluation framework for LLMs

(Biderman et al., 2024)
Code Generation LM Evaluation Harness (commit #0261c52)

An evaluation framework for code generation (HumanEval)

https://github.com/bigcode-project/bigcode-evaluation-harness
FastChat (commit #e86e70d0)

An automatic evaluation framework by an LLM (MT-Bench)

https://github.com/lm-sys/FastChat
swallow-evaluation

An evaluation framework used in Swallow Project (encompassing all the above-mentioned tools)

https://github.com/swallow-llm/swallow-evaluation

Evaluated models

We list the LLMs in alphabetical order. Some LLMs do not have scores for language understanding and generation but only those for Japanese MT-bench.

Name Size Type Distribution Missing scores
Aya Expanse 32B 32.0 instruct CohereForAI/aya-expanse-32b
Aya Expanse 8B 8.0 instruct CohereForAI/aya-expanse-8b
C4AI Command-R v0.1 35 base CohereForAI/c4ai-command-r-v01
CyberAgentLM2-7B-chat 7.0 instruct cyberagent/calm2-7b-chat
CyberAgentLM2-7B 7.0 base cyberagent/calm2-7b
CyberAgentLM3-22B-chat 22 instruct cyberagent/calm3-22b-chat
ELYZA-japanese-Llama-2-13b 13 base elyza/ELYZA-japanese-Llama-2-13b
Fugaku-LLM 13B 13 base Fugaku-LLM/Fugaku-LLM-13B
GPT-3.5 (gpt-3.5-turbo-0125) NaN instruct gpt-3.5-turbo-0125 Japanese tasks, English tasks
GPT-4o (gpt-4o-2024-05-13) NaN instruct gpt-4o-2024-05-13 Japanese tasks, English tasks
GRIN-MoE 42 instruct microsoft/GRIN-MoE
Gemma 2 27B IT 27 instruct google/gemma-2-27b-it
Gemma 2 2B IT 2.6 instruct google/gemma-2-2b-it
Gemma 2 9B IT 9.2 instruct google/gemma-2-9b-it
Gemma 2 27B 27 base google/gemma-2-27b
Gemma 2 2B 2.6 base google/gemma-2-2b
Gemma 2 9B 9.2 base google/gemma-2-9b
Gemma 2 Baku 2B IT 2.6 instruct rinna/gemma-2-baku-2b-it
Gemma 2 Baku 2B 2.6 base rinna/gemma-2-baku-2b
Gemma 2 JPN 2.6 instruct google/gemma-2-2b-jpn-it
Japanese Stable LM Base Gamma 7B 7.2 base stabilityai/japanese-stablelm-base-gamma-7b
Japanese Stable LM Beta 70B 70 base stabilityai/japanese-stablelm-base-beta-70b
Japanese Stable LM Beta 7B 6.7 base stabilityai/japanese-stablelm-base-beta-7b
KARAKURI LM 70B Chat v0.1 70 instruct karakuri-ai/karakuri-lm-70b-chat-v0.1
KARAKURI LM 8x7B Instruct v0.1 47 instruct karakuri-ai/karakuri-lm-8x7b-instruct-v0.1
KARAKURI LM 70B v0.1 70 base karakuri-ai/karakuri-lm-70b-v0.1
LLM-jp-13B v2.0 13 base llm-jp/llm-jp-13b-v2.0
Llama 2 13B 13 base meta-llama/Llama-2-13b-hf
Llama 2 70B 70 base meta-llama/Llama-2-70b-hf
Llama 2 7B 6.7 base meta-llama/Llama-2-7b-hf
Llama 3 70B Instruct 70 instruct meta-llama/Meta-Llama-3-70B-Instruct
Llama 3 8B Instruct 8.0 instruct meta-llama/Meta-Llama-3-8B-Instruct
Llama 3 70B 70 base meta-llama/Meta-Llama-3-70B
Llama 3 8B 8.0 base meta-llama/Meta-Llama-3-8B
Llama 3 Swallow 70B Instruct 70 instruct tokyotech-llm/Llama-3-Swallow-8B-Instruct-v0.1
Llama 3 Swallow 8B Instruct 8.0 instruct tokyotech-llm/Llama-3-Swallow-8B-v0.1
Llama 3 Swallow 70B 70 base tokyotech-llm/Llama-3-Swallow-70B-v0.1
Llama 3 Swallow 8B 8.0 base tokyotech-llm/Llama-3-Swallow-8B-v0.1
Llama 3 Youko 70B Instruct 70 instruct rinna/llama-3-youko-70b-instruct
Llama 3 Youko 8B Instruct 8.0 instruct rinna/llama-3-youko-8b-instruct
Llama 3 Youko 70B 70 base rinna/llama-3-youko-70b
Llama 3 Youko 8B 8.0 base rinna/llama-3-youko-8b
Llama 3 heron brain 70B v0.3 70 instruct turing-motors/Llama-3-heron-brain-70B-v0.3
Llama 3 heron brain 8B v0.3 8.0 instruct turing-motors/Llama-3-heron-brain-8B-v0.3
Llama 3.1 405B Instruct 405 instruct deepinfra/meta-llama/Meta-Llama-3.1-405B-Instruct Japanese tasks, English tasks
Llama 3.1 70B Instruct 70 instruct meta-llama/Meta-Llama-3.1-70B-Instruct
Llama 3.1 8B Instruct 8.0 instruct meta-llama/Meta-Llama-3.1-8B-Instruct
Llama 3.1 70B 70 base meta-llama/Meta-Llama-3.1-70B
Llama 3.1 8B 8.0 base meta-llama/Meta-Llama-3.1-8B
Llama 3.1 Swallow 70B Instruct v0.1 70 instruct tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1
Llama 3.1 Swallow 8B Instruct v0.1 8.0 instruct tokyotech-llm/Llama-3.1-Swallow-8B-v0.1
Llama 3.1 Swallow 8B Instruct v0.2 8.0 instruct tokyotech-llm/Llama-3.1-Swallow-8B-v0.2
Llama 3.1 Swallow 70B v0.1 70 base tokyotech-llm/Llama-3.1-Swallow-70B-v0.1
Llama 3.1 Swallow 8B v0.1 8.0 base tokyotech-llm/Llama-3.1-Swallow-8B-v0.1
Llama 3.1 Swallow 8B v0.2 8.0 base tokyotech-llm/Llama-3.1-Swallow-8B-v0.2
Llama 3.2 1B Instruct 1.2 instruct meta-llama/Llama-3.2-1B-Instruct
Llama 3.2 3B Instruct 3.2 instruct meta-llama/Llama-3.2-3B-Instruct
Llama 3.2 1B 1.2 base meta-llama/Llama-3.2-1B
Llama 3.2 3B 3.2 base meta-llama/Llama-3.2-3B
Llama-3-ELYZA-JP-8B 8.0 instruct elyza/Llama-3-ELYZA-JP-8B
Llama-3.1-70B-Japanese-Instruct-2407 70 instruct cyberagent/Llama-3.1-70B-Japanese-Instruct-2407
Mistral-7B-Instruct-v0.3 7.2 instruct mistralai/Mistral-7B-Instruct-v0.3
Mistral-7B-v0.1 7.2 base mistralai/Mistral-7B-v0.1
Mistral-7B-v0.2 7.2 base mistral-community/Mistral-7B-v0.2
Mistral-7B-v0.3 7.2 base mistralai/Mistral-7B-v0.3
Mistral-NeMo-Instruct-2407 (12B) 12 instruct mistralai/Mistral-Nemo-Instruct-2407
Mistral-NeMo-Minitron 8B Instruct 8.4 instruct nvidia/Mistral-NeMo-Minitron-8B-Instruct
Mistral-NeMo-Minitron 8B 8.4 base nvidia/Mistral-NeMo-Minitron-8B-Base
Mistral-Nemo-Base-2407 (12B) 12 base mistralai/Mistral-Nemo-Base-2407
Mixtral-8x7B-Instruct-v0.1 47 instruct mistralai/Mixtral-8x7B-Instruct-v0.1
Mixtral-8x22B-Instruct-v0.1 141 instruct mistralai/Mixtral-8x22B-Instruct-v0.1
Mixtral-8x7B-v0.1 47 base mistralai/Mixtral-8x7B-v0.1
Mixtral-8x22B-v0.1 141 base mistralai/Mixtral-8x22B-v0.1
Phi-3-Mini-128K-Instruct 3.8 instruct microsoft/Phi-3-mini-128k-instruct
Phi-3.5-MoE Instruct 42 instruct microsoft/Phi-3.5-MoE-instruct
Qwen1.5-7B 7.7 base Qwen/Qwen1.5-7B
Qwen2-72B-Instruct 72 instruct Qwen/Qwen2-72B-Instruct
Qwen2-7B-Instruct 7.6 instruct Qwen/Qwen2-7B-Instruct
Qwen2-72B 72 base Qwen/Qwen2-72B
Qwen2-7B 7.6 base Qwen/Qwen2-7B
Qwen2.5-72B-Instruct 72 instruct Qwen/Qwen2.5-72B-Instruct
Qwen2.5-3B-Instruct 3.1 instruct Qwen/Qwen2.5-3B-Instruct
Qwen2.5-7B-Instruct 7.6 instruct Qwen/Qwen2.5-7B-Instruct
Qwen2.5-0.5B-Instruct 0.5 instruct Qwen/Qwen2.5-0.5B-Instruct
Qwen2.5-0.5B 0.5 base Qwen/Qwen2.5-0.5B
Qwen2.5-72B 72 base Qwen/Qwen2.5-72B
Qwen2.5-1.5B-Instruct 1.5 instruct Qwen/Qwen2.5-1.5B-Instruct
Qwen2.5-1.5B 1.5 base Qwen/Qwen2.5-1.5B
Qwen2.5-3B 3.1 base Qwen/Qwen2.5-3B
Qwen2.5-7B 7.6 base Qwen/Qwen2.5-7B
RakutenAI-7B-chat 7.2 instruct Rakuten/RakutenAI-7B-chat
RakutenAI-7B 7.2 base Rakuten/RakutenAI-7B
Sarashina2-13B 13 base sbintuitions/sarashina2-13b
Sarashina2-70B 70 base sbintuitions/sarashina2-70b
Sarashina2-7B 7.3 base sbintuitions/sarashina2-7b
Stockmark-100b 100 base stockmark/stockmark-100b
Swallow 13B 13 base tokyotech-llm/Swallow-13b-hf
Swallow 70B 70 base tokyotech-llm/Swallow-70b-hf
Swallow 7B 6.7 base tokyotech-llm/Swallow-7b-hf
Swallow-70b-instruct-v0.1 70 instruct tokyotech-llm/Swallow-70b-instruct-v0.1
Swallow-7b-instruct-v0.1 6.7 instruct tokyotech-llm/Swallow-7b-instruct-v0.1
Swallow-MS 7B v0.1 7.2 base tokyotech-llm/Swallow-MS-7b-v0.1
Swallow-MS-7b-instruct-v0.1 7.2 instruct tokyotech-llm/Swallow-MS-7b-instruct-v0.1
Swallow-MX 8x7B v0.1 47 base tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1
Tanuki-8x8B-dpo-v1.0 47 instruct weblab-GENIAC/Tanuki-8x8B-dpo-v1.0
Tanuki-8B-dpo-v1.0 7.5 instruct weblab-GENIAC/Tanuki-8B-dpo-v1.0
Yi-1.5 34B 34 base 01-ai/Yi-1.5-34B
Yi-1.5 6B 6.1 base 01-ai/Yi-1.5-6B
Yi-1.5 9B 8.8 base 01-ai/Yi-1.5-9B
Youri 7B 6.7 base rinna/youri-7b
llm-jp-3-13b-instruct 13 instruct llm-jp/llm-jp-3-13b-instruct
llm-jp-3-13b 13 base llm-jp/llm-jp-3-13b
llm-jp-3-1.8b-instruct 1.8 instruct llm-jp/llm-jp-3-1.8b-instruct
llm-jp-3-1.8b 1.8 base llm-jp/llm-jp-3-1.8b
llm-jp-3-3.7b-instruct 3.7 instruct llm-jp/llm-jp-3-3.7b-instruct
llm-jp-3-3.7b 3.7 base llm-jp/llm-jp-3-3.7b

Acknowledgements

This website used these software packages.