About evaluation
The Swallow Project is independently conducting evaluation experiments of publicly available LLMs in parallel with the development of LLMs in order to serve as a reference for the development of high-performance large language models (LLMs). By comparing with LLMs developed not only in Japan but also around the world, we can learn the “current level” of the Swallow project. By evaluating each LLM under the fair conditions while taking into account its unique specifications (tokenization, system prompts, etc.) and contrasting them with the development methods of LLMs, we can examine the “recipe” for developing a high-performance LLM. We also realize the challenges in LLM evaluation by experiencing that high or low task evaluation scores are due to not only differences in LLM performance but also trivial specifications in the evaluation (e.g., prompt format). On this site, you can view the results of LLM evaluations conducted within the Swallow project, including bar graphs, radar charts, and scatter plots. We hope that this site will be useful not only as information for selecting the right LLM for your application, but also as reference information for the development of LLMs that are strong in Japanese.
Evaluation tasks
In the 2024 Swallow project, we are conducting LLM evaluation experiments using 10 datasets for the Japanese understanding and generation tasks, MT-Bench for the Japanese multi-turn dialogue task, and 9 datasets for the English understanding and generation tasks. For all tasks, the evaluation scores range from 0 (lowest) to 1 (highest).
Japanese understanding and generation tasks
JCom (JCommonsenseQA)
Q&A regarding commonsense and inference
Five-choice questions created with a knowledge base
- Metric: Accuracy
- Setting: 4-shot
- Reference: (Kurihara et al., 2022)
JEMHopQA
Multi-hop Q&A
Open-ended Q&A to assess the amount of knowledge and reasoning ability
- Metric: Character F1
- Setting: 4-shot
- Reference: (Ishii et al., 2024)
NIILC
Classical Q&A
Open-ended Q&A that can be answered by an encyclopedia
- Metric: Character F1
- Setting: 4-shot
- Reference: (Sekine, 2003)
JSQuAD
Reading comprehension
Open-ended Q&A for Wikipedia article
- Metric: Character F1
- Setting: 4-shot
- Reference: (Kurihara et al., 2022)
XL-Sum
Summarization
Task to generate a highlight from a news article of BBC
- Metric: ROUGE-2
- Setting: 1-shot
- Reference: (Hasan et al., 2021)
MGSM
Mathematics
Japanese translation of math word problems (GSM8K)
- Metric: Accuracy (exact match)
- Setting: 4-shot
- Reference: (Shi et al., 2023)
WMT20 (en-ja)
English-Japanese translation
Translation of news articles
- Metric: BLEU
- Setting: 4-shot
- Reference: (Barrault et al., 2020)
WMT20 (ja-en)
Japanese-English translation
Translation of news articles
- Metric: BLEU
- Setting: 4-shot
- Reference: (Barrault et al., 2020)
JMMLU
Multi-task natural language understanding
Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)
- Metric: Accuracy
- Setting: 5-shot
- Reference: (Yin et al, 2024)
JHumanEval
Code generation
Japanese translation of HumanEval (code genration benchmark)
- Metric: pass@1
- Setting: 0-shot, 10 trials
- Reference: (Sato et al., 2024)
Japanese multi-turn dialogue tasks (Japanese MT-Bench)
We used Japanese MT-Bench Nejumi Leaderboard Neo version, a Japanese version of MT-Bench, a benchmark for multi-turn dialogue capability. We evaluate instruction-tuned models only. This benchmark automatically rate response sentences on a 10-point scale using GPT-4 (gpt-4-1106-preview). The categories of evaluation are as follows.
Coding
Extraction
Humanities
Math
Reasoning
Roleplay
STEM
Writing
Note that our Japanese MT-Bench evaluation results are lower than those of the other leaderboards. We think that this difference in scores is caused by the fact that many leaderboards use GPT-4 (gpt-4-0613) to evaluate response texts, while we use GPT-4 (gpt-4-1106-preview). Our investigation revealed that while there are significant differences between our and the other leaderboard’s evaluation scores, the relative rankings among the models remain largely unchanged. Therefore, we continued the evaluation without changing the GPT-4 version (since we had already completed many of the evaluations).
nglish understanding and generation tasks
OpenBookQA
Q&A based on facts and common sense
Four-choice questions based on scientific knowledge and common sense
- Metric: Accuracy
- Setting: 4-shot
- Reference: (Mihaylov et al., 2018)
TriviaQA
Q&A based on knowledge
Open-ended Q&A based on trivias
- Metric: Accuracy (exact match)
- Setting: 4-shot
- Reference: (Joshi et al., 2017)
HellaSwag
Commonsense inference
Four-choice questions to predict the next event
- Metric: Accuracy
- Setting: 4-shot
- Reference: (Zellers et al., 2019)
SQuAD2
Reading comprehension
Open-ended Q&A developed for the evidence document
- Metric: Accuracy (exact match)
- Setting: 4-shot
- Reference: (Rajpurkar et al., 2018)
XWINO
Commonsense inference
Two-choice question to predict the antecedent of a pronoun
- Metric: Accuracy
- Setting: 4-shot
- Reference: (Tikhonov and Ryabinin, 2021)
MMLU
Multitask natural language understanding
Four-choice exam questions benchmark MMLU (53 subjects)
- Metric: Accuracy
- Setting: 5-shot
- Reference: (Hendrycks et al., 2021)
GSM8K
Mathematics
Math word problems
- Metric: Accuracy (exact match)
- Setting: 4-shot
- Reference: (Cobbe et al., 2021)
BBH (BIG-Bench-Hard)
Collection of hard-to-solve tasks for LLM
23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)
- Metric: Accuracy (exact match)
- Setting: 3-shot, CoT
- Reference: (Suzgun et al., 2023)
HumanEval
Code generation
Ability of code generation measured by unit test
- Metric: pass@1
- Setting: 0-shot, 10 trials
- Reference: (Chen et al., 2021)
Evaluation tools
We used these software packages for evaluation.
JP Language Model Evaluation Harness (commit #9b42d41)
An evaluation framework for Japanese LLMs
https://github.com/Stability-AI/lm-evaluation-harness/Code Generation LM Evaluation Harness (commit #0261c52)
An evaluation framework for code generation (HumanEval)
https://github.com/bigcode-project/bigcode-evaluation-harnessFastChat (commit #e86e70d0)
An automatic evaluation framework by an LLM (MT-Bench)
https://github.com/lm-sys/FastChatswallow-evaluation
An evaluation framework used in Swallow Project (encompassing all the above-mentioned tools)
https://github.com/swallow-llm/swallow-evaluationEvaluated models
We list the LLMs in alphabetical order. Some LLMs do not have scores for language understanding and generation but only those for Japanese MT-bench.
Name | Size | Type | Distribution | Missing scores |
---|---|---|---|---|
Aya Expanse 32B | 32.0 | instruct | CohereForAI/aya-expanse-32b | |
Aya Expanse 8B | 8.0 | instruct | CohereForAI/aya-expanse-8b | |
C4AI Command-R v0.1 | 35 | base | CohereForAI/c4ai-command-r-v01 | |
CyberAgentLM2-7B-chat | 7.0 | instruct | cyberagent/calm2-7b-chat | |
CyberAgentLM2-7B | 7.0 | base | cyberagent/calm2-7b | |
CyberAgentLM3-22B-chat | 22 | instruct | cyberagent/calm3-22b-chat | |
ELYZA-japanese-Llama-2-13b | 13 | base | elyza/ELYZA-japanese-Llama-2-13b | |
Fugaku-LLM 13B | 13 | base | Fugaku-LLM/Fugaku-LLM-13B | |
GPT-3.5 (gpt-3.5-turbo-0125) | NaN | instruct | gpt-3.5-turbo-0125 | Japanese tasks, English tasks |
GPT-4o (gpt-4o-2024-05-13) | NaN | instruct | gpt-4o-2024-05-13 | Japanese tasks, English tasks |
GRIN-MoE | 42 | instruct | microsoft/GRIN-MoE | |
Gemma 2 27B IT | 27 | instruct | google/gemma-2-27b-it | |
Gemma 2 2B IT | 2.6 | instruct | google/gemma-2-2b-it | |
Gemma 2 9B IT | 9.2 | instruct | google/gemma-2-9b-it | |
Gemma 2 27B | 27 | base | google/gemma-2-27b | |
Gemma 2 2B | 2.6 | base | google/gemma-2-2b | |
Gemma 2 9B | 9.2 | base | google/gemma-2-9b | |
Gemma 2 Baku 2B IT | 2.6 | instruct | rinna/gemma-2-baku-2b-it | |
Gemma 2 Baku 2B | 2.6 | base | rinna/gemma-2-baku-2b | |
Gemma 2 JPN | 2.6 | instruct | google/gemma-2-2b-jpn-it | |
Japanese Stable LM Base Gamma 7B | 7.2 | base | stabilityai/japanese-stablelm-base-gamma-7b | |
Japanese Stable LM Beta 70B | 70 | base | stabilityai/japanese-stablelm-base-beta-70b | |
Japanese Stable LM Beta 7B | 6.7 | base | stabilityai/japanese-stablelm-base-beta-7b | |
KARAKURI LM 70B Chat v0.1 | 70 | instruct | karakuri-ai/karakuri-lm-70b-chat-v0.1 | |
KARAKURI LM 8x7B Instruct v0.1 | 47 | instruct | karakuri-ai/karakuri-lm-8x7b-instruct-v0.1 | |
KARAKURI LM 70B v0.1 | 70 | base | karakuri-ai/karakuri-lm-70b-v0.1 | |
LLM-jp-13B v2.0 | 13 | base | llm-jp/llm-jp-13b-v2.0 | |
Llama 2 13B | 13 | base | meta-llama/Llama-2-13b-hf | |
Llama 2 70B | 70 | base | meta-llama/Llama-2-70b-hf | |
Llama 2 7B | 6.7 | base | meta-llama/Llama-2-7b-hf | |
Llama 3 70B Instruct | 70 | instruct | meta-llama/Meta-Llama-3-70B-Instruct | |
Llama 3 8B Instruct | 8.0 | instruct | meta-llama/Meta-Llama-3-8B-Instruct | |
Llama 3 70B | 70 | base | meta-llama/Meta-Llama-3-70B | |
Llama 3 8B | 8.0 | base | meta-llama/Meta-Llama-3-8B | |
Llama 3 Swallow 70B Instruct | 70 | instruct | tokyotech-llm/Llama-3-Swallow-8B-Instruct-v0.1 | |
Llama 3 Swallow 8B Instruct | 8.0 | instruct | tokyotech-llm/Llama-3-Swallow-8B-v0.1 | |
Llama 3 Swallow 70B | 70 | base | tokyotech-llm/Llama-3-Swallow-70B-v0.1 | |
Llama 3 Swallow 8B | 8.0 | base | tokyotech-llm/Llama-3-Swallow-8B-v0.1 | |
Llama 3 Youko 70B Instruct | 70 | instruct | rinna/llama-3-youko-70b-instruct | |
Llama 3 Youko 8B Instruct | 8.0 | instruct | rinna/llama-3-youko-8b-instruct | |
Llama 3 Youko 70B | 70 | base | rinna/llama-3-youko-70b | |
Llama 3 Youko 8B | 8.0 | base | rinna/llama-3-youko-8b | |
Llama 3 heron brain 70B v0.3 | 70 | instruct | turing-motors/Llama-3-heron-brain-70B-v0.3 | |
Llama 3 heron brain 8B v0.3 | 8.0 | instruct | turing-motors/Llama-3-heron-brain-8B-v0.3 | |
Llama 3.1 405B Instruct | 405 | instruct | deepinfra/meta-llama/Meta-Llama-3.1-405B-Instruct | Japanese tasks, English tasks |
Llama 3.1 70B Instruct | 70 | instruct | meta-llama/Meta-Llama-3.1-70B-Instruct | |
Llama 3.1 8B Instruct | 8.0 | instruct | meta-llama/Meta-Llama-3.1-8B-Instruct | |
Llama 3.1 70B | 70 | base | meta-llama/Meta-Llama-3.1-70B | |
Llama 3.1 8B | 8.0 | base | meta-llama/Meta-Llama-3.1-8B | |
Llama 3.1 Swallow 70B Instruct v0.1 | 70 | instruct | tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 | |
Llama 3.1 Swallow 8B Instruct v0.1 | 8.0 | instruct | tokyotech-llm/Llama-3.1-Swallow-8B-v0.1 | |
Llama 3.1 Swallow 8B Instruct v0.2 | 8.0 | instruct | tokyotech-llm/Llama-3.1-Swallow-8B-v0.2 | |
Llama 3.1 Swallow 70B v0.1 | 70 | base | tokyotech-llm/Llama-3.1-Swallow-70B-v0.1 | |
Llama 3.1 Swallow 8B v0.1 | 8.0 | base | tokyotech-llm/Llama-3.1-Swallow-8B-v0.1 | |
Llama 3.1 Swallow 8B v0.2 | 8.0 | base | tokyotech-llm/Llama-3.1-Swallow-8B-v0.2 | |
Llama 3.2 1B Instruct | 1.2 | instruct | meta-llama/Llama-3.2-1B-Instruct | |
Llama 3.2 3B Instruct | 3.2 | instruct | meta-llama/Llama-3.2-3B-Instruct | |
Llama 3.2 1B | 1.2 | base | meta-llama/Llama-3.2-1B | |
Llama 3.2 3B | 3.2 | base | meta-llama/Llama-3.2-3B | |
Llama-3-ELYZA-JP-8B | 8.0 | instruct | elyza/Llama-3-ELYZA-JP-8B | |
Llama-3.1-70B-Japanese-Instruct-2407 | 70 | instruct | cyberagent/Llama-3.1-70B-Japanese-Instruct-2407 | |
Mistral-7B-Instruct-v0.3 | 7.2 | instruct | mistralai/Mistral-7B-Instruct-v0.3 | |
Mistral-7B-v0.1 | 7.2 | base | mistralai/Mistral-7B-v0.1 | |
Mistral-7B-v0.2 | 7.2 | base | mistral-community/Mistral-7B-v0.2 | |
Mistral-7B-v0.3 | 7.2 | base | mistralai/Mistral-7B-v0.3 | |
Mistral-NeMo-Instruct-2407 (12B) | 12 | instruct | mistralai/Mistral-Nemo-Instruct-2407 | |
Mistral-NeMo-Minitron 8B Instruct | 8.4 | instruct | nvidia/Mistral-NeMo-Minitron-8B-Instruct | |
Mistral-NeMo-Minitron 8B | 8.4 | base | nvidia/Mistral-NeMo-Minitron-8B-Base | |
Mistral-Nemo-Base-2407 (12B) | 12 | base | mistralai/Mistral-Nemo-Base-2407 | |
Mixtral-8x7B-Instruct-v0.1 | 47 | instruct | mistralai/Mixtral-8x7B-Instruct-v0.1 | |
Mixtral-8x22B-Instruct-v0.1 | 141 | instruct | mistralai/Mixtral-8x22B-Instruct-v0.1 | |
Mixtral-8x7B-v0.1 | 47 | base | mistralai/Mixtral-8x7B-v0.1 | |
Mixtral-8x22B-v0.1 | 141 | base | mistralai/Mixtral-8x22B-v0.1 | |
Phi-3-Mini-128K-Instruct | 3.8 | instruct | microsoft/Phi-3-mini-128k-instruct | |
Phi-3.5-MoE Instruct | 42 | instruct | microsoft/Phi-3.5-MoE-instruct | |
Qwen1.5-7B | 7.7 | base | Qwen/Qwen1.5-7B | |
Qwen2-72B-Instruct | 72 | instruct | Qwen/Qwen2-72B-Instruct | |
Qwen2-7B-Instruct | 7.6 | instruct | Qwen/Qwen2-7B-Instruct | |
Qwen2-72B | 72 | base | Qwen/Qwen2-72B | |
Qwen2-7B | 7.6 | base | Qwen/Qwen2-7B | |
Qwen2.5-72B-Instruct | 72 | instruct | Qwen/Qwen2.5-72B-Instruct | |
Qwen2.5-3B-Instruct | 3.1 | instruct | Qwen/Qwen2.5-3B-Instruct | |
Qwen2.5-7B-Instruct | 7.6 | instruct | Qwen/Qwen2.5-7B-Instruct | |
Qwen2.5-0.5B-Instruct | 0.5 | instruct | Qwen/Qwen2.5-0.5B-Instruct | |
Qwen2.5-0.5B | 0.5 | base | Qwen/Qwen2.5-0.5B | |
Qwen2.5-72B | 72 | base | Qwen/Qwen2.5-72B | |
Qwen2.5-1.5B-Instruct | 1.5 | instruct | Qwen/Qwen2.5-1.5B-Instruct | |
Qwen2.5-1.5B | 1.5 | base | Qwen/Qwen2.5-1.5B | |
Qwen2.5-3B | 3.1 | base | Qwen/Qwen2.5-3B | |
Qwen2.5-7B | 7.6 | base | Qwen/Qwen2.5-7B | |
RakutenAI-7B-chat | 7.2 | instruct | Rakuten/RakutenAI-7B-chat | |
RakutenAI-7B | 7.2 | base | Rakuten/RakutenAI-7B | |
Sarashina2-13B | 13 | base | sbintuitions/sarashina2-13b | |
Sarashina2-70B | 70 | base | sbintuitions/sarashina2-70b | |
Sarashina2-7B | 7.3 | base | sbintuitions/sarashina2-7b | |
Stockmark-100b | 100 | base | stockmark/stockmark-100b | |
Swallow 13B | 13 | base | tokyotech-llm/Swallow-13b-hf | |
Swallow 70B | 70 | base | tokyotech-llm/Swallow-70b-hf | |
Swallow 7B | 6.7 | base | tokyotech-llm/Swallow-7b-hf | |
Swallow-70b-instruct-v0.1 | 70 | instruct | tokyotech-llm/Swallow-70b-instruct-v0.1 | |
Swallow-7b-instruct-v0.1 | 6.7 | instruct | tokyotech-llm/Swallow-7b-instruct-v0.1 | |
Swallow-MS 7B v0.1 | 7.2 | base | tokyotech-llm/Swallow-MS-7b-v0.1 | |
Swallow-MS-7b-instruct-v0.1 | 7.2 | instruct | tokyotech-llm/Swallow-MS-7b-instruct-v0.1 | |
Swallow-MX 8x7B v0.1 | 47 | base | tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1 | |
Tanuki-8x8B-dpo-v1.0 | 47 | instruct | weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 | |
Tanuki-8B-dpo-v1.0 | 7.5 | instruct | weblab-GENIAC/Tanuki-8B-dpo-v1.0 | |
Yi-1.5 34B | 34 | base | 01-ai/Yi-1.5-34B | |
Yi-1.5 6B | 6.1 | base | 01-ai/Yi-1.5-6B | |
Yi-1.5 9B | 8.8 | base | 01-ai/Yi-1.5-9B | |
Youri 7B | 6.7 | base | rinna/youri-7b | |
llm-jp-3-13b-instruct | 13 | instruct | llm-jp/llm-jp-3-13b-instruct | |
llm-jp-3-13b | 13 | base | llm-jp/llm-jp-3-13b | |
llm-jp-3-1.8b-instruct | 1.8 | instruct | llm-jp/llm-jp-3-1.8b-instruct | |
llm-jp-3-1.8b | 1.8 | base | llm-jp/llm-jp-3-1.8b | |
llm-jp-3-3.7b-instruct | 3.7 | instruct | llm-jp/llm-jp-3-3.7b-instruct | |
llm-jp-3-3.7b | 3.7 | base | llm-jp/llm-jp-3-3.7b |
Acknowledgements
This website used these software packages.