The Swallow Project is conducting independent evaluation experiments on major large language models (LLMs) in parallel with the development of a high-performance LLM specialized in Japanese. By comparing LLMs developed not only in Japan but also worldwide, we can better understand the current level of the Swallow Project. We conduct evaluations under fair conditions while considering the unique specifications of each LLM, such as tokenization and system prompts. By analyzing these evaluations in relation to the development methods of each LLM, we aim to explore the "recipe" for creating a high-performance LLM. This website visualizes the evaluation results of LLMs tested within the Swallow Project in bar charts, radar charts, and scatter plots. We hope this website serves not only as a guide for selecting high-performance LLMs but also as a reference for developing LLMs with strong Japanese language capabilities.
The content of this leaderboard (including data and graphs) is provided under a Creative Commons Attribution 4.0 (CC-BY 4.0) License, the evaluation software (swallow-evaluation) is distributed under the MIT License, and the source code of this website is also provided under the MIT License.
We evaluate LLMs on question answering and reading comprehension to assess language understanding and common knowledge, summarization and translation to measure language generation, and code generation and mathematics to test logical reasoning abilities. The evaluation scores range from 0 (lowest) to 1 (highest).
Five-choice questions created with a knowledge base
Open-ended Q&A to assess the amount of knowledge and reasoning ability
Open-ended Q&A that can be answered by an encyclopedia
Open-ended Q&A for Wikipedia article
Task to generate a highlight from a news article of BBC
Japanese translation of math word problems (GSM8K)
Translation of news articles (English to Japanese)
Translation of news articles (Japanese to English)
Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)
Japanese translation of HumanEval (code genration benchmark)
We evaluate LLMs on question answering, reading comprehension, and exam questions to assess language understanding and common knowledge, summarization to measure language generation, and code generation and mathematics to test logical reasoning abilities. The evaluation scores range from 0 (lowest) to 1 (highest).
Four-choice questions based on scientific knowledge and common sense
Open-ended Q&A based on trivias
Four-choice questions to predict the next event
Open-ended Q&A developed for the evidence document
Two-choice question to predict the antecedent of a pronoun
Four-choice exam questions benchmark MMLU (53 subjects)
Math word problems
High school math competitions
23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)
Ability of code generation measured by unit test
We used the Japanese version of MT-Bench (Nejumi LLM Leaderboard edition) to evaluate dialogue capabilities. The test questions are based on v4, and the reference answers are derived from v2 with corrections to incorrect responses. The evaluation scores range from 0 (lowest) to 1 (highest).
Implementing algorithms in Python or C++, and creating websites using HTML.
Extracting named entities (such as author names and numerical values) and sentiment (e.g., positive or negative) from text.
Creating essays and strategies on topics related to law, economics, history, philosophy, and education.
Generating solutions for problems and word problems in algebra, geometry, probability, and number theory.
Generating answers to questions by leveraging common knowledge and reasoning skills.
Writing creative texts by assuming the persona of famous individuals or fictional characters and imagining hypothetical scenarios.
Generating answers and explanations on topics related to physics, chemistry, biology, geography, architecture, and machine learning.
Writing blog articles, email drafts, and fictional narratives.
Model name | # Parameters [B] | Release date | Type | Missing scores |
---|---|---|---|---|
Aya Expanse 8B | 8.0 | 2024-10-24 | chat | |
Aya Expanse 32B | 32 | 2024-10-24 | chat | |
CyberAgentLM3-22B-chat | 22 | 2024-07-09 | chat | |
Falcon3-1B-Base | 1.7 | 2024-12-19 | base | |
Falcon3-1B-Instruct | 1.7 | 2024-12-19 | chat | |
Falcon3-3B-Base | 3.2 | 2024-12-19 | base | |
Falcon3-3B-Instruct | 3.2 | 2024-12-19 | chat | |
Falcon3-7B-Base | 7.5 | 2024-12-19 | base | |
Falcon3-7B-Instruct | 7.5 | 2024-12-19 | chat | |
Falcon3-10B-Base | 10 | 2024-12-19 | base | |
Falcon3-10B-Instruct | 10 | 2024-12-19 | chat | |
Gemma 2 2B | 2.6 | 2024-06-27 | base | |
Gemma 2 2B IT | 2.6 | 2024-06-27 | chat | |
Gemma 2 9B | 9.2 | 2024-06-27 | base | |
Gemma 2 9B IT | 9.2 | 2024-06-27 | chat | |
Gemma 2 27B | 27 | 2024-06-27 | base | |
Gemma 2 27B IT | 27 | 2024-06-27 | chat | |
Gemma 2 Baku 2B | 2.6 | 2024-10-03 | base | |
Gemma 2 Baku 2B IT | 2.6 | 2024-10-03 | chat | |
Gemma 2 JPN | 2.6 | 2024-06-27 | chat | |
GPT-3.5 (gpt-3.5-turbo-0125) | 0 | 2024-01-25 | chat | En Avg OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K MATH BBH HumanEval |
GPT-4-turbo (gpt-4-turbo-2024-04-09) | 0 | 2024-04-09 | chat | En Avg OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K MATH BBH HumanEval |
GPT-4o (gpt-4o-2024-05-13) | 0 | 2024-05-13 | chat | En Avg OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K MATH BBH HumanEval |
GPT-4o (gpt-4o-2024-08-06) | 0 | 2024-08-06 | chat | En Avg OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K MATH BBH HumanEval |
GPT-4o-mini (gpt-4o-mini-2024-07-18) | 0 | 2024-08-06 | chat | En Avg OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K MATH BBH HumanEval |
Llama 3 8B | 8.0 | 2024-04-18 | base | |
Llama 3 8B Instruct | 8.0 | 2024-04-18 | chat | |
Llama 3 70B | 70 | 2024-04-18 | base | |
Llama 3 70B Instruct | 70 | 2024-04-18 | chat | |
Llama-3-ELYZA-JP-8B | 8.0 | 2024-06-26 | chat | |
Llama 3 heron brain 8B v0.3 | 8.0 | 2024-07-01 | chat | |
Llama 3 heron brain 70B v0.3 | 70 | 2024-07-01 | chat | |
Llama 3 Swallow 8B | 8.0 | 2024-07-01 | base | |
Llama 3 Swallow 8B Instruct | 8.0 | 2024-07-01 | chat | |
Llama 3 Swallow 70B | 70 | 2024-07-01 | base | |
Llama 3 Swallow 70B Instruct | 70 | 2024-07-01 | chat | |
Llama 3 Youko 8B | 8.0 | 2024-05-07 | base | |
Llama 3 Youko 8B Instruct | 8.0 | 2024-05-07 | chat | |
Llama 3 Youko 70B | 70 | 2024-07-25 | base | |
Llama 3 Youko 70B Instruct | 70 | 2024-07-25 | chat | |
Llama 3.1 8B | 8.0 | 2024-07-23 | base | |
Llama 3.1 8B Instruct | 8.0 | 2024-07-23 | chat | |
Llama 3.1 70B | 70 | 2024-07-23 | base | |
Llama 3.1 70B Instruct | 70 | 2024-07-23 | chat | |
Llama-3.1-70B-Japanese-Instruct-2407 | 70 | 2024-07-23 | chat | |
Llama 3.1 Swallow 8B v0.1 | 8.0 | 2024-10-08 | base | |
Llama 3.1 Swallow 8B Instruct v0.1 | 8.0 | 2024-10-08 | chat | |
Llama 3.1 Swallow 70B v0.1 | 70 | 2024-10-08 | base | |
Llama 3.1 Swallow 70B Instruct v0.1 | 70 | 2024-10-08 | chat | |
Llama 3.1 Swallow 8B v0.2 | 8.0 | 2024-11-11 | base | |
Llama 3.1 Swallow 8B Instruct v0.2 | 8.0 | 2024-11-11 | chat | |
Llama 3.1 Swallow 8B Instruct v0.3 | 8.0 | 2024-12-23 | chat | |
Llama 3.1 Swallow 70B Instruct v0.3 | 70 | 2024-12-30 | chat | |
Llama 3.2 1B | 1.2 | 2024-09-25 | base | |
Llama 3.2 1B Instruct | 1.2 | 2024-09-25 | chat | |
Llama 3.2 3B | 3.2 | 2024-09-25 | base | |
Llama 3.2 3B Instruct | 3.2 | 2024-09-25 | chat | |
Llama 3.3 70B Instruct | 70 | 2024-12-06 | chat | |
Llama 3.3 Swallow 70B v0.4 | 70 | 2025-03-14 | base | |
Llama 3.3 Swallow 70B Instruct v0.4 | 70 | 2025-03-10 | chat | |
llm-jp-3-1.8b | 1.8 | 2024-09-25 | base | |
llm-jp-3-1.8b-instruct | 1.8 | 2024-09-25 | chat | |
llm-jp-3-3.7b | 3.7 | 2024-09-25 | base | |
llm-jp-3-3.7b-instruct | 3.7 | 2024-09-25 | chat | |
llm-jp-3-13b | 13 | 2024-09-25 | base | |
llm-jp-3-13b-instruct | 13 | 2024-09-25 | chat | |
Mistral-Nemo-Base-2407 (12B) | 12 | 2024-07-18 | base | |
Mistral-NeMo-Instruct-2407 (12B) | 12 | 2024-07-18 | chat | |
Mistral-NeMo-Minitron 8B | 8.4 | 2024-08-21 | base | |
Mistral-NeMo-Minitron 8B Instruct | 8.4 | 2024-08-21 | chat | |
Mistral-7B-v0.3 | 7.2 | 2024-05-22 | base | |
Mistral-7B-Instruct-v0.3 | 7.2 | 2024-05-22 | chat | |
Mixtral-8x22B-v0.1 | 141 | 2024-04-17 | base | |
Mixtral-8x22B-Instruct-v0.1 | 141 | 2024-04-17 | chat | |
Phi-3-Mini-128K-Instruct | 3.8 | 2024-04-23 | chat | |
Phi-4 | 14 | 2024-12-13 | chat | |
PLaMo 2 1B | 1.3 | 2025-02-21 | base | |
PLaMo 2 8B | 9.1 | 2025-02-21 | base | |
Qwen2-7B | 7.6 | 2024-06-07 | base | |
Qwen2-7B-Instruct | 7.6 | 2024-06-07 | chat | |
Qwen2-72B | 72 | 2024-06-07 | base | |
Qwen2-72B-Instruct | 72 | 2024-06-07 | chat | |
Qwen2.5-0.5B | 0.5 | 2024-09-19 | base | |
Qwen2.5-0.5B-Instruct | 0.5 | 2024-09-19 | chat | |
Qwen2.5-1.5B | 1.5 | 2024-09-19 | base | |
Qwen2.5-1.5B-Instruct | 1.5 | 2024-09-19 | chat | |
Qwen2.5-3B | 3.1 | 2024-09-19 | base | |
Qwen2.5-3B-Instruct | 3.1 | 2024-09-19 | chat | |
Qwen2.5-7B | 7.6 | 2024-09-19 | base | |
Qwen2.5-7B-Instruct | 7.6 | 2024-09-19 | chat | |
Qwen2.5-14B-Instruct | 14 | 2024-09-25 | chat | |
Qwen2.5-32B-Instruct | 32 | 2024-09-25 | chat | |
Qwen2.5-72B | 72 | 2024-09-19 | base | |
Qwen2.5-72B-Instruct | 72 | 2024-09-19 | chat | |
Sarashina2-7B | 7.3 | 2024-06-14 | base | |
Sarashina2-13B | 13 | 2024-06-14 | base | |
Sarashina2-70B | 70 | 2024-06-14 | base | |
Stockmark-100b | 100 | 2024-05-16 | base | |
Swallow 7B | 6.7 | 2023-12-19 | base | |
Swallow 13B | 13 | 2023-12-19 | base | |
Swallow 70B | 70 | 2023-12-19 | base | |
Swallow-MS 7B v0.1 | 7.2 | 2024-03-11 | base | |
Swallow-MS-7b-instruct-v0.1 | 7.2 | 2024-03-11 | chat | |
Swallow-MX 8x7B v0.1 | 47 | 2024-03-11 | base | |
Swallow-7b-instruct-v0.1 | 6.7 | 2023-12-19 | chat | |
Swallow-70b-instruct-v0.1 | 70 | 2023-12-19 | chat | |
Tanuki-8B-dpo-v1.0 | 7.5 | 2024-08-30 | chat | |
Tanuki-8x8B-dpo-v1.0 | 47 | 2024-08-30 | chat | |
TinySwallow-1.5B | 1.5 | 2025-01-30 | base | |
TinySwallow-1.5B-Instruct | 1.5 | 2025-01-30 | chat | |
Yi-1.5 6B | 6.1 | 2024-05-13 | base | |
Yi-1.5 9B | 8.8 | 2024-05-13 | base | |
Yi-1.5 34B | 34 | 2024-05-13 | base |