Swallow LLM Leaderboard v2

The Swallow LLM Leaderboard v2 is a leaderboard for Japanese large language models, featuring high-difficulty benchmarks. The experimental results published here are obtained using swallow-evaluation-instruct, an evaluation framework developed for post-trained large language models.

History

2025-08-20: We have released Swallow LLM Leaderboard v2 and swallow-evaluation-instruct.

Background

It is well known that the responses of large language models (LLMs) depend heavily on prompts and generation settings. Therefore, to measure a model’s capabilities fairly and consistently, it is important to standardize experimental conditions.

The Swallow team developed the swallow-evaluation framework and has been evaluating various LLMs under unified conditions as much as possible. To allow comparison between pretrained and post-trained models, we adopted the following experimental conditions (hereafter referred to as the conventional method) as the baseline:

Do not use chat templates that convert prompts into dialogue format (except for MT-Bench, where chat templates are applied).
Adopt few-shot inference (since models may otherwise fail to understand task instructions).
Require output of the answer only (to clearly define the evaluation scope).
For multiple-choice questions, select options based on likelihood (because some models may output beyond the provided options).
Greedy decoding with temperature set to 0 (since probabilistic decoding leads to unstable results).

This approach has been confirmed sufficient for evaluating performance on knowledge-dependent tasks such as question answering based on encyclopedic knowledge or commonsense, as well as traditional natural language processing tasks such as machine translation and automatic summarization.

However, in recent years, reasoning models—so-called “deep thinking models”—such as OpenAI’s o1 and DeepSeek-R1 have emerged, and it has become clear that the conventional method cannot measure their performance properly. For example, when evaluating a distillation model of DeepSeek-R1 (DeepSeek-R1-Distill-Llama-8B) with the conventional method, performance was underestimated by as much as 30 points on reasoning-heavy mathematics and science benchmarks such as MATH and GPQA. These models only demonstrate their true performance when chat templates, zero-shot inference, and free-form generation including reasoning processes are employed. In addition, for models provided solely via API, such as the OpenAI GPT series, likelihood-based evaluation itself is impossible. Furthermore, with the rapid progress of reasoning models, conventional benchmarks are facing score saturation, making higher-difficulty benchmarks increasingly necessary.

Against this backdrop, we newly developed swallow-evaluation-instruct. This framework is designed to properly evaluate post-trained reasoning-focused models and incorporates the following conditions:

Prompts with chat templates applied
Zero-shot inference as the default
Chain-of-thought prompts for tasks where reasoning is effective
Free-form generation including reasoning processes, rather than short answers
Support for probabilistic decoding
Evaluation based only on the final answer after removing reasoning traces
Ability to control the use and depth of reasoning mode

Evaluation benchmarks were selected based on the following criteria: their ability to highlight the challenges of Japanese LLMs, established international adoption, transparent construction process and validation methods, quality assurance by experts, and higher difficulty compared to conventional benchmarks. As of August 2025, six Japanese tasks (JEMHopQA, MMLU-ProX, GPQA, MATH-100, JHumanEval, M-IFEval-Ja) and six English tasks (HellaSwag, MMLU-Pro, GPQA, MATH-500, AIME 2024-2025, LiveCodeBench) have been adopted. Evaluation results are published in the Swallow LLM Leaderboard v2.

We believe that releasing Swallow LLM Leaderboard v2 and swallow-evaluation-instruct to the research and development community will contribute in the following ways:

Ensuring transparency: Publishing evaluation methods and providing an environment where anyone can verify results under the same criteria
Guaranteeing reproducibility: Enabling proper comparison of research and development outcomes under standardized conditions
Contributing to the community: As research on post-training methodologies and inference-time scaling laws for reasoning models becomes mainstream, providing a common foundation to support the development of new insights and techniques

Our goal is not simply numerical comparison, but to build a shared foundation that supports transparency and progress in Japanese LLM research. We hope that Swallow LLM Leaderboard v2 and swallow-evaluation-instruct will bring us closer to that goal.

Results

Overall scores

Average scores of benchmark results for post-trained models (click on any model name to change the sorting order)

The results of evaluating post-trained models using swallow-evaluation-instruct are shown here. For each model, from left to right, the average score of five Japanese tasks (excluding M-IFEval-Ja), the average score of six English tasks, the average score of Japanese MT-Bench, and the average score of English MT-Bench are displayed. All scores range from 0 (lowest, worst) to 1 (highest, best). By default, models are sorted by the average score of the five Japanese tasks, but clicking on a model name allows reordering.

For the five Japanese tasks, GPT-5 recorded the highest average score (0.891). GPT-5 also achieved the highest score (0.875) on the six English tasks, confirming the outstanding performance of OpenAI’s latest model. Among open models, Qwen3-235B-A22B-Thinking-2507 achieved the highest average score (0.823) on the five Japanese tasks. While there remains a gap with top models such as GPT-5 and o3, this result shows that the gap between open and closed models is steadily narrowing. Notably, this model is also released under a permissive Apache 2.0 license.

Recently, OpenAI released gpt-oss-120b under the Apache 2.0 license, attracting attention. It ranked 6th overall and 3rd among open models, following Qwen3-235B-A22B-Instruct-2507. With roughly half the total parameter count of the top Qwen3 models, it can be considered a cost-effective model. Furthermore, it ranks just below GPT-4.1 and just above o3-mini, indicating that gpt-oss-120b is among the cutting-edge models. Note that when the inference mode was set to “high,” generation sometimes stopped prematurely, so we used “medium” for this evaluation.

Meanwhile, the models developed by the Swallow team struggled to achieve higher average scores because they are not equipped with deep reasoning. Moreover, the Swallow series has focused on Japanese language and Japan-related knowledge, while the current benchmark for post-trained models includes only two knowledge-based tasks—JEMHopQA (Japanese) and HellaSwag (English). This also affected the average scores. Nevertheless, evaluations conducted with the newly developed swallow-evaluation-instruct reflect the latest trends, including the strength of reasoning models and differences in post-training recipes across models, and we believe it will serve as an important foundation for future model development.

Score of each task

Evaluation results of Japanese tasks for post-trained models

Evaluation results of English tasks for post-trained models

We visualized the scores of Japanese 6 tasks and English 6 tasks in radar charts for four models: Qwen3-235B-A22B-Thinking-2507, GPT-5, gpt-oss-120b, and gpt-oss-20b. On the Japanese 6 tasks, GPT-5 stands out, with each task score approaching 1.0, highlighting the need for more challenging benchmarks. Open models such as Qwen3-235B-A22B-Thinking-2507 and gpt-oss-120b also demonstrated results consistent with their scale, and the absence of significant weaknesses in specific tasks is noteworthy. However, for JEMHopQA, which measures Japanese knowledge, the score of Llama 3.3 Swallow 70B Instruct v0.4 (0.658)—not shown here—surpasses that of gpt-oss-120b (0.635), indicating that there is still room for improvement in knowledge related to Japan and the Japanese language.

Moreover, the positions of the data points for each model are nearly identical between Japanese benchmarks and their English counterparts (MATH-100 vs. MATH-500, GPQA (Japanese) vs. GPQA (English), MMLU-ProX vs. MMLU-Pro). This shows that at least top-level models can solve mathematics and science problems to the same degree regardless of whether they are presented in Japanese or English.

Evaluation results of Japanese MT-Bench

Evaluation results of English MT-Bench

Next, we present the evaluation results of Japanese and English MT-Bench, which measure the ability to provide useful responses in dialogue. All models achieved high scores, with Qwen3-235B-A22B-Thinking-2507 and gpt-oss-20b in particular outperforming GPT-5. This suggests that it is becoming increasingly difficult to fully assess the performance of state-of-the-art LLMs using MT-Bench alone. On the other hand, even in dialogue tasks, performance differences among models can be observed in M-IFEval-Ja, which evaluates controllability in following instructions during Japanese dialogue (e.g., “respond using only hiragana”), as shown in the earlier chart.

Evaluation Framework: swallow-evaluation-instruct

In developing swallow-evaluation-instruct, we compared existing evaluation frameworks (LM Eval Harness, llm-jp-eval, lighteval) and chose to base our work on lighteval. The main reasons are as follows:

Ease of answer extraction: Provides general-purpose regular expressions for extracting strings corresponding to “answers,” such as formulas or option symbols, from model outputs, reducing implementation overhead.
Suitability for code generation tasks: Includes functionality for running unit tests to judge correctness, making it easy to add code generation tasks like HumanEval without adding external dependencies.
High extensibility: Prompt configuration, model output generation, answer extraction, and correctness checking are modularized, making it easy to add new benchmarks or modify existing ones.
Reproducibility of major benchmarks: Successfully reproduced DeepSeek-R1 paper scores on implemented math, science, and code generation benchmarks such as MATH-500, AIME, GPQA, and LiveCodeBench. Official scores of major model series such as Gemma3 and Qwen3 were also replicated.

In designing swallow-evaluation-instruct, we set the following as mandatory requirements:

Automatic application of built-in chat templates in post-trained models
Specification of generation conditions such as temperature via runtime arguments
Flexible answer extraction for multiple-choice, formulas, free-form answers, and code snippets
Functionality to separate reasoning process and final answers in reasoning model outputs

Additionally, we identified the following desirable features:

Implementation of major benchmarks (e.g., GPQA)
Ease of adding new benchmarks
Functionality to verify correctness of formulas and code snippets
Multiple trials when using probabilistic decoding

From these perspectives, we concluded that lighteval is currently the most suitable framework for evaluating post-trained models.

swallow-evaluation-instruct is a new evaluation framework designed to appropriately measure the performance of post-trained LLMs, including reasoning-focused models, under standardized conditions. This framework is released under the MIT License, with documentation prepared for use by researchers and developers of LLMs.

We hope this framework will support transparency, reproducibility, and extensibility in Japanese LLM research and development, and contribute to the advancement of more sophisticated models.

Appendix

The research and development of the large language model Swallow was carried out with support from the AIST policy budget project “Research and Development on Foundation Models of Generative AI for the Physical Domain,” the MEXT-funded project “Formation of Research and Development Centers for Ensuring Transparency and Reliability of Generative AI Models,” and other assistance. We also utilized ABCI 3.0 provided by AIST and AIST Solutions under the “ABCI 3.0 Accelerated Use for Development” program. In addition, this research was conducted using the TSUBAME 4.0 supercomputer at the Institute of Science Tokyo.

Swallow LLM Leaderboard v2

History

Background

Results

Overall scores

Score of each task

Evaluation Framework: swallow-evaluation-instruct

Appendix

Get in touch