Swallow (Mistral)

Legacy (higher-performance models have been developed and released)

Models

8x7B Instruct v0.1

With post-training (chat model)

HuggingFace

7B Instruct v0.1

With post-training (chat model)

HuggingFace

7B v0.1

Without post-training

HuggingFace

Changelog

2024-04-26: Released Swallow-MS-7b-instruct-v0.1.
2024-03-11: Released Swallow-MS-7b-v0.1 and Swallow-MX-8x7b-NVE-v0.1.

Overview

The large language models Swallow-MS 7B and Swallow-MX 8x7B were developed by research teams from the Okazaki and Yokota Laboratories at the School of Computing, Tokyo Institute of Technology, together with the National Institute of Advanced Industrial Science and Technology (AIST). To enhance the Japanese capabilities of the large language models Mistral 7B and Mixtral 8x7B, which exhibit strong performance in English language understanding and dialogue, the research team conducted continued pre-training using large-scale Japanese language data. In performance evaluations conducted by the team, Swallow-MS 7B achieved the highest performance among open 7B large language models on benchmarks related to Japanese knowledge, reasoning, and language generation (as of March 2024, in comparisons among base models). In addition, Swallow-MX 8x7B adopts a Mixture of Experts (MoE) architecture and is the first open model with enhanced Japanese capabilities based on this architecture. The released models can be downloaded from Hugging Face.

The licenses for Swallow-MS 7B and Swallow-MX 8x7B inherit the Apache 2.0 licenses of Mistral 7B and Mixtral 8x7B. As long as users comply with this license, the models may be used for both research and commercial purposes.

Performance

Swallow-MS 7B

Let us compare the Japanese performance of models obtained through continued pre-training of Mistral 7B and Llama 2, as well as other 7B models. Swallow-MS 7B steadily improves the Japanese capabilities of Mistral 7B through continued pre-training. On the Japanese benchmarks used for evaluation, Swallow-MS 7B shows higher average performance than other open 7B models available as of March 2024. In particular, it achieves substantial improvements on knowledge-oriented tasks such as JCommonsenseQA (JComQA) and NIILC, a trend similar to the performance gains observed for Swallow over Llama 2. In addition, Swallow-MS 7B extends Mistral 7B with Japanese vocabulary, enabling more few-shot examples to be packed into prompts, reducing garbled text generation, and accelerating generation speed.

Model	Ja Avg	JComQA	JEMHopQA	NIILC	JSQuAD	XL-Sum	Ja-En	En-Ja	MGSM
CyberAgentLM2-7B (base)	0.3098	0.2198	0.5047	0.5066	0.7799	0.0233	0.1499	0.2345	0.0600
Llama 2 7B (base)	0.3201	0.3852	0.4240	0.3410	0.7917	0.1905	0.1738	0.1783	0.0760
Japanese Stable LM Beta 7B (base)	0.3366	0.3610	0.4478	0.4432	0.8318	0.2195	0.1226	0.1946	0.0720
Japanese Stable LM Beta 7B (base, VE)	0.2937	0.2172	0.4482	0.4309	0.8202	0.0757	0.1453	0.1601	0.0520
ELYZA-japanese-Llama-2-7b (base)	0.3467	0.5791	0.4703	0.4019	0.8226	0.1312	0.1289	0.1795	0.0600
ELYZA-japanese-Llama-2-7b-fast (base, VE)	0.3312	0.5308	0.4330	0.3898	0.8131	0.1289	0.1143	0.1678	0.0720
Youri 7B (base)	0.3767	0.4620	0.4776	0.4999	0.8506	0.1957	0.1971	0.2671	0.0640
Swallow 7B（base, VE）	0.3940	0.4808	0.5078	0.5968	0.8573	0.1830	0.1511	0.2510	0.1240
Swallow 7B-plus (base, VE)	0.4090	0.5478	0.5493	0.6030	0.8544	0.1806	0.1441	0.2568	0.1360
Qwen-7B	0.3742	0.7712	0.4234	0.2376	0.8594	0.1371	0.1801	0.1689	0.2160
Nekomata 7B	0.4185	0.7417	0.4928	0.5022	0.8707	0.1676	0.1815	0.2673	0.1240
Mistral-7B-v0.1 (7B, base)	0.3717	0.7301	0.4245	0.2722	0.8563	0.2006	0.1733	0.1405	0.1760
Japanese Stable LM Base Gamma 7B (base)	0.4301	0.7364	0.4643	0.5568	0.8910	0.2293	0.1561	0.2390	0.1680
Swallow-MS 7B (base, VE)	0.4524	0.8570	0.4915	0.5519	0.8802	0.1988	0.1667	0.2494	0.2240

Next, let us examine the performance in English.
As is commonly observed in continued pre-training with cross-lingual transfer, performance on English tasks declines in the post–continued pre-training model (Swallow-MS 7B) compared to the original model (Mistral 7B).
In line with the improvements in Japanese performance, the decline is particularly noticeable on knowledge-oriented tasks such as TriviaQA.
We attempted to mitigate the degradation in English performance while improving Japanese performance by mixing Japanese and English data at a 9:1 ratio and adjusting the learning rate. However, possibly due to the small model size, it was not possible to completely prevent the performance drop.
Nevertheless, the average English performance of Swallow-MS 7B slightly surpasses that of Llama 2 7B and remains higher than that of Swallow 7B, and we therefore expect it to be used as a model that is strong in both Japanese and English.

Model	En Avg	OpenBookQA	XWINO	TriviaQA	SQuAD 2.0	HellaSwag	GSM8k
CyberAgentLM2-7B (base)	0.4026	0.2860	0.8581	0.3496	0.3510	0.5003	0.0705
Llama-2-7b (base)	0.4895	0.3580	0.9049	0.6265	0.3207	0.5860	0.1410
Japanese Stable LM Beta 7B (base)	0.4736	0.3620	0.8994	0.5903	0.2992	0.5707	0.1198
Japanese Stable LM Beta 7B (base, VE)	0.4545	0.3520	0.8942	0.5549	0.3079	0.5644	0.0538
ELYZA-japanese-Llama-2-7b (base)	0.4703	0.3400	0.8989	0.5875	0.2721	0.5595	0.1638
ELYZA-japanese-Llama-2-7b-fast (base, VE)	0.4608	0.3280	0.8989	0.5817	0.2605	0.5530	0.1425
Youri 7B (base)	0.4566	0.3400	0.8938	0.5257	0.3297	0.5540	0.0963
Swallow 7B (base, VE)	0.4399	0.3180	0.8817	0.4836	0.3125	0.5308	0.1130
Swallow 7B-plus (base, VE)	0.4370	0.3280	0.8929	0.4558	0.3134	0.5259	0.1061
Qwen-7B	0.5412	0.3640	0.8933	0.5695	0.3799	0.5787	0.4617
Nekomata 7B	0.4380	0.3340	0.8766	0.4371	0.2933	0.5340	0.1531
Mistral-7B-v0.1 (7B, base)	0.5577	0.3660	0.9157	0.7050	0.3799	0.6264	0.3533
Japanese Stable LM Base Gamma 7B (base)	0.4860	0.3240	0.8976	0.5745	0.3546	0.5739	0.1911
Swallow-MS 7B (base, VE)	0.5042	0.3440	0.9037	0.5976	0.3364	0.5810	0.2623

We present here radar charts visualizing selected portions of these evaluation results.

Swallow-MX 8x7B

Next, we examine Swallow-MX 8x7B.
Because this model is a Mixture of Experts (MoE) model that combines eight 7B models, we compare it with 70B-class models that have a similar total number of parameters (Mixtral shares attention and layer normalization parameters across experts, resulting in a total of 47B parameters).
Evaluation on Japanese benchmarks shows that continued pre-training steadily improves the Japanese capabilities of Mixtral 8x7B in Swallow-MX 8x7B.
Significant performance gains are again observed on knowledge-oriented tasks such as JCommonsenseQA (JComQA) and NIILC.
Although it does not surpass Swallow 70B, which has a larger total number of parameters, it demonstrates performance comparable to 70B-class models, highlighting the strong potential of MoE architectures.

Model	Ja Avg	JComQA	JEMHopQA	NIILC	JSQuAD	XL-Sum	Ja-En	En-Ja	MGSM
KARAKURI LM 70B (base)	0.4669	0.8579	0.5125	0.5713	0.9100	0.1464	0.2113	0.2540	0.2720
Llama-2-70b (base)	0.4830	0.8686	0.4656	0.5256	0.9080	0.2361	0.2398	0.2643	0.3560
Japanese Stable LM Beta 70B (base)	0.5138	0.9115	0.4925	0.6042	0.9192	0.2573	0.2335	0.2765	0.4160
Swallow 70B（base, 語彙拡張）	0.5528	0.9348	0.6290	0.6960	0.9176	0.2266	0.2298	0.3043	0.4840
Qwen-14B	0.4431	0.8829	0.4243	0.3220	0.8980	0.1851	0.2224	0.2223	0.3880
Qwen-72B	0.5244	0.9294	0.5566	0.4518	0.9159	0.2179	0.2356	0.2561	0.6320
Mixtral 8x7B v0.1 (instruct)	0.4486	0.8400	0.5033	0.3107	0.8808	0.2002	0.2063	0.1956	0.4520
Swallow-MX 8x7B	0.5208	0.9258	0.5843	0.5687	0.9148	0.2589	0.2074	0.2705	0.4360

Finally, we examine the English performance of Swallow-MX 8x7B.
Unlike in the case of Swallow-MS 7B, Swallow-MX 8x7B shows little degradation compared to the original model, Mixtral 8x7B Instruct.
A similar mitigation of English performance degradation has also been observed in Swallow 70B, suggesting that increasing the number of model parameters may help reduce performance loss, regardless of whether the architecture is MoE or not.
However, in the continued pre-training of Swallow-MX 8x7B, the training data used a Japanese-to-English ratio of 72:28 (due to historical circumstances related to debugging issues in the training framework). Therefore, we would like to further investigate the possibility that differences in the Japanese–English data mixing ratio also contributed to this outcome.

Model	En Avg	OpenBookQA	XWINO	TriviaQA	SQuAD 2.0	HellaSwag	GSM8k
Llama-2-70b (base)	0.6268	0.4280	0.9290	0.8239	0.3770	0.6742	0.5284
Japanese Stable LM Beta 70B (base)	0.6288	0.4200	0.9299	0.8203	0.3867	0.6729	0.5428
Swallow 70B（base, VE）	0.6042	0.4220	0.9204	0.7756	0.3745	0.6458	0.4867
Qwen-14B	0.5945	0.3720	0.9067	0.6543	0.4167	0.6473	0.5701
Qwen-72B	0.6369	0.4040	0.9200	0.7501	0.3401	0.6647	0.7422
Mixtral 8x7B v0.1 (instruct)	0.6335	0.4160	0.9226	0.7740	0.3714	0.6823	0.6346
Swallow-MX 8x7B	0.6129	0.3740	0.9170	0.7847	0.3801	0.6520	0.5694

Performance of 47B- and 70B-class models (Japanese)

Note that benchmark scores cannot be compared across different datasets. For example, in a model’s Japanese evaluation results, even if the score for mathematics is higher than that for machine translation, this does not imply that the model is better at mathematics than at translation (it would be like comparing the results of entirely different exams with different difficulty levels and grading criteria). For the same reason, even if the average score on English tasks is higher than the average score on Japanese tasks for a given model, one cannot conclude that the model is stronger in English. Because evaluation scales and difficulty levels differ across benchmark datasets, it is inappropriate to discuss task strengths and weaknesses based solely on the shape of this radar chart.

Evaluation Benchmarks

For Japanese evaluation benchmarks, we used llm-jp-eval (v1.0.0) and the JP Language Model Evaluation Harness (commit #9b42d41). The breakdown is as follows:

Multiple-choice question answering (JCommonsenseQA [Kurihara+, 2022])
Free-form question answering (JEMHopQA [Ishii+, 2023])
Free-form question answering (NIILC [Sekine, 2003])
Machine reading comprehension (JSQuAD [Kurihara+, 2022])
Automatic summarization (XL-Sum [Hasan+, 2021])
Machine translation (WMT2020 ja–en [Barrault+, 2020])
Machine translation (WMT2020 en–ja [Barrault+, 2020])
Mathematics (MGSM [Shi+, 2023])

Note that natural language inference (NLI), which is commonly used as an evaluation benchmark for large language models, was excluded in this study. Language models tend to exhibit biased label predictions in NLI tasks, and when this bias happens to coincide with the correct labels, the resulting scores become artificially high. Consequently, the evaluation results—especially for 7B models—were unstable.

For English evaluation benchmarks, we used the Language Model Evaluation Harness (v0.3.0). The breakdown is as follows:

Multiple-choice question answering (OpenBookQA [Mihaylov+, 2018])
Free-form question answering (TriviaQA [Joshi+, 2017])
Machine reading comprehension (SQuAD 2.0 [Rajpurkar+, 2018])
Commonsense reasoning (XWINO [Tikhonov & Ryabinin, 2021])
Natural language inference (HellaSwag [Zellers+, 2019])
Mathematics (GSM8K [Cobbe+, 2021])

Method

Swallow-MS 7B and Swallow-MX 8x7B were constructed by applying continued pre-training to Mistral 7B and Mixtral 8x7B Instruct, respectively.
To develop Japanese large language models strong in arithmetic reasoning and code generation, source code corpora were mixed with text corpora during training.
Specifically, Swallow-MS 7B was trained on AlgebraicStack [Azerbayev+, 2024], a corpus of mathematics-related source code, while Swallow-MX 8x7B was trained on both AlgebraicStack and The Vault [Nguyen+, 2023], a corpus pairing natural language with source code.
The effects of incorporating source code corpora will be further investigated through comparative experiments in future work.

The text corpora followed the same configuration as Swallow, using a Japanese-to-English mixture ratio of 9:1 (except for Swallow-MX 8x7B, which used 72:28), and consisted of the Swallow corpus, Japanese Wikipedia, and for English, RefinedWeb and the arXiv subset of The Stack.
In this release, Japanese vocabulary expansion was applied only to Swallow-MS 7B and not to Swallow-MX 8x7B.
With the vocabulary expansion, the number of Hiragana characters included in the vocabulary increased from 58 to 83, Katakana from 76 to 87, and Kanji from 1,456 to 3,208.

Additional pre-training of Mistral and Mixtral was conducted using software developed in-house.

Acknowledgements

The research and development of Swallow-MS and Swallow-MX were supported by several initiatives, including the “Large-Scale Language Model Development Support Program” of the AI Bridging Cloud Infrastructure (ABCI), which is built and operated by AIST; the project “Development of AI Application Technologies to Support Decision-Making in Design Risk Assessment Based on Expert Perspectives” under the NEDO program “Development of Core Integrated Technologies for Next-Generation Artificial Intelligence and Robots” (JPNP18002); and other supporting programs. Part of these results was also achieved through the “Large-Scale Foundation Model Development Support Program” of ABCI. This program was jointly proposed in September 2023 by the LLM-jp study group—organized by the National Institute of Informatics (NII), AIST, and Tokyo Institute of Technology, and involving research teams from institutions such as NII, Tohoku University, the University of Tokyo, and Waseda University—and was subsequently selected. It provided an opportunity to exclusively use a portion of ABCI’s high-performance computational resources (referred to as A-nodes) for up to 60 days. In addition, evaluation experiments of the trained large language models utilized datasets and insights developed within the LLM-jp study group.