Swallow (on Llama 2) is a series of large language models (LLMs) developed by a research team from Okazaki laboratory and Yokota laboratory at Tokyo Institute of Technology and National Institute of Advanced Industrial Science and Technology (AIST). In order to extend the Japanese language capability of the Llama 2 7B, 13B and 70B models, which show the high capability in English, we added Japanese characters and subwords to the vocabulary of the models, and conducted continual pre-training of the models on newly developed Japanese corpus. Swallow LLMs achieved the highest performance on a Japanese benchmark among the open LLMs in the performance evaluation carried out on December 2023. We released nine (initially six) models in total: 7B, 13B and 70B (base) and two versions of instruction-tuned models (“initial” and v0.1). The models are available for download on Hugging Face.
- Swallow 7B base: https://huggingface.co/tokyotech-llm/Swallow-7b-hf
- Swallow 7B instruct (initial): https://huggingface.co/tokyotech-llm/Swallow-7b-instruct-hf
- Swallow 7B instruct v0.1: https://huggingface.co/tokyotech-llm/Swallow-7b-instruct-v0.1
- Swallow 13B base: https://huggingface.co/tokyotech-llm/Swallow-13b-hf
- Swallow 13B instruct (initial): https://huggingface.co/tokyotech-llm/Swallow-13b-instruct-hf
- Swallow 13B instruct v0.1: https://huggingface.co/tokyotech-llm/Swallow-13b-instruct-v0.1
- Swallow 70B base: https://huggingface.co/tokyotech-llm/Swallow-70b-hf
- Swallow 70B instruct (initial): https://huggingface.co/tokyotech-llm/Swallow-70b-instruct-hf
- Swallow 70B instruct v0.1: https://huggingface.co/tokyotech-llm/Swallow-70b-instruct-v0.1
Swallow inherits the license term of the LLAMA 2 Community License of Llama 2. Swallow LLMs can be used for research and commercial purposes as long as the user complies with this licence.
We compare Japanese performance of the models with continual pre-training applied to Llama 2 (we used Japanese 13B LLMs pre-trained from scratch as no continual-pre-trained 13B model was released). Results demonstrate that Swallow LLMs have steadily increased their Japanese language proficiency by continual pre-training and that they perform better than the models released by the others. The difference in evaluation scores with and without the vocabulary expansion was small, and the advantage of the vocabulary expansion is large (faster generation speed). Therefore, we released the models with the vocabulary expansion.
The English performance was slightly lower than the original model, although Swallow uses a mixture of Japanese and English texts for continual pre-training. We used the ratio of Japanese:English = 1:1, which is often used in other models. We plan to clarify in additional experiments whether this has led to a decrease in English performance.
Note that benchmark scores cannot be compared across different data sets. For example, in the Japanese evaluation results of the 70B model, the math score is higher than the machine translation score. However, this does not imply that Swallow is better at math than translation (it would be like comparing the results of exams with completely different difficulty and scoring criteria). For the same reason, one cannot conclude that Swallow is better at English just because the average score on the English task is higher than the average score on the Japanese task. It is inappropriate to discuss task strengths and weaknesses based solely on the shape of this radar chart, since different benchmark datasets have different rating scales and difficulty levels.
Finally, we present the results of a comparative experiment conducted by the research team on a Japanese evaluation benchmark. This evaluation experiment measured the performance of an open model whose parameter weights are publicly available.
Model | Ja Avg | JComQA | JEMHopQA | NIILC | JSQuAD | XL-Sum | Ja-En | En-Ja | MGSM |
---|---|---|---|---|---|---|---|---|---|
Japanese Stable LM Beta 7B (base, vocab expansion) | 0.2937 | 0.2172 | 0.4482 | 0.4309 | 0.8202 | 0.0757 | 0.1453 | 0.1601 | 0.0520 |
CyberAgentLM2 (7B) | 0.3098 | 0.2198 | 0.5047 | 0.5066 | 0.7799 | 0.0233 | 0.1499 | 0.2345 | 0.0600 |
ELYZA-japanese-Llama-2-7b-fast | 0.3312 | 0.5308 | 0.4330 | 0.3898 | 0.8131 | 0.1289 | 0.1143 | 0.1678 | 0.0720 |
Japanese Stable LM Beta 7B (base, no vocab expansion) | 0.3366 | 0.3610 | 0.4478 | 0.4432 | 0.8318 | 0.2195 | 0.1226 | 0.1946 | 0.0720 |
ELYZA-japanese-Llama-2-7b (no vocab expansion) | 0.3467 | 0.5791 | 0.4703 | 0.4019 | 0.8226 | 0.1312 | 0.1289 | 0.1795 | 0.0600 |
Mistral v0.1 (7B, base) | 0.3717 | 0.7301 | 0.4245 | 0.2722 | 0.8563 | 0.2006 | 0.1733 | 0.1405 | 0.1760 |
Japanese Stable LM Gamma (7B, base, no vocab expansion) | 0.4301 | 0.7364 | 0.4643 | 0.5568 | 0.8910 | 0.2293 | 0.1561 | 0.2390 | 0.1680 |
Llama 2 7B (base) | 0.3201 | 0.3852 | 0.4240 | 0.3410 | 0.7917 | 0.1905 | 0.1737 | 0.1783 | 0.0760 |
Swallow 7B (base) | 0.3940 | 0.4808 | 0.5078 | 0.5968 | 0.8573 | 0.1830 | 0.1511 | 0.2510 | 0.1240 |
Stockmark-13b | 0.2495 | 0.2216 | 0.0917 | 0.5021 | 0.7303 | 0.0773 | 0.0811 | 0.2040 | 0.0880 |
LLM-jp-13B | 0.2889 | 0.2261 | 0.4790 | 0.3857 | 0.7744 | 0.1082 | 0.1185 | 0.1955 | 0.0240 |
PLaMo-13B | 0.2923 | 0.2270 | 0.5189 | 0.4137 | 0.7621 | 0.1025 | 0.1196 | 0.1582 | 0.0360 |
Llama 2 13B (base) | 0.3963 | 0.6997 | 0.4415 | 0.417 | 0.8533 | 0.2139 | 0.1982 | 0.2146 | 0.1320 |
Swallow 13B (base) | 0.4625 | 0.7837 | 0.5063 | 0.6398 | 0.9005 | 0.2168 | 0.1771 | 0.2720 | 0.2040 |
Japanese Stable LM Beta 70B (base) | 0.5138 | 0.9115 | 0.4925 | 0.6042 | 0.9192 | 0.2573 | 0.2335 | 0.2765 | 0.4160 |
Llama 2 70B (base) | 0.4830 | 0.8686 | 0.4656 | 0.5256 | 0.9080 | 0.2361 | 0.2398 | 0.2643 | 0.3560 |
Swallow 70B (base) | 0.5528 | 0.9348 | 0.6290 | 0.696 | 0.9176 | 0.2266 | 0.2298 | 0.3043 | 0.4840 |
Compared to other models with 7B parameters, Swallow 7B outperformed the full-scratch trained model and other Llama 2-based models. The only model that outperform Swallow 7B was the Japanese Stable LM Gamma 7B (continual pre-training from Mistral 7B v0.1), but this can be attributed to the higher performance of the base model, Mistral 7B v0.1. In the future, we would like to work on continual pre-training based on Mistral 7B v0.1.
In comparison to other models with 13B and 70B parameters, Swallow achieves the best performance (as of December 2023). The performance difference from other models trained with full scratch is particularly large at 13B. While these experimental results suggest that continual pre-training from high-performing large-scale language models is efficient and effective, we would also like to explore measures and findings to improve performance in full-scratch training of large-scale language models in the future.
Evaluation benchmark
We used llm-jp-eval(v1.0.0) and JP Language Model Evaluation Harness(commit #9b42d41) as a Japanese evaluation benchmark. The tasks in the benchmark are:
- JCommonsenseQA [Kurihara et al., 2022]
- JEMHopQA [Ishii et al., 2024]
- NIILC [Sekine, 2003]
- JSQuAD [Kurihara et al., 2022]
- XL-Sum [Hasan et al., 2021]
- WMT2020 ja-en [Barrault et al., 2020]
- WMT2020 en-ja [Barrault et al., 2020]
- MGSM [Shi et al., 2023]
Note that the Natural Language Inference (NLI), which is often used as an evaluation benchmark for Japanese large language models, did not show stable evaluation results (especially in 7B and 13B) because language models tend to guess labels and score higher when the guess of the label prediction happens to match the correct answer. For this reason, we excluded it from the evaluation benchmarks in this study.
We used Language Model Evaluation Harness(v.0.3.0) as an English evaluation benchmark. The tasks in the benchmark are:
- OpenBookQA [Mihaylov+, 2018]
- TriviaQA [Joshi+, 2017]
- SQuAD 2.0 [Rajpurkar+, 2018]
- XWINO [Tikhonov & Ryabinin, 2021]
- HellaSwag [Zellers+, 2019]
- GSM8k [Cobbe+, 2021]
Continual pre-training for enhancing Japanese capability of Llama 2
The Meta Llama 2 series is an open, high-performance large language model that has been popular worldwide. Llama 2 can handle Japanese, as it is trained on data consisting of multiple languages including Japanese. However, English accounts for about 90% of Llama 2’s pre-training data, and Japanese accounts for only about 0.10% of the total. Therefore, despite its high performance in English, Llama 2 has a weakness in reading and writing Japanese.
Therefore, the research team conducted continual pre-training on a 9:1 mixture of a large Japanese web corpus and an English corpus based on Llama 2’s 7B, 13B, and 70B models, aiming to improve Japanese proficiency while utilizing the capabilities of the original language models. As a result, all 7B, 13B, and 70B models performed better than the base model on the benchmark data we employed for Japanese. They also performed better than a large-scale Japanese language model of the same size that was pre-trained only on the Japanese corpus, demonstrating the effectiveness of continuous pre-training.
Vocabulary expansion for improving efficiency of training and inference
In Llama 2, text is separated into tokens based on Byte-Pair Encoding (BPE; Byte-Pair Encoding). However, because Llama 2 is trained as a multilingual model with an emphasis on English, key Japanese words and characters are not included in the lexicon, and text is sometimes separated into unnatural units. For example, the 7-character text “I am a cat” is separated into 13 tokens that are difficult for humans to understand: `<0xE5><0x90><0xBE><0xE8><0xBC><0xA9> is <0xE7><0x8C><0xAB>. This is because the kanji characters “吾”, “輩”, and “猫” are not included in the vocabulary, and these characters are represented in bytes in the UTF-8 character code due to byte-fallback.
A language model that lacks a Japanese vocabulary is less efficient for training and generation because, in addition to handling Japanese in unnatural units, it represents text in more tokens. Since the computational budget required to train a large-scale language model is proportional to the number of tokens, conversely, under conditions where the computational budget is constant, more information can be packed into training if the text is represented by fewer tokens. Also, since the time required for a large-scale language model to produce text is proportional to the number of tokens, if the same text is to be produced, it will take less time to produce the result if it can be expressed with fewer tokens. In addition, there is an upper limit to the length of tokens that can be handled at one time for the input and output of a large-scale language model. If the input can be represented with fewer tokens, it can be packed with more task instructions and solving methods (examples of few-shots), which can also improve performance in downstream tasks. By adding 16,000 Japanese tokens to Llama 2’s tokenizer, the research team reduced the token length of Japanese text by 56.2%.
Large Japanese Web Corpus
Training large-scale language models requires a vast amount of linguistic data. Among them, data collected from web pages and converted into text is the key to building large-scale language models. Traditionally, Japanese portions of existing datasets such as CC-100, mC4, and OSCAR have been used to train open Japanese large-scale language models. However, these datasets have the problems of containing noise from the conversion of web page HTML into text and of not containing the latest information and knowledge. In addition, since these datasets are constructed as multilingual datasets, they do not incorporate any special efforts to improve the quality of the data specifically for Japanese.
Therefore, the research team independently extracted and refined Japanese text from the archive distributed by Common Crawl (term 8) (about 63.4 billion pages for 21 snapshots collected from 2020 to 2023), and constructed a Japanese web corpus consisting of about 312.1 billion characters (about 173 million pages) We have constructed a Japanese web corpus consisting of approximately 312.1 billion characters (approximately 173 million pages). This scale surpasses CC-100 (approx. 25.8 billion characters), mC4 (approx. 239.7 billion characters), and OSCAR 23.01 (approx. 74 billion characters), and is the largest commercially available training corpus for a Japanese language model.
References
- Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki. 2024. Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities. arXiv:2404:177790.
- Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki. 2024. Building a Large Japanese Web Corpus for Large Language Models. arXiv:2404:17733.
Research and development of Swallow was supported by a project, JPNP18002, commissioned by the New Energy and Industrial Technology Development Organization (NEDO). In addition, the experiments of continual pre-training of LLMs was supported by the “Support Program for Building Large Language Models” of the AI Bridging Cloud Infrastructure (ABCI) developed and operated by the National Institute of Advanced Industrial Science and Technology (AIST). We used the datasets and findings released by the Japanese LLM Study Group (LLM-jp) in the evaluation experiments. We also received suggestions on tokenization from Tatsuya Hiraoka of Fujitsu Ltd.