The average scores for the LLM Japanese task, Japanese MT-Bench, and English task selected in the table below are visualized in a bar graph. You can select the horizontal or vertical orientation of the graph with the buttons in the upper left corner (vertical is recommended for smartphones) and the order of the LLMs with the button in the upper right corner. You can copy the permalink corresponding to the selected model from the icon 🔗 in the upper left corner of the site. Note that it may be inappropriate to discuss the superiority of some models based on their average scores or sort order, since some tasks have not been evaluated. For example, GPT-3.5 and GPT-4 are presumed to show high performance in Japanese and English tasks, but since no evaluation was conducted, the average score for these tasks is treated as 0, and the sort order is also at the end.

Models

Model Average Japanese Japanese MT-Bench English
Name SortKey Type Size (B) Ja Ja (MTB) En JCom JEMHopQA NIILC JSQuAD XL-Sum MGSM WMT20 (en-ja) WMT20 (ja-en) JMMLU JHumanEval Coding Extraction Humanities Math Reasoning Roleplay Stem Writing OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K BBH HumanEval
Name SortKey Type Size (B) Ja Ja (MTB) En JCom JEMHopQA NIILC JSQuAD XL-Sum MGSM WMT20 (en-ja) WMT20 (ja-en) JMMLU JHumanEval Coding Extraction Humanities Math Reasoning Roleplay Stem Writing OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K BBH HumanEval