Scores for all tasks in Japanese, Japanese MT-Bench, and English benchmark for the LLMs selected in the table below are visualized in radar charts. In adition, average scores are visualized in a bar chart. You can copy the permalink corresponding to the selected model from the icon 🔗 in the upper left corner of the site. Note that it may be inappropriate to discuss the superiority of some models based on their average scores or sort order, since some tasks have not been evaluated. For example, GPT-3.5 and GPT-4 are presumed to show high performance in Japanese and English tasks, but since no evaluation was conducted, the average score for these tasks is treated as 0, and the sort order is also at the end.

Models

Model Average Japanese Japanese MT-Bench English
Name SortKey Type Size (B) Ja Ja (MTB) En JCom JEMHopQA NIILC JSQuAD XL-Sum MGSM WMT20 (en-ja) WMT20 (ja-en) JMMLU JHumanEval Coding Extraction Humanities Math Reasoning Roleplay Stem Writing OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K BBH HumanEval
Name SortKey Type Size (B) Ja Ja (MTB) En JCom JEMHopQA NIILC JSQuAD XL-Sum MGSM WMT20 (en-ja) WMT20 (ja-en) JMMLU JHumanEval Coding Extraction Humanities Math Reasoning Roleplay Stem Writing OpenBookQA TriviaQA HellaSwag SQuAD2 XWINO MMLU GSM8K BBH HumanEval