Japanese LLM Evaluation

Orientation:

Sort by:

The average scores for the LLM Japanese task, Japanese MT-Bench, and English task selected in the table below are visualized in a bar graph. You can select the horizontal or vertical orientation of the graph with the buttons in the upper left corner (vertical is recommended for smartphones) and the order of the LLMs with the button in the upper right corner. You can copy the permalink corresponding to the selected model from the icon 🔗 in the upper left corner of the site. Note that it may be inappropriate to discuss the superiority of some models based on their average scores or sort order, since some tasks have not been evaluated. For example, GPT-3.5 and GPT-4 are presumed to show high performance in Japanese and English tasks, but since no evaluation was conducted, the average score for these tasks is treated as 0, and the sort order is also at the end.

Models

Rows:

Models:

Model					Average			Japanese										Japanese MT-Bench								English
	Name	SortKey	Type	Size (B)	Ja	Ja (MTB)	En	JCom	JEMHopQA	NIILC	JSQuAD	XL-Sum	MGSM	WMT20 (en-ja)	WMT20 (ja-en)	JMMLU	JHumanEval	Coding	Extraction	Humanities	Math	Reasoning	Roleplay	Stem	Writing	OpenBookQA	TriviaQA	HellaSwag	SQuAD2	XWINO	MMLU	GSM8K	BBH	HumanEval
	Name	SortKey	Type	Size (B)	Ja	Ja (MTB)	En	JCom	JEMHopQA	NIILC	JSQuAD	XL-Sum	MGSM	WMT20 (en-ja)	WMT20 (ja-en)	JMMLU	JHumanEval	Coding	Extraction	Humanities	Math	Reasoning	Roleplay	Stem	Writing	OpenBookQA	TriviaQA	HellaSwag	SQuAD2	XWINO	MMLU	GSM8K	BBH	HumanEval

Usage and Notes

Models