Japanese LLM Evaluation

Orientation:

Sort by:

Scores for all tasks in Japanese, Japanese MT-Bench, and English benchmark for the LLMs selected in the table below are visualized in radar charts. In adition, average scores are visualized in a bar chart. You can copy the permalink corresponding to the selected model from the icon 🔗 in the upper left corner of the site. Note that it may be inappropriate to discuss the superiority of some models based on their average scores or sort order, since some tasks have not been evaluated. For example, GPT-3.5 and GPT-4 are presumed to show high performance in Japanese and English tasks, but since no evaluation was conducted, the average score for these tasks is treated as 0, and the sort order is also at the end.

Models

Rows:

Models:

Model					Average			Japanese										Japanese MT-Bench								English
	Name	SortKey	Type	Size (B)	Ja	Ja (MTB)	En	JCom	JEMHopQA	NIILC	JSQuAD	XL-Sum	MGSM	WMT20 (en-ja)	WMT20 (ja-en)	JMMLU	JHumanEval	Coding	Extraction	Humanities	Math	Reasoning	Roleplay	Stem	Writing	OpenBookQA	TriviaQA	HellaSwag	SQuAD2	XWINO	MMLU	GSM8K	BBH	HumanEval
	Name	SortKey	Type	Size (B)	Ja	Ja (MTB)	En	JCom	JEMHopQA	NIILC	JSQuAD	XL-Sum	MGSM	WMT20 (en-ja)	WMT20 (ja-en)	JMMLU	JHumanEval	Coding	Extraction	Humanities	Math	Reasoning	Roleplay	Stem	Writing	OpenBookQA	TriviaQA	HellaSwag	SQuAD2	XWINO	MMLU	GSM8K	BBH	HumanEval

Usage and Notes

Models