Japanese LLM Evaluation

Model scores are visualized as scatterplots (the size of the dots corresponds to the size of the model) by selecting any Japanese, Japanese MT-Bench, or English tasks for the horizontal and vertical axes. This page provides two scatter plots to compare many tasks simultaneously. The model you wish to visualize can be selected from the table below. You can copy the permalink corresponding to the selected model from the icon 🔗 in the upper left corner of the site.

Models

Rows:

Models:

Model					Average			Japanese										Japanese MT-Bench								English
	Name	SortKey	Type	Size (B)	Ja	Ja (MTB)	En	JCom	JEMHopQA	NIILC	JSQuAD	XL-Sum	MGSM	WMT20 (en-ja)	WMT20 (ja-en)	JMMLU	JHumanEval	Coding	Extraction	Humanities	Math	Reasoning	Roleplay	Stem	Writing	OpenBookQA	TriviaQA	HellaSwag	SQuAD2	XWINO	MMLU	GSM8K	BBH	HumanEval
	Name	SortKey	Type	Size (B)	Ja	Ja (MTB)	En	JCom	JEMHopQA	NIILC	JSQuAD	XL-Sum	MGSM	WMT20 (en-ja)	WMT20 (ja-en)	JMMLU	JHumanEval	Coding	Extraction	Humanities	Math	Reasoning	Roleplay	Stem	Writing	OpenBookQA	TriviaQA	HellaSwag	SQuAD2	XWINO	MMLU	GSM8K	BBH	HumanEval

Usage

Models