Model scores are visualized as scatterplots (the size of the dots corresponds to the size of the model) by selecting any Japanese, Japanese MT-Bench, or English tasks for the horizontal and vertical axes. This page provides two scatter plots to compare many tasks simultaneously. The model you wish to visualize can be selected from the table below. You can copy the permalink corresponding to the selected model from the icon 🔗 in the upper left corner of the site.
Models
Model | Average | Japanese | Japanese MT-Bench | English | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | SortKey | Type | Size (B) | Ja | Ja (MTB) | En | JCom | JEMHopQA | NIILC | JSQuAD | XL-Sum | MGSM | WMT20 (en-ja) | WMT20 (ja-en) | JMMLU | JHumanEval | Coding | Extraction | Humanities | Math | Reasoning | Roleplay | Stem | Writing | OpenBookQA | TriviaQA | HellaSwag | SQuAD2 | XWINO | MMLU | GSM8K | BBH | HumanEval | |
Name | SortKey | Type | Size (B) | Ja | Ja (MTB) | En | JCom | JEMHopQA | NIILC | JSQuAD | XL-Sum | MGSM | WMT20 (en-ja) | WMT20 (ja-en) | JMMLU | JHumanEval | Coding | Extraction | Humanities | Math | Reasoning | Roleplay | Stem | Writing | OpenBookQA | TriviaQA | HellaSwag | SQuAD2 | XWINO | MMLU | GSM8K | BBH | HumanEval |