Model name | Ja avg | JComQA | JEMHQA | NIILC | JSQuAD | XL-Sum | MGSM | En-Ja | Ja-En | JMMLU | JHumanEval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Aya Expanse 32B | 0.512 | 0.965 | 0.554 | 0.586 | 0.812 | 0.295 | 0.716 | 0.287 | 0.245 | 0.655 | 0.001 | |
CyberAgentLM3-22B-chat | 0.471 | 0.934 | 0.510 | 0.648 | 0.911 | 0.104 | 0.576 | 0.275 | 0.215 | 0.541 | 0.001 | |
Gemma 2 27B IT | 0.567 | 0.956 | 0.541 | 0.576 | 0.883 | 0.166 | 0.704 | 0.290 | 0.249 | 0.670 | 0.638 | |
GPT-3.5 (gpt-3.5-turbo-0125) | 0.515 | 0.922 | 0.456 | 0.447 | 0.893 | 0.215 | 0.572 | 0.287 | 0.243 | 0.499 | 0.616 | |
GPT-4-turbo (gpt-4-turbo-2024-04-09) | 0.626 | 0.971 | 0.690 | 0.615 | 0.878 | 0.201 | 0.848 | 0.295 | 0.239 | 0.753 | 0.773 | |
GPT-4o (gpt-4o-2024-05-13) | 0.649 | 0.979 | 0.737 | 0.722 | 0.892 | 0.140 | 0.860 | 0.314 | 0.237 | 0.794 | 0.813 | |
GPT-4o (gpt-4o-2024-08-06) | 0.646 | 0.982 | 0.731 | 0.709 | 0.889 | 0.170 | 0.864 | 0.314 | 0.254 | 0.797 | 0.752 | |
GPT-4o-mini (gpt-4o-mini-2024-07-18) | 0.581 | 0.961 | 0.464 | 0.591 | 0.902 | 0.160 | 0.832 | 0.299 | 0.241 | 0.679 | 0.675 | |
Llama 3 70B Instruct | 0.578 | 0.940 | 0.615 | 0.557 | 0.913 | 0.191 | 0.716 | 0.270 | 0.234 | 0.680 | 0.662 | |
Llama 3 heron brain 70B v0.3 | 0.615 | 0.965 | 0.652 | 0.679 | 0.922 | 0.261 | 0.772 | 0.309 | 0.258 | 0.707 | 0.623 | |
Llama 3 Swallow 70B Instruct | 0.571 | 0.963 | 0.627 | 0.598 | 0.921 | 0.139 | 0.672 | 0.272 | 0.255 | 0.657 | 0.608 | |
Llama 3 Youko 70B Instruct | 0.582 | 0.952 | 0.625 | 0.584 | 0.921 | 0.198 | 0.720 | 0.263 | 0.226 | 0.718 | 0.610 | |
Llama 3.1 70B Instruct | 0.595 | 0.950 | 0.635 | 0.579 | 0.921 | 0.178 | 0.732 | 0.279 | 0.247 | 0.733 | 0.696 | |
Llama-3.1-70B-Japanese-Instruct-2407 | 0.597 | 0.956 | 0.647 | 0.660 | 0.919 | 0.156 | 0.748 | 0.290 | 0.241 | 0.723 | 0.627 | |
Llama 3.1 Swallow 70B Instruct v0.1 | 0.588 | 0.962 | 0.621 | 0.660 | 0.924 | 0.192 | 0.776 | 0.312 | 0.259 | 0.711 | 0.468 | |
Llama 3.1 Swallow 70B Instruct v0.3 | 0.598 | 0.964 | 0.632 | 0.654 | 0.910 | 0.196 | 0.772 | 0.305 | 0.257 | 0.690 | 0.596 | |
Llama 3.3 70B Instruct | 0.601 | 0.941 | 0.640 | 0.570 | 0.893 | 0.179 | 0.784 | 0.278 | 0.243 | 0.735 | 0.744 | |
Llama 3.3 Swallow 70B Instruct v0.4 | 0.613 | 0.981 | 0.618 | 0.662 | 0.907 | 0.162 | 0.812 | 0.319 | 0.261 | 0.707 | 0.700 | |
Qwen2-72B-Instruct | 0.598 | 0.963 | 0.628 | 0.557 | 0.920 | 0.166 | 0.780 | 0.260 | 0.232 | 0.771 | 0.701 | |
Qwen2.5-32B-Instruct | 0.571 | 0.959 | 0.567 | 0.497 | 0.903 | 0.169 | 0.780 | 0.228 | 0.195 | 0.757 | 0.651 | |
Qwen2.5-72B-Instruct | 0.574 | 0.970 | 0.569 | 0.582 | 0.738 | 0.170 | 0.840 | 0.227 | 0.218 | 0.789 | 0.634 | |
Swallow-70b-instruct-v0.1 | 0.492 | 0.923 | 0.566 | 0.565 | 0.903 | 0.186 | 0.420 | 0.263 | 0.232 | 0.571 | 0.293 | |
Tanuki-8x8B-dpo-v1.0 | 0.454 | 0.708 | 0.551 | 0.612 | 0.867 | 0.142 | 0.456 | 0.269 | 0.208 | 0.439 | 0.284 |
Model name | En avg | OpenBookQA | TriviaQA | HellaSwag | SQuAD2 | XWINO | MMLU | GSM8K | MATH | BBH | HumanEval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Aya Expanse 32B | 0.614 | 0.420 | 0.757 | 0.668 | 0.679 | 0.912 | 0.744 | 0.858 | 0.344 | 0.757 | 0.005 | |
CyberAgentLM3-22B-chat | 0.527 | 0.372 | 0.619 | 0.598 | 0.603 | 0.905 | 0.603 | 0.698 | 0.274 | 0.599 | 0.000 | |
Gemma 2 27B IT | 0.703 | 0.458 | 0.766 | 0.655 | 0.669 | 0.909 | 0.762 | 0.851 | 0.466 | 0.790 | 0.707 | |
GPT-3.5 (gpt-3.5-turbo-0125) | ||||||||||||
GPT-4-turbo (gpt-4-turbo-2024-04-09) | ||||||||||||
GPT-4o (gpt-4o-2024-05-13) | ||||||||||||
GPT-4o (gpt-4o-2024-08-06) | ||||||||||||
GPT-4o-mini (gpt-4o-mini-2024-07-18) | ||||||||||||
Llama 3 70B Instruct | 0.729 | 0.438 | 0.800 | 0.655 | 0.696 | 0.914 | 0.800 | 0.909 | 0.474 | 0.833 | 0.774 | |
Llama 3 heron brain 70B v0.3 | 0.715 | 0.446 | 0.811 | 0.668 | 0.706 | 0.919 | 0.790 | 0.877 | 0.508 | 0.759 | 0.668 | |
Llama 3 Swallow 70B Instruct | 0.716 | 0.446 | 0.818 | 0.676 | 0.681 | 0.923 | 0.789 | 0.868 | 0.460 | 0.816 | 0.680 | |
Llama 3 Youko 70B Instruct | 0.709 | 0.454 | 0.797 | 0.686 | 0.659 | 0.915 | 0.805 | 0.892 | 0.434 | 0.780 | 0.662 | |
Llama 3.1 70B Instruct | 0.738 | 0.426 | 0.821 | 0.662 | 0.660 | 0.917 | 0.822 | 0.876 | 0.560 | 0.842 | 0.794 | |
Llama-3.1-70B-Japanese-Instruct-2407 | 0.725 | 0.422 | 0.810 | 0.647 | 0.663 | 0.917 | 0.807 | 0.889 | 0.528 | 0.823 | 0.746 | |
Llama 3.1 Swallow 70B Instruct v0.1 | 0.710 | 0.446 | 0.815 | 0.683 | 0.681 | 0.917 | 0.787 | 0.884 | 0.474 | 0.848 | 0.568 | |
Llama 3.1 Swallow 70B Instruct v0.3 | 0.710 | 0.454 | 0.825 | 0.692 | 0.647 | 0.919 | 0.777 | 0.872 | 0.458 | 0.816 | 0.643 | |
Llama 3.3 70B Instruct | 0.762 | 0.426 | 0.817 | 0.667 | 0.684 | 0.917 | 0.824 | 0.890 | 0.706 | 0.853 | 0.834 | |
Llama 3.3 Swallow 70B Instruct v0.4 | 0.736 | 0.448 | 0.817 | 0.686 | 0.654 | 0.912 | 0.803 | 0.907 | 0.566 | 0.812 | 0.750 | |
Qwen2-72B-Instruct | 0.669 | 0.444 | 0.759 | 0.685 | 0.685 | 0.911 | 0.840 | 0.848 | 0.634 | 0.193 | 0.688 | |
Qwen2.5-32B-Instruct | 0.588 | 0.424 | 0.534 | 0.671 | 0.536 | 0.893 | 0.834 | 0.581 | 0.802 | 0.017 | 0.589 | |
Qwen2.5-72B-Instruct | 0.691 | 0.454 | 0.676 | 0.706 | 0.677 | 0.889 | 0.848 | 0.904 | 0.770 | 0.375 | 0.614 | |
Swallow-70b-instruct-v0.1 | 0.556 | 0.446 | 0.742 | 0.656 | 0.571 | 0.917 | 0.668 | 0.509 | 0.108 | 0.664 | 0.281 | |
Tanuki-8x8B-dpo-v1.0 | 0.464 | 0.348 | 0.481 | 0.555 | 0.521 | 0.850 | 0.493 | 0.544 | 0.236 | 0.419 | 0.193 |
Model name | JMT avg | Code | Ext | Human | Math | Reason | Role | STEM | Write | |
---|---|---|---|---|---|---|---|---|---|---|
Aya Expanse 32B | 0.713 | 0.548 | 0.720 | 0.846 | 0.657 | 0.602 | 0.824 | 0.712 | 0.794 | |
CyberAgentLM3-22B-chat | 0.691 | 0.519 | 0.744 | 0.859 | 0.605 | 0.548 | 0.784 | 0.700 | 0.772 | |
Gemma 2 27B IT | 0.768 | 0.727 | 0.809 | 0.874 | 0.719 | 0.639 | 0.810 | 0.740 | 0.826 | |
GPT-3.5 (gpt-3.5-turbo-0125) | 0.691 | 0.693 | 0.789 | 0.773 | 0.665 | 0.462 | 0.728 | 0.644 | 0.775 | |
GPT-4-turbo (gpt-4-turbo-2024-04-09) | 0.837 | 0.842 | 0.891 | 0.863 | 0.865 | 0.673 | 0.861 | 0.844 | 0.854 | |
GPT-4o (gpt-4o-2024-05-13) | 0.848 | 0.859 | 0.930 | 0.882 | 0.917 | 0.631 | 0.858 | 0.858 | 0.851 | |
GPT-4o (gpt-4o-2024-08-06) | 0.848 | 0.855 | 0.926 | 0.880 | 0.872 | 0.706 | 0.862 | 0.838 | 0.849 | |
GPT-4o-mini (gpt-4o-mini-2024-07-18) | 0.824 | 0.825 | 0.865 | 0.857 | 0.843 | 0.665 | 0.846 | 0.855 | 0.840 | |
Llama 3 70B Instruct | 0.640 | 0.588 | 0.884 | 0.715 | 0.637 | 0.487 | 0.594 | 0.598 | 0.619 | |
Llama 3 heron brain 70B v0.3 | 0.683 | 0.510 | 0.870 | 0.776 | 0.680 | 0.513 | 0.727 | 0.692 | 0.693 | |
Llama 3 Swallow 70B Instruct | 0.618 | 0.633 | 0.823 | 0.601 | 0.521 | 0.482 | 0.622 | 0.635 | 0.630 | |
Llama 3 Youko 70B Instruct | 0.750 | 0.607 | 0.894 | 0.834 | 0.609 | 0.673 | 0.790 | 0.764 | 0.829 | |
Llama 3.1 70B Instruct | 0.706 | 0.691 | 0.848 | 0.730 | 0.669 | 0.618 | 0.699 | 0.699 | 0.694 | |
Llama-3.1-70B-Japanese-Instruct-2407 | 0.751 | 0.683 | 0.827 | 0.824 | 0.749 | 0.643 | 0.818 | 0.715 | 0.751 | |
Llama 3.1 Swallow 70B Instruct v0.1 | 0.691 | 0.654 | 0.792 | 0.768 | 0.704 | 0.573 | 0.682 | 0.653 | 0.704 | |
Llama 3.1 Swallow 70B Instruct v0.3 | 0.769 | 0.678 | 0.820 | 0.867 | 0.776 | 0.570 | 0.816 | 0.769 | 0.852 | |
Llama 3.3 70B Instruct | 0.737 | 0.707 | 0.865 | 0.757 | 0.720 | 0.635 | 0.773 | 0.706 | 0.733 | |
Llama 3.3 Swallow 70B Instruct v0.4 | 0.772 | 0.705 | 0.820 | 0.870 | 0.730 | 0.623 | 0.811 | 0.781 | 0.832 | |
Qwen2-72B-Instruct | 0.756 | 0.632 | 0.800 | 0.842 | 0.688 | 0.616 | 0.824 | 0.797 | 0.846 | |
Qwen2.5-32B-Instruct | 0.809 | 0.724 | 0.885 | 0.816 | 0.918 | 0.726 | 0.834 | 0.763 | 0.808 | |
Qwen2.5-72B-Instruct | 0.835 | 0.795 | 0.860 | 0.865 | 0.857 | 0.784 | 0.863 | 0.804 | 0.854 | |
Swallow-70b-instruct-v0.1 | 0.509 | 0.381 | 0.604 | 0.568 | 0.464 | 0.402 | 0.583 | 0.557 | 0.510 | |
Tanuki-8x8B-dpo-v1.0 | 0.546 | 0.513 | 0.489 | 0.624 | 0.557 | 0.445 | 0.604 | 0.547 | 0.594 |