モデル名 日本語平均 英語平均 日本語MTB平均 英語MTB平均
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.526 0.694 0.727 0.785
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.449 0.653 0.642 0.752
ELYZA-Shortcut-1.0-Qwen-32B 0.443 0.520 0.675 0.755
ELYZA-Thinking-1.0-Qwen-32B 0.444 0.563 0.664 0.722
Gemma 3 1B IT 0.147 0.195 0.352 0.516
Gemma 3 4B IT 0.292 0.386 0.572 0.679
Gemma 3 12B IT 0.405 0.493 0.659 0.760
Gemma 3 27B IT 0.437 0.528 0.691 0.771
Gemma 4 31B IT 0.675 0.874 0.815 0.839
Gemma 4 26B-A4B IT 0.646 0.841 0.806 0.832
Gemma 4 E2B IT 0.431 0.538 0.726 0.776
Gemma 4 E4B IT 0.484 0.636 0.758 0.801
GPT-4.1 (gpt-4.1-2025-04-14) 0.571 0.645 0.771 0.830
GPT-4o (gpt-4o-2024-08-06) 0.508 0.525 0.714 0.792
GPT-5 (gpt-5-2025-08-07) 0.685 0.854 0.842 0.862
GPT-5 mini (gpt-5-mini-2025-08-07) 0.631 0.810 0.830 0.869
GPT-5.4 Thinking (gpt-5.4-2026-03-05) 0.717 0.864 0.844 0.859
gpt-oss-20b 0.521 0.728 0.716 0.782
gpt-oss-120b 0.566 0.768 0.757 0.795
GPT-OSS-Swallow-20B-RL-v0.1 0.568 0.752 0.726 0.769
GPT-OSS-Swallow-120B-RL-v0.1 0.600 0.782 0.772 0.803
GPT-OSS-Swallow-20B-SFT-v0.1 0.521 0.705 0.731 0.751
GPT-OSS-Swallow-120B-SFT-v0.1 0.568 0.749 0.758 0.783
Llama 3.1 8B Instruct 0.279 0.377 0.446 0.600
Llama-3.1-Nemotron-Nano-8B-v1 0.317 0.543 0.271 0.569
Llama 3.1 Swallow 8B Instruct v0.5 0.341 0.302 0.565 0.617
Llama 3.3 70B Instruct 0.440 0.528 0.585 0.718
Llama 3.3 Swallow 70B Instruct v0.4 0.420 0.441 0.634 0.714
Llama 4 Scout Instruct 0.482 0.563 0.629 0.751
llm-jp-3.1-1.8b-instruct4 0.199 0.175 0.475 0.436
llm-jp-3.1-13b-instruct4 0.258 0.232 0.549 0.542
llm-jp-4-32b-a3b-thinking 0.471 0.578 0.726 0.759
llm-jp-4-8b-thinking 0.448 0.532 0.706 0.743
NVIDIA-Nemotron-3-Super-120B-A12B 0.611 0.825 0.736 0.806
NVIDIA-Nemotron-Nano-9B-v2-Japanese 0.503 0.640 0.689 0.759
o3 (o3-2025-04-16) 0.660 0.831 0.818 0.849
o3-mini (o3-mini-2025-01-31) 0.579 0.752 0.782 0.822
Olmo 3 7B Think 0.374 0.606 0.364 0.505
Olmo 3 32B Think 0.447 0.679 0.475 0.562
Qwen3-0.6B 0.216 0.327 0.336 0.468
Qwen3-1.7B 0.364 0.492 0.478 0.629
Qwen3-4B 0.448 0.639 0.628 0.737
Qwen3-8B 0.491 0.662 0.689 0.767
Qwen3-14B 0.524 0.696 0.712 0.784
Qwen3-32B 0.533 0.701 0.722 0.792
Qwen3-30B-A3B 0.522 0.707 0.722 0.791
Qwen3-235B-A22B-Instruct-2507 0.578 0.718 0.779 0.820
Qwen3-235B-A22B-Thinking-2507 0.626 0.813 0.757 0.802
Qwen3-Next-80B-A3B-Instruct 0.551 0.737 0.756 0.808
Qwen3-Next-80B-A3B-Thinking 0.598 0.809 0.759 0.807
Qwen3-Swallow-30B-A3B-CPT-v0.2 0.498 0.651 0.602 0.644
Qwen3-Swallow-30B-A3B-RL-v0.2 0.546 0.711 0.741 0.777
Qwen3-Swallow-30B-A3B-SFT-v0.2 0.499 0.668 0.714 0.759
Qwen3-Swallow-8B-CPT-v0.2 0.435 0.546 0.568 0.591
Qwen3-Swallow-32B-CPT-v0.2 0.527 0.697 0.641 0.659
Qwen3-Swallow-8B-RL-v0.2 0.510 0.668 0.710 0.736
Qwen3-Swallow-32B-RL-v0.2 0.555 0.763 0.753 0.780
Qwen3-Swallow-8B-SFT-v0.2 0.449 0.590 0.687 0.715
Qwen3-Swallow-32B-SFT-v0.2 0.524 0.703 0.738 0.771
Qwen3.5-0.8B 0.128 0.246 0.248 0.367
Qwen3.5-2B 0.172 0.405 0.391 0.537
Qwen3.5-4B 0.456 0.688 0.618 0.678
Qwen3.5-9B 0.534 0.727 0.707 0.753
Qwen3.5-27B 0.621 0.831 0.784 0.820
Qwen3.5-35B-A3B 0.604 0.810 0.760 0.806
Qwen3.5-122B-A10B 0.640 0.831 0.778 0.829
QwQ Bakeneko 32B 0.470 0.646 0.683 0.767
Sarashina2.2 3B Instruct v0.1 0.260 0.301 0.562 0.561
モデル名 日本語平均 JamC-QA 英日翻訳 日英翻訳 M-IFEval (日) MMLU-ProX (日) GPQA (日) PolyMath HT (日) JHumanEval
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.526 0.625 0.238 0.210 0.619 0.737 0.571 0.320 0.890
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.449 0.447 0.205 0.191 0.549 0.659 0.527 0.196 0.818
ELYZA-Shortcut-1.0-Qwen-32B 0.443 0.524 0.245 0.228 0.633 0.631 0.415 0.100 0.770
ELYZA-Thinking-1.0-Qwen-32B 0.444 0.503 0.237 0.216 0.558 0.644 0.455 0.136 0.805
Gemma 3 1B IT 0.147 0.249 0.004 0.083 0.323 0.148 0.248 0.012 0.112
Gemma 3 4B IT 0.292 0.285 0.186 0.189 0.473 0.335 0.246 0.020 0.604
Gemma 3 12B IT 0.405 0.401 0.225 0.229 0.619 0.527 0.373 0.100 0.763
Gemma 3 27B IT 0.437 0.488 0.250 0.238 0.597 0.609 0.417 0.104 0.796
Gemma 4 31B IT 0.675 0.686 0.285 0.260 0.898 0.841 0.795 0.684 0.952
Gemma 4 26B-A4B IT 0.646 0.647 0.280 0.261 0.867 0.818 0.739 0.600 0.959
Gemma 4 E2B IT 0.431 0.361 0.256 0.227 0.646 0.580 0.386 0.168 0.826
Gemma 4 E4B IT 0.484 0.424 0.261 0.228 0.637 0.670 0.511 0.260 0.882
GPT-4.1 (gpt-4.1-2025-04-14) 0.571 0.790 0.278 0.260 0.810 0.772 0.603 0.148 0.911
GPT-4o (gpt-4o-2024-08-06) 0.508 0.747 0.282 0.265 0.704 0.685 0.453 0.084 0.844
GPT-5 (gpt-5-2025-08-07) 0.685 0.858 0.272 0.236 0.907 0.849 0.786 0.624 0.946
GPT-5 mini (gpt-5-mini-2025-08-07) 0.631 0.701 0.267 0.239 0.827 0.805 0.750 0.512 0.944
GPT-5.4 Thinking (gpt-5.4-2026-03-05) 0.717 0.891 0.272 0.249 0.881 0.852 0.853 0.776 0.963
gpt-oss-20b 0.521 0.405 0.234 0.205 0.726 0.705 0.587 0.372 0.932
gpt-oss-120b 0.566 0.518 0.262 0.208 0.735 0.754 0.658 0.464 0.927
GPT-OSS-Swallow-20B-RL-v0.1 0.568 0.562 0.248 0.210 0.673 0.746 0.650 0.536 0.924
GPT-OSS-Swallow-120B-RL-v0.1 0.600 0.616 0.263 0.218 0.717 0.776 0.703 0.576 0.934
GPT-OSS-Swallow-20B-SFT-v0.1 0.521 0.521 0.246 0.216 0.633 0.693 0.585 0.364 0.909
GPT-OSS-Swallow-120B-SFT-v0.1 0.568 0.594 0.264 0.213 0.748 0.746 0.623 0.428 0.927
Llama 3.1 8B Instruct 0.279 0.310 0.187 0.194 0.381 0.306 0.261 0.016 0.580
Llama-3.1-Nemotron-Nano-8B-v1 0.317 0.267 0.051 0.069 0.235 0.493 0.330 0.288 0.804
Llama 3.1 Swallow 8B Instruct v0.5 0.341 0.496 0.249 0.222 0.496 0.369 0.295 0.020 0.584
Llama 3.3 70B Instruct 0.440 0.484 0.245 0.246 0.650 0.607 0.453 0.080 0.752
Llama 3.3 Swallow 70B Instruct v0.4 0.420 0.562 0.275 0.251 0.593 0.533 0.355 0.060 0.727
Llama 4 Scout Instruct 0.482 0.579 0.237 0.230 0.611 0.687 0.540 0.148 0.820
llm-jp-3.1-1.8b-instruct4 0.199 0.348 0.000 0.159 0.288 0.195 0.239 0.000 0.365
llm-jp-3.1-13b-instruct4 0.258 0.509 0.007 0.161 0.372 0.296 0.230 0.028 0.463
llm-jp-4-32b-a3b-thinking 0.471 0.541 0.235 0.190 0.668 0.656 0.462 0.148 0.871
llm-jp-4-8b-thinking 0.448 0.514 0.227 0.189 0.677 0.623 0.431 0.080 0.840
NVIDIA-Nemotron-3-Super-120B-A12B 0.611 0.593 0.244 0.222 0.823 0.794 0.739 0.544 0.930
NVIDIA-Nemotron-Nano-9B-v2-Japanese 0.503 0.437 0.207 0.196 0.637 0.711 0.554 0.396 0.888
o3 (o3-2025-04-16) 0.660 0.851 0.276 0.216 0.850 0.835 0.766 0.548 0.937
o3-mini (o3-mini-2025-01-31) 0.579 0.507 0.243 0.221 0.841 0.760 0.685 0.444 0.928
Olmo 3 7B Think 0.374 0.286 0.125 0.141 0.314 0.491 0.371 0.368 0.896
Olmo 3 32B Think 0.447 0.333 0.172 0.185 0.345 0.663 0.504 0.452 0.921
Qwen3-0.6B 0.216 0.239 0.001 0.000 0.425 0.320 0.248 0.048 0.446
Qwen3-1.7B 0.364 0.292 0.125 0.151 0.491 0.522 0.326 0.244 0.757
Qwen3-4B 0.448 0.356 0.149 0.189 0.509 0.665 0.475 0.364 0.879
Qwen3-8B 0.491 0.401 0.202 0.201 0.611 0.711 0.498 0.412 0.891
Qwen3-14B 0.524 0.444 0.228 0.219 0.611 0.740 0.587 0.440 0.921
Qwen3-32B 0.533 0.472 0.218 0.216 0.637 0.759 0.609 0.440 0.915
Qwen3-30B-A3B 0.522 0.455 0.177 0.215 0.664 0.741 0.592 0.416 0.915
Qwen3-235B-A22B-Instruct-2507 0.578 0.636 0.258 0.230 0.730 0.799 0.701 0.368 0.900
Qwen3-235B-A22B-Thinking-2507 0.626 0.641 0.255 0.229 0.810 0.819 0.728 0.576 0.948
Qwen3-Next-80B-A3B-Instruct 0.551 0.599 0.240 0.228 0.681 0.770 0.614 0.372 0.905
Qwen3-Next-80B-A3B-Thinking 0.598 0.614 0.240 0.195 0.819 0.797 0.705 0.540 0.873
Qwen3-Swallow-30B-A3B-CPT-v0.2 0.498 0.518 0.240 0.222 0.558 0.715 0.542 0.292 0.895
Qwen3-Swallow-30B-A3B-RL-v0.2 0.546 0.520 0.239 0.209 0.602 0.750 0.612 0.524 0.917
Qwen3-Swallow-30B-A3B-SFT-v0.2 0.499 0.495 0.240 0.207 0.580 0.707 0.556 0.320 0.890
Qwen3-Swallow-8B-CPT-v0.2 0.435 0.440 0.230 0.206 0.482 0.636 0.453 0.208 0.823
Qwen3-Swallow-32B-CPT-v0.2 0.527 0.521 0.244 0.228 0.619 0.738 0.589 0.364 0.909
Qwen3-Swallow-8B-RL-v0.2 0.510 0.469 0.222 0.201 0.566 0.708 0.565 0.456 0.895
Qwen3-Swallow-32B-RL-v0.2 0.555 0.518 0.247 0.215 0.659 0.761 0.621 0.492 0.929
Qwen3-Swallow-8B-SFT-v0.2 0.449 0.443 0.228 0.199 0.549 0.656 0.438 0.252 0.829
Qwen3-Swallow-32B-SFT-v0.2 0.524 0.499 0.247 0.210 0.650 0.727 0.609 0.340 0.911
Qwen3.5-0.8B 0.128 0.245 0.000 0.000 0.261 0.236 0.261 0.008 0.013
Qwen3.5-2B 0.172 0.283 0.000 0.002 0.332 0.391 0.324 0.032 0.012
Qwen3.5-4B 0.456 0.395 0.080 0.147 0.456 0.750 0.688 0.404 0.727
Qwen3.5-9B 0.534 0.489 0.174 0.209 0.518 0.784 0.725 0.460 0.909
Qwen3.5-27B 0.621 0.591 0.252 0.241 0.735 0.835 0.795 0.592 0.930
Qwen3.5-35B-A3B 0.604 0.611 0.250 0.239 0.659 0.829 0.775 0.548 0.921
Qwen3.5-122B-A10B 0.640 0.671 0.263 0.247 0.774 0.841 0.801 0.588 0.937
QwQ Bakeneko 32B 0.470 0.492 0.232 0.207 0.593 0.684 0.455 0.220 0.874
Sarashina2.2 3B Instruct v0.1 0.260 0.498 0.002 0.160 0.288 0.335 0.301 0.036 0.464
モデル名 英語平均 HellaSwag IFBench MMLU-Pro (英) GPQA (英) MATH-500 (英) AIME 24-25 LCB
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.694 0.906 0.340 0.798 0.615 0.956 0.713 0.529
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.653 0.870 0.267 0.778 0.619 0.954 0.608 0.473
ELYZA-Shortcut-1.0-Qwen-32B 0.520 0.897 0.346 0.684 0.460 0.830 0.150 0.276
ELYZA-Thinking-1.0-Qwen-32B 0.563 0.889 0.288 0.718 0.563 0.886 0.271 0.323
Gemma 3 1B IT 0.195 0.357 0.134 0.171 0.237 0.438 0.000 0.027
Gemma 3 4B IT 0.386 0.620 0.247 0.440 0.354 0.748 0.117 0.177
Gemma 3 12B IT 0.493 0.816 0.294 0.617 0.389 0.862 0.217 0.260
Gemma 3 27B IT 0.528 0.861 0.265 0.681 0.475 0.880 0.233 0.301
Gemma 4 31B IT 0.874 0.941 0.765 0.869 0.856 0.990 0.888 0.810
Gemma 4 26B-A4B IT 0.841 0.896 0.727 0.847 0.788 0.986 0.888 0.755
Gemma 4 E2B IT 0.538 0.606 0.343 0.619 0.449 0.910 0.413 0.429
Gemma 4 E4B IT 0.636 0.790 0.410 0.714 0.562 0.938 0.500 0.541
GPT-4.1 (gpt-4.1-2025-04-14) 0.645 0.940 0.416 0.813 0.667 0.906 0.400 0.371
GPT-4o (gpt-4o-2024-08-06) 0.525 0.930 0.311 0.749 0.556 0.792 0.083 0.253
GPT-5 (gpt-5-2025-08-07) 0.854 0.959 0.721 0.865 0.840 0.990 0.929 0.676
GPT-5 mini (gpt-5-mini-2025-08-07) 0.810 0.934 0.703 0.822 0.701 0.970 0.863 0.678
GPT-5.4 Thinking (gpt-5.4-2026-03-05) 0.864 0.958 0.651 0.875 0.888 0.992 0.958 0.723
gpt-oss-20b 0.728 0.847 0.558 0.748 0.654 0.958 0.729 0.599
gpt-oss-120b 0.768 0.873 0.610 0.789 0.706 0.968 0.779 0.647
GPT-OSS-Swallow-20B-RL-v0.1 0.752 0.831 0.517 0.738 0.676 0.984 0.854 0.665
GPT-OSS-Swallow-120B-RL-v0.1 0.782 0.850 0.605 0.731 0.705 0.990 0.879 0.711
GPT-OSS-Swallow-20B-SFT-v0.1 0.705 0.822 0.488 0.736 0.648 0.954 0.700 0.589
GPT-OSS-Swallow-120B-SFT-v0.1 0.749 0.881 0.564 0.773 0.681 0.964 0.750 0.634
Llama 3.1 8B Instruct 0.377 0.769 0.291 0.489 0.374 0.526 0.033 0.158
Llama-3.1-Nemotron-Nano-8B-v1 0.543 0.521 0.235 0.572 0.486 0.946 0.571 0.467
Llama 3.1 Swallow 8B Instruct v0.5 0.302 0.648 0.206 0.399 0.318 0.452 0.000 0.091
Llama 3.3 70B Instruct 0.528 0.911 0.445 0.717 0.480 0.746 0.117 0.279
Llama 3.3 Swallow 70B Instruct v0.4 0.441 0.884 0.259 0.570 0.409 0.642 0.083 0.239
Llama 4 Scout Instruct 0.563 0.891 0.372 0.744 0.606 0.834 0.183 0.311
llm-jp-3.1-1.8b-instruct4 0.175 0.450 0.137 0.163 0.278 0.146 0.000 0.051
llm-jp-3.1-13b-instruct4 0.232 0.717 0.131 0.252 0.227 0.188 0.000 0.112
llm-jp-4-32b-a3b-thinking 0.578 0.833 0.509 0.683 0.503 0.860 0.367 0.293
llm-jp-4-8b-thinking 0.532 0.835 0.439 0.657 0.431 0.820 0.283 0.259
NVIDIA-Nemotron-3-Super-120B-A12B 0.825 0.894 0.660 0.836 0.818 0.984 0.892 0.691
NVIDIA-Nemotron-Nano-9B-v2-Japanese 0.640 0.788 0.430 0.722 0.558 0.926 0.554 0.501
o3 (o3-2025-04-16) 0.831 0.956 0.712 0.857 0.819 0.978 0.842 0.649
o3-mini (o3-mini-2025-01-31) 0.752 0.869 0.677 0.792 0.734 0.958 0.733 0.503
Olmo 3 7B Think 0.606 0.713 0.253 0.633 0.506 0.964 0.708 0.466
Olmo 3 32B Think 0.679 0.849 0.256 0.760 0.601 0.968 0.750 0.569
Qwen3-0.6B 0.327 0.433 0.186 0.366 0.256 0.740 0.163 0.148
Qwen3-1.7B 0.492 0.631 0.238 0.570 0.378 0.900 0.433 0.297
Qwen3-4B 0.639 0.794 0.270 0.707 0.568 0.960 0.692 0.482
Qwen3-8B 0.662 0.851 0.282 0.721 0.583 0.976 0.721 0.502
Qwen3-14B 0.696 0.891 0.323 0.776 0.610 0.970 0.746 0.557
Qwen3-32B 0.701 0.903 0.294 0.786 0.638 0.976 0.738 0.573
Qwen3-30B-A3B 0.707 0.885 0.334 0.782 0.645 0.966 0.767 0.573
Qwen3-235B-A22B-Instruct-2507 0.718 0.940 0.430 0.824 0.586 0.982 0.767 0.494
Qwen3-235B-A22B-Thinking-2507 0.813 0.927 0.494 0.847 0.794 0.986 0.913 0.731
Qwen3-Next-80B-A3B-Instruct 0.737 0.929 0.384 0.824 0.753 0.980 0.733 0.559
Qwen3-Next-80B-A3B-Thinking 0.809 0.923 0.590 0.827 0.763 0.980 0.900 0.683
Qwen3-Swallow-30B-A3B-CPT-v0.2 0.651 0.844 0.390 0.740 0.580 0.940 0.583 0.481
Qwen3-Swallow-30B-A3B-RL-v0.2 0.711 0.828 0.433 0.692 0.615 0.976 0.817 0.618
Qwen3-Swallow-30B-A3B-SFT-v0.2 0.668 0.831 0.404 0.718 0.605 0.944 0.667 0.505
Qwen3-Swallow-8B-CPT-v0.2 0.546 0.790 0.308 0.624 0.496 0.860 0.400 0.342
Qwen3-Swallow-32B-CPT-v0.2 0.697 0.900 0.436 0.761 0.621 0.960 0.667 0.532
Qwen3-Swallow-8B-RL-v0.2 0.668 0.793 0.404 0.698 0.583 0.944 0.733 0.521
Qwen3-Swallow-32B-RL-v0.2 0.763 0.888 0.529 0.778 0.701 0.982 0.825 0.638
Qwen3-Swallow-8B-SFT-v0.2 0.590 0.797 0.395 0.671 0.509 0.898 0.463 0.395
Qwen3-Swallow-32B-SFT-v0.2 0.703 0.879 0.477 0.757 0.638 0.960 0.675 0.537
Qwen3.5-0.8B 0.246 0.409 0.166 0.388 0.288 0.424 0.017 0.033
Qwen3.5-2B 0.405 0.579 0.221 0.589 0.495 0.812 0.008 0.132
Qwen3.5-4B 0.688 0.859 0.404 0.788 0.765 0.962 0.671 0.365
Qwen3.5-9B 0.727 0.909 0.483 0.826 0.827 0.878 0.692 0.473
Qwen3.5-27B 0.831 0.957 0.645 0.868 0.866 0.956 0.892 0.635
Qwen3.5-35B-A3B 0.810 0.940 0.599 0.857 0.840 0.984 0.867 0.584
Qwen3.5-122B-A10B 0.831 0.951 0.645 0.870 0.862 0.974 0.879 0.638
QwQ Bakeneko 32B 0.646 0.902 0.273 0.772 0.610 0.942 0.542 0.483
Sarashina2.2 3B Instruct v0.1 0.301 0.613 0.174 0.329 0.293 0.570 0.017 0.112
モデル名 日本語MTB平均 コード 抽出 人文 数学 推論 ロールプレイ 科・技・工・数 執筆
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.727 0.766 0.697 0.637 0.949 0.772 0.686 0.657 0.651
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.642 0.628 0.700 0.541 0.888 0.702 0.573 0.559 0.546
ELYZA-Shortcut-1.0-Qwen-32B 0.675 0.736 0.680 0.622 0.836 0.658 0.638 0.589 0.642
ELYZA-Thinking-1.0-Qwen-32B 0.664 0.646 0.663 0.570 0.943 0.693 0.636 0.578 0.581
Gemma 3 1B IT 0.352 0.258 0.381 0.373 0.284 0.323 0.405 0.364 0.425
Gemma 3 4B IT 0.572 0.587 0.549 0.541 0.787 0.418 0.569 0.534 0.590
Gemma 3 12B IT 0.659 0.704 0.647 0.633 0.840 0.538 0.676 0.608 0.625
Gemma 3 27B IT 0.691 0.663 0.764 0.663 0.815 0.635 0.705 0.646 0.639
Gemma 4 31B IT 0.815 0.872 0.757 0.782 0.988 0.844 0.750 0.779 0.748
Gemma 4 26B-A4B IT 0.806 0.840 0.772 0.765 0.989 0.828 0.753 0.759 0.745
Gemma 4 E2B IT 0.726 0.722 0.706 0.694 0.949 0.735 0.673 0.638 0.692
Gemma 4 E4B IT 0.758 0.799 0.729 0.704 0.972 0.761 0.719 0.680 0.697
GPT-4.1 (gpt-4.1-2025-04-14) 0.771 0.813 0.757 0.699 0.942 0.764 0.748 0.719 0.725
GPT-4o (gpt-4o-2024-08-06) 0.714 0.737 0.776 0.652 0.914 0.645 0.671 0.658 0.660
GPT-5 (gpt-5-2025-08-07) 0.842 0.853 0.758 0.858 0.977 0.851 0.828 0.831 0.781
GPT-5 mini (gpt-5-mini-2025-08-07) 0.830 0.879 0.770 0.803 0.984 0.774 0.812 0.838 0.777
GPT-5.4 Thinking (gpt-5.4-2026-03-05) 0.844 0.872 0.787 0.815 0.997 0.827 0.816 0.832 0.804
gpt-oss-20b 0.716 0.792 0.749 0.609 0.988 0.693 0.639 0.639 0.620
gpt-oss-120b 0.757 0.791 0.748 0.678 0.976 0.773 0.704 0.706 0.680
GPT-OSS-Swallow-20B-RL-v0.1 0.726 0.786 0.724 0.624 0.935 0.793 0.662 0.669 0.613
GPT-OSS-Swallow-120B-RL-v0.1 0.772 0.795 0.763 0.687 0.949 0.832 0.716 0.729 0.707
GPT-OSS-Swallow-20B-SFT-v0.1 0.731 0.809 0.691 0.624 0.976 0.777 0.680 0.649 0.641
GPT-OSS-Swallow-120B-SFT-v0.1 0.758 0.809 0.742 0.667 0.987 0.782 0.704 0.682 0.687
Llama 3.1 8B Instruct 0.446 0.460 0.581 0.381 0.563 0.410 0.391 0.405 0.374
Llama-3.1-Nemotron-Nano-8B-v1 0.271 0.312 0.317 0.199 0.474 0.264 0.196 0.221 0.189
Llama 3.1 Swallow 8B Instruct v0.5 0.565 0.504 0.667 0.621 0.430 0.530 0.601 0.558 0.605
Llama 3.3 70B Instruct 0.585 0.630 0.698 0.550 0.692 0.549 0.535 0.516 0.510
Llama 3.3 Swallow 70B Instruct v0.4 0.634 0.578 0.631 0.629 0.703 0.666 0.656 0.599 0.613
Llama 4 Scout Instruct 0.629 0.623 0.721 0.572 0.839 0.561 0.611 0.566 0.538
llm-jp-3.1-1.8b-instruct4 0.475 0.451 0.449 0.548 0.540 0.374 0.512 0.462 0.461
llm-jp-3.1-13b-instruct4 0.549 0.542 0.537 0.605 0.635 0.427 0.560 0.518 0.568
llm-jp-4-32b-a3b-thinking 0.726 0.710 0.750 0.686 0.967 0.732 0.685 0.615 0.663
llm-jp-4-8b-thinking 0.706 0.649 0.734 0.644 0.904 0.732 0.690 0.623 0.670
NVIDIA-Nemotron-3-Super-120B-A12B 0.736 0.774 0.710 0.690 0.957 0.729 0.690 0.687 0.650
NVIDIA-Nemotron-Nano-9B-v2-Japanese 0.689 0.699 0.700 0.588 0.978 0.738 0.591 0.634 0.583
o3 (o3-2025-04-16) 0.818 0.836 0.764 0.775 0.991 0.831 0.796 0.805 0.747
o3-mini (o3-mini-2025-01-31) 0.782 0.828 0.774 0.703 0.997 0.794 0.728 0.737 0.694
Olmo 3 7B Think 0.364 0.345 0.401 0.288 0.609 0.345 0.315 0.342 0.270
Olmo 3 32B Think 0.475 0.379 0.529 0.363 0.679 0.589 0.446 0.424 0.389
Qwen3-0.6B 0.336 0.324 0.383 0.266 0.630 0.261 0.270 0.273 0.279
Qwen3-1.7B 0.478 0.496 0.502 0.369 0.790 0.470 0.397 0.419 0.382
Qwen3-4B 0.628 0.660 0.734 0.500 0.892 0.652 0.534 0.544 0.508
Qwen3-8B 0.689 0.730 0.697 0.565 0.960 0.749 0.605 0.613 0.590
Qwen3-14B 0.712 0.755 0.710 0.599 0.976 0.783 0.628 0.647 0.598
Qwen3-32B 0.722 0.737 0.765 0.618 0.974 0.737 0.642 0.674 0.631
Qwen3-30B-A3B 0.722 0.776 0.727 0.627 0.964 0.775 0.659 0.652 0.600
Qwen3-235B-A22B-Instruct-2507 0.779 0.826 0.739 0.707 0.979 0.795 0.716 0.756 0.717
Qwen3-235B-A22B-Thinking-2507 0.757 0.769 0.745 0.682 0.974 0.811 0.664 0.728 0.685
Qwen3-Next-80B-A3B-Instruct 0.756 0.798 0.725 0.668 0.988 0.815 0.662 0.724 0.668
Qwen3-Next-80B-A3B-Thinking 0.759 0.790 0.710 0.705 0.990 0.775 0.705 0.730 0.670
Qwen3-Swallow-30B-A3B-CPT-v0.2 0.602 0.578 0.589 0.575 0.784 0.589 0.617 0.564 0.520
Qwen3-Swallow-30B-A3B-RL-v0.2 0.741 0.781 0.721 0.651 0.978 0.794 0.680 0.690 0.635
Qwen3-Swallow-30B-A3B-SFT-v0.2 0.714 0.771 0.678 0.632 0.946 0.719 0.708 0.648 0.608
Qwen3-Swallow-8B-CPT-v0.2 0.568 0.514 0.494 0.532 0.802 0.550 0.609 0.544 0.495
Qwen3-Swallow-32B-CPT-v0.2 0.641 0.625 0.650 0.550 0.865 0.664 0.633 0.595 0.543
Qwen3-Swallow-8B-RL-v0.2 0.710 0.723 0.715 0.614 0.984 0.726 0.646 0.644 0.628
Qwen3-Swallow-32B-RL-v0.2 0.753 0.831 0.768 0.634 0.976 0.765 0.710 0.676 0.665
Qwen3-Swallow-8B-SFT-v0.2 0.687 0.644 0.709 0.589 0.974 0.669 0.680 0.621 0.610
Qwen3-Swallow-32B-SFT-v0.2 0.738 0.787 0.725 0.641 0.975 0.761 0.693 0.665 0.658
Qwen3.5-0.8B 0.248 0.176 0.278 0.225 0.370 0.298 0.208 0.238 0.189
Qwen3.5-2B 0.391 0.287 0.484 0.337 0.616 0.378 0.295 0.392 0.341
Qwen3.5-4B 0.618 0.594 0.624 0.523 0.956 0.718 0.436 0.656 0.435
Qwen3.5-9B 0.707 0.676 0.660 0.660 0.975 0.843 0.538 0.711 0.594
Qwen3.5-27B 0.784 0.777 0.748 0.752 0.982 0.847 0.702 0.765 0.702
Qwen3.5-35B-A3B 0.760 0.798 0.731 0.690 0.993 0.814 0.646 0.760 0.644
Qwen3.5-122B-A10B 0.778 0.782 0.721 0.755 0.992 0.804 0.731 0.771 0.669
QwQ Bakeneko 32B 0.683 0.657 0.690 0.600 0.917 0.723 0.647 0.618 0.608
Sarashina2.2 3B Instruct v0.1 0.562 0.495 0.493 0.603 0.782 0.484 0.590 0.509 0.537
モデル名 英語MTB平均 コード 抽出 人文 数学 推論 ロールプレイ 科・技・工・数 執筆
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.785 0.788 0.786 0.736 0.980 0.792 0.754 0.720 0.720
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.752 0.714 0.781 0.701 0.970 0.801 0.696 0.645 0.709
ELYZA-Shortcut-1.0-Qwen-32B 0.755 0.759 0.827 0.699 0.912 0.827 0.697 0.632 0.684
ELYZA-Thinking-1.0-Qwen-32B 0.722 0.684 0.778 0.657 0.939 0.782 0.646 0.615 0.678
Gemma 3 1B IT 0.516 0.423 0.441 0.553 0.696 0.319 0.570 0.502 0.626
Gemma 3 4B IT 0.679 0.605 0.652 0.700 0.883 0.565 0.721 0.608 0.696
Gemma 3 12B IT 0.760 0.716 0.752 0.772 0.892 0.756 0.771 0.691 0.733
Gemma 3 27B IT 0.771 0.636 0.839 0.774 0.937 0.730 0.780 0.726 0.745
Gemma 4 31B IT 0.839 0.798 0.830 0.824 0.986 0.878 0.813 0.806 0.775
Gemma 4 26B-A4B IT 0.832 0.809 0.826 0.815 0.992 0.850 0.815 0.779 0.774
Gemma 4 E2B IT 0.776 0.764 0.799 0.722 0.956 0.793 0.752 0.696 0.727
Gemma 4 E4B IT 0.801 0.793 0.821 0.750 0.957 0.847 0.772 0.715 0.754
GPT-4.1 (gpt-4.1-2025-04-14) 0.830 0.823 0.832 0.785 0.985 0.883 0.800 0.762 0.772
GPT-4o (gpt-4o-2024-08-06) 0.792 0.826 0.828 0.729 0.952 0.911 0.721 0.670 0.700
GPT-5 (gpt-5-2025-08-07) 0.862 0.870 0.790 0.853 0.973 0.875 0.867 0.864 0.802
GPT-5 mini (gpt-5-mini-2025-08-07) 0.869 0.861 0.845 0.834 0.993 0.910 0.853 0.849 0.805
GPT-5.4 Thinking (gpt-5.4-2026-03-05) 0.859 0.838 0.852 0.812 0.986 0.877 0.868 0.833 0.809
gpt-oss-20b 0.782 0.785 0.834 0.687 0.947 0.809 0.738 0.717 0.735
gpt-oss-120b 0.795 0.746 0.798 0.726 0.984 0.826 0.777 0.720 0.784
GPT-OSS-Swallow-20B-RL-v0.1 0.769 0.802 0.796 0.651 0.973 0.798 0.731 0.692 0.705
GPT-OSS-Swallow-120B-RL-v0.1 0.803 0.805 0.827 0.692 0.993 0.822 0.766 0.754 0.763
GPT-OSS-Swallow-20B-SFT-v0.1 0.751 0.753 0.780 0.619 0.946 0.841 0.724 0.622 0.723
GPT-OSS-Swallow-120B-SFT-v0.1 0.783 0.779 0.814 0.660 0.991 0.797 0.769 0.698 0.756
Llama 3.1 8B Instruct 0.600 0.595 0.651 0.611 0.719 0.430 0.608 0.523 0.665
Llama-3.1-Nemotron-Nano-8B-v1 0.569 0.567 0.563 0.455 0.862 0.539 0.530 0.495 0.538
Llama 3.1 Swallow 8B Instruct v0.5 0.617 0.504 0.692 0.691 0.714 0.483 0.661 0.546 0.646
Llama 3.3 70B Instruct 0.718 0.668 0.796 0.673 0.875 0.752 0.677 0.622 0.684
Llama 3.3 Swallow 70B Instruct v0.4 0.714 0.684 0.765 0.685 0.881 0.735 0.671 0.620 0.674
Llama 4 Scout Instruct 0.751 0.731 0.792 0.678 0.930 0.862 0.673 0.630 0.709
llm-jp-3.1-1.8b-instruct4 0.436 0.465 0.413 0.462 0.487 0.388 0.470 0.374 0.431
llm-jp-3.1-13b-instruct4 0.542 0.489 0.576 0.589 0.569 0.480 0.547 0.532 0.550
llm-jp-4-32b-a3b-thinking 0.759 0.707 0.805 0.722 0.925 0.782 0.724 0.665 0.742
llm-jp-4-8b-thinking 0.743 0.611 0.818 0.707 0.950 0.790 0.721 0.611 0.737
NVIDIA-Nemotron-3-Super-120B-A12B 0.806 0.779 0.791 0.769 0.973 0.841 0.783 0.745 0.764
NVIDIA-Nemotron-Nano-9B-v2-Japanese 0.759 0.726 0.748 0.677 0.987 0.835 0.685 0.673 0.737
o3 (o3-2025-04-16) 0.849 0.806 0.808 0.829 0.996 0.903 0.828 0.825 0.797
o3-mini (o3-mini-2025-01-31) 0.822 0.842 0.824 0.733 0.991 0.886 0.765 0.790 0.743
Olmo 3 7B Think 0.505 0.389 0.566 0.378 0.790 0.579 0.457 0.474 0.405
Olmo 3 32B Think 0.562 0.465 0.586 0.435 0.815 0.688 0.566 0.504 0.437
Qwen3-0.6B 0.468 0.355 0.503 0.394 0.761 0.404 0.383 0.444 0.499
Qwen3-1.7B 0.629 0.569 0.661 0.547 0.924 0.623 0.535 0.575 0.599
Qwen3-4B 0.737 0.703 0.764 0.604 0.983 0.841 0.670 0.676 0.657
Qwen3-8B 0.767 0.735 0.798 0.693 0.982 0.829 0.714 0.692 0.695
Qwen3-14B 0.784 0.764 0.804 0.724 0.986 0.817 0.746 0.709 0.724
Qwen3-32B 0.792 0.791 0.801 0.740 0.997 0.807 0.755 0.714 0.730
Qwen3-30B-A3B 0.791 0.771 0.800 0.722 0.986 0.865 0.741 0.718 0.727
Qwen3-235B-A22B-Instruct-2507 0.820 0.765 0.737 0.818 0.987 0.890 0.790 0.790 0.781
Qwen3-235B-A22B-Thinking-2507 0.802 0.771 0.797 0.770 0.994 0.855 0.716 0.747 0.766
Qwen3-Next-80B-A3B-Instruct 0.808 0.755 0.789 0.772 0.991 0.870 0.755 0.758 0.773
Qwen3-Next-80B-A3B-Thinking 0.807 0.790 0.770 0.775 1.000 0.895 0.735 0.755 0.735
Qwen3-Swallow-30B-A3B-CPT-v0.2 0.644 0.572 0.649 0.583 0.798 0.703 0.688 0.563 0.600
Qwen3-Swallow-30B-A3B-RL-v0.2 0.777 0.767 0.786 0.676 0.985 0.854 0.749 0.697 0.702
Qwen3-Swallow-30B-A3B-SFT-v0.2 0.759 0.733 0.752 0.694 0.974 0.825 0.733 0.677 0.686
Qwen3-Swallow-8B-CPT-v0.2 0.591 0.509 0.576 0.535 0.773 0.607 0.645 0.541 0.542
Qwen3-Swallow-32B-CPT-v0.2 0.659 0.588 0.713 0.577 0.794 0.704 0.675 0.594 0.625
Qwen3-Swallow-8B-RL-v0.2 0.736 0.664 0.801 0.644 0.969 0.775 0.708 0.653 0.672
Qwen3-Swallow-32B-RL-v0.2 0.780 0.770 0.793 0.677 0.978 0.810 0.748 0.728 0.736
Qwen3-Swallow-8B-SFT-v0.2 0.715 0.611 0.782 0.649 0.948 0.765 0.678 0.632 0.655
Qwen3-Swallow-32B-SFT-v0.2 0.771 0.710 0.830 0.703 0.959 0.802 0.732 0.688 0.743
Qwen3.5-0.8B 0.367 0.276 0.368 0.311 0.529 0.329 0.316 0.391 0.418
Qwen3.5-2B 0.537 0.432 0.465 0.473 0.840 0.511 0.506 0.514 0.556
Qwen3.5-4B 0.678 0.632 0.667 0.600 0.959 0.760 0.509 0.706 0.589
Qwen3.5-9B 0.753 0.693 0.739 0.733 0.964 0.839 0.612 0.737 0.705
Qwen3.5-27B 0.820 0.773 0.793 0.779 0.986 0.883 0.795 0.791 0.760
Qwen3.5-35B-A3B 0.806 0.774 0.734 0.782 0.986 0.884 0.751 0.796 0.740
Qwen3.5-122B-A10B 0.829 0.778 0.794 0.796 0.989 0.893 0.807 0.808 0.765
QwQ Bakeneko 32B 0.767 0.758 0.773 0.686 0.993 0.865 0.692 0.652 0.714
Sarashina2.2 3B Instruct v0.1 0.561 0.460 0.491 0.626 0.709 0.489 0.593 0.554 0.562