Model name Post (ja) avg Post (en) avg MTB (ja) avg MTB (en) avg
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.730 0.739 0.843 0.866
CyberAgentLM3-22B-chat 0.397 0.280 0.697 0.621
DeepSeek-R1-Distill-Llama-8B 0.451 0.549 0.526 0.704
DeepSeek-R1-Distill-Llama-70B 0.683 0.730 0.707 0.842
DeepSeek-R1-Distill-Qwen-7B 0.514 0.546 0.411 0.649
DeepSeek-R1-Distill-Qwen-14B 0.638 0.672 0.700 0.775
DeepSeek-R1-Distill-Qwen-32B 0.692 0.701 0.753 0.822
DeepSeek-R1-Distill-Qwen-14B-Japanese 0.575 0.629 0.771 0.835
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.649 0.697 0.808 0.857
ELYZA-Thinking-1.0-Qwen-32B 0.526 0.571 0.694 0.748
Gemma 2 2B IT 0.269 0.256 0.555 0.718
Gemma 2 9B IT 0.447 0.392 0.743 0.761
Gemma 2 27B IT 0.506 0.441 0.762 0.800
Gemma-2-Llama Swallow 2B IT 0.245 0.184 0.583 0.584
Gemma-2-Llama Swallow 9B IT 0.402 0.323 0.729 0.734
Gemma-2-Llama Swallow 27B IT 0.517 0.393 0.759 0.771
Gemma 3 1B IT 0.170 0.201 0.434 0.578
Gemma 3 4B IT 0.448 0.405 0.735 0.793
Gemma 3 12B IT 0.597 0.524 0.811 0.860
Gemma 3 27B IT 0.658 0.571 0.830 0.880
GPT-4.1 (gpt-4.1-2025-04-14) 0.808 0.685 0.892 0.908
GPT-4o (gpt-4o-2024-08-06) 0.710 0.560 0.865 0.922
GPT-5 (gpt-5-2025-08-07) 0.891 0.875 0.882 0.888
gpt-oss-20b 0.727 0.737 0.869 0.889
gpt-oss-120b 0.790 0.794 0.907 0.918
Llama 3.1 8B Instruct 0.403 0.387 0.592 0.737
Llama-3.1-Nemotron-Nano-8B-v1 0.550 0.588 0.363 0.701
Llama 3.1 Swallow 8B Instruct v0.3 0.389 0.291 0.709 0.691
Llama 3.1 Swallow 8B Instruct v0.5 0.451 0.315 0.726 0.753
Llama 3.3 70B Instruct 0.603 0.545 0.735 0.863
Llama-3.3-Nemotron-Super-49B-v1 0.716 0.711 0.806 0.881
Llama 3.3 Swallow 70B Instruct v0.4 0.594 0.470 0.791 0.816
Llama 4 Scout Instruct 0.663 0.594 0.789 0.857
llm-jp-3.1-1.8b-instruct4 0.271 0.178 0.657 0.548
llm-jp-3.1-13b-instruct4 0.384 0.244 0.733 0.682
MedGemma 27B IT 0.463 0.495 0.778 0.830
o3 (o3-2025-04-16) 0.870 0.846 0.903 0.917
o3-mini (o3-mini-2025-01-31) 0.785 0.767 0.880 0.901
Phi-4 0.646 0.547 0.822 0.881
Phi-4-reasoning-plus 0.437 0.469 0.374 0.426
Qwen2.5-7B-Instruct 0.502 0.454 0.688 0.797
Qwen2.5-14B-Instruct 0.596 0.514 0.799 0.865
Qwen2.5-32B-Instruct 0.642 0.543 0.819 0.869
Qwen3-0.6B 0.353 0.335 0.431 0.595
Qwen3-1.7B 0.533 0.531 0.662 0.779
Qwen3-4B 0.646 0.672 0.797 0.839
Qwen3-8B 0.691 0.715 0.845 0.851
Qwen3-14B 0.750 0.763 0.874 0.882
Qwen3-32B 0.756 0.768 0.875 0.892
Qwen3-235B-A22B-Instruct-2507 0.821 0.771 0.915 0.911
Qwen3-235B-A22B-Thinking-2507 0.823 0.856 0.904 0.922
Sarashina2.2 3B Instruct v0.1 0.400 0.318 0.721 0.708
Model name Post (ja) avg JEMHopQA MMLU-ProX (ja) GPQA (ja) MATH-100 (ja) JHumanEval M-IFEval-Ja
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.730 0.644 0.712 0.527 0.899 0.866 0.619
CyberAgentLM3-22B-chat 0.397 0.612 0.310 0.266 0.354 0.443 0.429
DeepSeek-R1-Distill-Llama-8B 0.451 0.348 0.319 0.310 0.556 0.721 0.319
DeepSeek-R1-Distill-Llama-70B 0.683 0.567 0.642 0.538 0.859 0.812 0.558
DeepSeek-R1-Distill-Qwen-7B 0.514 0.279 0.438 0.400 0.778 0.674 0.341
DeepSeek-R1-Distill-Qwen-14B 0.638 0.508 0.591 0.496 0.737 0.859 0.496
DeepSeek-R1-Distill-Qwen-32B 0.692 0.572 0.660 0.536 0.838 0.855 0.509
DeepSeek-R1-Distill-Qwen-14B-Japanese 0.575 0.545 0.525 0.400 0.788 0.620 0.513
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.649 0.654 0.606 0.464 0.838 0.680 0.544
ELYZA-Thinking-1.0-Qwen-32B 0.526 0.601 0.623 0.455 0.788 0.162 0.566
Gemma 2 2B IT 0.269 0.321 0.214 0.248 0.202 0.359 0.416
Gemma 2 9B IT 0.447 0.506 0.423 0.277 0.444 0.583 0.558
Gemma 2 27B IT 0.506 0.561 0.462 0.304 0.505 0.700 0.588
Gemma-2-Llama Swallow 2B IT 0.245 0.274 0.190 0.259 0.263 0.241 0.363
Gemma-2-Llama Swallow 9B IT 0.402 0.465 0.372 0.283 0.374 0.518 0.540
Gemma-2-Llama Swallow 27B IT 0.517 0.681 0.452 0.333 0.465 0.656 0.540
Gemma 3 1B IT 0.170 0.168 0.148 0.248 0.172 0.112 0.323
Gemma 3 4B IT 0.448 0.450 0.335 0.246 0.606 0.604 0.473
Gemma 3 12B IT 0.597 0.525 0.527 0.373 0.798 0.763 0.619
Gemma 3 27B IT 0.658 0.607 0.609 0.417 0.859 0.796 0.597
GPT-4.1 (gpt-4.1-2025-04-14) 0.808 0.856 0.772 0.603 0.899 0.911 0.810
GPT-4o (gpt-4o-2024-08-06) 0.710 0.813 0.685 0.453 0.758 0.844 0.704
GPT-5 (gpt-5-2025-08-07) 0.891 0.900 0.849 0.786 0.980 0.943 0.907
gpt-oss-20b 0.727 0.506 0.702 0.571 0.929 0.927 0.549
gpt-oss-120b 0.790 0.635 0.756 0.663 0.970 0.925 0.735
Llama 3.1 8B Instruct 0.403 0.482 0.306 0.261 0.384 0.580 0.381
Llama-3.1-Nemotron-Nano-8B-v1 0.550 0.202 0.489 0.339 0.919 0.802 0.186
Llama 3.1 Swallow 8B Instruct v0.3 0.389 0.549 0.306 0.239 0.364 0.488 0.491
Llama 3.1 Swallow 8B Instruct v0.5 0.451 0.602 0.369 0.295 0.404 0.584 0.496
Llama 3.3 70B Instruct 0.603 0.557 0.607 0.453 0.646 0.752 0.650
Llama-3.3-Nemotron-Super-49B-v1 0.716 0.541 0.687 0.531 0.919 0.900 0.558
Llama 3.3 Swallow 70B Instruct v0.4 0.594 0.658 0.533 0.355 0.697 0.727 0.593
Llama 4 Scout Instruct 0.663 0.512 0.687 0.540 0.758 0.820 0.611
llm-jp-3.1-1.8b-instruct4 0.271 0.342 0.195 0.239 0.212 0.365 0.288
llm-jp-3.1-13b-instruct4 0.384 0.698 0.296 0.230 0.232 0.463 0.372
MedGemma 27B IT 0.463 0.537 0.606 0.350 0.818 0.001 0.624
o3 (o3-2025-04-16) 0.870 0.852 0.835 0.766 0.970 0.929 0.850
o3-mini (o3-mini-2025-01-31) 0.785 0.607 0.760 0.685 0.939 0.934 0.841
Phi-4 0.646 0.589 0.638 0.435 0.798 0.770 0.438
Phi-4-reasoning-plus 0.437 0.015 0.118 0.563 0.737 0.751 0.221
Qwen2.5-7B-Instruct 0.502 0.372 0.452 0.315 0.636 0.737 0.504
Qwen2.5-14B-Instruct 0.596 0.553 0.556 0.348 0.768 0.754 0.606
Qwen2.5-32B-Instruct 0.642 0.604 0.623 0.411 0.768 0.803 0.673
Qwen3-0.6B 0.353 0.221 0.295 0.237 0.606 0.408 0.438
Qwen3-1.7B 0.533 0.230 0.514 0.315 0.859 0.747 0.460
Qwen3-4B 0.646 0.389 0.643 0.440 0.919 0.838 0.562
Qwen3-8B 0.691 0.468 0.696 0.491 0.929 0.869 0.575
Qwen3-14B 0.750 0.609 0.737 0.556 0.939 0.910 0.624
Qwen3-32B 0.756 0.588 0.746 0.571 0.949 0.923 0.681
Qwen3-235B-A22B-Instruct-2507 0.821 0.735 0.799 0.701 0.970 0.900 0.730
Qwen3-235B-A22B-Thinking-2507 0.823 0.651 0.819 0.739 0.970 0.938 0.783
Sarashina2.2 3B Instruct v0.1 0.400 0.434 0.335 0.301 0.465 0.464 0.288
Model name Post (en) avg HellaSwag MMLU-Pro (en) GPQA (en) MATH-500 (en) AIME 24-25 LCB
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.739 0.906 0.780 0.606 0.964 0.617 0.563
CyberAgentLM3-22B-chat 0.280 0.770 0.260 0.288 0.298 0.017 0.045
DeepSeek-R1-Distill-Llama-8B 0.549 0.688 0.549 0.460 0.866 0.367 0.364
DeepSeek-R1-Distill-Llama-70B 0.730 0.891 0.776 0.626 0.936 0.617 0.534
DeepSeek-R1-Distill-Qwen-7B 0.546 0.564 0.547 0.495 0.902 0.417 0.351
DeepSeek-R1-Distill-Qwen-14B 0.672 0.841 0.707 0.525 0.908 0.567 0.486
DeepSeek-R1-Distill-Qwen-32B 0.701 0.885 0.737 0.571 0.926 0.567 0.523
DeepSeek-R1-Distill-Qwen-14B-Japanese 0.629 0.823 0.679 0.470 0.916 0.433 0.451
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.697 0.872 0.737 0.576 0.940 0.550 0.508
ELYZA-Thinking-1.0-Qwen-32B 0.571 0.888 0.708 0.576 0.860 0.300 0.093
Gemma 2 2B IT 0.256 0.596 0.287 0.359 0.262 0.000 0.034
Gemma 2 9B IT 0.392 0.829 0.503 0.369 0.488 0.017 0.146
Gemma 2 27B IT 0.441 0.846 0.572 0.404 0.560 0.033 0.230
Gemma-2-Llama Swallow 2B IT 0.184 0.495 0.169 0.268 0.138 0.000 0.036
Gemma-2-Llama Swallow 9B IT 0.323 0.801 0.296 0.283 0.438 0.017 0.106
Gemma-2-Llama Swallow 27B IT 0.393 0.786 0.436 0.343 0.544 0.033 0.218
Gemma 3 1B IT 0.201 0.357 0.171 0.237 0.438 0.000 0.002
Gemma 3 4B IT 0.405 0.620 0.440 0.354 0.748 0.117 0.151
Gemma 3 12B IT 0.524 0.816 0.617 0.389 0.862 0.217 0.247
Gemma 3 27B IT 0.571 0.861 0.681 0.475 0.880 0.233 0.298
GPT-4.1 (gpt-4.1-2025-04-14) 0.685 0.940 0.813 0.667 0.906 0.400 0.387
GPT-4o (gpt-4o-2024-08-06) 0.560 0.930 0.749 0.556 0.792 0.083 0.250
GPT-5 (gpt-5-2025-08-07) 0.875 0.959 0.865 0.828 0.990 0.933 0.677
gpt-oss-20b 0.737 0.847 0.741 0.636 0.944 0.617 0.635
gpt-oss-120b 0.794 0.878 0.790 0.727 0.966 0.733 0.670
Llama 3.1 8B Instruct 0.387 0.769 0.489 0.374 0.526 0.033 0.131
Llama-3.1-Nemotron-Nano-8B-v1 0.588 0.518 0.566 0.470 0.948 0.550 0.478
Llama 3.1 Swallow 8B Instruct v0.3 0.291 0.725 0.287 0.293 0.338 0.000 0.102
Llama 3.1 Swallow 8B Instruct v0.5 0.315 0.648 0.399 0.318 0.452 0.000 0.072
Llama 3.3 70B Instruct 0.545 0.911 0.717 0.480 0.746 0.117 0.303
Llama-3.3-Nemotron-Super-49B-v1 0.711 0.885 0.783 0.667 0.960 0.567 0.408
Llama 3.3 Swallow 70B Instruct v0.4 0.470 0.884 0.570 0.409 0.642 0.083 0.232
Llama 4 Scout Instruct 0.594 0.891 0.744 0.606 0.834 0.183 0.309
llm-jp-3.1-1.8b-instruct4 0.178 0.450 0.163 0.278 0.146 0.000 0.030
llm-jp-3.1-13b-instruct4 0.244 0.717 0.252 0.227 0.188 0.000 0.082
MedGemma 27B IT 0.495 0.859 0.654 0.434 0.824 0.200 0.001
o3 (o3-2025-04-16) 0.846 0.956 0.857 0.818 0.978 0.817 0.649
o3-mini (o3-mini-2025-01-31) 0.767 0.869 0.792 0.747 0.958 0.733 0.503
Phi-4 0.547 0.859 0.630 0.551 0.800 0.217 0.227
Phi-4-reasoning-plus 0.469 0.260 0.113 0.611 0.770 0.583 0.478
Qwen2.5-7B-Instruct 0.454 0.820 0.554 0.348 0.742 0.100 0.158
Qwen2.5-14B-Instruct 0.514 0.886 0.652 0.404 0.794 0.133 0.215
Qwen2.5-32B-Instruct 0.543 0.908 0.640 0.480 0.812 0.150 0.270
Qwen3-0.6B 0.335 0.425 0.338 0.283 0.694 0.133 0.135
Qwen3-1.7B 0.531 0.626 0.560 0.394 0.904 0.383 0.315
Qwen3-4B 0.672 0.790 0.690 0.515 0.938 0.600 0.499
Qwen3-8B 0.715 0.851 0.713 0.561 0.942 0.700 0.525
Qwen3-14B 0.763 0.890 0.770 0.611 0.972 0.750 0.587
Qwen3-32B 0.768 0.901 0.779 0.646 0.964 0.717 0.602
Qwen3-235B-A22B-Instruct-2507 0.771 0.940 0.824 0.586 0.982 0.767 0.529
Qwen3-235B-A22B-Thinking-2507 0.856 0.931 0.845 0.803 0.980 0.883 0.692
Sarashina2.2 3B Instruct v0.1 0.318 0.613 0.329 0.293 0.570 0.017 0.086
Model name MTB (ja) avg Code Ext Human Math Reason Roleplay STEM Write
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.843 0.868 0.893 0.885 0.889 0.694 0.848 0.850 0.821
CyberAgentLM3-22B-chat 0.697 0.500 0.733 0.859 0.591 0.611 0.791 0.721 0.769
DeepSeek-R1-Distill-Llama-8B 0.526 0.376 0.625 0.681 0.595 0.496 0.483 0.510 0.442
DeepSeek-R1-Distill-Llama-70B 0.707 0.551 0.778 0.838 0.780 0.525 0.768 0.733 0.681
DeepSeek-R1-Distill-Qwen-7B 0.411 0.371 0.572 0.347 0.804 0.346 0.275 0.341 0.228
DeepSeek-R1-Distill-Qwen-14B 0.700 0.632 0.803 0.739 0.857 0.563 0.720 0.631 0.658
DeepSeek-R1-Distill-Qwen-32B 0.753 0.669 0.874 0.764 0.867 0.606 0.790 0.738 0.716
DeepSeek-R1-Distill-Qwen-14B-Japanese 0.771 0.557 0.777 0.880 0.871 0.664 0.801 0.859 0.758
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.808 0.639 0.813 0.917 0.924 0.652 0.842 0.872 0.802
ELYZA-Thinking-1.0-Qwen-32B 0.694 0.687 0.824 0.688 0.927 0.641 0.583 0.656 0.542
Gemma 2 2B IT 0.555 0.460 0.585 0.673 0.448 0.422 0.641 0.571 0.639
Gemma 2 9B IT 0.743 0.635 0.816 0.865 0.686 0.649 0.784 0.734 0.773
Gemma 2 27B IT 0.762 0.760 0.825 0.874 0.697 0.578 0.818 0.745 0.796
Gemma-2-Llama Swallow 2B IT 0.583 0.408 0.551 0.774 0.420 0.418 0.725 0.655 0.709
Gemma-2-Llama Swallow 9B IT 0.729 0.579 0.787 0.880 0.661 0.616 0.788 0.735 0.783
Gemma-2-Llama Swallow 27B IT 0.759 0.627 0.846 0.868 0.767 0.548 0.796 0.785 0.833
Gemma 3 1B IT 0.434 0.396 0.484 0.519 0.343 0.337 0.519 0.434 0.436
Gemma 3 4B IT 0.735 0.727 0.650 0.814 0.826 0.482 0.787 0.796 0.802
Gemma 3 12B IT 0.811 0.784 0.807 0.880 0.858 0.582 0.856 0.878 0.844
Gemma 3 27B IT 0.830 0.747 0.942 0.878 0.808 0.733 0.849 0.853 0.831
GPT-4.1 (gpt-4.1-2025-04-14) 0.892 0.917 0.911 0.885 0.980 0.819 0.879 0.887 0.858
GPT-4o (gpt-4o-2024-08-06) 0.865 0.896 0.929 0.874 0.895 0.755 0.869 0.847 0.855
GPT-5 (gpt-5-2025-08-07) 0.882 0.893 0.883 0.928 0.882 0.758 0.896 0.933 0.885
gpt-oss-20b 0.869 0.914 0.917 0.853 0.994 0.772 0.772 0.909 0.824
gpt-oss-120b 0.907 0.898 0.924 0.915 0.999 0.862 0.855 0.948 0.852
Llama 3.1 8B Instruct 0.592 0.528 0.848 0.585 0.600 0.465 0.569 0.562 0.577
Llama-3.1-Nemotron-Nano-8B-v1 0.363 0.374 0.503 0.311 0.564 0.270 0.289 0.301 0.293
Llama 3.1 Swallow 8B Instruct v0.3 0.709 0.570 0.783 0.869 0.631 0.506 0.782 0.716 0.813
Llama 3.1 Swallow 8B Instruct v0.5 0.726 0.590 0.843 0.884 0.470 0.618 0.780 0.799 0.822
Llama 3.3 70B Instruct 0.735 0.672 0.878 0.751 0.742 0.638 0.762 0.735 0.700
Llama-3.3-Nemotron-Super-49B-v1 0.806 0.731 0.898 0.821 0.801 0.755 0.804 0.809 0.828
Llama 3.3 Swallow 70B Instruct v0.4 0.791 0.696 0.856 0.881 0.807 0.664 0.827 0.772 0.822
Llama 4 Scout Instruct 0.789 0.763 0.923 0.816 0.879 0.615 0.787 0.752 0.778
llm-jp-3.1-1.8b-instruct4 0.657 0.574 0.601 0.809 0.672 0.446 0.767 0.697 0.693
llm-jp-3.1-13b-instruct4 0.733 0.587 0.700 0.870 0.731 0.559 0.831 0.775 0.807
MedGemma 27B IT 0.778 0.799 0.926 0.805 0.883 0.646 0.718 0.758 0.686
o3 (o3-2025-04-16) 0.903 0.935 0.898 0.888 0.995 0.809 0.889 0.941 0.867
o3-mini (o3-mini-2025-01-31) 0.880 0.868 0.937 0.860 0.952 0.802 0.863 0.893 0.868
Phi-4 0.822 0.752 0.933 0.862 0.890 0.629 0.830 0.845 0.835
Phi-4-reasoning-plus 0.374 0.205 0.376 0.206 0.379 0.283 0.643 0.162 0.741
Qwen2.5-7B-Instruct 0.688 0.638 0.711 0.782 0.685 0.494 0.736 0.730 0.729
Qwen2.5-14B-Instruct 0.799 0.773 0.882 0.850 0.796 0.646 0.829 0.795 0.822
Qwen2.5-32B-Instruct 0.819 0.776 0.913 0.845 0.863 0.706 0.839 0.802 0.811
Qwen3-0.6B 0.431 0.332 0.423 0.460 0.626 0.346 0.418 0.445 0.402
Qwen3-1.7B 0.662 0.574 0.591 0.715 0.841 0.567 0.631 0.765 0.613
Qwen3-4B 0.797 0.696 0.818 0.855 0.947 0.729 0.747 0.826 0.760
Qwen3-8B 0.845 0.757 0.834 0.890 0.996 0.823 0.829 0.822 0.806
Qwen3-14B 0.874 0.850 0.839 0.903 0.994 0.824 0.839 0.919 0.827
Qwen3-32B 0.875 0.794 0.871 0.871 0.997 0.836 0.881 0.917 0.830
Qwen3-235B-A22B-Instruct-2507 0.915 0.943 0.938 0.907 0.987 0.826 0.893 0.933 0.891
Qwen3-235B-A22B-Thinking-2507 0.904 0.896 0.878 0.933 0.985 0.851 0.876 0.955 0.861
Sarashina2.2 3B Instruct v0.1 0.721 0.579 0.680 0.862 0.828 0.467 0.832 0.766 0.752
Model name MTB (en) avg Code Ext Human Math Reason Roleplay STEM Write
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.866 0.808 0.878 0.899 0.951 0.757 0.872 0.882 0.881
CyberAgentLM3-22B-chat 0.621 0.467 0.695 0.828 0.479 0.429 0.678 0.647 0.747
DeepSeek-R1-Distill-Llama-8B 0.704 0.398 0.745 0.825 0.827 0.562 0.768 0.731 0.775
DeepSeek-R1-Distill-Llama-70B 0.842 0.787 0.931 0.862 0.919 0.723 0.850 0.806 0.854
DeepSeek-R1-Distill-Qwen-7B 0.649 0.481 0.656 0.708 0.762 0.520 0.686 0.700 0.677
DeepSeek-R1-Distill-Qwen-14B 0.775 0.512 0.851 0.815 0.886 0.745 0.841 0.750 0.803
DeepSeek-R1-Distill-Qwen-32B 0.822 0.619 0.901 0.869 0.918 0.793 0.861 0.768 0.850
DeepSeek-R1-Distill-Qwen-14B-Japanese 0.835 0.724 0.884 0.870 0.907 0.771 0.867 0.817 0.838
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.857 0.730 0.893 0.894 0.964 0.770 0.871 0.872 0.861
ELYZA-Thinking-1.0-Qwen-32B 0.748 0.770 0.913 0.754 0.912 0.775 0.617 0.639 0.606
Gemma 2 2B IT 0.718 0.543 0.687 0.868 0.659 0.609 0.780 0.780 0.816
Gemma 2 9B IT 0.761 0.624 0.799 0.893 0.682 0.610 0.832 0.808 0.841
Gemma 2 27B IT 0.800 0.701 0.855 0.891 0.724 0.702 0.843 0.827 0.858
Gemma-2-Llama Swallow 2B IT 0.584 0.461 0.534 0.758 0.376 0.452 0.728 0.646 0.715
Gemma-2-Llama Swallow 9B IT 0.734 0.523 0.831 0.886 0.720 0.518 0.789 0.786 0.819
Gemma-2-Llama Swallow 27B IT 0.771 0.626 0.789 0.883 0.751 0.622 0.824 0.821 0.853
Gemma 3 1B IT 0.578 0.503 0.453 0.719 0.709 0.366 0.634 0.686 0.553
Gemma 3 4B IT 0.793 0.704 0.762 0.901 0.855 0.553 0.869 0.845 0.857
Gemma 3 12B IT 0.860 0.741 0.862 0.917 0.932 0.758 0.891 0.899 0.879
Gemma 3 27B IT 0.880 0.767 0.919 0.920 0.924 0.797 0.908 0.917 0.888
GPT-4.1 (gpt-4.1-2025-04-14) 0.908 0.898 0.936 0.903 0.942 0.863 0.901 0.925 0.898
GPT-4o (gpt-4o-2024-08-06) 0.922 0.943 0.927 0.896 0.993 0.976 0.874 0.905 0.865
GPT-5 (gpt-5-2025-08-07) 0.888 0.876 0.912 0.923 0.843 0.811 0.906 0.927 0.904
gpt-oss-20b 0.889 0.913 0.881 0.935 0.913 0.779 0.880 0.939 0.869
gpt-oss-120b 0.918 0.947 0.892 0.915 0.989 0.871 0.886 0.960 0.886
Llama 3.1 8B Instruct 0.737 0.556 0.816 0.871 0.697 0.522 0.821 0.765 0.850
Llama-3.1-Nemotron-Nano-8B-v1 0.701 0.658 0.654 0.696 0.906 0.526 0.712 0.738 0.720
Llama 3.1 Swallow 8B Instruct v0.3 0.691 0.528 0.714 0.886 0.562 0.458 0.773 0.768 0.838
Llama 3.1 Swallow 8B Instruct v0.5 0.753 0.576 0.801 0.900 0.769 0.499 0.848 0.796 0.833
Llama 3.3 70B Instruct 0.863 0.795 0.935 0.891 0.895 0.861 0.858 0.822 0.847
Llama-3.3-Nemotron-Super-49B-v1 0.881 0.782 0.915 0.910 0.963 0.800 0.878 0.908 0.893
Llama 3.3 Swallow 70B Instruct v0.4 0.816 0.672 0.902 0.888 0.839 0.706 0.828 0.838 0.855
Llama 4 Scout Instruct 0.857 0.722 0.911 0.860 0.920 0.904 0.836 0.840 0.862
llm-jp-3.1-1.8b-instruct4 0.548 0.454 0.482 0.662 0.521 0.364 0.665 0.563 0.673
llm-jp-3.1-13b-instruct4 0.682 0.562 0.681 0.844 0.625 0.512 0.736 0.715 0.779
MedGemma 27B IT 0.830 0.722 0.914 0.884 0.970 0.735 0.819 0.858 0.737
o3 (o3-2025-04-16) 0.917 0.929 0.931 0.945 0.964 0.836 0.900 0.938 0.892
o3-mini (o3-mini-2025-01-31) 0.901 0.876 0.913 0.891 0.969 0.865 0.895 0.914 0.882
Phi-4 0.881 0.771 0.904 0.876 0.928 0.933 0.889 0.879 0.865
Phi-4-reasoning-plus 0.426 0.281 0.384 0.116 0.437 0.322 0.769 0.299 0.800
Qwen2.5-7B-Instruct 0.797 0.656 0.769 0.893 0.843 0.662 0.832 0.886 0.833
Qwen2.5-14B-Instruct 0.865 0.752 0.873 0.899 0.932 0.861 0.870 0.894 0.839
Qwen2.5-32B-Instruct 0.869 0.806 0.862 0.895 0.954 0.817 0.876 0.890 0.851
Qwen3-0.6B 0.595 0.376 0.678 0.673 0.803 0.408 0.551 0.633 0.637
Qwen3-1.7B 0.779 0.642 0.754 0.830 0.968 0.686 0.764 0.785 0.805
Qwen3-4B 0.839 0.737 0.831 0.884 0.947 0.735 0.870 0.861 0.845
Qwen3-8B 0.851 0.804 0.831 0.892 0.980 0.713 0.858 0.876 0.854
Qwen3-14B 0.882 0.843 0.849 0.904 0.971 0.805 0.878 0.919 0.890
Qwen3-32B 0.892 0.860 0.910 0.905 0.979 0.796 0.899 0.919 0.869
Qwen3-235B-A22B-Instruct-2507 0.911 0.888 0.859 0.925 0.990 0.873 0.911 0.940 0.905
Qwen3-235B-A22B-Thinking-2507 0.922 0.877 0.899 0.945 0.998 0.886 0.908 0.952 0.908
Sarashina2.2 3B Instruct v0.1 0.708 0.499 0.642 0.863 0.747 0.552 0.783 0.827 0.750