YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co./docs/hub/model-cards#model-card-metadata)
Category | Benchmark | Phi-3.5 Mini-Ins | Mistral-Nemo-12B-Ins-2407 | Llama-3.1-8B-Ins | Gemma-2-9B-Ins | Gemini 1.5 Flash |
---|---|---|---|---|---|---|
Popular aggregated benchmark | Arena Hard | 37 | 39.4 | 25.7 | 42 | 55.2 |
BigBench Hard CoT (0-shot) | 69 | 60.2 | 63.4 | 63.5 | 66.7 | |
MMLU (5-shot) | 69 | 67.2 | 68.1 | 71.3 | 78.7 | |
MMLU-Pro (0-shot, CoT) | 47.4 | 40.7 | 44 | 50.1 | 57.2 | |
Reasoning | ARC Challenge (10-shot) | 84.6 | 84.8 | 83.1 | 89.8 | 92.8 |
TruthfulQA (MC2) (10-shot) | 64 | 68.1 | 69.2 | 76.6 | 76.6 | |
WinoGrande (5-shot) | 68.5 | 70.4 | 64.7 | 74 | 74.7 | |
Multilingual | Multilingual MMLU (5-shot) | 55.4 | 58.9 | 56.2 | 63.8 | 77.2 |
Math | GSM8K (8-shot, CoT) | 86.2 | 84.2 | 82.4 | 84.9 | 82.4 |
MATH (0-shot, CoT) | 48.5 | 31.2 | 47.6 | 50.9 | 38 | |
Long context | Qasper | 41.9 | 30.7 | 37.2 | 13.9 | 43.5 |
SQuALITY | 24.3 | 25.8 | 26.2 | 0 | 23.5 | |
Code Generation | HumanEval (0-shot) | 62.8 | 63.4 | 66.5 | 61 | 74.4 |
MBPP (3-shot) | 69.6 | 68.1 | 69.4 | 69.3 | 77.5 |