mirror of
https://github.com/deepseek-ai/DeepSeek-LLM
synced 2025-01-22 10:36:03 +00:00
99dd5694d2
* fix math hungarian exam eval error * fix yi math score
5.0 KiB
5.0 KiB
More Evaluation Results
Model | HellaSwag | PIQA | WinoGrande | RACE-Middle | RACE-High | TriviaQA | NaturalQuestions | MMLU | MMLU (LM) | ARC-Easy | ARC-Challenge | GSM8K | HumanEval | MBPP | DROP (EM) | DROP (F1) | OpenBookQA | Pile-test | Pile-test-BPB | BBH | AGIEval | CLUEWSC | CHID | CEval | CMMLU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaMA2 7B | 75.6 | 78.0 | 69.6 | 60.7 | 45.8 | 63.8 | 25.5 | 45.8 | 44.5 | 69.1 | 49.0 | 15.5 | 14.6 | 21.8 | 34.0 | 39.8 | 57.4 | 1.739 | 0.764 | 38.5 | 22.8 | 64.0 | 37.9 | 33.9 | 32.6 |
Qwen 7B v2 | 73.7 | 77.5 | 67.3 | 57.0 | 43.1 | 59.6 | 32.3 | 57.3 | 40.9 | 56.5 | 41.5 | 52.1 | 32.3 | 37.2 | 43.4 | 51.7 | 48.6 | 2.025 | 0.756 | 47.3 | 29.3 | 76.5 | 86.6 | 62.3 | 62.6 |
Baichuan2 7B | 67.9 | 73.6 | 60.2 | 59.8 | 45.1 | 59.1 | 21.3 | 53.4 | 35.5 | 44.6 | 36.8 | 23.4 | 22.0 | 26.0 | 31.6 | 37.1 | 34.8 | 1.842 | 0.781 | 41.6 | 42.7 | 69.6 | 80.4 | 54.2 | 56.2 |
DeepSeek 7B Base | 75.4 | 79.2 | 70.5 | 63.2 | 46.5 | 59.7 | 22.2 | 48.2 | 42.9 | 67.9 | 48.1 | 17.4 | 26.2 | 39.0 | 34.9 | 41.0 | 55.8 | 1.871 | 0.746 | 39.5 | 26.4 | 73.1 | 89.3 | 45.0 | 47.2 |
DeepSeek 7B Chat | 68.5 | 77.6 | 66.9 | 65.2 | 50.8 | 57.9 | 32.5 | 49.4 | 42.3 | 71.0 | 49.4 | 62.6 | 48.2 | 35.2 | 37.5 | 49.1 | 54.8 | / | / | 42.3 | 19.3 | 71.9 | 64.9 | 47.0 | 49.7 |
LLaMA2 70B | 84.0 | 82.0 | 80.4 | 70.1 | 54.3 | 79.5 | 36.1 | 69.0 | 53.5 | 76.5 | 59.5 | 58.4 | 28.7 | 45.6 | 63.6 | 69.2 | 60.4 | 1.526 | 0.671 | 62.9 | 37.2 | 76.5 | 55.5 | 51.4 | 53.1 |
DeepSeek 67B Base | 84.0 | 83.6 | 79.8 | 69.9 | 50.7 | 78.9 | 36.6 | 71.3 | 54.1 | 76.9 | 59.0 | 63.4 | 42.7 | 57.4 | 61.0 | 67.9 | 60.2 | 1.660 | 0.662 | 68.7 | 41.3 | 81.0 | 92.1 | 66.1 | 70.8 |
DeepSeek 67B Chat | 75.7 | 82.6 | 76.0 | 70.9 | 56.0 | 81.5 | 47.0 | 71.1 | 55.0 | 81.6 | 64.1 | 84.1 | 73.8 | 61.4 | 59.4 | 71.9 | 63.2 | / | / | 71.7 | 46.4 | 60.0 | 72.6 | 65.2 | 67.8 |
Math evaluation results of DeepSeek LLM 67B Chat
Inference | GSM8k | MATH | MGSM-zh | CMATH | Gaokao-MathCloze | Gaokao-MathQA |
---|---|---|---|---|---|---|
CoT | 84.1% | 32.6% | 74.0% | 80.3% | 16.9% | 20.2% |
Tool-Integrated Reasoning | 86.7% | 51.1% | 76.4% | 85.4% | 21.2% | 28.2% |
Never Seen Before Exam
Model | DeepSeek LLM 67B Chat | Qwen-14B-Chat | ChatGLM3-6B | Baichuan2-Chat-13B | Yi-Chat-34B | GPT-3.5-Turbo | Grok-1 | Claude 2 | GPT-4 |
---|---|---|---|---|---|---|---|---|---|
Hungarian National High-School Exam | 58 | 36.5 | 32 | 19.5 | 39 | 41 | 59 | 55 | 68 |
Model | Qwen-14B-Chat | ChatGLM3-6B | Baichuan2-Chat-13B | Yi-Chat-34B | PaLM2 Small | DeepSeek LLM 67B Chat | GPT-4 |
---|---|---|---|---|---|---|---|
Prompt-level Instruction Following | 48.9 | 35.0 | 51.0 | 51.2 | 46.9 | 59.1 | 79.3 |
Model | Qwen-14B-Chat | ChatGLM3-6B | Baichuan2-Chat-13B | Yi-Chat-34B | GPT-3.5-Turbo | Phind-CodeLlama-34B-v2 | DeepSeek LLM 67B Chat | DeepSeek Coder 33B | GPT-4 |
---|---|---|---|---|---|---|---|---|---|
LeetCode Weekly Contest | 11.1 | 2.38 | 1.58 | 7.9 | 20.6 | 12.6 | 17.5 | 31.7 | 48.4 |