diff --git a/README.md b/README.md index c1db533..00060e6 100644 --- a/README.md +++ b/README.md @@ -126,7 +126,7 @@ In line with Grok-1, we have evaluated the model's mathematical capabilities usi result -**Remark:** Some results are obtained by DeepSeek LLM authors, while others are done by Grok-1 authors. We found some models count the score of the last question (Llemma 34b and Mammoth) while some (MetaMath-7B) are not in the original evaluation. In our evaluation, we count the last question score. Evaluation details are [here](https://github.com/deepseek-ai/DeepSeek-LLM/tree/HEAD/evaluation/hungarian_national_hs_solutions). +**Remark:** We have rectified an error from our initial evaluation. In this revised version, we have omitted the lowest scores for questions 16, 17, 18, as well as for the aforementioned image. Evaluation details are [here](https://github.com/deepseek-ai/DeepSeek-LLM/tree/HEAD/evaluation/hungarian_national_hs_solutions). --- diff --git a/evaluation/more_results.md b/evaluation/more_results.md index 5f13733..b81314b 100644 --- a/evaluation/more_results.md +++ b/evaluation/more_results.md @@ -22,7 +22,7 @@ | Model | DeepSeek LLM 67B Chat | Qwen-14B-Chat | ChatGLM3-6B | Baichuan2-Chat-13B | Yi-Chat-34B | GPT-3.5-Turbo | Grok-1 | Claude 2 | GPT-4 | |:-----------------------------------:|:---------------------:|:-------------:|:-----------:|:------------------:|:-----------:|:-------------:|:------:|:--------:|:-----:| -| Hungarian National High-School Exam | **65** | 38.5 | 37 | 20.5 | 44 | 41 | 59 | 55 | 68 | +| Hungarian National High-School Exam | **58** | 36.5 | 32 | 19.5 | 39 | 41 | 59 | 55 | 68 | | Model | Qwen-14B-Chat | ChatGLM3-6B | Baichuan2-Chat-13B | Yi-Chat-34B | PaLM2 Small | DeepSeek LLM 67B Chat | GPT-4 | diff --git a/images/mathexam.png b/images/mathexam.png index 3b079dc..d642833 100644 Binary files a/images/mathexam.png and b/images/mathexam.png differ