This commit is contained in:
ZhihongShao
2024-02-06 10:27:40 +08:00
commit 21cc5c6701
59 changed files with 17325 additions and 0 deletions

41
evaluation/README.md Normal file
View File

@@ -0,0 +1,41 @@
## 1. Introduction
We provide a test script for both zero-shot and few-shot evaluation on mathematical reasoning benchmarks used in our paper.
## 2. Setup
First configure the `prefix` in `environment.yml` and then run the following command
```
conda env create -f environment.yml
```
## 3. Evaluation
For chain-of-thought evaluation of DeepSeekMath-Instruct and DeepSeekMath-RL, our script (see `def markup_question()` in `run_subset_parallel.py`) processes each question as follows:
* English questions: `{question}\nPlease reason step by step, and put your final answer within \\boxed{}.`
* Chinese questions: `{question}\n请通过逐步推理来解答问题并把最终答案放置于\\boxed{}中。`
For tool-integrated reasoning, we process each question as follows:
* English questions: `{question}\nPlease integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}.`
* Chinese questions: `{question}\n请结合自然语言和Python程序语言来解答问题并把最终答案放置于\\boxed{}中。`
We provide an example of testing the DeepSeekMath-Base 7B using 8 GPUs.
If you wish to use a different model or dataset, you can modify the configs in `submit_eval_jobs.py` and `configs/*test_configs.json`
```
python submit_eval_jobs.py --n-gpus 8
```
Wait for all processes to finish, and then run the following command to aggregate results from all processes
```
python summarize_results.py [--eval-atp]
```
where the option `--eval-atp` will invoke `unsafe_score_minif2f_isabelle.py` to evaluate the informal-to-formal proving results. Please make sure you have set up the [PISA](https://github.com/wellecks/lm-evaluation-harness/blob/minif2f-isabelle/docs/isabelle_setup.md) server before using this option.
A summary of all evaluation results will be saved as `evaluation_results.json`
## 4. Model Outputs
We provide all model outputs in `outputs.zip`.