mirror of
https://github.com/deepseek-ai/DeepSeek-Math
synced 2024-11-22 11:38:17 +00:00
42 lines
1.9 KiB
Markdown
42 lines
1.9 KiB
Markdown
|
## 1. Introduction
|
|||
|
|
|||
|
We provide a test script for both zero-shot and few-shot evaluation on mathematical reasoning benchmarks used in our paper.
|
|||
|
|
|||
|
## 2. Setup
|
|||
|
|
|||
|
First configure the `prefix` in `environment.yml` and then run the following command
|
|||
|
```
|
|||
|
conda env create -f environment.yml
|
|||
|
```
|
|||
|
|
|||
|
## 3. Evaluation
|
|||
|
|
|||
|
For chain-of-thought evaluation of DeepSeekMath-Instruct and DeepSeekMath-RL, our script (see `def markup_question()` in `run_subset_parallel.py`) processes each question as follows:
|
|||
|
* English questions: `{question}\nPlease reason step by step, and put your final answer within \\boxed{}.`
|
|||
|
* Chinese questions: `{question}\n请通过逐步推理来解答问题,并把最终答案放置于\\boxed{}中。`
|
|||
|
|
|||
|
For tool-integrated reasoning, we process each question as follows:
|
|||
|
* English questions: `{question}\nPlease integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}.`
|
|||
|
* Chinese questions: `{question}\n请结合自然语言和Python程序语言来解答问题,并把最终答案放置于\\boxed{}中。`
|
|||
|
|
|||
|
We provide an example of testing the DeepSeekMath-Base 7B using 8 GPUs.
|
|||
|
|
|||
|
If you wish to use a different model or dataset, you can modify the configs in `submit_eval_jobs.py` and `configs/*test_configs.json`
|
|||
|
|
|||
|
```
|
|||
|
python submit_eval_jobs.py --n-gpus 8
|
|||
|
```
|
|||
|
|
|||
|
Wait for all processes to finish, and then run the following command to aggregate results from all processes
|
|||
|
|
|||
|
```
|
|||
|
python summarize_results.py [--eval-atp]
|
|||
|
```
|
|||
|
where the option `--eval-atp` will invoke `unsafe_score_minif2f_isabelle.py` to evaluate the informal-to-formal proving results. Please make sure you have set up the [PISA](https://github.com/wellecks/lm-evaluation-harness/blob/minif2f-isabelle/docs/isabelle_setup.md) server before using this option.
|
|||
|
|
|||
|
A summary of all evaluation results will be saved as `evaluation_results.json`
|
|||
|
|
|||
|
## 4. Model Outputs
|
|||
|
|
|||
|
We provide all model outputs in `outputs.zip`.
|