.. | ||
configs | ||
data_processing | ||
datasets | ||
eval | ||
few_shot_prompts | ||
infer | ||
environment.yml | ||
evaluation_results.json | ||
outputs.zip | ||
README.md | ||
run_subset_parallel.py | ||
submit_eval_jobs.py | ||
summarize_results.py | ||
unsafe_score_minif2f_isabelle.py | ||
utils.py |
1. Introduction
We provide a test script for both zero-shot and few-shot evaluation on mathematical reasoning benchmarks used in our paper.
2. Setup
First configure the prefix
in environment.yml
and then run the following command
conda env create -f environment.yml
3. Evaluation
For chain-of-thought evaluation of DeepSeekMath-Instruct and DeepSeekMath-RL, our script (see def markup_question()
in run_subset_parallel.py
) processes each question as follows:
- English questions:
{question}\nPlease reason step by step, and put your final answer within \\boxed{}.
- Chinese questions:
{question}\n请通过逐步推理来解答问题,并把最终答案放置于\\boxed{}中。
For tool-integrated reasoning, we process each question as follows:
- English questions:
{question}\nPlease integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}.
- Chinese questions:
{question}\n请结合自然语言和Python程序语言来解答问题,并把最终答案放置于\\boxed{}中。
We provide an example of testing the DeepSeekMath-Base 7B using 8 GPUs.
If you wish to use a different model or dataset, you can modify the configs in submit_eval_jobs.py
and configs/*test_configs.json
python submit_eval_jobs.py --n-gpus 8
Wait for all processes to finish, and then run the following command to aggregate results from all processes
python summarize_results.py [--eval-atp]
where the option --eval-atp
will invoke unsafe_score_minif2f_isabelle.py
to evaluate the informal-to-formal proving results. Please make sure you have set up the PISA server before using this option.
A summary of all evaluation results will be saved as evaluation_results.json
4. Model Outputs
We provide all model outputs in outputs.zip
.