DeepSeek-Math

mirror of https://github.com/deepseek-ai/DeepSeek-Math synced 2025-06-26 18:16:20 +00:00

History

ZhihongShao db877abb91 update submit_eval_jobs		2024-02-09 15:23:28 +08:00
..
configs	init	2024-02-06 10:27:40 +08:00
data_processing	init	2024-02-06 10:27:40 +08:00
datasets	init	2024-02-06 10:27:40 +08:00
eval	init	2024-02-06 10:27:40 +08:00
few_shot_prompts	init	2024-02-06 10:27:40 +08:00
infer	update submit_eval_jobs	2024-02-09 15:23:28 +08:00
environment.yml	init	2024-02-06 10:27:40 +08:00
evaluation_results.json	init	2024-02-06 10:27:40 +08:00
outputs.zip	init	2024-02-06 10:27:40 +08:00
README.md	init	2024-02-06 10:27:40 +08:00
run_subset_parallel.py	init	2024-02-06 10:27:40 +08:00
submit_eval_jobs.py	update submit_eval_jobs	2024-02-09 15:23:28 +08:00
summarize_results.py	init	2024-02-06 10:27:40 +08:00
unsafe_score_minif2f_isabelle.py	init	2024-02-06 10:27:40 +08:00
utils.py	init	2024-02-06 10:27:40 +08:00

README.md

1. Introduction

We provide a test script for both zero-shot and few-shot evaluation on mathematical reasoning benchmarks used in our paper.

2. Setup

First configure the prefix in environment.yml and then run the following command

conda env create -f environment.yml

3. Evaluation

For chain-of-thought evaluation of DeepSeekMath-Instruct and DeepSeekMath-RL, our script (see def markup_question() in run_subset_parallel.py) processes each question as follows:

English questions: {question}\nPlease reason step by step, and put your final answer within \\boxed{}.
Chinese questions: {question}\n请通过逐步推理来解答问题，并把最终答案放置于\\boxed{}中。

For tool-integrated reasoning, we process each question as follows:

English questions: {question}\nPlease integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}.
Chinese questions: {question}\n请结合自然语言和Python程序语言来解答问题，并把最终答案放置于\\boxed{}中。

We provide an example of testing the DeepSeekMath-Base 7B using 8 GPUs.

If you wish to use a different model or dataset, you can modify the configs in submit_eval_jobs.py and configs/*test_configs.json

python submit_eval_jobs.py --n-gpus 8

Wait for all processes to finish, and then run the following command to aggregate results from all processes

python summarize_results.py [--eval-atp]

where the option --eval-atp will invoke unsafe_score_minif2f_isabelle.py to evaluate the informal-to-formal proving results. Please make sure you have set up the PISA server before using this option.

A summary of all evaluation results will be saved as evaluation_results.json

4. Model Outputs

We provide all model outputs in outputs.zip.

README.md Unescape Escape

1. Introduction

2. Setup

3. Evaluation

4. Model Outputs

README.md