mirror of
https://github.com/deepseek-ai/DeepSeek-Coder
synced 2025-01-23 19:07:17 +00:00
64 lines
2.9 KiB
Markdown
64 lines
2.9 KiB
Markdown
## 1. Introduction
|
|
|
|
We provide a test script to evaluate the capability of the **deepseek-coder** model to solve mathematical problems using external tools (Python interpreter). We evaluate it using the [PAL](https://arxiv.org/pdf/2211.10435.pdf) method on seven datasets: **GSM8k, MATH, GSM-Hard, SVAMP, TabMWP, ASDiv, and MAWPS**.
|
|
|
|
|
|
|
|
## 2. Setup
|
|
|
|
```
|
|
pip install sympy==1.12 pebble timeout-decorator transformers
|
|
```
|
|
|
|
|
|
|
|
## 3. Evaluation
|
|
|
|
We provide an example of testing the **deepseek-coder-1.3b-base** model on the **gsm8k** dataset using **8** GPUs. If you wish to use a different model or dataset, you can modify it as needed.
|
|
|
|
```bash
|
|
MODEL_NAME_OR_PATH=deepseek-ai/deepseek-coder-1.3b-base
|
|
DATA=gsm8k # 'math' 'gsm8k' 'gsm-hard' 'svamp' 'tabmwp' 'asdiv' 'mawps'
|
|
MODEL_DIR_NAME=${MODEL_NAME_OR_PATH##*/}
|
|
GPU_NUM=8
|
|
for rank in {0..7}; do
|
|
CUDA_VISIBLE_DEVICES=$rank nohup python run.py \
|
|
--data_name ${DATA} \
|
|
--model_name_or_path ${MODEL_NAME_OR_PATH} \
|
|
--batch_size 16 \
|
|
--do_inference \
|
|
--rank $rank \
|
|
--world_size $GPU_NUM 2>&1 &
|
|
done
|
|
|
|
# Wait for all processes to finish
|
|
wait
|
|
echo "All processes completed."
|
|
python run.py --do_eval --data_name ${DATA} --model_name_or_path ${MODEL_NAME_OR_PATH} --world_size $GPU_NUM | tee outputs/${MODEL_DIR_NAME}/${DATA}/result.out
|
|
```
|
|
|
|
|
|
|
|
## 4. Experimental Results
|
|
|
|
We report experimental results here for mathematical reasoning tasks by using python program. For all open-source models, we utilize this repository and test with the same prompt. We set the maximum input length to **2048** and the maximum output length to **512**, and employ the **greedy search strategy**.
|
|
|
|
|
|
|
|
|
|
| Model | Size | GSM8k | MATH | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg |
|
|
| -------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- |
|
|
| CodeShell | 7B | 17.0% | 9.1% | 18.2% | 45.6% | 29.6% | 46.6% | 56.8% | 31.8% |
|
|
| CodeGeex-2 | 7B | 23.6% | 9.6% | 22.4% | 48.0% | 47.2% | 46.9% | 66.0% | 37.7% |
|
|
| StarCoder-Base | 16B | 27.3% | 11.5% | 24.2% | 44.0% | 45.6% | 54.9% | 73.4% | 40.1% |
|
|
| CodeLLama-Base | 7B | 36.4% | 12.3% | 29.7% | 57.6% | 58.4% | 59.6% | 82.6% | 48.0% |
|
|
| CodeLLama-Base | 13B | 44.2% | 15.5% | 42.4% | 65.6% | 61.6% | 65.3% | 85.3% | 54.3% |
|
|
| CodeLLama-Base | 34B | 58.2% | 22.1% | **55.2%** | 77.2% | 69.6% | 70.0% | 92.8% | 63.6% |
|
|
| | | | | | | | | | |
|
|
| DeepSeek-Coder-Base | 1.3B | 15.8% | 16.3% | 14.5% | 38.4% | 28.8% | 51.3% | 66.0% | 33.0% |
|
|
| DeepSeek-Coder-MQA-Base | 5.7B | 44.8% | 25.4% | 40.6% | 56.8% | 62.4% | 66.8% | 84.2% | 54.4% |
|
|
| DeepSeek-Coder-Base | 6.7B | 46.1% | 25.6% | 40.0% | 67.2% | 71.2% | 69.0% | 89.2% | 58.3% |
|
|
| DeepSeek-Coder-Base | 33B | **58.2%** | **35.3%** | 54.5% | **78.4%** | **76.8%** | **78.2%** | **94.0%** | **67.9%** |
|
|
|
|
|