DeepSeek-Coder/Evaluation/MBPP
2023-11-02 22:07:09 +08:00
..
__pycache__ init project 2023-11-02 22:07:09 +08:00
data init project 2023-11-02 22:07:09 +08:00
human_eval init project 2023-11-02 22:07:09 +08:00
utils init project 2023-11-02 22:07:09 +08:00
eval_pal.py init project 2023-11-02 22:07:09 +08:00
eval.sh init project 2023-11-02 22:07:09 +08:00
mbpp.py init project 2023-11-02 22:07:09 +08:00
README.md init project 2023-11-02 22:07:09 +08:00
test_config.yaml init project 2023-11-02 22:07:09 +08:00

1. Introduction

We provide a test script to evaluate the performance of the deepseek-coder model on code generation benchmarks with 3-shot setting, [MBPP](https://huggingface.co/datasets/mbpp).

2. Setup

pip install accelerate
pip install attrdict
pip install transformers
pip install pytorch

3. Evaluation

We've created a sample script, eval.sh, that demonstrates how to test the deepseek-coder-1.3b-base model on the MBPP dataset leveraging 8 GPUs.

MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
DATASET_ROOT="data/"
LANGUAGE="python"
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT} 

4. Experimental Results

We report experimental results here for several models. We set the maximum input length to 4096 and the maximum output length to 500, and employ the greedy search strategy.

(1) Multilingual Base Models

Model Size Pass@1
CodeShell 7B 38.6%
CodeGeeX2 6B 36.2%
StarCoder 16B 42.8%
CodeLLama-Base 7B 38.6%
CodeLLama-Base 13B 47.0%
CodeLLama-Base 34B 55.0%
DeepSeek-Coder-Base 1.3B 46.8%
DeepSeek-Coder-Base 5.7B 57.2%
DeepSeek-Coder-Base 6.7B 60.6%
DeepSeek-Coder-Base 33B 66.0%

(2) Instruction-Tuned Models

Model Size Pass@1
GPT-3.5-Turbo - 70.8%
GPT-4 - 80.0%
DeepSeek-Coder-Instruct 1.3B 49.4%
DeepSeek-Coder-Instruct 5.7B 62.4%
DeepSeek-Coder-Instruct 6.7B 65.4%
DeepSeek-Coder-Instruct 33B 70.0%