DeepSeek-Coder

History

Yang Dejian 94c476e295 fix test file		2023-11-23 16:46:17 +08:00
..
__pycache__	init project	2023-11-02 22:07:09 +08:00
data	init project	2023-11-02 22:07:09 +08:00
human_eval	add mbpp instruct eval	2023-11-23 15:22:39 +08:00
utils	init project	2023-11-02 22:07:09 +08:00
README.md	Update README.md	2023-11-03 09:27:58 +08:00
eval.sh	init project	2023-11-02 22:07:09 +08:00
eval_instruct.py	fix test file	2023-11-23 16:46:17 +08:00
eval_pal.py	init project	2023-11-02 22:07:09 +08:00
mbpp.py	init project	2023-11-02 22:07:09 +08:00
test_config.yaml	init project	2023-11-02 22:07:09 +08:00

README.md

1. Introduction

We provide a test script to evaluate the performance of the deepseek-coder model on code generation benchmarks, MBPP, with 3-shot setting.

2. Setup

pip install accelerate
pip install attrdict
pip install transformers
pip install pytorch

3. Evaluation

We've created a sample script, eval.sh, that demonstrates how to test the deepseek-coder-1.3b-base model on the MBPP dataset leveraging 8 GPUs.

MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
DATASET_ROOT="data/"
LANGUAGE="python"
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT}

4. Experimental Results

We report experimental results here for several models. We set the maximum input length to 4096 and the maximum output length to 500, and employ the greedy search strategy.

(1) Multilingual Base Models

Model	Size	Pass@1
CodeShell	7B	38.6%
CodeGeeX2	6B	36.2%
StarCoder	16B	42.8%
CodeLLama-Base	7B	38.6%
CodeLLama-Base	13B	47.0%
CodeLLama-Base	34B	55.0%

DeepSeek-Coder-Base	1.3B	46.8%
DeepSeek-Coder-Base	5.7B	57.2%
DeepSeek-Coder-Base	6.7B	60.6%
DeepSeek-Coder-Base	33B	66.0%

(2) Instruction-Tuned Models

Model	Size	Pass@1
GPT-3.5-Turbo	-	70.8%
GPT-4	-	80.0%

DeepSeek-Coder-Instruct	1.3B	49.4%
DeepSeek-Coder-Instruct	6.7B	65.4%
DeepSeek-Coder-Instruct	33B	70.0%