DeepSeek-Coder

mirror of https://github.com/deepseek-ai/DeepSeek-Coder synced 2025-06-26 18:25:53 +00:00

History

chuyuanpeng 6590983bf0 Fix the position of add_generation_prompt		2024-01-10 17:18:24 +08:00
..
__pycache__	remove pycache	2023-11-09 22:52:37 +08:00
data	init project	2023-11-02 22:07:09 +08:00
human_eval	remove pycache	2023-11-09 22:52:37 +08:00
utils	remove pycache	2023-11-09 22:52:37 +08:00
eval_instruct.py	Fix the position of add_generation_prompt	2024-01-10 17:18:24 +08:00
eval_pal.py	init project	2023-11-02 22:07:09 +08:00
eval.sh	init project	2023-11-02 22:07:09 +08:00
humaneval.py	init project	2023-11-02 22:07:09 +08:00
javatuples-1.2.jar	init project	2023-11-02 22:07:09 +08:00
README.md	Update instruct model evaluation in README.md	2023-11-10 17:47:42 +08:00
test_config.yaml	init project	2023-11-02 22:07:09 +08:00

README.md

1. Introduction

We provide a test script to evaluate the performance of the deepseek-coder model on code generation benchmarks. We select the widely-used benchmarks: HumanEval-Python, HumanEval-Multilingual.

2. Setup

pip install accelerate
pip install attrdict
pip install transformers
pip install pytorch

3. Evaluation

We've created a sample script, eval.sh, that demonstrates how to test the DeepSeek-Coder-1.3b-Base model on the HumanEval dataset leveraging 8 GPUs. If your use case involves a different model or dataset, simply adjust the script to fit your needs.

Additionally, for various programming languages, the execution path may differ. Please ensure you update the appropriate paths in the humaneval/execution.py file accordingly.

MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
DATASET_ROOT="data/"
LANGUAGE="python"
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --language ${LANGUAGE} --dataroot ${DATASET_ROOT}

To evaluate the instruction-based model, please follow the script below:

LANG="python"
OUPUT_DIR="output"
MODEL="deepseek-coder-33b-instruct"

CUDA_VISIBLE_DEVICES=0,1 python eval_instruct.py \
    --model "deepseek-ai/$MODEL" \
    --output_path "$OUPUT_DIR/${LANG}.$MODEL.jsonl" \
    --language $LANG \
    --temp_dir $OUPUT_DIR

4. Experimental Results

We report experimental results here for 8 main-stream programming languages, python, c++, java, PHP, TypeScript, C#, Bash, and JavaScript. For all open-source models, we utilize this repository to obtain the performance of the models on the HumanEval dataset. We set the maximum input length to 4096 and the maximum output length to 500, and employ the greedy search strategy.

(1) Multilingual Base Models

Model	Size	Python	C++	Java	PHP	TS	C#	Bash	JS	Avg
code-cushman-001	12B	33.5%	31.9%	30.6%	28.9%	31.3%	22.1%	11.7%	-	-
CodeShell	7B	35.4%	32.9%	34.2%	31.7%	30.2%	38.0%	7.0%	33.5%	30.4%
CodeGeeX2	6B	36.0%	29.2%	25.9%	23.6%	20.8%	29.7%	6.3%	24.8%	24.5%
StarCoderBase	16B	31.7%	31.1%	28.5%	25.4%	34.0%	34.8%	8.9%	29.8%	28.0%
CodeLLama	7B	31.7%	29.8%	34.2%	23.6%	36.5%	36.7%	12.0%	29.2%	29.2%
CodeLLama	13B	36.0%	37.9%	38.0%	34.2%	45.2%	43.0%	16.5%	32.3%	35.4%
CodeLLama	34B	48.2%	44.7%	44.9%	41.0%	42.1%	48.7%	15.8%	42.2%	41.0%

DeepSeek-Coder-Base	1.3B	34.8%	31.1%	32.3%	24.2%	28.9%	36.7%	10.1%	28.6%	28.3%
DeepSeek-Coder-Base	5.7B	48.7%	45.3%	41.1%	39.7%	44.7%	41.1%	27.8%	42.2%	41.3%
DeepSeek-Coder-Base	6.7B	49.4%	50.3%	43.0%	38.5%	49.7%	50.0%	28.5%	48.4%	44.7%
DeepSeek-Coder-Base	33B	56.1%	58.4%	51.9%	44.1%	52.8%	51.3%	32.3%	55.3%	50.3%

(2) Instruction-Tuned Models

Model	Size	Python	C++	Java	PHP	TS	C#	Bash	JS	Avg
GPT-3.5-Turbo	-	76.2%	63.4%	69.2%	60.9%	69.1%	70.8%	42.4%	67.1%	64.9%
GPT-4	-	84.1%	76.4%	81.6%	77.2%	77.4%	79.1%	58.2%	78.0%	76.5%

DeepSeek-Coder-Instruct	1.3B	65.2%	45.3%	51.9%	45.3%	59.7%	55.1%	12.7%	52.2%	48.4%
DeepSeek-Coder-Instruct	6.7B	78.9%	63.4%	68.4%	68.9%	67.2%	72.8%	36.7%	72.7%	66.1%
DeepSeek-Coder-Instruct	33B	79.3%	68.9%	73.4%	72.7%	67.9%	74.1%	43.0%	73.9%	69.2%