mirror of
https://github.com/deepseek-ai/DeepSeek-Coder
synced 2024-12-04 18:14:44 +00:00
.. | ||
__pycache__ | ||
data | ||
human_eval | ||
utils | ||
eval_instruct.py | ||
eval_pal.py | ||
eval.sh | ||
mbpp.py | ||
README.md | ||
test_config.yaml |
1. Introduction
We provide a test script to evaluate the performance of the deepseek-coder model on code generation benchmarks, MBPP, with 3-shot setting.
2. Setup
pip install accelerate
pip install attrdict
pip install transformers
pip install pytorch
3. Evaluation
We've created a sample script, eval.sh, that demonstrates how to test the deepseek-coder-1.3b-base model on the MBPP dataset leveraging 8 GPUs.
MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
DATASET_ROOT="data/"
LANGUAGE="python"
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT}
4. Experimental Results
We report experimental results here for several models. We set the maximum input length to 4096 and the maximum output length to 500, and employ the greedy search strategy.
(1) Multilingual Base Models
Model | Size | Pass@1 |
---|---|---|
CodeShell | 7B | 38.6% |
CodeGeeX2 | 6B | 36.2% |
StarCoder | 16B | 42.8% |
CodeLLama-Base | 7B | 38.6% |
CodeLLama-Base | 13B | 47.0% |
CodeLLama-Base | 34B | 55.0% |
DeepSeek-Coder-Base | 1.3B | 46.8% |
DeepSeek-Coder-Base | 5.7B | 57.2% |
DeepSeek-Coder-Base | 6.7B | 60.6% |
DeepSeek-Coder-Base | 33B | 66.0% |
(2) Instruction-Tuned Models
Model | Size | Pass@1 |
---|---|---|
GPT-3.5-Turbo | - | 70.8% |
GPT-4 | - | 80.0% |
DeepSeek-Coder-Instruct | 1.3B | 49.4% |
DeepSeek-Coder-Instruct | 6.7B | 65.4% |
DeepSeek-Coder-Instruct | 33B | 70.0% |