mirror of
https://github.com/deepseek-ai/DeepSeek-Coder
synced 2025-01-22 18:45:34 +00:00
64 lines
3.7 KiB
Markdown
64 lines
3.7 KiB
Markdown
|
## 1. Introduction
|
||
|
|
||
|
We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval-Python](https://huggingface.co/datasets/openai_humaneval), [HumanEval-Multilingual](https://huggingface.co/datasets/nuprl/MultiPL-E)**.
|
||
|
|
||
|
|
||
|
|
||
|
## 2. Setup
|
||
|
|
||
|
```
|
||
|
pip install accelerate
|
||
|
pip install attrdict
|
||
|
pip install transformers
|
||
|
pip install pytorch
|
||
|
```
|
||
|
|
||
|
|
||
|
## 3. Evaluation
|
||
|
|
||
|
We've created a sample script, **eval.sh**, that demonstrates how to test the **DeepSeek-Coder-1.3b-Base** model on the HumanEval dataset leveraging **8** GPUs. If your use case involves a different model or dataset, simply adjust the script to fit your needs.
|
||
|
|
||
|
Additionally, for various programming languages, the execution path may differ. Please ensure you update the appropriate paths in the **humaneval/execution.py** file accordingly.
|
||
|
|
||
|
```bash
|
||
|
MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
|
||
|
DATASET_ROOT="data/"
|
||
|
LANGUAGE="python"
|
||
|
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --language ${LANGUAGE} --dataroot ${DATASET_ROOT}
|
||
|
```
|
||
|
|
||
|
|
||
|
|
||
|
## 4. Experimental Results
|
||
|
|
||
|
We report experimental results here for 8 main-stream programming languages, **python**, **c++**, **java**, **PHP**, **TypeScript**, **C#**, **Bash**, and **JavaScript**. For all open-source models, we utilize this repository to obtain the performance of the models on the HumanEval dataset. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**.
|
||
|
|
||
|
|
||
|
#### (1) Multilingual Base Models
|
||
|
|
||
|
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg |
|
||
|
|-------------------|------|--------|-------|------|------|------|------|------|------|------|
|
||
|
| code-cushman-001 | 12B | 33.5% | 31.9% | 30.6%| 28.9%| 31.3%| 22.1%| 11.7%| - | - |
|
||
|
| CodeShell | 7B | 35.4% | 32.9% | 34.2%| 31.7%| 30.2%| 38.0%| 7.0% | 33.5%| 30.4%|
|
||
|
| CodeGeeX2 | 6B | 36.0% | 29.2% | 25.9%| 23.6%| 20.8%| 29.7%| 6.3% | 24.8%| 24.5%|
|
||
|
| StarCoderBase | 16B | 31.7% | 31.1% | 28.5%| 25.4%| 34.0%| 34.8%| 8.9% | 29.8%| 28.0%|
|
||
|
| CodeLLama | 7B | 31.7% | 29.8% | 34.2%| 23.6%| 36.5%| 36.7%| 12.0%| 29.2%| 29.2%|
|
||
|
| CodeLLama | 13B | 36.0% | 37.9% | 38.0%| 34.2%| 45.2%| 43.0%| 16.5%| 32.3%| 35.4%|
|
||
|
| CodeLLama | 34B | 48.2% | 44.7% | 44.9%| 41.0%| 42.1%| 48.7%| 15.8%| 42.2%| 41.0%|
|
||
|
| | | | | | | | | | | |
|
||
|
| DeepSeek-Coder-Base| 1.3B | 34.8% | 31.1% | 32.3%| 24.2%| 28.9%| 36.7%| 10.1%| 28.6%| 28.3%|
|
||
|
| DeepSeek-Coder-Base| 5.7B | 48.7% | 45.3% | 41.1%| 39.7%| 44.7%| 41.1%| 27.8%| 42.2%| 41.3%|
|
||
|
| DeepSeek-Coder-Base| 6.7B | 49.4% | 50.3% | 43.0%| 38.5%| 49.7%| 50.0%| 28.5%| 48.4%| 44.7%|
|
||
|
| DeepSeek-Coder-Base|33B | **56.1%** | **58.4%** | **51.9%**| **44.1%**| **52.8%**| **51.3%**| **32.3%**| **55.3%**| **50.3%**|
|
||
|
|
||
|
#### (2) Instruction-Tuned Models
|
||
|
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg |
|
||
|
|---------------------|------|--------|-------|------|------|------|------|------|------|------|
|
||
|
| GPT-3.5-Turbo | - | 76.2% | 63.4% | 69.2%| 60.9%| 69.1%| 70.8%| 42.4%| 67.1%| 64.9%|
|
||
|
| GPT-4 | - | **84.1%** | **76.4%** | **81.6%**| **77.2%**| **77.4%**| **79.1%**| **58.2%**| **78.0%**| **76.5%**|
|
||
|
| | | | | | | | | | | |
|
||
|
| DeepSeek-Coder-Instruct | 1.3B | 65.2% | 45.3% | 51.9% | 45.3% | 59.7% |55.1% | 12.7% | 52.2% | 48.4% |
|
||
|
| DeepSeek-Coder-Instruct | 6.7B | 78.9% | 63.4% | 68.4% | 68.9%| 67.2%| 72.8%| 36.7%| 72.7%| 66.1%|
|
||
|
| DeepSeek-Coder-Instruct | 33B | **79.3%** | **68.9%** | **73.4%** | **72.7%**| **67.9%**| **74.1%**| **43.0%**| **73.9%**| **69.2%**|
|
||
|
|