mirror of
https://github.com/deepseek-ai/DeepSeek-Coder
synced 2024-12-05 02:24:46 +00:00
2.4 KiB
2.4 KiB
1. Introduction
We provide a test script to evaluate the performance of the deepseek-coder model on code completion benchmarks. We select the widely-used benchmarks: DS-1000.
2. Evaluation
We directly use the scripts provided by the DS-1000 repository to evaluate the performance of the models. You can refer to DS-1000 to find more details about the evaluation.
3. Experimental Results
We report experimental results here for the completion mode of DS-1000. We set the maximum length to 2048, and employ the greedy search strategy. To ensure a fair comparison, we apply identical hyper-parameters across all open-source models under evaluation.
Model | Size | Matplotlib | Numpy | Pandas | Pytorch | Scipy | Scikit-Learn | Tensorflow | Avg |
---|---|---|---|---|---|---|---|---|---|
Codex-001 | - | 41.8% | 26.6% | 9.4% | 9.7% | 15.0% | 18.5% | 17.2% | 20.2% |
Codex-002 | - | 57.0% | 43.1% | 26.5% | 41.8% | 31.8% | 44.8% | 39.3% | 39.2% |
CodeShell | 7B | 34.1% | 21.8% | 10.7% | 11.8% | 17.0% | 20.0% | 15.6% | 18.8% |
CodeGeeX2 | 6B | 38.7% | 26.8% | 14.4% | 11.8% | 19.8% | 27.0% | 17.8% | 22.9% |
StarCoder | 16B | 47.7% | 31.4% | 12.7% | 25% | 22.6% | 35.7% | 22.2% | 27.2% |
CodeLLama-Base | 7B | 41.9% | 24.6% | 14.8% | 16.2% | 18.9% | 17.4% | 17.8% | 22.1% |
CodeLLama-Base | 13B | 46.5% | 28.6% | 18.2% | 19.1% | 18.9% | 27.8% | 33.3% | 26.8% |
CodeLLama-Base | 34B | 50.3% | 42.7% | 23.0% | 25.0% | 28.3% | 33.9% | 40.0% | 34.3% |
DeepSeek-Coder-Base | 1.3B | 32.3% | 21.4% | 9.3% | 8.8% | 8.5% | 16.5% | 8.9% | 16.2% |
DeepSeek-Coder-Base | 5.7B | 51.1% | 31.8% | 19.9% | 14.7% | 17.0% | 29.6% | 15.6% | 27.7% |
DeepSeek-Coder-Base | 6.7B | 48.4% | 35.5% | 20.6% | 19.1% | 22.6% | 38.3% | 24.4% | 30.5% |
DeepSeek-Coder-Base | 33B | 56.1% | 49.6% | 25.8% | 36.8% | 36.8% | 40.0% | 46.7% | 40.2% |