DeepSeek-Coder/Evaluation/DS-1000/README.md

2.4 KiB

1. Introduction

We provide a test script to evaluate the performance of the deepseek-coder model on code completion benchmarks. We select the widely-used benchmarks: DS-1000.

2. Evaluation

We directly use the scripts provided by the DS-1000 repository to evaluate the performance of the models. You can refer to DS-1000 to find more details about the evaluation.

3. Experimental Results

We report experimental results here for the completion mode of DS-1000. We set the maximum length to 2048, and employ the greedy search strategy. To ensure a fair comparison, we apply identical hyper-parameters across all open-source models under evaluation.

Model Size Matplotlib Numpy Pandas Pytorch Scipy Scikit-Learn Tensorflow Avg
Codex-001 - 41.8% 26.6% 9.4% 9.7% 15.0% 18.5% 17.2% 20.2%
Codex-002 - 57.0% 43.1% 26.5% 41.8% 31.8% 44.8% 39.3% 39.2%
CodeShell 7B 34.1% 21.8% 10.7% 11.8% 17.0% 20.0% 15.6% 18.8%
CodeGeeX2 6B 38.7% 26.8% 14.4% 11.8% 19.8% 27.0% 17.8% 22.9%
StarCoder 16B 47.7% 31.4% 12.7% 25% 22.6% 35.7% 22.2% 27.2%
CodeLLama-Base 7B 41.9% 24.6% 14.8% 16.2% 18.9% 17.4% 17.8% 22.1%
CodeLLama-Base 13B 46.5% 28.6% 18.2% 19.1% 18.9% 27.8% 33.3% 26.8%
CodeLLama-Base 34B 50.3% 42.7% 23.0% 25.0% 28.3% 33.9% 40.0% 34.3%
DeepSeek-Coder-Base 1.3B 32.3% 21.4% 9.3% 8.8% 8.5% 16.5% 8.9% 16.2%
DeepSeek-Coder-Base 5.7B 51.1% 31.8% 19.9% 14.7% 17.0% 29.6% 15.6% 27.7%
DeepSeek-Coder-Base 6.7B 48.4% 35.5% 20.6% 19.1% 22.6% 38.3% 24.4% 30.5%
DeepSeek-Coder-Base 33B 56.1% 49.6% 25.8% 36.8% 36.8% 40.0% 46.7% 40.2%