DeepSeek-Coder/README.md at bingxuan/dev

2.4 KiB

Raw Permalink Blame History

1. Introduction

We provide a test script to evaluate the performance of the deepseek-coder model on code completion benchmarks. We select the widely-used benchmarks: DS-1000.

2. Evaluation

We directly use the scripts provided by the DS-1000 repository to evaluate the performance of the models. You can refer to DS-1000 to find more details about the evaluation.

3. Experimental Results

We report experimental results here for the completion mode of DS-1000. We set the maximum length to 2048, and employ the greedy search strategy. To ensure a fair comparison, we apply identical hyper-parameters across all open-source models under evaluation.

Model	Size	Matplotlib	Numpy	Pandas	Pytorch	Scipy	Scikit-Learn	Tensorflow	Avg
Codex-001	-	41.8%	26.6%	9.4%	9.7%	15.0%	18.5%	17.2%	20.2%
Codex-002	-	57.0%	43.1%	26.5%	41.8%	31.8%	44.8%	39.3%	39.2%
CodeShell	7B	34.1%	21.8%	10.7%	11.8%	17.0%	20.0%	15.6%	18.8%
CodeGeeX2	6B	38.7%	26.8%	14.4%	11.8%	19.8%	27.0%	17.8%	22.9%
StarCoder	16B	47.7%	31.4%	12.7%	25%	22.6%	35.7%	22.2%	27.2%
CodeLLama-Base	7B	41.9%	24.6%	14.8%	16.2%	18.9%	17.4%	17.8%	22.1%
CodeLLama-Base	13B	46.5%	28.6%	18.2%	19.1%	18.9%	27.8%	33.3%	26.8%
CodeLLama-Base	34B	50.3%	42.7%	23.0%	25.0%	28.3%	33.9%	40.0%	34.3%

DeepSeek-Coder-Base	1.3B	32.3%	21.4%	9.3%	8.8%	8.5%	16.5%	8.9%	16.2%
DeepSeek-Coder-Base	5.7B	51.1%	31.8%	19.9%	14.7%	17.0%	29.6%	15.6%	27.7%
DeepSeek-Coder-Base	6.7B	48.4%	35.5%	20.6%	19.1%	22.6%	38.3%	24.4%	30.5%
DeepSeek-Coder-Base	33B	56.1%	49.6%	25.8%	36.8%	36.8%	40.0%	46.7%	40.2%

2.4 KiB Raw Permalink Blame History

1. Introduction

2. Evaluation

3. Experimental Results

2.4 KiB

Raw Permalink Blame History