DeepSeek-Coder/Evaluation/DS-1000/README.md

## 1. Introduction

We provide a test script to evaluate the performance of the **deepseek-coder** model on code completion benchmarks. We select the widely-used benchmarks: [**DS-1000**](https://github.com/xlang-ai/DS-1000).

## 2. Evaluation

We directly use the scripts provided by the DS-1000 repository to evaluate the performance of the models. You can refer to [**DS-1000**](https://github.com/xlang-ai/DS-1000) to find more details about the evaluation.


## 3. Experimental Results

We report experimental results here for the completion mode of DS-1000. We set the maximum length to **2048**, and employ the **greedy search strategy**.  To ensure a fair comparison, we apply identical hyper-parameters across all open-source models under evaluation.

| Model                  | Size | Matplotlib | Numpy | Pandas | Pytorch | Scipy | Scikit-Learn | Tensorflow | Avg   |
|------------------------|------|------------|-------|--------|---------|-------|-------------|------------|-------|
| Codex-001              | -    | 41.8%      | 26.6% | 9.4%   | 9.7%    | 15.0% | 18.5%        | 17.2%      | 20.2% |
| Codex-002              | -    | **57.0%**      | 43.1% | **26.5%**  | **41.8%**   | 31.8% | **44.8%**        | 39.3%      | 39.2% |
| CodeShell              | 7B   | 34.1%      | 21.8% | 10.7%  | 11.8%   | 17.0% | 20.0%        | 15.6%      | 18.8% |
| CodeGeeX2              | 6B   | 38.7%      | 26.8% | 14.4%  | 11.8%   | 19.8% | 27.0%        | 17.8%      | 22.9% |
| StarCoder         | 16B  | 47.7%      | 31.4% | 12.7%  | 25%   | 22.6% | 35.7%        | 22.2%      | 27.2% |
| CodeLLama-Base         | 7B   | 41.9%      | 24.6% | 14.8%  | 16.2%   | 18.9% | 17.4%        | 17.8%      | 22.1% |
| CodeLLama-Base         | 13B  | 46.5%      | 28.6% | 18.2%  | 19.1%   | 18.9% | 27.8%        | 33.3%      | 26.8% |
| CodeLLama-Base         | 34B  | 50.3%      | 42.7% | 23.0%  | 25.0%   | 28.3% | 33.9%        | 40.0%      | 34.3% |
| | | | |  |  |  |  |  |  | |
| DeepSeek-Coder-Base    | 1.3B   | 32.3%      | 21.4% | 9.3%   | 8.8%    | 8.5%  | 16.5%        | 8.9%       | 16.2% |
| DeepSeek-Coder-Base    | 5.7B   | 51.1%      | 31.8% | 19.9%  | 14.7%   | 17.0% | 29.6%        | 15.6%      | 27.7% |
| DeepSeek-Coder-Base    | 6.7B   | 48.4%      | 35.5% | 20.6%  | 19.1%   | 22.6% | 38.3%        | 24.4%      | 30.5% |
| DeepSeek-Coder-Base    | 33B  | 56.1%      | **49.6%** | 25.8%  | 36.8%   | **36.8%** | 40.0%        | **46.7%**      | **40.2%** |
init project 2023-11-02 14:07:09 +00:00			`## 1. Introduction`

			`We provide a test script to evaluate the performance of the deepseek-coder model on code completion benchmarks. We select the widely-used benchmarks: [DS-1000](https://github.com/xlang-ai/DS-1000).`

			`## 2. Evaluation`

			`We directly use the scripts provided by the DS-1000 repository to evaluate the performance of the models. You can refer to [DS-1000](https://github.com/xlang-ai/DS-1000) to find more details about the evaluation.`


			`## 3. Experimental Results`

			`We report experimental results here for the completion mode of DS-1000. We set the maximum length to 2048, and employ the greedy search strategy. To ensure a fair comparison, we apply identical hyper-parameters across all open-source models under evaluation.`

			`\| Model \| Size \| Matplotlib \| Numpy \| Pandas \| Pytorch \| Scipy \| Scikit-Learn \| Tensorflow \| Avg \|`
			`\|------------------------\|------\|------------\|-------\|--------\|---------\|-------\|-------------\|------------\|-------\|`
			`\| Codex-001 \| - \| 41.8% \| 26.6% \| 9.4% \| 9.7% \| 15.0% \| 18.5% \| 17.2% \| 20.2% \|`
			`\| Codex-002 \| - \| 57.0% \| 43.1% \| 26.5% \| 41.8% \| 31.8% \| 44.8% \| 39.3% \| 39.2% \|`
			`\| CodeShell \| 7B \| 34.1% \| 21.8% \| 10.7% \| 11.8% \| 17.0% \| 20.0% \| 15.6% \| 18.8% \|`
			`\| CodeGeeX2 \| 6B \| 38.7% \| 26.8% \| 14.4% \| 11.8% \| 19.8% \| 27.0% \| 17.8% \| 22.9% \|`
			`\| StarCoder \| 16B \| 47.7% \| 31.4% \| 12.7% \| 25% \| 22.6% \| 35.7% \| 22.2% \| 27.2% \|`
			`\| CodeLLama-Base \| 7B \| 41.9% \| 24.6% \| 14.8% \| 16.2% \| 18.9% \| 17.4% \| 17.8% \| 22.1% \|`
			`\| CodeLLama-Base \| 13B \| 46.5% \| 28.6% \| 18.2% \| 19.1% \| 18.9% \| 27.8% \| 33.3% \| 26.8% \|`
			`\| CodeLLama-Base \| 34B \| 50.3% \| 42.7% \| 23.0% \| 25.0% \| 28.3% \| 33.9% \| 40.0% \| 34.3% \|`
			`\| \| \| \| \| \| \| \| \| \| \| \|`
			`\| DeepSeek-Coder-Base \| 1.3B \| 32.3% \| 21.4% \| 9.3% \| 8.8% \| 8.5% \| 16.5% \| 8.9% \| 16.2% \|`
			`\| DeepSeek-Coder-Base \| 5.7B \| 51.1% \| 31.8% \| 19.9% \| 14.7% \| 17.0% \| 29.6% \| 15.6% \| 27.7% \|`
			`\| DeepSeek-Coder-Base \| 6.7B \| 48.4% \| 35.5% \| 20.6% \| 19.1% \| 22.6% \| 38.3% \| 24.4% \| 30.5% \|`
			`\| DeepSeek-Coder-Base \| 33B \| 56.1% \| 49.6% \| 25.8% \| 36.8% \| 36.8% \| 40.0% \| 46.7% \| 40.2% \|`