DeepSeek-Coder/Evaluation/LeetCode/readme.md at main

DeepSeek/DeepSeek-Coder

Fork 0

mirror of https://github.com/deepseek-ai/DeepSeek-Coder synced 2025-06-26 18:25:53 +00:00

Files

Dejian Yang d1898bcabc update model name

2024-01-26 14:12:44 +08:00

3.0 KiB

Raw Permalink Blame History

1. Introduction

We construct the LeetCode Contest benchmark to to further validate the model's capability in real-world programming problems. LeetCode presents competition-level problems, offering significant challenges that test the model's problem understanding and code generation skills. We collected the latest problems from LeetCode Contests to prevent the appearance of both the problems or their solutions in our pre-training data. A total of 180 problems were collected from July 2023 to January 2024. For each problem, we collected 100 test cases. The data format is the same as human-eval. For more details, please refer to leetcode_contest_data.

2. Evaluation

Please follow the following two steps to evaluate the model's performance on our LeetCode Contest benchmark:

Run vllm_inference.py to get generation results.

cd Evaluation/LeetCode

# Set the model or path here
MODEL="deepseek-ai/deepseek-coder-7b-instruct"

python vllm_inference.py --model_name_or_path $MODEL --saved_path output/20240121-Jul.deepseek-coder-7b-instruct.jsonl

If you want to evaluate the model with COT, please add --cot to the command:

python vllm_inference.py --model_name_or_path $MODEL --saved_path output/20240121-Jul.deepseek-coder-7b-instruct.jsonl --cot

Run evaluate_leetcode.py to get evaluation results.

python evaluate_leetcode.py --generation_path output/20240121-Jul.deepseek-coder-7b-instruct.jsonl --result_path output/20240121-Jul.deepseek-coder-7b-instruct.result.jsonl

3. Experimental Results

We report experimental results here:

Model	Size	Easy (45)	Medium (91)	Hard (44)	Overall(180)
WizardCoder-V1.0	15B	17.8%	1.1%	0.0%	5.0%
CodeLlama-Instruct	34B	24.4%	4.4%	4.5%	9.4%
Phind-CodeLlama-V2	34B	26.7%	8.8%	9.1%	13.3%

GPT-3.5-Turbo	-	46.7%	15.4 %	15.9%	23.3%
GPT-3.5-Turbo + CoT	-	42.2%	15.4%	20.5%	23.3%
GPT-4-Turbo	-	73.3%	31.9%	25.0%	40.6%
GPT-4-Turbo + CoT	-	71.1%	35.2%	25.0%	41.8%

DeepSeek-Coder-Instruct	1.3B	22.2%	1.1%	4.5%	7.2%
DeepSeek-Coder-Instruct + CoT	1.3B	22.2%	2.2%	2.3%	7.2%
DeepSeek-Coder-Instruct	6.7B	44.4%	12.1%	9.1%	19.4%
DeepSeek-Coder-Instruct + CoT	6.7B	44.4%	17.6%	4.5%	21.1%
DeepSeek-Coder-Instruct	33B	57.8%	22.0%	9.1%	27.8%
DeepSeek-Coder-Instruct + CoT	33B	53.3%	25.3%	11.4%	28.9%

3.0 KiB Raw Permalink Blame History

1. Introduction

2. Evaluation

3. Experimental Results

3.0 KiB

Raw Permalink Blame History