mirror of
https://github.com/deepseek-ai/DeepSeek-MoE
synced 2025-01-22 10:35:57 +00:00
update paper link
This commit is contained in:
parent
75fe19cfe9
commit
e3e4f59b82
BIN
DeepSeekMoE.pdf
BIN
DeepSeekMoE.pdf
Binary file not shown.
565
README.md
565
README.md
@ -1,280 +1,285 @@
|
|||||||
<!-- markdownlint-disable first-line-h1 -->
|
<!-- markdownlint-disable first-line-h1 -->
|
||||||
<!-- markdownlint-disable html -->
|
<!-- markdownlint-disable html -->
|
||||||
<!-- markdownlint-disable no-duplicate-header -->
|
<!-- markdownlint-disable no-duplicate-header -->
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="images/logo.svg" width="60%" alt="DeepSeek LLM" />
|
<img src="images/logo.svg" width="60%" alt="DeepSeek LLM" />
|
||||||
</div>
|
</div>
|
||||||
<hr>
|
<hr>
|
||||||
<div align="center">
|
<div align="center">
|
||||||
|
|
||||||
<a href="https://www.deepseek.com/" target="_blank">
|
<a href="https://www.deepseek.com/" target="_blank">
|
||||||
<img alt="Homepage" src="images/badge.svg" />
|
<img alt="Homepage" src="images/badge.svg" />
|
||||||
</a>
|
</a>
|
||||||
<a href="https://chat.deepseek.com/" target="_blank">
|
<a href="https://chat.deepseek.com/" target="_blank">
|
||||||
<img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-DeepSeek%20LLM-536af5?color=536af5&logoColor=white" />
|
<img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-DeepSeek%20LLM-536af5?color=536af5&logoColor=white" />
|
||||||
</a>
|
</a>
|
||||||
<a href="https://huggingface.co/deepseek-ai" target="_blank">
|
<a href="https://huggingface.co/deepseek-ai" target="_blank">
|
||||||
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
|
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
|
||||||
</a>
|
</a>
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
|
|
||||||
<a href="https://discord.gg/Tc7c45Zzu5" target="_blank">
|
<a href="https://discord.gg/Tc7c45Zzu5" target="_blank">
|
||||||
<img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />
|
<img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />
|
||||||
</a>
|
</a>
|
||||||
<a href="images/qr.jpeg" target="_blank">
|
<a href="images/qr.jpeg" target="_blank">
|
||||||
<img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" />
|
<img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" />
|
||||||
</a>
|
</a>
|
||||||
<a href="https://twitter.com/deepseek_ai" target="_blank">
|
<a href="https://twitter.com/deepseek_ai" target="_blank">
|
||||||
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />
|
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />
|
||||||
</a>
|
</a>
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
|
|
||||||
<a href="LICENSE-CODE">
|
<a href="LICENSE-CODE">
|
||||||
<img alt="Code License" src="https://img.shields.io/badge/Code_License-MIT-f5de53?&color=f5de53">
|
<img alt="Code License" src="https://img.shields.io/badge/Code_License-MIT-f5de53?&color=f5de53">
|
||||||
</a>
|
</a>
|
||||||
<a href="LICENSE-MODEL">
|
<a href="LICENSE-MODEL">
|
||||||
<img alt="Model License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53">
|
<img alt="Model License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53">
|
||||||
</a>
|
</a>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<a href="#3-model-downloads">Model Download</a> |
|
<a href="#3-model-downloads">Model Download</a> |
|
||||||
<a href="#2-evaluation-results">Evaluation Results</a> |
|
<a href="#2-evaluation-results">Evaluation Results</a> |
|
||||||
<a href="#4-quick-start">Quick Start</a> |
|
<a href="#4-quick-start">Quick Start</a> |
|
||||||
<a href="#5-license">License</a> |
|
<a href="#5-license">License</a> |
|
||||||
<a href="#6-citation">Citation</a>
|
<a href="#6-citation">Citation</a>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<a href="https://github.com/deepseek-ai/DeepSeek-MoE/blob/main/DeepSeekMoE.pdf"><b>Paper Preview</b>👁️</a>
|
<a href="https://arxiv.org/pdf/2401.06066.pdf"><b>Paper Link</b>👁️</a>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
|
|
||||||
## 1. Introduction
|
## 1. Introduction
|
||||||
|
|
||||||
DeepSeekMoE 16B is a Mixture-of-Experts (MoE) language model with 16.4B parameters.
|
DeepSeekMoE 16B is a Mixture-of-Experts (MoE) language model with 16.4B parameters.
|
||||||
It employs an innovative MoE architecture, which involves two principal strategies: fine-grained expert segmentation and shared experts isolation.
|
It employs an innovative MoE architecture, which involves two principal strategies: fine-grained expert segmentation and shared experts isolation.
|
||||||
It is trained from scratch on 2T tokens, and exhibits comparable performance with DeekSeek 7B and LLaMA2 7B, with only about 40% of computations.
|
It is trained from scratch on 2T tokens, and exhibits comparable performance with DeekSeek 7B and LLaMA2 7B, with only about 40% of computations.
|
||||||
For research purposes, we release the model checkpoints of DeepSeekMoE 16B Base and DeepSeekMoE 16B Chat to the public, which can be deployed on a single GPU with 40GB of memory without the need for quantization.
|
For research purposes, we release the model checkpoints of DeepSeekMoE 16B Base and DeepSeekMoE 16B Chat to the public, which can be deployed on a single GPU with 40GB of memory without the need for quantization.
|
||||||
The model code file can be found [here](https://huggingface.co/deepseek-ai/deepseek-moe-16b-base/blob/main/modeling_deepseek.py).
|
The model code file can be found [here](https://huggingface.co/deepseek-ai/deepseek-moe-16b-base/blob/main/modeling_deepseek.py).
|
||||||
|
|
||||||
## 2. Evaluation Results
|
## 2. Evaluation Results
|
||||||
|
|
||||||
### DeepSeekMoE 16B Base
|
### DeepSeekMoE 16B Base
|
||||||
|
|
||||||
We evaluate DeepSeekMoE 16B on various benchmarks and compare it with a series of models, as shown in the following.
|
We evaluate DeepSeekMoE 16B on various benchmarks and compare it with a series of models, as shown in the following.
|
||||||
|
|
||||||
- Comparison with open source models on the Open LLM Leaderboard. DeepSeekMoE 16B consistently outperforms models with a similar number of activated parameters by a large margin, and achieves comparable performance with LLaMA2 7B, which has approximately 2.5 times the activated parameters.
|
- Comparison with open source models on the Open LLM Leaderboard. DeepSeekMoE 16B consistently outperforms models with a similar number of activated parameters by a large margin, and achieves comparable performance with LLaMA2 7B, which has approximately 2.5 times the activated parameters.
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<img src="images/evaluation_deepseekmoe16b_base_openllm.jpg" alt="table" width="50%">
|
<img src="images/evaluation_deepseekmoe16b_base_openllm.jpg" alt="table" width="50%">
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
- Comparison with DeepSeek 7B on our internal benchmarks. DeepSeek 7B is a dense model trained on the same corpus as DeepSeekMoE 16B. With only 40.5% of computations, DeepSeekMoE 16B achieves comparable performance with DeepSeek 7B.
|
- Comparison with DeepSeek 7B on our internal benchmarks. DeepSeek 7B is a dense model trained on the same corpus as DeepSeekMoE 16B. With only 40.5% of computations, DeepSeekMoE 16B achieves comparable performance with DeepSeek 7B.
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<img src="images/evaluation_deepseekmoe16b_base_1.jpg" alt="table" width="50%">
|
<img src="images/evaluation_deepseekmoe16b_base_1.jpg" alt="table" width="50%">
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
- Comparison with LLaMA2 7B on our internal benchmarks. With only 39.6% of computations, DeepSeekMoE 16B outperforms LLaMA2 7B on the majority of benchmarks.
|
- Comparison with LLaMA2 7B on our internal benchmarks. With only 39.6% of computations, DeepSeekMoE 16B outperforms LLaMA2 7B on the majority of benchmarks.
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<img src="images/evaluation_deepseekmoe16b_base_2.jpg" alt="table" width="50%">
|
<img src="images/evaluation_deepseekmoe16b_base_2.jpg" alt="table" width="50%">
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
### DeepSeekMoE 16B Chat
|
### DeepSeekMoE 16B Chat
|
||||||
|
|
||||||
We also evaluate DeepSeekMoE 16B Chat on various benchmarks and compare it with DeepSeek 7B Chat and LLaMA2 7B SFT. All of the compared models follow the same fine-tuning setting and data for fair comparison.
|
We also evaluate DeepSeekMoE 16B Chat on various benchmarks and compare it with DeepSeek 7B Chat and LLaMA2 7B SFT. All of the compared models follow the same fine-tuning setting and data for fair comparison.
|
||||||
The evaluation results are shown in the following. With only about 40% of computations, DeepSeekMoE 16B Chat achieves comparable or better performance than DeepSeek 7B Chat and LLaMA2 7B SFT.
|
The evaluation results are shown in the following. With only about 40% of computations, DeepSeekMoE 16B Chat achieves comparable or better performance than DeepSeek 7B Chat and LLaMA2 7B SFT.
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<img src="images/evaluation_deepseekmoe16b_chat.jpg" alt="table" width="60%">
|
<img src="images/evaluation_deepseekmoe16b_chat.jpg" alt="table" width="60%">
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
## 3. Model Downloads
|
## 3. Model Downloads
|
||||||
|
|
||||||
We release the DeepSeekMoE 16B, including both base and chat models, to the public. To support a broader and more diverse range of research within both academic and commercial communities. Please **note** that the use of this model is subject to the terms outlined in [License section](#5-license). Commercial usage is permitted under these terms.
|
We release the DeepSeekMoE 16B, including both base and chat models, to the public. To support a broader and more diverse range of research within both academic and commercial communities. Please **note** that the use of this model is subject to the terms outlined in [License section](#5-license). Commercial usage is permitted under these terms.
|
||||||
|
|
||||||
### Huggingface
|
### Huggingface
|
||||||
|
|
||||||
| Model | Sequence Length | Download |
|
| Model | Sequence Length | Download |
|
||||||
|:---------------------:|:---------------:|:-----------------------------------------------------------------------:|
|
|:---------------------:|:---------------:|:-----------------------------------------------------------------------:|
|
||||||
| DeepSeekMoE 16B Base | 4096 | 🤗 [HuggingFace](https://huggingface.co/deepseek-ai/deepseek-moe-16b-base) |
|
| DeepSeekMoE 16B Base | 4096 | 🤗 [HuggingFace](https://huggingface.co/deepseek-ai/deepseek-moe-16b-base) |
|
||||||
| DeepSeekMoE 16B Chat | 4096 | 🤗 [HuggingFace](https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat) |
|
| DeepSeekMoE 16B Chat | 4096 | 🤗 [HuggingFace](https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat) |
|
||||||
|
|
||||||
## 4. Quick Start
|
## 4. Quick Start
|
||||||
### Installation
|
### Installation
|
||||||
|
|
||||||
On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:
|
On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
### Inference with Huggingface's Transformers
|
### Inference with Huggingface's Transformers
|
||||||
|
|
||||||
You can directly employ [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference.
|
You can directly employ [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference.
|
||||||
|
|
||||||
**Text Completion**
|
**Text Completion**
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import torch
|
import torch
|
||||||
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
|
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
|
||||||
|
|
||||||
model_name = "deepseek-ai/deepseek-ai/deepseek-moe-16b-base"
|
model_name = "deepseek-ai/deepseek-ai/deepseek-moe-16b-base"
|
||||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
|
||||||
model.generation_config = GenerationConfig.from_pretrained(model_name)
|
model.generation_config = GenerationConfig.from_pretrained(model_name)
|
||||||
model.generation_config.pad_token_id = model.generation_config.eos_token_id
|
model.generation_config.pad_token_id = model.generation_config.eos_token_id
|
||||||
|
|
||||||
text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
|
text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
|
||||||
inputs = tokenizer(text, return_tensors="pt")
|
inputs = tokenizer(text, return_tensors="pt")
|
||||||
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
|
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
|
||||||
|
|
||||||
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
||||||
print(result)
|
print(result)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Chat Completion**
|
**Chat Completion**
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import torch
|
import torch
|
||||||
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
|
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
|
||||||
|
|
||||||
model_name = "deepseek-ai/deepseek-moe-16b-chat"
|
model_name = "deepseek-ai/deepseek-moe-16b-chat"
|
||||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
|
||||||
model.generation_config = GenerationConfig.from_pretrained(model_name)
|
model.generation_config = GenerationConfig.from_pretrained(model_name)
|
||||||
model.generation_config.pad_token_id = model.generation_config.eos_token_id
|
model.generation_config.pad_token_id = model.generation_config.eos_token_id
|
||||||
|
|
||||||
messages = [
|
messages = [
|
||||||
{"role": "user", "content": "Who are you?"}
|
{"role": "user", "content": "Who are you?"}
|
||||||
]
|
]
|
||||||
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
|
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
|
||||||
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
|
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
|
||||||
|
|
||||||
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
|
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
|
||||||
print(result)
|
print(result)
|
||||||
```
|
```
|
||||||
|
|
||||||
Avoiding the use of the provided function `apply_chat_template`, you can also interact with our model following the sample template. Note that `messages` should be replaced by your input.
|
Avoiding the use of the provided function `apply_chat_template`, you can also interact with our model following the sample template. Note that `messages` should be replaced by your input.
|
||||||
|
|
||||||
```
|
```
|
||||||
User: {messages[0]['content']}
|
User: {messages[0]['content']}
|
||||||
|
|
||||||
Assistant: {messages[1]['content']}<|end▁of▁sentence|>User: {messages[2]['content']}
|
Assistant: {messages[1]['content']}<|end▁of▁sentence|>User: {messages[2]['content']}
|
||||||
|
|
||||||
Assistant:
|
Assistant:
|
||||||
```
|
```
|
||||||
|
|
||||||
**Note:** By default (`add_special_tokens=True`), our tokenizer automatically adds a `bos_token` (`<|begin▁of▁sentence|>`) before the input text. Additionally, since the system prompt is not compatible with this version of our models, we DO NOT RECOMMEND including the system prompt in your input.
|
**Note:** By default (`add_special_tokens=True`), our tokenizer automatically adds a `bos_token` (`<|begin▁of▁sentence|>`) before the input text. Additionally, since the system prompt is not compatible with this version of our models, we DO NOT RECOMMEND including the system prompt in your input.
|
||||||
|
|
||||||
### How to Fine-tune DeepSeekMoE
|
### How to Fine-tune DeepSeekMoE
|
||||||
|
|
||||||
We provide script `fintune/finetune.py` for users to finetune our models on downstream tasks.
|
We provide script `fintune/finetune.py` for users to finetune our models on downstream tasks.
|
||||||
|
|
||||||
The script supports the training with [DeepSpeed](https://github.com/microsoft/DeepSpeed). You need install required packages by:
|
The script supports the training with [DeepSpeed](https://github.com/microsoft/DeepSpeed). You need install required packages by:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
Please follow [Sample Dataset Format](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) to prepare your training data.
|
Please follow [Sample Dataset Format](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) to prepare your training data.
|
||||||
Each item has two required fields `instruction` and `output`.
|
Each item has two required fields `instruction` and `output`.
|
||||||
|
|
||||||
After data preparation, you can use the sample shell script to finetune the DeepSeekMoE model.
|
After data preparation, you can use the sample shell script to finetune the DeepSeekMoE model.
|
||||||
Remember to specify `DATA_PATH`, `OUTPUT_PATH`.
|
Remember to specify `DATA_PATH`, `OUTPUT_PATH`.
|
||||||
And please choose appropriate hyper-parameters(e.g., `learning_rate`, `per_device_train_batch_size`) according to your scenario.
|
And please choose appropriate hyper-parameters(e.g., `learning_rate`, `per_device_train_batch_size`) according to your scenario.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
DATA_PATH="<your_data_path>"
|
DATA_PATH="<your_data_path>"
|
||||||
OUTPUT_PATH="<your_output_path>"
|
OUTPUT_PATH="<your_output_path>"
|
||||||
MODEL_PATH="<your_model_path>"
|
MODEL_PATH="<your_model_path>"
|
||||||
|
|
||||||
cd finetune
|
cd finetune
|
||||||
deepspeed finetune.py \
|
deepspeed finetune.py \
|
||||||
--model_name_or_path $MODEL_PATH \
|
--model_name_or_path $MODEL_PATH \
|
||||||
--data_path $DATA_PATH \
|
--data_path $DATA_PATH \
|
||||||
--output_dir $OUTPUT_PATH \
|
--output_dir $OUTPUT_PATH \
|
||||||
--num_train_epochs 3 \
|
--num_train_epochs 3 \
|
||||||
--model_max_length 1024 \
|
--model_max_length 1024 \
|
||||||
--per_device_train_batch_size 16 \
|
--per_device_train_batch_size 16 \
|
||||||
--per_device_eval_batch_size 1 \
|
--per_device_eval_batch_size 1 \
|
||||||
--gradient_accumulation_steps 4 \
|
--gradient_accumulation_steps 4 \
|
||||||
--evaluation_strategy "no" \
|
--evaluation_strategy "no" \
|
||||||
--save_strategy "steps" \
|
--save_strategy "steps" \
|
||||||
--save_steps 100 \
|
--save_steps 100 \
|
||||||
--save_total_limit 100 \
|
--save_total_limit 100 \
|
||||||
--learning_rate 2e-5 \
|
--learning_rate 2e-5 \
|
||||||
--warmup_steps 10 \
|
--warmup_steps 10 \
|
||||||
--logging_steps 1 \
|
--logging_steps 1 \
|
||||||
--lr_scheduler_type "cosine" \
|
--lr_scheduler_type "cosine" \
|
||||||
--gradient_checkpointing True \
|
--gradient_checkpointing True \
|
||||||
--report_to "tensorboard" \
|
--report_to "tensorboard" \
|
||||||
--deepspeed configs/ds_config_zero3.json \
|
--deepspeed configs/ds_config_zero3.json \
|
||||||
--bf16 True \
|
--bf16 True \
|
||||||
--use_lora False
|
--use_lora False
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also finetune the model with 4/8-bits qlora, feel free to try it.
|
You can also finetune the model with 4/8-bits qlora, feel free to try it.
|
||||||
```bash
|
```bash
|
||||||
DATA_PATH="<your_data_path>"
|
DATA_PATH="<your_data_path>"
|
||||||
OUTPUT_PATH="<your_output_path>"
|
OUTPUT_PATH="<your_output_path>"
|
||||||
MODEL_PATH="<your_model_path>"
|
MODEL_PATH="<your_model_path>"
|
||||||
|
|
||||||
cd finetune
|
cd finetune
|
||||||
deepspeed finetune.py \
|
deepspeed finetune.py \
|
||||||
--model_name_or_path $MODEL_PATH \
|
--model_name_or_path $MODEL_PATH \
|
||||||
--data_path $DATA_PATH \
|
--data_path $DATA_PATH \
|
||||||
--output_dir $OUTPUT_PATH \
|
--output_dir $OUTPUT_PATH \
|
||||||
--num_train_epochs 3 \
|
--num_train_epochs 3 \
|
||||||
--model_max_length 1024 \
|
--model_max_length 1024 \
|
||||||
--per_device_train_batch_size 16 \
|
--per_device_train_batch_size 16 \
|
||||||
--per_device_eval_batch_size 1 \
|
--per_device_eval_batch_size 1 \
|
||||||
--gradient_accumulation_steps 4 \
|
--gradient_accumulation_steps 4 \
|
||||||
--evaluation_strategy "no" \
|
--evaluation_strategy "no" \
|
||||||
--save_strategy "steps" \
|
--save_strategy "steps" \
|
||||||
--save_steps 100 \
|
--save_steps 100 \
|
||||||
--save_total_limit 100 \
|
--save_total_limit 100 \
|
||||||
--learning_rate 2e-5 \
|
--learning_rate 2e-5 \
|
||||||
--warmup_steps 10 \
|
--warmup_steps 10 \
|
||||||
--logging_steps 1 \
|
--logging_steps 1 \
|
||||||
--lr_scheduler_type "cosine" \
|
--lr_scheduler_type "cosine" \
|
||||||
--gradient_checkpointing True \
|
--gradient_checkpointing True \
|
||||||
--report_to "tensorboard" \
|
--report_to "tensorboard" \
|
||||||
--deepspeed configs/ds_config_zero2_no_offload.json \
|
--deepspeed configs/ds_config_zero2_no_offload.json \
|
||||||
--bf16 True \
|
--bf16 True \
|
||||||
--use_lora True \
|
--use_lora True \
|
||||||
--bits 4 \
|
--bits 4 \
|
||||||
--max_grad_norm 0.3 \
|
--max_grad_norm 0.3 \
|
||||||
--double_quant \
|
--double_quant \
|
||||||
--lora_r 64 \
|
--lora_r 64 \
|
||||||
--lora_alpha 16 \
|
--lora_alpha 16 \
|
||||||
--quant_type nf4 \
|
--quant_type nf4 \
|
||||||
```
|
```
|
||||||
|
|
||||||
## 5. License
|
## 5. License
|
||||||
This code repository is licensed under the MIT License. The use of DeepSeekMoE models is subject to the Model License. DeepSeekMoE supports commercial use.
|
This code repository is licensed under the MIT License. The use of DeepSeekMoE models is subject to the Model License. DeepSeekMoE supports commercial use.
|
||||||
|
|
||||||
See the [LICENSE-CODE](LICENSE-CODE) and [LICENSE-MODEL](LICENSE-MODEL) for more details.
|
See the [LICENSE-CODE](LICENSE-CODE) and [LICENSE-MODEL](LICENSE-MODEL) for more details.
|
||||||
|
|
||||||
## 6. Citation
|
## 6. Citation
|
||||||
|
|
||||||
```
|
```
|
||||||
@article{deepseekmoe,
|
@article{dai2024deepseekmoe,
|
||||||
[coming soon]
|
author={Damai Dai and Chengqi Deng and Chenggang Zhao and R. X. Xu and Huazuo Gao and Deli Chen and Jiashi Li and Wangding Zeng and Xingkai Yu and Y. Wu and Zhenda Xie and Y. K. Li and Panpan Huang and Fuli Luo and Chong Ruan and Zhifang Sui and Wenfeng Liang},
|
||||||
}
|
title={DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models},
|
||||||
```
|
journal = {CoRR},
|
||||||
|
volume = {abs/2401.06066},
|
||||||
|
year = {2024},
|
||||||
## 7. Contact
|
url = {https://arxiv.org/abs/2401.06066},
|
||||||
|
}
|
||||||
If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## 7. Contact
|
||||||
|
|
||||||
|
If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).
|
||||||
|
Loading…
Reference in New Issue
Block a user