ESFT/README.md


# Expert-Specialized Fine-Tuning


Official Repo for paper [Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models](https://arxiv.org/abs/2407.01906) by 
[Zihan Wang](https://zihanwang314.github.io), [Deli Chen](https://victorchen96.github.io/chendeli.io/), [Damai Dai](https://scholar.google.com.hk/citations?user=8b-ysf0NWVoC&hl=zh-CN), [Runxin Xu](https://runxinxu.github.io/aboutme/), 
[Zhuoshu Li](http://www.idi.zju.edu.cn/member/3053.html) and
Y. Wu. 

**ESFT** aims to efficiently customize Large Language Models (LLMs) with Mixture-of-Experts (MoE) architecture by adjusting only task-relevant parts, improving efficiency and performance while using fewer resources and storage. 
 

## 📰 News

📅 **2024.9.20:** Glad to announce that ESFT has been accepted to the **EMNLP 2024 Main Conference**! 
📅 **2024.8.11:** We now release the **ESFT training code**! ✨ You can now try it with your own models and dataset!


## 🚀 Quick Start 
### Installation and Setup
```bash
git clone https://github.com/deepseek-ai/ESFT.git
cd esft
```

### Install required dependencies
```bash
pip install transformers torch safetensors accelerate
```

### Download necessary adapters
```bash
bash scripts/download_adapters.sh
```


## 🔧Key Scripts
1. **eval_multigpu.py**
This script evaluates the performance of the model on various datasets. See **scripts/eval.sh** for detailed configs and explanations.

**Usage:**
```bash
python eval_multigpu.py \
    --eval_dataset=translation \
    --base_model_path=deepseek-ai/ESFT-vanilla-lite \
    --adapter_dir=all_models/adapters/token/translation \
    --output_path=results/completions/token/translation.jsonl \
    --openai_api_key=YOUR_OPENAI_API_KEY
```


2. **get_expert_scores.py**
This script calculates the scores for each expert based on the evaluation datasets.
**Usage:**
```bash
python scripts/expert/get_expert_scores.py \
    --eval_dataset=translation \
    --base_model_path=deepseek-ai/ESFT-vanilla-lite \
    --output_dir=results/expert_scores/translation \
    --n_sample_tokens=131072 \
    --world_size=4 \
    --gpus_per_rank=2
```

3. **generate_expert_config.py**
This script generates the configuration to convert a MoE model with only task-relevant tasks trained based on evaluation scores.
**Usage:**
```bash
python scripts/expert/generate_expert_config.py \
    --eval_datasets=intent,summary,law,translation \
    --expert_scores_dir=results/expert_scores \
    --output_dir=results/expert_configs \
    --score_function=token \
    --top_p=0.2 # the scoring function and top_p are hyperparameters
```

4. **train.py** and **train_ep.py**
This script trains the model with the expert configuration generated by the previous script. The train_ep.py file uses expert parallel and has been optimized for multi-GPU training.
**Usage:**
```bash
python train.py \
    --base_model_path=deepseek-ai/ESFT-vanilla-lite \
    --expert_config=results/expert_configs/intent.json \
    --train_dataset=intent \
    --train_config=configs/base.yaml \
    --output_dir=results/checkpoints/intent
    
torchrun --nproc-per-node=8 train_ep.py \
    --base_model_path=deepseek-ai/ESFT-vanilla-lite \
    --expert_config=results/expert_configs/translation.json \
    --train_dataset=translation \
    --train_config=configs/base.yaml \
    --output_dir=results/checkpoints/translation

```

## Contact and Support
For bug reports, feature requests, and general inquiries, please open an issue on our GitHub Issues page. Make sure to include as much detail as possible to help us address your issue quickly.

## 🌟Todo list
- ☑️  📝 Update models, evaluation scripts, and expert selection scripts
- ☑️ 🔧 Update training scripts
- 🔲 🚀 More...


## 📚Citation
If you find our code or paper useful, please cite:
```bash
@article{wang2024letexpertsticklast,
      title={Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models}, 
      author={Zihan Wang and Deli Chen and Damai Dai and Runxin Xu and Zhuoshu Li and Y. Wu},
      year={2024},
      eprint={2407.01906},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.01906}, 
}
```
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00
			`# Expert-Specialized Fine-Tuning`


			`Official Repo for paper [Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models](https://arxiv.org/abs/2407.01906) by`
			`[Zihan Wang](https://zihanwang314.github.io), [Deli Chen](https://victorchen96.github.io/chendeli.io/), [Damai Dai](https://scholar.google.com.hk/citations?user=8b-ysf0NWVoC&hl=zh-CN), [Runxin Xu](https://runxinxu.github.io/aboutme/),`
			`[Zhuoshu Li](http://www.idi.zju.edu.cn/member/3053.html) and`
			`Y. Wu.`

			`ESFT aims to efficiently customize Large Language Models (LLMs) with Mixture-of-Experts (MoE) architecture by adjusting only task-relevant parts, improving efficiency and performance while using fewer resources and storage.`


update eval and readme 2024-08-09 10:06:57 +00:00			`## 📰 News`

Update README.md 2024-09-22 15:46:31 +00:00			`📅 2024.9.20: Glad to announce that ESFT has been accepted to the EMNLP 2024 Main Conference!`
update eval and readme 2024-08-09 10:06:57 +00:00			`📅 2024.8.11: We now release the ESFT training code! ✨ You can now try it with your own models and dataset!`
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00

			`## 🚀 Quick Start`
			`### Installation and Setup`
			```bash
			`git clone https://github.com/deepseek-ai/ESFT.git`
			`cd esft`
			```

update eval and readme 2024-08-09 10:06:57 +00:00			`### Install required dependencies`
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00			```bash
update eval and readme 2024-08-09 10:06:57 +00:00			`pip install transformers torch safetensors accelerate`
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00			```

			`### Download necessary adapters`
			```bash
			`bash scripts/download_adapters.sh`
			```



			`## 🔧Key Scripts`
update eval and readme 2024-08-09 10:06:57 +00:00			`1. eval_multigpu.py`
			`This script evaluates the performance of the model on various datasets. See scripts/eval.sh for detailed configs and explanations.`

			`Usage:`
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00			```bash
update eval and readme 2024-08-09 10:06:57 +00:00			`python eval_multigpu.py \`
			`--eval_dataset=translation \`
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00			`--base_model_path=deepseek-ai/ESFT-vanilla-lite \`
update eval and readme 2024-08-09 10:06:57 +00:00			`--adapter_dir=all_models/adapters/token/translation \`
			`--output_path=results/completions/token/translation.jsonl \`
			`--openai_api_key=YOUR_OPENAI_API_KEY`
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00			```

update eval and readme 2024-08-09 10:06:57 +00:00
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00			`2. get_expert_scores.py`
			`This script calculates the scores for each expert based on the evaluation datasets.`
			`Usage:`
			```bash
update eval and readme 2024-08-09 10:06:57 +00:00			`python scripts/expert/get_expert_scores.py \`
			`--eval_dataset=translation \`
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00			`--base_model_path=deepseek-ai/ESFT-vanilla-lite \`
update eval and readme 2024-08-09 10:06:57 +00:00			`--output_dir=results/expert_scores/translation \`
			`--n_sample_tokens=131072 \`
			`--world_size=4 \`
			`--gpus_per_rank=2`
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00			```

			`3. generate_expert_config.py`
			`This script generates the configuration to convert a MoE model with only task-relevant tasks trained based on evaluation scores.`
			`Usage:`
			```bash
update eval and readme 2024-08-09 10:06:57 +00:00			`python scripts/expert/generate_expert_config.py \`
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00			`--eval_datasets=intent,summary,law,translation \`
			`--expert_scores_dir=results/expert_scores \`
			`--output_dir=results/expert_configs \`
			`--score_function=token \`
			`--top_p=0.2 # the scoring function and top_p are hyperparameters`
			```

update eval and readme 2024-08-09 10:06:57 +00:00			`4. train.py and train_ep.py`
			`This script trains the model with the expert configuration generated by the previous script. The train_ep.py file uses expert parallel and has been optimized for multi-GPU training.`
			`Usage:`
			```bash
			`python train.py \`
			`--base_model_path=deepseek-ai/ESFT-vanilla-lite \`
			`--expert_config=results/expert_configs/intent.json \`
			`--train_dataset=intent \`
			`--train_config=configs/base.yaml \`
			`--output_dir=results/checkpoints/intent`

			`torchrun --nproc-per-node=8 train_ep.py \`
			`--base_model_path=deepseek-ai/ESFT-vanilla-lite \`
			`--expert_config=results/expert_configs/translation.json \`
			`--train_dataset=translation \`
			`--train_config=configs/base.yaml \`
			`--output_dir=results/checkpoints/translation`

			```
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00
			`## Contact and Support`
			`For bug reports, feature requests, and general inquiries, please open an issue on our GitHub Issues page. Make sure to include as much detail as possible to help us address your issue quickly.`

			`## 🌟Todo list`
			`- ☑️ 📝 Update models, evaluation scripts, and expert selection scripts`
update eval and readme 2024-08-09 10:06:57 +00:00			`- ☑️ 🔧 Update training scripts`
first commit update readme update readme update readme Update benchmarks.py Update download_adapters.sh Update esft.py 2024-07-04 13:37:15 +00:00			`- 🔲 🚀 More...`


			`## 📚Citation`
			`If you find our code or paper useful, please cite:`
			```bash
			`@article{wang2024letexpertsticklast,`
			`title={Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models},`
			`author={Zihan Wang and Deli Chen and Damai Dai and Runxin Xu and Zhuoshu Li and Y. Wu},`
			`year={2024},`
			`eprint={2407.01906},`
			`archivePrefix={arXiv},`
			`primaryClass={cs.CL},`
			`url={https://arxiv.org/abs/2407.01906},`
			`}`
			```