diff --git a/README.md b/README.md index 89d3858..36f84f1 100644 --- a/README.md +++ b/README.md @@ -193,6 +193,8 @@ Each item has two required fields `instruction` and `output`. After data preparation, you can use the sample shell script to finetune the DeepSeekMoE model. Remember to specify `DATA_PATH`, `OUTPUT_PATH`. And please choose appropriate hyper-parameters(e.g., `learning_rate`, `per_device_train_batch_size`) according to your scenario. +We have used flash_attention2 by default. For devices supported by flash_attention, you can refer [here](https://github.com/Dao-AILab/flash-attention). +For this configuration, zero_stage needs to be set to 3, and we run it on eight A100 40 GPUs. ```bash DATA_PATH="" @@ -224,7 +226,7 @@ deepspeed finetune.py \ --use_lora False ``` -You can also finetune the model with 4/8-bits qlora, feel free to try it. +You can also finetune the model with 4/8-bits qlora, feel free to try it. For this configuration, it is possible to run on a single A100 80G GPU, and adjustments can be made according to your resources. ```bash DATA_PATH="" OUTPUT_PATH=""