DeepSeek-VL2/README.md

<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->

<div align="center">
  <img src="images/logo.svg" width="60%" alt="DeepSeek LLM" />
</div>
<hr>
<div align="center">

  <a href="https://www.deepseek.com/" target="_blank">
    <img alt="Homepage" src="images/badge.svg" />
  </a>
  <a href="" target="_blank">
    <img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-DeepSeek%20VL-536af5?color=536af5&logoColor=white" />
  </a>
  <a href="https://huggingface.co/deepseek-ai" target="_blank">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
  </a>

</div>


<div align="center">

  <a href="https://discord.gg/Tc7c45Zzu5" target="_blank">
    <img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />
  </a>
  <a href="images/qr.jpeg" target="_blank">
    <img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" />
  </a>
  <a href="https://twitter.com/deepseek_ai" target="_blank">
    <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />
  </a>

</div>

<div align="center">

  <a href="LICENSE-CODE">
    <img alt="Code License" src="https://img.shields.io/badge/Code_License-MIT-f5de53?&color=f5de53">
  </a>
  <a href="LICENSE-MODEL">
    <img alt="Model License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53">
  </a>
</div>


<p align="center">
  <a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#3-model-download"><b>📥 Model Download</b></a> |
  <a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#4-quick-start"><b>⚡ Quick Start</b></a> |
  <a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#5-license"><b>📜 License</b></a> |
  <a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#6-citation"><b>📖 Citation</b></a> <br>
  <a href="./DeepSeek_VL2_paper.pdf"><b>📄 Paper Link</b></a> |
  <a href="https://arxiv.org/abs/2412.10302"><b>📄 Arxiv Paper Link</b></a> |
  <a href=""><b>👁️ Demo</b></a>
</p>

## 1. Introduction

Introducing DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition,  document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively.
DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models.


[DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding]()

Zhiyu Wu*, Xiaokang Chen*, Zizheng Pan*, Xingchao Liu*, Wen Liu**, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan*** (* Equal Contribution, ** Project Lead, *** Corresponding author)

![](./images/vl2_teaser.jpeg)

## 2. Release
✅ <b>2024-12-13</b>: DeepSeek-VL2 family released, including <code>DeepSeek-VL2-tiny</code>, <code>DeepSeek-VL2-small</code>, <code>DeepSeek-VL2</code>.

## 3. Model Download

We release the DeepSeek-VL2 family, including <code>DeepSeek-VL2-tiny</code>, <code>DeepSeek-VL2-small</code>, <code>DeepSeek-VL2</code>.
To support a broader and more diverse range of research within both academic and commercial communities.
Please note that the use of this model is subject to the terms outlined in [License section](#5-license).

### Huggingface

| Model        | Sequence Length | Download                                                                    |
|--------------|-----------------|-----------------------------------------------------------------------------|
| DeepSeek-VL2-tiny | 4096            | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/deepseek-vl2-tiny) |
| DeepSeek-VL2-small | 4096            | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/deepseek-vl2-small) |
| DeepSeek-VL2 | 4096            | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/deepseek-vl2)   |


## 4. Quick Start

### Installation

On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:

```shell
pip install -e .
```

### Simple Inference Example

```python
import torch
from transformers import AutoModelForCausalLM

from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl.utils.io import load_pil_images


# specify the path to the model
model_path = "deepseek-ai/deepseek-vl2-small"
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

## single image conversation example
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
        "images": ["./images/visual_grounding.jpeg"],
    },
    {"role": "<|Assistant|>", "content": ""},
]

## multiple images (or in-context learning) conversation example
# conversation = [
#     {
#         "role": "User",
#         "content": "<image_placeholder>A dog wearing nothing in the foreground, "
#                    "<image_placeholder>a dog wearing a santa hat, "
#                    "<image_placeholder>a dog wearing a wizard outfit, and "
#                    "<image_placeholder>what's the dog wearing?",
#         "images": [
#             "images/dog_a.png",
#             "images/dog_b.png",
#             "images/dog_c.png",
#             "images/dog_d.png",
#         ],
#     },
#     {"role": "Assistant", "content": ""}
# ]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
```

### Gradio Demo (TODO)


### Demo
This figure present some examples of DeepSeek-VL2.
![](./images/github_demo.png)

## 5. License

This code repository is licensed under [MIT License](./LICENSE-CODE). The use of DeepSeek-VL2 models is subject to [DeepSeek Model License](./LICENSE-MODEL). DeepSeek-VL2 series supports commercial use.

## 6. Citation

```
@misc{wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels,
      title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding}, 
      author={Zhiyu Wu and Xiaokang Chen and Zizheng Pan and Xingchao Liu and Wen Liu and Damai Dai and Huazuo Gao and Yiyang Ma and Chengyue Wu and Bingxuan Wang and Zhenda Xie and Yu Wu and Kai Hu and Jiawei Wang and Yaofeng Sun and Yukun Li and Yishi Piao and Kang Guan and Aixin Liu and Xin Xie and Yuxiang You and Kai Dong and Xingkai Yu and Haowei Zhang and Liang Zhao and Yisong Wang and Chong Ruan},
      year={2024},
      eprint={2412.10302},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.10302}, 
}
```

## 7. Contact

If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).
Initial commit 2024-12-13 12:38:59 +00:00			`<!-- markdownlint-disable first-line-h1 -->`
			`<!-- markdownlint-disable html -->`
			`<!-- markdownlint-disable no-duplicate-header -->`

			`<div align="center">`
			`<img src="images/logo.svg" width="60%" alt="DeepSeek LLM" />`
			`</div>`
			`<hr>`
			`<div align="center">`

			`<a href="https://www.deepseek.com/" target="_blank">`
			`<img alt="Homepage" src="images/badge.svg" />`
			`</a>`
			`<a href="" target="_blank">`
			`<img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-DeepSeek%20VL-536af5?color=536af5&logoColor=white" />`
			`</a>`
			`<a href="https://huggingface.co/deepseek-ai" target="_blank">`
			`<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />`
			`</a>`

			`</div>`


			`<div align="center">`

			`<a href="https://discord.gg/Tc7c45Zzu5" target="_blank">`
			`<img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />`
			`</a>`
			`<a href="images/qr.jpeg" target="_blank">`
			`<img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" />`
			`</a>`
			`<a href="https://twitter.com/deepseek_ai" target="_blank">`
			`<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />`
			`</a>`

			`</div>`

			`<div align="center">`

			`<a href="LICENSE-CODE">`
			`<img alt="Code License" src="https://img.shields.io/badge/Code_License-MIT-f5de53?&color=f5de53">`
			`</a>`
			`<a href="LICENSE-MODEL">`
			`<img alt="Model License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53">`
			`</a>`
			`</div>`


			`<p align="center">`
			`<a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#3-model-download"><b>📥 Model Download</b></a> \|`
			`<a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#4-quick-start"><b>⚡ Quick Start</b></a> \|`
			`<a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#5-license"><b>📜 License</b></a> \|`
			`<a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#6-citation"><b>📖 Citation</b></a> <br>`
			`<a href="./DeepSeek_VL2_paper.pdf"><b>📄 Paper Link</b></a> \|`
update arxiv link and bibtex 2024-12-16 02:39:34 +00:00			`<a href="https://arxiv.org/abs/2412.10302"><b>📄 Arxiv Paper Link</b></a> \|`
Initial commit 2024-12-13 12:38:59 +00:00			`<a href=""><b>👁️ Demo</b></a>`
			`</p>`

			`## 1. Introduction`

			Introducing DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively.
			`DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models.`


			`[DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding]()`

			`Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan* (* Equal Contribution, Project Lead, * Corresponding author)`

			`![](./images/vl2_teaser.jpeg)`

			`## 2. Release`
			`✅ <b>2024-12-13</b>: DeepSeek-VL2 family released, including <code>DeepSeek-VL2-tiny</code>, <code>DeepSeek-VL2-small</code>, <code>DeepSeek-VL2</code>.`

			`## 3. Model Download`

			`We release the DeepSeek-VL2 family, including <code>DeepSeek-VL2-tiny</code>, <code>DeepSeek-VL2-small</code>, <code>DeepSeek-VL2</code>.`
			`To support a broader and more diverse range of research within both academic and commercial communities.`
			`Please note that the use of this model is subject to the terms outlined in [License section](#5-license).`

			`### Huggingface`

			`\| Model \| Sequence Length \| Download \|`
			`\|--------------\|-----------------\|-----------------------------------------------------------------------------\|`
			`\| DeepSeek-VL2-tiny \| 4096 \| [🤗 Hugging Face](https://huggingface.co/deepseek-ai/deepseek-vl2-tiny) \|`
			`\| DeepSeek-VL2-small \| 4096 \| [🤗 Hugging Face](https://huggingface.co/deepseek-ai/deepseek-vl2-small) \|`
			`\| DeepSeek-VL2 \| 4096 \| [🤗 Hugging Face](https://huggingface.co/deepseek-ai/deepseek-vl2) \|`


			`## 4. Quick Start`

			`### Installation`

			On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:

			```shell
			`pip install -e .`
			```

			`### Simple Inference Example`

			```python
			`import torch`
			`from transformers import AutoModelForCausalLM`

			`from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM`
			`from deepseek_vl.utils.io import load_pil_images`


			`# specify the path to the model`
			`model_path = "deepseek-ai/deepseek-vl2-small"`
			`vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)`
			`tokenizer = vl_chat_processor.tokenizer`

			`vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)`
			`vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()`

			`## single image conversation example`
			`conversation = [`
			`{`
			`"role": "<\|User\|>",`
			`"content": "<image>\n<\|ref\|>The giraffe at the back.<\|/ref\|>.",`
			`"images": ["./images/visual_grounding.jpeg"],`
			`},`
			`{"role": "<\|Assistant\|>", "content": ""},`
			`]`

			`## multiple images (or in-context learning) conversation example`
			`# conversation = [`
			`# {`
			`# "role": "User",`
			`# "content": "<image_placeholder>A dog wearing nothing in the foreground, "`
			`# "<image_placeholder>a dog wearing a santa hat, "`
			`# "<image_placeholder>a dog wearing a wizard outfit, and "`
			`# "<image_placeholder>what's the dog wearing?",`
			`# "images": [`
			`# "images/dog_a.png",`
			`# "images/dog_b.png",`
			`# "images/dog_c.png",`
			`# "images/dog_d.png",`
			`# ],`
			`# },`
			`# {"role": "Assistant", "content": ""}`
			`# ]`

			`# load images and prepare for inputs`
			`pil_images = load_pil_images(conversation)`
			`prepare_inputs = vl_chat_processor(`
			`conversations=conversation,`
			`images=pil_images,`
			`force_batchify=True,`
			`system_prompt=""`
			`).to(vl_gpt.device)`

			`# run image encoder to get the image embeddings`
			`inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)`

			`# run the model to get the response`
			`outputs = vl_gpt.language_model.generate(`
			`inputs_embeds=inputs_embeds,`
			`attention_mask=prepare_inputs.attention_mask,`
			`pad_token_id=tokenizer.eos_token_id,`
			`bos_token_id=tokenizer.bos_token_id,`
			`eos_token_id=tokenizer.eos_token_id,`
			`max_new_tokens=512,`
			`do_sample=False,`
			`use_cache=True`
			`)`

			`answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)`
			`print(f"{prepare_inputs['sft_format'][0]}", answer)`
			```

			`### Gradio Demo (TODO)`


			`### Demo`
			`This figure present some examples of DeepSeek-VL2.`
			`![](./images/github_demo.png)`

			`## 5. License`

			`This code repository is licensed under [MIT License](./LICENSE-CODE). The use of DeepSeek-VL2 models is subject to [DeepSeek Model License](./LICENSE-MODEL). DeepSeek-VL2 series supports commercial use.`

			`## 6. Citation`

			```
update arxiv link and bibtex 2024-12-16 02:39:34 +00:00			`@misc{wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels,`
			`title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding},`
			`author={Zhiyu Wu and Xiaokang Chen and Zizheng Pan and Xingchao Liu and Wen Liu and Damai Dai and Huazuo Gao and Yiyang Ma and Chengyue Wu and Bingxuan Wang and Zhenda Xie and Yu Wu and Kai Hu and Jiawei Wang and Yaofeng Sun and Yukun Li and Yishi Piao and Kang Guan and Aixin Liu and Xin Xie and Yuxiang You and Kai Dong and Xingkai Yu and Haowei Zhang and Liang Zhao and Yisong Wang and Chong Ruan},`
Initial commit 2024-12-13 12:38:59 +00:00			`year={2024},`
update arxiv link and bibtex 2024-12-16 02:39:34 +00:00			`eprint={2412.10302},`
			`archivePrefix={arXiv},`
			`primaryClass={cs.CV},`
			`url={https://arxiv.org/abs/2412.10302},`
Initial commit 2024-12-13 12:38:59 +00:00			`}`
			```

			`## 7. Contact`

			`If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).`