mirror of
https://github.com/deepseek-ai/Janus
synced 2024-12-28 14:52:12 +00:00
267 lines
9.2 KiB
Markdown
267 lines
9.2 KiB
Markdown
<!-- markdownlint-disable first-line-h1 -->
|
||
<!-- markdownlint-disable html -->
|
||
<!-- markdownlint-disable no-duplicate-header -->
|
||
|
||
<div align="center">
|
||
<img src="images/logo.svg" width="60%" alt="DeepSeek LLM" />
|
||
</div>
|
||
<hr>
|
||
<div align="center">
|
||
|
||
<a href="https://www.deepseek.com/" target="_blank">
|
||
<img alt="Homepage" src="images/badge.svg" />
|
||
</a>
|
||
</a>
|
||
<a href="https://huggingface.co/deepseek-ai" target="_blank">
|
||
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
|
||
</a>
|
||
|
||
</div>
|
||
|
||
|
||
<div align="center">
|
||
|
||
<!-- <a href="https://discord.gg/Tc7c45Zzu5" target="_blank">
|
||
<img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />
|
||
</a> -->
|
||
<!-- <a href="images/qr.jpeg" target="_blank">
|
||
<img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" />
|
||
</a> -->
|
||
<!-- <a href="https://twitter.com/deepseek_ai" target="_blank">
|
||
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />
|
||
</a> -->
|
||
|
||
</div>
|
||
|
||
<div align="center">
|
||
|
||
<a href="LICENSE-CODE">
|
||
<img alt="Code License" src="https://img.shields.io/badge/Code_License-MIT-f5de53?&color=f5de53">
|
||
</a>
|
||
<a href="LICENSE-MODEL">
|
||
<img alt="Model License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53">
|
||
</a>
|
||
</div>
|
||
|
||
|
||
<p align="center">
|
||
<a href="#2-model-download"><b>📥 Model Download</b></a> |
|
||
<a href="#3-quick-start"><b>⚡ Quick Start</b></a> |
|
||
<a href="#4-license"><b>📜 License</b></a> |
|
||
<a href="#5-citation"><b>📖 Citation</b></a> <br>
|
||
<a href="https://arxiv.org/abs/2410.13848"><b>📄 Paper Link</b></a> |
|
||
</p>
|
||
|
||
|
||
## 1. Introduction
|
||
|
||
Janus is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
|
||
|
||
<div align="center">
|
||
<img alt="image" src="images/teaser.png" style="width:90%;">
|
||
</div>
|
||
|
||
|
||
## 2. Model Download
|
||
|
||
We release Janus to the public to support a broader and more diverse range of research within both academic and commercial communities.
|
||
Please note that the use of this model is subject to the terms outlined in [License section](#4-license). Commercial usage is
|
||
permitted under these terms.
|
||
|
||
### Huggingface
|
||
|
||
| Model | Sequence Length | Download |
|
||
|-----------------------|-----------------|-----------------------------------------------------------------------------|
|
||
| Janus-1.3B | 4096 | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/Janus-1.3B) |
|
||
|
||
|
||
|
||
|
||
## 3. Quick Start
|
||
|
||
### Installation
|
||
|
||
On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:
|
||
|
||
```shell
|
||
pip install -e .
|
||
```
|
||
|
||
### Simple Inference Example
|
||
|
||
#### Multimodal Understanding
|
||
```python
|
||
|
||
import torch
|
||
from transformers import AutoModelForCausalLM
|
||
from janus.models import MultiModalityCausalLM, VLChatProcessor
|
||
from janus.utils.io import load_pil_images
|
||
|
||
# specify the path to the model
|
||
model_path = "deepseek-ai/Janus-1.3B"
|
||
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
|
||
tokenizer = vl_chat_processor.tokenizer
|
||
|
||
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
|
||
model_path, trust_remote_code=True
|
||
)
|
||
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
|
||
|
||
conversation = [
|
||
{
|
||
"role": "User",
|
||
"content": "<image_placeholder>\nConvert the formula into latex code.",
|
||
"images": ["images/equation.png"],
|
||
},
|
||
{"role": "Assistant", "content": ""},
|
||
]
|
||
|
||
# load images and prepare for inputs
|
||
pil_images = load_pil_images(conversation)
|
||
prepare_inputs = vl_chat_processor(
|
||
conversations=conversation, images=pil_images, force_batchify=True
|
||
).to(vl_gpt.device)
|
||
|
||
# # run image encoder to get the image embeddings
|
||
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
|
||
|
||
# # run the model to get the response
|
||
outputs = vl_gpt.language_model.generate(
|
||
inputs_embeds=inputs_embeds,
|
||
attention_mask=prepare_inputs.attention_mask,
|
||
pad_token_id=tokenizer.eos_token_id,
|
||
bos_token_id=tokenizer.bos_token_id,
|
||
eos_token_id=tokenizer.eos_token_id,
|
||
max_new_tokens=512,
|
||
do_sample=False,
|
||
use_cache=True,
|
||
)
|
||
|
||
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
|
||
print(f"{prepare_inputs['sft_format'][0]}", answer)
|
||
|
||
```
|
||
|
||
#### Text-to-Image Generation
|
||
```python
|
||
import os
|
||
import PIL.Image
|
||
import torch
|
||
import numpy as np
|
||
from transformers import AutoModelForCausalLM
|
||
from janus.models import MultiModalityCausalLM, VLChatProcessor
|
||
|
||
|
||
# specify the path to the model
|
||
model_path = "deepseek-ai/Janus-1.3B"
|
||
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
|
||
tokenizer = vl_chat_processor.tokenizer
|
||
|
||
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
|
||
model_path, trust_remote_code=True
|
||
)
|
||
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
|
||
|
||
conversation = [
|
||
{
|
||
"role": "User",
|
||
"content": "A stunning princess from kabul in red, white traditional clothing, blue eyes, brown hair",
|
||
},
|
||
{"role": "Assistant", "content": ""},
|
||
]
|
||
|
||
sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
|
||
conversations=conversation,
|
||
sft_format=vl_chat_processor.sft_format,
|
||
system_prompt="",
|
||
)
|
||
prompt = sft_format + vl_chat_processor.image_start_tag
|
||
|
||
|
||
@torch.inference_mode()
|
||
def generate(
|
||
mmgpt: MultiModalityCausalLM,
|
||
vl_chat_processor: VLChatProcessor,
|
||
prompt: str,
|
||
temperature: float = 1,
|
||
parallel_size: int = 16,
|
||
cfg_weight: float = 5,
|
||
image_token_num_per_image: int = 576,
|
||
img_size: int = 384,
|
||
patch_size: int = 16,
|
||
):
|
||
input_ids = vl_chat_processor.tokenizer.encode(prompt)
|
||
input_ids = torch.LongTensor(input_ids)
|
||
|
||
tokens = torch.zeros((parallel_size*2, len(input_ids)), dtype=torch.int).cuda()
|
||
for i in range(parallel_size*2):
|
||
tokens[i, :] = input_ids
|
||
if i % 2 != 0:
|
||
tokens[i, 1:-1] = vl_chat_processor.pad_id
|
||
|
||
inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)
|
||
|
||
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda()
|
||
|
||
for i in range(image_token_num_per_image):
|
||
outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=outputs.past_key_values if i != 0 else None)
|
||
hidden_states = outputs.last_hidden_state
|
||
|
||
logits = mmgpt.gen_head(hidden_states[:, -1, :])
|
||
logit_cond = logits[0::2, :]
|
||
logit_uncond = logits[1::2, :]
|
||
|
||
logits = logit_uncond + cfg_weight * (logit_cond-logit_uncond)
|
||
probs = torch.softmax(logits / temperature, dim=-1)
|
||
|
||
next_token = torch.multinomial(probs, num_samples=1)
|
||
generated_tokens[:, i] = next_token.squeeze(dim=-1)
|
||
|
||
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
|
||
img_embeds = mmgpt.prepare_gen_img_embeds(next_token)
|
||
inputs_embeds = img_embeds.unsqueeze(dim=1)
|
||
|
||
|
||
dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int), shape=[parallel_size, 8, img_size//patch_size, img_size//patch_size])
|
||
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
|
||
|
||
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
|
||
|
||
visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
|
||
visual_img[:, :, :] = dec
|
||
|
||
os.makedirs('generated_samples', exist_ok=True)
|
||
for i in range(parallel_size):
|
||
save_path = os.path.join('generated_samples', "img_{}.jpg".format(i))
|
||
PIL.Image.fromarray(visual_img[i]).save(save_path)
|
||
|
||
|
||
generate(
|
||
vl_gpt,
|
||
vl_chat_processor,
|
||
prompt,
|
||
)
|
||
```
|
||
|
||
## 4. License
|
||
|
||
This code repository is licensed under [the MIT License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-CODE). The use of Janus models is subject to [DeepSeek Model License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-MODEL).
|
||
|
||
## 5. Citation
|
||
|
||
```
|
||
@misc{wu2024janus,
|
||
title={Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation},
|
||
author={Chengyue Wu and Xiaokang Chen and Zhiyu Wu and Yiyang Ma and Xingchao Liu and Zizheng Pan and Wen Liu and Zhenda Xie and Xingkai Yu and Chong Ruan and Ping Luo},
|
||
year={2024},
|
||
eprint={2410.13848},
|
||
archivePrefix={arXiv},
|
||
primaryClass={cs.CV},
|
||
url={https://arxiv.org/abs/2410.13848},
|
||
}
|
||
```
|
||
|
||
## 6. Contact
|
||
|
||
If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).
|