diff --git a/README.md b/README.md index 9350451..96f7b7d 100644 --- a/README.md +++ b/README.md @@ -209,70 +209,6 @@ print(f"{prepare_inputs['sft_format'][0]}", answer) python cli_chat.py --model_path deepseek-ai/deepseek-vl-7b-chat ``` -Avoiding the use of the provided function `apply_chat_template`, you can also interact with our model following the sample template. Note that `messages` should be replaced by your input. - -``` -User: {messages[0]['content']} - -Assistant: {messages[1]['content']}<|end▁of▁sentence|>User: {messages[2]['content']} - -Assistant: -``` - -**Note:** By default (`add_special_tokens=True`), our tokenizer automatically adds a `bos_token` (`<|begin▁of▁sentence|>`) before the input text. Additionally, since the system prompt is not compatible with this version of our models, we DO NOT RECOMMEND including the system prompt in your input. - -### Inference with vLLM - -You can also employ [vLLM](https://github.com/vllm-project/vllm) for high-throughput inference. - -**Text Completion** - -```python -from vllm import LLM, SamplingParams - -tp_size = 4 # Tensor Parallelism -sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100) -model_name = "deepseek-ai/deepseek-llm-67b-base" -llm = LLM(model=model_name, trust_remote_code=True, gpu_memory_utilization=0.9, tensor_parallel_size=tp_size) - -prompts = [ - "If everyone in a country loves one another,", - "The research should also focus on the technologies", - "To determine if the label is correct, we need to" -] -outputs = llm.generate(prompts, sampling_params) - -generated_text = [output.outputs[0].text for output in outputs] -print(generated_text) -``` - -**Chat Completion** - -```python -from transformers import AutoTokenizer -from vllm import LLM, SamplingParams - -tp_size = 4 # Tensor Parallelism -sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100) -model_name = "deepseek-ai/deepseek-llm-67b-chat" -tokenizer = AutoTokenizer.from_pretrained(model_name) -llm = LLM(model=model_name, trust_remote_code=True, gpu_memory_utilization=0.9, tensor_parallel_size=tp_size) - -messages_list = [ - [{"role": "user", "content": "Who are you?"}], - [{"role": "user", "content": "What can you do?"}], - [{"role": "user", "content": "Explain Transformer briefly."}], -] -# Avoid adding bos_token repeatedly -prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list] - -sampling_params.stop = [tokenizer.eos_token] -outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params) - -generated_text = [output.outputs[0].text for output in outputs] -print(generated_text) -``` - ## 6. FAQ ### Could You Provide the tokenizer.model File for Model Quantization?