mini-omni/README.md


# Mini-Omni

<p align="center"><strong style="font-size: 18px;">
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
</strong>
</p>

<p align="center">
🤗 <a href="https://huggingface.co/gpt-omni/mini-omni">Hugging Face</a>   | 📖 <a href="https://github.com/gpt-omni/mini-omni">Github</a> 
|     📑 <a href="https://arxiv.org/abs/2408.16725">Technical report</a> |
🤗 <a href="https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K">Datasets</a>
</p>

Mini-Omni is an open-source multimodal large language model that can **hear, talk while thinking**. Featuring real-time end-to-end speech input and **streaming audio output** conversational capabilities.

<p align="center">
    <img src="data/figures/frameworkv3.jpg" width="100%"/>
</p>


## Updates

- **2024.09:** **VoiceAssistant-400K** is uploaded to [Hugging Face](https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K).

## Features

✅ **Real-time speech-to-speech** conversational capabilities. No extra ASR or TTS models required.

✅ **Talking while thinking**, with the ability to generate text and audio at the same time.

✅ **Streaming audio output** capabilities.

✅ With "Audio-to-Text" and "Audio-to-Audio" **batch inference** to further boost the performance.

## Demo

NOTE: need to unmute first.

https://github.com/user-attachments/assets/03bdde05-9514-4748-b527-003bea57f118


## Install

Create a new conda environment and install the required packages:

```sh
conda create -n omni python=3.10
conda activate omni

git clone https://github.com/gpt-omni/mini-omni.git
cd mini-omni
pip install -r requirements.txt
```

## Quick start

**Interactive demo**

- start server

NOTE: you need to start the server before running the streamlit or gradio demo with API_URL set to the server address.

```sh
sudo apt-get install ffmpeg
conda activate omni
cd mini-omni
python3 server.py --ip '0.0.0.0' --port 60808
```


- run streamlit demo

NOTE: you need to run streamlit **locally** with PyAudio installed. For error: `ModuleNotFoundError: No module named 'utils.vad'`, please run `export PYTHONPATH=./` first.

```sh
pip install PyAudio==0.2.14
API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
```

- run gradio demo
```sh
API_URL=http://0.0.0.0:60808/chat python3 webui/omni_gradio.py
```

example:

NOTE: need to unmute first. Gradio seems can not play audio stream instantly, so the latency feels a bit longer.

https://github.com/user-attachments/assets/29187680-4c42-47ff-b352-f0ea333496d9


**Local test**

```sh
conda activate omni
cd mini-omni
# test run the preset audio samples and questions
python inference.py
```

## FAQ

**1. Does the model support other languages?**

No, the model is only trained on English. However, as we use whisper as the audio encoder, the model can understand other languages which is supported by whisper (like chinese), but the output is only in English.

**2. What is `post_adapter` in the code? does the open-source version support tts-adapter?**

The `post_adapter` is `tts-adapter` in the model.py, but the open-source version does not support `tts-adapter`.

**3. Error: `ModuleNotFoundError: No module named 'utils.xxxx'`**

Run `export PYTHONPATH=./` first. No need to run `pip install utils`, or just try: `pip uninstall utils`

**4. Error: can not run streamlit in local browser, with remote streamlit server**, issue: https://github.com/gpt-omni/mini-omni/issues/37
    
You need start streamlit **locally** with PyAudio installed.


## Acknowledgements 

- [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.
- [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.
- [whisper](https://github.com/openai/whisper/)  for audio encoding.
- [snac](https://github.com/hubertsiuzdak/snac/)  for audio decoding.
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.
- [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=gpt-omni/mini-omni&type=Date)](https://star-history.com/#gpt-omni/mini-omni&Date)
init mini-omni 2024-08-29 11:06:05 +00:00
			`# Mini-Omni`

Update README.md 2024-08-30 02:55:54 +00:00			`<p align="center"><strong style="font-size: 18px;">`
init mini-omni 2024-08-29 11:06:05 +00:00			`Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming`
Update README.md 2024-08-30 02:55:54 +00:00			`</strong>`
			`</p>`
init mini-omni 2024-08-29 11:06:05 +00:00
			`<p align="center">`
Update README.md 2024-08-30 04:50:07 +00:00			`🤗 <a href="https://huggingface.co/gpt-omni/mini-omni">Hugging Face</a> \| 📖 <a href="https://github.com/gpt-omni/mini-omni">Github</a>`
add datasets 2024-09-13 16:00:28 +00:00			`\| 📑 <a href="https://arxiv.org/abs/2408.16725">Technical report</a> \|`
			`🤗 <a href="https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K">Datasets</a>`
Update README.md 2024-08-30 02:55:54 +00:00			`</p>`
init mini-omni 2024-08-29 11:06:05 +00:00
Update README.md 2024-09-04 04:35:46 +00:00			`Mini-Omni is an open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.`
init mini-omni 2024-08-29 11:06:05 +00:00
			`<p align="center">`
			`<img src="data/figures/frameworkv3.jpg" width="100%"/>`
Update README.md 2024-08-30 02:55:54 +00:00			`</p>`
init mini-omni 2024-08-29 11:06:05 +00:00

add datasets 2024-09-13 16:00:28 +00:00			`## Updates`

			`- 2024.09: VoiceAssistant-400K is uploaded to [Hugging Face](https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K).`

init mini-omni 2024-08-29 11:06:05 +00:00			`## Features`

			`✅ Real-time speech-to-speech conversational capabilities. No extra ASR or TTS models required.`

			`✅ Talking while thinking, with the ability to generate text and audio at the same time.`

docs: update README.md outupt -> output 2024-09-07 16:19:06 +00:00			`✅ Streaming audio output capabilities.`
init mini-omni 2024-08-29 11:06:05 +00:00
			`✅ With "Audio-to-Text" and "Audio-to-Audio" batch inference to further boost the performance.`

			`## Demo`

Update README.md 2024-08-30 02:55:54 +00:00			`NOTE: need to unmute first.`

			`https://github.com/user-attachments/assets/03bdde05-9514-4748-b527-003bea57f118`

init mini-omni 2024-08-29 11:06:05 +00:00
			`## Install`

			`Create a new conda environment and install the required packages:`

			```sh
			`conda create -n omni python=3.10`
			`conda activate omni`

			`git clone https://github.com/gpt-omni/mini-omni.git`
			`cd mini-omni`
			`pip install -r requirements.txt`
			```

			`## Quick start`

			`Interactive demo`

			`- start server`
update readme 2024-09-09 04:14:40 +00:00
			`NOTE: you need to start the server before running the streamlit or gradio demo with API_URL set to the server address.`

init mini-omni 2024-08-29 11:06:05 +00:00			```sh
update readme 2024-09-07 10:21:16 +00:00			`sudo apt-get install ffmpeg`
init mini-omni 2024-08-29 11:06:05 +00:00			`conda activate omni`
			`cd mini-omni`
			`python3 server.py --ip '0.0.0.0' --port 60808`
			```

update readme 2024-09-09 03:41:59 +00:00
init mini-omni 2024-08-29 11:06:05 +00:00			`- run streamlit demo`

update readme 2024-09-13 05:12:41 +00:00			NOTE: you need to run streamlit locally with PyAudio installed. For error: `ModuleNotFoundError: No module named 'utils.vad'`, please run `export PYTHONPATH=./` first.
init mini-omni 2024-08-29 11:06:05 +00:00
			```sh
			`pip install PyAudio==0.2.14`
			`API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py`
			```

			`- run gradio demo`
			```sh
			`API_URL=http://0.0.0.0:60808/chat python3 webui/omni_gradio.py`
			```

			`example:`

Update README.md 2024-08-30 02:55:54 +00:00			`NOTE: need to unmute first. Gradio seems can not play audio stream instantly, so the latency feels a bit longer.`

			`https://github.com/user-attachments/assets/29187680-4c42-47ff-b352-f0ea333496d9`

init mini-omni 2024-08-29 11:06:05 +00:00
			`Local test`

			```sh
			`conda activate omni`
			`cd mini-omni`
			`# test run the preset audio samples and questions`
			`python inference.py`
			```

update readme 2024-09-14 03:27:03 +00:00			`## FAQ`
update readme 2024-09-09 04:02:32 +00:00
update readme 2024-09-14 03:27:03 +00:00			`1. Does the model support other languages?`
add datasets 2024-09-13 16:00:28 +00:00
update readme 2024-09-14 03:27:03 +00:00			`No, the model is only trained on English. However, as we use whisper as the audio encoder, the model can understand other languages which is supported by whisper (like chinese), but the output is only in English.`
update readme 2024-09-09 04:06:27 +00:00
update readme 2024-09-14 03:27:03 +00:00			2. What is `post_adapter` in the code? does the open-source version support tts-adapter?
update readme 2024-09-13 05:12:41 +00:00
update readme 2024-09-14 03:27:03 +00:00			The `post_adapter` is `tts-adapter` in the model.py, but the open-source version does not support `tts-adapter`.
update readme 2024-09-13 05:12:41 +00:00
update readme 2024-09-14 03:27:03 +00:00			3. Error: `ModuleNotFoundError: No module named 'utils.xxxx'`
add datasets 2024-09-13 16:00:28 +00:00
update readme 2024-09-14 03:27:03 +00:00			Run `export PYTHONPATH=./` first. No need to run `pip install utils`, or just try: `pip uninstall utils`
add datasets 2024-09-13 16:00:28 +00:00
update readme 2024-09-14 03:27:03 +00:00			`4. Error: can not run streamlit in local browser, with remote streamlit server, issue: https://github.com/gpt-omni/mini-omni/issues/37`
update readme 2024-09-13 05:12:41 +00:00
update readme 2024-09-14 03:27:03 +00:00			`You need start streamlit locally with PyAudio installed.`
add datasets 2024-09-13 16:00:28 +00:00
update readme 2024-09-09 04:02:32 +00:00
init mini-omni 2024-08-29 11:06:05 +00:00			`## Acknowledgements`

			`- [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.`
			`- [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.`
			`- [whisper](https://github.com/openai/whisper/) for audio encoding.`
			`- [snac](https://github.com/hubertsiuzdak/snac/) for audio decoding.`
			`- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.`
Update README.md 2024-08-30 02:55:54 +00:00			`- [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.`
Update README 2024-09-03 13:35:15 +00:00
			`## Star History`

			`[![Star History Chart](https://api.star-history.com/svg?repos=gpt-omni/mini-omni&type=Date)](https://star-history.com/#gpt-omni/mini-omni&Date)`