Merge pull request #203 from silentoplayz/patch-2

Update openedai-speech-integration.md
This commit is contained in:
Justin Hayes 2024-09-02 21:07:00 -04:00 committed by GitHub
commit 127f2cacda
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -3,162 +3,198 @@ sidebar_position: 12
title: "TTS - OpenedAI-Speech using Docker"
---
Integrating `openedai-speech` into Open WebUI using Docker
================================================================
**Integrating `openedai-speech` into Open WebUI using Docker**
==============================================================
**What is `openedai-speech`?**
-----------------------------
:::info
[openedai-speech](https://github.com/matatonic/openedai-speech) is an OpenAI API compatible text-to-speech server that uses Coqui AI's `xtts_v2` and/or `Piper TTS` as the backend. It's a free, private, text-to-speech server that allows for custom voice cloning and is compatible with the OpenAI audio/speech API.
[openedai-speech](https://github.com/matatonic/openedai-speech) is an OpenAI audio/speech API compatible text-to-speech server.
It serves the `/v1/audio/speech` endpoint and provides a free, private text-to-speech experience with custom voice cloning capabilities. This service is in no way affiliated with OpenAI and does not require an OpenAI API key.
:::
**Prerequisites**
---------------
**Requirements**
-----------------
* Docker installed on your system
* Open WebUI running in a Docker container
* A basic understanding of Docker and Docker Compose
* Basic understanding of Docker and Docker Compose
**Option 1: Using Docker Compose**
---------------------------------
----------------------------------
**Step 1: Create a new folder for the `openedai-speech` service**
---------------------------------------------------------
-----------------------------------------------------------------
Create a new folder, for example, `openedai-speech-service`, to store the `docker-compose.yml` and `.env` files.
Create a new folder, for example, `openedai-speech-service`, to store the `docker-compose.yml` and `speech.env` files.
**Step 2: Create a `docker-compose.yml` file**
------------------------------------------
**Step 2: Clone the `openedai-speech` repository from GitHub**
--------------------------------------------------------------
In the `openedai-speech-service` folder, create a new file named `docker-compose.yml` with the following contents:
```yaml
services:
server:
image: ghcr.io/matatonic/openedai-speech
container_name: openedai-speech
env_file: .env
ports:
- "8000:8000"
volumes:
- tts-voices:/app/voices
- tts-config:/app/config
# labels:
# - "com.centurylinklabs.watchtower.enable=true"
restart: unless-stopped
volumes:
tts-voices:
tts-config:
```bash
git clone https://github.com/matatonic/openedai-speech.git
```
**Step 3: Create an `.env` file (optional)**
-----------------------------------------
In the same `openedai-speech-service` folder, create a new file named `.env` with the following contents:
This will download the `openedai-speech` repository to your local machine, which includes the Docker Compose files (`docker-compose.yml`, `docker-compose.min.yml`, and `docker-compose.rocm.yml`) and other necessary files.
**Step 3: Rename the `sample.env` file to `speech.env` (Customize if needed)**
------------------------------------------------------------------------------
In the `openedai-speech` repository folder, create a new file named `speech.env` with the following contents:
```yaml
TTS_HOME=voices
HF_HOME=voices
#PRELOAD_MODEL=xtts
#PRELOAD_MODEL=xtts_v2.0.2
#PRELOAD_MODEL=parler-tts/parler_tts_mini_v0.1
#EXTRA_ARGS=--log-level DEBUG --unload-timer 300
#USE_ROCM=1
```
**Step 4: Run `docker compose` to start the `openedai-speech` service**
---------------------------------------------------------
Run the following command in the `openedai-speech-service` folder to start the `openedai-speech` service in detached mode:
```yaml
**Step 4: Choose a Docker Compose file**
----------------------------------------
You can use any of the following Docker Compose files:
* [docker-compose.yml](https://github.com/matatonic/openedai-speech/blob/main/docker-compose.yml): This file uses the `ghcr.io/matatonic/openedai-speech` image and builds from [Dockerfile](https://github.com/matatonic/openedai-speech/blob/main/Dockerfile).
* [docker-compose.min.yml](https://github.com/matatonic/openedai-speech/blob/main/docker-compose.min.yml): This file uses the `ghcr.io/matatonic/openedai-speech-min` image and builds from [Dockerfile.min](https://github.com/matatonic/openedai-speech/blob/main/Dockerfile.min).
This image is a minimal version that only includes Piper support and does not require a GPU.
* [docker-compose.rocm.yml](https://github.com/matatonic/openedai-speech/blob/main/docker-compose.rocm.yml): This file uses the `ghcr.io/matatonic/openedai-speech-rocm` image and builds from [Dockerfile](https://github.com/matatonic/openedai-speech/blob/main/Dockerfile) with ROCm support.
**Step 4: Build the Chosen Docker Image**
-----------------------------------------
Before running the Docker Compose file, you need to build the Docker image:
* **Nvidia GPU (CUDA support)**:
```bash
docker build -t ghcr.io/matatonic/openedai-speech .
```
* **AMD GPU (ROCm support)**:
```bash
docker build -f Dockerfile --build-arg USE_ROCM=1 -t ghcr.io/matatonic/openedai-speech-rocm .
```
* **CPU only, No GPU (Piper only)**:
```bash
docker build -f Dockerfile.min -t ghcr.io/matatonic/openedai-speech-min .
```
**Step 5: Run the correct `docker compose up -d` command**
----------------------------------------------------------
* **Nvidia GPU (CUDA support)**: Run the following command to start the `openedai-speech` service in detached mode:
```bash
docker compose up -d
```
This will start the `openedai-speech` service in the background.
* **AMD GPU (ROCm support)**: Run the following command to start the `openedai-speech` service in detached mode:
```bash
docker compose -f docker-compose.rocm.yml up -d
```
* **ARM64 (Apple M-series, Raspberry Pi)**: XTTS only has CPU support here and will be very slow. You can use the Nvidia image for XTTS with CPU (slow), or use the Piper only image (recommended):
```bash
docker compose -f docker-compose.min.yml up -d
```
* **CPU only, No GPU (Piper only)**: For a minimal docker image with only Piper support (<1GB vs. 8GB):
```bash
docker compose -f docker-compose.min.yml up -d
```
This will start the `openedai-speech` service in detached mode.
**Option 2: Using Docker Run Commands**
-------------------------------------
---------------------------------------
You can also use the following Docker run commands to start the `openedai-speech` service in detached mode:
**With GPU (Nvidia CUDA) support:**
```yaml
docker run -d --gpus=all -p 8000:8000 -v tts-voices:/app/voices -v tts-config:/app/config --name openedai-speech ghcr.io/matatonic/openedai-speech:latest
* **Nvidia GPU (CUDA)**: Run the following command to build and start the `openedai-speech` service:
```bash
docker build -t ghcr.io/matatonic/openedai-speech .
docker run -d --gpus=all -p 8000:8000 -v voices:/app/voices -v config:/app/config --name openedai-speech ghcr.io/matatonic/openedai-speech
```
**Alternative without GPU support:**
```yaml
docker run -d -p 8000:8000 -v tts-voices:/app/voices -v tts-config:/app/config --name openedai-speech ghcr.io/matatonic/openedai-speech-min:latest
* **ROCm (AMD GPU)**: Run the following command to build and start the `openedai-speech` service:
> To enable ROCm support, uncomment the `#USE_ROCM=1` line in the `speech.env` file.
```bash
docker build -f Dockerfile --build-arg USE_ROCM=1 -t ghcr.io/matatonic/openedai-speech-rocm .
docker run -d --privileged --init --name openedai-speech -p 8000:8000 -v voices:/app/voices -v config:/app/config ghcr.io/matatonic/openedai-speech-rocm
```
* **CPU only, No GPU (Piper only)**: Run the following command to build and start the `openedai-speech` service:
```bash
docker build -f Dockerfile.min -t ghcr.io/matatonic/openedai-speech-min .
docker run -d -p 8000:8000 -v voices:/app/voices -v config:/app/config --name openedai-speech ghcr.io/matatonic/openedai-speech-min
```
**Configuring Open WebUI**
-------------------------
:::tip
For more information on configuring Open WebUI to use `openedai-speech`, including setting environment variables, see the [Open WebUI documentation](https://docs.openwebui.com/getting-started/env-configuration/#text-to-speech).
:::
**Step 5: Configure Open WebUI to use `openedai-speech`**
**Step 6: Configuring Open WebUI to use `openedai-speech` for TTS**
---------------------------------------------------------
Open the Open WebUI settings and navigate to the TTS Settings under **Admin Panel > Settings > Audio**. Add the following configuration as shown in the following image:
![openedai-tts](https://github.com/silentoplayz/docs/assets/50341825/ea08494f-2ebf-41a2-bb0f-9b48dd3ace79)
* **API Base URL**: `http://host.docker.internal:8000/v1`
* **API Key**: `sk-111111111` (note: this is a dummy API key, as `openedai-speech` doesn't require an API key; you can use whatever for this field)
Open the Open WebUI settings and navigate to the TTS Settings under **Admin Panel > Settings > Audio**. Add the following configuration:
**Step 6: Choose a voice**
-------------------------
* **API Base URL**: `http://host.docker.internal:8000/v1`
* **API Key**: `sk-111111111` (Note that this is a dummy API key, as `openedai-speech` doesn't require an API key. You can use whatever you'd like for this field, as long as it is filled.)
**Step 7: Choose a voice**
--------------------------
Under `TTS Voice` within the same audio settings menu in the admin panel, you can set the `TTS Model` to use from the following choices below that `openedai-speech` supports. The voices of these models are optimized for the English language.
* `tts-1` or `tts-1-hd`: `alloy`, `echo`, `echo-alt`, `fable`, `onyx`, `nova`, and `shimmer` (`tts-1-hd` is configurable; uses OpenAI samples by default)
**Step 7 (optional): Adding new voices**
-------------------------
The voice wave files are stored in the `tts-voices` volume and the configuration files are in the `tts-config` volume. Default voices are defined in `voice_to_speaker.default.yaml`.
In order to add an additional voice, you need to:
1. Add an appropriate wave file/voice (*.wav) into the `tts-voices` volume, for example `example.wav`.
2. Reference the newly added wave file in the `voice_to_speaker.yaml` configuration file, under the appropriate model (either `tts1` or `tts-1-hd`), eg:
```
example:
model: xtts
speaker: voices/example.wav
```
To use this new voice, simply use the string of the voice name (in this case `example`) in the Audio configuration settings for your user (or set this voice as the system default).
**Model Details:**
Two example [parler-tts](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) voices are included in the `voice_to_speaker.default.yaml` file. `parler-tts` is experimental software and is on the slower side. The exact voice will be slightly different each generation but should be similar to the basic description.
* `tts-1` via [Piper TTS](https://github.com/rhasspy/piper) (very fast, runs on CPU): You can map your own [Piper voices](https://rhasspy.github.io/piper-samples/) via the `voice_to_speaker.yaml` configuration file, as per the instructions above.
* `tts-1-hd` via [Coqui AI/TTS](https://github.com/coqui-ai/TTS) XTTS v2 voice cloning (fast, but requires around 4GB GPU VRAM & Nvidia GPU with CUDA).
* Beta [parler-tts](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) support (you can describe very basic features of the speaker voice), See: (https://www.text-description-to-speech.com/) for some examples of how to describe voices.
Note that both Piper and Coqui support [multilingual](https://github.com/matatonic/openedai-speech#multilingual) voices.
**Step 8: Press `Save` to apply the changes and start enjoying naturally sounding voices**
--------------------------------------------------------------------------------------------
Press the `Save` button to apply the changes to your Open WebUI settings and enjoy using `openedai-speech` integration within Open WebUI to generate naturally sounding voice responses with text-to-speech.
Press the `Save` button to apply the changes to your Open WebUI settings. Refresh the page for the change to fully take effect and enjoy using `openedai-speech` integration within Open WebUI to read aloud text responses with text-to-speech in a natural sounding voice.
**Model Details:**
------------------
`openedai-speech` supports multiple text-to-speech models, each with its own strengths and requirements. The following models are available:
* **Piper TTS** (very fast, runs on CPU): Use your own [Piper voices](https://rhasspy.github.io/piper-samples/) via the `voice_to_speaker.yaml` configuration file. This model is great for applications that require low latency and high performance. Piper TTS also supports [multilingual](https://github.com/matatonic/openedai-speech#multilingual) voices.
* **Coqui AI/TTS XTTS v2** (fast, but requires around 4GB GPU VRAM & Nvidia GPU with CUDA): This model uses Coqui AI's XTTS v2 voice cloning technology to generate high-quality voices. While it requires a more powerful GPU, it provides excellent performance and high-quality audio. Coqui also supports [multilingual](https://github.com/matatonic/openedai-speech#multilingual) voices.
* **Beta Parler-TTS Support** (experimental, slower): This model uses the Parler-TTS framework to generate voices. While it's currently in beta, it allows you to describe very basic features of the speaker voice. The exact voice will be slightly different with each generation, but should be similar to the speaker description provided. For inspiration on how to describe voices, see [Text Description to Speech](https://www.text-description-to-speech.com/).
**Troubleshooting**
-------------------
If you encounter any issues, make sure that:
If you encounter any problems integrating `openedai-speech` with Open WebUI, follow these troubleshooting steps:
* The `openedai-speech` service is running and the port you set in the docker-compose.yml file is exposed.
* The `host.docker.internal` hostname is resolvable from within the Open WebUI container. `host.docker.internal` is required since `openedai-speech` is exposed via `localhost` on your PC, but `open-webui` cannot normally access this from within its container.
* The API key is set to a dummy value, as `openedai-speech` doesn't require an API key.
* **Verify `openedai-speech` service**: Ensure that the `openedai-speech` service is running and the port you specified in the docker-compose.yml file is exposed.
* **Check access to host.docker.internal**: Verify that the hostname `host.docker.internal` is resolvable from within the Open WebUI container. This is necessary because `openedai-speech` is exposed via `localhost` on your PC, but `open-webui` cannot normally access it from inside its container. You can add a volume to the `docker-compose.yml` file to mount a file from the host to the container, for example, to a directory that will be served by openedai-speech.
* **Review API key configuration**: Make sure the API key is set to a dummy value or effectively left unchecked because `openedai-speech` doesn't require an API key.
* **Check voice configuration**: Verify that the voice you are trying to use for TTS exists in your `voice_to_speaker.yaml` file and the corresponding files (e.g., voice XML files) are present in the correct directory.
* **Verify voice model paths**: If you're experiencing issues with voice model loading, double-check that the paths in your `voice_to_speaker.yaml` file match the actual locations of your voice models.
**Additional Troubleshooting Tips**
------------------------------------
* Check the openedai-speech logs for errors or warnings that might indicate where the issue lies.
* Verify that the `docker-compose.yml` file is correctly configured for your environment.
* If you're still experiencing issues, try restarting the `openedai-speech` service or the entire Docker environment.
* If the problem persists, consult the `openedai-speech` GitHub repository or seek help on a relevant community forum.
**FAQ**
----
-------
**How can I control the emotional range of the generated audio?**
There is no direct mechanism to control the emotional output of the audio generated. Certain factors may influence the output audio like capitalization or grammar, but internal tests have yielded mixed results.
There is no direct mechanism to control the emotional output of the generated audio. Certain factors such as capitalization or grammar may affect the output audio, but internal testing has yielded mixed results.
**Where are the voice files stored? What about the configuration file?**.
The configuration files, which define the available voices and their properties, are stored in the config volume. Specifically, the default voices are defined in voice_to_speaker.default.yaml.
**Additional Resources**
-------------------------
------------------------
For more information on `openedai-speech`, please visit the [GitHub repository](https://github.com/matatonic/openedai-speech).
For more information on configuring Open WebUI to use `openedai-speech`, including setting environment variables, see the [Open WebUI documentation](https://docs.openwebui.com/getting-started/env-configuration/#text-to-speech).
For more information about `openedai-speech`, please visit the [GitHub repository](https://github.com/matatonic/openedai-speech).
**How to add more voices to openedai-speech:**
[Custom-Voices-HowTo](https://github.com/matatonic/openedai-speech?tab=readme-ov-file#custom-voices-howto)
:::note
You can change the port number in the `docker-compose.yml` file to any open and usable port, but make sure to update the **API Base URL** in Open WebUI Admin Audio settings accordingly.
You can change the port number in the `docker-compose.yml` file to any open and usable port, but be sure to update the **API Base URL** in Open WebUI Admin Audio settings accordingly.
:::