xtts: +AMD gpu ROCm, +Apple MPS

This commit is contained in:
matatonic 2024-06-24 20:35:07 -04:00
parent e9bbc56523
commit 72c7b799b9
12 changed files with 89 additions and 26 deletions

View File

@ -10,7 +10,11 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN mkdir -p voices config
COPY requirements.txt /app/
ARG USE_ROCM
ENV USE_ROCM=${USE_ROCM}
COPY requirements*.txt /app/
RUN if [ ${USE_ROCM} = "1" ]; then mv /app/requirements-rocm.txt /app/requirements.txt; fi
RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt
COPY speech.py openedai.py say.py *.sh *.default.yaml README.md LICENSE /app/

View File

@ -30,10 +30,12 @@ If you find a better voice match for `tts-1` or `tts-1-hd`, please let me know s
Version 0.13.0, 2024-06-22
* Added [Custom fine-tuned XTTS model support](#custom-fine-tuned-model-support)
* Initial prebuilt arm64 image support (Apple M1/2/3, Raspberry Pi), thanks @JakeStevenson, @hchasens
* Initial prebuilt arm64 image support with MPS (Apple M-series, Raspberry Pi), thanks @JakeStevenson, @hchasens
* Initial AMD GPU (rocm 5.7) support, set USE_ROCM=1 when building docker or use requirements-rocm.txt
* Parler-tts support removed
* Move the *.default.yaml to the root folder
* Added 'audio_reader.py' for streaming text input and reading long texts
* Run the docker as a service by default (`restart: unless-stopped`)
* Added `audio_reader.py` for streaming text input and reading long texts
Version 0.12.3, 2024-06-17
@ -84,10 +86,12 @@ Version: 0.7.3, 2024-03-20
## Installation instructions
1) Copy the `sample.env` to `speech.env` (customize if needed)
1. Copy the `sample.env` to `speech.env` (customize if needed)
```bash
cp sample.env speech.env
```
#### AMD GPU (ROCm support)
> If you have an AMD GPU and want to use ROCm, set `USE_ROCM=1` in the `speech.env` before building the docker image. You will need to `docker compose build` before running the container in the next step.
2. Option: Docker (**recommended**) (prebuilt images are available)
@ -95,9 +99,8 @@ Run the server:
```shell
docker compose up
```
For a minimal docker image with only piper support (<1GB vs. 8GB), use `docker compose -f docker-compose.min.yml up`
> For a minimal docker image with only piper support (<1GB vs. 8GB) use `docker compose -f docker-compose.min.yml up`
To install the docker image as a service, edit the `docker-compose.yml` and uncomment `restart: unless-stopped`, then start the service with: `docker compose up -d`
2. Option: Manual installation:
@ -107,7 +110,7 @@ sudo apt install curl ffmpeg
# Create & activate a new virtual environment (optional but recommended)
python -m venv .venv
source .venv/bin/activate
# Install the Python requirements
# Install the Python requirements - use requirements-rocm.txt for AMD GPU (ROCm support)
pip install -r requirements.txt
# run the server
bash startup.sh
@ -156,7 +159,7 @@ curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -
Or just like this:
```shell
curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
curl -s http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
"input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3
```
@ -184,8 +187,10 @@ with client.audio.speech.with_streaming_response.create(
Also see the `say.py` sample application for an example of how to use the openai-python API.
```shell
python say.py -t "The quick brown fox jumped over the lazy dog." -p # play the audio, requires 'pip install playsound'
python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac # save to a file.
# play the audio, requires 'pip install playsound'
python say.py -t "The quick brown fox jumped over the lazy dog." -p
# save to a file in flac format
python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac
```
```
@ -212,6 +217,28 @@ options:
```
You can also try the included `audio_reader.py` for listening to longer text and streamed input.
```
usage: audio_reader.py [-h] [-m MODEL] [-v VOICE] [-s SPEED]
Text to speech player
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
The OpenAI model (default: tts-1)
-v VOICE, --voice VOICE
The voice to use (default: alloy)
-s SPEED, --speed SPEED
How fast to read the audio (default: 1.0)
```
Example usage:
```bash
$ python audio_reader.py -s 2 < LICENSE
```
## Custom Voices Howto
### Piper
@ -260,13 +287,13 @@ For example:
...
tts-1-hd:
me:
model: xtts_v2.0.2 # you can specify different xtts versions
model: xtts
speaker: voices/me.wav # this could be you
```
## Multilingual
Multilingual support was added in version 0.11.0 and is available only with the XTTS v2 model.
Multilingual cloning support was added in version 0.11.0 and is available only with the XTTS v2 model. To use multilingual voices with piper simply download a language specific voice.
Coqui XTTSv2 has support for 16 languages: English (`en`), Spanish (`es`), French (`fr`), German (`de`), Italian (`it`), Portuguese (`pt`), Polish (`pl`), Turkish (`tr`), Russian (`ru`), Dutch (`nl`), Czech (`cs`), Arabic (`ar`), Chinese (`zh-cn`), Japanese (`ja`), Hungarian (`hu`) and Korean (`ko`).

View File

@ -101,9 +101,9 @@ if __name__ == "__main__":
description='Text to speech player',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('-m', '--model', action='store', default="tts-1")
parser.add_argument('-v', '--voice', action='store', default="alloy")
parser.add_argument('-s', '--speed', action='store', default=1.0)
parser.add_argument('-m', '--model', action='store', default="tts-1", help="The OpenAI model")
parser.add_argument('-v', '--voice', action='store', default="alloy", help="The voice to use")
parser.add_argument('-s', '--speed', action='store', default=1.0, help="How fast to read the audio")
args = parser.parse_args()

View File

@ -10,4 +10,4 @@ services:
- ./voices:/app/voices
- ./config:/app/config
# To install as a service
#restart: unless-stopped
restart: unless-stopped

View File

@ -10,9 +10,7 @@ services:
- ./voices:/app/voices
- ./config:/app/config
# To install as a service
#restart: unless-stopped
# Set nvidia runtime if it's not the default
#runtime: nvidia
restart: unless-stopped
deploy:
resources:
reservations:

14
requirements-rocm.txt Normal file
View File

@ -0,0 +1,14 @@
fastapi
uvicorn
loguru
# piper-tts
piper-tts==1.2.0
# xtts
TTS
# XXX, 3.8+ has some issue for now
spacy==3.7.4
# torch==2.2.2 Fixes: https://github.com/matatonic/openedai-speech/issues/9
# Re: https://github.com/pytorch/pytorch/issues/121834
torch==2.2.2; --index-url https://download.pytorch.org/whl/rocm5.7; sys_platform == "linux"
torchaudio==2.2.2; --index-url https://download.pytorch.org/whl/rocm5.7; sys_platform == "linux"

View File

@ -5,8 +5,15 @@ loguru
piper-tts==1.2.0
# xtts
TTS
# Fixes: https://github.com/matatonic/openedai-speech/issues/9
# Re: https://github.com/pytorch/pytorch/issues/121834
torch==2.2.2
# XXX, 3.8+ has some issue for now
spacy==3.7.4
# torch==2.2.2 Fixes: https://github.com/matatonic/openedai-speech/issues/9
# Re: https://github.com/pytorch/pytorch/issues/121834
torch==2.2.2; sys_platform != "darwin"
torchaudio; sys_platform != "darwin"
# for MPS accelerated torch on Mac
torch==2.2.2; --index-url https://download.pytorch.org/whl/cpu; sys_platform == "darwin"
torchaudio==2.2.2; --index-url https://download.pytorch.org/whl/cpu; sys_platform == "darwin"
# ROCM (Linux only) - use requirements.amd.txt

View File

@ -2,3 +2,5 @@ TTS_HOME=voices
HF_HOME=voices
#PRELOAD_MODEL=xtts
#PRELOAD_MODEL=xtts_v2.0.2
#EXTRA_ARGS=--log-level DEBUG
#USE_ROCM=1

View File

@ -205,12 +205,21 @@ async def generate_speech(request: GenerateSpeechRequest):
return StreamingResponse(content=ffmpeg_proc.stdout, media_type=media_type)
def auto_torch_device():
try:
import torch
return 'cuda' if torch.cuda.is_available() else 'mps' if ( torch.backends.mps.is_available() and torch.backends.mps.is_built() ) else 'cpu'
except:
return 'none'
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description='OpenedAI Speech API Server',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--xtts_device', action='store', default="cuda", help="Set the device for the xtts model. The special value of 'none' will use piper for all models.")
parser.add_argument('--xtts_device', action='store', default=auto_torch_device(), help="Set the device for the xtts model. The special value of 'none' will use piper for all models.")
parser.add_argument('--preload', action='store', default=None, help="Preload a model (Ex. 'xtts' or 'xtts_v2.0.2'). By default it's loaded on first use.")
parser.add_argument('-P', '--port', action='store', default=8000, type=int, help="Server tcp port")
parser.add_argument('-H', '--host', action='store', default='0.0.0.0', help="Host to listen on, Ex. 0.0.0.0")

View File

@ -5,4 +5,4 @@ set /p < speech.env
call download_voices_tts-1.bat
call download_voices_tts-1-hd.bat %PRELOAD_MODEL%
python speech.py %PRELOAD_MODEL:+--preload %PRELOAD_MODEL% %OPENEDAI_LOG_LEVEL:+--log-level %OPENEDAI_LOG_LEVEL%
python speech.py %PRELOAD_MODEL:+--preload %PRELOAD_MODEL% %EXTRA_ARGS%

View File

@ -4,4 +4,4 @@
bash download_voices_tts-1.sh
python speech.py --xtts_device none ${OPENEDAI_LOG_LEVEL:+--log-level $OPENEDAI_LOG_LEVEL}
python speech.py --xtts_device none $EXTRA_ARGS $@

View File

@ -2,7 +2,9 @@
[ -f speech.env ] && . speech.env
echo "First startup may download 2GB of speech models. Please wait."
bash download_voices_tts-1.sh
bash download_voices_tts-1-hd.sh $PRELOAD_MODEL
python speech.py ${PRELOAD_MODEL:+--preload $PRELOAD_MODEL} ${OPENEDAI_LOG_LEVEL:+--log-level $OPENEDAI_LOG_LEVEL} $@
python speech.py ${PRELOAD_MODEL:+--preload $PRELOAD_MODEL} $EXTRA_ARGS $@