mirror of
https://github.com/matatonic/openedai-speech
synced 2025-06-26 18:16:32 +00:00
xtts: +AMD gpu ROCm, +Apple MPS
This commit is contained in:
parent
e9bbc56523
commit
72c7b799b9
@ -10,7 +10,11 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/*
|
||||
WORKDIR /app
|
||||
RUN mkdir -p voices config
|
||||
|
||||
COPY requirements.txt /app/
|
||||
ARG USE_ROCM
|
||||
ENV USE_ROCM=${USE_ROCM}
|
||||
|
||||
COPY requirements*.txt /app/
|
||||
RUN if [ ${USE_ROCM} = "1" ]; then mv /app/requirements-rocm.txt /app/requirements.txt; fi
|
||||
RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt
|
||||
|
||||
COPY speech.py openedai.py say.py *.sh *.default.yaml README.md LICENSE /app/
|
||||
|
49
README.md
49
README.md
@ -30,10 +30,12 @@ If you find a better voice match for `tts-1` or `tts-1-hd`, please let me know s
|
||||
Version 0.13.0, 2024-06-22
|
||||
|
||||
* Added [Custom fine-tuned XTTS model support](#custom-fine-tuned-model-support)
|
||||
* Initial prebuilt arm64 image support (Apple M1/2/3, Raspberry Pi), thanks @JakeStevenson, @hchasens
|
||||
* Initial prebuilt arm64 image support with MPS (Apple M-series, Raspberry Pi), thanks @JakeStevenson, @hchasens
|
||||
* Initial AMD GPU (rocm 5.7) support, set USE_ROCM=1 when building docker or use requirements-rocm.txt
|
||||
* Parler-tts support removed
|
||||
* Move the *.default.yaml to the root folder
|
||||
* Added 'audio_reader.py' for streaming text input and reading long texts
|
||||
* Run the docker as a service by default (`restart: unless-stopped`)
|
||||
* Added `audio_reader.py` for streaming text input and reading long texts
|
||||
|
||||
Version 0.12.3, 2024-06-17
|
||||
|
||||
@ -84,10 +86,12 @@ Version: 0.7.3, 2024-03-20
|
||||
|
||||
## Installation instructions
|
||||
|
||||
1) Copy the `sample.env` to `speech.env` (customize if needed)
|
||||
1. Copy the `sample.env` to `speech.env` (customize if needed)
|
||||
```bash
|
||||
cp sample.env speech.env
|
||||
```
|
||||
#### AMD GPU (ROCm support)
|
||||
> If you have an AMD GPU and want to use ROCm, set `USE_ROCM=1` in the `speech.env` before building the docker image. You will need to `docker compose build` before running the container in the next step.
|
||||
|
||||
2. Option: Docker (**recommended**) (prebuilt images are available)
|
||||
|
||||
@ -95,9 +99,8 @@ Run the server:
|
||||
```shell
|
||||
docker compose up
|
||||
```
|
||||
For a minimal docker image with only piper support (<1GB vs. 8GB), use `docker compose -f docker-compose.min.yml up`
|
||||
> For a minimal docker image with only piper support (<1GB vs. 8GB) use `docker compose -f docker-compose.min.yml up`
|
||||
|
||||
To install the docker image as a service, edit the `docker-compose.yml` and uncomment `restart: unless-stopped`, then start the service with: `docker compose up -d`
|
||||
|
||||
|
||||
2. Option: Manual installation:
|
||||
@ -107,7 +110,7 @@ sudo apt install curl ffmpeg
|
||||
# Create & activate a new virtual environment (optional but recommended)
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate
|
||||
# Install the Python requirements
|
||||
# Install the Python requirements - use requirements-rocm.txt for AMD GPU (ROCm support)
|
||||
pip install -r requirements.txt
|
||||
# run the server
|
||||
bash startup.sh
|
||||
@ -156,7 +159,7 @@ curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -
|
||||
Or just like this:
|
||||
|
||||
```shell
|
||||
curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
|
||||
curl -s http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
|
||||
"input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3
|
||||
```
|
||||
|
||||
@ -184,8 +187,10 @@ with client.audio.speech.with_streaming_response.create(
|
||||
Also see the `say.py` sample application for an example of how to use the openai-python API.
|
||||
|
||||
```shell
|
||||
python say.py -t "The quick brown fox jumped over the lazy dog." -p # play the audio, requires 'pip install playsound'
|
||||
python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac # save to a file.
|
||||
# play the audio, requires 'pip install playsound'
|
||||
python say.py -t "The quick brown fox jumped over the lazy dog." -p
|
||||
# save to a file in flac format
|
||||
python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac
|
||||
```
|
||||
|
||||
```
|
||||
@ -212,6 +217,28 @@ options:
|
||||
|
||||
```
|
||||
|
||||
You can also try the included `audio_reader.py` for listening to longer text and streamed input.
|
||||
|
||||
```
|
||||
usage: audio_reader.py [-h] [-m MODEL] [-v VOICE] [-s SPEED]
|
||||
|
||||
Text to speech player
|
||||
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
-m MODEL, --model MODEL
|
||||
The OpenAI model (default: tts-1)
|
||||
-v VOICE, --voice VOICE
|
||||
The voice to use (default: alloy)
|
||||
-s SPEED, --speed SPEED
|
||||
How fast to read the audio (default: 1.0)
|
||||
|
||||
```
|
||||
Example usage:
|
||||
```bash
|
||||
$ python audio_reader.py -s 2 < LICENSE
|
||||
```
|
||||
|
||||
## Custom Voices Howto
|
||||
|
||||
### Piper
|
||||
@ -260,13 +287,13 @@ For example:
|
||||
...
|
||||
tts-1-hd:
|
||||
me:
|
||||
model: xtts_v2.0.2 # you can specify different xtts versions
|
||||
model: xtts
|
||||
speaker: voices/me.wav # this could be you
|
||||
```
|
||||
|
||||
## Multilingual
|
||||
|
||||
Multilingual support was added in version 0.11.0 and is available only with the XTTS v2 model.
|
||||
Multilingual cloning support was added in version 0.11.0 and is available only with the XTTS v2 model. To use multilingual voices with piper simply download a language specific voice.
|
||||
|
||||
Coqui XTTSv2 has support for 16 languages: English (`en`), Spanish (`es`), French (`fr`), German (`de`), Italian (`it`), Portuguese (`pt`), Polish (`pl`), Turkish (`tr`), Russian (`ru`), Dutch (`nl`), Czech (`cs`), Arabic (`ar`), Chinese (`zh-cn`), Japanese (`ja`), Hungarian (`hu`) and Korean (`ko`).
|
||||
|
||||
|
@ -101,9 +101,9 @@ if __name__ == "__main__":
|
||||
description='Text to speech player',
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
|
||||
parser.add_argument('-m', '--model', action='store', default="tts-1")
|
||||
parser.add_argument('-v', '--voice', action='store', default="alloy")
|
||||
parser.add_argument('-s', '--speed', action='store', default=1.0)
|
||||
parser.add_argument('-m', '--model', action='store', default="tts-1", help="The OpenAI model")
|
||||
parser.add_argument('-v', '--voice', action='store', default="alloy", help="The voice to use")
|
||||
parser.add_argument('-s', '--speed', action='store', default=1.0, help="How fast to read the audio")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
|
@ -10,4 +10,4 @@ services:
|
||||
- ./voices:/app/voices
|
||||
- ./config:/app/config
|
||||
# To install as a service
|
||||
#restart: unless-stopped
|
||||
restart: unless-stopped
|
||||
|
@ -10,9 +10,7 @@ services:
|
||||
- ./voices:/app/voices
|
||||
- ./config:/app/config
|
||||
# To install as a service
|
||||
#restart: unless-stopped
|
||||
# Set nvidia runtime if it's not the default
|
||||
#runtime: nvidia
|
||||
restart: unless-stopped
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
|
14
requirements-rocm.txt
Normal file
14
requirements-rocm.txt
Normal file
@ -0,0 +1,14 @@
|
||||
fastapi
|
||||
uvicorn
|
||||
loguru
|
||||
# piper-tts
|
||||
piper-tts==1.2.0
|
||||
# xtts
|
||||
TTS
|
||||
# XXX, 3.8+ has some issue for now
|
||||
spacy==3.7.4
|
||||
|
||||
# torch==2.2.2 Fixes: https://github.com/matatonic/openedai-speech/issues/9
|
||||
# Re: https://github.com/pytorch/pytorch/issues/121834
|
||||
torch==2.2.2; --index-url https://download.pytorch.org/whl/rocm5.7; sys_platform == "linux"
|
||||
torchaudio==2.2.2; --index-url https://download.pytorch.org/whl/rocm5.7; sys_platform == "linux"
|
@ -5,8 +5,15 @@ loguru
|
||||
piper-tts==1.2.0
|
||||
# xtts
|
||||
TTS
|
||||
# Fixes: https://github.com/matatonic/openedai-speech/issues/9
|
||||
# Re: https://github.com/pytorch/pytorch/issues/121834
|
||||
torch==2.2.2
|
||||
# XXX, 3.8+ has some issue for now
|
||||
spacy==3.7.4
|
||||
|
||||
# torch==2.2.2 Fixes: https://github.com/matatonic/openedai-speech/issues/9
|
||||
# Re: https://github.com/pytorch/pytorch/issues/121834
|
||||
torch==2.2.2; sys_platform != "darwin"
|
||||
torchaudio; sys_platform != "darwin"
|
||||
# for MPS accelerated torch on Mac
|
||||
torch==2.2.2; --index-url https://download.pytorch.org/whl/cpu; sys_platform == "darwin"
|
||||
torchaudio==2.2.2; --index-url https://download.pytorch.org/whl/cpu; sys_platform == "darwin"
|
||||
|
||||
# ROCM (Linux only) - use requirements.amd.txt
|
@ -2,3 +2,5 @@ TTS_HOME=voices
|
||||
HF_HOME=voices
|
||||
#PRELOAD_MODEL=xtts
|
||||
#PRELOAD_MODEL=xtts_v2.0.2
|
||||
#EXTRA_ARGS=--log-level DEBUG
|
||||
#USE_ROCM=1
|
11
speech.py
11
speech.py
@ -205,12 +205,21 @@ async def generate_speech(request: GenerateSpeechRequest):
|
||||
return StreamingResponse(content=ffmpeg_proc.stdout, media_type=media_type)
|
||||
|
||||
|
||||
def auto_torch_device():
|
||||
try:
|
||||
import torch
|
||||
return 'cuda' if torch.cuda.is_available() else 'mps' if ( torch.backends.mps.is_available() and torch.backends.mps.is_built() ) else 'cpu'
|
||||
|
||||
except:
|
||||
return 'none'
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description='OpenedAI Speech API Server',
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
|
||||
parser.add_argument('--xtts_device', action='store', default="cuda", help="Set the device for the xtts model. The special value of 'none' will use piper for all models.")
|
||||
parser.add_argument('--xtts_device', action='store', default=auto_torch_device(), help="Set the device for the xtts model. The special value of 'none' will use piper for all models.")
|
||||
parser.add_argument('--preload', action='store', default=None, help="Preload a model (Ex. 'xtts' or 'xtts_v2.0.2'). By default it's loaded on first use.")
|
||||
parser.add_argument('-P', '--port', action='store', default=8000, type=int, help="Server tcp port")
|
||||
parser.add_argument('-H', '--host', action='store', default='0.0.0.0', help="Host to listen on, Ex. 0.0.0.0")
|
||||
|
@ -5,4 +5,4 @@ set /p < speech.env
|
||||
call download_voices_tts-1.bat
|
||||
call download_voices_tts-1-hd.bat %PRELOAD_MODEL%
|
||||
|
||||
python speech.py %PRELOAD_MODEL:+--preload %PRELOAD_MODEL% %OPENEDAI_LOG_LEVEL:+--log-level %OPENEDAI_LOG_LEVEL%
|
||||
python speech.py %PRELOAD_MODEL:+--preload %PRELOAD_MODEL% %EXTRA_ARGS%
|
@ -4,4 +4,4 @@
|
||||
|
||||
bash download_voices_tts-1.sh
|
||||
|
||||
python speech.py --xtts_device none ${OPENEDAI_LOG_LEVEL:+--log-level $OPENEDAI_LOG_LEVEL}
|
||||
python speech.py --xtts_device none $EXTRA_ARGS $@
|
||||
|
@ -2,7 +2,9 @@
|
||||
|
||||
[ -f speech.env ] && . speech.env
|
||||
|
||||
echo "First startup may download 2GB of speech models. Please wait."
|
||||
|
||||
bash download_voices_tts-1.sh
|
||||
bash download_voices_tts-1-hd.sh $PRELOAD_MODEL
|
||||
|
||||
python speech.py ${PRELOAD_MODEL:+--preload $PRELOAD_MODEL} ${OPENEDAI_LOG_LEVEL:+--log-level $OPENEDAI_LOG_LEVEL} $@
|
||||
python speech.py ${PRELOAD_MODEL:+--preload $PRELOAD_MODEL} $EXTRA_ARGS $@
|
||||
|
Loading…
Reference in New Issue
Block a user