xtts: +AMD gpu ROCm, +Apple MPS

2025-06-26 18:16:32 +00:00 · 2024-06-24 20:35:07 -04:00 · 2024-06-24 20:35:07 -04:00 · 72c7b799b9
commit 72c7b799b9
parent e9bbc56523
12 changed files with 89 additions and 26 deletions
--- a/6
+++ b/6
@ -10,7 +10,11 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/*
 WORKDIR /app
 RUN mkdir -p voices config

-COPY requirements.txt /app/
+ARG USE_ROCM
+ENV USE_ROCM=${USE_ROCM}
+
+COPY requirements*.txt /app/
+RUN if [ ${USE_ROCM} = "1" ]; then mv /app/requirements-rocm.txt /app/requirements.txt; fi
 RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt

 COPY speech.py openedai.py say.py *.sh *.default.yaml README.md LICENSE /app/
--- a/README.md
+++ b/README.md
@ -30,10 +30,12 @@ If you find a better voice match for `tts-1` or `tts-1-hd`, please let me know s
 Version 0.13.0, 2024-06-22

 * Added [Custom fine-tuned XTTS model support](#custom-fine-tuned-model-support)
-* Initial prebuilt arm64 image support (Apple M1/2/3, Raspberry Pi), thanks @JakeStevenson, @hchasens
+* Initial prebuilt arm64 image support with MPS (Apple M-series, Raspberry Pi), thanks @JakeStevenson, @hchasens
+* Initial AMD GPU (rocm 5.7) support, set USE_ROCM=1 when building docker or use requirements-rocm.txt
 * Parler-tts support removed
 * Move the *.default.yaml to the root folder
-* Added 'audio_reader.py' for streaming text input and reading long texts
+* Run the docker as a service by default (`restart: unless-stopped`)
+* Added `audio_reader.py` for streaming text input and reading long texts

 Version 0.12.3, 2024-06-17

@ -84,10 +86,12 @@ Version: 0.7.3, 2024-03-20

 ## Installation instructions

-1) Copy the `sample.env` to `speech.env` (customize if needed)
+1. Copy the `sample.env` to `speech.env` (customize if needed)
 ```bash
 cp sample.env speech.env
 ```
+#### AMD GPU (ROCm support)
+> If you have an AMD GPU and want to use ROCm, set `USE_ROCM=1` in the `speech.env` before building the docker image. You will need to `docker compose build` before running the container in the next step.

 2. Option: Docker (**recommended**) (prebuilt images are available)

@ -95,9 +99,8 @@ Run the server:
 ```shell
 docker compose up
 ```
-For a minimal docker image with only piper support (<1GB vs. 8GB), use `docker compose -f docker-compose.min.yml up`
+> For a minimal docker image with only piper support (<1GB vs. 8GB) use `docker compose -f docker-compose.min.yml up`

-To install the docker image as a service, edit the `docker-compose.yml` and uncomment `restart: unless-stopped`, then start the service with: `docker compose up -d`


 2. Option: Manual installation:
@ -107,7 +110,7 @@ sudo apt install curl ffmpeg
 # Create & activate a new virtual environment (optional but recommended)
 python -m venv .venv
 source .venv/bin/activate
-# Install the Python requirements
+# Install the Python requirements - use requirements-rocm.txt for AMD GPU (ROCm support)
 pip install -r requirements.txt
 # run the server
 bash startup.sh
@ -156,7 +159,7 @@ curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -
 Or just like this:

 ```shell
-curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
+curl -s http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
    "input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3
 ```

@ -184,8 +187,10 @@ with client.audio.speech.with_streaming_response.create(
 Also see the `say.py` sample application for an example of how to use the openai-python API.

 ```shell
-python say.py -t "The quick brown fox jumped over the lazy dog." -p # play the audio, requires 'pip install playsound'
-python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac # save to a file.
+# play the audio, requires 'pip install playsound'
+python say.py -t "The quick brown fox jumped over the lazy dog." -p
+# save to a file in flac format
+python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac
 ```

 ```
@ -212,6 +217,28 @@ options:

 ```

+You can also try the included `audio_reader.py` for listening to longer text and streamed input.
+
+```
+usage: audio_reader.py [-h] [-m MODEL] [-v VOICE] [-s SPEED]
+
+Text to speech player
+
+options:
+  -h, --help            show this help message and exit
+  -m MODEL, --model MODEL
+                        The OpenAI model (default: tts-1)
+  -v VOICE, --voice VOICE
+                        The voice to use (default: alloy)
+  -s SPEED, --speed SPEED
+                        How fast to read the audio (default: 1.0)
+
+```
+Example usage:
+```bash
+$ python audio_reader.py -s 2 < LICENSE
+```
+
 ## Custom Voices Howto

 ### Piper
@ -260,13 +287,13 @@ For example:
 ...
 tts-1-hd:
  me:
-    model: xtts_v2.0.2 # you can specify different xtts versions
+    model: xtts
    speaker: voices/me.wav # this could be you
 ```

 ## Multilingual

-Multilingual support was added in version 0.11.0 and is available only with the XTTS v2 model.
+Multilingual cloning support was added in version 0.11.0 and is available only with the XTTS v2 model. To use multilingual voices with piper simply download a language specific voice.

 Coqui XTTSv2 has support for 16 languages: English (`en`), Spanish (`es`), French (`fr`), German (`de`), Italian (`it`), Portuguese (`pt`), Polish (`pl`), Turkish (`tr`), Russian (`ru`), Dutch (`nl`), Czech (`cs`), Arabic (`ar`), Chinese (`zh-cn`), Japanese (`ja`), Hungarian (`hu`) and Korean (`ko`).

--- a/audio_reader.py
+++ b/audio_reader.py
@ -101,9 +101,9 @@ if __name__ == "__main__":
        description='Text to speech player',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)

-    parser.add_argument('-m', '--model', action='store', default="tts-1")
-    parser.add_argument('-v', '--voice', action='store', default="alloy")
-    parser.add_argument('-s', '--speed', action='store', default=1.0)
+    parser.add_argument('-m', '--model', action='store', default="tts-1", help="The OpenAI model")
+    parser.add_argument('-v', '--voice', action='store', default="alloy", help="The voice to use")
+    parser.add_argument('-s', '--speed', action='store', default=1.0, help="How fast to read the audio")

    args = parser.parse_args()

--- a/docker-compose.min.yml
+++ b/docker-compose.min.yml
@ -10,4 +10,4 @@ services:
      - ./voices:/app/voices
      - ./config:/app/config
    # To install as a service
-    #restart: unless-stopped
+    restart: unless-stopped
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -10,9 +10,7 @@ services:
      - ./voices:/app/voices
      - ./config:/app/config
    # To install as a service
-    #restart: unless-stopped
-    # Set nvidia runtime if it's not the default
-    #runtime: nvidia
+    restart: unless-stopped
    deploy:
      resources:
        reservations:
--- a/requirements-rocm.txt
+++ b/requirements-rocm.txt
@ -0,0 +1,14 @@
+fastapi
+uvicorn
+loguru
+# piper-tts
+piper-tts==1.2.0
+# xtts
+TTS
+# XXX, 3.8+ has some issue for now
+spacy==3.7.4
+
+# torch==2.2.2 Fixes: https://github.com/matatonic/openedai-speech/issues/9
+# Re:  https://github.com/pytorch/pytorch/issues/121834
+torch==2.2.2; --index-url https://download.pytorch.org/whl/rocm5.7; sys_platform == "linux"
+torchaudio==2.2.2; --index-url https://download.pytorch.org/whl/rocm5.7; sys_platform == "linux"
--- a/requirements.txt
+++ b/requirements.txt
@ -5,8 +5,15 @@ loguru
 piper-tts==1.2.0
 # xtts
 TTS
-# Fixes: https://github.com/matatonic/openedai-speech/issues/9
-# Re:  https://github.com/pytorch/pytorch/issues/121834
-torch==2.2.2
 # XXX, 3.8+ has some issue for now
 spacy==3.7.4
+
+# torch==2.2.2 Fixes: https://github.com/matatonic/openedai-speech/issues/9
+# Re:  https://github.com/pytorch/pytorch/issues/121834
+torch==2.2.2; sys_platform != "darwin"
+torchaudio; sys_platform != "darwin"
+# for MPS accelerated torch on Mac
+torch==2.2.2; --index-url https://download.pytorch.org/whl/cpu; sys_platform == "darwin"
+torchaudio==2.2.2; --index-url https://download.pytorch.org/whl/cpu; sys_platform == "darwin"
+
+# ROCM (Linux only) - use requirements.amd.txt
--- a/sample.env
+++ b/sample.env
@ -2,3 +2,5 @@ TTS_HOME=voices
 HF_HOME=voices
 #PRELOAD_MODEL=xtts
 #PRELOAD_MODEL=xtts_v2.0.2
+#EXTRA_ARGS=--log-level DEBUG
+#USE_ROCM=1
--- a/speech.py
+++ b/speech.py
@ -205,12 +205,21 @@ async def generate_speech(request: GenerateSpeechRequest):
    return StreamingResponse(content=ffmpeg_proc.stdout, media_type=media_type)


+def auto_torch_device():
+    try:
+        import torch
+        return 'cuda' if torch.cuda.is_available() else 'mps' if ( torch.backends.mps.is_available() and torch.backends.mps.is_built() ) else 'cpu'
+    
+    except:
+        return 'none'
+
+
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description='OpenedAI Speech API Server',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)

-    parser.add_argument('--xtts_device', action='store', default="cuda", help="Set the device for the xtts model. The special value of 'none' will use piper for all models.")
+    parser.add_argument('--xtts_device', action='store', default=auto_torch_device(), help="Set the device for the xtts model. The special value of 'none' will use piper for all models.")
    parser.add_argument('--preload', action='store', default=None, help="Preload a model (Ex. 'xtts' or 'xtts_v2.0.2'). By default it's loaded on first use.")
    parser.add_argument('-P', '--port', action='store', default=8000, type=int, help="Server tcp port")
    parser.add_argument('-H', '--host', action='store', default='0.0.0.0', help="Host to listen on, Ex. 0.0.0.0")
--- a/startup.bat
+++ b/startup.bat
@ -5,4 +5,4 @@ set /p < speech.env
 call download_voices_tts-1.bat
 call download_voices_tts-1-hd.bat %PRELOAD_MODEL%

-python speech.py %PRELOAD_MODEL:+--preload %PRELOAD_MODEL% %OPENEDAI_LOG_LEVEL:+--log-level %OPENEDAI_LOG_LEVEL%
+python speech.py %PRELOAD_MODEL:+--preload %PRELOAD_MODEL% %EXTRA_ARGS%
--- a/startup.min.sh
+++ b/startup.min.sh
@ -4,4 +4,4 @@

 bash download_voices_tts-1.sh

-python speech.py --xtts_device none  ${OPENEDAI_LOG_LEVEL:+--log-level $OPENEDAI_LOG_LEVEL}
+python speech.py --xtts_device none $EXTRA_ARGS $@
--- a/startup.sh
+++ b/startup.sh
@ -2,7 +2,9 @@

 [ -f speech.env ] && . speech.env

+echo "First startup may download 2GB of speech models. Please wait."
+
 bash download_voices_tts-1.sh
 bash download_voices_tts-1-hd.sh $PRELOAD_MODEL

-python speech.py ${PRELOAD_MODEL:+--preload $PRELOAD_MODEL} ${OPENEDAI_LOG_LEVEL:+--log-level $OPENEDAI_LOG_LEVEL} $@
+python speech.py ${PRELOAD_MODEL:+--preload $PRELOAD_MODEL} $EXTRA_ARGS $@