0.10.0 + docs update

2025-06-26 18:16:32 +00:00 · 2024-04-27 11:37:52 -04:00
parent 6864cf03b1
commit c4d9d4e7a7
2 changed files with 47 additions and 35 deletions
--- a/README.md
+++ b/README.md
@@ -1,5 +1,4 @@
-OpenedAI Speech
---------------
+# OpenedAI Speech

 An OpenAI API compatible text to speech server.

@@ -24,11 +23,13 @@ Details:

 If you find a better voice match for `tts-1` or `tts-1-hd`, please let me know so I can update the defaults.

+## Recent Changes

 Version: 0.10.0, 2024-04-26

-* Better upgrades: Reorganize config files under config, voice models under voices
-* * **If you customized your `voice_to_speaker.yaml` or `pre_process_map.yaml` you need to move them to the `config/` folder.**
+* Prebuilt & tested docker images, smaller docker images (8GB or 860MB)
+* Better upgrades: reorganize config files under `config/`, voice models under `voices/`
+* **Compatibility!** If you customized your `voice_to_speaker.yaml` or `pre_process_map.yaml` you need to move them to the `config/` folder.
 * default listen host to 0.0.0.0

 Version: 0.9.0, 2024-04-23
@@ -36,29 +37,17 @@ Version: 0.9.0, 2024-04-23
 * Fix bug with yaml and loading UTF-8
 * New sample text-to-speech application `say.py`
 * Smaller docker base image
-* Add beta [parler-tts](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) support (you can describe very basic features of the speaker voice), See: (https://www.text-description-to-speech.com/) for some examples of how to describe voices. Voices can be defined in the `voice_to_speaker.yaml`.
-* 2 example [parler-tts](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) voices are included in the `voice_to_speaker.yaml` file.
-* parler-tts is experimental software and is kind of slow. The exact voice will be slightly different each generation but should be similar to the basic description.
+* Add beta [parler-tts](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) support (you can describe very basic features of the speaker voice), See: (https://www.text-description-to-speech.com/) for some examples of how to describe voices. Voices can be defined in the `voice_to_speaker.default.yaml`. Two example [parler-tts](https://huggingface.co/parler-tts/parler_tts_mini_v0.1) voices are included in the `voice_to_speaker.default.yaml` file. `parler-tts` is experimental software and is kind of slow. The exact voice will be slightly different each generation but should be similar to the basic description.

-Version: 0.8.0, 2024-03-23
-
-* Cleanup, docs update.
+...

 Version: 0.7.3, 2024-03-20

 * Allow different xtts versions per voice in `voice_to_speaker.yaml`, ex. xtts_v2.0.2
 * Quality: Fix xtts sample rate (24000 vs. 22050 for piper) and pops
-* use CUDA 12.2-base in Dockerfile
-
-API Documentation
-----------------
-
-* [OpenAI Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech)
-* [OpenAI API Reference](https://platform.openai.com/docs/api-reference/audio/createSpeech)


-Installation instructions
-------------------------
+## Installation instructions

 1) Download the models & voices
 ```shell
@@ -68,24 +57,42 @@ bash download_voices_tts-1.sh
 bash download_voices_tts-1-hd.sh
 ```

-2a) Docker (**recommended**): You can run the server via docker like so:
+If you have different models which you want to use, both of the download scripts accept arguments for which models to download.
+
+Example:
+```shell
+# Download en_US-ryan-high too
+bash download_voices_tts-1.sh en_US-libritts_r-medium en_GB-northern_english_male-medium en_US-ryan-high
+# Download xtts (latest) and xtts_v2.0.2
+bash download_voices_tts-1-hd.sh xtts xtts_v2.0.2
+```
+
+
+2a) Option 1: Docker (**recommended**) (prebuilt images are available)
+
+You can run the server via docker like so:
 ```shell
 cp sample.env speech.env # edit to suit your environment as needed, you can preload a model on startup
 docker compose up
 ```
-If you want a minimal docker image with piper support only (~1GB vs. ~10GB, see: Dockerfile.min). You can edit the `docker-compose.yml` to easily change this.
+If you want a minimal docker image with piper support only (<1GB vs. 8GB, see: Dockerfile.min). You can edit the `docker-compose.yml` to easily change this.
+To install the docker image as a service, edit the `docker-compose.yml` and uncomment `restart: unless-stopped`, then start the service with: `docker compose up -d`.

-2b) Manual instructions:
+
+2b) Option 2: Manual instructions:
 ```shell
-# Install the Python requirements
-pip install -r requirements.txt
 # install ffmpeg and curl
 sudo apt install ffmpeg curl
+# Create & activate a new virtual environment
+python -m venv .venv
+source .venv/bin/activate
+# Install the Python requirements
+pip install -r requirements.txt
+# run the server
 python speech.py
 ```

-Usage
-----
+## Usage

 ```
 usage: speech.py [-h] [--piper_cuda] [--xtts_device XTTS_DEVICE] [--preload PRELOAD] [-P PORT] [-H HOST]
@@ -103,8 +110,13 @@ options:

 ```

-Sample API Usage
----------------
+## API Documentation
+
+* [OpenAI Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech)
+* [OpenAI API Reference](https://platform.openai.com/docs/api-reference/audio/createSpeech)
+
+
+### Sample API Usage

 You can use it like this:

@@ -148,9 +160,9 @@ with client.audio.speech.with_streaming_response.create(

 Also see the `say.py` sample application for an example of how to use the openai-python API.

-```
-$ python say.py -t "The quick brown fox jumped over the lazy dog." -p # play the audio, requires 'pip install playsound'
-$ python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac # save to a file.
+```shell
+python say.py -t "The quick brown fox jumped over the lazy dog." -p # play the audio, requires 'pip install playsound'
+python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac # save to a file.
 ```

 ```
@@ -176,8 +188,7 @@ options:
  -p, --playsound       Play the audio (default: False)
 ```

-Custom Voices Howto
-------------------
+## Custom Voices Howto

 Custom voices should be mono 22050 hz sample rate WAV files with low noise (no background music, etc.) and not contain any partial words.Sample voices for xtts should be at least 6 seconds long, but they can be longer. However, longer samples do not always produce better results.

--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -1,7 +1,7 @@
 services:
  server:
    build:
-      dockerfile: Dockerfile # for tts-1-hd support via xtts_v2, ~4GB VRAM required, ~10GB
+      dockerfile: Dockerfile # for tts-1-hd support via xtts_v2, ~4GB VRAM required, ~8GB
      #dockerfile: Dockerfile.min # piper for all models, no gpu/nvidia required, ~1GB
    image: ghcr.io/matatonic/openedai-speech
    #image: ghcr.io/matatonic/openedai-speech-min
@@ -11,7 +11,8 @@ services:
    volumes:
      - ./voices:/app/voices
      - ./config:/app/config
-    #restart: unless-stopped # install as a service
+    # install as a service, run with docker compose up -d
+    #restart: unless-stopped
    # Below can be removed if not using GPU
    runtime: nvidia
    deploy: