mirror of https://github.com/clearml/clearml-serving synced 2025-06-26 18:16:00 +00:00

History

allegroai dc6fd46a46 Update requirements		2024-02-26 11:34:21 +02:00
..
docker-compose-override.yml	Optimize containers	2022-10-08 02:22:32 +03:00
example_payload.json	Add huggingface transformers example	2022-10-08 02:12:31 +03:00
preprocess.py	Add huggingface transformers example	2022-10-08 02:12:31 +03:00
readme.md	Add huggingface transformers example	2022-10-08 02:12:31 +03:00
requirements.txt	Update requirements	2024-02-26 11:34:21 +02:00

readme.md

Example Huggingface on ClearML Serving

Technically, the underlying NVIDIA Triton inference engine can handle almost any type of model, including Pytorch models which is how many Huggingface models are shipped out of the box.

But in order to get better serving speeds, check out this repository, their docs and the excellent accompanying blogpost to convert huggingface models first into ONNX and then into TensorRT optimized binaries.

Model vs Tokenizer

Most Huggingface NLP models ship with a tokenizer as well. We don’t want to leave it to the end user to embed their own inputs. The blogpost above uses an ensemble endpoint in Triton that first runs some python code that contains the tokenizer and then sends the result to a second endpoint which contains the actual model.

This is a good approach, but the tokenizer is CPU based and not independently scalable from the GPU based transformer model. With ClearML serving, we can move the tokenization step to the preprocessing script that we provide to the ClearML serving inference container, which will make this step completely autoscalable.

Getting the right TensorRT <> Triton versions

Chances are very high that the transformer-deploy image has a different triton version than what ClearML serving uses, which will give issues later on. Triton is very harsh on its version requirements. Please check the triton version we are using in clearml_serving/engines/triton/Dockerfile and compare it to the main Dockerfile from the transformers-deploy repo. Check this page for more information about which TensorRT version is shipped in which Triton container.

If they don't match up, either rebuild the ClearML triton image locally with the right triton version and make sure it is picked up by compose, or build the transformers-deploy image locally with the correct version and use it to run the model conversion. Your model has to be optimized using the exact same TensorRT version or it will not serve!

Setting up for the example

At the time of this writing, compiling a huggingface model from the transformers-deploy main branch means it is compiled using tensorRT version 8.4.1, which corresponds to Triton version 22.07.

To get ClearML running on 22.07, all we need to do is change the base image name in the docker-compose-triton-gpu.yml file, the the correct version.

...
clearml-serving-triton:
-   image: allegroai/clearml-serving-triton:latest
+   image: allegroai/clearml-serving-triton:1.2.0-22.07
    container_name: clearml-serving-triton
    restart: unless-stopped
    # optimize perforamnce
    security_opt:
      - seccomp:unconfined
...

Or you can build your own correct version by adapting the dockerfile in clearml_serving/engines/triton/Dockerfile, building it and making sure the triton compose yaml uses it instead.

Setting up the serving service

Get the repository (with the example)

Clone the serving repository if you haven’t already.

git clone https://github.com/allegroai/clearml-serving.git
cd clearml-serving

Launch the serving task to clearml

Install clearml-serving either via pip or from the repository. Create serving Service:

clearml-serving create --name "huggingface serving example"

(write down the service ID, this is the service ID that is in your env file as well)

Setting up the docker-compose serving stack

Setup the docker/example.env file with your ClearML credentials, then add an extra line to install 3rd party packages. In this case, we want to also install the transformers package because we’re going to run the tokenizer in the inference container

CLEARML_WEB_HOST="https://app.clear.ml"
CLEARML_API_HOST="https://api.clear.ml"
CLEARML_FILES_HOST="https://files.clear.ml"
CLEARML_API_ACCESS_KEY="<>"
CLEARML_API_SECRET_KEY="<>"
CLEARML_SERVING_TASK_ID="<>"
# Add this to install necessary packages
CLEARML_EXTRA_PYTHON_PACKAGES=transformers
# Change this depending on your machine and performance needs
CLEARML_USE_GUNICORN=1
CLEARML_SERVING_NUM_PROCESS=8

Huggingface models require Triton engine support, please use docker-compose-triton.yml / docker-compose-triton-gpu.yml or if running on Kubernetes, the matching helm chart to set things up. Check the repository main readme documentation if you need help.

To run with the correct version of Triton for this example, do:

docker compose --env-file docker/example.env -f docker/docker-compose-triton-gpu.yml -f examples/huggingface/docker-compose-override.yml  up --force-recreate

This should get you a running ClearML stack with Triton which is reporting to a ClearML task in a project called DevOps.

Getting the sample model

If you didn’t use the transformers-deploy repository on your own model, you can run this single command to get a tensorRT binary of an example classification model.

Please make sure you have properly installed docker and nvidia-container-toolkit, so it can be run on GPU. The command will download a model.bin file to the local directory for you to serve.

curl https://clearml-public.s3.amazonaws.com/models/model_onnx.bin -o model.bin

Setup

Upload the TensorRT model (write down the model ID)

clearml-serving --id <your_service_ID> model upload --name "Transformer ONNX" --project "Hugginface Serving" --path model.bin

Create a model endpoint:

# Without dynamic batching
clearml-serving --id <your_service_ID> model add --engine triton --endpoint "transformer_model" --model-id <your_model_ID> --preprocess examples/huggingface/preprocess.py --input-size "[-1, -1]" "[-1, -1]" "[-1, -1]" --input-type int32 int32 int32 --input-name "input_ids" "token_type_ids" "attention_mask" --output-size "[-1, 2]" --output-type float32 --output-name "output" --aux-config platform=\"tensorrt_plan\" default_model_filename=\"model.bin\"

# With dynamic batching
clearml-serving --id <your_service_ID> model add --engine triton --endpoint "transformer_model" --model-id <your_model_ID> --preprocess examples/huggingface/preprocess.py --input-size "[-1]" "[-1]" "[-1]" --input-type int32 int32 int32 --input-name "input_ids" "token_type_ids" "attention_mask" --output-size "[2]" --output-type float32 --output-name "output" --aux-config platform=\"onnxruntime_onnx\" default_model_filename=\"model.bin\" dynamic_batching.preferred_batch_size="[1,2,4,8,16,32,64]" dynamic_batching.max_queue_delay_microseconds=5000000 max_batch_size=64

Note the backslashes for string values! platform=\"tensorrt_plan\" default_model_filename=\"model.bin\"

INFO: the model input and output parameters are usually in a config.pbtxt file next to the model itself.

Make sure you have the clearml-serving docker-compose-triton.yml (or docker-compose-triton-gpu.yml) running, it might take it a minute or two to sync with the new endpoint.
Test new endpoint (do notice the first call will trigger the model pulling, so it might take longer, from here on, it's all in memory):

Notice: You can also change the serving service while it is already running! This includes adding/removing endpoints, adding canary model routing etc. by default new endpoints/models will be automatically updated after 1 minute

Running Inference

After waiting a little bit for the stack to detect your new endpoint and load it, you can use curl to send a request:

curl -X POST "http://127.0.0.1:8080/serve/transformer_model" -H "accept: application/json" -H "Content-Type: application/json" -d '{"text": "This is a ClearML example to show how Triton binaries are deployed."}'

Or use the notebook in this example folder to run it using python requests

The inference request will be sent to the ClearML inference service first, which will run the raw request through the preprocessing.py file, which takes out the text value, runs it through the tokenizer and then sends the result to Triton, which runs the model and sends the output back to the same preprocessing.py file but in the postprocessing function this time, whose result is returned to the user.

Benchmarking

To run a load test on your endpoint to check its performance, use the following commands:

ab -l -n 8000 -c 128  -H "accept: application/json" -H "Content-Type: application/json" -T "application/json" -p examples/huggingface/example_payload.json  "http://127.0.0.1:8080/serve/transformer_model"

readme.md Unescape Escape