From 952d8acc6b8690c24e3180091ad08c7ffe129c24 Mon Sep 17 00:00:00 2001 From: pollfly <75068813+pollfly@users.noreply.github.com> Date: Sun, 1 Dec 2024 11:43:18 +0200 Subject: [PATCH] Add llama.cpp model deployment app (#976) --- .../applications/apps_llama_deployment.md | 77 +++++++++++++++++++ docs/webapp/applications/apps_overview.md | 1 + sidebars.js | 3 +- static/icons/ico-llama-active.svg | 7 ++ static/icons/ico-llama-idle.svg | 7 ++ static/icons/ico-llama-loading.svg | 24 ++++++ static/icons/ico-llama-stopped.svg | 7 ++ 7 files changed, 125 insertions(+), 1 deletion(-) create mode 100644 docs/webapp/applications/apps_llama_deployment.md create mode 100644 static/icons/ico-llama-active.svg create mode 100644 static/icons/ico-llama-idle.svg create mode 100644 static/icons/ico-llama-loading.svg create mode 100644 static/icons/ico-llama-stopped.svg diff --git a/docs/webapp/applications/apps_llama_deployment.md b/docs/webapp/applications/apps_llama_deployment.md new file mode 100644 index 00000000..3e5ad55c --- /dev/null +++ b/docs/webapp/applications/apps_llama_deployment.md @@ -0,0 +1,77 @@ +--- +title: Llama.cpp Model Deployment +--- + +:::important Enterprise Feature +The llama.cpp Model Deployment App is available under the ClearML Enterprise plan. +::: + +The llama.cpp Model Deployment app enables users to quickly deploy LLM models in GGUF format using [`llama.cpp`](https://github.com/ggerganov/llama.cpp). +The llama.cpp Model Deployment application serves your model on a machine of your choice. Once an app instance is +running, it serves your model through a secure, publicly accessible network endpoint. The app monitors endpoint activity +and shuts down if the model remains inactive for a specified maximum idle time. + +:::important AI Application Gateway +The llama.cpp Model Deployment app makes use of the ClearML Traffic Router which implements a secure, authenticated +network endpoint for the model. + +If the ClearML AI application Gateway is not available, the model endpoint might not be accessible. +::: + +After starting a llama.cpp Model Deployment instance, you can view the following information in its dashboard: +* Status indicator + * Active server - App instance is running and is actively in use + * Loading server - App instance is setting up + * Idle server - App instance is idle + * Stopped server - App instance is stopped +* Idle time - Time elapsed since last activity +* App - The publicly accessible URL of the model endpoint. Active model endpoints are also available in the + [Model Endpoints](../webapp_model_endpoints.md) table, which allows you to view and compare endpoint details and + monitor status over time +* API base - The base URL for the model endpoint +* API key - The authentication key for the model endpoint +* Test Command - An example command line to test the deployed model +* Requests - Number of requests over time +* Latency - Request response time (ms) over time +* Endpoint resource monitoring metrics over time + * CPU usage + * Network throughput + * Disk performance + * Memory performance + * GPU utilization + * GPU memory usage + * GPU temperature +* Console log - The console log shows the app instance's console output: setup progress, status changes, error messages, etc. + +## Llama.cpp Model Deployment Instance Configuration + +When configuring a new llama.cpp Model Deployment instance, you can fill in the required parameters or reuse the +configuration of a previously launched instance. + +Launch an app instance with the configuration of a previously launched instance using one of the following options: +* Cloning a previously launched app instance will open the instance launch form with the original instance's configuration prefilled. +* Importing an app configuration file. You can export the configuration of a previously launched instance as a JSON file when viewing its configuration. + +The prefilled configuration form can be edited before launching the new app instance. + +To configure a new app instance, click `Launch New` Add new +to open the app's configuration form. + +## Configuration Options +* Import Configuration - Import an app instance configuration file. This will fill the configuration form with the +values from the file, which can be modified before launching the app instance +* Project name - ClearML Project where your llama.cpp Model Deployment app instance will be stored +* Task name - Name of [ClearML Task](../../fundamentals/task.md) for your llama.cpp Model Deployment app instance +* Queue - The [ClearML Queue](../../fundamentals/agents_and_queues.md#agent-and-queue-workflow) to which the + llama.cpp Model Deployment app instance task will be enqueued (make sure an agent is assigned to it) +* Model - A ClearML Model ID or a Hugging Face model. The model must be in GGUF format. If you are using a + HuggingFace model, make sure to pass the path to the GGUF file. For example: `provider/repo/path/to/model.gguf` +* General + * Hugging Face Token - Token for accessing Hugging Face models that require authentication + * Number of GPU Layers - Number of layers to store in VRAM. `9999` indicates that all layers should be loaded in + VRAM. Used to offload the model on the CPU RAM +* Advanced Options + * Idle Time Limit (Hours) - Maximum idle time after which the app instance will shut down + * Last Action Report Interval (Seconds) - The frequency at which the last activity made by the application is reported. + Used to stop the application from entering an idle state when the machine metrics are low but the application is + actually still running diff --git a/docs/webapp/applications/apps_overview.md b/docs/webapp/applications/apps_overview.md index a4b1e601..2507a436 100644 --- a/docs/webapp/applications/apps_overview.md +++ b/docs/webapp/applications/apps_overview.md @@ -38,6 +38,7 @@ Applications for deploying user interfaces for models: Applications for deploying machine learning models as scalable, secure services: * [**Embedding Model Deployment**](apps_embed_model_deployment.md) - Deploy embedding models as networking services over a secure endpoint (available under ClearML Enterprise Plan) * [**Model Deployment**](apps_model_deployment.md) - Deploy LLM models as networking services over a secure endpoint (available under ClearML Enterprise Plan) +* [**llama.cpp**](apps_llama_deployment.md) - Deploy LLM models in GGUF format using [`llama.cpp`](https://github.com/ggerganov/llama.cpp) as networking services over a secure endpoint (available under ClearML Enterprise Plan) :::info Autoscalers Autoscaling ([AWS Autoscaler](apps_aws_autoscaler.md) and [GCP Autoscaler](apps_gcp_autoscaler.md)) diff --git a/sidebars.js b/sidebars.js index f9526ce2..112e487c 100644 --- a/sidebars.js +++ b/sidebars.js @@ -234,7 +234,8 @@ module.exports = { { "Deploy": [ 'webapp/applications/apps_embed_model_deployment', - 'webapp/applications/apps_model_deployment' + 'webapp/applications/apps_model_deployment', + 'webapp/applications/apps_llama_deployment' ] }, ] diff --git a/static/icons/ico-llama-active.svg b/static/icons/ico-llama-active.svg new file mode 100644 index 00000000..2c5dbf84 --- /dev/null +++ b/static/icons/ico-llama-active.svg @@ -0,0 +1,7 @@ + + + + + + + \ No newline at end of file diff --git a/static/icons/ico-llama-idle.svg b/static/icons/ico-llama-idle.svg new file mode 100644 index 00000000..d3775780 --- /dev/null +++ b/static/icons/ico-llama-idle.svg @@ -0,0 +1,7 @@ + + + + + + + \ No newline at end of file diff --git a/static/icons/ico-llama-loading.svg b/static/icons/ico-llama-loading.svg new file mode 100644 index 00000000..a1e35eac --- /dev/null +++ b/static/icons/ico-llama-loading.svg @@ -0,0 +1,24 @@ + + + + + + + \ No newline at end of file diff --git a/static/icons/ico-llama-stopped.svg b/static/icons/ico-llama-stopped.svg new file mode 100644 index 00000000..5399a9bf --- /dev/null +++ b/static/icons/ico-llama-stopped.svg @@ -0,0 +1,7 @@ + + + + + + +