From 952d8acc6b8690c24e3180091ad08c7ffe129c24 Mon Sep 17 00:00:00 2001
From: pollfly <75068813+pollfly@users.noreply.github.com>
Date: Sun, 1 Dec 2024 11:43:18 +0200
Subject: [PATCH] Add llama.cpp model deployment app (#976)
---
.../applications/apps_llama_deployment.md | 77 +++++++++++++++++++
docs/webapp/applications/apps_overview.md | 1 +
sidebars.js | 3 +-
static/icons/ico-llama-active.svg | 7 ++
static/icons/ico-llama-idle.svg | 7 ++
static/icons/ico-llama-loading.svg | 24 ++++++
static/icons/ico-llama-stopped.svg | 7 ++
7 files changed, 125 insertions(+), 1 deletion(-)
create mode 100644 docs/webapp/applications/apps_llama_deployment.md
create mode 100644 static/icons/ico-llama-active.svg
create mode 100644 static/icons/ico-llama-idle.svg
create mode 100644 static/icons/ico-llama-loading.svg
create mode 100644 static/icons/ico-llama-stopped.svg
diff --git a/docs/webapp/applications/apps_llama_deployment.md b/docs/webapp/applications/apps_llama_deployment.md
new file mode 100644
index 00000000..3e5ad55c
--- /dev/null
+++ b/docs/webapp/applications/apps_llama_deployment.md
@@ -0,0 +1,77 @@
+---
+title: Llama.cpp Model Deployment
+---
+
+:::important Enterprise Feature
+The llama.cpp Model Deployment App is available under the ClearML Enterprise plan.
+:::
+
+The llama.cpp Model Deployment app enables users to quickly deploy LLM models in GGUF format using [`llama.cpp`](https://github.com/ggerganov/llama.cpp).
+The llama.cpp Model Deployment application serves your model on a machine of your choice. Once an app instance is
+running, it serves your model through a secure, publicly accessible network endpoint. The app monitors endpoint activity
+and shuts down if the model remains inactive for a specified maximum idle time.
+
+:::important AI Application Gateway
+The llama.cpp Model Deployment app makes use of the ClearML Traffic Router which implements a secure, authenticated
+network endpoint for the model.
+
+If the ClearML AI application Gateway is not available, the model endpoint might not be accessible.
+:::
+
+After starting a llama.cpp Model Deployment instance, you can view the following information in its dashboard:
+* Status indicator
+ * - App instance is running and is actively in use
+ * - App instance is setting up
+ * - App instance is idle
+ * - App instance is stopped
+* Idle time - Time elapsed since last activity
+* App - The publicly accessible URL of the model endpoint. Active model endpoints are also available in the
+ [Model Endpoints](../webapp_model_endpoints.md) table, which allows you to view and compare endpoint details and
+ monitor status over time
+* API base - The base URL for the model endpoint
+* API key - The authentication key for the model endpoint
+* Test Command - An example command line to test the deployed model
+* Requests - Number of requests over time
+* Latency - Request response time (ms) over time
+* Endpoint resource monitoring metrics over time
+ * CPU usage
+ * Network throughput
+ * Disk performance
+ * Memory performance
+ * GPU utilization
+ * GPU memory usage
+ * GPU temperature
+* Console log - The console log shows the app instance's console output: setup progress, status changes, error messages, etc.
+
+## Llama.cpp Model Deployment Instance Configuration
+
+When configuring a new llama.cpp Model Deployment instance, you can fill in the required parameters or reuse the
+configuration of a previously launched instance.
+
+Launch an app instance with the configuration of a previously launched instance using one of the following options:
+* Cloning a previously launched app instance will open the instance launch form with the original instance's configuration prefilled.
+* Importing an app configuration file. You can export the configuration of a previously launched instance as a JSON file when viewing its configuration.
+
+The prefilled configuration form can be edited before launching the new app instance.
+
+To configure a new app instance, click `Launch New`
+to open the app's configuration form.
+
+## Configuration Options
+* Import Configuration - Import an app instance configuration file. This will fill the configuration form with the
+values from the file, which can be modified before launching the app instance
+* Project name - ClearML Project where your llama.cpp Model Deployment app instance will be stored
+* Task name - Name of [ClearML Task](../../fundamentals/task.md) for your llama.cpp Model Deployment app instance
+* Queue - The [ClearML Queue](../../fundamentals/agents_and_queues.md#agent-and-queue-workflow) to which the
+ llama.cpp Model Deployment app instance task will be enqueued (make sure an agent is assigned to it)
+* Model - A ClearML Model ID or a Hugging Face model. The model must be in GGUF format. If you are using a
+ HuggingFace model, make sure to pass the path to the GGUF file. For example: `provider/repo/path/to/model.gguf`
+* General
+ * Hugging Face Token - Token for accessing Hugging Face models that require authentication
+ * Number of GPU Layers - Number of layers to store in VRAM. `9999` indicates that all layers should be loaded in
+ VRAM. Used to offload the model on the CPU RAM
+* Advanced Options
+ * Idle Time Limit (Hours) - Maximum idle time after which the app instance will shut down
+ * Last Action Report Interval (Seconds) - The frequency at which the last activity made by the application is reported.
+ Used to stop the application from entering an idle state when the machine metrics are low but the application is
+ actually still running
diff --git a/docs/webapp/applications/apps_overview.md b/docs/webapp/applications/apps_overview.md
index a4b1e601..2507a436 100644
--- a/docs/webapp/applications/apps_overview.md
+++ b/docs/webapp/applications/apps_overview.md
@@ -38,6 +38,7 @@ Applications for deploying user interfaces for models:
Applications for deploying machine learning models as scalable, secure services:
* [**Embedding Model Deployment**](apps_embed_model_deployment.md) - Deploy embedding models as networking services over a secure endpoint (available under ClearML Enterprise Plan)
* [**Model Deployment**](apps_model_deployment.md) - Deploy LLM models as networking services over a secure endpoint (available under ClearML Enterprise Plan)
+* [**llama.cpp**](apps_llama_deployment.md) - Deploy LLM models in GGUF format using [`llama.cpp`](https://github.com/ggerganov/llama.cpp) as networking services over a secure endpoint (available under ClearML Enterprise Plan)
:::info Autoscalers
Autoscaling ([AWS Autoscaler](apps_aws_autoscaler.md) and [GCP Autoscaler](apps_gcp_autoscaler.md))
diff --git a/sidebars.js b/sidebars.js
index f9526ce2..112e487c 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -234,7 +234,8 @@ module.exports = {
{
"Deploy": [
'webapp/applications/apps_embed_model_deployment',
- 'webapp/applications/apps_model_deployment'
+ 'webapp/applications/apps_model_deployment',
+ 'webapp/applications/apps_llama_deployment'
]
},
]
diff --git a/static/icons/ico-llama-active.svg b/static/icons/ico-llama-active.svg
new file mode 100644
index 00000000..2c5dbf84
--- /dev/null
+++ b/static/icons/ico-llama-active.svg
@@ -0,0 +1,7 @@
+
\ No newline at end of file
diff --git a/static/icons/ico-llama-idle.svg b/static/icons/ico-llama-idle.svg
new file mode 100644
index 00000000..d3775780
--- /dev/null
+++ b/static/icons/ico-llama-idle.svg
@@ -0,0 +1,7 @@
+
\ No newline at end of file
diff --git a/static/icons/ico-llama-loading.svg b/static/icons/ico-llama-loading.svg
new file mode 100644
index 00000000..a1e35eac
--- /dev/null
+++ b/static/icons/ico-llama-loading.svg
@@ -0,0 +1,24 @@
+
\ No newline at end of file
diff --git a/static/icons/ico-llama-stopped.svg b/static/icons/ico-llama-stopped.svg
new file mode 100644
index 00000000..5399a9bf
--- /dev/null
+++ b/static/icons/ico-llama-stopped.svg
@@ -0,0 +1,7 @@
+