diff --git a/docs/img/apps_embedding_model_deployment.png b/docs/img/apps_embedding_model_deployment.png
new file mode 100644
index 00000000..e0f8fec0
Binary files /dev/null and b/docs/img/apps_embedding_model_deployment.png differ
diff --git a/docs/img/apps_embedding_model_deployment_form.png b/docs/img/apps_embedding_model_deployment_form.png
new file mode 100644
index 00000000..54574778
Binary files /dev/null and b/docs/img/apps_embedding_model_deployment_form.png differ
diff --git a/docs/img/apps_model_deployment.png b/docs/img/apps_model_deployment.png
new file mode 100644
index 00000000..bf964baa
Binary files /dev/null and b/docs/img/apps_model_deployment.png differ
diff --git a/docs/img/apps_model_deployment_form.png b/docs/img/apps_model_deployment_form.png
new file mode 100644
index 00000000..f1594f9f
Binary files /dev/null and b/docs/img/apps_model_deployment_form.png differ
diff --git a/docs/webapp/applications/apps_embed_model_deployment.md b/docs/webapp/applications/apps_embed_model_deployment.md
new file mode 100644
index 00000000..ff584576
--- /dev/null
+++ b/docs/webapp/applications/apps_embed_model_deployment.md
@@ -0,0 +1,89 @@
+---
+title: Embedding Model Deployment
+---
+
+:::important Enterprise Feature
+The Embedding Model Deployment App is available under the ClearML Enterprise plan.
+:::
+
+The Embedding Model Deployment app enables users to quickly deploy embedding models as networking services over a secure
+endpoint. This application supports various model configurations and customizations, addressing a range of embedding use
+cases. The Embedding Model Deployment application serves your model on a machine of your choice. Once an app instance is
+running, it serves your embedding model through a secure, publicly accessible network endpoint. The app monitors
+endpoint activity and shuts down if the model remains inactive for a specified maximum idle time.
+
+:::info Task Traffic Router
+The Embedding Model Deployment app relies on the ClearML Traffic Router which implements a secure, authenticated network
+channel to the model
+:::
+
+After starting an Embedding Model Deployment instance, you can view the following information in its dashboard:
+* Status indicator
+ * - App instance is running and is actively in use
+ * - App instance is setting up
+ * - App instance is idle
+ * - App instance is stopped
+* Idle time - Time elapsed since last activity
+* Endpoint - The publicly accessible URL of the model endpoint
+* API base - The base URL for the model endpoint
+* API key - The authentication key for the model endpoint
+* Test Command - An example command line to test the deployed model
+* Requests - Number of requests over time
+* Latency - Request response time (ms) over time
+* Endpoint resource monitoring metrics over time
+ * CPU usage
+ * Network throughput
+ * Disk performance
+ * Memory performance
+ * GPU utilization
+ * GPU memory usage
+ * GPU temperature
+* Console log - The console log shows the app instance's console output: setup progress, status changes, error messages, etc.
+
+
+
+## Embedding Model Deployment Instance Configuration
+
+When configuring a new Embedding Model Deployment instance, you can fill in the required parameters or reuse the
+configuration of a previously launched instance.
+
+Launch an app instance with the configuration of a previously launched instance using one of the following options:
+* Cloning a previously launched app instance will open the instance launch form with the original instance's
+configuration prefilled.
+* Importing an app configuration file. You can export the configuration of a previously launched instance as a JSON file
+when viewing its configuration.
+
+The prefilled configuration form can be edited before launching the new app instance.
+
+To configure a new app instance, click `Launch New`
+to open the app's configuration form.
+
+### Configuration Options
+* Import Configuration - Import an app instance configuration file. This will fill the configuration form with the
+values from the file, which can be modified before launching the app instance
+* Project name - ClearML Project where your Embedding Model Deployment app instance will be stored
+* Task name - Name of ClearML Task for your Embedding Model Deployment app instance
+* Queue - The [ClearML Queue](../../fundamentals/agents_and_queues.md#what-is-a-queue) to which the Embedding Model
+Deployment app instance task will be enqueued (make sure an agent is assigned to it)
+* Model Configuration
+ * Model - A ClearML Model ID or a Hugging Face model name (e.g. `openai-community/gpt2`)
+ * Revision - The specific Hugging Face version of the model you want to use. You can use a specific commit ID or a
+ branch like `refs/pr/2`
+ * Tokenization Workers - Number of tokenizer workers used for payload tokenization, validation, and truncation.
+ Defaults to the number of CPU cores on the machine
+ * Dtype - The data type enforced on the model
+ * Pooling - Model pooling method. If `pooling` is not set, the pooling configuration will be parsed from the model
+ `1_Pooling/config.json` configuration. If `pooling` is set, it will override the model pooling configuration. Possible
+ values:
+ * `cls`: Use CLS token
+ * `mean`: Apply Mean pooling
+ * `splade`: Apply SPLADE (Sparse Lexical and Expansion) pooling. This option is only available for `ForMaskedLM`
+ Transformer models
+ * \+ Add item - Add another model endpoint. Each model will be accessible through the same base URL, with the model
+ name appended to the URL.
+* Hugging Face Token - Token for accessing Hugging Face models that require authentication
+* Idle Time Limit (Hours) - Maximum idle time after which the app instance will shut down
+* Export Configuration - Export the app instance configuration as a JSON file, which you can later import to create a
+new instance with the same configuration
+
+
\ No newline at end of file
diff --git a/docs/webapp/applications/apps_model_deployment.md b/docs/webapp/applications/apps_model_deployment.md
new file mode 100644
index 00000000..d684030a
--- /dev/null
+++ b/docs/webapp/applications/apps_model_deployment.md
@@ -0,0 +1,143 @@
+---
+title: Model Deployment
+---
+
+:::important Enterprise Feature
+The Model Deployment App is available under the ClearML Enterprise plan.
+:::
+
+The Model Deployment application enables users to quickly deploy LLM models as networking services over a secure
+endpoint. This application supports various model configurations and customizations to optimize performance and resource
+usage. The Model Deployment application serves your model on a machine of your choice. Once an app instance is running,
+it serves your model through a secure, publicly accessible network endpoint. The app monitors endpoint activity and
+shuts down if the model remains inactive for a specified maximum idle time.
+
+:::info Task Traffic Router
+The Model Deployment app relies on the ClearML Traffic Router which implements a secure, authenticated network channel
+to the model
+:::
+
+Once you start a Model Deployment instance, you can view the following information in its dashboard:
+* Status indicator
+ * - App instance is running and is actively in use
+ * - App instance is setting up
+ * - App instance is idle
+ * - App instance is stopped
+* Idle time - Time elapsed since last activity
+* Endpoint - The publicly accessible URL of the model endpoint
+* API base - The base URL for the model endpoint
+* API key - The authentication key for the model endpoint
+* Test Command - An example command line to test the deployed model
+* Requests - Number of requests over time
+* Latency - Request response time (ms) over time
+* Endpoint resource monitoring metrics over time
+* CPU usage
+ * Network throughput
+ * Disk performance
+ * Memory performance
+ * GPU utilization
+ * GPU memory usage
+ * GPU temperature
+* Console log - The console log shows the app instance's console output: setup progress, status changes, error messages,
+etc.
+
+
+
+## Model Deployment Instance Configuration
+
+When configuring a new Model Deployment instance, you can fill in the required parameters or reuse the
+configuration of a previously launched instance.
+
+Launch an app instance with the configuration of a previously launched instance using one of the following options:
+* Cloning a previously launched app instance will open the instance launch form with the original instance's
+configuration prefilled.
+* Importing an app configuration file. You can export the configuration of a previously launched instance as a JSON file
+when viewing its configuration.
+
+The prefilled configuration form can be edited before launching the new app instance.
+
+To configure a new app instance, click `Launch New`
+to open the app's configuration form.
+
+### Configuration Options
+* Import Configuration - Import an app instance configuration file. This will fill the instance launch form with the
+values from the file, which can be modified before launching the app instance
+* Project name - ClearML Project Name
+* Task name - Name of ClearML Task for your Model Deployment app instance
+* Queue - The [ClearML Queue](../../fundamentals/agents_and_queues.md#what-is-a-queue) to which the Model Deployment app
+instance task will be enqueued (make sure an agent is assigned to that queue)
+* Model - A ClearML Model ID or a HuggingFace model name (e.g. `openai-community/gpt2`)
+* Model Configuration
+ * Trust Remote Code - Select to set Hugging Face [`trust_remote_code`](https://huggingface.co/docs/text-generation-inference/main/en/reference/launcher#trustremotecode)
+ to `true`.
+ * Revision - The specific Hugging Face version of the model (i.e. weights) you want to use. You
+ can use a specific commit ID or a branch like `refs/pr/2`.
+ * Code Revision - The specific revision to use for the model code on HuggingFace Hub. It can be a branch name, a tag
+ name, or a commit ID. If unspecified, will use the default version.
+ * Max Model Length - Model context length. If unspecified, will be automatically derived from the model
+ * Tokenizer - A ClearML Model ID or a Hugging Face tokenizer
+ * Tokenizer Revision - The specific tokenizer Hugging Face version to use. It can be a branch name, a tag name, or a
+ commit ID. If unspecified, will use the default version.
+ * Tokenizer Mode - Select the tokenizer mode:
+ * `auto` - Uses the fast tokenizer if available
+ * `slow` - Uses the slow tokenizer.
+* LoRA Configuration
+ * Enable LoRA - If checked, enable handling of [LoRA adapters](https://huggingface.co/docs/diffusers/en/training/lora#lora).
+ * LoRA Modules - LoRA module configurations in the format `name=path`. Multiple modules can be specified.
+ * Max LoRAs - Max number of LoRAs in a single batch.
+ * Max LoRA Rank
+ * LoRA Extra Vocabulary Size - Maximum size of extra vocabulary that can be present in a LoRA adapter (added to the base model vocabulary).
+ * LoRA Dtype - Select the data type for LoRA. Select one of the following:
+ * `auto` - If selected, will default to base model data type.
+ * `float16`
+ * `bfloat16`
+ * `float32`
+ * Max CPU LoRAs - Maximum number of LoRAs to store in CPU memory. Must be greater or equal to the
+ `Max Number of Sequences` field in the General section below. Defaults to `Max Number of Sequences`.
+* General
+ * Disable Log Stats - Disable logging statistics
+ * Enforce Eager - Always use eager-mode PyTorch. If False, a hybrid of eager mode and CUDA graph will be used for
+ maximal performance and flexibility.
+ * Disable Custom All Reduce - See [vllm ParallelConfig](https://github.com/vllm-project/vllm/blob/main/vllm/config.py#L701)
+ * Disable Logging Requests
+ * Fixed API Access Key - Key to use for authenticating API access. Set a fixed API key if you've set up the server to
+ be accessible without authentication. Setting an API key ensures that only authorized users can access the endpoint.
+ * HuggingFace Token - Token for accessing HuggingFace models that require authentication
+ * Load Format - Select the model weights format to load:
+ * `auto` - Load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.
+ * `pt` - Load the weights in the pytorch bin format.
+ * `safetensors` - Load the weights in the safetensors format.
+ * `npcache` - Load the weights in pytorch format and store a numpy cache to speed up the loading.
+ * `dummy` Initialize the weights with random values. Mainly used for profiling.
+ * Dtype - Select the data type for model weights and activations:
+ * `auto` - if selected, will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
+ * `half`
+ * `float16`
+ * `bfloat16`
+ * `float`
+ * `float32`
+ * KV Cache Type - Select data type for kv cache storage:
+ * `auto` - If selected, will use the model data type. Note FP8 is not supported when cuda version is lower than 11.8.
+ * `fp8_e5m2`
+ * Pipeline Parallel Size - Number of pipeline stages
+ * Tensor Parallel Size - Number of tensor parallel replicas
+ * Max Parallel Loading Workers - Load model sequentially in multiple batches, to avoid RAM OOM when using tensor
+ parallel and large models
+ * Token Block Size
+ * Random Seed
+ * Swap Space - CPU swap space size (GiB) per GPU
+ * GPU Memory Utilization - The fraction of GPU memory to be used for the model executor, which can range from 0 to 1
+ * Max Number of Batched Tokens - Maximum number of batched tokens per iteration
+ * Max Number of Sequences - Maximum number of sequences per iteration
+ * Max Number of Paddings - Maximum number of paddings in a batch
+ * Quantization - Method used to quantize the weights. If None, we first check the `quantization_config` attribute in
+ the model config file. If that is None, we assume the model weights are not quantized and use `dtype` to determine the
+ data type of the weights.
+ * Max Context Length to Capture - Maximum context length covered by CUDA graphs. When a sequence has context length
+ larger than this, we fall back to eager mode.
+ * Max Log Length - Max number of prompt characters or prompt ID numbers being printed in log. Default: unlimited
+* Idle Time Limit (Hours) - Maximum idle time after which the app instance will shut down
+* Export Configuration - Export the app instance configuration as a JSON file, which you can later import to create a
+new instance with the same configuration
+
+
\ No newline at end of file
diff --git a/docs/webapp/applications/apps_overview.md b/docs/webapp/applications/apps_overview.md
index 4c0436f0..bf09dbb5 100644
--- a/docs/webapp/applications/apps_overview.md
+++ b/docs/webapp/applications/apps_overview.md
@@ -13,15 +13,21 @@ Use ClearML's GUI Applications to manage ML workloads and automatically run your
Configure and launch app instances, then track their execution from the app dashboard.
ClearML provides the following applications:
-* [**Hyperparameter Optimization**](apps_hpo.md) - Find the parameter values that yield the best performing models
-* **Nvidia Clara** - Train models using Nvidia's Clara framework
-* [**Project Dashboard**](apps_dashboard.md) - High-level project monitoring with Slack alerts
-* [**Task Scheduler**](apps_task_scheduler.md) - Schedule tasks for one-shot and/or periodic execution at specified times (available under ClearML Enterprise Plan)
-* [**Trigger Manager**](apps_trigger_manager.md) - Define tasks to be run when predefined events occur (available under ClearML Enterprise Plan)
-* [**Jupyter Lab**](apps_jupyter_lab.md) - Launch a Jupyter Lab session on a remote machine (available under ClearML Enterprise Plan)
-* [**VS Code**](apps_vscode.md) - Launch a VS Code session on a remote machine (available under ClearML Enterprise Plan)
-* [**Gradio Launcher**](apps_gradio.md) - Create visual web interfaces for your models with Gradio (available under ClearML Enterprise Plan)
-* [**Streamlit Launcher**](apps_streamlit.md) - Create visual web interfaces for your models with Streamlit (available under ClearML Enterprise Plan)
+* General:
+ * [**Hyperparameter Optimization**](apps_hpo.md) - Find the parameter values that yield the best performing models
+ * **Nvidia Clara** - Train models using Nvidia's Clara framework
+ * [**Project Dashboard**](apps_dashboard.md) - High-level project monitoring with Slack alerts
+ * [**Task Scheduler**](apps_task_scheduler.md) - Schedule tasks for one-shot and/or periodic execution at specified times (available under ClearML Enterprise Plan)
+ * [**Trigger Manager**](apps_trigger_manager.md) - Define tasks to be run when predefined events occur (available under ClearML Enterprise Plan)
+* AI Dev:
+ * [**Jupyter Lab**](apps_jupyter_lab.md) - Launch a Jupyter Lab session on a remote machine (available under ClearML Enterprise Plan)
+ * [**VS Code**](apps_vscode.md) - Launch a VS Code session on a remote machine (available under ClearML Enterprise Plan)
+* UI Dev:
+ * [**Gradio Launcher**](apps_gradio.md) - Create visual web interfaces for your models with Gradio (available under ClearML Enterprise Plan)
+ * [**Streamlit Launcher**](apps_streamlit.md) - Create visual web interfaces for your models with Streamlit (available under ClearML Enterprise Plan)
+* Deploy:
+ * [**Embedding Model Deployment**](apps_embed_model_deployment.md) Deploy embedding models as networking services over a secure endpoint (available under ClearML Enterprise Plan)
+ * [**Model Deployment**](apps_model_deployment.md) - Deploy LLM models as networking services over a secure endpoint (available under ClearML Enterprise Plan)
:::info Autoscalers
Autoscaling ([AWS Autoscaler](apps_aws_autoscaler.md) and [GCP Autoscaler](apps_gcp_autoscaler.md))
diff --git a/sidebars.js b/sidebars.js
index 39b79d8c..f62c0d4b 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -147,14 +147,32 @@ module.exports = {
{
'ClearML Applications': [
'webapp/applications/apps_overview',
- 'webapp/applications/apps_hpo',
- 'webapp/applications/apps_dashboard',
- 'webapp/applications/apps_task_scheduler',
- 'webapp/applications/apps_trigger_manager',
- 'webapp/applications/apps_jupyter_lab',
- 'webapp/applications/apps_vscode',
- 'webapp/applications/apps_gradio',
- 'webapp/applications/apps_streamlit'
+ {
+ "General": [
+ 'webapp/applications/apps_hpo',
+ 'webapp/applications/apps_dashboard',
+ 'webapp/applications/apps_task_scheduler',
+ 'webapp/applications/apps_trigger_manager',
+ ]
+ },
+ {
+ "AI Dev": [
+ 'webapp/applications/apps_jupyter_lab',
+ 'webapp/applications/apps_vscode',
+ ]
+ },
+ {
+ "UI Dev": [
+ 'webapp/applications/apps_gradio',
+ 'webapp/applications/apps_streamlit'
+ ]
+ },
+ {
+ "Deploy": [
+ 'webapp/applications/apps_embed_model_deployment',
+ 'webapp/applications/apps_model_deployment'
+ ]
+ },
]
},
diff --git a/static/icons/ico-embedding-model-active.svg b/static/icons/ico-embedding-model-active.svg
new file mode 100644
index 00000000..9fe12caa
--- /dev/null
+++ b/static/icons/ico-embedding-model-active.svg
@@ -0,0 +1,7 @@
+
\ No newline at end of file
diff --git a/static/icons/ico-embedding-model-idle.svg b/static/icons/ico-embedding-model-idle.svg
new file mode 100644
index 00000000..e8f305ad
--- /dev/null
+++ b/static/icons/ico-embedding-model-idle.svg
@@ -0,0 +1,7 @@
+
\ No newline at end of file
diff --git a/static/icons/ico-embedding-model-loading.svg b/static/icons/ico-embedding-model-loading.svg
new file mode 100644
index 00000000..7bd28a6b
--- /dev/null
+++ b/static/icons/ico-embedding-model-loading.svg
@@ -0,0 +1,24 @@
+
\ No newline at end of file
diff --git a/static/icons/ico-embedding-model-stopped.svg b/static/icons/ico-embedding-model-stopped.svg
new file mode 100644
index 00000000..885630c1
--- /dev/null
+++ b/static/icons/ico-embedding-model-stopped.svg
@@ -0,0 +1,7 @@
+
diff --git a/static/icons/ico-model-active.svg b/static/icons/ico-model-active.svg
new file mode 100644
index 00000000..a14f3407
--- /dev/null
+++ b/static/icons/ico-model-active.svg
@@ -0,0 +1,7 @@
+
\ No newline at end of file
diff --git a/static/icons/ico-model-idle.svg b/static/icons/ico-model-idle.svg
new file mode 100644
index 00000000..124ea782
--- /dev/null
+++ b/static/icons/ico-model-idle.svg
@@ -0,0 +1,7 @@
+
\ No newline at end of file
diff --git a/static/icons/ico-model-loading.svg b/static/icons/ico-model-loading.svg
new file mode 100644
index 00000000..d7954e9d
--- /dev/null
+++ b/static/icons/ico-model-loading.svg
@@ -0,0 +1,24 @@
+
\ No newline at end of file
diff --git a/static/icons/ico-model-stopped.svg b/static/icons/ico-model-stopped.svg
new file mode 100644
index 00000000..670d0ea1
--- /dev/null
+++ b/static/icons/ico-model-stopped.svg
@@ -0,0 +1,7 @@
+