diff --git a/docs/img/apps_embedding_model_deployment.png b/docs/img/apps_embedding_model_deployment.png new file mode 100644 index 00000000..e0f8fec0 Binary files /dev/null and b/docs/img/apps_embedding_model_deployment.png differ diff --git a/docs/img/apps_embedding_model_deployment_form.png b/docs/img/apps_embedding_model_deployment_form.png new file mode 100644 index 00000000..54574778 Binary files /dev/null and b/docs/img/apps_embedding_model_deployment_form.png differ diff --git a/docs/img/apps_model_deployment.png b/docs/img/apps_model_deployment.png new file mode 100644 index 00000000..bf964baa Binary files /dev/null and b/docs/img/apps_model_deployment.png differ diff --git a/docs/img/apps_model_deployment_form.png b/docs/img/apps_model_deployment_form.png new file mode 100644 index 00000000..f1594f9f Binary files /dev/null and b/docs/img/apps_model_deployment_form.png differ diff --git a/docs/webapp/applications/apps_embed_model_deployment.md b/docs/webapp/applications/apps_embed_model_deployment.md new file mode 100644 index 00000000..ff584576 --- /dev/null +++ b/docs/webapp/applications/apps_embed_model_deployment.md @@ -0,0 +1,89 @@ +--- +title: Embedding Model Deployment +--- + +:::important Enterprise Feature +The Embedding Model Deployment App is available under the ClearML Enterprise plan. +::: + +The Embedding Model Deployment app enables users to quickly deploy embedding models as networking services over a secure +endpoint. This application supports various model configurations and customizations, addressing a range of embedding use +cases. The Embedding Model Deployment application serves your model on a machine of your choice. Once an app instance is +running, it serves your embedding model through a secure, publicly accessible network endpoint. The app monitors +endpoint activity and shuts down if the model remains inactive for a specified maximum idle time. + +:::info Task Traffic Router +The Embedding Model Deployment app relies on the ClearML Traffic Router which implements a secure, authenticated network +channel to the model +::: + +After starting an Embedding Model Deployment instance, you can view the following information in its dashboard: +* Status indicator + * Active instance - App instance is running and is actively in use + * Loading instance - App instance is setting up + * Idle instance - App instance is idle + * Stopped instance - App instance is stopped +* Idle time - Time elapsed since last activity +* Endpoint - The publicly accessible URL of the model endpoint +* API base - The base URL for the model endpoint +* API key - The authentication key for the model endpoint +* Test Command - An example command line to test the deployed model +* Requests - Number of requests over time +* Latency - Request response time (ms) over time +* Endpoint resource monitoring metrics over time + * CPU usage + * Network throughput + * Disk performance + * Memory performance + * GPU utilization + * GPU memory usage + * GPU temperature +* Console log - The console log shows the app instance's console output: setup progress, status changes, error messages, etc. + +![Embedding Model Deployment app](../../img/apps_embedding_model_deployment.png) + +## Embedding Model Deployment Instance Configuration + +When configuring a new Embedding Model Deployment instance, you can fill in the required parameters or reuse the +configuration of a previously launched instance. + +Launch an app instance with the configuration of a previously launched instance using one of the following options: +* Cloning a previously launched app instance will open the instance launch form with the original instance's +configuration prefilled. +* Importing an app configuration file. You can export the configuration of a previously launched instance as a JSON file +when viewing its configuration. + +The prefilled configuration form can be edited before launching the new app instance. + +To configure a new app instance, click `Launch New` Add new +to open the app's configuration form. + +### Configuration Options +* Import Configuration - Import an app instance configuration file. This will fill the configuration form with the +values from the file, which can be modified before launching the app instance +* Project name - ClearML Project where your Embedding Model Deployment app instance will be stored +* Task name - Name of ClearML Task for your Embedding Model Deployment app instance +* Queue - The [ClearML Queue](../../fundamentals/agents_and_queues.md#what-is-a-queue) to which the Embedding Model +Deployment app instance task will be enqueued (make sure an agent is assigned to it) +* Model Configuration + * Model - A ClearML Model ID or a Hugging Face model name (e.g. `openai-community/gpt2`) + * Revision - The specific Hugging Face version of the model you want to use. You can use a specific commit ID or a + branch like `refs/pr/2` + * Tokenization Workers - Number of tokenizer workers used for payload tokenization, validation, and truncation. + Defaults to the number of CPU cores on the machine + * Dtype - The data type enforced on the model + * Pooling - Model pooling method. If `pooling` is not set, the pooling configuration will be parsed from the model + `1_Pooling/config.json` configuration. If `pooling` is set, it will override the model pooling configuration. Possible + values: + * `cls`: Use CLS token + * `mean`: Apply Mean pooling + * `splade`: Apply SPLADE (Sparse Lexical and Expansion) pooling. This option is only available for `ForMaskedLM` + Transformer models + * \+ Add item - Add another model endpoint. Each model will be accessible through the same base URL, with the model + name appended to the URL. +* Hugging Face Token - Token for accessing Hugging Face models that require authentication +* Idle Time Limit (Hours) - Maximum idle time after which the app instance will shut down +* Export Configuration - Export the app instance configuration as a JSON file, which you can later import to create a +new instance with the same configuration + +![Embedding Model Deployment form](../../img/apps_embedding_model_deployment_form.png) \ No newline at end of file diff --git a/docs/webapp/applications/apps_model_deployment.md b/docs/webapp/applications/apps_model_deployment.md new file mode 100644 index 00000000..d684030a --- /dev/null +++ b/docs/webapp/applications/apps_model_deployment.md @@ -0,0 +1,143 @@ +--- +title: Model Deployment +--- + +:::important Enterprise Feature +The Model Deployment App is available under the ClearML Enterprise plan. +::: + +The Model Deployment application enables users to quickly deploy LLM models as networking services over a secure +endpoint. This application supports various model configurations and customizations to optimize performance and resource +usage. The Model Deployment application serves your model on a machine of your choice. Once an app instance is running, +it serves your model through a secure, publicly accessible network endpoint. The app monitors endpoint activity and +shuts down if the model remains inactive for a specified maximum idle time. + +:::info Task Traffic Router +The Model Deployment app relies on the ClearML Traffic Router which implements a secure, authenticated network channel +to the model +::: + +Once you start a Model Deployment instance, you can view the following information in its dashboard: +* Status indicator + * Active instance - App instance is running and is actively in use + * Loading instance - App instance is setting up + * Idle instance - App instance is idle + * Stopped instance - App instance is stopped +* Idle time - Time elapsed since last activity +* Endpoint - The publicly accessible URL of the model endpoint +* API base - The base URL for the model endpoint +* API key - The authentication key for the model endpoint +* Test Command - An example command line to test the deployed model +* Requests - Number of requests over time +* Latency - Request response time (ms) over time +* Endpoint resource monitoring metrics over time +* CPU usage + * Network throughput + * Disk performance + * Memory performance + * GPU utilization + * GPU memory usage + * GPU temperature +* Console log - The console log shows the app instance's console output: setup progress, status changes, error messages, +etc. + +![Model Deployment App](../../img/apps_model_deployment.png) + +## Model Deployment Instance Configuration + +When configuring a new Model Deployment instance, you can fill in the required parameters or reuse the +configuration of a previously launched instance. + +Launch an app instance with the configuration of a previously launched instance using one of the following options: +* Cloning a previously launched app instance will open the instance launch form with the original instance's +configuration prefilled. +* Importing an app configuration file. You can export the configuration of a previously launched instance as a JSON file +when viewing its configuration. + +The prefilled configuration form can be edited before launching the new app instance. + +To configure a new app instance, click `Launch New` Add new +to open the app's configuration form. + +### Configuration Options +* Import Configuration - Import an app instance configuration file. This will fill the instance launch form with the +values from the file, which can be modified before launching the app instance +* Project name - ClearML Project Name +* Task name - Name of ClearML Task for your Model Deployment app instance +* Queue - The [ClearML Queue](../../fundamentals/agents_and_queues.md#what-is-a-queue) to which the Model Deployment app +instance task will be enqueued (make sure an agent is assigned to that queue) +* Model - A ClearML Model ID or a HuggingFace model name (e.g. `openai-community/gpt2`) +* Model Configuration + * Trust Remote Code - Select to set Hugging Face [`trust_remote_code`](https://huggingface.co/docs/text-generation-inference/main/en/reference/launcher#trustremotecode) + to `true`. + * Revision - The specific Hugging Face version of the model (i.e. weights) you want to use. You + can use a specific commit ID or a branch like `refs/pr/2`. + * Code Revision - The specific revision to use for the model code on HuggingFace Hub. It can be a branch name, a tag + name, or a commit ID. If unspecified, will use the default version. + * Max Model Length - Model context length. If unspecified, will be automatically derived from the model + * Tokenizer - A ClearML Model ID or a Hugging Face tokenizer + * Tokenizer Revision - The specific tokenizer Hugging Face version to use. It can be a branch name, a tag name, or a + commit ID. If unspecified, will use the default version. + * Tokenizer Mode - Select the tokenizer mode: + * `auto` - Uses the fast tokenizer if available + * `slow` - Uses the slow tokenizer. +* LoRA Configuration + * Enable LoRA - If checked, enable handling of [LoRA adapters](https://huggingface.co/docs/diffusers/en/training/lora#lora). + * LoRA Modules - LoRA module configurations in the format `name=path`. Multiple modules can be specified. + * Max LoRAs - Max number of LoRAs in a single batch. + * Max LoRA Rank + * LoRA Extra Vocabulary Size - Maximum size of extra vocabulary that can be present in a LoRA adapter (added to the base model vocabulary). + * LoRA Dtype - Select the data type for LoRA. Select one of the following: + * `auto` - If selected, will default to base model data type. + * `float16` + * `bfloat16` + * `float32` + * Max CPU LoRAs - Maximum number of LoRAs to store in CPU memory. Must be greater or equal to the + `Max Number of Sequences` field in the General section below. Defaults to `Max Number of Sequences`. +* General + * Disable Log Stats - Disable logging statistics + * Enforce Eager - Always use eager-mode PyTorch. If False, a hybrid of eager mode and CUDA graph will be used for + maximal performance and flexibility. + * Disable Custom All Reduce - See [vllm ParallelConfig](https://github.com/vllm-project/vllm/blob/main/vllm/config.py#L701) + * Disable Logging Requests + * Fixed API Access Key - Key to use for authenticating API access. Set a fixed API key if you've set up the server to + be accessible without authentication. Setting an API key ensures that only authorized users can access the endpoint. + * HuggingFace Token - Token for accessing HuggingFace models that require authentication + * Load Format - Select the model weights format to load: + * `auto` - Load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available. + * `pt` - Load the weights in the pytorch bin format. + * `safetensors` - Load the weights in the safetensors format. + * `npcache` - Load the weights in pytorch format and store a numpy cache to speed up the loading. + * `dummy` Initialize the weights with random values. Mainly used for profiling. + * Dtype - Select the data type for model weights and activations: + * `auto` - if selected, will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. + * `half` + * `float16` + * `bfloat16` + * `float` + * `float32` + * KV Cache Type - Select data type for kv cache storage: + * `auto` - If selected, will use the model data type. Note FP8 is not supported when cuda version is lower than 11.8. + * `fp8_e5m2` + * Pipeline Parallel Size - Number of pipeline stages + * Tensor Parallel Size - Number of tensor parallel replicas + * Max Parallel Loading Workers - Load model sequentially in multiple batches, to avoid RAM OOM when using tensor + parallel and large models + * Token Block Size + * Random Seed + * Swap Space - CPU swap space size (GiB) per GPU + * GPU Memory Utilization - The fraction of GPU memory to be used for the model executor, which can range from 0 to 1 + * Max Number of Batched Tokens - Maximum number of batched tokens per iteration + * Max Number of Sequences - Maximum number of sequences per iteration + * Max Number of Paddings - Maximum number of paddings in a batch + * Quantization - Method used to quantize the weights. If None, we first check the `quantization_config` attribute in + the model config file. If that is None, we assume the model weights are not quantized and use `dtype` to determine the + data type of the weights. + * Max Context Length to Capture - Maximum context length covered by CUDA graphs. When a sequence has context length + larger than this, we fall back to eager mode. + * Max Log Length - Max number of prompt characters or prompt ID numbers being printed in log. Default: unlimited +* Idle Time Limit (Hours) - Maximum idle time after which the app instance will shut down +* Export Configuration - Export the app instance configuration as a JSON file, which you can later import to create a +new instance with the same configuration + +![Model Deployment app form](../../img/apps_model_deployment_form.png) \ No newline at end of file diff --git a/docs/webapp/applications/apps_overview.md b/docs/webapp/applications/apps_overview.md index 4c0436f0..bf09dbb5 100644 --- a/docs/webapp/applications/apps_overview.md +++ b/docs/webapp/applications/apps_overview.md @@ -13,15 +13,21 @@ Use ClearML's GUI Applications to manage ML workloads and automatically run your Configure and launch app instances, then track their execution from the app dashboard. ClearML provides the following applications: -* [**Hyperparameter Optimization**](apps_hpo.md) - Find the parameter values that yield the best performing models -* **Nvidia Clara** - Train models using Nvidia's Clara framework -* [**Project Dashboard**](apps_dashboard.md) - High-level project monitoring with Slack alerts -* [**Task Scheduler**](apps_task_scheduler.md) - Schedule tasks for one-shot and/or periodic execution at specified times (available under ClearML Enterprise Plan) -* [**Trigger Manager**](apps_trigger_manager.md) - Define tasks to be run when predefined events occur (available under ClearML Enterprise Plan) -* [**Jupyter Lab**](apps_jupyter_lab.md) - Launch a Jupyter Lab session on a remote machine (available under ClearML Enterprise Plan) -* [**VS Code**](apps_vscode.md) - Launch a VS Code session on a remote machine (available under ClearML Enterprise Plan) -* [**Gradio Launcher**](apps_gradio.md) - Create visual web interfaces for your models with Gradio (available under ClearML Enterprise Plan) -* [**Streamlit Launcher**](apps_streamlit.md) - Create visual web interfaces for your models with Streamlit (available under ClearML Enterprise Plan) +* General: + * [**Hyperparameter Optimization**](apps_hpo.md) - Find the parameter values that yield the best performing models + * **Nvidia Clara** - Train models using Nvidia's Clara framework + * [**Project Dashboard**](apps_dashboard.md) - High-level project monitoring with Slack alerts + * [**Task Scheduler**](apps_task_scheduler.md) - Schedule tasks for one-shot and/or periodic execution at specified times (available under ClearML Enterprise Plan) + * [**Trigger Manager**](apps_trigger_manager.md) - Define tasks to be run when predefined events occur (available under ClearML Enterprise Plan) +* AI Dev: + * [**Jupyter Lab**](apps_jupyter_lab.md) - Launch a Jupyter Lab session on a remote machine (available under ClearML Enterprise Plan) + * [**VS Code**](apps_vscode.md) - Launch a VS Code session on a remote machine (available under ClearML Enterprise Plan) +* UI Dev: + * [**Gradio Launcher**](apps_gradio.md) - Create visual web interfaces for your models with Gradio (available under ClearML Enterprise Plan) + * [**Streamlit Launcher**](apps_streamlit.md) - Create visual web interfaces for your models with Streamlit (available under ClearML Enterprise Plan) +* Deploy: + * [**Embedding Model Deployment**](apps_embed_model_deployment.md) Deploy embedding models as networking services over a secure endpoint (available under ClearML Enterprise Plan) + * [**Model Deployment**](apps_model_deployment.md) - Deploy LLM models as networking services over a secure endpoint (available under ClearML Enterprise Plan) :::info Autoscalers Autoscaling ([AWS Autoscaler](apps_aws_autoscaler.md) and [GCP Autoscaler](apps_gcp_autoscaler.md)) diff --git a/sidebars.js b/sidebars.js index 39b79d8c..f62c0d4b 100644 --- a/sidebars.js +++ b/sidebars.js @@ -147,14 +147,32 @@ module.exports = { { 'ClearML Applications': [ 'webapp/applications/apps_overview', - 'webapp/applications/apps_hpo', - 'webapp/applications/apps_dashboard', - 'webapp/applications/apps_task_scheduler', - 'webapp/applications/apps_trigger_manager', - 'webapp/applications/apps_jupyter_lab', - 'webapp/applications/apps_vscode', - 'webapp/applications/apps_gradio', - 'webapp/applications/apps_streamlit' + { + "General": [ + 'webapp/applications/apps_hpo', + 'webapp/applications/apps_dashboard', + 'webapp/applications/apps_task_scheduler', + 'webapp/applications/apps_trigger_manager', + ] + }, + { + "AI Dev": [ + 'webapp/applications/apps_jupyter_lab', + 'webapp/applications/apps_vscode', + ] + }, + { + "UI Dev": [ + 'webapp/applications/apps_gradio', + 'webapp/applications/apps_streamlit' + ] + }, + { + "Deploy": [ + 'webapp/applications/apps_embed_model_deployment', + 'webapp/applications/apps_model_deployment' + ] + }, ] }, diff --git a/static/icons/ico-embedding-model-active.svg b/static/icons/ico-embedding-model-active.svg new file mode 100644 index 00000000..9fe12caa --- /dev/null +++ b/static/icons/ico-embedding-model-active.svg @@ -0,0 +1,7 @@ + + + + + + + \ No newline at end of file diff --git a/static/icons/ico-embedding-model-idle.svg b/static/icons/ico-embedding-model-idle.svg new file mode 100644 index 00000000..e8f305ad --- /dev/null +++ b/static/icons/ico-embedding-model-idle.svg @@ -0,0 +1,7 @@ + + + + + + + \ No newline at end of file diff --git a/static/icons/ico-embedding-model-loading.svg b/static/icons/ico-embedding-model-loading.svg new file mode 100644 index 00000000..7bd28a6b --- /dev/null +++ b/static/icons/ico-embedding-model-loading.svg @@ -0,0 +1,24 @@ + + + + + + + \ No newline at end of file diff --git a/static/icons/ico-embedding-model-stopped.svg b/static/icons/ico-embedding-model-stopped.svg new file mode 100644 index 00000000..885630c1 --- /dev/null +++ b/static/icons/ico-embedding-model-stopped.svg @@ -0,0 +1,7 @@ + + + + + + + diff --git a/static/icons/ico-model-active.svg b/static/icons/ico-model-active.svg new file mode 100644 index 00000000..a14f3407 --- /dev/null +++ b/static/icons/ico-model-active.svg @@ -0,0 +1,7 @@ + + + + + + + \ No newline at end of file diff --git a/static/icons/ico-model-idle.svg b/static/icons/ico-model-idle.svg new file mode 100644 index 00000000..124ea782 --- /dev/null +++ b/static/icons/ico-model-idle.svg @@ -0,0 +1,7 @@ + + + + + + + \ No newline at end of file diff --git a/static/icons/ico-model-loading.svg b/static/icons/ico-model-loading.svg new file mode 100644 index 00000000..d7954e9d --- /dev/null +++ b/static/icons/ico-model-loading.svg @@ -0,0 +1,24 @@ + + + + + + + \ No newline at end of file diff --git a/static/icons/ico-model-stopped.svg b/static/icons/ico-model-stopped.svg new file mode 100644 index 00000000..670d0ea1 --- /dev/null +++ b/static/icons/ico-model-stopped.svg @@ -0,0 +1,7 @@ + + + + + + +