From 8e1c790532ce159e87dcc10bbfc092226dcbeb68 Mon Sep 17 00:00:00 2001 From: revital Date: Tue, 10 Jun 2025 09:27:24 +0300 Subject: [PATCH] Add ClearML Enterprise Server on k8s overview and ToC --- .../enterprise_deploy/k8s_overview.md | 58 +++++++-------- sidebars.js | 71 ++++++++++++++++++- 2 files changed, 97 insertions(+), 32 deletions(-) diff --git a/docs/deploying_clearml/enterprise_deploy/k8s_overview.md b/docs/deploying_clearml/enterprise_deploy/k8s_overview.md index 515c9180..0e2e5c39 100644 --- a/docs/deploying_clearml/enterprise_deploy/k8s_overview.md +++ b/docs/deploying_clearml/enterprise_deploy/k8s_overview.md @@ -1,22 +1,23 @@ --- -title: Complete ClearML Enterprise K8s Installation and Configuration +title: ClearML Enterprise K8s Installation and Configuration --- -This guides walks you through installing and configuring ClearML Enterprise on Kubernetes, from basic installation -to advanced configuration options. +This guides walks you through the complete installation and configuration of ClearML Enterprise Server on Kubernetes +from initial setup to advanced configuration options. Follow the steps in the order presented for a smooth setup process. ## Prerequisites -Before installing ClearML Enterprise, verify that the following components are in place: +Before you begin, ensure the following requirements are met: -- Kubernetes Cluster: A vanilla Kubernetes cluster is recommended for optimal GPU support. +- Kubernetes Cluster: A standard Kubernetes (vanilla) cluster is recommended, especially for optimal GPU support. - CLI Tools: `kubectl` and `helm` must be installed and configured. -- Ingress Controller: Required to expose services via HTTP/S (e.g., `nginx-ingress`). If you need external access, also +- Ingress Controller: Required to expose services via HTTP/S (e.g., `nginx-ingress`). For external access, configure a LoadBalancer (e.g., `MetalLB`). -- Server and workers communicating on HTTP/S (ports 80 and 443). Additionally, the TCP session feature requires a range - of ports for TCP traffic based on your configuration (see [AI App Gateway installation](appgw_install_k8s.md)). +- Network Ports: + - HTTP/S communication (ports 80 and 443) must be available between the server and agents. + - TCP session support (for interactive apps) requires an additional port range (see the [AI App Gateway installation](appgw_install_k8s.md)). - DNS Configuration: A domain with subdomain support is required, ideally with trusted TLS certificates. All entries must be resolvable by the Ingress controller. Example subdomains: - Server: @@ -50,54 +51,50 @@ See the [ClearML Server on Kubernetes installation guide](k8s.md). ### ClearML Applications -ClearML Applications are like plugins that allow you to manage ML workloads and automatically run recurring workflows +ClearML Applications are plugin services that automate ML workloads without any coding. Applications are installed on top of the ClearML Server and are provided by the ClearML team. See the [Application Installation guide](extra_configs/apps.md). ### ClearML Enterprise Agent -The ClearML Enterprise Agent Enables scheduling and execution of distributed workloads (Tasks) on your Kubernetes cluster. +The ClearML Enterprise Agent enables scheduling and execution of distributed workloads (Tasks) on your Kubernetes cluster. See the [ClearML Agent installation guide](agent_k8s.md). ### AI Application Gateway -The ClearML AI Application Gateway provides secure and authenticated routing of HTTPS connections from a -user's browser running the ClearML WebApp to pods running interactive ClearML applications. - -Some ClearML applications (e.g., JupyterLab, Streamlit) may require users to access running ClearML tasks in a secure -and authenticated manner, based on ClearML user permissions. To provide access to these tasks running inside pods, an AI -App Gateway service must run on the same network as the agents and pods running the tasks. +The AI App Gateway enables secure, authenticated access to interactive ClearML applications (e.g., JupyterLab, Streamlit) +based on ClearML user permissions. It routes HTTPS traffic from users to running pods on the cluster. See the [AI Application Gateway installation guide](appgw_install_k8s.md). -## Additional Configurations +## Additional Configuration Options -### Setup GPUs +### GPU Operator -#### GPU Operator +Deploy the NVIDIA GPU Operator in order to use Nvidia GPUs in ClearML. - -$$$$$$$$$$$$$$$$$$$$In order to use Nvidia GPUs in ClearML. - -See the [guide for deploying the NVIDIA GPU Operator alongside ClearML Enterprise](extra_configs/gpu_operator.md) +See the [GPU Operator Basic Deployment guide](extra_configs/gpu_operator.md) ### Fractional GPU Support -Enable allocating a fraction of the available GPU cores and memory for better utilization of shared GPU nodes. +To optimize GPU utilization: -$$$$$$TODO link +* ClearML Dynamic MIG Orchestrator (CDMO): Manage GPU fractions using NVIDIA MIGs. See the [CDMO guide](fractional_gpus/cdmo.md) +* ClearML Fractional GPU Injector (CFGI): Use fractional (non-MIG) GPU slices for efficient resource sharing. See the [CFGI guide](fractional_gpus/cfgi.md) +* Mixed Deployments: Deploy both CDMO and CFGI in clusters with diverse GPU types. Use the NVIDIA GPU Operator to handle + mixed hardware setups. See the [CDMO and CFGI guide](fractional_gpus/cdmo_cfgi_same_cluster.md). ### Multi-Tenant Setup -Enable isolated tenants within the same ClearML Server, each with separate configuration, users, and project namespaces. +Run multiple isolated tenants on a single ClearML Server instance, each with its own configuration and user namespaces. -See the [multi-tenant service installation guide](multi_tenant_k8s.md). +See the [Multi-Tenant Service guide](multi_tenant_k8s.md). ### SSO (Identity Provider) Setup -Configure Single sign-on identity providers on ClearML Enterprise. +Integrate identity providers to enable SSO login for ClearML Enterprise users. See the [SSO Setup guide](extra_configs/sso_login.md). @@ -109,12 +106,11 @@ See the [Custom Event guide](extra_configs/custom_events.md). ### ClearML Presign Service -The ClearML Presign Service is a secure component for generating and redirecting pre-signed storage URLs for -authenticated users. +The ClearML Presign Service securely generates pre-signed storage URLs for authenticated users. See the [ClearML Presign Service guide](extra_configs/presign_service.md). -### Install with a non-root user +## Install with a Non-Root User In some Helm charts, you will find a values file called `values-enterprise-non-root-privileged.yaml` to be used for a non-root installation. diff --git a/sidebars.js b/sidebars.js index 2b37dd5a..3c897d55 100644 --- a/sidebars.js +++ b/sidebars.js @@ -664,7 +664,7 @@ module.exports = { {'ClearML Application Gateway': [ 'deploying_clearml/enterprise_deploy/appgw_install_compose', 'deploying_clearml/enterprise_deploy/appgw_install_compose_hosted', - 'deploying_clearml/enterprise_deploy/appgw_install_k8s', + 'deploying_clearml/enterprise_deploy/appgw_install_k8s', ] }, 'deploying_clearml/enterprise_deploy/custom_billing', @@ -719,5 +719,74 @@ module.exports = { }, ], }, + ], + enterpriseDeploy: [ + { + type: 'category', + collapsible: true, + label: 'ClearML Enterprise K8s Installation and Configuration', + link: {type: 'doc', id: 'deploying_clearml/enterprise_deploy/k8s_overview'}, + items: [ + 'deploying_clearml/enterprise_deploy/agent_k8s', + 'deploying_clearml/enterprise_deploy/extra_configs/apps', + { + type: 'category', + collapsible: true, + label: 'Extra Configuration', + items: [ + { + type: 'doc', + label: 'GPU Operator Basic Deployment', + id: 'deploying_clearml/enterprise_deploy/extra_configs/gpu_operator' + }, { + type: 'doc', + id: 'deploying_clearml/enterprise_deploy/extra_configs/custom_events' + }, + { + type: 'doc', + id: 'deploying_clearml/enterprise_deploy/extra_configs/presign_service' + }, + { + type: 'doc', + id: 'deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template' + }, + { + type: 'doc', + id: 'deploying_clearml/enterprise_deploy/extra_configs/multi_node_training' + }, + + { + type: 'doc', + label: 'K8s Deployment with Self-Signed Certificates', + id: 'deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates' + }, + { + type: 'doc', + id: 'deploying_clearml/enterprise_deploy/extra_configs/sso_login' + }, + ], + }, + { + type: 'category', + collapsible: true, + label: 'Fractional GPUs', + items: [ + { + type: 'doc', + label: 'ClearML Dynamic MIG Operator (CDMO)', + id: 'deploying_clearml/enterprise_deploy/fractional_gpus/cdmo' + }, + { + type: 'doc', + id: 'deploying_clearml/enterprise_deploy/fractional_gpus/cfgi' + }, + { + type: 'doc', + id: 'deploying_clearml/enterprise_deploy/fractional_gpus/cdmo_cfgi_same_cluster' + }, + ], + }, + ] + } ] };