diff --git a/docs/deploying_clearml/enterprise_deploy/agent_k8s.md b/docs/deploying_clearml/enterprise_deploy/agent_k8s.md new file mode 100644 index 00000000..e2c2c4dc --- /dev/null +++ b/docs/deploying_clearml/enterprise_deploy/agent_k8s.md @@ -0,0 +1,169 @@ +--- +title: ClearML Agent on Kubernetes +--- + +The ClearML Agent enables scheduling and executing distributed experiments on a Kubernetes cluster. + +## Prerequisites + +- A running [ClearML Enterprise Server](k8s.md) +- API credentials (`` and ``) generated via + the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). + + :::note + Make sure these credentials belong to an admin user or a service user with admin privileges. + ::: + +- The worker environment must be able to access the ClearML Server over the same network. +- Helm token to access `clearml-enterprise` Helm chart repo + +## Installation + +### Add the Helm Repo Locally + +Add the ClearML Helm repository: +```bash +helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password +``` + +Update the repository locally: +```bash +helm repo update +``` + +### Create a Values Override File + +Create a `clearml-agent-values.override.yaml` file with the following content: + +:::note +Replace the `` and ``with the API credentials you generated earlier. +Set the `ServerUrlReference` fields to match your ClearML +Server URLs. +::: + +```yaml +imageCredentials: + password: "" +clearml: + agentk8sglueKey: "" + agentk8sglueSecret: "" +agentk8sglue: + apiServerUrlReference: "" + fileServerUrlReference: "" + webServerUrlReference: "" + createQueues: true + queues: + exampleQueue: + templateOverrides: {} + queueSettings: {} +``` + +### Install the Chart + +Install the ClearML Enterprise Agent Helm chart: + +```bash +helm upgrade -i -n clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml +``` + +## Additional Configuration Options + +To view available configuration options for the Helm chart, run the following command: + +```bash +helm show readme clearml-enterprise/clearml-enterprise-agent +# or +helm show values clearml-enterprise/clearml-enterprise-agent +``` + +### Reporting GPU Availability to Orchestration Dashboard + +To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs: + +```yaml +agentk8sglue: + # -- Agent reporting to Dashboard max GPU available. This will report 2 GPUs. + dashboardReportMaxGpu: 2 +``` + +### Queues + +The ClearML Agent monitors [ClearML queues](../../fundamentals/agents_and_queues.md) and pulls tasks that are +scheduled for execution. + +A single agent can monitor multiple queues. By default, all queues share a base pod template (`agentk8sglue.basePodTemplate`) +used when launching tasks on Kubernetes after it has been pulled from the queue. + +Each queue can be configured to override the base pod template with its own settings with a `templateOverrides` queue template. +This way queue definitions can be tailored to different use cases. + +The following are a few examples of agent queue templates: + +#### Example: GPU Queues + +To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see [GPU Operator](extra_configs/gpu_operator.md). + +```yaml +agentk8sglue: + createQueues: true + queues: + 1xGPU: + templateOverrides: + resources: + limits: + nvidia.com/gpu: 1 + 2xGPU: + templateOverrides: + resources: + limits: + nvidia.com/gpu: 2 +``` + +#### Example: Custom Pod Template per Queue + +This example demonstrates how to override the base pod template definitions on a per-queue basis. +In this example: + +- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section. +- The `blue` queue overrides the label by setting it to `team=blue`, and inherits the 1Gi memory from the `basePodTemplate` section. +- The `green` queue overrides the label by setting it to `team=green`, and overrides the memory limit by setting it to 2Gi. + It also sets an annotation and an environment variable. + +```yaml +agentk8sglue: + # Defines common template + basePodTemplate: + labels: + team: red + resources: + limits: + memory: 1Gi + createQueues: true + queues: + red: + # Does not override + templateOverrides: {} + blue: + # Overrides labels + templateOverrides: + labels: + team: blue + green: + # Overrides labels and resources, plus set new fields + templateOverrides: + labels: + team: green + annotations: + example: "example value" + resources: + limits: + memory: 2Gi + env: + - name: MY_ENV + value: "my_value" +``` + +## Next Steps + +Once the agent is up and running, proceed with deploying the [ClearML Enterprise App Gateway](appgw_install_k8s.md). + diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md new file mode 100644 index 00000000..911ff411 --- /dev/null +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md @@ -0,0 +1,271 @@ +--- +title: Dynamically Edit Task Pod Template +--- + +ClearML Agent allows you to inject custom Python code to dynamically modify the Kubernetes Pod template before applying it. + + +## Agent Configuration + +The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that +module to be invoked by the agent before applying a task pod template. + +The agent will run this code in its own context, pass arguments (including the actual template) to the function, and use +the returned template to create the final Task Pod in Kubernetes. + +Arguments passed to the function include: + +* `queue` (string) - ID of the queue from which the task was pulled. +* `queue_name` (string) - Name of the queue from which the task was pulled. +* `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides. +* `task_data` (object) - Task data object (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID. +* `providers_info` (dictionary) - Provider info containing optional information collected for the user running this task + when the user logged into the system (requires additional server configuration). +* `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable + for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values. +* `worker` - The agent Python object in case custom calls might be required. + +### Usage + +Update `clearml-agent-values.override.yaml` to include: + +```yaml +agentk8sglue: + extraEnvs: + - name: CLEARML_K8S_GLUE_TEMPLATE_MODULE + value: "custom_code:update_template" + fileMounts: + - name: "custom_code.py" + folderPath: "/root" + fileContent: |- + import json + from pprint import pformat + + def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs): + print(pformat(template)) + + my_var_name = "foo" + my_var_value = "bar" + + try: + template["spec"]["containers"][0]["env"][0]["name"] = str(my_var_name) + template["spec"]["containers"][0]["env"][0]["value"] = str(my_var_value) + except KeyError as ex: + raise Exception("Failed modifying template: {}".format(ex)) + + return {"template": template} +``` + +:::note notes +* Always include `*args, **kwargs` at the end of the function's argument list and only use keyword arguments. + This is needed to maintain backward compatibility. + +* Custom code modules can be included as a file in the pod's container, and the environment variable can be used to + point to the file and entry point. + +* When defining a custom code module, by default the agent will start watching pods in all namespaces + across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the + `CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces. + Instead, set it to `"1"` if namespace-related changes are needed in the code. + + ```yaml + agentk8sglue: + extraEnvs: + - name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES + value: "0" + ``` + + To customize the bash startup scripts instead of the pod spec, use: + + ```yaml + agentk8sglue: + # -- Custom Bash script for the Agent pod ran by Glue Agent + customBashScript: "" + # -- Custom Bash script for the Task Pods ran by Glue Agent + containerCustomBashScript: "" + ``` + +## Examples + +### Example: Edit Template Based on ENV Var + +```yaml +agentk8sglue: + extraEnvs: + - name: CLEARML_K8S_GLUE_TEMPLATE_MODULE + value: "custom_code:update_template" + fileMounts: + - name: "custom_code.py" + folderPath: "/root" + fileContent: |- + import json + from pprint import pformat + def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs): + print(pformat(template)) + + my_var = "some_var" + + try: + template["spec"]["initContainers"][0]["command"][-1] = \ + template["spec"]["initContainers"][0]["command"][-1].replace("MY_VAR", str(my_var)) + template["spec"]["containers"][0]["volumeMounts"][0]["subPath"] = str(my_var) + except KeyError as ex: + raise Exception("Failed modifying template with MY_VAR: {}".format(ex)) + + return {"template": template} + basePodTemplate: + initContainers: + - name: myInitContainer + image: docker/ubuntu:18.04 + command: + - /bin/bash + - -c + - > + echo MY_VAR; + volumeMounts: + - name: myTemplatedMount + mountPath: MY_VAR + volumes: + - name: myTemplatedMount + emptyDir: {} +``` + +### Example: Inject NFS Mount Path + +```yaml +agentk8sglue: + extraEnvs: + - name: CLEARML_K8S_GLUE_TEMPLATE_MODULE + value: "custom_code:update_template" + fileMounts: + - name: "custom_code.py" + folderPath: "/root" + fileContent: |- + import json + from pprint import pformat + def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs): + nfs = task_config.get("nfs") + # ad_role = providers_info.get("ad-role") + if nfs: # and ad_role == "some-value" + print(pformat(template)) + + try: + template["spec"]["containers"][0]["volumeMounts"].append( + {"name": "custom-mount", "mountPath": nfs.get("mountPath")} + ) + template["spec"]["containers"][0]["volumes"].append( + {"name": "custom-mount", "nfs": {"server": nfs.get("server.ip"), "path": nfs.get("server.path")}} + ) + except KeyError as ex: + raise Exception("Failed modifying template: {}".format(ex)) + + return {"template": template} +``` + +### Example: Bind PVC Resource to Task Pod + +In this example, a PVC is created and attached to every pod created from a dedicated queue, then it is deleted. + +Key points: + +- `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` and `CLEARML_K8S_GLUE_POD_POST_DELETE_CMD` env vars let you define custom bash + code hooks to be executed around the main apply command by the Agent, such as creating and deleting a PVC object. + +- `CLEARML_K8S_GLUE_TEMPLATE_MODULE` env var and a file mount let you define custom Python code in a specific context, + useful to dynamically update the main Pod template before the Agent applies it. + +:::note notes +* This example uses a queue named `pvc-test`, make sure to replace all occurrences of it. + +* `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` can reference templated vars as `{queue_name}, {pod_name}, {namespace}` that will + get replaced with the actual value by the Agent at execution time. + +```yaml +agentk8sglue: + # Bind a pre-defined custom 'custom-agent-role' Role with the ability to handle 'persistentvolumeclaims' + additionalRoleBindings: + - custom-agent-role + extraEnvs: + # Need this or permissions to list all namespaces + - name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES + value: "0" + # Executed before applying the Task Pod. Replace the $PVC_NAME placeholder in the manifest template with the Pod name and apply it, only in a specific queue. + - name: CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD + value: "[[ {queue_name} == 'pvc-test' ]] && sed 's/\\$PVC_NAME/{pod_name}/g' /mnt/yaml-manifests/pvc.yaml | kubectl apply -n {namespace} -f - || echo 'Skipping PRE_APPLY PVC creation...'" + # Executed after deleting the Task Pod. Delete the PVC. + - name: CLEARML_K8S_GLUE_POD_POST_DELETE_CMD + value: "kubectl delete pvc {pod_name} -n {namespace} || echo 'Skipping POST_DELETE PVC deletion...'" + # Define a custom code module for updating the Pod template + - name: CLEARML_K8S_GLUE_TEMPLATE_MODULE + value: "custom_code:update_template" + fileMounts: + # Mount a PVC manifest file with a templated $PVC_NAME name + - name: "pvc.yaml" + folderPath: "/mnt/yaml-manifests" + fileContent: | + apiVersion: v1 + kind: PersistentVolumeClaim + metadata: + name: $PVC_NAME + spec: + resources: + requests: + storage: 5Gi + volumeMode: Filesystem + storageClassName: standard + accessModes: + - ReadWriteOnce + # Custom code module for updating the Pod template + - name: "custom_code.py" + folderPath: "/root" + fileContent: |- + import json + from pprint import pformat + def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs): + if queue_name == "pvc-test": + # Set PVC_NAME as the name of the Pod + PVC_NAME = f"clearml-id-{task_data.id}" + try: + # Replace the claimName placeholder with a dynamic value + template["spec"]["volumes"][0]["persistentVolumeClaim"]["claimName"] = str(PVC_NAME) + except KeyError as ex: + raise Exception("Failed modifying template with PVC_NAME: {}".format(ex)) + # Return the edited template + return {"template": template} + createQueues: true + queues: + # Define a queue with an override `volumes` and `volumeMounts` section for binding a PVC + pvc-test: + templateOverrides: + volumes: + - name: task-pvc + persistentVolumeClaim: + # PVC_NAME placeholder. This will get replaced in the custom code module. + claimName: PVC_NAME + volumeMounts: + - mountPath: "/tmp/task/" + name: task-pvc +``` + +### Example: Required Role + +The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`: + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: custom-agent-role +rules: +- apiGroups: + - "" + resources: + - persistentvolumeclaims + verbs: + - get + - list + - watch + - create + - patch + - delete +``` \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md new file mode 100644 index 00000000..97b59e75 --- /dev/null +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md @@ -0,0 +1,61 @@ +--- +title: Basic Deployment - Suggested GPU Operator Values +--- + +This guide provides recommended configuration values for deploying the NVIDIA GPU Operator alongside ClearML Enterprise. + +## Add the Helm Repo Locally + +Add the NVIDIA GPU Operator Helm repository: + +```bash +helm repo add nvidia https://nvidia.github.io/gpu-operator +``` + +Update the repository locally: +```bash +helm repo update +``` + +## Installation + +To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator +using the following `gpu-operator.override.yaml` file: + +```yaml +toolkit: + env: + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED + value: "false" + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS + value: "true" +devicePlugin: + env: + - name: PASS_DEVICE_SPECS + value: "true" + - name: FAIL_ON_INIT_ERROR + value: "true" + - name: DEVICE_LIST_STRATEGY # Use volume-mounts + value: volume-mounts + - name: DEVICE_ID_STRATEGY + value: uuid + - name: NVIDIA_VISIBLE_DEVICES + value: all + - name: NVIDIA_DRIVER_CAPABILITIES + value: all +``` + +Install the `gpu-operator`: + +``` bash +helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml +``` + +## Fractional GPU Support + +To enable fractional GPU allocation or manage mixed GPU configurations, refer to the following guides: +* [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) – Dynamically configures MIG GPUs on supported devices. +* [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) – Enables fractional (non-MIG) GPU + allocation for better hardware utilization and workload distribution in Kubernetes. +* [CDMO and CFGI on the same Cluster](../fractional_gpus/cdmo_cfgi_same_cluster.md) - In clusters with multiple nodes and + varying GPU types, the `gpu-operator` can be used to manage different device configurations and fractioning modes. \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md new file mode 100644 index 00000000..026a39d6 --- /dev/null +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md @@ -0,0 +1,27 @@ +--- +title: Multi-Node Training +--- + +The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods +on different nodes. + +Below is a configuration example using `clearml-agent-values.override.yaml`: + +```yaml +agentk8sglue: + # Cluster access is required to run multi-node Tasks + serviceAccountClusterAccess: true + multiNode: + enabled: true + createQueues: true + queues: + multi-node-example: + queueSettings: + # Defines the distribution of GPUs Tasks across multiple nodes. The format [x, y, ...] specifies the distribution of Tasks as 'x' GPUs on a node and 'y' GPUs on another node. Multiple Pods will be spawned respectively based on the lowest-common-denominator defined. + multiNode: [ 4, 2 ] + templateOverrides: + resources: + limits: + # Note you will need to use the lowest-common-denominator of the GPUs distribution defined in `queueSettings.multiNode`. + nvidia.com/gpu: 2 +``` \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md new file mode 100644 index 00000000..80a57359 --- /dev/null +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md @@ -0,0 +1,72 @@ +--- +title: ClearML Presign Service +--- + +The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated +users, enabling direct access to cloud-hosted data (e.g., S3) without exposing credentials. + +## Prerequisites + +- A ClearML Enterprise Server is up and running. +- API credentials (`` and ``) generated via + the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). + + :::note + Make sure these credentials belong to an admin user or a service user with admin privileges. + ::: + +- The worker environment must be able to access the ClearML Server over the same network. +- Token to access `clearml-enterprise` Helm chart repo + +## Installation + +### Add the Helm Repo Locally + +Add the ClearML Helm repository: +```bash +helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password +``` + +Update the repository locally: +```bash +helm repo update +``` + +### Prepare Configuration + +Create a `presign-service.override.yaml` file (make sure to replace the placeholders): + +```yaml +imageCredentials: + password: "" +clearml: + apiServerUrlReference: "" + apiKey: "" + apiSecret: "" +ingress: + enabled: true + hostName: "" +``` + +### Deploy the Helm Chart + +Install the `clearml-presign-service` Helm chart in the same namespace as the ClearML Enterprise server: + +```bash +helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml +``` + +### Update ClearML Enterprise Server Configuration + +Enable the ClearML Server to use the Presign Service by editing your `clearml-values.override.yaml` file. +Add the following to the `apiserver.extraEnvs` section (make sure to replace ``). + +```yaml +apiserver: + extraEnvs: + - name: CLEARML__SERVICES__SYSTEM__COMPANY__DEFAULT__SERVICES + value: "[{\"type\":\"presign\",\"url\":\"https://\",\"use_fallback\":\"false\",\"match_sets\":[{\"rules\":[{\"field\":\"\",\"obj_type\":\"\",\"regex\":\"^s3://\"}]}]}]" +``` + +Apply the changes with a Helm upgrade. + diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md new file mode 100644 index 00000000..741f5636 --- /dev/null +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md @@ -0,0 +1,201 @@ +--- +title: ClearML Tenant with Self Signed Certificates +--- + +This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent) +to use self-signed or custom SSL certificates. + +## AI Application Gateway + +To configure certificates for the Application Gateway, update your `clearml-app-gateway-values.override.yaml` file: + +```yaml +# -- Custom certificates +customCertificates: + # -- Override system crt certificate bundle. Mutual exclusive with extraCerts. + overrideCaCertificatesCrt: + # -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt. + extraCerts: + - alias: certificateName + pem: | + -----BEGIN CERTIFICATE----- + ### + -----END CERTIFICATE----- +``` + +You have two configuration options: + +- [**Replace**](#replace-entire-ca-certificatescrt-file) the entire `ca-certificates.crt` file +- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt` + + +### Replace Entire `ca-certificates.crt` File + +To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as +they are stored in a standard `ca-certificates.crt`. + +```yaml +# -- Custom certificates +customCertificates: + # -- Override system crt certificate bundle. Mutual exclusive with extraCerts. + overrideCaCertificatesCrt: | + -----BEGIN CERTIFICATE----- + ### CERT 1 + -----END CERTIFICATE----- + -----BEGIN CERTIFICATE----- + ### CERT 2 + -----END CERTIFICATE----- + -----BEGIN CERTIFICATE----- + ### CERT 3 + -----END CERTIFICATE----- + ... +``` + +### Append Extra Certificates to the Existing `ca-certificates.crt` + +You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`. + +```yaml +# -- Custom certificates +customCertificates: + # -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt. + extraCerts: + - alias: certificate-name-1 + pem: | + -----BEGIN CERTIFICATE----- + ### + -----END CERTIFICATE----- + - alias: certificate-name-2 + pem: | + -----BEGIN CERTIFICATE----- + ### + -----END CERTIFICATE----- +``` + +### Apply Changes + +To apply the changes, run the update command: + +```bash +helm upgrade -i -n clearml-enterprise/clearml-enterprise-app-gateway --version -f clearml-app-gateway-values.override.yaml +``` + +## ClearML Agent + +For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file: + +```yaml +# -- Custom certificates +customCertificates: + # -- Override system crt certificate bundle. Mutual exclusive with extraCerts. + overrideCaCertificatesCrt: + # -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt. + extraCerts: + - alias: certificateName + pem: | + -----BEGIN CERTIFICATE----- + ### + -----END CERTIFICATE----- +``` + +You have two configuration options: + +- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file +- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt` + + +### Replace Entire `ca-certificates.crt` File + +To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as +they are stored in a standard `ca-certificates.crt`. + + +```yaml +# -- Custom certificates +customCertificates: + # -- Override system crt certificate bundle. Mutual exclusive with extraCerts. + overrideCaCertificatesCrt: | + -----BEGIN CERTIFICATE----- + ### CERT 1 + -----END CERTIFICATE----- + -----BEGIN CERTIFICATE----- + ### CERT 2 + -----END CERTIFICATE----- + -----BEGIN CERTIFICATE----- + ### CERT 3 + -----END CERTIFICATE----- + ... +``` + +### Append Extra Certificates to the Existing `ca-certificates.crt` + +You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`. + +```yaml +# -- Custom certificates +customCertificates: + # -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt. + extraCerts: + - alias: certificate-name-1 + pem: | + -----BEGIN CERTIFICATE----- + ### + -----END CERTIFICATE----- + - alias: certificate-name-2 + pem: | + -----BEGIN CERTIFICATE----- + ### + -----END CERTIFICATE----- +``` + +### Add Certificates to Task Pods + +If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods: + +```yaml +agentk8sglue: + basePodTemplate: + initContainers: + - command: + - /bin/sh + - -c + - update-ca-certificates + image: allegroai/clearml-enterprise-agent-k8s-base: + imagePullPolicy: IfNotPresent + name: init-task + volumeMounts: + - name: etc-ssl-certs + mountPath: "/etc/ssl/certs" + - name: clearml-extra-ca-certs + mountPath: "/usr/local/share/ca-certificates" + env: + - name: REQUESTS_CA_BUNDLE + value: /etc/ssl/certs/ca-certificates.crt + volumeMounts: + - name: etc-ssl-certs + mountPath: "/etc/ssl/certs" + volumes: + - name: etc-ssl-certs + emptyDir: {} + - name: clearml-extra-ca-certs + projected: + defaultMode: 420 + sources: +# LIST HERE CONFIGMAPS CREATED BY THE AGENT CHART, THE CARDINALITY DEPENDS ON THE NUMBER OF CERTS PROVIDED. + - configMap: + name: clearml-agent-clearml-enterprise-agent-custom-ca-0 + - configMap: + name: clearml-agent-clearml-enterprise-agent-custom-ca-1 +``` + +The `clearml-extra-ca-certs` volume must include all `ConfigMap` resources generated by the agent for the custom certificates. +These `ConfigMaps` are automatically created by the Helm chart based on the number of certificates provided. +Their names are usually prefixed with the Helm release name, so adjust accordingly if you used a custom release name. + +### Apply Changes + +Apply the changes by running the update command: + +``` bash +helm upgrade -i -n clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml +``` \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md new file mode 100644 index 00000000..6e9179e5 --- /dev/null +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md @@ -0,0 +1,73 @@ +--- +title: SSO (Identity Provider) Setup +--- + +ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers. +SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the +`apiserver` component. + +The following are configuration examples for commonly used providers. Other supported systems include: +* Auth0 +* Keycloak +* Okta +* Azure AD +* Google +* AWS Cognito + +## Auth0 + +```yaml +apiserver: + extraEnvs: + - name: CLEARML__secure__login__sso__oauth_client__auth0__client_id + value: "" + - name: CLEARML__secure__login__sso__oauth_client__auth0__client_secret + value: "" + - name: CLEARML__services__login__sso__oauth_client__auth0__base_url + value: "" + - name: CLEARML__services__login__sso__oauth_client__auth0__authorize_url + value: "" + - name: CLEARML__services__login__sso__oauth_client__auth0__access_token_url + value: "" + - name: CLEARML__services__login__sso__oauth_client__auth0__audience + value: "" +``` + +## Keycloak + +```yaml +apiserver: + extraEnvs: + - name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id + value: "" + - name: CLEARML__secure__login__sso__oauth_client__keycloak__client_secret + value: "" + - name: CLEARML__services__login__sso__oauth_client__keycloak__base_url + value: "/realms//" + - name: CLEARML__services__login__sso__oauth_client__keycloak__authorize_url + value: "/realms//protocol/openid-connect/auth" + - name: CLEARML__services__login__sso__oauth_client__keycloak__access_token_url + value: "/realms//protocol/openid-connect/token" + - name: CLEARML__services__login__sso__oauth_client__keycloak__idp_logout + value: "true" +``` + +## Group Membership Mapping in Keycloak + +To map Keycloak groups into the ClearML user's SSO token: + +1. Go to the **Client Scopes** tab. +1. Click on the `-dedicated` scope. +1. Click **Add Mapper > By Configuration > Group Membership** +1. Configure the mapper: + * Select the **Name** "groups" + * Set **Token Claim Name** "groups" + * Uncheck the **Full group path** + * Save the mapper. + +To verify: + +1. Go to the **Client Details > Client scope** tab. +1. Go to the **Evaluate** sub-tab and select a user with any group memberships. +1. Go to **Generated ID Token** and then to **Generated User Info**. +1. Inspect that in both cases you can see the group's claim in the displayed user data. diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md new file mode 100644 index 00000000..138792fe --- /dev/null +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md @@ -0,0 +1,132 @@ +--- +title: ClearML Dynamic MIG Operator (CDMO) +--- + +The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations. + +## Installation + +### Requirements + +* Add and update the Nvidia Helm repo: + + ```bash + helm repo add nvidia https://nvidia.github.io/gpu-operator + helm repo update + ``` + +* Create a `gpu-operator.override.yaml` file with the following content: + + ```yaml + migManager: + enabled: false + mig: + strategy: mixed + toolkit: + env: + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED + value: "false" + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS + value: "true" + devicePlugin: + env: + - name: PASS_DEVICE_SPECS + value: "true" + - name: FAIL_ON_INIT_ERROR + value: "true" + - name: DEVICE_LIST_STRATEGY # Use volume-mounts + value: volume-mounts + - name: DEVICE_ID_STRATEGY + value: uuid + - name: NVIDIA_VISIBLE_DEVICES + value: all + - name: NVIDIA_DRIVER_CAPABILITIES + value: all + ``` + +* Install the NVIDIA `gpu-operator` using Helm with the previous configuration: + + ```bash + helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml + ``` + +### Installing CDMO + +1. Create a `cdmo-values.override.yaml` file with the following content: + + ```yaml + imageCredentials: + password: "" + ``` + +1. Install the CDMO Helm Chart using the previous override file: + + ```bash + helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml + ``` + +1. Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU + (run it for each GPU `` on the host): + + ```bash + nvidia-smi -mig 1 + ``` + + :::note notes + * A node reboot may be required if the command output indicates so. + + * For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node. + ::: + +1. Label all MIG-enabled GPU node `` from the previous step: + + ```bash + kubectl label nodes "cdmo.clear.ml/gpu-partitioning=mig" + ``` + +## Disabling MIGs + +To disable MIG mode and restore standard full-GPU access: + +1. Ensure no running workflows are using GPUs on the target node(s). + +2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration. + + ```bash + kubectl label nodes "cdmo.clear.ml/gpu-partitioning-" + ``` + +3. Execute a shell into the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands: + + ```bash + nvidia-smi mig -dci + + nvidia-smi mig -dgi + + nvidia-smi -mig 0 + ``` + +4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`: + + ```yaml + toolkit: + env: + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED + value: "false" + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS + value: "true" + devicePlugin: + env: + - name: PASS_DEVICE_SPECS + value: "true" + - name: FAIL_ON_INIT_ERROR + value: "true" + - name: DEVICE_LIST_STRATEGY # Use volume-mounts + value: volume-mounts + - name: DEVICE_ID_STRATEGY + value: uuid + - name: NVIDIA_VISIBLE_DEVICES + value: all + - name: NVIDIA_DRIVER_CAPABILITIES + value: all + ``` \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo_cfgi_same_cluster.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo_cfgi_same_cluster.md new file mode 100644 index 00000000..db61dd7f --- /dev/null +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo_cfgi_same_cluster.md @@ -0,0 +1,123 @@ +--- +title: Install CDMO and CFGI on the Same Cluster +--- + +You can install both CDMO (ClearML Dynamic MIG Orchestrator) and CFGI (ClearML Fractional GPU Injector) on a shared Kubernetes cluster. +In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations +and fractioning modes. + +## Configuring the NVIDIA GPU Operator + +The NVIDIA `gpu-operator` supports defining multiple configurations for the Device Plugin. + +The following example YAML defines two configurations: "mig" and "ts" (time-slicing). + +```yaml +migManager: + enabled: false +mig: + strategy: mixed +toolkit: + env: + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED + value: "false" + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS + value: "true" +devicePlugin: + enabled: true + env: + - name: PASS_DEVICE_SPECS + value: "true" + - name: FAIL_ON_INIT_ERROR + value: "true" + - name: DEVICE_LIST_STRATEGY + value: volume-mounts + - name: DEVICE_ID_STRATEGY + value: uuid + - name: NVIDIA_VISIBLE_DEVICES + value: all + - name: NVIDIA_DRIVER_CAPABILITIES + value: all + config: + name: device-plugin-config + create: true + default: "all-disabled" + data: + all-disabled: |- + version: v1 + flags: + migStrategy: none + ts: |- + version: v1 + flags: + migStrategy: none + sharing: + timeSlicing: + renameByDefault: false + failRequestsGreaterThanOne: false + # Edit the following configuration as needed, adding as many GPU indices as many cards are installed on the Host. + resources: + - name: nvidia.com/gpu + rename: nvidia.com/gpu-0 + devices: + - "0" + replicas: 8 + mig: |- + version: v1 + flags: + migStrategy: mixed +``` + +## Applying Configuration to Nodes + +Label each Kubernetes node accordingly to activate a specific GPU mode: + +|Mode| Label command| +|----|-----| +| `mig` | `kubectl label node nvidia.com/device-plugin.config=mig` | +| `ts` (time slicing) | `kubectl label node nvidia.com/device-plugin.config=ts` | +| Standard full-GPU access | `kubectl label node nvidia.com/device-plugin.config=all-disabled` | + +After a node is labeled, the NVIDIA `device-plugin` will automatically reload the new configuration. + +## Installing CDMO and CFGI + +After configuring the NVIDIA `gpu-operator` and labeling nodes, proceed with the standard installations of [CDMO](cdmo.md) +and [CFGI](cfgi.md). + +## Disabling Configurations + +### Time Slicing + +To disable time-slicing and use full GPU access, update the node label using the `--overwrite` flag: + +```bash +kubectl label node nvidia.com/device-plugin.config=all-disabled --overwrite +``` + +### MIG + +To disable MIG mode: + +1. Ensure there are no more running workflows requesting any form of GPU on the node(s). +2. Remove the CDMO label from the target node(s). + + ```bash + kubectl label nodes "cdmo.clear.ml/gpu-partitioning-" + ``` + +3. Execute a shell in the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands: + + ```bash + nvidia-smi mig -dci + + nvidia-smi mig -dgi + + nvidia-smi -mig 0 + ``` + +4. Label the node to use standard (non-MIG) GPU mode: + + ```bash + kubectl label node nvidia.com/device-plugin.config=all-disabled --overwrite + ``` \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md new file mode 100644 index 00000000..749c02d6 --- /dev/null +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md @@ -0,0 +1,305 @@ +--- +title: ClearML Fractional GPU Injector (CFGI) +--- + +The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to utilize fractional (non-MIG) GPU slices +on Kubernetes clusters, maximizing hardware efficiency and performance. + +## Installation + +### Add the Local ClearML Helm Repository + +```bash +helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password +helm repo update +``` + +### Requirements + +* Install the NVIDIA `gpu-operator` using Helm +* Set the number of GPU slices to 8 +* Add and update the Nvidia Helm repo: + + ```bash + helm repo add nvidia https://nvidia.github.io/gpu-operator + helm repo update + ``` + +* Credentials for the ClearML Enterprise DockerHub repository + +### GPU Operator Configuration + +#### For CFGI Version >= 1.3.0 + +1. Create a Docker Registry secret named `clearml-dockerhub-access` in the `gpu-operator` namespace. Make sure to replace `` with your token. + + ```bash + kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \ + --docker-server=docker.io \ + --docker-username=allegroaienterprise \ + --docker-password="" \ + --docker-email="" + ``` + +1. Create a `gpu-operator.override.yaml` file as follows: + * Set `devicePlugin.repository` to `docker.io/clearml` + * Configure `devicePlugin.config.data.renamed-resources.sharing.timeSlicing.resources` for each GPU index on the host + * Use `nvidia.com/gpu-` format for the `rename` field, and set `replicas` to `8`. + +```yaml +gfd: + imagePullSecrets: + - "clearml-dockerhub-access" +toolkit: + env: + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED + value: "false" + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS + value: "true" +devicePlugin: + repository: docker.io/clearml + image: k8s-device-plugin + version: v0.17.1-gpu-card-selection + imagePullPolicy: Always + imagePullSecrets: + - "clearml-dockerhub-access" + env: + - name: PASS_DEVICE_SPECS + value: "true" + - name: FAIL_ON_INIT_ERROR + value: "true" + - name: DEVICE_LIST_STRATEGY # Use volume-mounts + value: volume-mounts + - name: DEVICE_ID_STRATEGY + value: uuid + - name: NVIDIA_VISIBLE_DEVICES + value: all + - name: NVIDIA_DRIVER_CAPABILITIES + value: all + config: + name: device-plugin-config + create: true + default: "renamed-resources" + data: + renamed-resources: |- + version: v1 + flags: + migStrategy: none + sharing: + timeSlicing: + renameByDefault: false + failRequestsGreaterThanOne: false + # Edit the following configuration as needed, adding as many GPU indices as many cards are installed on the Host. + resources: + - name: nvidia.com/gpu + rename: nvidia.com/gpu-0 + devices: + - "0" + replicas: 8 + - name: nvidia.com/gpu + rename: nvidia.com/gpu-1 + devices: + - "1" + replicas: 8 +``` + +#### For CFGI version < 1.3.0 (Legacy) + +Create a `gpu-operator.override.yaml` file: + +```yaml +toolkit: + env: + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED + value: "false" + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS + value: "true" +devicePlugin: + env: + - name: PASS_DEVICE_SPECS + value: "true" + - name: FAIL_ON_INIT_ERROR + value: "true" + - name: DEVICE_LIST_STRATEGY # Use volume-mounts + value: volume-mounts + - name: DEVICE_ID_STRATEGY + value: uuid + - name: NVIDIA_VISIBLE_DEVICES + value: all + - name: NVIDIA_DRIVER_CAPABILITIES + value: all + config: + name: device-plugin-config + create: true + default: "any" + data: + any: |- + version: v1 + flags: + migStrategy: none + sharing: + timeSlicing: + renameByDefault: false + failRequestsGreaterThanOne: false + resources: + - name: nvidia.com/gpu + replicas: 8 +``` + +### Install GPU Operator and CFGI + +1. Install the NVIDIA `gpu-operator` using the previously created `gpu-operator.override.yaml` file: + + ```bash + helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml + ``` + +1. Create a `cfgi-values.override.yaml` file with the following content: + + ```yaml + imageCredentials: + password: "" + ``` + +1. Install the CFGI Helm Chart using the previous override file: + + ```bash + helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml + ``` + +## Usage + +To use fractional GPUs, label your pod with: + +```yaml +labels: + clearml-injector/fraction: "" +``` + +Valid values for `""` include: + +* Fractions: + * "0.0625" (1/16th) + * "0.125" (1/8th) + * "0.250" + * "0.375" + * "0.500" + * "0.625" + * "0.750" + * "0.875" +* Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc. + +### ClearML Agent Configuration + +To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file. + +Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the +fraction of a GPU to allocate (e.g., "0.500" for half a GPU). + +This label is used by the CFGI to assign the correct portion of GPU resources to the pod running the task. + +#### CFGI Version >= 1.3.0 + +Starting from version 1.3.0, there is no need to specify the resources field. You only need to set the labels: + + +``` yaml +agentk8sglue: + createQueues: true + queues: + gpu-fraction-1_000: + templateOverrides: + labels: + clearml-injector/fraction: "1.000" + gpu-fraction-0_500: + templateOverrides: + labels: + clearml-injector/fraction: "0.500" + gpu-fraction-0_250: + templateOverrides: + labels: + clearml-injector/fraction: "0.250" + gpu-fraction-0_125: + templateOverrides: + labels: + clearml-injector/fraction: "0.125" +``` + +#### CFGI Version < 1.3.0 + +For versions older than 1.3.0, the GPU limits must be defined: + +```yaml +agentk8sglue: + createQueues: true + queues: + gpu-fraction-1_000: + templateOverrides: + resources: + limits: + nvidia.com/gpu: 8 + gpu-fraction-0_500: + templateOverrides: + labels: + clearml-injector/fraction: "0.500" + resources: + limits: + nvidia.com/gpu: 4 + gpu-fraction-0_250: + templateOverrides: + labels: + clearml-injector/fraction: "0.250" + resources: + limits: + nvidia.com/gpu: 2 + gpu-fraction-0_125: + templateOverrides: + labels: + clearml-injector/fraction: "0.125" + resources: + limits: + nvidia.com/gpu: 1 +``` + +## Upgrading CFGI Chart + +To upgrade to the latest chart version: + +```bash +helm repo update +helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector +``` + +To apply new values to an existing installation: + +```bash +helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml +``` + +## Disabling Fractions + +To revert to standard GPU scheduling (without time slicing), remove the `devicePlugin.config` section from the `gpu-operator.override.yaml` +file and upgrade the `gpu-operator`: + +```yaml +toolkit: + env: + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED + value: "false" + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS + value: "true" +devicePlugin: + env: + - name: PASS_DEVICE_SPECS + value: "true" + - name: FAIL_ON_INIT_ERROR + value: "true" + - name: DEVICE_LIST_STRATEGY # Use volume-mounts + value: volume-mounts + - name: DEVICE_ID_STRATEGY + value: uuid + - name: NVIDIA_VISIBLE_DEVICES + value: all + - name: NVIDIA_DRIVER_CAPABILITIES + value: all +``` \ No newline at end of file