diff --git a/docs/deploying_clearml/enterprise_deploy/agent_k8s.md b/docs/deploying_clearml/enterprise_deploy/agent_k8s.md index e2c2c4dc..02a38de7 100644 --- a/docs/deploying_clearml/enterprise_deploy/agent_k8s.md +++ b/docs/deploying_clearml/enterprise_deploy/agent_k8s.md @@ -11,7 +11,7 @@ The ClearML Agent enables scheduling and executing distributed experiments on a the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). :::note - Make sure these credentials belong to an admin user or a service user with admin privileges. + Make sure these credentials belong to an admin user or a service account with admin privileges. ::: - The worker environment must be able to access the ClearML Server over the same network. @@ -26,7 +26,7 @@ Add the ClearML Helm repository: helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password ``` -Update the repository locally: +Update the local repository: ```bash helm repo update ``` diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md index c7e651e5..a78196a0 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md @@ -18,8 +18,8 @@ Arguments passed to the function include: * `queue` (string) - ID of the queue from which the task was pulled. * `queue_name` (string) - Name of the queue from which the task was pulled. * `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides. -* `task_data` (object) - Task data object (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID. -* `providers_info` (dictionary) - Provider info containing optional information collected for the user running this task +* `task_data` (object) - [Task object](../../../references/sdk/task.md) (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID. +* `providers_info` (dictionary) - [Identity provider](sso_login.md) info containing optional information collected for the user running this task when the user logged into the system (requires additional server configuration). * `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values. @@ -248,11 +248,8 @@ agentk8sglue: - mountPath: "/tmp/task/" name: task-pvc ``` -::: -### Example: Required Role - -The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`: +* The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`: ```yaml apiVersion: rbac.authorization.k8s.io/v1 @@ -271,4 +268,6 @@ rules: - create - patch - delete -``` \ No newline at end of file +``` + +::: \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md index 97b59e75..042418bf 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md @@ -12,7 +12,7 @@ Add the NVIDIA GPU Operator Helm repository: helm repo add nvidia https://nvidia.github.io/gpu-operator ``` -Update the repository locally: +Update the local repository: ```bash helm repo update ``` diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md index 026a39d6..90ca201b 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md @@ -2,10 +2,28 @@ title: Multi-Node Training --- -The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods +The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single ClearML Task to run across multiple pods on different nodes. -Below is a configuration example using `clearml-agent-values.override.yaml`: +This is useful for distributed training where the training job needs to span multiple GPUs and potentially +multiple nodes. + +To enable multi-node scheduling, set both `agentk8sglue.serviceAccountClusterAccess` and `agentk8sglue.multiNode` to `true`. + +Multi-node behavior is controlled using the `multiNode` key in a queue configuration. This setting tells the +agent how to divide a Task's GPU requirements across multiple pods, with each pod running a part of the training job. + +Below is a configuration example using `clearml-agent-values.override.yaml` to enable multi-node training. + +In this example: +* The `multiNode: [4, 2]` setting means splits the Task into two workloads: + * One workload will need 4 GPUs + * The other workload will need 2 GPUs +* The GPU limit per pod is set to `nvidia.com/gpu: 2`, meaning each pod will be limited to 2 GPUs + +With this setup: +* The first workload (which needs 4 GPUs) will be scheduled as 2 pods, each with 2 GPUs +* The second workload (which needs 2 GPUs) will be scheduled as 1 pod with 2 GPUs ```yaml agentk8sglue: @@ -17,7 +35,7 @@ agentk8sglue: queues: multi-node-example: queueSettings: - # Defines the distribution of GPUs Tasks across multiple nodes. The format [x, y, ...] specifies the distribution of Tasks as 'x' GPUs on a node and 'y' GPUs on another node. Multiple Pods will be spawned respectively based on the lowest-common-denominator defined. + # Defines GPU needs per worker (e.g., 4 GPUs and 2 GPUs). Multiple Pods will be spawned respectively based on the lowest-common-denominator defined. multiNode: [ 4, 2 ] templateOverrides: resources: diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md index 80a57359..4de148f7 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md @@ -1,9 +1,18 @@ --- -title: ClearML Presign Service +title: ClearML S3 Presign Service --- The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated -users, enabling direct access to cloud-hosted data (e.g., S3) without exposing credentials. +users, enabling direct access to S3 data without exposing credentials. + +When configured, the ClearML WebApp automatically redirects requests for matching storage URLs (like `s3://...`) to the +Presign Service. The service: + +* Authenticates the use with ClearML. +* Generates a temporary, secure (pre-signed) S3 URL. +* Redirects the user's browser to the URL for direct access. + +This setup ensures secure access to S3-hosted data. ## Prerequisites @@ -12,7 +21,7 @@ users, enabling direct access to cloud-hosted data (e.g., S3) without exposing c the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). :::note - Make sure these credentials belong to an admin user or a service user with admin privileges. + Make sure these credentials belong to an admin user or a service account with admin privileges. ::: - The worker environment must be able to access the ClearML Server over the same network. @@ -27,7 +36,7 @@ Add the ClearML Helm repository: helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password ``` -Update the repository locally: +Update the local repository: ```bash helm repo update ``` diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md index e529e20e..7a031e21 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md @@ -1,13 +1,15 @@ --- -title: ClearML Tenant with Self Signed Certificates +title: Kubernetes Deployment with Self-Signed Certificates --- -This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent) +This guide covers how to configure the [AI Application Gateway](../appgw.md) and [ClearML Agent](../agent_k8s.md) to use self-signed or custom SSL certificates. -## AI Application Gateway +## Certificate Configuration -To configure certificates for the Application Gateway, update your `clearml-app-gateway-values.override.yaml` file: +To configure certificates, update the applicable overrides file: +* For AI Application Gateway: `clearml-app-gateway-values.override.yaml` file +* For ClearML Agent: `clearml-agent-values.override.yaml` file ```yaml # -- Custom certificates @@ -72,83 +74,7 @@ customCertificates: -----END CERTIFICATE----- ``` -### Apply Changes - -To apply the changes, run the update command: - -```bash -helm upgrade -i -n clearml-enterprise/clearml-enterprise-app-gateway --version -f clearml-app-gateway-values.override.yaml -``` - -## ClearML Agent - -For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file: - -```yaml -# -- Custom certificates -customCertificates: - # -- Override system crt certificate bundle. Mutual exclusive with extraCerts. - overrideCaCertificatesCrt: - # -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt. - extraCerts: - - alias: certificateName - pem: | - -----BEGIN CERTIFICATE----- - ### - -----END CERTIFICATE----- -``` - -You have two configuration options: - -- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file -- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt` - - -### Replace Entire ca-certificates.crt File - -To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as -they are stored in a standard `ca-certificates.crt`. - - -```yaml -# -- Custom certificates -customCertificates: - # -- Override system crt certificate bundle. Mutual exclusive with extraCerts. - overrideCaCertificatesCrt: | - -----BEGIN CERTIFICATE----- - ### CERT 1 - -----END CERTIFICATE----- - -----BEGIN CERTIFICATE----- - ### CERT 2 - -----END CERTIFICATE----- - -----BEGIN CERTIFICATE----- - ### CERT 3 - -----END CERTIFICATE----- - ... -``` - -### Append Extra Certificates to the Existing ca-certificates.crt - -You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`. - -```yaml -# -- Custom certificates -customCertificates: - # -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt. - extraCerts: - - alias: certificate-name-1 - pem: | - -----BEGIN CERTIFICATE----- - ### - -----END CERTIFICATE----- - - alias: certificate-name-2 - pem: | - -----BEGIN CERTIFICATE----- - ### - -----END CERTIFICATE----- -``` - -### Add Certificates to Task Pods +### ClearML Agent: Add Certificates to Task Pods If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods: @@ -194,8 +120,15 @@ Their names are usually prefixed with the Helm release name, so adjust according ### Apply Changes -Apply the changes by running the update command: +To apply the changes, run the update command: +* For AI Application Gateway: -``` bash -helm upgrade -i -n clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml -``` \ No newline at end of file + ```bash + helm upgrade -i -n clearml-enterprise/clearml-enterprise-app-gateway --version -f clearml-app-gateway-values.override.yaml + ``` + +* For ClearML Agent: + + ```bash + helm upgrade -i -n clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml + ``` \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md index 6e9179e5..4eae0c6a 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md @@ -6,13 +6,8 @@ ClearML Enterprise Server supports various Single Sign-On (SSO) identity provide SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the `apiserver` component. -The following are configuration examples for commonly used providers. Other supported systems include: -* Auth0 -* Keycloak -* Okta -* Azure AD -* Google -* AWS Cognito +The following are configuration examples for commonly used identity providers. See [full list of supported identity providers](../../../webapp/settings/webapp_settings_id_providers.md). + ## Auth0 @@ -52,7 +47,7 @@ apiserver: value: "true" ``` -## Group Membership Mapping in Keycloak +### Group Membership Mapping in Keycloak To map Keycloak groups into the ClearML user's SSO token: diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md index 0cbf1d32..0e04bc81 100644 --- a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md @@ -1,8 +1,14 @@ --- -title: ClearML Dynamic MIG Operator (CDMO) +title: Managing GPU Fractions with ClearML Dynamic MIG Operator (CDMO) --- -The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations. +This guide covers using GPU fractions in Kubernetes clusters using NVIDIA MIGs and +ClearML's Dynamic MIG Operator (CDMO). CDMO enables dynamic MIG (Multi-Instance GPU) configurations. + +This guide covers: +* Installing CDMO +* Enabling MIG mode on your cluster +* Managing GPU partitioning dynamically ## Installation @@ -78,7 +84,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU * For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node. ::: -1. Label all MIG-enabled GPU node `` from the previous step: +1. Label all MIG-enabled GPU nodes `` from the previous step: ```bash kubectl label nodes "cdmo.clear.ml/gpu-partitioning=mig" @@ -106,7 +112,7 @@ To disable MIG mode and restore standard full-GPU access: nvidia-smi -mig 0 ``` -4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`: +4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access: ```yaml toolkit: @@ -129,4 +135,10 @@ To disable MIG mode and restore standard full-GPU access: value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all - ``` \ No newline at end of file + ``` + +5. Upgrade the `gpu-operator`: + + ```bash + helm upgrade -n gpu-operator gpu-operator nvidia/gpu-operator -f gpu-operator.override.yaml + ``` \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md index 67e63b7f..cc28aef6 100644 --- a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md @@ -16,7 +16,7 @@ helm repo update ### Requirements -* Install the NVIDIA `gpu-operator` using Helm +* Install the NVIDIA `gpu-operator` using Helm. For instructions, see [Basic Deployment](../extra_configs/gpu_operator.md). * Set the number of GPU slices to 8 * Add and update the Nvidia Helm repo: @@ -191,7 +191,7 @@ Valid values for `""` include: ### ClearML Agent Configuration -To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file. +To run ClearML jobs with fractional GPU allocation, configure your queues in your `clearml-agent-values.override.yaml` file. Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the fraction of a GPU to allocate (e.g., "0.500" for half a GPU). diff --git a/docs/deploying_clearml/enterprise_deploy/k8s.md b/docs/deploying_clearml/enterprise_deploy/k8s.md index 7ed10bd0..0992ebc3 100644 --- a/docs/deploying_clearml/enterprise_deploy/k8s.md +++ b/docs/deploying_clearml/enterprise_deploy/k8s.md @@ -50,7 +50,7 @@ Add the ClearML Helm repository: helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password ``` -Update the repository locally: +Update the local repository: ``` bash helm repo update ``` diff --git a/docs/deploying_clearml/enterprise_deploy/multi_tenant_k8s.md b/docs/deploying_clearml/enterprise_deploy/multi_tenant_k8s.md index b6d2e19e..604d169c 100644 --- a/docs/deploying_clearml/enterprise_deploy/multi_tenant_k8s.md +++ b/docs/deploying_clearml/enterprise_deploy/multi_tenant_k8s.md @@ -709,7 +709,7 @@ The following features can be assigned to groups via the `features` configuratio | `reports` | Enables access to Reports. | No | | `resource_dashboard` | Enables access to the compute resource dashboard feature. | No | | `sso_management` | Enables the SSO (Single Sign-On) configuration wizard. | No | -| `service_users` | Enables support for creating and managing service users (API keys). | No | +| `service_users` | Enables support for creating and managing service accounts (API keys). | No | | `resource_policy` | Enables the resource policy feature. | May default to a trial feature if not explicitly enabled. | | `model_serving` | Enables access to the model serving endpoints feature. | No | | `show_dashboard` | Makes the "Dashboard" menu item visible in the UI sidebar. | No | diff --git a/docs/release_notes/clearml_server/enterprise/ver_3_25.md b/docs/release_notes/clearml_server/enterprise/ver_3_25.md index 17a61860..47edcfcc 100644 --- a/docs/release_notes/clearml_server/enterprise/ver_3_25.md +++ b/docs/release_notes/clearml_server/enterprise/ver_3_25.md @@ -29,7 +29,7 @@ title: Version 3.25 * Display per-GPU metrics in "CPU and GPU Usage" and "Video Memory" graphs when multiple GPUs are available * Add "GPU Count" column to the Resource Groups table in the Orchestration Dashboard * Add global search bar to all UI pages -* Enable setting service users as admins +* Enable setting service accounts as admins * Add filter to UI Model Endpoints table * Add UI scalar viewing configuration on a per-project basis ([ClearML GitHub issue #1377](https://github.com/clearml/clearml/issues/1377)) * Add clicking project name in breadcrumbs of full-screen task opens the task in detail’s view ([ClearML GitHub issue #1376](https://github.com/clearml/clearml/issues/1376)) @@ -42,7 +42,7 @@ title: Version 3.25 * Fix EMA smoothing in UI scalars is incorrect in first data point ([ClearML Web GitHub issue #101](https://github.com/clearml/clearml-web/issues/101)) * Improve UI scalar smoothing algorithms (ClearML Web GitHub issues [#101](https://github.com/clearml/clearml-web/issues/101), [#102](https://github.com/clearml/clearml-web/issues/102), [#103](https://github.com/clearml/clearml-web/issues/103)) * Fix UI Users & Groups table's "Groups" column data remains condensed after column is expanded -* Fix setting service users as admins causes apiserver to crash +* Fix setting service accounts as admins causes apiserver to crash * Fix UI "New Dataview" modal's version selection sometimes does not display draft versions * Fix GCS and Azure credential input popups not displaying in UI task debug samples * Fix UI pipeline "Preview" tab sometimes displays "Failed to get plot charts" error