This commit is contained in:
revital 2025-05-20 13:33:57 +03:00
parent cc47ae1465
commit f0e9cbe027
12 changed files with 88 additions and 162 deletions

View File

@ -11,7 +11,7 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note :::note
Make sure these credentials belong to an admin user or a service user with admin privileges. Make sure these credentials belong to an admin user or a service account with admin privileges.
::: :::
- The worker environment must be able to access the ClearML Server over the same network. - The worker environment must be able to access the ClearML Server over the same network.
@ -26,7 +26,7 @@ Add the ClearML Helm repository:
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN> helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
``` ```
Update the repository locally: Update the local repository:
```bash ```bash
helm repo update helm repo update
``` ```

View File

@ -18,8 +18,8 @@ Arguments passed to the function include:
* `queue` (string) - ID of the queue from which the task was pulled. * `queue` (string) - ID of the queue from which the task was pulled.
* `queue_name` (string) - Name of the queue from which the task was pulled. * `queue_name` (string) - Name of the queue from which the task was pulled.
* `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides. * `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides.
* `task_data` (object) - Task data object (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID. * `task_data` (object) - [Task object](../../../references/sdk/task.md) (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID.
* `providers_info` (dictionary) - Provider info containing optional information collected for the user running this task * `providers_info` (dictionary) - [Identity provider](sso_login.md) info containing optional information collected for the user running this task
when the user logged into the system (requires additional server configuration). when the user logged into the system (requires additional server configuration).
* `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable * `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable
for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values. for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values.
@ -248,11 +248,8 @@ agentk8sglue:
- mountPath: "/tmp/task/" - mountPath: "/tmp/task/"
name: task-pvc name: task-pvc
``` ```
:::
### Example: Required Role * The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
```yaml ```yaml
apiVersion: rbac.authorization.k8s.io/v1 apiVersion: rbac.authorization.k8s.io/v1
@ -272,3 +269,5 @@ rules:
- patch - patch
- delete - delete
``` ```
:::

View File

@ -12,7 +12,7 @@ Add the NVIDIA GPU Operator Helm repository:
helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo add nvidia https://nvidia.github.io/gpu-operator
``` ```
Update the repository locally: Update the local repository:
```bash ```bash
helm repo update helm repo update
``` ```

View File

@ -2,10 +2,28 @@
title: Multi-Node Training title: Multi-Node Training
--- ---
The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single ClearML Task to run across multiple pods
on different nodes. on different nodes.
Below is a configuration example using `clearml-agent-values.override.yaml`: This is useful for distributed training where the training job needs to span multiple GPUs and potentially
multiple nodes.
To enable multi-node scheduling, set both `agentk8sglue.serviceAccountClusterAccess` and `agentk8sglue.multiNode` to `true`.
Multi-node behavior is controlled using the `multiNode` key in a queue configuration. This setting tells the
agent how to divide a Task's GPU requirements across multiple pods, with each pod running a part of the training job.
Below is a configuration example using `clearml-agent-values.override.yaml` to enable multi-node training.
In this example:
* The `multiNode: [4, 2]` setting means splits the Task into two workloads:
* One workload will need 4 GPUs
* The other workload will need 2 GPUs
* The GPU limit per pod is set to `nvidia.com/gpu: 2`, meaning each pod will be limited to 2 GPUs
With this setup:
* The first workload (which needs 4 GPUs) will be scheduled as 2 pods, each with 2 GPUs
* The second workload (which needs 2 GPUs) will be scheduled as 1 pod with 2 GPUs
```yaml ```yaml
agentk8sglue: agentk8sglue:
@ -17,7 +35,7 @@ agentk8sglue:
queues: queues:
multi-node-example: multi-node-example:
queueSettings: queueSettings:
# Defines the distribution of GPUs Tasks across multiple nodes. The format [x, y, ...] specifies the distribution of Tasks as 'x' GPUs on a node and 'y' GPUs on another node. Multiple Pods will be spawned respectively based on the lowest-common-denominator defined. # Defines GPU needs per worker (e.g., 4 GPUs and 2 GPUs). Multiple Pods will be spawned respectively based on the lowest-common-denominator defined.
multiNode: [ 4, 2 ] multiNode: [ 4, 2 ]
templateOverrides: templateOverrides:
resources: resources:

View File

@ -1,9 +1,18 @@
--- ---
title: ClearML Presign Service title: ClearML S3 Presign Service
--- ---
The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated
users, enabling direct access to cloud-hosted data (e.g., S3) without exposing credentials. users, enabling direct access to S3 data without exposing credentials.
When configured, the ClearML WebApp automatically redirects requests for matching storage URLs (like `s3://...`) to the
Presign Service. The service:
* Verifies the user's ClearML authentication.
* Generates a temporary, secure (pre-signed) S3 URL.
* Redirects the user's browser to the URL for direct access.
This setup ensures secure access to S3-hosted data.
## Prerequisites ## Prerequisites
@ -12,7 +21,7 @@ users, enabling direct access to cloud-hosted data (e.g., S3) without exposing c
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note :::note
Make sure these credentials belong to an admin user or a service user with admin privileges. Make sure these credentials belong to an admin user or a service account with admin privileges.
::: :::
- The worker environment must be able to access the ClearML Server over the same network. - The worker environment must be able to access the ClearML Server over the same network.
@ -27,7 +36,7 @@ Add the ClearML Helm repository:
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN> helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
``` ```
Update the repository locally: Update the local repository:
```bash ```bash
helm repo update helm repo update
``` ```

View File

@ -1,13 +1,15 @@
--- ---
title: ClearML Tenant with Self Signed Certificates title: Self-Signed Certificates for ClearML Agent and AI App Gateway
--- ---
This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent) This guide covers how to configure the [AI Application Gateway](../appgw.md) and [ClearML Agent](../agent_k8s.md)
to use self-signed or custom SSL certificates. to use self-signed or custom SSL certificates.
## AI Application Gateway ## Certificate Configuration
To configure certificates for the Application Gateway, update your `clearml-app-gateway-values.override.yaml` file: To configure certificates, update the following files:
* For AI Application Gateway: `clearml-app-gateway-values.override.yaml` file
* For ClearML Agent: `clearml-agent-values.override.yaml`
```yaml ```yaml
# -- Custom certificates # -- Custom certificates
@ -72,83 +74,7 @@ customCertificates:
-----END CERTIFICATE----- -----END CERTIFICATE-----
``` ```
### Apply Changes ### ClearML Agent: Add Certificates to Task Pods
To apply the changes, run the update command:
```bash
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
```
## ClearML Agent
For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificateName
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
You have two configuration options:
- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt`
### Replace Entire ca-certificates.crt File
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
they are stored in a standard `ca-certificates.crt`.
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt: |
-----BEGIN CERTIFICATE-----
### CERT 1
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 2
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 3
-----END CERTIFICATE-----
...
```
### Append Extra Certificates to the Existing ca-certificates.crt
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
```yaml
# -- Custom certificates
customCertificates:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificate-name-1
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
- alias: certificate-name-2
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
### Add Certificates to Task Pods
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods: If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods:
@ -194,7 +120,14 @@ Their names are usually prefixed with the Helm release name, so adjust according
### Apply Changes ### Apply Changes
Apply the changes by running the update command: To apply the changes, run the update command:
* For AI Application Gateway:
```bash
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
```
* For ClearML Agent:
```bash ```bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml

View File

@ -6,13 +6,8 @@ ClearML Enterprise Server supports various Single Sign-On (SSO) identity provide
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the
`apiserver` component. `apiserver` component.
The following are configuration examples for commonly used providers. Other supported systems include: The following are configuration examples for commonly used identity providers. See [full list of supported identity providers](../../../webapp/settings/webapp_settings_id_providers.md).
* Auth0
* Keycloak
* Okta
* Azure AD
* Google
* AWS Cognito
## Auth0 ## Auth0
@ -52,7 +47,7 @@ apiserver:
value: "true" value: "true"
``` ```
## Group Membership Mapping in Keycloak ### Group Membership Mapping in Keycloak
To map Keycloak groups into the ClearML user's SSO token: To map Keycloak groups into the ClearML user's SSO token:

View File

@ -1,54 +1,20 @@
--- ---
title: ClearML Dynamic MIG Operator (CDMO) title: Managing GPU Fragments with ClearML Dynamic MIG Operator (CDMO)
--- ---
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations. This guide covers using GPU fragments in Kubernetes clusters using NVIDIA MIGs and
ClearML's Dynamic MIG Operator (CDMO). CDMO enables dynamic MIG (Multi-Instance GPU) configurations.
This guide covers:
* Installing CDMO
* Enabling MIG mode on your cluster
* Managing GPU partitioning dynamically
## Installation ## Installation
### Requirements ### Requirements
* Add and update the Nvidia Helm repo: * Install the NVIDIA `gpu-operator` using Helm. For instructions, see [Basic Deployment](../extra_configs/gpu_operator.md).
```bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
```
* Create a `gpu-operator.override.yaml` file with the following content:
```yaml
migManager:
enabled: false
mig:
strategy: mixed
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```
* Install the NVIDIA `gpu-operator` using Helm with the previous configuration:
```bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
### Installing CDMO ### Installing CDMO
@ -78,7 +44,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU
* For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node. * For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node.
::: :::
1. Label all MIG-enabled GPU node `<NODE_NAME>` from the previous step: 1. Label all MIG-enabled GPU nodes `<NODE_NAME>` from the previous step:
```bash ```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig" kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
@ -106,7 +72,7 @@ To disable MIG mode and restore standard full-GPU access:
nvidia-smi -mig 0 nvidia-smi -mig 0
``` ```
4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`: 4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access:
```yaml ```yaml
toolkit: toolkit:
@ -130,3 +96,9 @@ To disable MIG mode and restore standard full-GPU access:
- name: NVIDIA_DRIVER_CAPABILITIES - name: NVIDIA_DRIVER_CAPABILITIES
value: all value: all
``` ```
5. Upgrade the `gpu-operator`:
```bash
helm upgrade -n gpu-operator gpu-operator nvidia/gpu-operator -f gpu-operator.override.yaml
```

View File

@ -16,7 +16,7 @@ helm repo update
### Requirements ### Requirements
* Install the NVIDIA `gpu-operator` using Helm * Install the NVIDIA `gpu-operator` using Helm. For instructions, see [Basic Deployment](../extra_configs/gpu_operator.md).
* Set the number of GPU slices to 8 * Set the number of GPU slices to 8
* Add and update the Nvidia Helm repo: * Add and update the Nvidia Helm repo:
@ -191,7 +191,7 @@ Valid values for `"<GPU_FRACTION_VALUE>"` include:
### ClearML Agent Configuration ### ClearML Agent Configuration
To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file. To run ClearML jobs with fractional GPU allocation, configure your queues in your `clearml-agent-values.override.yaml` file.
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
fraction of a GPU to allocate (e.g., "0.500" for half a GPU). fraction of a GPU to allocate (e.g., "0.500" for half a GPU).

View File

@ -50,7 +50,7 @@ Add the ClearML Helm repository:
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN> helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
``` ```
Update the repository locally: Update the local repository:
``` bash ``` bash
helm repo update helm repo update
``` ```

View File

@ -709,7 +709,7 @@ The following features can be assigned to groups via the `features` configuratio
| `reports` | Enables access to Reports. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> | | `reports` | Enables access to Reports. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `resource_dashboard` | Enables access to the compute resource dashboard feature. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> | | `resource_dashboard` | Enables access to the compute resource dashboard feature. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `sso_management` | Enables the SSO (Single Sign-On) configuration wizard. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> | | `sso_management` | Enables the SSO (Single Sign-On) configuration wizard. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `service_users` | Enables support for creating and managing service users (API keys). | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> | | `service_users` | Enables support for creating and managing service accounts (API keys). | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `resource_policy` | Enables the resource policy feature. | May default to a trial feature if not explicitly enabled. | | `resource_policy` | Enables the resource policy feature. | May default to a trial feature if not explicitly enabled. |
| `model_serving` | Enables access to the model serving endpoints feature. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> | | `model_serving` | Enables access to the model serving endpoints feature. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `show_dashboard` | Makes the "Dashboard" menu item visible in the UI sidebar. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> | | `show_dashboard` | Makes the "Dashboard" menu item visible in the UI sidebar. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |

View File

@ -29,7 +29,7 @@ title: Version 3.25
* Display per-GPU metrics in "CPU and GPU Usage" and "Video Memory" graphs when multiple GPUs are available * Display per-GPU metrics in "CPU and GPU Usage" and "Video Memory" graphs when multiple GPUs are available
* Add "GPU Count" column to the Resource Groups table in the Orchestration Dashboard * Add "GPU Count" column to the Resource Groups table in the Orchestration Dashboard
* Add global search bar to all UI pages * Add global search bar to all UI pages
* Enable setting service users as admins * Enable setting service accounts as admins
* Add filter to UI Model Endpoints table * Add filter to UI Model Endpoints table
* Add UI scalar viewing configuration on a per-project basis ([ClearML GitHub issue #1377](https://github.com/clearml/clearml/issues/1377)) * Add UI scalar viewing configuration on a per-project basis ([ClearML GitHub issue #1377](https://github.com/clearml/clearml/issues/1377))
* Add clicking project name in breadcrumbs of full-screen task opens the task in details view ([ClearML GitHub issue #1376](https://github.com/clearml/clearml/issues/1376)) * Add clicking project name in breadcrumbs of full-screen task opens the task in details view ([ClearML GitHub issue #1376](https://github.com/clearml/clearml/issues/1376))
@ -42,7 +42,7 @@ title: Version 3.25
* Fix EMA smoothing in UI scalars is incorrect in first data point ([ClearML Web GitHub issue #101](https://github.com/clearml/clearml-web/issues/101)) * Fix EMA smoothing in UI scalars is incorrect in first data point ([ClearML Web GitHub issue #101](https://github.com/clearml/clearml-web/issues/101))
* Improve UI scalar smoothing algorithms (ClearML Web GitHub issues [#101](https://github.com/clearml/clearml-web/issues/101), [#102](https://github.com/clearml/clearml-web/issues/102), [#103](https://github.com/clearml/clearml-web/issues/103)) * Improve UI scalar smoothing algorithms (ClearML Web GitHub issues [#101](https://github.com/clearml/clearml-web/issues/101), [#102](https://github.com/clearml/clearml-web/issues/102), [#103](https://github.com/clearml/clearml-web/issues/103))
* Fix UI Users & Groups table's "Groups" column data remains condensed after column is expanded * Fix UI Users & Groups table's "Groups" column data remains condensed after column is expanded
* Fix setting service users as admins causes apiserver to crash * Fix setting service accounts as admins causes apiserver to crash
* Fix UI "New Dataview" modal's version selection sometimes does not display draft versions * Fix UI "New Dataview" modal's version selection sometimes does not display draft versions
* Fix GCS and Azure credential input popups not displaying in UI task debug samples * Fix GCS and Azure credential input popups not displaying in UI task debug samples
* Fix UI pipeline "Preview" tab sometimes displays "Failed to get plot charts" error * Fix UI pipeline "Preview" tab sometimes displays "Failed to get plot charts" error