Edit Enterprise Server pages

This commit is contained in:
Noam Wasersprung 2025-05-22 13:48:22 +03:00 committed by GitHub
commit ac733ba3f0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
12 changed files with 87 additions and 121 deletions

View File

@ -11,7 +11,7 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note
Make sure these credentials belong to an admin user or a service user with admin privileges.
Make sure these credentials belong to an admin user or a service account with admin privileges.
:::
- The worker environment must be able to access the ClearML Server over the same network.
@ -26,7 +26,7 @@ Add the ClearML Helm repository:
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
```
Update the repository locally:
Update the local repository:
```bash
helm repo update
```

View File

@ -18,8 +18,8 @@ Arguments passed to the function include:
* `queue` (string) - ID of the queue from which the task was pulled.
* `queue_name` (string) - Name of the queue from which the task was pulled.
* `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides.
* `task_data` (object) - Task data object (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID.
* `providers_info` (dictionary) - Provider info containing optional information collected for the user running this task
* `task_data` (object) - [Task object](../../../references/sdk/task.md) (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID.
* `providers_info` (dictionary) - [Identity provider](sso_login.md) info containing optional information collected for the user running this task
when the user logged into the system (requires additional server configuration).
* `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable
for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values.
@ -248,11 +248,8 @@ agentk8sglue:
- mountPath: "/tmp/task/"
name: task-pvc
```
:::
### Example: Required Role
The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
* The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
@ -271,4 +268,6 @@ rules:
- create
- patch
- delete
```
```
:::

View File

@ -12,7 +12,7 @@ Add the NVIDIA GPU Operator Helm repository:
helm repo add nvidia https://nvidia.github.io/gpu-operator
```
Update the repository locally:
Update the local repository:
```bash
helm repo update
```

View File

@ -2,10 +2,28 @@
title: Multi-Node Training
---
The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods
The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single ClearML Task to run across multiple pods
on different nodes.
Below is a configuration example using `clearml-agent-values.override.yaml`:
This is useful for distributed training where the training job needs to span multiple GPUs and potentially
multiple nodes.
To enable multi-node scheduling, set both `agentk8sglue.serviceAccountClusterAccess` and `agentk8sglue.multiNode` to `true`.
Multi-node behavior is controlled using the `multiNode` key in a queue configuration. This setting tells the
agent how to divide a Task's GPU requirements across multiple pods, with each pod running a part of the training job.
Below is a configuration example using `clearml-agent-values.override.yaml` to enable multi-node training.
In this example:
* The `multiNode: [4, 2]` setting means splits the Task into two workloads:
* One workload will need 4 GPUs
* The other workload will need 2 GPUs
* The GPU limit per pod is set to `nvidia.com/gpu: 2`, meaning each pod will be limited to 2 GPUs
With this setup:
* The first workload (which needs 4 GPUs) will be scheduled as 2 pods, each with 2 GPUs
* The second workload (which needs 2 GPUs) will be scheduled as 1 pod with 2 GPUs
```yaml
agentk8sglue:
@ -17,7 +35,7 @@ agentk8sglue:
queues:
multi-node-example:
queueSettings:
# Defines the distribution of GPUs Tasks across multiple nodes. The format [x, y, ...] specifies the distribution of Tasks as 'x' GPUs on a node and 'y' GPUs on another node. Multiple Pods will be spawned respectively based on the lowest-common-denominator defined.
# Defines GPU needs per worker (e.g., 4 GPUs and 2 GPUs). Multiple Pods will be spawned respectively based on the lowest-common-denominator defined.
multiNode: [ 4, 2 ]
templateOverrides:
resources:

View File

@ -1,9 +1,18 @@
---
title: ClearML Presign Service
title: ClearML S3 Presign Service
---
The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated
users, enabling direct access to cloud-hosted data (e.g., S3) without exposing credentials.
users, enabling direct access to S3 data without exposing credentials.
When configured, the ClearML WebApp automatically redirects requests for matching storage URLs (like `s3://...`) to the
Presign Service. The service:
* Authenticates the use with ClearML.
* Generates a temporary, secure (pre-signed) S3 URL.
* Redirects the user's browser to the URL for direct access.
This setup ensures secure access to S3-hosted data.
## Prerequisites
@ -12,7 +21,7 @@ users, enabling direct access to cloud-hosted data (e.g., S3) without exposing c
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
:::note
Make sure these credentials belong to an admin user or a service user with admin privileges.
Make sure these credentials belong to an admin user or a service account with admin privileges.
:::
- The worker environment must be able to access the ClearML Server over the same network.
@ -27,7 +36,7 @@ Add the ClearML Helm repository:
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
```
Update the repository locally:
Update the local repository:
```bash
helm repo update
```

View File

@ -1,13 +1,15 @@
---
title: ClearML Tenant with Self Signed Certificates
title: Kubernetes Deployment with Self-Signed Certificates
---
This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent)
This guide covers how to configure the [AI Application Gateway](../appgw.md) and [ClearML Agent](../agent_k8s.md)
to use self-signed or custom SSL certificates.
## AI Application Gateway
## Certificate Configuration
To configure certificates for the Application Gateway, update your `clearml-app-gateway-values.override.yaml` file:
To configure certificates, update the applicable overrides file:
* For AI Application Gateway: `clearml-app-gateway-values.override.yaml` file
* For ClearML Agent: `clearml-agent-values.override.yaml` file
```yaml
# -- Custom certificates
@ -72,83 +74,7 @@ customCertificates:
-----END CERTIFICATE-----
```
### Apply Changes
To apply the changes, run the update command:
```bash
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
```
## ClearML Agent
For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificateName
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
You have two configuration options:
- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt`
### Replace Entire ca-certificates.crt File
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
they are stored in a standard `ca-certificates.crt`.
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
overrideCaCertificatesCrt: |
-----BEGIN CERTIFICATE-----
### CERT 1
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 2
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
### CERT 3
-----END CERTIFICATE-----
...
```
### Append Extra Certificates to the Existing ca-certificates.crt
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
```yaml
# -- Custom certificates
customCertificates:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
extraCerts:
- alias: certificate-name-1
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
- alias: certificate-name-2
pem: |
-----BEGIN CERTIFICATE-----
###
-----END CERTIFICATE-----
```
### Add Certificates to Task Pods
### ClearML Agent: Add Certificates to Task Pods
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods:
@ -194,8 +120,15 @@ Their names are usually prefixed with the Helm release name, so adjust according
### Apply Changes
Apply the changes by running the update command:
To apply the changes, run the update command:
* For AI Application Gateway:
``` bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
```
```bash
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
```
* For ClearML Agent:
```bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
```

View File

@ -6,13 +6,8 @@ ClearML Enterprise Server supports various Single Sign-On (SSO) identity provide
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the
`apiserver` component.
The following are configuration examples for commonly used providers. Other supported systems include:
* Auth0
* Keycloak
* Okta
* Azure AD
* Google
* AWS Cognito
The following are configuration examples for commonly used identity providers. See [full list of supported identity providers](../../../webapp/settings/webapp_settings_id_providers.md).
## Auth0
@ -52,7 +47,7 @@ apiserver:
value: "true"
```
## Group Membership Mapping in Keycloak
### Group Membership Mapping in Keycloak
To map Keycloak groups into the ClearML user's SSO token:

View File

@ -1,8 +1,14 @@
---
title: ClearML Dynamic MIG Operator (CDMO)
title: Managing GPU Fractions with ClearML Dynamic MIG Operator (CDMO)
---
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations.
This guide covers using GPU fractions in Kubernetes clusters using NVIDIA MIGs and
ClearML's Dynamic MIG Operator (CDMO). CDMO enables dynamic MIG (Multi-Instance GPU) configurations.
This guide covers:
* Installing CDMO
* Enabling MIG mode on your cluster
* Managing GPU partitioning dynamically
## Installation
@ -78,7 +84,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU
* For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node.
:::
1. Label all MIG-enabled GPU node `<NODE_NAME>` from the previous step:
1. Label all MIG-enabled GPU nodes `<NODE_NAME>` from the previous step:
```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
@ -106,7 +112,7 @@ To disable MIG mode and restore standard full-GPU access:
nvidia-smi -mig 0
```
4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`:
4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access:
```yaml
toolkit:
@ -129,4 +135,10 @@ To disable MIG mode and restore standard full-GPU access:
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```
```
5. Upgrade the `gpu-operator`:
```bash
helm upgrade -n gpu-operator gpu-operator nvidia/gpu-operator -f gpu-operator.override.yaml
```

View File

@ -16,7 +16,7 @@ helm repo update
### Requirements
* Install the NVIDIA `gpu-operator` using Helm
* Install the NVIDIA `gpu-operator` using Helm. For instructions, see [Basic Deployment](../extra_configs/gpu_operator.md).
* Set the number of GPU slices to 8
* Add and update the Nvidia Helm repo:
@ -191,7 +191,7 @@ Valid values for `"<GPU_FRACTION_VALUE>"` include:
### ClearML Agent Configuration
To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file.
To run ClearML jobs with fractional GPU allocation, configure your queues in your `clearml-agent-values.override.yaml` file.
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
fraction of a GPU to allocate (e.g., "0.500" for half a GPU).

View File

@ -50,7 +50,7 @@ Add the ClearML Helm repository:
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
```
Update the repository locally:
Update the local repository:
``` bash
helm repo update
```

View File

@ -709,7 +709,7 @@ The following features can be assigned to groups via the `features` configuratio
| `reports` | Enables access to Reports. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `resource_dashboard` | Enables access to the compute resource dashboard feature. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `sso_management` | Enables the SSO (Single Sign-On) configuration wizard. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `service_users` | Enables support for creating and managing service users (API keys). | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `service_users` | Enables support for creating and managing service accounts (API keys). | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `resource_policy` | Enables the resource policy feature. | May default to a trial feature if not explicitly enabled. |
| `model_serving` | Enables access to the model serving endpoints feature. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
| `show_dashboard` | Makes the "Dashboard" menu item visible in the UI sidebar. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |

View File

@ -29,7 +29,7 @@ title: Version 3.25
* Display per-GPU metrics in "CPU and GPU Usage" and "Video Memory" graphs when multiple GPUs are available
* Add "GPU Count" column to the Resource Groups table in the Orchestration Dashboard
* Add global search bar to all UI pages
* Enable setting service users as admins
* Enable setting service accounts as admins
* Add filter to UI Model Endpoints table
* Add UI scalar viewing configuration on a per-project basis ([ClearML GitHub issue #1377](https://github.com/clearml/clearml/issues/1377))
* Add clicking project name in breadcrumbs of full-screen task opens the task in details view ([ClearML GitHub issue #1376](https://github.com/clearml/clearml/issues/1376))
@ -42,7 +42,7 @@ title: Version 3.25
* Fix EMA smoothing in UI scalars is incorrect in first data point ([ClearML Web GitHub issue #101](https://github.com/clearml/clearml-web/issues/101))
* Improve UI scalar smoothing algorithms (ClearML Web GitHub issues [#101](https://github.com/clearml/clearml-web/issues/101), [#102](https://github.com/clearml/clearml-web/issues/102), [#103](https://github.com/clearml/clearml-web/issues/103))
* Fix UI Users & Groups table's "Groups" column data remains condensed after column is expanded
* Fix setting service users as admins causes apiserver to crash
* Fix setting service accounts as admins causes apiserver to crash
* Fix UI "New Dataview" modal's version selection sometimes does not display draft versions
* Fix GCS and Azure credential input popups not displaying in UI task debug samples
* Fix UI pipeline "Preview" tab sometimes displays "Failed to get plot charts" error