mirror of
https://github.com/clearml/clearml-docs
synced 2025-05-23 21:44:45 +00:00
Edit Enterprise Server pages
This commit is contained in:
commit
ac733ba3f0
@ -11,7 +11,7 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
|
||||
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
|
||||
|
||||
:::note
|
||||
Make sure these credentials belong to an admin user or a service user with admin privileges.
|
||||
Make sure these credentials belong to an admin user or a service account with admin privileges.
|
||||
:::
|
||||
|
||||
- The worker environment must be able to access the ClearML Server over the same network.
|
||||
@ -26,7 +26,7 @@ Add the ClearML Helm repository:
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
Update the local repository:
|
||||
```bash
|
||||
helm repo update
|
||||
```
|
||||
|
@ -18,8 +18,8 @@ Arguments passed to the function include:
|
||||
* `queue` (string) - ID of the queue from which the task was pulled.
|
||||
* `queue_name` (string) - Name of the queue from which the task was pulled.
|
||||
* `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides.
|
||||
* `task_data` (object) - Task data object (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID.
|
||||
* `providers_info` (dictionary) - Provider info containing optional information collected for the user running this task
|
||||
* `task_data` (object) - [Task object](../../../references/sdk/task.md) (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID.
|
||||
* `providers_info` (dictionary) - [Identity provider](sso_login.md) info containing optional information collected for the user running this task
|
||||
when the user logged into the system (requires additional server configuration).
|
||||
* `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable
|
||||
for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values.
|
||||
@ -248,11 +248,8 @@ agentk8sglue:
|
||||
- mountPath: "/tmp/task/"
|
||||
name: task-pvc
|
||||
```
|
||||
:::
|
||||
|
||||
### Example: Required Role
|
||||
|
||||
The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
|
||||
* The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
|
||||
|
||||
```yaml
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
@ -271,4 +268,6 @@ rules:
|
||||
- create
|
||||
- patch
|
||||
- delete
|
||||
```
|
||||
```
|
||||
|
||||
:::
|
@ -12,7 +12,7 @@ Add the NVIDIA GPU Operator Helm repository:
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
Update the local repository:
|
||||
```bash
|
||||
helm repo update
|
||||
```
|
||||
|
@ -2,10 +2,28 @@
|
||||
title: Multi-Node Training
|
||||
---
|
||||
|
||||
The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods
|
||||
The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single ClearML Task to run across multiple pods
|
||||
on different nodes.
|
||||
|
||||
Below is a configuration example using `clearml-agent-values.override.yaml`:
|
||||
This is useful for distributed training where the training job needs to span multiple GPUs and potentially
|
||||
multiple nodes.
|
||||
|
||||
To enable multi-node scheduling, set both `agentk8sglue.serviceAccountClusterAccess` and `agentk8sglue.multiNode` to `true`.
|
||||
|
||||
Multi-node behavior is controlled using the `multiNode` key in a queue configuration. This setting tells the
|
||||
agent how to divide a Task's GPU requirements across multiple pods, with each pod running a part of the training job.
|
||||
|
||||
Below is a configuration example using `clearml-agent-values.override.yaml` to enable multi-node training.
|
||||
|
||||
In this example:
|
||||
* The `multiNode: [4, 2]` setting means splits the Task into two workloads:
|
||||
* One workload will need 4 GPUs
|
||||
* The other workload will need 2 GPUs
|
||||
* The GPU limit per pod is set to `nvidia.com/gpu: 2`, meaning each pod will be limited to 2 GPUs
|
||||
|
||||
With this setup:
|
||||
* The first workload (which needs 4 GPUs) will be scheduled as 2 pods, each with 2 GPUs
|
||||
* The second workload (which needs 2 GPUs) will be scheduled as 1 pod with 2 GPUs
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
@ -17,7 +35,7 @@ agentk8sglue:
|
||||
queues:
|
||||
multi-node-example:
|
||||
queueSettings:
|
||||
# Defines the distribution of GPUs Tasks across multiple nodes. The format [x, y, ...] specifies the distribution of Tasks as 'x' GPUs on a node and 'y' GPUs on another node. Multiple Pods will be spawned respectively based on the lowest-common-denominator defined.
|
||||
# Defines GPU needs per worker (e.g., 4 GPUs and 2 GPUs). Multiple Pods will be spawned respectively based on the lowest-common-denominator defined.
|
||||
multiNode: [ 4, 2 ]
|
||||
templateOverrides:
|
||||
resources:
|
||||
|
@ -1,9 +1,18 @@
|
||||
---
|
||||
title: ClearML Presign Service
|
||||
title: ClearML S3 Presign Service
|
||||
---
|
||||
|
||||
The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated
|
||||
users, enabling direct access to cloud-hosted data (e.g., S3) without exposing credentials.
|
||||
users, enabling direct access to S3 data without exposing credentials.
|
||||
|
||||
When configured, the ClearML WebApp automatically redirects requests for matching storage URLs (like `s3://...`) to the
|
||||
Presign Service. The service:
|
||||
|
||||
* Authenticates the use with ClearML.
|
||||
* Generates a temporary, secure (pre-signed) S3 URL.
|
||||
* Redirects the user's browser to the URL for direct access.
|
||||
|
||||
This setup ensures secure access to S3-hosted data.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
@ -12,7 +21,7 @@ users, enabling direct access to cloud-hosted data (e.g., S3) without exposing c
|
||||
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
|
||||
|
||||
:::note
|
||||
Make sure these credentials belong to an admin user or a service user with admin privileges.
|
||||
Make sure these credentials belong to an admin user or a service account with admin privileges.
|
||||
:::
|
||||
|
||||
- The worker environment must be able to access the ClearML Server over the same network.
|
||||
@ -27,7 +36,7 @@ Add the ClearML Helm repository:
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
Update the local repository:
|
||||
```bash
|
||||
helm repo update
|
||||
```
|
||||
|
@ -1,13 +1,15 @@
|
||||
---
|
||||
title: ClearML Tenant with Self Signed Certificates
|
||||
title: Kubernetes Deployment with Self-Signed Certificates
|
||||
---
|
||||
|
||||
This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent)
|
||||
This guide covers how to configure the [AI Application Gateway](../appgw.md) and [ClearML Agent](../agent_k8s.md)
|
||||
to use self-signed or custom SSL certificates.
|
||||
|
||||
## AI Application Gateway
|
||||
## Certificate Configuration
|
||||
|
||||
To configure certificates for the Application Gateway, update your `clearml-app-gateway-values.override.yaml` file:
|
||||
To configure certificates, update the applicable overrides file:
|
||||
* For AI Application Gateway: `clearml-app-gateway-values.override.yaml` file
|
||||
* For ClearML Agent: `clearml-agent-values.override.yaml` file
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
@ -72,83 +74,7 @@ customCertificates:
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
### Apply Changes
|
||||
|
||||
To apply the changes, run the update command:
|
||||
|
||||
```bash
|
||||
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
|
||||
```
|
||||
|
||||
## ClearML Agent
|
||||
|
||||
For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
overrideCaCertificatesCrt:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
extraCerts:
|
||||
- alias: certificateName
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
You have two configuration options:
|
||||
|
||||
- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file
|
||||
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
|
||||
### Replace Entire ca-certificates.crt File
|
||||
|
||||
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
|
||||
they are stored in a standard `ca-certificates.crt`.
|
||||
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
overrideCaCertificatesCrt: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 1
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 2
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 3
|
||||
-----END CERTIFICATE-----
|
||||
...
|
||||
```
|
||||
|
||||
### Append Extra Certificates to the Existing ca-certificates.crt
|
||||
|
||||
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
|
||||
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
extraCerts:
|
||||
- alias: certificate-name-1
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
- alias: certificate-name-2
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
### Add Certificates to Task Pods
|
||||
### ClearML Agent: Add Certificates to Task Pods
|
||||
|
||||
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods:
|
||||
|
||||
@ -194,8 +120,15 @@ Their names are usually prefixed with the Helm release name, so adjust according
|
||||
|
||||
### Apply Changes
|
||||
|
||||
Apply the changes by running the update command:
|
||||
To apply the changes, run the update command:
|
||||
* For AI Application Gateway:
|
||||
|
||||
``` bash
|
||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||
```
|
||||
```bash
|
||||
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
|
||||
```
|
||||
|
||||
* For ClearML Agent:
|
||||
|
||||
```bash
|
||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||
```
|
@ -6,13 +6,8 @@ ClearML Enterprise Server supports various Single Sign-On (SSO) identity provide
|
||||
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the
|
||||
`apiserver` component.
|
||||
|
||||
The following are configuration examples for commonly used providers. Other supported systems include:
|
||||
* Auth0
|
||||
* Keycloak
|
||||
* Okta
|
||||
* Azure AD
|
||||
* Google
|
||||
* AWS Cognito
|
||||
The following are configuration examples for commonly used identity providers. See [full list of supported identity providers](../../../webapp/settings/webapp_settings_id_providers.md).
|
||||
|
||||
|
||||
## Auth0
|
||||
|
||||
@ -52,7 +47,7 @@ apiserver:
|
||||
value: "true"
|
||||
```
|
||||
|
||||
## Group Membership Mapping in Keycloak
|
||||
### Group Membership Mapping in Keycloak
|
||||
|
||||
To map Keycloak groups into the ClearML user's SSO token:
|
||||
|
||||
|
@ -1,8 +1,14 @@
|
||||
---
|
||||
title: ClearML Dynamic MIG Operator (CDMO)
|
||||
title: Managing GPU Fractions with ClearML Dynamic MIG Operator (CDMO)
|
||||
---
|
||||
|
||||
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations.
|
||||
This guide covers using GPU fractions in Kubernetes clusters using NVIDIA MIGs and
|
||||
ClearML's Dynamic MIG Operator (CDMO). CDMO enables dynamic MIG (Multi-Instance GPU) configurations.
|
||||
|
||||
This guide covers:
|
||||
* Installing CDMO
|
||||
* Enabling MIG mode on your cluster
|
||||
* Managing GPU partitioning dynamically
|
||||
|
||||
## Installation
|
||||
|
||||
@ -78,7 +84,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU
|
||||
* For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node.
|
||||
:::
|
||||
|
||||
1. Label all MIG-enabled GPU node `<NODE_NAME>` from the previous step:
|
||||
1. Label all MIG-enabled GPU nodes `<NODE_NAME>` from the previous step:
|
||||
|
||||
```bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
|
||||
@ -106,7 +112,7 @@ To disable MIG mode and restore standard full-GPU access:
|
||||
nvidia-smi -mig 0
|
||||
```
|
||||
|
||||
4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`:
|
||||
4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access:
|
||||
|
||||
```yaml
|
||||
toolkit:
|
||||
@ -129,4 +135,10 @@ To disable MIG mode and restore standard full-GPU access:
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
||||
```
|
||||
|
||||
5. Upgrade the `gpu-operator`:
|
||||
|
||||
```bash
|
||||
helm upgrade -n gpu-operator gpu-operator nvidia/gpu-operator -f gpu-operator.override.yaml
|
||||
```
|
@ -16,7 +16,7 @@ helm repo update
|
||||
|
||||
### Requirements
|
||||
|
||||
* Install the NVIDIA `gpu-operator` using Helm
|
||||
* Install the NVIDIA `gpu-operator` using Helm. For instructions, see [Basic Deployment](../extra_configs/gpu_operator.md).
|
||||
* Set the number of GPU slices to 8
|
||||
* Add and update the Nvidia Helm repo:
|
||||
|
||||
@ -191,7 +191,7 @@ Valid values for `"<GPU_FRACTION_VALUE>"` include:
|
||||
|
||||
### ClearML Agent Configuration
|
||||
|
||||
To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file.
|
||||
To run ClearML jobs with fractional GPU allocation, configure your queues in your `clearml-agent-values.override.yaml` file.
|
||||
|
||||
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
|
||||
fraction of a GPU to allocate (e.g., "0.500" for half a GPU).
|
||||
|
@ -50,7 +50,7 @@ Add the ClearML Helm repository:
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
Update the local repository:
|
||||
``` bash
|
||||
helm repo update
|
||||
```
|
||||
|
@ -709,7 +709,7 @@ The following features can be assigned to groups via the `features` configuratio
|
||||
| `reports` | Enables access to Reports. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
| `resource_dashboard` | Enables access to the compute resource dashboard feature. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
| `sso_management` | Enables the SSO (Single Sign-On) configuration wizard. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
| `service_users` | Enables support for creating and managing service users (API keys). | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
| `service_users` | Enables support for creating and managing service accounts (API keys). | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
| `resource_policy` | Enables the resource policy feature. | May default to a trial feature if not explicitly enabled. |
|
||||
| `model_serving` | Enables access to the model serving endpoints feature. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
| `show_dashboard` | Makes the "Dashboard" menu item visible in the UI sidebar. | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|
@ -29,7 +29,7 @@ title: Version 3.25
|
||||
* Display per-GPU metrics in "CPU and GPU Usage" and "Video Memory" graphs when multiple GPUs are available
|
||||
* Add "GPU Count" column to the Resource Groups table in the Orchestration Dashboard
|
||||
* Add global search bar to all UI pages
|
||||
* Enable setting service users as admins
|
||||
* Enable setting service accounts as admins
|
||||
* Add filter to UI Model Endpoints table
|
||||
* Add UI scalar viewing configuration on a per-project basis ([ClearML GitHub issue #1377](https://github.com/clearml/clearml/issues/1377))
|
||||
* Add clicking project name in breadcrumbs of full-screen task opens the task in detail’s view ([ClearML GitHub issue #1376](https://github.com/clearml/clearml/issues/1376))
|
||||
@ -42,7 +42,7 @@ title: Version 3.25
|
||||
* Fix EMA smoothing in UI scalars is incorrect in first data point ([ClearML Web GitHub issue #101](https://github.com/clearml/clearml-web/issues/101))
|
||||
* Improve UI scalar smoothing algorithms (ClearML Web GitHub issues [#101](https://github.com/clearml/clearml-web/issues/101), [#102](https://github.com/clearml/clearml-web/issues/102), [#103](https://github.com/clearml/clearml-web/issues/103))
|
||||
* Fix UI Users & Groups table's "Groups" column data remains condensed after column is expanded
|
||||
* Fix setting service users as admins causes apiserver to crash
|
||||
* Fix setting service accounts as admins causes apiserver to crash
|
||||
* Fix UI "New Dataview" modal's version selection sometimes does not display draft versions
|
||||
* Fix GCS and Azure credential input popups not displaying in UI task debug samples
|
||||
* Fix UI pipeline "Preview" tab sometimes displays "Failed to get plot charts" error
|
||||
|
Loading…
Reference in New Issue
Block a user