From f0e9cbe027282ae1a76d2670be2bc1e2fa332254 Mon Sep 17 00:00:00 2001 From: revital Date: Tue, 20 May 2025 13:33:57 +0300 Subject: [PATCH 01/10] Edits --- .../enterprise_deploy/agent_k8s.md | 4 +- .../dynamic_edit_task_pod_template.md | 13 +-- .../extra_configs/gpu_operator.md | 2 +- .../extra_configs/multi_node_training.md | 24 +++- .../extra_configs/presign_service.md | 17 ++- .../extra_configs/self_signed_certificates.md | 103 +++--------------- .../extra_configs/sso_login.md | 11 +- .../enterprise_deploy/fractional_gpus/cdmo.md | 64 +++-------- .../enterprise_deploy/fractional_gpus/cfgi.md | 4 +- .../enterprise_deploy/k8s.md | 2 +- .../enterprise_deploy/multi_tenant_k8s.md | 2 +- .../clearml_server/enterprise/ver_3_25.md | 4 +- 12 files changed, 88 insertions(+), 162 deletions(-) diff --git a/docs/deploying_clearml/enterprise_deploy/agent_k8s.md b/docs/deploying_clearml/enterprise_deploy/agent_k8s.md index e2c2c4dc..02a38de7 100644 --- a/docs/deploying_clearml/enterprise_deploy/agent_k8s.md +++ b/docs/deploying_clearml/enterprise_deploy/agent_k8s.md @@ -11,7 +11,7 @@ The ClearML Agent enables scheduling and executing distributed experiments on a the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). :::note - Make sure these credentials belong to an admin user or a service user with admin privileges. + Make sure these credentials belong to an admin user or a service account with admin privileges. ::: - The worker environment must be able to access the ClearML Server over the same network. @@ -26,7 +26,7 @@ Add the ClearML Helm repository: helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password ``` -Update the repository locally: +Update the local repository: ```bash helm repo update ``` diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md index c7e651e5..a78196a0 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md @@ -18,8 +18,8 @@ Arguments passed to the function include: * `queue` (string) - ID of the queue from which the task was pulled. * `queue_name` (string) - Name of the queue from which the task was pulled. * `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides. -* `task_data` (object) - Task data object (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID. -* `providers_info` (dictionary) - Provider info containing optional information collected for the user running this task +* `task_data` (object) - [Task object](../../../references/sdk/task.md) (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID. +* `providers_info` (dictionary) - [Identity provider](sso_login.md) info containing optional information collected for the user running this task when the user logged into the system (requires additional server configuration). * `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values. @@ -248,11 +248,8 @@ agentk8sglue: - mountPath: "/tmp/task/" name: task-pvc ``` -::: -### Example: Required Role - -The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`: +* The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`: ```yaml apiVersion: rbac.authorization.k8s.io/v1 @@ -271,4 +268,6 @@ rules: - create - patch - delete -``` \ No newline at end of file +``` + +::: \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md index 97b59e75..042418bf 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md @@ -12,7 +12,7 @@ Add the NVIDIA GPU Operator Helm repository: helm repo add nvidia https://nvidia.github.io/gpu-operator ``` -Update the repository locally: +Update the local repository: ```bash helm repo update ``` diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md index 026a39d6..90ca201b 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md @@ -2,10 +2,28 @@ title: Multi-Node Training --- -The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods +The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single ClearML Task to run across multiple pods on different nodes. -Below is a configuration example using `clearml-agent-values.override.yaml`: +This is useful for distributed training where the training job needs to span multiple GPUs and potentially +multiple nodes. + +To enable multi-node scheduling, set both `agentk8sglue.serviceAccountClusterAccess` and `agentk8sglue.multiNode` to `true`. + +Multi-node behavior is controlled using the `multiNode` key in a queue configuration. This setting tells the +agent how to divide a Task's GPU requirements across multiple pods, with each pod running a part of the training job. + +Below is a configuration example using `clearml-agent-values.override.yaml` to enable multi-node training. + +In this example: +* The `multiNode: [4, 2]` setting means splits the Task into two workloads: + * One workload will need 4 GPUs + * The other workload will need 2 GPUs +* The GPU limit per pod is set to `nvidia.com/gpu: 2`, meaning each pod will be limited to 2 GPUs + +With this setup: +* The first workload (which needs 4 GPUs) will be scheduled as 2 pods, each with 2 GPUs +* The second workload (which needs 2 GPUs) will be scheduled as 1 pod with 2 GPUs ```yaml agentk8sglue: @@ -17,7 +35,7 @@ agentk8sglue: queues: multi-node-example: queueSettings: - # Defines the distribution of GPUs Tasks across multiple nodes. The format [x, y, ...] specifies the distribution of Tasks as 'x' GPUs on a node and 'y' GPUs on another node. Multiple Pods will be spawned respectively based on the lowest-common-denominator defined. + # Defines GPU needs per worker (e.g., 4 GPUs and 2 GPUs). Multiple Pods will be spawned respectively based on the lowest-common-denominator defined. multiNode: [ 4, 2 ] templateOverrides: resources: diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md index 80a57359..9207d4e6 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md @@ -1,9 +1,18 @@ --- -title: ClearML Presign Service +title: ClearML S3 Presign Service --- The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated -users, enabling direct access to cloud-hosted data (e.g., S3) without exposing credentials. +users, enabling direct access to S3 data without exposing credentials. + +When configured, the ClearML WebApp automatically redirects requests for matching storage URLs (like `s3://...`) to the +Presign Service. The service: + +* Verifies the user's ClearML authentication. +* Generates a temporary, secure (pre-signed) S3 URL. +* Redirects the user's browser to the URL for direct access. + +This setup ensures secure access to S3-hosted data. ## Prerequisites @@ -12,7 +21,7 @@ users, enabling direct access to cloud-hosted data (e.g., S3) without exposing c the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). :::note - Make sure these credentials belong to an admin user or a service user with admin privileges. + Make sure these credentials belong to an admin user or a service account with admin privileges. ::: - The worker environment must be able to access the ClearML Server over the same network. @@ -27,7 +36,7 @@ Add the ClearML Helm repository: helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password ``` -Update the repository locally: +Update the local repository: ```bash helm repo update ``` diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md index e529e20e..f477da7a 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md @@ -1,13 +1,15 @@ --- -title: ClearML Tenant with Self Signed Certificates +title: Self-Signed Certificates for ClearML Agent and AI App Gateway --- -This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent) +This guide covers how to configure the [AI Application Gateway](../appgw.md) and [ClearML Agent](../agent_k8s.md) to use self-signed or custom SSL certificates. -## AI Application Gateway +## Certificate Configuration -To configure certificates for the Application Gateway, update your `clearml-app-gateway-values.override.yaml` file: +To configure certificates, update the following files: +* For AI Application Gateway: `clearml-app-gateway-values.override.yaml` file +* For ClearML Agent: `clearml-agent-values.override.yaml` ```yaml # -- Custom certificates @@ -72,83 +74,7 @@ customCertificates: -----END CERTIFICATE----- ``` -### Apply Changes - -To apply the changes, run the update command: - -```bash -helm upgrade -i -n clearml-enterprise/clearml-enterprise-app-gateway --version -f clearml-app-gateway-values.override.yaml -``` - -## ClearML Agent - -For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file: - -```yaml -# -- Custom certificates -customCertificates: - # -- Override system crt certificate bundle. Mutual exclusive with extraCerts. - overrideCaCertificatesCrt: - # -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt. - extraCerts: - - alias: certificateName - pem: | - -----BEGIN CERTIFICATE----- - ### - -----END CERTIFICATE----- -``` - -You have two configuration options: - -- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file -- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt` - - -### Replace Entire ca-certificates.crt File - -To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as -they are stored in a standard `ca-certificates.crt`. - - -```yaml -# -- Custom certificates -customCertificates: - # -- Override system crt certificate bundle. Mutual exclusive with extraCerts. - overrideCaCertificatesCrt: | - -----BEGIN CERTIFICATE----- - ### CERT 1 - -----END CERTIFICATE----- - -----BEGIN CERTIFICATE----- - ### CERT 2 - -----END CERTIFICATE----- - -----BEGIN CERTIFICATE----- - ### CERT 3 - -----END CERTIFICATE----- - ... -``` - -### Append Extra Certificates to the Existing ca-certificates.crt - -You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`. - -```yaml -# -- Custom certificates -customCertificates: - # -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt. - extraCerts: - - alias: certificate-name-1 - pem: | - -----BEGIN CERTIFICATE----- - ### - -----END CERTIFICATE----- - - alias: certificate-name-2 - pem: | - -----BEGIN CERTIFICATE----- - ### - -----END CERTIFICATE----- -``` - -### Add Certificates to Task Pods +### ClearML Agent: Add Certificates to Task Pods If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods: @@ -194,8 +120,15 @@ Their names are usually prefixed with the Helm release name, so adjust according ### Apply Changes -Apply the changes by running the update command: +To apply the changes, run the update command: +* For AI Application Gateway: -``` bash -helm upgrade -i -n clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml -``` \ No newline at end of file + ```bash + helm upgrade -i -n clearml-enterprise/clearml-enterprise-app-gateway --version -f clearml-app-gateway-values.override.yaml + ``` + +* For ClearML Agent: + + ```bash + helm upgrade -i -n clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml + ``` \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md index 6e9179e5..4eae0c6a 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md @@ -6,13 +6,8 @@ ClearML Enterprise Server supports various Single Sign-On (SSO) identity provide SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the `apiserver` component. -The following are configuration examples for commonly used providers. Other supported systems include: -* Auth0 -* Keycloak -* Okta -* Azure AD -* Google -* AWS Cognito +The following are configuration examples for commonly used identity providers. See [full list of supported identity providers](../../../webapp/settings/webapp_settings_id_providers.md). + ## Auth0 @@ -52,7 +47,7 @@ apiserver: value: "true" ``` -## Group Membership Mapping in Keycloak +### Group Membership Mapping in Keycloak To map Keycloak groups into the ClearML user's SSO token: diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md index 0cbf1d32..6e9fcc4a 100644 --- a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md @@ -1,54 +1,20 @@ --- -title: ClearML Dynamic MIG Operator (CDMO) +title: Managing GPU Fragments with ClearML Dynamic MIG Operator (CDMO) --- -The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations. +This guide covers using GPU fragments in Kubernetes clusters using NVIDIA MIGs and +ClearML's Dynamic MIG Operator (CDMO). CDMO enables dynamic MIG (Multi-Instance GPU) configurations. + +This guide covers: +* Installing CDMO +* Enabling MIG mode on your cluster +* Managing GPU partitioning dynamically ## Installation ### Requirements -* Add and update the Nvidia Helm repo: - - ```bash - helm repo add nvidia https://nvidia.github.io/gpu-operator - helm repo update - ``` - -* Create a `gpu-operator.override.yaml` file with the following content: - - ```yaml - migManager: - enabled: false - mig: - strategy: mixed - toolkit: - env: - - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED - value: "false" - - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS - value: "true" - devicePlugin: - env: - - name: PASS_DEVICE_SPECS - value: "true" - - name: FAIL_ON_INIT_ERROR - value: "true" - - name: DEVICE_LIST_STRATEGY # Use volume-mounts - value: volume-mounts - - name: DEVICE_ID_STRATEGY - value: uuid - - name: NVIDIA_VISIBLE_DEVICES - value: all - - name: NVIDIA_DRIVER_CAPABILITIES - value: all - ``` - -* Install the NVIDIA `gpu-operator` using Helm with the previous configuration: - - ```bash - helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml - ``` +* Install the NVIDIA `gpu-operator` using Helm. For instructions, see [Basic Deployment](../extra_configs/gpu_operator.md). ### Installing CDMO @@ -78,7 +44,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU * For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node. ::: -1. Label all MIG-enabled GPU node `` from the previous step: +1. Label all MIG-enabled GPU nodes `` from the previous step: ```bash kubectl label nodes "cdmo.clear.ml/gpu-partitioning=mig" @@ -106,7 +72,7 @@ To disable MIG mode and restore standard full-GPU access: nvidia-smi -mig 0 ``` -4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`: +4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access: ```yaml toolkit: @@ -129,4 +95,10 @@ To disable MIG mode and restore standard full-GPU access: value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all - ``` \ No newline at end of file + ``` + +5. Upgrade the `gpu-operator`: + + ```bash + helm upgrade -n gpu-operator gpu-operator nvidia/gpu-operator -f gpu-operator.override.yaml + ``` \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md index 13ba2e85..de258499 100644 --- a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md @@ -16,7 +16,7 @@ helm repo update ### Requirements -* Install the NVIDIA `gpu-operator` using Helm +* Install the NVIDIA `gpu-operator` using Helm. For instructions, see [Basic Deployment](../extra_configs/gpu_operator.md). * Set the number of GPU slices to 8 * Add and update the Nvidia Helm repo: @@ -191,7 +191,7 @@ Valid values for `""` include: ### ClearML Agent Configuration -To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file. +To run ClearML jobs with fractional GPU allocation, configure your queues in your `clearml-agent-values.override.yaml` file. Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the fraction of a GPU to allocate (e.g., "0.500" for half a GPU). diff --git a/docs/deploying_clearml/enterprise_deploy/k8s.md b/docs/deploying_clearml/enterprise_deploy/k8s.md index 7ed10bd0..0992ebc3 100644 --- a/docs/deploying_clearml/enterprise_deploy/k8s.md +++ b/docs/deploying_clearml/enterprise_deploy/k8s.md @@ -50,7 +50,7 @@ Add the ClearML Helm repository: helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password ``` -Update the repository locally: +Update the local repository: ``` bash helm repo update ``` diff --git a/docs/deploying_clearml/enterprise_deploy/multi_tenant_k8s.md b/docs/deploying_clearml/enterprise_deploy/multi_tenant_k8s.md index b6d2e19e..604d169c 100644 --- a/docs/deploying_clearml/enterprise_deploy/multi_tenant_k8s.md +++ b/docs/deploying_clearml/enterprise_deploy/multi_tenant_k8s.md @@ -709,7 +709,7 @@ The following features can be assigned to groups via the `features` configuratio | `reports` | Enables access to Reports. | No | | `resource_dashboard` | Enables access to the compute resource dashboard feature. | No | | `sso_management` | Enables the SSO (Single Sign-On) configuration wizard. | No | -| `service_users` | Enables support for creating and managing service users (API keys). | No | +| `service_users` | Enables support for creating and managing service accounts (API keys). | No | | `resource_policy` | Enables the resource policy feature. | May default to a trial feature if not explicitly enabled. | | `model_serving` | Enables access to the model serving endpoints feature. | No | | `show_dashboard` | Makes the "Dashboard" menu item visible in the UI sidebar. | No | diff --git a/docs/release_notes/clearml_server/enterprise/ver_3_25.md b/docs/release_notes/clearml_server/enterprise/ver_3_25.md index 17a61860..47edcfcc 100644 --- a/docs/release_notes/clearml_server/enterprise/ver_3_25.md +++ b/docs/release_notes/clearml_server/enterprise/ver_3_25.md @@ -29,7 +29,7 @@ title: Version 3.25 * Display per-GPU metrics in "CPU and GPU Usage" and "Video Memory" graphs when multiple GPUs are available * Add "GPU Count" column to the Resource Groups table in the Orchestration Dashboard * Add global search bar to all UI pages -* Enable setting service users as admins +* Enable setting service accounts as admins * Add filter to UI Model Endpoints table * Add UI scalar viewing configuration on a per-project basis ([ClearML GitHub issue #1377](https://github.com/clearml/clearml/issues/1377)) * Add clicking project name in breadcrumbs of full-screen task opens the task in detail’s view ([ClearML GitHub issue #1376](https://github.com/clearml/clearml/issues/1376)) @@ -42,7 +42,7 @@ title: Version 3.25 * Fix EMA smoothing in UI scalars is incorrect in first data point ([ClearML Web GitHub issue #101](https://github.com/clearml/clearml-web/issues/101)) * Improve UI scalar smoothing algorithms (ClearML Web GitHub issues [#101](https://github.com/clearml/clearml-web/issues/101), [#102](https://github.com/clearml/clearml-web/issues/102), [#103](https://github.com/clearml/clearml-web/issues/103)) * Fix UI Users & Groups table's "Groups" column data remains condensed after column is expanded -* Fix setting service users as admins causes apiserver to crash +* Fix setting service accounts as admins causes apiserver to crash * Fix UI "New Dataview" modal's version selection sometimes does not display draft versions * Fix GCS and Azure credential input popups not displaying in UI task debug samples * Fix UI pipeline "Preview" tab sometimes displays "Failed to get plot charts" error From ca17d1563a711826d83ba2d30cbdfab7b26ba4de Mon Sep 17 00:00:00 2001 From: revital Date: Wed, 21 May 2025 06:59:58 +0300 Subject: [PATCH 02/10] Edits --- .../enterprise_deploy/extra_configs/presign_service.md | 2 +- .../extra_configs/self_signed_certificates.md | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md index 9207d4e6..4de148f7 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md @@ -8,7 +8,7 @@ users, enabling direct access to S3 data without exposing credentials. When configured, the ClearML WebApp automatically redirects requests for matching storage URLs (like `s3://...`) to the Presign Service. The service: -* Verifies the user's ClearML authentication. +* Authenticates the use with ClearML. * Generates a temporary, secure (pre-signed) S3 URL. * Redirects the user's browser to the URL for direct access. diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md index f477da7a..7a031e21 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md @@ -1,5 +1,5 @@ --- -title: Self-Signed Certificates for ClearML Agent and AI App Gateway +title: Kubernetes Deployment with Self-Signed Certificates --- This guide covers how to configure the [AI Application Gateway](../appgw.md) and [ClearML Agent](../agent_k8s.md) @@ -7,9 +7,9 @@ to use self-signed or custom SSL certificates. ## Certificate Configuration -To configure certificates, update the following files: +To configure certificates, update the applicable overrides file: * For AI Application Gateway: `clearml-app-gateway-values.override.yaml` file -* For ClearML Agent: `clearml-agent-values.override.yaml` +* For ClearML Agent: `clearml-agent-values.override.yaml` file ```yaml # -- Custom certificates From 26fd03a81dfff72fdb8fdb159b5c25703d64edf8 Mon Sep 17 00:00:00 2001 From: fbrintazzoli Date: Wed, 21 May 2025 10:07:44 +0200 Subject: [PATCH 03/10] Fixed: cdmo --- .../enterprise_deploy/fractional_gpus/cdmo.md | 45 +++++++++++++++++-- 1 file changed, 42 insertions(+), 3 deletions(-) diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md index 6e9fcc4a..c1652ab4 100644 --- a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md @@ -1,8 +1,8 @@ --- -title: Managing GPU Fragments with ClearML Dynamic MIG Operator (CDMO) +title: Managing GPU Fractions with ClearML Dynamic MIG Operator (CDMO) --- -This guide covers using GPU fragments in Kubernetes clusters using NVIDIA MIGs and +This guide covers using GPU fractions in Kubernetes clusters using NVIDIA MIGs and ClearML's Dynamic MIG Operator (CDMO). CDMO enables dynamic MIG (Multi-Instance GPU) configurations. This guide covers: @@ -14,7 +14,46 @@ This guide covers: ### Requirements -* Install the NVIDIA `gpu-operator` using Helm. For instructions, see [Basic Deployment](../extra_configs/gpu_operator.md). +* Add and update the Nvidia Helm repo: + + ```bash + helm repo add nvidia https://nvidia.github.io/gpu-operator + helm repo update + ``` + +* Create a `gpu-operator.override.yaml` file with the following content: + + ```yaml + migManager: + enabled: false + mig: + strategy: mixed + toolkit: + env: + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED + value: "false" + - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS + value: "true" + devicePlugin: + env: + - name: PASS_DEVICE_SPECS + value: "true" + - name: FAIL_ON_INIT_ERROR + value: "true" + - name: DEVICE_LIST_STRATEGY # Use volume-mounts + value: volume-mounts + - name: DEVICE_ID_STRATEGY + value: uuid + - name: NVIDIA_VISIBLE_DEVICES + value: all + - name: NVIDIA_DRIVER_CAPABILITIES + value: all + ``` +* Install the NVIDIA `gpu-operator` using Helm with the previous configuration: + + ```bash + helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml + ``` ### Installing CDMO From 41e455f46c4f49d832214efe495ffd5498e559fa Mon Sep 17 00:00:00 2001 From: fbrintazzoli Date: Wed, 21 May 2025 10:08:22 +0200 Subject: [PATCH 04/10] Added: newline --- docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md index c1652ab4..0e04bc81 100644 --- a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md @@ -49,6 +49,7 @@ This guide covers: - name: NVIDIA_DRIVER_CAPABILITIES value: all ``` + * Install the NVIDIA `gpu-operator` using Helm with the previous configuration: ```bash From ffa57dcd4b5016ce1e613662d2ff3634d3797e6d Mon Sep 17 00:00:00 2001 From: fbrintazzoli Date: Thu, 22 May 2025 12:19:25 +0200 Subject: [PATCH 05/10] Update CFGI devicePlugin version reference --- .../deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md index 13ba2e85..67e63b7f 100644 --- a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md @@ -59,7 +59,7 @@ helm repo update devicePlugin: repository: docker.io/clearml image: k8s-device-plugin - version: v0.17.1-gpu-card-selection + version: "v0.17.2-gpu-card-selection" imagePullPolicy: Always imagePullSecrets: - "clearml-dockerhub-access" From 57dec2e5574ccb74262cb2c5f76cb6aebe7bd1a7 Mon Sep 17 00:00:00 2001 From: Noam Wasersprung <51905810+ainoam@users.noreply.github.com> Date: Thu, 22 May 2025 13:49:31 +0300 Subject: [PATCH 06/10] Fix typo --- .../enterprise_deploy/extra_configs/presign_service.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md index 4de148f7..13037abf 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md @@ -8,7 +8,7 @@ users, enabling direct access to S3 data without exposing credentials. When configured, the ClearML WebApp automatically redirects requests for matching storage URLs (like `s3://...`) to the Presign Service. The service: -* Authenticates the use with ClearML. +* Authenticates the user with ClearML. * Generates a temporary, secure (pre-signed) S3 URL. * Redirects the user's browser to the URL for direct access. From 34d6ccb771e3dff8ae36b7a6a5659187e9982860 Mon Sep 17 00:00:00 2001 From: revital Date: Thu, 22 May 2025 14:09:46 +0300 Subject: [PATCH 07/10] Add ClearML 2.0.0 release notes --- docs/release_notes/sdk/open_source/ver_2_0.md | 33 +++++++++++++++++++ sidebars.js | 4 +-- 2 files changed, 35 insertions(+), 2 deletions(-) create mode 100644 docs/release_notes/sdk/open_source/ver_2_0.md diff --git a/docs/release_notes/sdk/open_source/ver_2_0.md b/docs/release_notes/sdk/open_source/ver_2_0.md new file mode 100644 index 00000000..f81dafd1 --- /dev/null +++ b/docs/release_notes/sdk/open_source/ver_2_0.md @@ -0,0 +1,33 @@ +--- +title: Version 2.0 +--- + +### ClearML 2.0.0 + +**New Features** +* Clean up exception handling in `cleanup_service.py` ([ClearML GitHub issue #1386](https://github.com/clearml/clearml/pull/1386)) +* Add support for `clearml-task` command line options `--force-no-requirements`,` --skip-repo-detection`, and `--skip-python-env-install` +* Allow calling the same pipeline step multiple times with inputs that originate from tasks/controller +* Add` Task.upload_artifact()` argument` sort_keys` to allow disabling sorting yaml/json keys when uploading artifacts +* Add Python annotations to all methods +* Update `pyjwt` constraint version + +**Bug Fixes** +* Fix local file uploads without scheme ([ClearML GitHub issue #1313](https://github.com/clearml/clearml/pull/1313)) +* Fix argument order mismatch in `PipelineController` ([ClearML GitHub PR #1406](https://github.com/clearml/clearml/pull/1406)) +* Fix `_logger` property might be `None` in Session ([ClearML GitHub PR #1412](https://github.com/clearml/clearml/pull/1412)) +* Fix unhandled `None` value in project IDs when listing all datasets ([ClearML GitHub PR #1413](https://github.com/clearml/clearml/pull/1413)) +* Fix typo in config exception string ([ClearML GitHub PR #1418](https://github.com/clearml/clearml/pull/1418)) +* Fix experiments are created twice during HPO ([ClearML GitHub issue #644](https://github.com/clearml/clearml/issues/644)) +* Fix `clearml-task-run` HPO breaks up ([ClearML GitHub issue #1151](https://github.com/clearml/clearml/issues/1151)) +* Fix oversized event reports cause subsequent events to be lost ([ClearML GitHub issue #1316](https://github.com/clearml/clearml/issues/1316)) +* Fix downloading datasets with multiple parents might not work ([ClearML GitHub issue #1398](https://github.com/clearml/clearml/issues/1398)) +* Fix GPU reporting fails to detect GPU when the `NVIDIA_VISIBLE_DEVICES` env var contains a directory reference +* Fix verify configuration option for S3 storage (boto3) is not used when testing buckets +* Fix `PipelineDecorator.component()` ignores `*args` and crashes with `**kwargs` +* Fix Pipelines run via `clearml-task` do not appear in the UI +* Fix task log URL print for API v2.31 should show `"/tasks/{}/output/log"` +* Fix tqdm upload/download reporting, remove warning +* Fix pipeline from CLI with no args fails +* Fix pillow constraint for `Python<=3.7` +* Fix requests constraint for `Python<3.8` \ No newline at end of file diff --git a/sidebars.js b/sidebars.js index d9eaffe2..2b37dd5a 100644 --- a/sidebars.js +++ b/sidebars.js @@ -328,10 +328,10 @@ module.exports = { { 'Open Source': [ - 'release_notes/sdk/open_source/ver_1_18', + 'release_notes/sdk/open_source/ver_2_0', { 'Older Versions': [ - 'release_notes/sdk/open_source/ver_1_17', + 'release_notes/sdk/open_source/ver_1_18', 'release_notes/sdk/open_source/ver_1_17', 'release_notes/sdk/open_source/ver_1_16', 'release_notes/sdk/open_source/ver_1_15', 'release_notes/sdk/open_source/ver_1_14', 'release_notes/sdk/open_source/ver_1_13', 'release_notes/sdk/open_source/ver_1_12', 'release_notes/sdk/open_source/ver_1_11', From eab0561b70e5d23fd927fcc6c04ddb50d0514787 Mon Sep 17 00:00:00 2001 From: revital Date: Thu, 22 May 2025 14:46:14 +0300 Subject: [PATCH 08/10] Add clearml-task command line options --- docs/apps/clearml_task.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/apps/clearml_task.md b/docs/apps/clearml_task.md index 00fb0379..dfaeed42 100644 --- a/docs/apps/clearml_task.md +++ b/docs/apps/clearml_task.md @@ -65,6 +65,7 @@ errors in identifying the correct default branch. | `--docker_bash_setup_script` | Add a bash script to be executed inside the container before setting up the task's environment | No | | `--docker_args` | Add Docker arguments. Pass a single string in the following format: `--docker_args ""`. For example: `--docker_args "-v some_dir_1:other_dir_1 -v some_dir_2:other_dir_2"` | No | | `--folder` | Execute the code from a local folder. Notice, it assumes a git repository already exists. Current state of the repo (commit ID and uncommitted changes) is logged and replicated on the remote machine | No | +| `--force-no-requirements` | If specified, skips all package and requirements installation, and neither packages nor a requirements file need to be provided. | No | | `--import-offline-session`| Specify the path to the offline session you want to import.| No | | `--name` | Set a target name for the new task | Yes | | `--output-uri` | Set the task `output_uri`, upload destination for task models and artifacts | No | @@ -73,7 +74,9 @@ errors in identifying the correct default branch. | `--queue` | Select a task's execution queue. If not provided, a task is created but not launched | No | | `--repo` | URL of remote repository. Example: `--repo https://github.com/clearml/clearml.git` | No | | `--requirements` | Specify `requirements.txt` file to install when setting the session. By default, the` requirements.txt` from the repository will be used | No | +| `--skip-python-env-install` | If specified, agent will use the existing Python environment without installing packages. Only applies when running in Docker mode or on Kubernetes. | Yes | | `--script` | Entry point script for the remote execution. When used with `--repo`, input the script's relative path inside the repository. For example: `--script source/train.py`. When used with `--folder`, it supports a direct path to a file inside the local repository itself, for example: `--script ~/project/source/train.py` | Yes | +| `--skip-repo-detection` | If specified, skip repository detection when a repository is not specified. No repository will be set in remote execution | Yes | | `--skip-task-init` | If set, `Task.init()` call is not added to the entry point, and is assumed to be called within the script | No | | `--tags` | Add tags to the newly created task. For example: `--tags "base" "job"` | No | | `--task-type` | Set the task type. Optional values: training, testing, inference, data_processing, application, monitor, controller, optimizer, service, qc, custom | No | From 0dc5c1f857d4a09b04f1766b16f7f42cfa1bf99e Mon Sep 17 00:00:00 2001 From: revital Date: Thu, 22 May 2025 14:50:49 +0300 Subject: [PATCH 09/10] Edits --- docs/apps/clearml_task.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/apps/clearml_task.md b/docs/apps/clearml_task.md index dfaeed42..77ebff26 100644 --- a/docs/apps/clearml_task.md +++ b/docs/apps/clearml_task.md @@ -74,9 +74,9 @@ errors in identifying the correct default branch. | `--queue` | Select a task's execution queue. If not provided, a task is created but not launched | No | | `--repo` | URL of remote repository. Example: `--repo https://github.com/clearml/clearml.git` | No | | `--requirements` | Specify `requirements.txt` file to install when setting the session. By default, the` requirements.txt` from the repository will be used | No | -| `--skip-python-env-install` | If specified, agent will use the existing Python environment without installing packages. Only applies when running in Docker mode or on Kubernetes. | Yes | +| `--skip-python-env-install` | If specified, agent will use the existing Python environment without installing packages. Only applies when running in Docker mode or on Kubernetes. | No | | `--script` | Entry point script for the remote execution. When used with `--repo`, input the script's relative path inside the repository. For example: `--script source/train.py`. When used with `--folder`, it supports a direct path to a file inside the local repository itself, for example: `--script ~/project/source/train.py` | Yes | -| `--skip-repo-detection` | If specified, skip repository detection when a repository is not specified. No repository will be set in remote execution | Yes | +| `--skip-repo-detection` | If specified, skip repository detection when a repository is not specified. No repository will be set in remote execution | No | | `--skip-task-init` | If set, `Task.init()` call is not added to the entry point, and is assumed to be called within the script | No | | `--tags` | Add tags to the newly created task. For example: `--tags "base" "job"` | No | | `--task-type` | Set the task type. Optional values: training, testing, inference, data_processing, application, monitor, controller, optimizer, service, qc, custom | No | From 9a75bd71776d15f87f5efee06eee2760b5430122 Mon Sep 17 00:00:00 2001 From: revital Date: Sun, 25 May 2025 07:57:32 +0300 Subject: [PATCH 10/10] Small edits --- .../deploying_clearml/enterprise_deploy/extra_configs/apps.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/apps.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/apps.md index 06185a83..d48f8dd8 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/apps.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/apps.md @@ -11,14 +11,14 @@ Applications are installed on top of the ClearML Server and are provided by the ## Requirements -- Python 3 installed on your local machine to run the provided installation scripts) +- Python 3 installed on your local machine to run the provided installation scripts - A ClearML Enterprise Server is up and running with `clearmlApplications.enabled` set to `"true"` in the server's `overrides.yaml` file. - Applications package provided by ClearML, including the following scripts: - `convert_image_registry.py` - `upload_apps.py` - API credentials (`` and ``) generated via the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). Make sure these credentials - belong to an admin user or a service user with admin privilegesFor more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). + belong to an admin user or a service user with admin privilegesFor more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). ## Installation