mirror of
https://github.com/clearml/clearml-docs
synced 2025-05-30 01:58:41 +00:00
Add Enterprise Server deployment usecases
This commit is contained in:
parent
b21275c262
commit
7779945d3e
@ -1,26 +1,20 @@
|
||||
🟡 Ready, but missing hyperlinks (see TODOs)
|
||||
TODO:
|
||||
- Link: GPU Operator
|
||||
- Link: Additional configurations
|
||||
- Link: Now proceed with AI App Gateway
|
||||
|
||||
---
|
||||
title: ClearML Agent on Kubernetes
|
||||
---
|
||||
|
||||
The ClearML Agent allows scheduling distributed experiments on a Kubernetes cluster.
|
||||
The ClearML Agent enables scheduling and executing distributed experiments on a Kubernetes cluster.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- The ClearML Enterprise server is up and running.
|
||||
- Create a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way to do so is from
|
||||
- Generate a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via
|
||||
the ClearML UI (**Settings > Workspace > App Credentials > Create new credentials**).
|
||||
|
||||
:::note
|
||||
Make sure that the generated keys belong to an admin user or a service user with admin privileges.
|
||||
Make sure these credentials belong to an admin user or a service user with admin privileges.
|
||||
:::
|
||||
|
||||
- The worker environment should be able to communicate to the ClearML Server over the same network.
|
||||
- The worker environment must be able to access the ClearML Server over the same network.
|
||||
|
||||
## Installation
|
||||
|
||||
@ -36,13 +30,13 @@ Update the repository locally:
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### Prepare Values
|
||||
### Create a Values Override File
|
||||
|
||||
Create a `clearml-agent-values.override.yaml` file with the following content:
|
||||
|
||||
:::note
|
||||
In the following configuration, replace the `<ACCESS_KEY>` and `<SECRET_KEY>` placeholders with the admin credentials
|
||||
you have generated on the ClearML Server. The values for `<api|file|web>ServerUrlReference` should point to your ClearML
|
||||
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>` with the admin credentials
|
||||
you created earlier. Set `<api|file|web>ServerUrlReference` to the relevant URLs of your ClearML
|
||||
control-plane installation.
|
||||
:::
|
||||
|
||||
@ -73,19 +67,17 @@ helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-e
|
||||
|
||||
## Additional Configuration Options
|
||||
|
||||
:::note
|
||||
You can view the full set of available and documented values of the chart by running the following command:
|
||||
To view all configurable options for the Helm chart, run the following command:
|
||||
|
||||
```bash
|
||||
helm show readme clearml-enterprise/clearml-enterprise-agent
|
||||
# or
|
||||
helm show values clearml-enterprise/clearml-enterprise-agent
|
||||
```
|
||||
:::
|
||||
|
||||
### Report GPUs in the Dashboard
|
||||
### Report GPUs to the Dashboard
|
||||
|
||||
The Agent should explicitly report the total number of GPUs available on the cluster for it to appear in the dashboard reporting:
|
||||
To show GPU availability in the dashboard, explicitly set the number of GPUs:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
@ -95,23 +87,23 @@ agentk8sglue:
|
||||
|
||||
### Queues
|
||||
|
||||
The ClearML Agent in Kubernetes monitors ClearML queues and pulls tasks that are scheduled for execution.
|
||||
The ClearML Agent on Kubernetes monitors ClearML queues and pulls tasks that are scheduled for execution.
|
||||
|
||||
A single agent can monitor multiple queues, each queue sharing a Pod template (`agentk8sglue.basePodTemplate`) to be
|
||||
A single agent can monitor multiple queues. By default, the queues share a base pod template (`agentk8sglue.basePodTemplate`)
|
||||
used when submitting a task to Kubernetes after it has been extracted from the queue.
|
||||
|
||||
Each queue can be configured with dedicated Pod template spec override (`templateOverrides`). This way queue definitions
|
||||
can be mixed and matched to serve multiple use-cases.
|
||||
can be tailored to different use cases.
|
||||
|
||||
The Following are a few examples of agent queue templates.
|
||||
The following are a few examples of agent queue templates:
|
||||
|
||||
#### GPU Queues
|
||||
#### Example: GPU Queues
|
||||
|
||||
|
||||
GPU queue support requires deploying the NVIDIA GPU Operator on your Kubernetes cluster.
|
||||
|
||||
For more information, see [GPU Operator](extra_configs/gpu_operator.md).
|
||||
|
||||
:::note
|
||||
The GPU queues configuration and usage from the ClearML Enterprise Agent requires deploying the Nvidia GPU Operator
|
||||
on your Kubernetes cluster.
|
||||
For more information, refer to the [GPU Operator](https://TODO) page.
|
||||
:::
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
@ -129,13 +121,14 @@ agentk8sglue:
|
||||
nvidia.com/gpu: 2
|
||||
```
|
||||
|
||||
#### Override a Pod Template by Queue
|
||||
#### Example: Overriding Pod Templates per Queue
|
||||
|
||||
In the following example:
|
||||
In this example:
|
||||
|
||||
- The `red` queue will inherit both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
|
||||
- The `blue` queue will set the label `team=blue`, but will inherit the 1Gi memory from the `basePodTemplate` section.
|
||||
- The `green` queue will set both the label `team=green` and a 2Gi memory limit. It will also set an annotation and an environment variable.
|
||||
- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
|
||||
- The `blue` queue overrides the label by setting it to `team=blue`, and inherits the 1Gi memory from the `basePodTemplate` section.
|
||||
- The `green` queue overrides the label by setting it to `team=green`, and overrides the memory limit by setting it to 2Gi.
|
||||
It also sets an annotation and an environment variable.
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
@ -173,9 +166,5 @@ agentk8sglue:
|
||||
|
||||
## Next Steps
|
||||
|
||||
Once the ClearML Enterprise Agent is up and running, proceed with deploying the ClearML Enterprise App Gateway.
|
||||
Once the agent is up and running, proceed with deploying the[ ClearML Enterprise App Gateway](appgw_install_k8s.md).
|
||||
|
||||
$$$$$$$$$$$$$$$
|
||||
$$$$$$$$$$$$$$$
|
||||
|
||||
TODO link to the AI App Gateway page in documentation
|
@ -1,40 +1,38 @@
|
||||
🟢 Ready
|
||||
TODO in future:
|
||||
- Add NFS Example - https://allegroai.atlassian.net/wiki/x/AoCUiQ?atlOrigin=eyJpIjoiMjNiNTcxYTJiMzUxNDVhMThiODlhMTcwYzE1YWE3ZTUiLCJwIjoiYyJ9
|
||||
|
||||
---
|
||||
title: Dynamically Edit Task Pod Template
|
||||
---
|
||||
|
||||
# Dynamically Edit Task Pod Template
|
||||
The ClearML Enterprise Agent supports defining custom Python code to modify a task's Pod template before it is applied
|
||||
to Kubernetes.
|
||||
|
||||
The ClearML Enterprise Agent supports defining custom Python code for interacting with a Task's Pod template before it gets applied to Kubernetes.
|
||||
This enables dynamic customization of Task Pod manifests in the context of a ClearML Enterprise Agent, which is useful
|
||||
for injecting values or changing configurations based on runtime context.
|
||||
|
||||
This allows to dynamically edit a Task Pod manifest in the context of a ClearML Enterprise Agent and can be useful in a variety of scenarios such as customizing fields based on variables.
|
||||
## Agent Configuration
|
||||
|
||||
# Agent Configuration
|
||||
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that
|
||||
module that the ClearML Enterprise Agent should invoke before applying a Task Pod template.
|
||||
|
||||
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable is used to indicate a Python module and function inside that module for the ClearML Enterprise Agent to run before applying a Task Pod template. The Agent will run this code from its own context, pass some arguments (including the actual template) to the function and use the returned template to create the final Task Pod in Kubernetes.
|
||||
The Agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
|
||||
the returned template to create the final Task Pod in Kubernetes.
|
||||
|
||||
Arguments passed to the function include:
|
||||
|
||||
`queue` - ID of the queue (string) from which the task was pulled.
|
||||
* `queue` (string) - ID of the queue from which the task was pulled.
|
||||
* `queue_name` (string) - Name of the queue from which the task was pulled.
|
||||
* `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides.
|
||||
* `task_data` (object) - Task data object (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID.
|
||||
* `providers_info` (dictionary) - Provider info containing optional information collected for the user running this task
|
||||
when the user logged into the system (requires additional server configuration).
|
||||
* `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable
|
||||
for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values.
|
||||
* `worker` - The agent Python object in case custom calls might be required.
|
||||
|
||||
`queue_name` - Name of the queue (string) from which the task was pulled.
|
||||
### Usage
|
||||
|
||||
`template` - Base template (python dictionary) created from the agent’s values, with any specific overrides for the queue from which the task was pulled.
|
||||
Update `clearml-agent-values.override.yaml` to include:
|
||||
|
||||
`task_data` - Task data structure (object) containing the task’s information (as returned by the tasks.get_by_id API call). For example, use task_data.project to get the task’s project ID.
|
||||
|
||||
`providers_info` - Providers info (dictionary) containing optional information collected for the user running this task when the user logged into the system (requires additional server configuration).
|
||||
|
||||
`task_config` - Task configuration (clearml_agent.backend_config.Config object) containing the configuration used to run this task. This includes any overrides added in Vaults applicable for the user running this task. Use task_config.get("...") to get specific configuration values.
|
||||
|
||||
`worker` - the agent Python object, in case custom calls might be required.
|
||||
|
||||
## Usage
|
||||
|
||||
Edit the `clearml-agent-values.override.yaml` file adding the following:
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
@ -61,24 +59,28 @@ agentk8sglue:
|
||||
return {"template": template}
|
||||
```
|
||||
|
||||
## Notes
|
||||
:::note notes
|
||||
* Make sure to include `*args, **kwargs` at the end of the function's argument list and to only use keyword arguments.
|
||||
This is needed to maintain backward compatibility.
|
||||
|
||||
**Note**: Make sure to include `*args, **kwargs` at the end of the function’s argument list and to only use keyword arguments. This is needed to maintain backward compatibility and make sure any added named arguments or changes in the arguments order in new agent versions won’t affect your implementation.
|
||||
* Custom code modules can be included as a file in the pod's container, and the environment variable can be used to
|
||||
point to the file and entry point.
|
||||
|
||||
**Note**: Custom code modules can be included as a file in the Pod's container and the environment variable can be used to simply point to the file and entry point.
|
||||
* When defining a custom code module, by default the Agent will start watching pods in all namespaces
|
||||
across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the
|
||||
`CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces.
|
||||
Instead, set it to `"1"` if namespace-related changes are needed in the code.
|
||||
|
||||
**Note**: When defining a custom code module, by default the ClearML Etnerprise Agent will start watching Pods in all namespaces across the Cluster. If you do not intend to give a ClusterRole permission, make sure to set the `CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the ClearML Enterprise Agent to try listing Pods in all namespaces. Instead, set it to `"1"` if namespace-related changes are needed in the code.
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
|
||||
value: "0"
|
||||
```
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
|
||||
value: "0"
|
||||
```
|
||||
To customize the bash startup scripts instead of the pod spec, use:
|
||||
|
||||
**Note**: If you want instead to modify the Bash script used to start the Task Pod or the Agent, see here instead:
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# -- Custom Bash script for the Agent pod ran by Glue Agent
|
||||
customBashScript: ""
|
||||
@ -86,11 +88,11 @@ agentk8sglue:
|
||||
containerCustomBashScript: ""
|
||||
```
|
||||
|
||||
# Examples
|
||||
## Examples
|
||||
|
||||
## Example – Edit Template based on ENV var
|
||||
### Example: Edit Template Based on ENV Var
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
@ -131,9 +133,9 @@ agentk8sglue:
|
||||
emptyDir: {}
|
||||
```
|
||||
|
||||
## Example – NFS Mount Path
|
||||
### Example: Inject NFS Mount Path
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
@ -163,23 +165,25 @@ agentk8sglue:
|
||||
return {"template": template}
|
||||
```
|
||||
|
||||
# Bind Additional Resources to Task Pod (PVC Example)
|
||||
### Example: Bind PVC Resource to Task Pod
|
||||
|
||||
In this example, a dedicated PVC is dynamically created and attached to every Pod created from a dedicated queue, then deleted after the Pod deletion.
|
||||
In this example, a PVC is created and attached to every Pod created from a dedicated queue, then deleted afterwards.
|
||||
|
||||
The following code block is commented to explain the context.
|
||||
Key points:
|
||||
|
||||
The key points are:
|
||||
- `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` and `CLEARML_K8S_GLUE_POD_POST_DELETE_CMD` env vars let you define custom bash
|
||||
code hooks to be executed around the main apply command by the Agent, such as creating and deleting a PVC object.
|
||||
|
||||
- `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` and `CLEARML_K8S_GLUE_POD_POST_DELETE_CMD` env vars let you define custom bash code hooks to be executed around the main apply command by the Agent, such as creating and deleting a PVC object.
|
||||
- `CLEARML_K8S_GLUE_TEMPLATE_MODULE` env var and a file mount let you define custom Python code in a specific context,
|
||||
useful to dynamically update the main Pod template before the Agent applies it.
|
||||
|
||||
- `CLEARML_K8S_GLUE_TEMPLATE_MODULE` env var and a file mount let you define custom Python code in a specific context, useful to dynamically update the main Pod template before the Agent applies it.
|
||||
:::note notes
|
||||
* This example uses a queue named `pvc-test`, make sure to replace all occurrences of it.
|
||||
|
||||
**Note**: This example uses a queue named `pvc-test`, make sure to replace all occurrences of it.
|
||||
* `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` can reference templated vars as `{queue_name}, {pod_name}, {namespace}` that will
|
||||
get replaced with the actual value by the Agent at execution time.
|
||||
|
||||
**Note**: `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` can reference templated vars as `{queue_name}, {pod_name}, {namespace}` that will get replaced with the actual value by the Agent at execution time.
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# Bind a pre-defined custom 'custom-agent-role' Role with the ability to handle 'persistentvolumeclaims'
|
||||
additionalRoleBindings:
|
||||
@ -246,9 +250,11 @@ agentk8sglue:
|
||||
name: task-pvc
|
||||
```
|
||||
|
||||
Example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
|
||||
### Example: Required Role
|
||||
|
||||
``` yaml
|
||||
The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
|
||||
|
||||
```yaml
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: Role
|
||||
metadata:
|
||||
|
@ -1,31 +1,28 @@
|
||||
🟡 Ready, missing link
|
||||
---
|
||||
TODO:
|
||||
- Link: fractional GPUs
|
||||
|
||||
title: Basic Deployment - Suggested GPU Operator Values
|
||||
---
|
||||
|
||||
# Basic Deployment - Suggested GPU Operator Values
|
||||
|
||||
## Add the Helm Repo Locally
|
||||
|
||||
Add the ClearML Helm repository:
|
||||
``` bash
|
||||
Add the NVIDIA GPU Operator Helm repository:
|
||||
```bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
``` bash
|
||||
```bash
|
||||
helm repo update
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
As mentioned by NVIDIA, this configuration is needed to prevent unprivileged containers from bypassing the Kubernetes Device Plugin API.
|
||||
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator with the
|
||||
following override values.
|
||||
|
||||
Create a `gpu-operator.override.yaml` file with the following content:
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
@ -48,14 +45,17 @@ devicePlugin:
|
||||
value: all
|
||||
```
|
||||
|
||||
Install the gpu-operator:
|
||||
Install the `gpu-operator`:
|
||||
|
||||
``` bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
|
||||
# Fractioning
|
||||
## Fractional GPU Support
|
||||
|
||||
For fractional GPU support, refer to the dedicated guides.
|
||||
|
||||
TODO link to the fractional_gpus directory page in documentation
|
||||
For support with fractional GPUs, refer to the dedicated guides:
|
||||
* [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) – Dynamically configures MIG GPUs on supported devices.
|
||||
* [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) – Enables fractional (non-MIG) GPU
|
||||
allocation for better hardware utilization and workload distribution in Kubernetes.
|
||||
* [CDMO and CFGI on the same Cluster](../fractional_gpus/cdmo_cfgi_same_cluster.md) - In clusters with multiple nodes and
|
||||
varying GPU types, the `gpu-operator` can be used to manage different device configurations and fractioning modes.
|
@ -1,10 +1,12 @@
|
||||
🟢 Ready
|
||||
---
|
||||
# Multi-Node Training
|
||||
title: Multi-Node Training
|
||||
---
|
||||
|
||||
The ClearML Enterprise Agent supports horizontal multi-node training executions. Here is a configuration example (in `clearml-agent-values.override.yaml`):
|
||||
The ClearML Enterprise Agent supports horizontal multi-node training--running a single Task across multiple Pods on different nodes.
|
||||
|
||||
``` yaml
|
||||
Below is a configuration example using `clearml-agent-values.override.yaml`:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# Cluster access is required to run multi-node Tasks
|
||||
serviceAccountClusterAccess: true
|
||||
|
@ -1,211 +0,0 @@
|
||||
🟡 Ready, but missing hyperlinks (see TODOs)
|
||||
---
|
||||
TODO:
|
||||
Control Plane:
|
||||
- Link: basic k8s installation
|
||||
- Link: SSO login
|
||||
- Additional envs for control-plane multi-tenancy
|
||||
|
||||
Workers:
|
||||
- Link: basic Agent installation
|
||||
- Link: basic AI App Gateway installation
|
||||
|
||||
---
|
||||
|
||||
# Multi-Tenancy
|
||||
|
||||
## Control Plane
|
||||
|
||||
For installing the ClearML control-plane, follow this guide (TODO link to the basic_k8s_installation page).
|
||||
|
||||
Update the Server’s `clearml-values.override.yaml` with the following values:
|
||||
|
||||
``` yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__services__organization__features__user_management_advanced
|
||||
value: "true"
|
||||
- name: CLEARML__services__auth__ui_features_per_role__user__show_datasets
|
||||
value: "false"
|
||||
- name: CLEARML__services__auth__ui_features_per_role__user__show_orchestration
|
||||
value: "false"
|
||||
- name: CLEARML__services__workers__resource_usages__supervisor_company
|
||||
value: "d1bd92a3b039400cbafc60a7a5b1e52b" # Default company
|
||||
- name: CLEARML__secure__credentials__supervisor__role
|
||||
value: "system"
|
||||
- name: CLEARML__secure__credentials__supervisor__allow_login
|
||||
value: "true"
|
||||
- name: CLEARML__secure__credentials__supervisor__user_key
|
||||
value: "<SUPERVISOR_USER_KEY>"
|
||||
- name: CLEARML__secure__credentials__supervisor__user_secret
|
||||
value: "<SUPERVISOR_USER_SECRET>"
|
||||
- name: CLEARML__secure__credentials__supervisor__sec_groups
|
||||
value: "[\"users\", \"admins\", \"queue_admins\"]"
|
||||
- name: CLEARML__secure__credentials__supervisor__email
|
||||
value: "\"<SUPERVISOR_USER_EMAIL>\""
|
||||
- name: CLEARML__apiserver__company__unique_names
|
||||
value: "true"
|
||||
```
|
||||
|
||||
The credentials specified in `<SUPERVISOR_USER_KEY>` and `<SUPERVISOR_USER_SECRET>` can be used to log in as the supervisor user from the ClearML Web UI accessible using the URL `app.<BASE_DOMAIN>`.
|
||||
|
||||
Note that the `<SUPERVISOR_USER_EMAIL>` value must be explicitly quoted. To do so, put `\"` around the quoted value. Example `"\"email@example.com\""`.
|
||||
|
||||
You will want to configure SSO as well. For this, follow the “SSO (Identity Provider) Setup†(TODO link to the sso-login page).
|
||||
|
||||
### Create a Tenant
|
||||
|
||||
The following section will address the steps required to create a new tenant in the ClearML Control-plane server using a series of API calls.
|
||||
|
||||
Note that placeholders (`<PLACEHOLDER>`) in the following configuration should be substituted with a valid domain based on your installation values.
|
||||
|
||||
#### Create a new Tenant in the ClearML Control-plane
|
||||
|
||||
*Define variables to use in the next steps:*
|
||||
|
||||
``` bash
|
||||
APISERVER_URL="https://api.<BASE_DOMAIN>"
|
||||
APISERVER_KEY=<APISERVER_KEY>
|
||||
APISERVER_SECRET=<APISERVER_SECRET>
|
||||
```
|
||||
|
||||
**Note**: The apiserver key and secret should be the same as those used for installing the ClearML Enterprise server Chart.
|
||||
|
||||
*Create a Tenant (company):*
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/system.create_company \
|
||||
-H "Content-Type: application/json" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"name":"<TENANT_NAME>"}'
|
||||
```
|
||||
|
||||
The result returns the new Company ID (`<COMPANY_ID>`)
|
||||
|
||||
If needed, list existing tenants (companies) using:
|
||||
|
||||
``` bash
|
||||
curl -u $APISERVER_KEY:$APISERVER_SECRET $APISERVER_URL/system.get_companies
|
||||
```
|
||||
|
||||
*Create an Admin User for the new tenant:*
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/auth.create_user \
|
||||
-H "Content-Type: application/json" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"name":"<ADMIN_USER_NAME>","company":"<COMPANY_ID>","email":"<ADMIN_USER_EMAIL>","role":"admin","internal":"true"}'
|
||||
```
|
||||
|
||||
The result returns the new User ID (`<USER_ID>`)
|
||||
|
||||
*Create Credentials for the new Admin User:*
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/auth.create_credentials \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Clearml-Impersonate-As: <USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET
|
||||
```
|
||||
|
||||
The result returns a set of key and secret credentials associated with the new Admin User.
|
||||
|
||||
**Note**: You can use this set of credentials to set up an Agent or App Gateway for the newly created Tenant.
|
||||
|
||||
#### Create IDP/SSO sign-in rules
|
||||
|
||||
To map new users signing into the system to existing tenants, you can use one or more of the following route methods to route new users (based on their email address) to an existing tenant.
|
||||
|
||||
*Route an email to a tenant based on the email’s domain:*
|
||||
|
||||
This will instruct the server to assign any new user whose email domain matches the domain provided below to this specific tenant.
|
||||
|
||||
Note that providing the same domain name for multiple tenants will result in unstable behavior and should be avoided.
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/login.set_domains \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Clearml-Act-As: <USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"domains":["<USERS_EMAIL_DOMAIN>"]}'
|
||||
```
|
||||
|
||||
`<USERS_EMAIL_DOMAIN>` is the email domain set up for users to access through SSO.
|
||||
|
||||
*Route specific email(s) to a tenant:*
|
||||
|
||||
This will instruct the server to assign any new user whose email is found in this list to this specific tenant. You can use the is_admin property to choose whether these users will be set as admins in this tenant upon login.
|
||||
|
||||
Note that you can create more than one list per tenant (using multiple API calls) to create one list for admin users and another for non-admin users.
|
||||
|
||||
Note that including the same email address in more than a single tenant’s list will result in unstable behavior and should be avoided.
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/login.add_whitelist_entries \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Clearml-Act-As: <USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"emails":["<email1>", "<email2>", ...],"is_admin":false}'
|
||||
```
|
||||
|
||||
To remove existing email(s) from these lists, use the following API call. Note that this will not affect a user who has already logged in using one of these email addresses:
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/login.remove_whitelist_entries \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Clearml-Act-As: <USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"emails":["<email1>", "<email2>", ...]}'
|
||||
```
|
||||
|
||||
*Get the current login routing settings:*
|
||||
|
||||
To get the current IDP/SSO login rule settings for this tenant:
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/login.get_settings \
|
||||
-H "X-Clearml-Act-As: <USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET
|
||||
```
|
||||
|
||||
### Limit Features for all Users in a Group​
|
||||
|
||||
The server’s `clearml-values.override.yaml` can control some tenants’ configurations, limiting the features available to some users or groups in the system.
|
||||
|
||||
Example: with the following configuration, all users in the “users†group will only have the `applications` feature enabled.
|
||||
|
||||
``` yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__services__auth__default_groups__users__features
|
||||
value: "[\"applications\"]"
|
||||
```
|
||||
|
||||
Available Features:
|
||||
|
||||
- `applications` - Viewing and running applications
|
||||
- `data_management` - Working with hyper-datasets and dataviews
|
||||
- `experiments` - Viewing the experiment table and launching experiments
|
||||
- `queues` - Viewing the queues screen
|
||||
- `queue_management` - Creating and deleting queues
|
||||
- `pipelines` - Viewing/managing pipelines in the system
|
||||
- `reports` - Viewing and managing reports in the system
|
||||
- `show_dashboard` - Show the dashboard screen
|
||||
- `show_projects` - Show the projects menu option
|
||||
- `resource_dashboard` - Display the resource dashboard in the orchestration page
|
||||
|
||||
## Workers
|
||||
|
||||
Refer to the following pages for installing and configuring the ClearML Enterprise Agent (TODO link to agent_k8s_installation) and App Gateway (TODO link to app-gateway).
|
||||
|
||||
**Note**: Make sure to setup Agent and App Gateway using a Tenant's admin user credentials created with the Tenant creation APIs described above.
|
||||
|
||||
### Tenants Separation
|
||||
|
||||
In multi-tenant setups, you can separate the tenants’ workers in different namespaces.
|
||||
|
||||
Create a Kubernetes Namespace for each tenant and install a dedicated ClearML Agent and AI Application Gateway in each Namespace.
|
||||
|
||||
A tenant’s Agent and Gateway need to be configured with credentials created on the ClearML server by a user of the same tenant.
|
||||
|
||||
Additional network separation can be achieved via Kubernetes Network Policies.
|
@ -1,33 +1,42 @@
|
||||
# ClearML Presign Service
|
||||
---
|
||||
title: ClearML Presign Service
|
||||
---
|
||||
|
||||
The ClearML Presign Service is a secure component for generating and redirecting pre-signed storage URLs for authenticated users.
|
||||
The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated
|
||||
users, enabling direct access to cloud-hosted data (e.g., S3) without exposing credentials.
|
||||
|
||||
# Prerequisites
|
||||
## Prerequisites
|
||||
|
||||
- The ClearML Enterprise server is up and running.
|
||||
- Create a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way to do so is from the ClearML UI (Settings → Workspace → App Credentials → Create new credentials).
|
||||
Note: Make sure that the generated keys belong to an admin user or a service user with admin privileges.
|
||||
- The worker environment should be able to communicate to the ClearML Server over the same network.
|
||||
- Generate `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via the ClearML UI
|
||||
(**Settings > Workspace > App Credentials > Create new credentials**).
|
||||
|
||||
# Installation
|
||||
:::note
|
||||
Make sure these credentials belong to an admin user or a service user with admin privileges.
|
||||
:::
|
||||
|
||||
- The worker environment must be able to access the ClearML Server over the same network.
|
||||
|
||||
## Add the Helm Repo Locally
|
||||
|
||||
## Installation
|
||||
|
||||
### Add the Helm Repo Locally
|
||||
|
||||
Add the ClearML Helm repository:
|
||||
``` bash
|
||||
```bash
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
``` bash
|
||||
```bash
|
||||
helm repo update
|
||||
```
|
||||
|
||||
## Prepare Values
|
||||
### Prepare Configuration
|
||||
|
||||
Create a `presign-service.override.yaml` override file, replacing placeholders.
|
||||
Create a `presign-service.override.yaml` file (make sure to replace the placeholders):
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
clearml:
|
||||
@ -39,21 +48,25 @@ ingress:
|
||||
hostName: "<PRESIGN_SERVICE_URL>"
|
||||
```
|
||||
|
||||
## Install
|
||||
### Deploy the Helm Chart
|
||||
|
||||
Install the clearml-presign-service helm chart in the same namespace as the ClearML Enterprise server:
|
||||
Install the `clearml-presign-service` helm chart in the same namespace as the ClearML Enterprise server:
|
||||
|
||||
``` bash
|
||||
```bash
|
||||
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml
|
||||
```
|
||||
|
||||
## Configure ClearML Enterprise Server
|
||||
### Update ClearML Enterprise Server Configuration
|
||||
|
||||
After installing, edit the ClearML Enterprise `clearml-values.override.yaml` file adding an extra environment variable to the apiserver component as follows, making sure to replace the `<PRESIGN_SERVICE_URL>` placeholder, then perform a helm upgrade.
|
||||
Enable the ClearML Server to use the Presign Service by editing your `clearml-values.override.yaml` file.
|
||||
Add the following to the `apiserver.extraEnvs` section (make sure to replace `<PRESIGN_SERVICE_URL>`).
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__SERVICES__SYSTEM__COMPANY__DEFAULT__SERVICES
|
||||
value: "[{\"type\":\"presign\",\"url\":\"https://<PRESIGN_SERVICE_URL>\",\"use_fallback\":\"false\",\"match_sets\":[{\"rules\":[{\"field\":\"\",\"obj_type\":\"\",\"regex\":\"^s3://\"}]}]}]"
|
||||
```
|
||||
```
|
||||
|
||||
Apply the changes with a Helm upgrade.
|
||||
|
||||
|
@ -1,4 +1,6 @@
|
||||
# ClearML Tenant with Self Signed Certificates
|
||||
---
|
||||
title: ClearML Tenant with Self Signed Certificates
|
||||
---
|
||||
|
||||
This guide covers the configuration to support SSL Custom certificates for the following components:
|
||||
|
||||
@ -7,9 +9,9 @@ This guide covers the configuration to support SSL Custom certificates for the f
|
||||
|
||||
## AI Application Gateway
|
||||
|
||||
Add the following section in the `clearml-app-gateway-values.override.yaml` file:
|
||||
To configure certificates for the Application Gateway, update your `clearml-app-gateway-values.override.yaml` file:
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
@ -23,18 +25,18 @@ customCertificates:
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
In the section, there are two options:
|
||||
In this section, there are two options:
|
||||
|
||||
- Replace the whole `ca-certificates.crt` file
|
||||
- Add extra certificates to the existing `ca-certificates.crt`
|
||||
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file
|
||||
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
Let’s see them in detail.
|
||||
|
||||
### Replace the whole `ca-certificates.crt` file
|
||||
### Replace Entire `ca-certificates.crt` File
|
||||
|
||||
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like they are stored in a standard `ca-certificates.crt`.
|
||||
To replace the whole ca-bundle, you can attach a concatenation of all your valid CA in a `pem` format as
|
||||
they are stored in a standard `ca-certificates.crt`.
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
@ -51,13 +53,11 @@ customCertificates:
|
||||
...
|
||||
```
|
||||
|
||||
### Add extra certificates to the existing `ca-certificates.crt`
|
||||
### Append Extra Certificates to the Existing `ca-certificates.crt`
|
||||
|
||||
To add extra certificates to the standard provided CA available in the container you can define any specific certificate in the list.
|
||||
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`.
|
||||
|
||||
Ensure to provide different aliases.
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
@ -74,19 +74,19 @@ customCertificates:
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
### Apply changes
|
||||
### Apply Changes
|
||||
|
||||
After applying the changes ensure to run the update command:
|
||||
To apply the changes, run the update command:
|
||||
|
||||
``` bash
|
||||
```bash
|
||||
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
|
||||
```
|
||||
|
||||
## ClearML Enterprise Agent
|
||||
|
||||
Add the following section in the `clearml-agent-values.override.yaml` file:
|
||||
For the Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
@ -102,16 +102,16 @@ customCertificates:
|
||||
|
||||
In the section, there are two options:
|
||||
|
||||
- Replace the whole `ca-certificates.crt` file
|
||||
- Add extra certificates to the existing `ca-certificates.crt`
|
||||
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file
|
||||
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
Let’s see them in detail.
|
||||
|
||||
### Replace the whole `ca-certificates.crt` file
|
||||
### Replace Entire `ca-certificates.crt` File
|
||||
|
||||
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like they are stored in a standard `ca-certificates.crt`.
|
||||
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like
|
||||
they are stored in a standard `ca-certificates.crt`.
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
@ -128,13 +128,11 @@ customCertificates:
|
||||
...
|
||||
```
|
||||
|
||||
### Add extra certificates to the existing `ca-certificates.crt`
|
||||
### Append Extra Certificates to the Existing `ca-certificates.crt`
|
||||
|
||||
To add extra certificates to the standard provided CA available in the container you can define any specific certificate in the list.
|
||||
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`.
|
||||
|
||||
Ensure to provide different aliases.
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
@ -151,9 +149,11 @@ customCertificates:
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
### Add certificates to Tasks
|
||||
### Add Certificates to Task Pods
|
||||
|
||||
``` yaml
|
||||
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into Pods:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
basePodTemplate:
|
||||
initContainers:
|
||||
@ -189,13 +189,13 @@ agentk8sglue:
|
||||
name: clearml-agent-clearml-enterprise-agent-custom-ca-1
|
||||
```
|
||||
|
||||
Please note the `clearml-extra-ca-certs` volume, ensure to add each configMap created by the agent.
|
||||
The `clearml-extra-ca-certs` volume must include all `ConfigMap` resources generated by the agent for the custom certificates.
|
||||
These `ConfigMaps` are automatically created by the Helm chart based on the number of certificates provided.
|
||||
Their names are usually prefixed with the Helm release name, so adjust accordingly if you used a custom release name.
|
||||
|
||||
The name can differ based on the release name used during the installation.
|
||||
### Apply Changes
|
||||
|
||||
### Apply changes
|
||||
|
||||
After applying the changes ensure to run the update command:
|
||||
Applying the changes by running the the update command:
|
||||
|
||||
``` bash
|
||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||
|
@ -1,12 +1,21 @@
|
||||
# SSO (Identity Provider) Setup
|
||||
---
|
||||
title: SSO (Identity Provider) Setup
|
||||
---
|
||||
|
||||
ClearML Enterprise Server supports various SSO options, values configurations can be set in `clearml-values.override.yaml`.
|
||||
ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers.
|
||||
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and applied to the `apiserver` component.
|
||||
|
||||
The following are a few examples. Some other supported providers are Auth0, Keycloak, Okta, Azure, Google, Cognito.
|
||||
The following are configuration examples for commonly used providers. Other supported systems include:
|
||||
* Auth0
|
||||
* Keycloak
|
||||
* Okta
|
||||
* Azure AD
|
||||
* Google
|
||||
* and AWS Cognito
|
||||
|
||||
## Auth0
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_id
|
||||
@ -25,7 +34,7 @@ apiserver:
|
||||
|
||||
## Keycloak
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id
|
||||
@ -42,19 +51,22 @@ apiserver:
|
||||
value: "true"
|
||||
```
|
||||
|
||||
### Note if using Groups Mapping
|
||||
## Group Membership Mapping in Keycloak
|
||||
|
||||
When configuring the OpenID client for ClearML:
|
||||
To map Keycloak groups into the ClearML user's SSO token:
|
||||
|
||||
- Navigate to the Client Scopes tab.
|
||||
- Click on the first row <clearml client>-dedicated.
|
||||
- Click "Add Mapper" → "By configuration" and then select the "Group membership" option.
|
||||
- In the opened dialog, enter the name "groups" and the Token claim name "groups".
|
||||
- Uncheck the "Full group path" option and save the mapper.
|
||||
1. Go to the **Client Scopes** tab.
|
||||
1. Click on the first row `<clearml client>-dedicated`.
|
||||
1. Click **Add Mapper > By configuration > Group membership**
|
||||
1. In the dialog:
|
||||
* select the **Name** "groups"
|
||||
* Set **Token Claim Name** "groups"
|
||||
* Uncheck the **Full group path**
|
||||
* Save the mapper.
|
||||
|
||||
To validate yourself:
|
||||
To verify:
|
||||
|
||||
- Return to the Client Details → Client scope tab.
|
||||
- Go to the Evaluate sub-tab and select a user who has any group memberships.
|
||||
- On the right side, navigate to the Generated ID token and then to Generated User Info.
|
||||
- Inspect that in both cases, you can see the group's claim in the displayed user data.
|
||||
1. Return to **Client Details > Client scope** tab.
|
||||
1. Go to the Evaluate sub-tab and select a user who has any group memberships.
|
||||
1. Go to **Generated ID token** and then to **Generated User Info**.
|
||||
1Inspect that in both cases you can see the group's claim in the displayed user data.
|
||||
|
@ -1,103 +1,106 @@
|
||||
🟢 Ready
|
||||
---
|
||||
# ClearML Dynamic MIG Operator (CDMO)
|
||||
title: ClearML Dynamic MIG Operator (CDMO)
|
||||
---
|
||||
|
||||
Enables dynamic MIG GPU configurations.
|
||||
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
|
||||
|
||||
# Installation
|
||||
## Installation
|
||||
|
||||
## Requirements
|
||||
### Requirements
|
||||
|
||||
Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
|
||||
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
|
||||
|
||||
Add and update the Nvidia Helm repo:
|
||||
* Add and update the Nvidia Helm repo:
|
||||
|
||||
``` bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
helm repo update
|
||||
```
|
||||
```bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
helm repo update
|
||||
```
|
||||
|
||||
Create a `gpu-operator.override.yaml` file with the following content:
|
||||
* Create a `gpu-operator.override.yaml` file with the following content:
|
||||
|
||||
``` yaml
|
||||
migManager:
|
||||
enabled: false
|
||||
mig:
|
||||
strategy: mixed
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
||||
```yaml
|
||||
migManager:
|
||||
enabled: false
|
||||
mig:
|
||||
strategy: mixed
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
||||
|
||||
Install the official NVIDIA `gpu-operator` using Helm with the previous configuration:
|
||||
* Install the official NVIDIA `gpu-operator` using Helm with the previous configuration:
|
||||
|
||||
``` bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
```bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
|
||||
## Install
|
||||
### Installing CDMO
|
||||
|
||||
Create a `cdmo-values.override.yaml` file with the following content:
|
||||
* Create a `cdmo-values.override.yaml` file with the following content:
|
||||
|
||||
``` yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
```
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
```
|
||||
|
||||
Install the CDMO operator Helm Chart using the previous override file:
|
||||
* Install the CDMO Helm Chart using the previous override file:
|
||||
|
||||
``` bash
|
||||
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
|
||||
```
|
||||
```bash
|
||||
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
|
||||
```
|
||||
|
||||
Enable the NVIDIA MIG support on your cluster by running the following command on all Nodes with a MIG-supported GPU (run it for each GPU `<GPU_ID>` you have on the Host):
|
||||
* Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
|
||||
(run it for each GPU `<GPU_ID>` on the host):
|
||||
|
||||
``` bash
|
||||
nvidia-smi -mig 1
|
||||
```
|
||||
```bash
|
||||
nvidia-smi -mig 1
|
||||
```
|
||||
|
||||
**NOTE**: The node might need to be rebooted if reported by the result of the previous command.
|
||||
:::note notes
|
||||
* The node reboot may be required if the command output indicates so.
|
||||
|
||||
**NOTE**: For convenience, this command can be issued from inside the nvidia-device-plugin-daemonset Pod running on the related node.
|
||||
* For convenience, this command can be issued from inside the `nvidia-device-plugin-daemonset` pod running on the related node.
|
||||
:::
|
||||
|
||||
Any MIG-enabled GPU node `<NODE_NAME>` from the last point must be labeled accordingly as follows:
|
||||
* Any MIG-enabled GPU node `<NODE_NAME>` from the last point must be labeled accordingly as follows:
|
||||
|
||||
``` bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
|
||||
```
|
||||
```bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
|
||||
```
|
||||
|
||||
# Unconfigure MIGs
|
||||
## Disabling MIGs
|
||||
|
||||
For disabling MIG, follow these steps in order:
|
||||
To disable MIG, follow these steps:
|
||||
|
||||
1. Ensure there are no more running workflows requesting any form of GPU on the Node(s) before re-configuring it.
|
||||
1. Ensure no running workflows are using GPUs on the target node(s).
|
||||
|
||||
2. Remove the CDMO label from the target Node(s) to disable the dynamic MIG reconfiguration.
|
||||
2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
|
||||
|
||||
``` bash
|
||||
```bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
|
||||
```
|
||||
|
||||
3. Execute a shell into the `device-plugin-daemonset` Pod instance running on the target Node(s) and execute the following commands in order:
|
||||
3. Execute a shell into the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands:
|
||||
|
||||
``` bash
|
||||
```bash
|
||||
nvidia-smi mig -dci
|
||||
|
||||
nvidia-smi mig -dgi
|
||||
@ -105,27 +108,27 @@ For disabling MIG, follow these steps in order:
|
||||
nvidia-smi -mig 0
|
||||
```
|
||||
|
||||
4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs and upgrade the `gpu-operator`:
|
||||
4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs, and upgrade the `gpu-operator`:
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
value: all
|
||||
```
|
@ -2,14 +2,14 @@
|
||||
title: Install CDMO and CFGI on the same Cluster
|
||||
---
|
||||
|
||||
In clusters with multiple nodes with different GPU types, the `gpu-operator` can be used to manage different devices and
|
||||
fractioning modes.
|
||||
In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations
|
||||
and fractioning modes.
|
||||
|
||||
## Configuring the NVIDIA GPU Operator
|
||||
|
||||
The NVIDIA `gpu-operator` allows defining multiple configurations for the Device Plugin.
|
||||
The NVIDIA `gpu-operator` supports defining multiple configurations for the Device Plugin.
|
||||
|
||||
The following YAML values define two usable configs as "mig" and "ts" (time-slicing).
|
||||
The following example YAML defines two configurations: "mig" and "ts" (time-slicing).
|
||||
|
||||
``` yaml
|
||||
migManager:
|
||||
@ -67,52 +67,50 @@ devicePlugin:
|
||||
migStrategy: mixed
|
||||
```
|
||||
|
||||
## Usage
|
||||
## Applying Configuration to Nodes
|
||||
|
||||
Previously defined configurations need to be applied to Kubernetes nodes using labels. After a label is added to a node,
|
||||
the NVIDIA `device-plugin` will automatically reconfigure it.
|
||||
To activate a configuration, label the Kubernetes node accordingly. After a node is labeled,
|
||||
the NVIDIA `device-plugin` will automatically reload the new configuration.
|
||||
|
||||
The following is an example using the previous configuration.
|
||||
Example usage:
|
||||
* Apply the `mig` (MIG mode) config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig
|
||||
```
|
||||
|
||||
* Apply the MIG `mig` config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig
|
||||
```
|
||||
* Apply the `ts` (time slicing) config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts
|
||||
```
|
||||
|
||||
* Apply the time slicing `ts` config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts
|
||||
```
|
||||
* Apply the `all-disabled` (standard full GPU access) config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled
|
||||
```
|
||||
|
||||
* Apply the vanilla full GPUs `all-disabled` config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled
|
||||
```
|
||||
## Installing CDMO and CFGI
|
||||
|
||||
## Install CDMO and CFGI
|
||||
After configuring the NVIDIA `gpu-operator` and labeling nodes, proceed with the standard installations of [CDMO](cdmo.md)
|
||||
and [CFGI](cfgi.md).
|
||||
|
||||
After configuring the NVIDIA `gpu-operator` and labeling nodes, proceed with the standard [CDMO](cdmo.md) and [CFGI](cfgi.md)
|
||||
installation.
|
||||
|
||||
## Disabling
|
||||
## Disabling Configurations
|
||||
|
||||
### Time Slicing
|
||||
|
||||
To toggle between time slicing and vanilla full GPUs, simply toggle the label value between `ts` and `all-disabled` using
|
||||
the `--overwrite` flag in kubectl.
|
||||
To switch between time-slicing and full GPU access, update the node label using the `--overwrite` flag:
|
||||
|
||||
### MIG
|
||||
|
||||
To disable MIG, follow these steps:
|
||||
To disable MIG mode:
|
||||
|
||||
1. Ensure there are no more running workflows requesting any form of GPU on the Node(s) before re-configuring it.
|
||||
2. Remove the CDMO label from the target Node(s) to disable the dynamic MIG reconfiguration.
|
||||
1. Ensure there are no more running workflows requesting any form of GPU on the node(s) before re-configuring it.
|
||||
2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
|
||||
|
||||
``` bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
|
||||
```
|
||||
|
||||
3. Execute a shell into the `device-plugin-daemonset` Pod instance running on the target Node(s) and execute the following commands:
|
||||
3. Execute a shell in the `device-plugin-daemonset` Pod instance running on the target node(s) and execute the following commands:
|
||||
|
||||
``` bash
|
||||
nvidia-smi mig -dci
|
||||
@ -122,7 +120,7 @@ To disable MIG, follow these steps:
|
||||
nvidia-smi -mig 0
|
||||
```
|
||||
|
||||
4. Label the target node to disable MIG.
|
||||
4. Relabel the target node to disable MIG:
|
||||
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
|
||||
|
@ -167,36 +167,39 @@ helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --c
|
||||
|
||||
## Usage
|
||||
|
||||
Fractional GPU Injector will inject CUDA files on pods that have the following label:
|
||||
To use fractional GPUs, label your pod with:
|
||||
|
||||
```yaml
|
||||
labels:
|
||||
clearml-injector/fraction: "<GPU_FRACTION_VALUE>"
|
||||
```
|
||||
|
||||
where `"<GPU_FRACTION_VALUE>"` must be equal one of following values:
|
||||
Valid values for `"<GPU_FRACTION_VALUE>"` include:
|
||||
|
||||
* "0.0625" (1/16th)
|
||||
* "0.125" (1/8th)
|
||||
* "0.250"
|
||||
* "0.375"
|
||||
* "0.500"
|
||||
* "0.625"
|
||||
* "0.750"
|
||||
* "0.875"
|
||||
|
||||
Valid values for `"<GPU_FRACTION_VALUE>"` include integer representation of GPUs such as `1.000` or `2` or `2.0` etc.
|
||||
* Fractions:
|
||||
* "0.0625" (1/16th)
|
||||
* "0.125" (1/8th)
|
||||
* "0.250"
|
||||
* "0.375"
|
||||
* "0.500"
|
||||
* "0.625"
|
||||
* "0.750"
|
||||
* "0.875"
|
||||
* Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc.
|
||||
|
||||
### ClearML Enterprise Agent Configuration
|
||||
|
||||
In order to specify resource allocation when using the CFGI, the following values configuration should be set in `clearml-agent-values.override.yaml`.
|
||||
To run ClearML jobs that request specific GPU fractions, configure the queues in your `clearml-agent-values.override.yaml` file.
|
||||
|
||||
The label `clearml-injector/fraction: "<GPU_FRACTION_VALUE>"` is required in order to specify a fraction of the GPU to be
|
||||
assigned to the pod started for a task pulled from the respective queue.
|
||||
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
|
||||
fraction of a GPU to allocate (e.g., "0.500" for half a GPU).
|
||||
|
||||
This label is used by the CFGI to assign the correct portion of GPU resources to the pod running the task.
|
||||
|
||||
#### CFGI Version >= 1.3.0
|
||||
|
||||
Starting from version 1.3.0, there is no need to specify the resources field.
|
||||
Starting from version 1.3.0, there is no need to specify the resources field. You only need to set the labels:
|
||||
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
@ -222,6 +225,8 @@ agentk8sglue:
|
||||
|
||||
#### CFGI Version < 1.3.0
|
||||
|
||||
For versions older than 1.3.0, the GPU limits must be defined:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
createQueues: true
|
||||
@ -256,24 +261,22 @@ agentk8sglue:
|
||||
|
||||
## Upgrading Chart
|
||||
|
||||
### Upgrades / Values Upgrades
|
||||
|
||||
To update to the latest version of this chart:
|
||||
To upgrade to the latest version of this chart:
|
||||
|
||||
```bash
|
||||
helm repo update
|
||||
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
|
||||
```
|
||||
|
||||
To change values on an existing installation:
|
||||
To apply changes to values on an existing installation:
|
||||
|
||||
```bash
|
||||
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml
|
||||
```
|
||||
|
||||
## Disable Fractions
|
||||
## Disabling Fractions
|
||||
|
||||
To toggle between timeSlicing and vanilla full GPUs, remove the `devicePlugin.config` section from the `gpu-operator.override.yaml`
|
||||
To revert to standard GPU scheduling (without time slicing), remove the `devicePlugin.config` section from the `gpu-operator.override.yaml`
|
||||
file and upgrade the `gpu-operator`:
|
||||
|
||||
```yaml
|
||||
|
Loading…
Reference in New Issue
Block a user