mirror of
https://github.com/clearml/clearml-docs
synced 2025-06-26 18:17:44 +00:00
Additional config options
This commit is contained in:
parent
c01766f852
commit
b21275c262
181
docs/deploying_clearml/enterprise_deploy/agent_k8s.md
Normal file
181
docs/deploying_clearml/enterprise_deploy/agent_k8s.md
Normal file
@ -0,0 +1,181 @@
|
||||
🟡 Ready, but missing hyperlinks (see TODOs)
|
||||
TODO:
|
||||
- Link: GPU Operator
|
||||
- Link: Additional configurations
|
||||
- Link: Now proceed with AI App Gateway
|
||||
|
||||
---
|
||||
title: ClearML Agent on Kubernetes
|
||||
---
|
||||
|
||||
The ClearML Agent allows scheduling distributed experiments on a Kubernetes cluster.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- The ClearML Enterprise server is up and running.
|
||||
- Create a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way to do so is from
|
||||
the ClearML UI (**Settings > Workspace > App Credentials > Create new credentials**).
|
||||
|
||||
:::note
|
||||
Make sure that the generated keys belong to an admin user or a service user with admin privileges.
|
||||
:::
|
||||
|
||||
- The worker environment should be able to communicate to the ClearML Server over the same network.
|
||||
|
||||
## Installation
|
||||
|
||||
### Add the Helm Repo Locally
|
||||
|
||||
Add the ClearML Helm repository:
|
||||
```bash
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
```bash
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### Prepare Values
|
||||
|
||||
Create a `clearml-agent-values.override.yaml` file with the following content:
|
||||
|
||||
:::note
|
||||
In the following configuration, replace the `<ACCESS_KEY>` and `<SECRET_KEY>` placeholders with the admin credentials
|
||||
you have generated on the ClearML Server. The values for `<api|file|web>ServerUrlReference` should point to your ClearML
|
||||
control-plane installation.
|
||||
:::
|
||||
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
clearml:
|
||||
agentk8sglueKey: "<ACCESS_KEY>"
|
||||
agentk8sglueSecret: "<SECRET_KEY>"
|
||||
agentk8sglue:
|
||||
apiServerUrlReference: "<CLEARML_API_SERVER_REFERENCE>"
|
||||
fileServerUrlReference: "<CLEARML_FILE_SERVER_REFERENCE>"
|
||||
webServerUrlReference: "<CLEARML_WEB_SERVER_REFERENCE>"
|
||||
createQueues: true
|
||||
queues:
|
||||
exampleQueue:
|
||||
templateOverrides: {}
|
||||
queueSettings: {}
|
||||
```
|
||||
|
||||
### Install the Chart
|
||||
|
||||
Install the ClearML Enterprise Agent Helm chart using the previous values override file:
|
||||
|
||||
```bash
|
||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||
```
|
||||
|
||||
## Additional Configuration Options
|
||||
|
||||
:::note
|
||||
You can view the full set of available and documented values of the chart by running the following command:
|
||||
|
||||
```bash
|
||||
helm show readme clearml-enterprise/clearml-enterprise-agent
|
||||
# or
|
||||
helm show values clearml-enterprise/clearml-enterprise-agent
|
||||
```
|
||||
:::
|
||||
|
||||
### Report GPUs in the Dashboard
|
||||
|
||||
The Agent should explicitly report the total number of GPUs available on the cluster for it to appear in the dashboard reporting:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# -- Agent reporting to Dashboard max GPU available. This will report 2 GPUs.
|
||||
dashboardReportMaxGpu: 2
|
||||
```
|
||||
|
||||
### Queues
|
||||
|
||||
The ClearML Agent in Kubernetes monitors ClearML queues and pulls tasks that are scheduled for execution.
|
||||
|
||||
A single agent can monitor multiple queues, each queue sharing a Pod template (`agentk8sglue.basePodTemplate`) to be
|
||||
used when submitting a task to Kubernetes after it has been extracted from the queue.
|
||||
|
||||
Each queue can be configured with dedicated Pod template spec override (`templateOverrides`). This way queue definitions
|
||||
can be mixed and matched to serve multiple use-cases.
|
||||
|
||||
The Following are a few examples of agent queue templates.
|
||||
|
||||
#### GPU Queues
|
||||
|
||||
:::note
|
||||
The GPU queues configuration and usage from the ClearML Enterprise Agent requires deploying the Nvidia GPU Operator
|
||||
on your Kubernetes cluster.
|
||||
For more information, refer to the [GPU Operator](https://TODO) page.
|
||||
:::
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
createQueues: true
|
||||
queues:
|
||||
1xGPU:
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
2xGPU:
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 2
|
||||
```
|
||||
|
||||
#### Override a Pod Template by Queue
|
||||
|
||||
In the following example:
|
||||
|
||||
- The `red` queue will inherit both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
|
||||
- The `blue` queue will set the label `team=blue`, but will inherit the 1Gi memory from the `basePodTemplate` section.
|
||||
- The `green` queue will set both the label `team=green` and a 2Gi memory limit. It will also set an annotation and an environment variable.
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
# Defines common template
|
||||
basePodTemplate:
|
||||
labels:
|
||||
team: red
|
||||
resources:
|
||||
limits:
|
||||
memory: 1Gi
|
||||
createQueues: true
|
||||
queues:
|
||||
red:
|
||||
# Does not override
|
||||
templateOverrides: {}
|
||||
blue:
|
||||
# Overrides labels
|
||||
templateOverrides:
|
||||
labels:
|
||||
team: blue
|
||||
green:
|
||||
# Overrides labels and resources, plus set new fields
|
||||
templateOverrides:
|
||||
labels:
|
||||
team: green
|
||||
annotations:
|
||||
example: "example value"
|
||||
resources:
|
||||
limits:
|
||||
memory: 2Gi
|
||||
env:
|
||||
- name: MY_ENV
|
||||
value: "my_value"
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
Once the ClearML Enterprise Agent is up and running, proceed with deploying the ClearML Enterprise App Gateway.
|
||||
|
||||
$$$$$$$$$$$$$$$
|
||||
$$$$$$$$$$$$$$$
|
||||
|
||||
TODO link to the AI App Gateway page in documentation
|
@ -0,0 +1,268 @@
|
||||
🟢 Ready
|
||||
TODO in future:
|
||||
- Add NFS Example - https://allegroai.atlassian.net/wiki/x/AoCUiQ?atlOrigin=eyJpIjoiMjNiNTcxYTJiMzUxNDVhMThiODlhMTcwYzE1YWE3ZTUiLCJwIjoiYyJ9
|
||||
|
||||
---
|
||||
|
||||
# Dynamically Edit Task Pod Template
|
||||
|
||||
The ClearML Enterprise Agent supports defining custom Python code for interacting with a Task's Pod template before it gets applied to Kubernetes.
|
||||
|
||||
This allows to dynamically edit a Task Pod manifest in the context of a ClearML Enterprise Agent and can be useful in a variety of scenarios such as customizing fields based on variables.
|
||||
|
||||
# Agent Configuration
|
||||
|
||||
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable is used to indicate a Python module and function inside that module for the ClearML Enterprise Agent to run before applying a Task Pod template. The Agent will run this code from its own context, pass some arguments (including the actual template) to the function and use the returned template to create the final Task Pod in Kubernetes.
|
||||
|
||||
Arguments passed to the function include:
|
||||
|
||||
`queue` - ID of the queue (string) from which the task was pulled.
|
||||
|
||||
`queue_name` - Name of the queue (string) from which the task was pulled.
|
||||
|
||||
`template` - Base template (python dictionary) created from the agent’s values, with any specific overrides for the queue from which the task was pulled.
|
||||
|
||||
`task_data` - Task data structure (object) containing the task’s information (as returned by the tasks.get_by_id API call). For example, use task_data.project to get the task’s project ID.
|
||||
|
||||
`providers_info` - Providers info (dictionary) containing optional information collected for the user running this task when the user logged into the system (requires additional server configuration).
|
||||
|
||||
`task_config` - Task configuration (clearml_agent.backend_config.Config object) containing the configuration used to run this task. This includes any overrides added in Vaults applicable for the user running this task. Use task_config.get("...") to get specific configuration values.
|
||||
|
||||
`worker` - the agent Python object, in case custom calls might be required.
|
||||
|
||||
## Usage
|
||||
|
||||
Edit the `clearml-agent-values.override.yaml` file adding the following:
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
value: "custom_code:update_template"
|
||||
fileMounts:
|
||||
- name: "custom_code.py"
|
||||
folderPath: "/root"
|
||||
fileContent: |-
|
||||
import json
|
||||
from pprint import pformat
|
||||
|
||||
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
|
||||
print(pformat(template))
|
||||
|
||||
my_var_name = "foo"
|
||||
my_var_value = "bar"
|
||||
|
||||
try:
|
||||
template["spec"]["containers"][0]["env"][0]["name"] = str(my_var_name)
|
||||
template["spec"]["containers"][0]["env"][0]["value"] = str(my_var_value)
|
||||
except KeyError as ex:
|
||||
raise Exception("Failed modifying template: {}".format(ex))
|
||||
|
||||
return {"template": template}
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
**Note**: Make sure to include `*args, **kwargs` at the end of the function’s argument list and to only use keyword arguments. This is needed to maintain backward compatibility and make sure any added named arguments or changes in the arguments order in new agent versions won’t affect your implementation.
|
||||
|
||||
**Note**: Custom code modules can be included as a file in the Pod's container and the environment variable can be used to simply point to the file and entry point.
|
||||
|
||||
**Note**: When defining a custom code module, by default the ClearML Etnerprise Agent will start watching Pods in all namespaces across the Cluster. If you do not intend to give a ClusterRole permission, make sure to set the `CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the ClearML Enterprise Agent to try listing Pods in all namespaces. Instead, set it to `"1"` if namespace-related changes are needed in the code.
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
|
||||
value: "0"
|
||||
```
|
||||
|
||||
**Note**: If you want instead to modify the Bash script used to start the Task Pod or the Agent, see here instead:
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
# -- Custom Bash script for the Agent pod ran by Glue Agent
|
||||
customBashScript: ""
|
||||
# -- Custom Bash script for the Task Pods ran by Glue Agent
|
||||
containerCustomBashScript: ""
|
||||
```
|
||||
|
||||
# Examples
|
||||
|
||||
## Example – Edit Template based on ENV var
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
value: "custom_code:update_template"
|
||||
fileMounts:
|
||||
- name: "custom_code.py"
|
||||
folderPath: "/root"
|
||||
fileContent: |-
|
||||
import json
|
||||
from pprint import pformat
|
||||
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
|
||||
print(pformat(template))
|
||||
|
||||
my_var = "some_var"
|
||||
|
||||
try:
|
||||
template["spec"]["initContainers"][0]["command"][-1] = \
|
||||
template["spec"]["initContainers"][0]["command"][-1].replace("MY_VAR", str(my_var))
|
||||
template["spec"]["containers"][0]["volumeMounts"][0]["subPath"] = str(my_var)
|
||||
except KeyError as ex:
|
||||
raise Exception("Failed modifying template with MY_VAR: {}".format(ex))
|
||||
|
||||
return {"template": template}
|
||||
basePodTemplate:
|
||||
initContainers:
|
||||
- name: myInitContainer
|
||||
image: docker/ubuntu:18.04
|
||||
command:
|
||||
- /bin/bash
|
||||
- -c
|
||||
- >
|
||||
echo MY_VAR;
|
||||
volumeMounts:
|
||||
- name: myTemplatedMount
|
||||
mountPath: MY_VAR
|
||||
volumes:
|
||||
- name: myTemplatedMount
|
||||
emptyDir: {}
|
||||
```
|
||||
|
||||
## Example – NFS Mount Path
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
extraEnvs:
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
value: "custom_code:update_template"
|
||||
fileMounts:
|
||||
- name: "custom_code.py"
|
||||
folderPath: "/root"
|
||||
fileContent: |-
|
||||
import json
|
||||
from pprint import pformat
|
||||
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
|
||||
nfs = task_config.get("nfs")
|
||||
# ad_role = providers_info.get("ad-role")
|
||||
if nfs: # and ad_role == "some-value"
|
||||
print(pformat(template))
|
||||
|
||||
try:
|
||||
template["spec"]["containers"][0]["volumeMounts"].append(
|
||||
{"name": "custom-mount", "mountPath": nfs.get("mountPath")}
|
||||
)
|
||||
template["spec"]["containers"][0]["volumes"].append(
|
||||
{"name": "custom-mount", "nfs": {"server": nfs.get("server.ip"), "path": nfs.get("server.path")}}
|
||||
)
|
||||
except KeyError as ex:
|
||||
raise Exception("Failed modifying template: {}".format(ex))
|
||||
|
||||
return {"template": template}
|
||||
```
|
||||
|
||||
# Bind Additional Resources to Task Pod (PVC Example)
|
||||
|
||||
In this example, a dedicated PVC is dynamically created and attached to every Pod created from a dedicated queue, then deleted after the Pod deletion.
|
||||
|
||||
The following code block is commented to explain the context.
|
||||
|
||||
The key points are:
|
||||
|
||||
- `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` and `CLEARML_K8S_GLUE_POD_POST_DELETE_CMD` env vars let you define custom bash code hooks to be executed around the main apply command by the Agent, such as creating and deleting a PVC object.
|
||||
|
||||
- `CLEARML_K8S_GLUE_TEMPLATE_MODULE` env var and a file mount let you define custom Python code in a specific context, useful to dynamically update the main Pod template before the Agent applies it.
|
||||
|
||||
**Note**: This example uses a queue named `pvc-test`, make sure to replace all occurrences of it.
|
||||
|
||||
**Note**: `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` can reference templated vars as `{queue_name}, {pod_name}, {namespace}` that will get replaced with the actual value by the Agent at execution time.
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
# Bind a pre-defined custom 'custom-agent-role' Role with the ability to handle 'persistentvolumeclaims'
|
||||
additionalRoleBindings:
|
||||
- custom-agent-role
|
||||
extraEnvs:
|
||||
# Need this or permissions to list all namespaces
|
||||
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
|
||||
value: "0"
|
||||
# Executed before applying the Task Pod. Replace the $PVC_NAME placeholder in the manifest template with the Pod name and apply it, only in a specific queue.
|
||||
- name: CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD
|
||||
value: "[[ {queue_name} == 'pvc-test' ]] && sed 's/\\$PVC_NAME/{pod_name}/g' /mnt/yaml-manifests/pvc.yaml | kubectl apply -n {namespace} -f - || echo 'Skipping PRE_APPLY PVC creation...'"
|
||||
# Executed after deleting the Task Pod. Delete the PVC.
|
||||
- name: CLEARML_K8S_GLUE_POD_POST_DELETE_CMD
|
||||
value: "kubectl delete pvc {pod_name} -n {namespace} || echo 'Skipping POST_DELETE PVC deletion...'"
|
||||
# Define a custom code module for updating the Pod template
|
||||
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
|
||||
value: "custom_code:update_template"
|
||||
fileMounts:
|
||||
# Mount a PVC manifest file with a templated $PVC_NAME name
|
||||
- name: "pvc.yaml"
|
||||
folderPath: "/mnt/yaml-manifests"
|
||||
fileContent: |
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: $PVC_NAME
|
||||
spec:
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
volumeMode: Filesystem
|
||||
storageClassName: standard
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
# Custom code module for updating the Pod template
|
||||
- name: "custom_code.py"
|
||||
folderPath: "/root"
|
||||
fileContent: |-
|
||||
import json
|
||||
from pprint import pformat
|
||||
def update_template(queue, task_data, providers_info, template, task_config, worker, queue_name, *args, **kwargs):
|
||||
if queue_name == "pvc-test":
|
||||
# Set PVC_NAME as the name of the Pod
|
||||
PVC_NAME = f"clearml-id-{task_data.id}"
|
||||
try:
|
||||
# Replace the claimName placeholder with a dynamic value
|
||||
template["spec"]["volumes"][0]["persistentVolumeClaim"]["claimName"] = str(PVC_NAME)
|
||||
except KeyError as ex:
|
||||
raise Exception("Failed modifying template with PVC_NAME: {}".format(ex))
|
||||
# Return the edited template
|
||||
return {"template": template}
|
||||
createQueues: true
|
||||
queues:
|
||||
# Define a queue with an override `volumes` and `volumeMounts` section for binding a PVC
|
||||
pvc-test:
|
||||
templateOverrides:
|
||||
volumes:
|
||||
- name: task-pvc
|
||||
persistentVolumeClaim:
|
||||
# PVC_NAME placeholder. This will get replaced in the custom code module.
|
||||
claimName: PVC_NAME
|
||||
volumeMounts:
|
||||
- mountPath: "/tmp/task/"
|
||||
name: task-pvc
|
||||
```
|
||||
|
||||
Example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
|
||||
|
||||
``` yaml
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: Role
|
||||
metadata:
|
||||
name: custom-agent-role
|
||||
rules:
|
||||
- apiGroups:
|
||||
- ""
|
||||
resources:
|
||||
- persistentvolumeclaims
|
||||
verbs:
|
||||
- get
|
||||
- list
|
||||
- watch
|
||||
- create
|
||||
- patch
|
||||
- delete
|
||||
```
|
@ -0,0 +1,61 @@
|
||||
🟡 Ready, missing link
|
||||
---
|
||||
TODO:
|
||||
- Link: fractional GPUs
|
||||
|
||||
---
|
||||
|
||||
# Basic Deployment - Suggested GPU Operator Values
|
||||
|
||||
## Add the Helm Repo Locally
|
||||
|
||||
Add the ClearML Helm repository:
|
||||
``` bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
``` bash
|
||||
helm repo update
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
As mentioned by NVIDIA, this configuration is needed to prevent unprivileged containers from bypassing the Kubernetes Device Plugin API.
|
||||
|
||||
Create a `gpu-operator.override.yaml` file with the following content:
|
||||
|
||||
``` yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
||||
|
||||
Install the gpu-operator:
|
||||
|
||||
``` bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
|
||||
# Fractioning
|
||||
|
||||
For fractional GPU support, refer to the dedicated guides.
|
||||
|
||||
TODO link to the fractional_gpus directory page in documentation
|
@ -0,0 +1,24 @@
|
||||
🟢 Ready
|
||||
---
|
||||
# Multi-Node Training
|
||||
|
||||
The ClearML Enterprise Agent supports horizontal multi-node training executions. Here is a configuration example (in `clearml-agent-values.override.yaml`):
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
# Cluster access is required to run multi-node Tasks
|
||||
serviceAccountClusterAccess: true
|
||||
multiNode:
|
||||
enabled: true
|
||||
createQueues: true
|
||||
queues:
|
||||
multi-node-example:
|
||||
queueSettings:
|
||||
# Defines the distribution of GPUs Tasks across multiple nodes. The format [x, y, ...] specifies the distribution of Tasks as 'x' GPUs on a node and 'y' GPUs on another node. Multiple Pods will be spawned respectively based on the lowest-common-denominator defined.
|
||||
multiNode: [ 4, 2 ]
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
# Note you will need to use the lowest-common-denominator of the GPUs distribution defined in `queueSettings.multiNode`.
|
||||
nvidia.com/gpu: 2
|
||||
```
|
@ -0,0 +1,211 @@
|
||||
🟡 Ready, but missing hyperlinks (see TODOs)
|
||||
---
|
||||
TODO:
|
||||
Control Plane:
|
||||
- Link: basic k8s installation
|
||||
- Link: SSO login
|
||||
- Additional envs for control-plane multi-tenancy
|
||||
|
||||
Workers:
|
||||
- Link: basic Agent installation
|
||||
- Link: basic AI App Gateway installation
|
||||
|
||||
---
|
||||
|
||||
# Multi-Tenancy
|
||||
|
||||
## Control Plane
|
||||
|
||||
For installing the ClearML control-plane, follow this guide (TODO link to the basic_k8s_installation page).
|
||||
|
||||
Update the Server’s `clearml-values.override.yaml` with the following values:
|
||||
|
||||
``` yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__services__organization__features__user_management_advanced
|
||||
value: "true"
|
||||
- name: CLEARML__services__auth__ui_features_per_role__user__show_datasets
|
||||
value: "false"
|
||||
- name: CLEARML__services__auth__ui_features_per_role__user__show_orchestration
|
||||
value: "false"
|
||||
- name: CLEARML__services__workers__resource_usages__supervisor_company
|
||||
value: "d1bd92a3b039400cbafc60a7a5b1e52b" # Default company
|
||||
- name: CLEARML__secure__credentials__supervisor__role
|
||||
value: "system"
|
||||
- name: CLEARML__secure__credentials__supervisor__allow_login
|
||||
value: "true"
|
||||
- name: CLEARML__secure__credentials__supervisor__user_key
|
||||
value: "<SUPERVISOR_USER_KEY>"
|
||||
- name: CLEARML__secure__credentials__supervisor__user_secret
|
||||
value: "<SUPERVISOR_USER_SECRET>"
|
||||
- name: CLEARML__secure__credentials__supervisor__sec_groups
|
||||
value: "[\"users\", \"admins\", \"queue_admins\"]"
|
||||
- name: CLEARML__secure__credentials__supervisor__email
|
||||
value: "\"<SUPERVISOR_USER_EMAIL>\""
|
||||
- name: CLEARML__apiserver__company__unique_names
|
||||
value: "true"
|
||||
```
|
||||
|
||||
The credentials specified in `<SUPERVISOR_USER_KEY>` and `<SUPERVISOR_USER_SECRET>` can be used to log in as the supervisor user from the ClearML Web UI accessible using the URL `app.<BASE_DOMAIN>`.
|
||||
|
||||
Note that the `<SUPERVISOR_USER_EMAIL>` value must be explicitly quoted. To do so, put `\"` around the quoted value. Example `"\"email@example.com\""`.
|
||||
|
||||
You will want to configure SSO as well. For this, follow the “SSO (Identity Provider) Setup†(TODO link to the sso-login page).
|
||||
|
||||
### Create a Tenant
|
||||
|
||||
The following section will address the steps required to create a new tenant in the ClearML Control-plane server using a series of API calls.
|
||||
|
||||
Note that placeholders (`<PLACEHOLDER>`) in the following configuration should be substituted with a valid domain based on your installation values.
|
||||
|
||||
#### Create a new Tenant in the ClearML Control-plane
|
||||
|
||||
*Define variables to use in the next steps:*
|
||||
|
||||
``` bash
|
||||
APISERVER_URL="https://api.<BASE_DOMAIN>"
|
||||
APISERVER_KEY=<APISERVER_KEY>
|
||||
APISERVER_SECRET=<APISERVER_SECRET>
|
||||
```
|
||||
|
||||
**Note**: The apiserver key and secret should be the same as those used for installing the ClearML Enterprise server Chart.
|
||||
|
||||
*Create a Tenant (company):*
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/system.create_company \
|
||||
-H "Content-Type: application/json" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"name":"<TENANT_NAME>"}'
|
||||
```
|
||||
|
||||
The result returns the new Company ID (`<COMPANY_ID>`)
|
||||
|
||||
If needed, list existing tenants (companies) using:
|
||||
|
||||
``` bash
|
||||
curl -u $APISERVER_KEY:$APISERVER_SECRET $APISERVER_URL/system.get_companies
|
||||
```
|
||||
|
||||
*Create an Admin User for the new tenant:*
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/auth.create_user \
|
||||
-H "Content-Type: application/json" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"name":"<ADMIN_USER_NAME>","company":"<COMPANY_ID>","email":"<ADMIN_USER_EMAIL>","role":"admin","internal":"true"}'
|
||||
```
|
||||
|
||||
The result returns the new User ID (`<USER_ID>`)
|
||||
|
||||
*Create Credentials for the new Admin User:*
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/auth.create_credentials \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Clearml-Impersonate-As: <USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET
|
||||
```
|
||||
|
||||
The result returns a set of key and secret credentials associated with the new Admin User.
|
||||
|
||||
**Note**: You can use this set of credentials to set up an Agent or App Gateway for the newly created Tenant.
|
||||
|
||||
#### Create IDP/SSO sign-in rules
|
||||
|
||||
To map new users signing into the system to existing tenants, you can use one or more of the following route methods to route new users (based on their email address) to an existing tenant.
|
||||
|
||||
*Route an email to a tenant based on the email’s domain:*
|
||||
|
||||
This will instruct the server to assign any new user whose email domain matches the domain provided below to this specific tenant.
|
||||
|
||||
Note that providing the same domain name for multiple tenants will result in unstable behavior and should be avoided.
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/login.set_domains \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Clearml-Act-As: <USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"domains":["<USERS_EMAIL_DOMAIN>"]}'
|
||||
```
|
||||
|
||||
`<USERS_EMAIL_DOMAIN>` is the email domain set up for users to access through SSO.
|
||||
|
||||
*Route specific email(s) to a tenant:*
|
||||
|
||||
This will instruct the server to assign any new user whose email is found in this list to this specific tenant. You can use the is_admin property to choose whether these users will be set as admins in this tenant upon login.
|
||||
|
||||
Note that you can create more than one list per tenant (using multiple API calls) to create one list for admin users and another for non-admin users.
|
||||
|
||||
Note that including the same email address in more than a single tenant’s list will result in unstable behavior and should be avoided.
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/login.add_whitelist_entries \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Clearml-Act-As: <USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"emails":["<email1>", "<email2>", ...],"is_admin":false}'
|
||||
```
|
||||
|
||||
To remove existing email(s) from these lists, use the following API call. Note that this will not affect a user who has already logged in using one of these email addresses:
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/login.remove_whitelist_entries \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Clearml-Act-As: <USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"emails":["<email1>", "<email2>", ...]}'
|
||||
```
|
||||
|
||||
*Get the current login routing settings:*
|
||||
|
||||
To get the current IDP/SSO login rule settings for this tenant:
|
||||
|
||||
``` bash
|
||||
curl $APISERVER_URL/login.get_settings \
|
||||
-H "X-Clearml-Act-As: <USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET
|
||||
```
|
||||
|
||||
### Limit Features for all Users in a Group​
|
||||
|
||||
The server’s `clearml-values.override.yaml` can control some tenants’ configurations, limiting the features available to some users or groups in the system.
|
||||
|
||||
Example: with the following configuration, all users in the “users†group will only have the `applications` feature enabled.
|
||||
|
||||
``` yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__services__auth__default_groups__users__features
|
||||
value: "[\"applications\"]"
|
||||
```
|
||||
|
||||
Available Features:
|
||||
|
||||
- `applications` - Viewing and running applications
|
||||
- `data_management` - Working with hyper-datasets and dataviews
|
||||
- `experiments` - Viewing the experiment table and launching experiments
|
||||
- `queues` - Viewing the queues screen
|
||||
- `queue_management` - Creating and deleting queues
|
||||
- `pipelines` - Viewing/managing pipelines in the system
|
||||
- `reports` - Viewing and managing reports in the system
|
||||
- `show_dashboard` - Show the dashboard screen
|
||||
- `show_projects` - Show the projects menu option
|
||||
- `resource_dashboard` - Display the resource dashboard in the orchestration page
|
||||
|
||||
## Workers
|
||||
|
||||
Refer to the following pages for installing and configuring the ClearML Enterprise Agent (TODO link to agent_k8s_installation) and App Gateway (TODO link to app-gateway).
|
||||
|
||||
**Note**: Make sure to setup Agent and App Gateway using a Tenant's admin user credentials created with the Tenant creation APIs described above.
|
||||
|
||||
### Tenants Separation
|
||||
|
||||
In multi-tenant setups, you can separate the tenants’ workers in different namespaces.
|
||||
|
||||
Create a Kubernetes Namespace for each tenant and install a dedicated ClearML Agent and AI Application Gateway in each Namespace.
|
||||
|
||||
A tenant’s Agent and Gateway need to be configured with credentials created on the ClearML server by a user of the same tenant.
|
||||
|
||||
Additional network separation can be achieved via Kubernetes Network Policies.
|
@ -0,0 +1,59 @@
|
||||
# ClearML Presign Service
|
||||
|
||||
The ClearML Presign Service is a secure component for generating and redirecting pre-signed storage URLs for authenticated users.
|
||||
|
||||
# Prerequisites
|
||||
|
||||
- The ClearML Enterprise server is up and running.
|
||||
- Create a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way to do so is from the ClearML UI (Settings → Workspace → App Credentials → Create new credentials).
|
||||
Note: Make sure that the generated keys belong to an admin user or a service user with admin privileges.
|
||||
- The worker environment should be able to communicate to the ClearML Server over the same network.
|
||||
|
||||
# Installation
|
||||
|
||||
## Add the Helm Repo Locally
|
||||
|
||||
Add the ClearML Helm repository:
|
||||
``` bash
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
|
||||
```
|
||||
|
||||
Update the repository locally:
|
||||
``` bash
|
||||
helm repo update
|
||||
```
|
||||
|
||||
## Prepare Values
|
||||
|
||||
Create a `presign-service.override.yaml` override file, replacing placeholders.
|
||||
|
||||
``` yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
clearml:
|
||||
apiServerUrlReference: "<CLEARML_API_SERVER_URL>"
|
||||
apiKey: "<ACCESS_KEY>"
|
||||
apiSecret: "<SECRET_KEY>"
|
||||
ingress:
|
||||
enabled: true
|
||||
hostName: "<PRESIGN_SERVICE_URL>"
|
||||
```
|
||||
|
||||
## Install
|
||||
|
||||
Install the clearml-presign-service helm chart in the same namespace as the ClearML Enterprise server:
|
||||
|
||||
``` bash
|
||||
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml
|
||||
```
|
||||
|
||||
## Configure ClearML Enterprise Server
|
||||
|
||||
After installing, edit the ClearML Enterprise `clearml-values.override.yaml` file adding an extra environment variable to the apiserver component as follows, making sure to replace the `<PRESIGN_SERVICE_URL>` placeholder, then perform a helm upgrade.
|
||||
|
||||
``` yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__SERVICES__SYSTEM__COMPANY__DEFAULT__SERVICES
|
||||
value: "[{\"type\":\"presign\",\"url\":\"https://<PRESIGN_SERVICE_URL>\",\"use_fallback\":\"false\",\"match_sets\":[{\"rules\":[{\"field\":\"\",\"obj_type\":\"\",\"regex\":\"^s3://\"}]}]}]"
|
||||
```
|
@ -0,0 +1,202 @@
|
||||
# ClearML Tenant with Self Signed Certificates
|
||||
|
||||
This guide covers the configuration to support SSL Custom certificates for the following components:
|
||||
|
||||
- ClearML Enterprise AI Application Gateway
|
||||
- ClearML Enterprise Agent
|
||||
|
||||
## AI Application Gateway
|
||||
|
||||
Add the following section in the `clearml-app-gateway-values.override.yaml` file:
|
||||
|
||||
``` yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
overrideCaCertificatesCrt:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
extraCerts:
|
||||
- alias: certificateName
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
In the section, there are two options:
|
||||
|
||||
- Replace the whole `ca-certificates.crt` file
|
||||
- Add extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
Let’s see them in detail.
|
||||
|
||||
### Replace the whole `ca-certificates.crt` file
|
||||
|
||||
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like they are stored in a standard `ca-certificates.crt`.
|
||||
|
||||
``` yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
overrideCaCertificatesCrt: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 1
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 2
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 3
|
||||
-----END CERTIFICATE-----
|
||||
...
|
||||
```
|
||||
|
||||
### Add extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
To add extra certificates to the standard provided CA available in the container you can define any specific certificate in the list.
|
||||
|
||||
Ensure to provide different aliases.
|
||||
|
||||
``` yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
extraCerts:
|
||||
- alias: certificate-name-1
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
- alias: certificate-name-2
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
### Apply changes
|
||||
|
||||
After applying the changes ensure to run the update command:
|
||||
|
||||
``` bash
|
||||
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
|
||||
```
|
||||
|
||||
## ClearML Enterprise Agent
|
||||
|
||||
Add the following section in the `clearml-agent-values.override.yaml` file:
|
||||
|
||||
``` yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
overrideCaCertificatesCrt:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
extraCerts:
|
||||
- alias: certificateName
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
In the section, there are two options:
|
||||
|
||||
- Replace the whole `ca-certificates.crt` file
|
||||
- Add extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
Let’s see them in detail.
|
||||
|
||||
### Replace the whole `ca-certificates.crt` file
|
||||
|
||||
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like they are stored in a standard `ca-certificates.crt`.
|
||||
|
||||
``` yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
|
||||
overrideCaCertificatesCrt: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 1
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 2
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
### CERT 3
|
||||
-----END CERTIFICATE-----
|
||||
...
|
||||
```
|
||||
|
||||
### Add extra certificates to the existing `ca-certificates.crt`
|
||||
|
||||
To add extra certificates to the standard provided CA available in the container you can define any specific certificate in the list.
|
||||
|
||||
Ensure to provide different aliases.
|
||||
|
||||
``` yaml
|
||||
# -- Custom certificates
|
||||
customCertificates:
|
||||
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
|
||||
extraCerts:
|
||||
- alias: certificate-name-1
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
- alias: certificate-name-2
|
||||
pem: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
###
|
||||
-----END CERTIFICATE-----
|
||||
```
|
||||
|
||||
### Add certificates to Tasks
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
basePodTemplate:
|
||||
initContainers:
|
||||
- command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- update-ca-certificates
|
||||
image: allegroai/clearml-enterprise-agent-k8s-base:<AGENT-VERSION-AVAIABLE-ON-REPO>
|
||||
imagePullPolicy: IfNotPresent
|
||||
name: init-task
|
||||
volumeMounts:
|
||||
- name: etc-ssl-certs
|
||||
mountPath: "/etc/ssl/certs"
|
||||
- name: clearml-extra-ca-certs
|
||||
mountPath: "/usr/local/share/ca-certificates"
|
||||
env:
|
||||
- name: REQUESTS_CA_BUNDLE
|
||||
value: /etc/ssl/certs/ca-certificates.crt
|
||||
volumeMounts:
|
||||
- name: etc-ssl-certs
|
||||
mountPath: "/etc/ssl/certs"
|
||||
volumes:
|
||||
- name: etc-ssl-certs
|
||||
emptyDir: {}
|
||||
- name: clearml-extra-ca-certs
|
||||
projected:
|
||||
defaultMode: 420
|
||||
sources:
|
||||
# LIST HERE CONFIGMAPS CREATED BY THE AGENT CHART, THE CARDINALITY DEPENDS ON THE NUMBER OF CERTS PROVIDED.
|
||||
- configMap:
|
||||
name: clearml-agent-clearml-enterprise-agent-custom-ca-0
|
||||
- configMap:
|
||||
name: clearml-agent-clearml-enterprise-agent-custom-ca-1
|
||||
```
|
||||
|
||||
Please note the `clearml-extra-ca-certs` volume, ensure to add each configMap created by the agent.
|
||||
|
||||
The name can differ based on the release name used during the installation.
|
||||
|
||||
### Apply changes
|
||||
|
||||
After applying the changes ensure to run the update command:
|
||||
|
||||
``` bash
|
||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||
```
|
@ -0,0 +1,60 @@
|
||||
# SSO (Identity Provider) Setup
|
||||
|
||||
ClearML Enterprise Server supports various SSO options, values configurations can be set in `clearml-values.override.yaml`.
|
||||
|
||||
The following are a few examples. Some other supported providers are Auth0, Keycloak, Okta, Azure, Google, Cognito.
|
||||
|
||||
## Auth0
|
||||
|
||||
``` yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_id
|
||||
value: "<AUTH0_CLIENT_ID>"
|
||||
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_secret
|
||||
value: "<AUTH0_CLIENT_SECRET>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__base_url
|
||||
value: "<AUTH0_BASE_URL>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__authorize_url
|
||||
value: "<AUTH0_AUTHORIZE_URL>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__access_token_url
|
||||
value: "<AUTH0_ACCESS_TOKEN_URL>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__audience
|
||||
value: "<AUTH0_AUDIENCE>"
|
||||
```
|
||||
|
||||
## Keycloak
|
||||
|
||||
``` yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id
|
||||
value: "<KC_CLIENT_ID>"
|
||||
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_secret
|
||||
value: "<KC_SECRET_ID>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__base_url
|
||||
value: "<KC_URL>/realms/<REALM_NAME>/"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__authorize_url
|
||||
value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/auth"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__access_token_url
|
||||
value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/token"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__idp_logout
|
||||
value: "true"
|
||||
```
|
||||
|
||||
### Note if using Groups Mapping
|
||||
|
||||
When configuring the OpenID client for ClearML:
|
||||
|
||||
- Navigate to the Client Scopes tab.
|
||||
- Click on the first row <clearml client>-dedicated.
|
||||
- Click "Add Mapper" → "By configuration" and then select the "Group membership" option.
|
||||
- In the opened dialog, enter the name "groups" and the Token claim name "groups".
|
||||
- Uncheck the "Full group path" option and save the mapper.
|
||||
|
||||
To validate yourself:
|
||||
|
||||
- Return to the Client Details → Client scope tab.
|
||||
- Go to the Evaluate sub-tab and select a user who has any group memberships.
|
||||
- On the right side, navigate to the Generated ID token and then to Generated User Info.
|
||||
- Inspect that in both cases, you can see the group's claim in the displayed user data.
|
131
docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md
Normal file
131
docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md
Normal file
@ -0,0 +1,131 @@
|
||||
🟢 Ready
|
||||
---
|
||||
# ClearML Dynamic MIG Operator (CDMO)
|
||||
|
||||
Enables dynamic MIG GPU configurations.
|
||||
|
||||
# Installation
|
||||
|
||||
## Requirements
|
||||
|
||||
Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
|
||||
|
||||
Add and update the Nvidia Helm repo:
|
||||
|
||||
``` bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
helm repo update
|
||||
```
|
||||
|
||||
Create a `gpu-operator.override.yaml` file with the following content:
|
||||
|
||||
``` yaml
|
||||
migManager:
|
||||
enabled: false
|
||||
mig:
|
||||
strategy: mixed
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
||||
|
||||
Install the official NVIDIA `gpu-operator` using Helm with the previous configuration:
|
||||
|
||||
``` bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
|
||||
## Install
|
||||
|
||||
Create a `cdmo-values.override.yaml` file with the following content:
|
||||
|
||||
``` yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
```
|
||||
|
||||
Install the CDMO operator Helm Chart using the previous override file:
|
||||
|
||||
``` bash
|
||||
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
|
||||
```
|
||||
|
||||
Enable the NVIDIA MIG support on your cluster by running the following command on all Nodes with a MIG-supported GPU (run it for each GPU `<GPU_ID>` you have on the Host):
|
||||
|
||||
``` bash
|
||||
nvidia-smi -mig 1
|
||||
```
|
||||
|
||||
**NOTE**: The node might need to be rebooted if reported by the result of the previous command.
|
||||
|
||||
**NOTE**: For convenience, this command can be issued from inside the nvidia-device-plugin-daemonset Pod running on the related node.
|
||||
|
||||
Any MIG-enabled GPU node `<NODE_NAME>` from the last point must be labeled accordingly as follows:
|
||||
|
||||
``` bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
|
||||
```
|
||||
|
||||
# Unconfigure MIGs
|
||||
|
||||
For disabling MIG, follow these steps in order:
|
||||
|
||||
1. Ensure there are no more running workflows requesting any form of GPU on the Node(s) before re-configuring it.
|
||||
|
||||
2. Remove the CDMO label from the target Node(s) to disable the dynamic MIG reconfiguration.
|
||||
|
||||
``` bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
|
||||
```
|
||||
|
||||
3. Execute a shell into the `device-plugin-daemonset` Pod instance running on the target Node(s) and execute the following commands in order:
|
||||
|
||||
``` bash
|
||||
nvidia-smi mig -dci
|
||||
|
||||
nvidia-smi mig -dgi
|
||||
|
||||
nvidia-smi -mig 0
|
||||
```
|
||||
|
||||
4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs and upgrade the `gpu-operator`:
|
||||
|
||||
``` yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
@ -0,0 +1,129 @@
|
||||
---
|
||||
title: Install CDMO and CFGI on the same Cluster
|
||||
---
|
||||
|
||||
In clusters with multiple nodes with different GPU types, the `gpu-operator` can be used to manage different devices and
|
||||
fractioning modes.
|
||||
|
||||
## Configuring the NVIDIA GPU Operator
|
||||
|
||||
The NVIDIA `gpu-operator` allows defining multiple configurations for the Device Plugin.
|
||||
|
||||
The following YAML values define two usable configs as "mig" and "ts" (time-slicing).
|
||||
|
||||
``` yaml
|
||||
migManager:
|
||||
enabled: false
|
||||
mig:
|
||||
strategy: mixed
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
enabled: true
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
config:
|
||||
name: device-plugin-config
|
||||
create: true
|
||||
default: "all-disabled"
|
||||
data:
|
||||
all-disabled: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: none
|
||||
ts: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: none
|
||||
sharing:
|
||||
timeSlicing:
|
||||
renameByDefault: false
|
||||
failRequestsGreaterThanOne: false
|
||||
# Edit the following configuration as needed, adding as many GPU indices as many cards are installed on the Host.
|
||||
resources:
|
||||
- name: nvidia.com/gpu
|
||||
rename: nvidia.com/gpu-0
|
||||
devices:
|
||||
- "0"
|
||||
replicas: 8
|
||||
mig: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: mixed
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Previously defined configurations need to be applied to Kubernetes nodes using labels. After a label is added to a node,
|
||||
the NVIDIA `device-plugin` will automatically reconfigure it.
|
||||
|
||||
The following is an example using the previous configuration.
|
||||
|
||||
* Apply the MIG `mig` config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig
|
||||
```
|
||||
|
||||
* Apply the time slicing `ts` config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts
|
||||
```
|
||||
|
||||
* Apply the vanilla full GPUs `all-disabled` config:
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled
|
||||
```
|
||||
|
||||
## Install CDMO and CFGI
|
||||
|
||||
After configuring the NVIDIA `gpu-operator` and labeling nodes, proceed with the standard [CDMO](cdmo.md) and [CFGI](cfgi.md)
|
||||
installation.
|
||||
|
||||
## Disabling
|
||||
|
||||
### Time Slicing
|
||||
|
||||
To toggle between time slicing and vanilla full GPUs, simply toggle the label value between `ts` and `all-disabled` using
|
||||
the `--overwrite` flag in kubectl.
|
||||
|
||||
### MIG
|
||||
|
||||
To disable MIG, follow these steps:
|
||||
|
||||
1. Ensure there are no more running workflows requesting any form of GPU on the Node(s) before re-configuring it.
|
||||
2. Remove the CDMO label from the target Node(s) to disable the dynamic MIG reconfiguration.
|
||||
|
||||
``` bash
|
||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
|
||||
```
|
||||
|
||||
3. Execute a shell into the `device-plugin-daemonset` Pod instance running on the target Node(s) and execute the following commands:
|
||||
|
||||
``` bash
|
||||
nvidia-smi mig -dci
|
||||
|
||||
nvidia-smi mig -dgi
|
||||
|
||||
nvidia-smi -mig 0
|
||||
```
|
||||
|
||||
4. Label the target node to disable MIG.
|
||||
|
||||
``` bash
|
||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
|
||||
```
|
300
docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md
Normal file
300
docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md
Normal file
@ -0,0 +1,300 @@
|
||||
---
|
||||
title: ClearML Fractional GPU Injector (CFGI)
|
||||
---
|
||||
|
||||
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to run on Kubernetes using non-MIG GPU
|
||||
fractions, optimizing both hardware utilization and performance.
|
||||
|
||||
## Installation
|
||||
|
||||
### Add the Local ClearML Helm Repository
|
||||
|
||||
``` bash
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### Requirements
|
||||
|
||||
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
|
||||
* The number of slices must be 8.
|
||||
* Add and update the Nvidia Helm repo:
|
||||
|
||||
``` bash
|
||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||
helm repo update
|
||||
```
|
||||
|
||||
#### GPU Operator Configuration
|
||||
|
||||
##### For CFGI Version >= 1.3.0
|
||||
|
||||
1. Create a docker-registry secret named `clearml-dockerhub-access` in the `gpu-operator` Namespace, making sure to replace your `<CLEARML_DOCKERHUB_TOKEN>`:
|
||||
|
||||
```bash
|
||||
kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \
|
||||
--docker-server=docker.io \
|
||||
--docker-username=allegroaienterprise \
|
||||
--docker-password="<CLEARML_DOCKERHUB_TOKEN>" \
|
||||
--docker-email=""
|
||||
```
|
||||
|
||||
1. Create a `gpu-operator.override.yaml` file as follows:
|
||||
* Set `devicePlugin.repository` to `docker.io/clearml`
|
||||
* Configure `devicePlugin.config.data.renamed-resources.sharing.timeSlicing.resources` for each GPU index on the host
|
||||
* Use `nvidia.com/gpu-<INDEX>` format for the `rename` field, and set `replicas` to `8`.
|
||||
|
||||
```yaml
|
||||
gfd:
|
||||
imagePullSecrets:
|
||||
- "clearml-dockerhub-access"
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
repository: docker.io/clearml
|
||||
image: k8s-device-plugin
|
||||
version: v0.17.1-gpu-card-selection
|
||||
imagePullPolicy: Always
|
||||
imagePullSecrets:
|
||||
- "clearml-dockerhub-access"
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
config:
|
||||
name: device-plugin-config
|
||||
create: true
|
||||
default: "renamed-resources"
|
||||
data:
|
||||
renamed-resources: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: none
|
||||
sharing:
|
||||
timeSlicing:
|
||||
renameByDefault: false
|
||||
failRequestsGreaterThanOne: false
|
||||
# Edit the following configuration as needed, adding as many GPU indices as many cards are installed on the Host.
|
||||
resources:
|
||||
- name: nvidia.com/gpu
|
||||
rename: nvidia.com/gpu-0
|
||||
devices:
|
||||
- "0"
|
||||
replicas: 8
|
||||
- name: nvidia.com/gpu
|
||||
rename: nvidia.com/gpu-1
|
||||
devices:
|
||||
- "1"
|
||||
replicas: 8
|
||||
```
|
||||
|
||||
##### For CFGI version < 1.3.0 (Legacy GPU Operator)
|
||||
|
||||
Create a `gpu-operator.override.yaml` file:
|
||||
|
||||
``` yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
config:
|
||||
name: device-plugin-config
|
||||
create: true
|
||||
default: "any"
|
||||
data:
|
||||
any: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: none
|
||||
sharing:
|
||||
timeSlicing:
|
||||
renameByDefault: false
|
||||
failRequestsGreaterThanOne: false
|
||||
resources:
|
||||
- name: nvidia.com/gpu
|
||||
replicas: 8
|
||||
```
|
||||
|
||||
### Install
|
||||
|
||||
Install the nvidia `gpu-operator` using the previously created `gpu-operator.override.yaml` override file:
|
||||
|
||||
```bash
|
||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||
```
|
||||
|
||||
Create a `cfgi-values.override.yaml` file with the following content:
|
||||
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
```
|
||||
|
||||
Install the CFGI Helm Chart using the previous override file:
|
||||
|
||||
```bash
|
||||
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Fractional GPU Injector will inject CUDA files on pods that have the following label:
|
||||
|
||||
```yaml
|
||||
labels:
|
||||
clearml-injector/fraction: "<GPU_FRACTION_VALUE>"
|
||||
```
|
||||
|
||||
where `"<GPU_FRACTION_VALUE>"` must be equal one of following values:
|
||||
|
||||
* "0.0625" (1/16th)
|
||||
* "0.125" (1/8th)
|
||||
* "0.250"
|
||||
* "0.375"
|
||||
* "0.500"
|
||||
* "0.625"
|
||||
* "0.750"
|
||||
* "0.875"
|
||||
|
||||
Valid values for `"<GPU_FRACTION_VALUE>"` include integer representation of GPUs such as `1.000` or `2` or `2.0` etc.
|
||||
|
||||
### ClearML Enterprise Agent Configuration
|
||||
|
||||
In order to specify resource allocation when using the CFGI, the following values configuration should be set in `clearml-agent-values.override.yaml`.
|
||||
|
||||
The label `clearml-injector/fraction: "<GPU_FRACTION_VALUE>"` is required in order to specify a fraction of the GPU to be
|
||||
assigned to the pod started for a task pulled from the respective queue.
|
||||
|
||||
#### CFGI Version >= 1.3.0
|
||||
|
||||
Starting from version 1.3.0, there is no need to specify the resources field.
|
||||
|
||||
``` yaml
|
||||
agentk8sglue:
|
||||
createQueues: true
|
||||
queues:
|
||||
gpu-fraction-1_000:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "1.000"
|
||||
gpu-fraction-0_500:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.500"
|
||||
gpu-fraction-0_250:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.250"
|
||||
gpu-fraction-0_125:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.125"
|
||||
```
|
||||
|
||||
#### CFGI Version < 1.3.0
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
createQueues: true
|
||||
queues:
|
||||
gpu-fraction-1_000:
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 8
|
||||
gpu-fraction-0_500:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.500"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 4
|
||||
gpu-fraction-0_250:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.250"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 2
|
||||
gpu-fraction-0_125:
|
||||
templateOverrides:
|
||||
labels:
|
||||
clearml-injector/fraction: "0.125"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
## Upgrading Chart
|
||||
|
||||
### Upgrades / Values Upgrades
|
||||
|
||||
To update to the latest version of this chart:
|
||||
|
||||
```bash
|
||||
helm repo update
|
||||
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
|
||||
```
|
||||
|
||||
To change values on an existing installation:
|
||||
|
||||
```bash
|
||||
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml
|
||||
```
|
||||
|
||||
## Disable Fractions
|
||||
|
||||
To toggle between timeSlicing and vanilla full GPUs, remove the `devicePlugin.config` section from the `gpu-operator.override.yaml`
|
||||
file and upgrade the `gpu-operator`:
|
||||
|
||||
```yaml
|
||||
toolkit:
|
||||
env:
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||
value: "false"
|
||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
|
||||
value: "true"
|
||||
devicePlugin:
|
||||
env:
|
||||
- name: PASS_DEVICE_SPECS
|
||||
value: "true"
|
||||
- name: FAIL_ON_INIT_ERROR
|
||||
value: "true"
|
||||
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
|
||||
value: volume-mounts
|
||||
- name: DEVICE_ID_STRATEGY
|
||||
value: uuid
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
value: all
|
||||
- name: NVIDIA_DRIVER_CAPABILITIES
|
||||
value: all
|
||||
```
|
Loading…
Reference in New Issue
Block a user