Add Enterprise Server deployment usecases

This commit is contained in:
revital 2025-05-14 14:57:35 +03:00
parent b21275c262
commit 7779945d3e
11 changed files with 346 additions and 531 deletions

View File

@ -1,26 +1,20 @@
🟡 Ready, but missing hyperlinks (see TODOs)
TODO:
- Link: GPU Operator
- Link: Additional configurations
- Link: Now proceed with AI App Gateway
---
title: ClearML Agent on Kubernetes
---
The ClearML Agent allows scheduling distributed experiments on a Kubernetes cluster.
The ClearML Agent enables scheduling and executing distributed experiments on a Kubernetes cluster.
## Prerequisites
- The ClearML Enterprise server is up and running.
- Create a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way to do so is from
- Generate a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via
the ClearML UI (**Settings > Workspace > App Credentials > Create new credentials**).
:::note
Make sure that the generated keys belong to an admin user or a service user with admin privileges.
Make sure these credentials belong to an admin user or a service user with admin privileges.
:::
- The worker environment should be able to communicate to the ClearML Server over the same network.
- The worker environment must be able to access the ClearML Server over the same network.
## Installation
@ -36,13 +30,13 @@ Update the repository locally:
helm repo update
```
### Prepare Values
### Create a Values Override File
Create a `clearml-agent-values.override.yaml` file with the following content:
:::note
In the following configuration, replace the `<ACCESS_KEY>` and `<SECRET_KEY>` placeholders with the admin credentials
you have generated on the ClearML Server. The values for `<api|file|web>ServerUrlReference` should point to your ClearML
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>` with the admin credentials
you created earlier. Set `<api|file|web>ServerUrlReference` to the relevant URLs of your ClearML
control-plane installation.
:::
@ -73,19 +67,17 @@ helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-e
## Additional Configuration Options
:::note
You can view the full set of available and documented values of the chart by running the following command:
To view all configurable options for the Helm chart, run the following command:
```bash
helm show readme clearml-enterprise/clearml-enterprise-agent
# or
helm show values clearml-enterprise/clearml-enterprise-agent
```
:::
### Report GPUs in the Dashboard
### Report GPUs to the Dashboard
The Agent should explicitly report the total number of GPUs available on the cluster for it to appear in the dashboard reporting:
To show GPU availability in the dashboard, explicitly set the number of GPUs:
```yaml
agentk8sglue:
@ -95,23 +87,23 @@ agentk8sglue:
### Queues
The ClearML Agent in Kubernetes monitors ClearML queues and pulls tasks that are scheduled for execution.
The ClearML Agent on Kubernetes monitors ClearML queues and pulls tasks that are scheduled for execution.
A single agent can monitor multiple queues, each queue sharing a Pod template (`agentk8sglue.basePodTemplate`) to be
A single agent can monitor multiple queues. By default, the queues share a base pod template (`agentk8sglue.basePodTemplate`)
used when submitting a task to Kubernetes after it has been extracted from the queue.
Each queue can be configured with dedicated Pod template spec override (`templateOverrides`). This way queue definitions
can be mixed and matched to serve multiple use-cases.
can be tailored to different use cases.
The Following are a few examples of agent queue templates.
The following are a few examples of agent queue templates:
#### GPU Queues
#### Example: GPU Queues
GPU queue support requires deploying the NVIDIA GPU Operator on your Kubernetes cluster.
For more information, see [GPU Operator](extra_configs/gpu_operator.md).
:::note
The GPU queues configuration and usage from the ClearML Enterprise Agent requires deploying the Nvidia GPU Operator
on your Kubernetes cluster.
For more information, refer to the [GPU Operator](https://TODO) page.
:::
``` yaml
agentk8sglue:
@ -129,13 +121,14 @@ agentk8sglue:
nvidia.com/gpu: 2
```
#### Override a Pod Template by Queue
#### Example: Overriding Pod Templates per Queue
In the following example:
In this example:
- The `red` queue will inherit both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
- The `blue` queue will set the label `team=blue`, but will inherit the 1Gi memory from the `basePodTemplate` section.
- The `green` queue will set both the label `team=green` and a 2Gi memory limit. It will also set an annotation and an environment variable.
- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
- The `blue` queue overrides the label by setting it to `team=blue`, and inherits the 1Gi memory from the `basePodTemplate` section.
- The `green` queue overrides the label by setting it to `team=green`, and overrides the memory limit by setting it to 2Gi.
It also sets an annotation and an environment variable.
```yaml
agentk8sglue:
@ -173,9 +166,5 @@ agentk8sglue:
## Next Steps
Once the ClearML Enterprise Agent is up and running, proceed with deploying the ClearML Enterprise App Gateway.
Once the agent is up and running, proceed with deploying the[ ClearML Enterprise App Gateway](appgw_install_k8s.md).
$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$
TODO link to the AI App Gateway page in documentation

View File

@ -1,40 +1,38 @@
🟢 Ready
TODO in future:
- Add NFS Example - https://allegroai.atlassian.net/wiki/x/AoCUiQ?atlOrigin=eyJpIjoiMjNiNTcxYTJiMzUxNDVhMThiODlhMTcwYzE1YWE3ZTUiLCJwIjoiYyJ9
---
title: Dynamically Edit Task Pod Template
---
# Dynamically Edit Task Pod Template
The ClearML Enterprise Agent supports defining custom Python code to modify a task's Pod template before it is applied
to Kubernetes.
The ClearML Enterprise Agent supports defining custom Python code for interacting with a Task's Pod template before it gets applied to Kubernetes.
This enables dynamic customization of Task Pod manifests in the context of a ClearML Enterprise Agent, which is useful
for injecting values or changing configurations based on runtime context.
This allows to dynamically edit a Task Pod manifest in the context of a ClearML Enterprise Agent and can be useful in a variety of scenarios such as customizing fields based on variables.
## Agent Configuration
# Agent Configuration
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that
module that the ClearML Enterprise Agent should invoke before applying a Task Pod template.
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable is used to indicate a Python module and function inside that module for the ClearML Enterprise Agent to run before applying a Task Pod template. The Agent will run this code from its own context, pass some arguments (including the actual template) to the function and use the returned template to create the final Task Pod in Kubernetes.
The Agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
the returned template to create the final Task Pod in Kubernetes.
Arguments passed to the function include:
`queue` - ID of the queue (string) from which the task was pulled.
* `queue` (string) - ID of the queue from which the task was pulled.
* `queue_name` (string) - Name of the queue from which the task was pulled.
* `template` (Python dictionary) - Base Pod template created from the agent's configuration and any queue-specific overrides.
* `task_data` (object) - Task data object (as returned by the `tasks.get_by_id` API call). For example, use `task_data.project` to get the task's project ID.
* `providers_info` (dictionary) - Provider info containing optional information collected for the user running this task
when the user logged into the system (requires additional server configuration).
* `task_config` (`clearml_agent.backend_config.Config` object) - Task configuration containing configuration vaults applicable
for the user running this task, and other configuration. Use `task_config.get("...")` to get specific configuration values.
* `worker` - The agent Python object in case custom calls might be required.
`queue_name` - Name of the queue (string) from which the task was pulled.
### Usage
`template` - Base template (python dictionary) created from the agent’s values, with any specific overrides for the queue from which the task was pulled.
Update `clearml-agent-values.override.yaml` to include:
`task_data` - Task data structure (object) containing the task’s information (as returned by the tasks.get_by_id API call). For example, use task_data.project to get the task’s project ID.
`providers_info` - Providers info (dictionary) containing optional information collected for the user running this task when the user logged into the system (requires additional server configuration).
`task_config` - Task configuration (clearml_agent.backend_config.Config object) containing the configuration used to run this task. This includes any overrides added in Vaults applicable for the user running this task. Use task_config.get("...") to get specific configuration values.
`worker` - the agent Python object, in case custom calls might be required.
## Usage
Edit the `clearml-agent-values.override.yaml` file adding the following:
``` yaml
```yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
@ -61,24 +59,28 @@ agentk8sglue:
return {"template": template}
```
## Notes
:::note notes
* Make sure to include `*args, **kwargs` at the end of the function's argument list and to only use keyword arguments.
This is needed to maintain backward compatibility.
**Note**: Make sure to include `*args, **kwargs` at the end of the function’s argument list and to only use keyword arguments. This is needed to maintain backward compatibility and make sure any added named arguments or changes in the arguments order in new agent versions won’t affect your implementation.
* Custom code modules can be included as a file in the pod's container, and the environment variable can be used to
point to the file and entry point.
**Note**: Custom code modules can be included as a file in the Pod's container and the environment variable can be used to simply point to the file and entry point.
* When defining a custom code module, by default the Agent will start watching pods in all namespaces
across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the
`CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces.
Instead, set it to `"1"` if namespace-related changes are needed in the code.
**Note**: When defining a custom code module, by default the ClearML Etnerprise Agent will start watching Pods in all namespaces across the Cluster. If you do not intend to give a ClusterRole permission, make sure to set the `CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the ClearML Enterprise Agent to try listing Pods in all namespaces. Instead, set it to `"1"` if namespace-related changes are needed in the code.
```yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
value: "0"
```
``` yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES
value: "0"
```
To customize the bash startup scripts instead of the pod spec, use:
**Note**: If you want instead to modify the Bash script used to start the Task Pod or the Agent, see here instead:
``` yaml
```yaml
agentk8sglue:
# -- Custom Bash script for the Agent pod ran by Glue Agent
customBashScript: ""
@ -86,11 +88,11 @@ agentk8sglue:
containerCustomBashScript: ""
```
# Examples
## Examples
## Example – Edit Template based on ENV var
### Example: Edit Template Based on ENV Var
``` yaml
```yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
@ -131,9 +133,9 @@ agentk8sglue:
emptyDir: {}
```
## Example – NFS Mount Path
### Example: Inject NFS Mount Path
``` yaml
```yaml
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_TEMPLATE_MODULE
@ -163,23 +165,25 @@ agentk8sglue:
return {"template": template}
```
# Bind Additional Resources to Task Pod (PVC Example)
### Example: Bind PVC Resource to Task Pod
In this example, a dedicated PVC is dynamically created and attached to every Pod created from a dedicated queue, then deleted after the Pod deletion.
In this example, a PVC is created and attached to every Pod created from a dedicated queue, then deleted afterwards.
The following code block is commented to explain the context.
Key points:
The key points are:
- `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` and `CLEARML_K8S_GLUE_POD_POST_DELETE_CMD` env vars let you define custom bash
code hooks to be executed around the main apply command by the Agent, such as creating and deleting a PVC object.
- `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` and `CLEARML_K8S_GLUE_POD_POST_DELETE_CMD` env vars let you define custom bash code hooks to be executed around the main apply command by the Agent, such as creating and deleting a PVC object.
- `CLEARML_K8S_GLUE_TEMPLATE_MODULE` env var and a file mount let you define custom Python code in a specific context,
useful to dynamically update the main Pod template before the Agent applies it.
- `CLEARML_K8S_GLUE_TEMPLATE_MODULE` env var and a file mount let you define custom Python code in a specific context, useful to dynamically update the main Pod template before the Agent applies it.
:::note notes
* This example uses a queue named `pvc-test`, make sure to replace all occurrences of it.
**Note**: This example uses a queue named `pvc-test`, make sure to replace all occurrences of it.
* `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` can reference templated vars as `{queue_name}, {pod_name}, {namespace}` that will
get replaced with the actual value by the Agent at execution time.
**Note**: `CLEARML_K8S_GLUE_POD_PRE_APPLY_CMD` can reference templated vars as `{queue_name}, {pod_name}, {namespace}` that will get replaced with the actual value by the Agent at execution time.
``` yaml
```yaml
agentk8sglue:
# Bind a pre-defined custom 'custom-agent-role' Role with the ability to handle 'persistentvolumeclaims'
additionalRoleBindings:
@ -246,9 +250,11 @@ agentk8sglue:
name: task-pvc
```
Example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
### Example: Required Role
``` yaml
The following is an example of `custom-agent-role` Role with permissions to handle `persistentvolumeclaims`:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:

View File

@ -1,31 +1,28 @@
🟡 Ready, missing link
---
TODO:
- Link: fractional GPUs
title: Basic Deployment - Suggested GPU Operator Values
---
# Basic Deployment - Suggested GPU Operator Values
## Add the Helm Repo Locally
Add the ClearML Helm repository:
``` bash
Add the NVIDIA GPU Operator Helm repository:
```bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
```
Update the repository locally:
``` bash
```bash
helm repo update
```
## Installation
As mentioned by NVIDIA, this configuration is needed to prevent unprivileged containers from bypassing the Kubernetes Device Plugin API.
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator with the
following override values.
Create a `gpu-operator.override.yaml` file with the following content:
``` yaml
```yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
@ -48,14 +45,17 @@ devicePlugin:
value: all
```
Install the gpu-operator:
Install the `gpu-operator`:
``` bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
# Fractioning
## Fractional GPU Support
For fractional GPU support, refer to the dedicated guides.
TODO link to the fractional_gpus directory page in documentation
For support with fractional GPUs, refer to the dedicated guides:
* [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) Dynamically configures MIG GPUs on supported devices.
* [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) Enables fractional (non-MIG) GPU
allocation for better hardware utilization and workload distribution in Kubernetes.
* [CDMO and CFGI on the same Cluster](../fractional_gpus/cdmo_cfgi_same_cluster.md) - In clusters with multiple nodes and
varying GPU types, the `gpu-operator` can be used to manage different device configurations and fractioning modes.

View File

@ -1,10 +1,12 @@
🟢 Ready
---
# Multi-Node Training
title: Multi-Node Training
---
The ClearML Enterprise Agent supports horizontal multi-node training executions. Here is a configuration example (in `clearml-agent-values.override.yaml`):
The ClearML Enterprise Agent supports horizontal multi-node training--running a single Task across multiple Pods on different nodes.
``` yaml
Below is a configuration example using `clearml-agent-values.override.yaml`:
```yaml
agentk8sglue:
# Cluster access is required to run multi-node Tasks
serviceAccountClusterAccess: true

View File

@ -1,211 +0,0 @@
🟡 Ready, but missing hyperlinks (see TODOs)
---
TODO:
Control Plane:
- Link: basic k8s installation
- Link: SSO login
- Additional envs for control-plane multi-tenancy
Workers:
- Link: basic Agent installation
- Link: basic AI App Gateway installation
---
# Multi-Tenancy
## Control Plane
For installing the ClearML control-plane, follow this guide (TODO link to the basic_k8s_installation page).
Update the Server’s `clearml-values.override.yaml` with the following values:
``` yaml
apiserver:
extraEnvs:
- name: CLEARML__services__organization__features__user_management_advanced
value: "true"
- name: CLEARML__services__auth__ui_features_per_role__user__show_datasets
value: "false"
- name: CLEARML__services__auth__ui_features_per_role__user__show_orchestration
value: "false"
- name: CLEARML__services__workers__resource_usages__supervisor_company
value: "d1bd92a3b039400cbafc60a7a5b1e52b" # Default company
- name: CLEARML__secure__credentials__supervisor__role
value: "system"
- name: CLEARML__secure__credentials__supervisor__allow_login
value: "true"
- name: CLEARML__secure__credentials__supervisor__user_key
value: "<SUPERVISOR_USER_KEY>"
- name: CLEARML__secure__credentials__supervisor__user_secret
value: "<SUPERVISOR_USER_SECRET>"
- name: CLEARML__secure__credentials__supervisor__sec_groups
value: "[\"users\", \"admins\", \"queue_admins\"]"
- name: CLEARML__secure__credentials__supervisor__email
value: "\"<SUPERVISOR_USER_EMAIL>\""
- name: CLEARML__apiserver__company__unique_names
value: "true"
```
The credentials specified in `<SUPERVISOR_USER_KEY>` and `<SUPERVISOR_USER_SECRET>` can be used to log in as the supervisor user from the ClearML Web UI accessible using the URL `app.<BASE_DOMAIN>`.
Note that the `<SUPERVISOR_USER_EMAIL>` value must be explicitly quoted. To do so, put `\"` around the quoted value. Example `"\"email@example.com\""`.
You will want to configure SSO as well. For this, follow the “SSO (Identity Provider) Setup” (TODO link to the sso-login page).
### Create a Tenant
The following section will address the steps required to create a new tenant in the ClearML Control-plane server using a series of API calls.
Note that placeholders (`<PLACEHOLDER>`) in the following configuration should be substituted with a valid domain based on your installation values.
#### Create a new Tenant in the ClearML Control-plane
*Define variables to use in the next steps:*
``` bash
APISERVER_URL="https://api.<BASE_DOMAIN>"
APISERVER_KEY=<APISERVER_KEY>
APISERVER_SECRET=<APISERVER_SECRET>
```
**Note**: The apiserver key and secret should be the same as those used for installing the ClearML Enterprise server Chart.
*Create a Tenant (company):*
``` bash
curl $APISERVER_URL/system.create_company \
-H "Content-Type: application/json" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"name":"<TENANT_NAME>"}'
```
The result returns the new Company ID (`<COMPANY_ID>`)
If needed, list existing tenants (companies) using:
``` bash
curl -u $APISERVER_KEY:$APISERVER_SECRET $APISERVER_URL/system.get_companies
```
*Create an Admin User for the new tenant:*
``` bash
curl $APISERVER_URL/auth.create_user \
-H "Content-Type: application/json" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"name":"<ADMIN_USER_NAME>","company":"<COMPANY_ID>","email":"<ADMIN_USER_EMAIL>","role":"admin","internal":"true"}'
```
The result returns the new User ID (`<USER_ID>`)
*Create Credentials for the new Admin User:*
``` bash
curl $APISERVER_URL/auth.create_credentials \
-H "Content-Type: application/json" \
-H "X-Clearml-Impersonate-As: <USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET
```
The result returns a set of key and secret credentials associated with the new Admin User.
**Note**: You can use this set of credentials to set up an Agent or App Gateway for the newly created Tenant.
#### Create IDP/SSO sign-in rules
To map new users signing into the system to existing tenants, you can use one or more of the following route methods to route new users (based on their email address) to an existing tenant.
*Route an email to a tenant based on the email’s domain:*
This will instruct the server to assign any new user whose email domain matches the domain provided below to this specific tenant.
Note that providing the same domain name for multiple tenants will result in unstable behavior and should be avoided.
``` bash
curl $APISERVER_URL/login.set_domains \
-H "Content-Type: application/json" \
-H "X-Clearml-Act-As: <USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"domains":["<USERS_EMAIL_DOMAIN>"]}'
```
`<USERS_EMAIL_DOMAIN>` is the email domain set up for users to access through SSO.
*Route specific email(s) to a tenant:*
This will instruct the server to assign any new user whose email is found in this list to this specific tenant. You can use the is_admin property to choose whether these users will be set as admins in this tenant upon login.
Note that you can create more than one list per tenant (using multiple API calls) to create one list for admin users and another for non-admin users.
Note that including the same email address in more than a single tenant’s list will result in unstable behavior and should be avoided.
``` bash
curl $APISERVER_URL/login.add_whitelist_entries \
-H "Content-Type: application/json" \
-H "X-Clearml-Act-As: <USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"emails":["<email1>", "<email2>", ...],"is_admin":false}'
```
To remove existing email(s) from these lists, use the following API call. Note that this will not affect a user who has already logged in using one of these email addresses:
``` bash
curl $APISERVER_URL/login.remove_whitelist_entries \
-H "Content-Type: application/json" \
-H "X-Clearml-Act-As: <USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"emails":["<email1>", "<email2>", ...]}'
```
*Get the current login routing settings:*
To get the current IDP/SSO login rule settings for this tenant:
``` bash
curl $APISERVER_URL/login.get_settings \
-H "X-Clearml-Act-As: <USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET
```
### Limit Features for all Users in a Groupâ€
The server’s `clearml-values.override.yaml` can control some tenants’ configurations, limiting the features available to some users or groups in the system.
Example: with the following configuration, all users in the “users” group will only have the `applications` feature enabled.
``` yaml
apiserver:
extraEnvs:
- name: CLEARML__services__auth__default_groups__users__features
value: "[\"applications\"]"
```
Available Features:
- `applications` - Viewing and running applications
- `data_management` - Working with hyper-datasets and dataviews
- `experiments` - Viewing the experiment table and launching experiments
- `queues` - Viewing the queues screen
- `queue_management` - Creating and deleting queues
- `pipelines` - Viewing/managing pipelines in the system
- `reports` - Viewing and managing reports in the system
- `show_dashboard` - Show the dashboard screen
- `show_projects` - Show the projects menu option
- `resource_dashboard` - Display the resource dashboard in the orchestration page
## Workers
Refer to the following pages for installing and configuring the ClearML Enterprise Agent (TODO link to agent_k8s_installation) and App Gateway (TODO link to app-gateway).
**Note**: Make sure to setup Agent and App Gateway using a Tenant's admin user credentials created with the Tenant creation APIs described above.
### Tenants Separation
In multi-tenant setups, you can separate the tenants’ workers in different namespaces.
Create a Kubernetes Namespace for each tenant and install a dedicated ClearML Agent and AI Application Gateway in each Namespace.
A tenant’s Agent and Gateway need to be configured with credentials created on the ClearML server by a user of the same tenant.
Additional network separation can be achieved via Kubernetes Network Policies.

View File

@ -1,33 +1,42 @@
# ClearML Presign Service
---
title: ClearML Presign Service
---
The ClearML Presign Service is a secure component for generating and redirecting pre-signed storage URLs for authenticated users.
The ClearML Presign Service is a secure service that generates and redirects pre-signed storage URLs for authenticated
users, enabling direct access to cloud-hosted data (e.g., S3) without exposing credentials.
# Prerequisites
## Prerequisites
- The ClearML Enterprise server is up and running.
- Create a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way to do so is from the ClearML UI (Settings → Workspace → App Credentials → Create new credentials).
Note: Make sure that the generated keys belong to an admin user or a service user with admin privileges.
- The worker environment should be able to communicate to the ClearML Server over the same network.
- Generate `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via the ClearML UI
(**Settings > Workspace > App Credentials > Create new credentials**).
# Installation
:::note
Make sure these credentials belong to an admin user or a service user with admin privileges.
:::
- The worker environment must be able to access the ClearML Server over the same network.
## Add the Helm Repo Locally
## Installation
### Add the Helm Repo Locally
Add the ClearML Helm repository:
``` bash
```bash
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
```
Update the repository locally:
``` bash
```bash
helm repo update
```
## Prepare Values
### Prepare Configuration
Create a `presign-service.override.yaml` override file, replacing placeholders.
Create a `presign-service.override.yaml` file (make sure to replace the placeholders):
``` yaml
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
clearml:
@ -39,21 +48,25 @@ ingress:
hostName: "<PRESIGN_SERVICE_URL>"
```
## Install
### Deploy the Helm Chart
Install the clearml-presign-service helm chart in the same namespace as the ClearML Enterprise server:
Install the `clearml-presign-service` helm chart in the same namespace as the ClearML Enterprise server:
``` bash
```bash
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml
```
## Configure ClearML Enterprise Server
### Update ClearML Enterprise Server Configuration
After installing, edit the ClearML Enterprise `clearml-values.override.yaml` file adding an extra environment variable to the apiserver component as follows, making sure to replace the `<PRESIGN_SERVICE_URL>` placeholder, then perform a helm upgrade.
Enable the ClearML Server to use the Presign Service by editing your `clearml-values.override.yaml` file.
Add the following to the `apiserver.extraEnvs` section (make sure to replace `<PRESIGN_SERVICE_URL>`).
``` yaml
```yaml
apiserver:
extraEnvs:
- name: CLEARML__SERVICES__SYSTEM__COMPANY__DEFAULT__SERVICES
value: "[{\"type\":\"presign\",\"url\":\"https://<PRESIGN_SERVICE_URL>\",\"use_fallback\":\"false\",\"match_sets\":[{\"rules\":[{\"field\":\"\",\"obj_type\":\"\",\"regex\":\"^s3://\"}]}]}]"
```
```
Apply the changes with a Helm upgrade.

View File

@ -1,4 +1,6 @@
# ClearML Tenant with Self Signed Certificates
---
title: ClearML Tenant with Self Signed Certificates
---
This guide covers the configuration to support SSL Custom certificates for the following components:
@ -7,9 +9,9 @@ This guide covers the configuration to support SSL Custom certificates for the f
## AI Application Gateway
Add the following section in the `clearml-app-gateway-values.override.yaml` file:
To configure certificates for the Application Gateway, update your `clearml-app-gateway-values.override.yaml` file:
``` yaml
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
@ -23,18 +25,18 @@ customCertificates:
-----END CERTIFICATE-----
```
In the section, there are two options:
In this section, there are two options:
- Replace the whole `ca-certificates.crt` file
- Add extra certificates to the existing `ca-certificates.crt`
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
Let’s see them in detail.
### Replace the whole `ca-certificates.crt` file
### Replace Entire `ca-certificates.crt` File
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like they are stored in a standard `ca-certificates.crt`.
To replace the whole ca-bundle, you can attach a concatenation of all your valid CA in a `pem` format as
they are stored in a standard `ca-certificates.crt`.
``` yaml
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
@ -51,13 +53,11 @@ customCertificates:
...
```
### Add extra certificates to the existing `ca-certificates.crt`
### Append Extra Certificates to the Existing `ca-certificates.crt`
To add extra certificates to the standard provided CA available in the container you can define any specific certificate in the list.
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`.
Ensure to provide different aliases.
``` yaml
```yaml
# -- Custom certificates
customCertificates:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
@ -74,19 +74,19 @@ customCertificates:
-----END CERTIFICATE-----
```
### Apply changes
### Apply Changes
After applying the changes ensure to run the update command:
To apply the changes, run the update command:
``` bash
```bash
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
```
## ClearML Enterprise Agent
Add the following section in the `clearml-agent-values.override.yaml` file:
For the Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
``` yaml
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
@ -102,16 +102,16 @@ customCertificates:
In the section, there are two options:
- Replace the whole `ca-certificates.crt` file
- Add extra certificates to the existing `ca-certificates.crt`
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
Let’s see them in detail.
### Replace the whole `ca-certificates.crt` file
### Replace Entire `ca-certificates.crt` File
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like they are stored in a standard `ca-certificates.crt`.
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like
they are stored in a standard `ca-certificates.crt`.
``` yaml
```yaml
# -- Custom certificates
customCertificates:
# -- Override system crt certificate bundle. Mutual exclusive with extraCerts.
@ -128,13 +128,11 @@ customCertificates:
...
```
### Add extra certificates to the existing `ca-certificates.crt`
### Append Extra Certificates to the Existing `ca-certificates.crt`
To add extra certificates to the standard provided CA available in the container you can define any specific certificate in the list.
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`.
Ensure to provide different aliases.
``` yaml
```yaml
# -- Custom certificates
customCertificates:
# -- Extra certs usable in case of needs of adding more certificates to the standard bundle, Requires root permissions to run update-ca-certificates. Mutual exclusive with overrideCaCertificatesCrt.
@ -151,9 +149,11 @@ customCertificates:
-----END CERTIFICATE-----
```
### Add certificates to Tasks
### Add Certificates to Task Pods
``` yaml
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into Pods:
```yaml
agentk8sglue:
basePodTemplate:
initContainers:
@ -189,13 +189,13 @@ agentk8sglue:
name: clearml-agent-clearml-enterprise-agent-custom-ca-1
```
Please note the `clearml-extra-ca-certs` volume, ensure to add each configMap created by the agent.
The `clearml-extra-ca-certs` volume must include all `ConfigMap` resources generated by the agent for the custom certificates.
These `ConfigMaps` are automatically created by the Helm chart based on the number of certificates provided.
Their names are usually prefixed with the Helm release name, so adjust accordingly if you used a custom release name.
The name can differ based on the release name used during the installation.
### Apply Changes
### Apply changes
After applying the changes ensure to run the update command:
Applying the changes by running the the update command:
``` bash
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml

View File

@ -1,12 +1,21 @@
# SSO (Identity Provider) Setup
---
title: SSO (Identity Provider) Setup
---
ClearML Enterprise Server supports various SSO options, values configurations can be set in `clearml-values.override.yaml`.
ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers.
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and applied to the `apiserver` component.
The following are a few examples. Some other supported providers are Auth0, Keycloak, Okta, Azure, Google, Cognito.
The following are configuration examples for commonly used providers. Other supported systems include:
* Auth0
* Keycloak
* Okta
* Azure AD
* Google
* and AWS Cognito
## Auth0
``` yaml
```yaml
apiserver:
extraEnvs:
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_id
@ -25,7 +34,7 @@ apiserver:
## Keycloak
``` yaml
```yaml
apiserver:
extraEnvs:
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id
@ -42,19 +51,22 @@ apiserver:
value: "true"
```
### Note if using Groups Mapping
## Group Membership Mapping in Keycloak
When configuring the OpenID client for ClearML:
To map Keycloak groups into the ClearML user's SSO token:
- Navigate to the Client Scopes tab.
- Click on the first row <clearml client>-dedicated.
- Click "Add Mapper" → "By configuration" and then select the "Group membership" option.
- In the opened dialog, enter the name "groups" and the Token claim name "groups".
- Uncheck the "Full group path" option and save the mapper.
1. Go to the **Client Scopes** tab.
1. Click on the first row `<clearml client>-dedicated`.
1. Click **Add Mapper > By configuration > Group membership**
1. In the dialog:
* select the **Name** "groups"
* Set **Token Claim Name** "groups"
* Uncheck the **Full group path**
* Save the mapper.
To validate yourself:
To verify:
- Return to the Client Details → Client scope tab.
- Go to the Evaluate sub-tab and select a user who has any group memberships.
- On the right side, navigate to the Generated ID token and then to Generated User Info.
- Inspect that in both cases, you can see the group's claim in the displayed user data.
1. Return to **Client Details > Client scope** tab.
1. Go to the Evaluate sub-tab and select a user who has any group memberships.
1. Go to **Generated ID token** and then to **Generated User Info**.
1Inspect that in both cases you can see the group's claim in the displayed user data.

View File

@ -1,103 +1,106 @@
🟢 Ready
---
# ClearML Dynamic MIG Operator (CDMO)
title: ClearML Dynamic MIG Operator (CDMO)
---
Enables dynamic MIG GPU configurations.
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
# Installation
## Installation
## Requirements
### Requirements
Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
Add and update the Nvidia Helm repo:
* Add and update the Nvidia Helm repo:
``` bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
```
```bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
```
Create a `gpu-operator.override.yaml` file with the following content:
* Create a `gpu-operator.override.yaml` file with the following content:
``` yaml
migManager:
enabled: false
mig:
strategy: mixed
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```
```yaml
migManager:
enabled: false
mig:
strategy: mixed
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```
Install the official NVIDIA `gpu-operator` using Helm with the previous configuration:
* Install the official NVIDIA `gpu-operator` using Helm with the previous configuration:
``` bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
```bash
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
```
## Install
### Installing CDMO
Create a `cdmo-values.override.yaml` file with the following content:
* Create a `cdmo-values.override.yaml` file with the following content:
``` yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
```
```yaml
imageCredentials:
password: "<CLEARML_DOCKERHUB_TOKEN>"
```
Install the CDMO operator Helm Chart using the previous override file:
* Install the CDMO Helm Chart using the previous override file:
``` bash
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
```
```bash
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
```
Enable the NVIDIA MIG support on your cluster by running the following command on all Nodes with a MIG-supported GPU (run it for each GPU `<GPU_ID>` you have on the Host):
* Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
(run it for each GPU `<GPU_ID>` on the host):
``` bash
nvidia-smi -mig 1
```
```bash
nvidia-smi -mig 1
```
**NOTE**: The node might need to be rebooted if reported by the result of the previous command.
:::note notes
* The node reboot may be required if the command output indicates so.
**NOTE**: For convenience, this command can be issued from inside the nvidia-device-plugin-daemonset Pod running on the related node.
* For convenience, this command can be issued from inside the `nvidia-device-plugin-daemonset` pod running on the related node.
:::
Any MIG-enabled GPU node `<NODE_NAME>` from the last point must be labeled accordingly as follows:
* Any MIG-enabled GPU node `<NODE_NAME>` from the last point must be labeled accordingly as follows:
``` bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
```
```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
```
# Unconfigure MIGs
## Disabling MIGs
For disabling MIG, follow these steps in order:
To disable MIG, follow these steps:
1. Ensure there are no more running workflows requesting any form of GPU on the Node(s) before re-configuring it.
1. Ensure no running workflows are using GPUs on the target node(s).
2. Remove the CDMO label from the target Node(s) to disable the dynamic MIG reconfiguration.
2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
``` bash
```bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
```
3. Execute a shell into the `device-plugin-daemonset` Pod instance running on the target Node(s) and execute the following commands in order:
3. Execute a shell into the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands:
``` bash
```bash
nvidia-smi mig -dci
nvidia-smi mig -dgi
@ -105,27 +108,27 @@ For disabling MIG, follow these steps in order:
nvidia-smi -mig 0
```
4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs and upgrade the `gpu-operator`:
4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs, and upgrade the `gpu-operator`:
``` yaml
```yaml
toolkit:
env:
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
value: "false"
value: "false"
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
value: "true"
value: "true"
devicePlugin:
env:
- name: PASS_DEVICE_SPECS
value: "true"
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
value: "true"
- name: DEVICE_LIST_STRATEGY # Use volume-mounts
value: volume-mounts
value: volume-mounts
- name: DEVICE_ID_STRATEGY
value: uuid
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
value: all
```

View File

@ -2,14 +2,14 @@
title: Install CDMO and CFGI on the same Cluster
---
In clusters with multiple nodes with different GPU types, the `gpu-operator` can be used to manage different devices and
fractioning modes.
In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations
and fractioning modes.
## Configuring the NVIDIA GPU Operator
The NVIDIA `gpu-operator` allows defining multiple configurations for the Device Plugin.
The NVIDIA `gpu-operator` supports defining multiple configurations for the Device Plugin.
The following YAML values define two usable configs as "mig" and "ts" (time-slicing).
The following example YAML defines two configurations: "mig" and "ts" (time-slicing).
``` yaml
migManager:
@ -67,52 +67,50 @@ devicePlugin:
migStrategy: mixed
```
## Usage
## Applying Configuration to Nodes
Previously defined configurations need to be applied to Kubernetes nodes using labels. After a label is added to a node,
the NVIDIA `device-plugin` will automatically reconfigure it.
To activate a configuration, label the Kubernetes node accordingly. After a node is labeled,
the NVIDIA `device-plugin` will automatically reload the new configuration.
The following is an example using the previous configuration.
Example usage:
* Apply the `mig` (MIG mode) config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig
```
* Apply the MIG `mig` config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig
```
* Apply the `ts` (time slicing) config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts
```
* Apply the time slicing `ts` config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts
```
* Apply the `all-disabled` (standard full GPU access) config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled
```
* Apply the vanilla full GPUs `all-disabled` config:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled
```
## Installing CDMO and CFGI
## Install CDMO and CFGI
After configuring the NVIDIA `gpu-operator` and labeling nodes, proceed with the standard installations of [CDMO](cdmo.md)
and [CFGI](cfgi.md).
After configuring the NVIDIA `gpu-operator` and labeling nodes, proceed with the standard [CDMO](cdmo.md) and [CFGI](cfgi.md)
installation.
## Disabling
## Disabling Configurations
### Time Slicing
To toggle between time slicing and vanilla full GPUs, simply toggle the label value between `ts` and `all-disabled` using
the `--overwrite` flag in kubectl.
To switch between time-slicing and full GPU access, update the node label using the `--overwrite` flag:
### MIG
To disable MIG, follow these steps:
To disable MIG mode:
1. Ensure there are no more running workflows requesting any form of GPU on the Node(s) before re-configuring it.
2. Remove the CDMO label from the target Node(s) to disable the dynamic MIG reconfiguration.
1. Ensure there are no more running workflows requesting any form of GPU on the node(s) before re-configuring it.
2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
``` bash
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
```
3. Execute a shell into the `device-plugin-daemonset` Pod instance running on the target Node(s) and execute the following commands:
3. Execute a shell in the `device-plugin-daemonset` Pod instance running on the target node(s) and execute the following commands:
``` bash
nvidia-smi mig -dci
@ -122,7 +120,7 @@ To disable MIG, follow these steps:
nvidia-smi -mig 0
```
4. Label the target node to disable MIG.
4. Relabel the target node to disable MIG:
``` bash
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite

View File

@ -167,36 +167,39 @@ helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --c
## Usage
Fractional GPU Injector will inject CUDA files on pods that have the following label:
To use fractional GPUs, label your pod with:
```yaml
labels:
clearml-injector/fraction: "<GPU_FRACTION_VALUE>"
```
where `"<GPU_FRACTION_VALUE>"` must be equal one of following values:
Valid values for `"<GPU_FRACTION_VALUE>"` include:
* "0.0625" (1/16th)
* "0.125" (1/8th)
* "0.250"
* "0.375"
* "0.500"
* "0.625"
* "0.750"
* "0.875"
Valid values for `"<GPU_FRACTION_VALUE>"` include integer representation of GPUs such as `1.000` or `2` or `2.0` etc.
* Fractions:
* "0.0625" (1/16th)
* "0.125" (1/8th)
* "0.250"
* "0.375"
* "0.500"
* "0.625"
* "0.750"
* "0.875"
* Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc.
### ClearML Enterprise Agent Configuration
In order to specify resource allocation when using the CFGI, the following values configuration should be set in `clearml-agent-values.override.yaml`.
To run ClearML jobs that request specific GPU fractions, configure the queues in your `clearml-agent-values.override.yaml` file.
The label `clearml-injector/fraction: "<GPU_FRACTION_VALUE>"` is required in order to specify a fraction of the GPU to be
assigned to the pod started for a task pulled from the respective queue.
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
fraction of a GPU to allocate (e.g., "0.500" for half a GPU).
This label is used by the CFGI to assign the correct portion of GPU resources to the pod running the task.
#### CFGI Version >= 1.3.0
Starting from version 1.3.0, there is no need to specify the resources field.
Starting from version 1.3.0, there is no need to specify the resources field. You only need to set the labels:
``` yaml
agentk8sglue:
@ -222,6 +225,8 @@ agentk8sglue:
#### CFGI Version < 1.3.0
For versions older than 1.3.0, the GPU limits must be defined:
```yaml
agentk8sglue:
createQueues: true
@ -256,24 +261,22 @@ agentk8sglue:
## Upgrading Chart
### Upgrades / Values Upgrades
To update to the latest version of this chart:
To upgrade to the latest version of this chart:
```bash
helm repo update
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
```
To change values on an existing installation:
To apply changes to values on an existing installation:
```bash
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml
```
## Disable Fractions
## Disabling Fractions
To toggle between timeSlicing and vanilla full GPUs, remove the `devicePlugin.config` section from the `gpu-operator.override.yaml`
To revert to standard GPU scheduling (without time slicing), remove the `devicePlugin.config` section from the `gpu-operator.override.yaml`
file and upgrade the `gpu-operator`:
```yaml