Fix runtime property overriding existing properties

Edit README (#156 )
Update k8s glue docker build resources
2025-06-26 18:16:15 +00:00 · 2023-07-20 10:41:15 +03:00 · 2023-07-19 16:51:14 +03:00 · 2023-07-19 16:47:50 +03:00 · 2023-07-11 10:32:01 +03:00
8 changed files with 82 additions and 61 deletions
--- a/README.md
+++ b/README.md
@@ -24,8 +24,7 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
 * Launch-and-Forget service containers
 * [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
 * [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
-*
-Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
+* Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)

 It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.

@@ -35,8 +34,8 @@ It is a zero configuration fire-and-forget execution agent, providing a full ML/
   or [free tier hosting](https://app.clear.ml)
 2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine:
   on-premises / cloud / ...)
-3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or
-   Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
+3. Create a [job](https://clear.ml/docs/latest/docs/apps/clearml_task) or
+   add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines of code
 4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or
   automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
 5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes:  :beer:
@@ -81,21 +80,21 @@ Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://githu

 **Two K8s integration flavours**

- Spin ClearML-Agent as a long-lasting service pod
-    - use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
+- Spin ClearML-Agent as a long-lasting service pod:
+    - Use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
    - map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
-    - allow the clearml-agent to manage sibling dockers
-    - benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
-    - downside: Sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
+    - Allow the clearml-agent to manage sibling dockers
+    - Benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
+    - Downside: sibling containers
+- Kubernetes Glue, map ClearML jobs directly to K8s jobs:
    - Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
      a K8s cpu node
    - The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
      yaml template)
    - Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
      experiment's process
-    - benefits: Kubernetes full view of all running jobs in the system
-    - downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
+    - Benefits: Kubernetes full view of all running jobs in the system
+    - Downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)

 ### Using the ClearML Agent

@@ -110,15 +109,15 @@ A previously run experiment can be put into 'Draft' state by either of two metho

 * Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any
  results and artifacts the previous run had created.
-* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new '
-  Draft' experiment with the same configuration as the original experiment.
+* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new 
+  'Draft' experiment with the same configuration as the original experiment.

 An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in
 the ClearML UI and selecting the execution queue.

 See [creating an experiment and enqueuing it for execution](#from-scratch).

-Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue.
+Once an experiment is enqueued, it will be picked up and executed by a ClearML Agent monitoring this queue.

 The ClearML UI Workers & Queues page provides ongoing execution information:

@@ -170,22 +169,22 @@ clearml-agent init
 ```

 Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
-ClearML Agent cache folder is `~/.clearml`
+ClearML Agent cache folder is `~/.clearml`.

-See full details in your configuration file at `~/clearml.conf`
+See full details in your configuration file at `~/clearml.conf`.

-Note: The **ClearML agent** extends the **ClearML** configuration file `~/clearml.conf`
+Note: The **ClearML Agent** extends the **ClearML** configuration file `~/clearml.conf`.
 They are designed to share the same configuration file, see example [here](docs/clearml.conf)

 #### Running the ClearML Agent

-For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
+For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen:

 ```bash
 clearml-agent daemon --queue default --foreground
 ```

-For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
+For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe).
 Notice: with `--detached` flag, the *clearml-agent* will be running in the background

 ```bash
@@ -195,20 +194,21 @@ clearml-agent daemon --detached --queue default
 GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled
 with `--cpu-only`).

-If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for
-the `clearml-agent` <br>
+If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPUs will be allocated for
+the `clearml-agent`. <br>
 If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES="none"`, no gpu will be allocated for
-the `clearml-agent`
+the `clearml-agent`.

-Example: spin two agents, one per gpu on the same machine:
-Notice: with `--detached` flag, the *clearml-agent* will be running in the background
+Example: spin two agents, one per GPU on the same machine:
+
+Notice: with `--detached` flag, the *clearml-agent* will run in the background

 ```bash
 clearml-agent daemon --detached --gpus 0 --queue default
 clearml-agent daemon --detached --gpus 1 --queue default
 ```

-Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
+Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent

 ```bash
 clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
@@ -223,14 +223,14 @@ For debug and experimentation, start the ClearML agent in `foreground` mode, whe
 clearml-agent daemon --queue default --docker --foreground
 ```

-For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
-Notice: with `--detached` flag, the *clearml-agent* will be running in the background
+For actual service mode, all the stdout will be stored automatically into a file (no need to pipe).
+Notice: with `--detached` flag, the *clearml-agent* will run in the background

 ```bash
 clearml-agent daemon --detached --queue default --docker
 ```

-Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+Example: spin two agents, one per gpu on the same machine, with default `nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04`
 docker:

 ```bash
@@ -238,8 +238,8 @@ clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10
 clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
 ```

-Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:
-10.1-cudnn7-runtime-ubuntu18.04 docker:
+Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent, with default 
+`nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04` docker:

 ```bash
 clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
@@ -250,16 +250,16 @@ clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda

 Priority Queues are also supported, example use case:

-High priority queue: `important_jobs`  Low priority queue: `default`
+High priority queue: `important_jobs`, low priority queue: `default`

 ```bash
 clearml-agent daemon --queue important_jobs default
 ```

-The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from
-the `default` queue.
+The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, and only if it is empty, the agent 
+will try to pull from the `default` queue.

-Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see
+Adding queues, managing job order within a queue, and moving jobs between queues, is available using the Web UI, see
 example on our [free server](https://app.clear.ml/workers-and-queues/queues)

 ##### Stopping the ClearML Agent
@@ -279,32 +279,33 @@ clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10
    - Git repository link and commit ID (or an entire jupyter notebook)
    - Git diff (we’re not saying you never commit and push, but still...)
    - Python packages used by your code (including specific versions used)
-    - Hyper-Parameters
-    - Input Artifacts
+    - Hyperparameters
+    - Input artifacts

  You now have a 'template' of your experiment with everything required for automated execution

-* In the ClearML UI, Right-click on the experiment and select 'clone'. A copy of your experiment will be created.
+* In the ClearML UI, right-click on the experiment and select 'clone'. A copy of your experiment will be created.
 * You now have a new draft experiment cloned from your original experiment, feel free to edit it
-    - Change the Hyper-Parameters
+    - Change the hyperparameters
    - Switch to the latest code base of the repository
    - Update package versions
    - Select a specific docker image to run in (see docker execution mode section)
    - Or simply change nothing to run the same experiment again...
-* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
+* Schedule the newly created experiment for execution: right-click the experiment and select 'enqueue'

 ### ClearML-Agent Services Mode <a name="services"></a>

 ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that
 previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks)
-for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the
-budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as
-Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data
-transparency)
+for different use cases: 
+* Auto-scaler service (spinning instances when the need arises and the budget allows)
+* Controllers (Implementing pipelines and more sophisticated DevOps logic)
+* Optimizer (such as Hyperparameter Optimization or sweeping)
+* Application (such as interactive Bokeh apps for increased data transparency)

 ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by
 ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
-Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched
+Currently, clearml-agent in services-mode supports CPU only configuration. ClearML-Agent services mode can be launched
 alongside GPU agents.

 ```bash
@@ -321,15 +322,15 @@ ClearML package.
 Sample AutoML & Orchestration examples can be found in the
 ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.

-AutoML examples
+AutoML examples:

 - [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
    - In order to create an experiment-template in the system, this code must be executed once manually
 - [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
-    - This example will create multiple copies of the Keras experiment-template, with different hyper-parameter
+    - This example will create multiple copies of the Keras experiment-template, with different hyperparameter
      combinations

-Experiment Pipeline examples
+Experiment Pipeline examples:

 - [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
    - This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
--- a/clearml_agent/backend_api/config/default/sdk.conf
+++ b/clearml_agent/backend_api/config/default/sdk.conf
@@ -3,7 +3,7 @@

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
            size {
                # max_used_bytes = -1
--- a/clearml_agent/commands/worker.py
+++ b/clearml_agent/commands/worker.py
@@ -858,10 +858,21 @@ class Worker(ServiceCommandSection):

        # noinspection PyBroadException
        try:
+            result = task_session.send_request(
+                service='tasks',
+                action='get_all',
+                version='2.15',
+                method=Request.def_method,
+                json={'id': [task_id], 'only_fields': ["runtime"], 'search_hidden': True}
+            )
+
+            runtime = result.json().get("data", {}).get("tasks", [])[0].get("runtime") or {}
+            runtime[self.hostname_task_runtime_prop] = socket.gethostname()
+
            res = task_session.send_request(
                service='tasks', action='edit', method=Request.def_method,
                json={
-                    "task": task_id, "force": True, "runtime": {self.hostname_task_runtime_prop: socket.gethostname()}
+                    "task": task_id, "force": True, "runtime": runtime
                },
            )
            if not res.ok:
--- a/docker/k8s-glue/build-resources/clearml.conf
+++ b/docker/k8s-glue/build-resources/clearml.conf
@@ -58,7 +58,7 @@ agent {
        type: pip,

        # specify pip version to use (examples "<20.2", "==19.3.1", "", empty string will install the latest version)
-        pip_version: "<21",
+        pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"],

        # virtual environment inheres packages from system
        system_site_packages: false,
@@ -224,7 +224,7 @@ sdk {

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
            size {
                # max_used_bytes = -1
--- a/docker/k8s-glue/build-resources/entrypoint.sh
+++ b/docker/k8s-glue/build-resources/entrypoint.sh
@@ -33,4 +33,9 @@ echo "api.files_server: ${CLEARML_FILES_HOST}" >> ~/clearml.conf

 ./provider_entrypoint.sh

-python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
+if [[ -z "${K8S_GLUE_MAX_PODS}" ]]
+then
+  python3 k8s_glue_example.py --queue ${QUEUE} ${EXTRA_ARGS}
+else
+  python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
+fi
--- a/docker/k8s-glue/build-resources/setup.sh
+++ b/docker/k8s-glue/build-resources/setup.sh
@@ -2,13 +2,17 @@

 chmod +x /root/entrypoint.sh

-apt-get update -y
-apt-get dist-upgrade -y
-apt-get install -y curl unzip less locales
+apt-get update -qqy
+apt-get dist-upgrade -qqy
+apt-get install -qqy curl unzip less locales

 locale-gen en_US.UTF-8

-apt-get install -y curl python3-pip git
+apt-get update -qqy
+apt-get install -qqy curl gcc python3-dev python3-pip apt-transport-https lsb-release openssh-client git gnupg
+rm -rf /var/lib/apt/lists/*
+apt clean
+
 python3 -m pip install -U pip
-python3 -m pip install clearml-agent
-python3 -m pip install -U "cryptography>=2.9"
+python3 -m pip install --no-cache-dir clearml-agent
+python3 -m pip install -U --no-cache-dir "cryptography>=2.9"
--- a/docs/clearml.conf
+++ b/docs/clearml.conf
@@ -305,7 +305,7 @@ sdk {

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
        }

--- a/docs/trains.conf
+++ b/docs/trains.conf
@@ -146,7 +146,7 @@ sdk {

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
        }
Author	SHA1	Message	Date
allegroai	159a6e9a5a	Fix runtime property overriding existing properties	2023-07-20 10:41:15 +03:00
pollfly	6b7ee12dc1	Edit README (#156 )	2023-07-19 16:51:14 +03:00
allegroai	3838247716	Update k8s glue docker build resources	2023-07-19 16:47:50 +03:00
pollfly	6e7d35a42a	Improve configuration files (#160 )	2023-07-11 10:32:01 +03:00