Support force_system_packages argument in k8s glue class

Version bump to v1.7.0
Add agent.resource_monitoring.disk_use_path configuration option to allow monitoring a different volume than the one containing the home folder
2025-06-26 18:16:15 +00:00 · 2023-12-26 10:12:32 +02:00 · 2023-12-20 18:08:38 +02:00 · 2023-12-20 17:49:33 +02:00 · 2023-12-20 17:49:04 +02:00 · 2023-12-20 17:48:18 +02:00
55 changed files with 3277 additions and 963 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -11,3 +11,6 @@ build/
 dist/
 *.egg-info

+# VSCode
+.vscode
+
--- a/README.md
+++ b/README.md
@@ -24,8 +24,7 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
 * Launch-and-Forget service containers
 * [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
 * [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
-*
-Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
+* Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)

 It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.

@@ -35,8 +34,8 @@ It is a zero configuration fire-and-forget execution agent, providing a full ML/
   or [free tier hosting](https://app.clear.ml)
 2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine:
   on-premises / cloud / ...)
-3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or
-   Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
+3. Create a [job](https://clear.ml/docs/latest/docs/apps/clearml_task) or
+   add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines of code
 4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or
   automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
 5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes:  :beer:
@@ -81,21 +80,21 @@ Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://githu

 **Two K8s integration flavours**

- Spin ClearML-Agent as a long-lasting service pod
-    - use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
+- Spin ClearML-Agent as a long-lasting service pod:
+    - Use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
    - map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
-    - allow the clearml-agent to manage sibling dockers
-    - benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
-    - downside: Sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
+    - Allow the clearml-agent to manage sibling dockers
+    - Benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
+    - Downside: sibling containers
+- Kubernetes Glue, map ClearML jobs directly to K8s jobs:
    - Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
      a K8s cpu node
    - The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
      yaml template)
    - Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
      experiment's process
-    - benefits: Kubernetes full view of all running jobs in the system
-    - downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
+    - Benefits: Kubernetes full view of all running jobs in the system
+    - Downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)

 ### Using the ClearML Agent

@@ -110,15 +109,15 @@ A previously run experiment can be put into 'Draft' state by either of two metho

 * Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any
  results and artifacts the previous run had created.
-* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new '
-  Draft' experiment with the same configuration as the original experiment.
+* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new 
+  'Draft' experiment with the same configuration as the original experiment.

 An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in
 the ClearML UI and selecting the execution queue.

 See [creating an experiment and enqueuing it for execution](#from-scratch).

-Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue.
+Once an experiment is enqueued, it will be picked up and executed by a ClearML Agent monitoring this queue.

 The ClearML UI Workers & Queues page provides ongoing execution information:

@@ -170,22 +169,22 @@ clearml-agent init
 ```

 Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
-ClearML Agent cache folder is `~/.clearml`
+ClearML Agent cache folder is `~/.clearml`.

-See full details in your configuration file at `~/clearml.conf`
+See full details in your configuration file at `~/clearml.conf`.

-Note: The **ClearML agent** extends the **ClearML** configuration file `~/clearml.conf`
+Note: The **ClearML Agent** extends the **ClearML** configuration file `~/clearml.conf`.
 They are designed to share the same configuration file, see example [here](docs/clearml.conf)

 #### Running the ClearML Agent

-For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
+For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen:

 ```bash
 clearml-agent daemon --queue default --foreground
 ```

-For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
+For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe).
 Notice: with `--detached` flag, the *clearml-agent* will be running in the background

 ```bash
@@ -195,20 +194,21 @@ clearml-agent daemon --detached --queue default
 GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled
 with `--cpu-only`).

-If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for
-the `clearml-agent` <br>
+If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPUs will be allocated for
+the `clearml-agent`. <br>
 If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES="none"`, no gpu will be allocated for
-the `clearml-agent`
+the `clearml-agent`.

-Example: spin two agents, one per gpu on the same machine:
-Notice: with `--detached` flag, the *clearml-agent* will be running in the background
+Example: spin two agents, one per GPU on the same machine:
+
+Notice: with `--detached` flag, the *clearml-agent* will run in the background

 ```bash
 clearml-agent daemon --detached --gpus 0 --queue default
 clearml-agent daemon --detached --gpus 1 --queue default
 ```

-Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
+Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent

 ```bash
 clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
@@ -223,43 +223,43 @@ For debug and experimentation, start the ClearML agent in `foreground` mode, whe
 clearml-agent daemon --queue default --docker --foreground
 ```

-For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
-Notice: with `--detached` flag, the *clearml-agent* will be running in the background
+For actual service mode, all the stdout will be stored automatically into a file (no need to pipe).
+Notice: with `--detached` flag, the *clearml-agent* will run in the background

 ```bash
 clearml-agent daemon --detached --queue default --docker
 ```

-Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+Example: spin two agents, one per gpu on the same machine, with default `nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04`
 docker:

 ```bash
-clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
-clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
+clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
 ```

-Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:
-10.1-cudnn7-runtime-ubuntu18.04 docker:
+Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent, with default 
+`nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04` docker:

 ```bash
-clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
-clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
+clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
 ```

 ##### Starting the ClearML Agent - Priority Queues

 Priority Queues are also supported, example use case:

-High priority queue: `important_jobs`  Low priority queue: `default`
+High priority queue: `important_jobs`, low priority queue: `default`

 ```bash
 clearml-agent daemon --queue important_jobs default
 ```

-The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from
-the `default` queue.
+The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, and only if it is empty, the agent 
+will try to pull from the `default` queue.

-Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see
+Adding queues, managing job order within a queue, and moving jobs between queues, is available using the Web UI, see
 example on our [free server](https://app.clear.ml/workers-and-queues/queues)

 ##### Stopping the ClearML Agent
@@ -268,7 +268,7 @@ To stop a **ClearML Agent** running in the background, run the same command line
 appended. For example, to stop the first of the above shown same machine, single gpu agents:

 ```bash
-clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop
+clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 --stop
 ```

 ### How do I create an experiment on the ClearML Server? <a name="from-scratch"></a>
@@ -279,32 +279,33 @@ clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10
    - Git repository link and commit ID (or an entire jupyter notebook)
    - Git diff (we’re not saying you never commit and push, but still...)
    - Python packages used by your code (including specific versions used)
-    - Hyper-Parameters
-    - Input Artifacts
+    - Hyperparameters
+    - Input artifacts

  You now have a 'template' of your experiment with everything required for automated execution

-* In the ClearML UI, Right-click on the experiment and select 'clone'. A copy of your experiment will be created.
+* In the ClearML UI, right-click on the experiment and select 'clone'. A copy of your experiment will be created.
 * You now have a new draft experiment cloned from your original experiment, feel free to edit it
-    - Change the Hyper-Parameters
+    - Change the hyperparameters
    - Switch to the latest code base of the repository
    - Update package versions
    - Select a specific docker image to run in (see docker execution mode section)
    - Or simply change nothing to run the same experiment again...
-* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
+* Schedule the newly created experiment for execution: right-click the experiment and select 'enqueue'

 ### ClearML-Agent Services Mode <a name="services"></a>

 ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that
 previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks)
-for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the
-budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as
-Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data
-transparency)
+for different use cases: 
+* Auto-scaler service (spinning instances when the need arises and the budget allows)
+* Controllers (Implementing pipelines and more sophisticated DevOps logic)
+* Optimizer (such as Hyperparameter Optimization or sweeping)
+* Application (such as interactive Bokeh apps for increased data transparency)

 ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by
 ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
-Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched
+Currently, clearml-agent in services-mode supports CPU only configuration. ClearML-Agent services mode can be launched
 alongside GPU agents.

 ```bash
@@ -321,15 +322,15 @@ ClearML package.
 Sample AutoML & Orchestration examples can be found in the
 ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.

-AutoML examples
+AutoML examples:

 - [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
    - In order to create an experiment-template in the system, this code must be executed once manually
 - [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
-    - This example will create multiple copies of the Keras experiment-template, with different hyper-parameter
+    - This example will create multiple copies of the Keras experiment-template, with different hyperparameter
      combinations

-Experiment Pipeline examples
+Experiment Pipeline examples:

 - [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
    - This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
--- a/clearml_agent/backend_api/config/default/agent.conf
+++ b/clearml_agent/backend_api/config/default/agent.conf
@@ -45,8 +45,8 @@
    # it solves passing user/token to git submodules.
    # this is a safer way to ensure multiple users using the same repository will
    # not accidentally leak credentials
-    # Only supported on Linux systems, it will be the default in future releases
-    # enable_git_ask_pass: false
+    # Note: this is only supported on Linux systems
+    # enable_git_ask_pass: true

    # in docker mode, if container's entrypoint automatically activated a virtual environment
    # use the activated virtual environment and install everything there
@@ -80,6 +80,17 @@
        # additional artifact repositories to use when installing python packages
        # extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]

+        # control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
+        # Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
+        # "pip" (default): would automatically detect the cuda version, and supply pip with the correct
+        # extra-index-url, based on pytorch.org tables
+        # "direct": would resolve a direct link to the pytorch wheel by parsing the pytorch.org pip repository
+        # and matching the automatically detected cuda version with the required pytorch wheel.
+        # if the exact cuda version is not found for the required pytorch wheel, it will try
+        # a lower cuda version until a match is found
+        # "none": No resolver used, install pytorch like any other package
+        # pytorch_resolve: "pip"
+
        # additional conda channels to use when installing with conda package manager
        conda_channels: ["pytorch", "conda-forge", "defaults", ]

@@ -88,19 +99,23 @@
        # force_repo_requirements_txt: false

        # set the priority packages to be installed before the rest of the required packages
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # priority_packages: ["cython", "numpy", "setuptools", ]

        # set the optional priority packages to be installed before the rest of the required packages,
        # In case a package installation fails, the package will be ignored,
        # and the virtual environment process will continue
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        priority_optional_packages: ["pygobject", ]

        # set the post packages to be installed after all the rest of the required packages
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # post_packages: ["horovod", ]

        # set the optional post packages to be installed after all the rest of the required packages,
        # In case a package installation fails, the package will be ignored,
        # and the virtual environment process will continue
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # post_optional_packages: []

        # set to True to support torch nightly build installation,
@@ -162,6 +177,13 @@
    # these are local for this agent and will not be updated in the experiment's docker_cmd section
    # extra_docker_arguments: ["--ipc=host", ]

+    # Allow the extra docker arg to override task level docker arg (if the same argument is passed on both),
+    # if set to False, a task docker arg will override the docker extra arg
+    # docker_args_extra_precedes_task: true
+
+    # allows the following task docker args to be overridden by the extra_docker_arguments
+    # protected_docker_extra_args: ["privileged", "security-opt", "network", "ipc"]
+
    # optional shell script to run in docker when started before the experiment is started
    # extra_docker_shell_script: ["apt-get install -y bindfs", ]

@@ -192,7 +214,7 @@

    default_docker: {
        # default docker image to use when running in docker mode
-        image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
+        image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"

        # optional arguments to pass to docker image
        # arguments: ["--ipc=host", ]
@@ -259,10 +281,15 @@

    # Name docker containers created by the daemon using the following string format (supported from Docker 0.6.5)
    # Allowed variables are task_id, worker_id and rand_string (random lower-case letters string, up to 32 characters)
+    # Custom variables may be specified using the docker_container_name_format_fields option.
    # Note: resulting name must start with an alphanumeric character and
    #       continue with alphanumeric characters, underscores (_), dots (.) and/or dashes (-)
    # docker_container_name_format: "clearml-id-{task_id}-{rand_string:.8}"

+    # Specify custom variables for the docker_container_name_format option using a mapping of variable name
+    # to a (nested) task field (using "." as a task field separator, digits specify array index)
+    # docker_container_name_format_fields: { foo: "bar.moo" }
+
    # Apply top-level environment section from configuration into os.environ
    apply_environment: true
    # Top-level environment section is in the form of:
@@ -283,6 +310,8 @@
    #  target_format: format used to encode contents before writing into the target file. Supported values are json,
    #                 yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
    #  overwrite: overwrite the target file in case it exists. Default is true.
+    #  mode: file-system mode to be applied to the file after its creation. The mode string will be parsed into an
+    #        integer (e.g. "0o777" for -rwxrwxrwx)
    #
    # Example:
    #   files {
@@ -348,7 +377,7 @@
    # Notice: Matching is done via regular expression, for example "^searchme$" will match exactly "searchme$" string
    #
    #     "default_docker": {
-    #         "image": "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04",
+    #         "image": "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04",
    #         # optional arguments to pass to docker image
    #         # arguments: ["--ipc=host", ]
    #         "match_rules": [
@@ -369,13 +398,6 @@
    #                 }
    #             },
    #             {
-    #                 "image": "better_container:tag",
-    #                 "arguments": "",
-    #                 "match": {
-    #                     "container": "replace_me_please"
-    #                 }
-    #             },
-    #             {
    #                 "image": "another_container:tag",
    #                 "arguments": "",
    #                 "match": {
--- a/clearml_agent/backend_api/config/default/sdk.conf
+++ b/clearml_agent/backend_api/config/default/sdk.conf
@@ -3,7 +3,7 @@

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
            size {
                # max_used_bytes = -1
--- a/clearml_agent/backend_api/session/defs.py
+++ b/clearml_agent/backend_api/session/defs.py
@@ -1,5 +1,5 @@
-from ...backend_config.converters import safe_text_to_bool
-from ...backend_config.environment import EnvEntry
+from clearml_agent.helper.environment import EnvEntry
+from clearml_agent.helper.environment.converters import safe_text_to_bool


 ENV_HOST = EnvEntry("CLEARML_API_HOST", "TRAINS_API_HOST")
@@ -20,6 +20,7 @@ ENV_PROPAGATE_EXITCODE = EnvEntry("CLEARML_AGENT_PROPAGATE_EXITCODE", type=bool,
 ENV_INITIAL_CONNECT_RETRY_OVERRIDE = EnvEntry(
    'CLEARML_AGENT_INITIAL_CONNECT_RETRY_OVERRIDE', default=True, converter=safe_text_to_bool
 )
+ENV_FORCE_MAX_API_VERSION = EnvEntry("CLEARML_AGENT_FORCE_MAX_API_VERSION", type=str)

 """
 Experimental option to set the request method for all API requests and auth login.
--- a/clearml_agent/backend_api/session/session.py
+++ b/clearml_agent/backend_api/session/session.py
@@ -16,11 +16,11 @@ from requests.auth import HTTPBasicAuth
 from six.moves.urllib.parse import urlparse, urlunparse

 from clearml_agent.external.pyhocon import ConfigTree, ConfigFactory
-
 from .callresult import CallResult
 from .defs import (
    ENV_VERBOSE, ENV_HOST, ENV_ACCESS_KEY, ENV_SECRET_KEY, ENV_WEB_HOST, ENV_FILES_HOST, ENV_AUTH_TOKEN,
-    ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD, )
+    ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD,
+    ENV_FORCE_MAX_API_VERSION)
 from .request import Request, BatchRequest
 from .token_manager import TokenManager
 from ..config import load
@@ -28,7 +28,6 @@ from ..utils import get_http_session_with_retry, urllib_log_warning_setup
 from ...backend_config.environment import backward_compatibility_support
 from ...version import __version__

-
 sys_random = SystemRandom()


@@ -64,6 +63,7 @@ class Session(TokenManager):
    default_files = "https://demofiles.demo.clear.ml"
    default_key = "EGRTCO8JMSIGI6S39GTP43NFWXDQOW"
    default_secret = "x!XTov_G-#vspE*Y(h$Anm&DIc5Ou-F)jsl$PdOyj5wG1&E!Z8"
+    force_max_api_version = ENV_FORCE_MAX_API_VERSION.get()

    # TODO: add requests.codes.gateway_timeout once we support async commits
    _retry_codes = [
@@ -199,6 +199,12 @@ class Session(TokenManager):
        # notice: this is across the board warning omission
        urllib_log_warning_setup(total_retries=http_retries_config.get('total', 0), display_warning_after=3)

+        if self.force_max_api_version and self.check_min_api_version(self.force_max_api_version):
+            print("Using forced API version {}".format(self.force_max_api_version))
+            Session.max_api_version = Session.api_version = str(self.force_max_api_version)
+
+        self.pre_vault_config = None
+
    def _setup_session(self, http_retries_config, initial_session=False, default_initial_connect_override=None):
        # type: (dict, bool, Optional[bool]) -> (dict, requests.Session)
        http_retries_config = http_retries_config or self.config.get(
@@ -250,7 +256,11 @@ class Session(TokenManager):
        def parse(vault):
            # noinspection PyBroadException
            try:
-                d = vault.get('data', None)
+                print("Loaded {} vault: {}".format(
+                    vault.get("scope", ""),
+                    (vault.get("description", None) or "")[:50] or vault.get("id", ""))
+                )
+                d = vault.get("data", None)
                if d:
                    r = ConfigFactory.parse_string(d)
                    if isinstance(r, (ConfigTree, dict)):
@@ -266,6 +276,7 @@ class Session(TokenManager):
                vaults = res.json().get("data", {}).get("vaults", [])
                data = list(filter(None, map(parse, vaults)))
                if data:
+                    self.pre_vault_config = self.config.copy()
                    self.config.set_overrides(*data)
                    return True
            elif res.status_code != 404:
--- a/clearml_agent/backend_api/utils.py
+++ b/clearml_agent/backend_api/utils.py
@@ -86,7 +86,10 @@ def get_http_session_with_retry(
    session = requests.Session()

    if backoff_max is not None:
-        Retry.BACKOFF_MAX = backoff_max
+        if "BACKOFF_MAX" in vars(Retry):
+            Retry.BACKOFF_MAX = backoff_max
+        else:
+            Retry.DEFAULT_BACKOFF_MAX = backoff_max

    retry = Retry(
        total=total, connect=connect, read=read, redirect=redirect, status=status,
--- a/clearml_agent/backend_config/config.py
+++ b/clearml_agent/backend_config/config.py
@@ -297,6 +297,9 @@ class Config(object):
    def put(self, key, value):
        self._config.put(key, value)

+    def pop(self, key, default=None):
+        return self._config.pop(key, default=default)
+
    def to_dict(self):
        return self._config.as_plain_ordered_dict()

--- a/clearml_agent/backend_config/converters.py
+++ b/clearml_agent/backend_config/converters.py
@@ -1,69 +1,8 @@
-import base64
-from distutils.util import strtobool
-from typing import Union, Optional, Any, TypeVar, Callable, Tuple
-
-import six
-
-try:
-    from typing import Text
-except ImportError:
-    # windows conda-less hack
-    Text = Any
-
-
-ConverterType = TypeVar("ConverterType", bound=Callable[[Any], Any])
-
-
-def text_to_int(value, default=0):
-    # type: (Any, int) -> int
-    try:
-        return int(value)
-    except (ValueError, TypeError):
-        return default
-
-
-def base64_to_text(value):
-    # type: (Any) -> Text
-    return base64.b64decode(value).decode("utf-8")
-
-
-def text_to_bool(value):
-    # type: (Text) -> bool
-    return bool(strtobool(value))
-
-
-def safe_text_to_bool(value):
-    # type: (Text) -> bool
-    try:
-        return text_to_bool(value)
-    except ValueError:
-        return bool(value)
-
-
-def any_to_bool(value):
-    # type: (Optional[Union[int, float, Text]]) -> bool
-    if isinstance(value, six.text_type):
-        return text_to_bool(value)
-    return bool(value)
-
-
-def or_(*converters, **kwargs):
-    # type: (ConverterType, Tuple[Exception, ...]) -> ConverterType
-    """
-    Wrapper that implements an "optional converter" pattern. Allows specifying a converter
-    for which a set of exceptions is ignored (and the original value is returned)
-    :param converters: A converter callable
-    :param exceptions: A tuple of exception types to ignore
-    """
-    # noinspection PyUnresolvedReferences
-    exceptions = kwargs.get("exceptions", (ValueError, TypeError))
-
-    def wrapper(value):
-        for converter in converters:
-            try:
-                return converter(value)
-            except exceptions:
-                pass
-        return value
-
-    return wrapper
+from clearml_agent.helper.environment.converters import (
+    base64_to_text,
+    text_to_bool,
+    text_to_int,
+    safe_text_to_bool,
+    any_to_bool,
+    or_,
+)
--- a/clearml_agent/backend_config/entry.py
+++ b/clearml_agent/backend_config/entry.py
@@ -1,111 +1,6 @@
-import abc
-from typing import Optional, Any, Tuple, Callable, Dict
+from clearml_agent.helper.environment import Entry, NotSet

-import six
-
-from .converters import any_to_bool
-
-try:
-    from typing import Text
-except ImportError:
-    # windows conda-less hack
-    Text = Any
-
-
-NotSet = object()
-
-Converter = Callable[[Any], Any]
-
-
-@six.add_metaclass(abc.ABCMeta)
-class Entry(object):
-    """
-    Configuration entry definition
-    """
-
-    @classmethod
-    def default_conversions(cls):
-        # type: () -> Dict[Any, Converter]
-        return {
-            bool: any_to_bool,
-            six.text_type: lambda s: six.text_type(s).strip(),
-        }
-
-    def __init__(self, key, *more_keys, **kwargs):
-        # type: (Text, Text, Any) -> None
-        """
-        :param key: Entry's key (at least one).
-        :param more_keys: More alternate keys for this entry.
-        :param type: Value type. If provided, will be used choosing a default conversion or
-        (if none exists) for casting the environment value.
-        :param converter: Value converter. If provided, will be used to convert the environment value.
-        :param default: Default value. If provided, will be used as the default value on calls to get() and get_pair()
-        in case no value is found for any key and no specific default value was provided in the call.
-        Default value is None.
-        :param help: Help text describing this entry
-        """
-        self.keys = (key,) + more_keys
-        self.type = kwargs.pop("type", six.text_type)
-        self.converter = kwargs.pop("converter", None)
-        self.default = kwargs.pop("default", None)
-        self.help = kwargs.pop("help", None)
-
-    def __str__(self):
-        return str(self.key)
-
-    @property
-    def key(self):
-        return self.keys[0]
-
-    def convert(self, value, converter=None):
-        # type: (Any, Converter) -> Optional[Any]
-        converter = converter or self.converter
-        if not converter:
-            converter = self.default_conversions().get(self.type, self.type)
-        return converter(value)
-
-    def get_pair(self, default=NotSet, converter=None, value_cb=None):
-        # type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
-        for key in self.keys:
-            value = self._get(key)
-            if value is NotSet:
-                continue
-            try:
-                value = self.convert(value, converter)
-            except Exception as ex:
-                self.error("invalid value {key}={value}: {ex}".format(**locals()))
-                break
-            # noinspection PyBroadException
-            try:
-                if value_cb:
-                    value_cb(key, value)
-            except Exception:
-                pass
-            return key, value
-
-        result = self.default if default is NotSet else default
-        return self.key, result
-
-    def get(self, default=NotSet, converter=None, value_cb=None):
-        # type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
-        return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
-
-    def set(self, value):
-        # type: (Any, Any) -> (Text, Any)
-        # key, _ = self.get_pair(default=None, converter=None)
-        for k in self.keys:
-            self._set(k, str(value))
-
-    def _set(self, key, value):
-        # type: (Text, Text) -> None
-        pass
-
-    @abc.abstractmethod
-    def _get(self, key):
-        # type: (Text) -> Any
-        pass
-
-    @abc.abstractmethod
-    def error(self, message):
-        # type: (Text) -> None
-        pass
+__all__ = [
+    "Entry",
+    "NotSet"
+]
--- a/clearml_agent/backend_config/environment.py
+++ b/clearml_agent/backend_config/environment.py
@@ -1,32 +1,6 @@
-from os import getenv, environ
+from os import environ

-from .converters import text_to_bool
-from .entry import Entry, NotSet
-
-
-class EnvEntry(Entry):
-    @classmethod
-    def default_conversions(cls):
-        conversions = super(EnvEntry, cls).default_conversions().copy()
-        conversions[bool] = text_to_bool
-        return conversions
-
-    def pop(self):
-        for k in self.keys:
-            environ.pop(k, None)
-
-    def _get(self, key):
-        value = getenv(key, "").strip()
-        return value or NotSet
-
-    def _set(self, key, value):
-        environ[key] = value
-
-    def __str__(self):
-        return "env:{}".format(super(EnvEntry, self).__str__())
-
-    def error(self, message):
-        print("Environment configuration: {}".format(message))
+from clearml_agent.helper.environment import EnvEntry


 def backward_compatibility_support():
@@ -34,6 +8,7 @@ def backward_compatibility_support():
    if ENVIRONMENT_BACKWARD_COMPATIBLE.get():
        # Add TRAINS_ prefix on every CLEARML_ os environment we support
        for k, v in ENVIRONMENT_CONFIG.items():
+            # noinspection PyBroadException
            try:
                trains_vars = [var for var in v.vars if var.startswith('CLEARML_')]
                if not trains_vars:
@@ -44,6 +19,7 @@ def backward_compatibility_support():
            except:
                continue
        for k, v in ENVIRONMENT_SDK_PARAMS.items():
+            # noinspection PyBroadException
            try:
                trains_vars = [var for var in v if var.startswith('CLEARML_')]
                if not trains_vars:
@@ -62,3 +38,9 @@ def backward_compatibility_support():
        backwards_k = k.replace('CLEARML_', 'TRAINS_', 1)
        if backwards_k not in keys:
            environ[backwards_k] = environ[k]
+
+
+__all__ = [
+    "EnvEntry",
+    "backward_compatibility_support"
+]
--- a/clearml_agent/backend_config/utils.py
+++ b/clearml_agent/backend_config/utils.py
@@ -31,7 +31,8 @@ def apply_environment(config):
    keys = list(filter(None, env_vars.keys()))

    for key in keys:
-        os.environ[str(key)] = str(env_vars[key] or "")
+        value = env_vars[key]
+        os.environ[str(key)] = str(value if value is not None else "")

    return keys

@@ -52,6 +53,7 @@ def apply_files(config):
        target_fmt = data.get("target_format", "string")
        overwrite = bool(data.get("overwrite", True))
        contents = data.get("contents")
+        mode = data.get("mode")

        target = Path(expanduser(expandvars(path)))

@@ -110,3 +112,14 @@ def apply_files(config):
        except Exception as ex:
            print("Skipped [{}]: failed saving file {} ({})".format(key, target, ex))
            continue
+
+        try:
+            if mode:
+                if isinstance(mode, int):
+                    mode = int(str(mode), 8)
+                else:
+                    mode = int(mode, 8)
+                target.chmod(mode)
+        except Exception as ex:
+            print("Skipped [{}]: failed setting mode {} for {} ({})".format(key, mode, target, ex))
+            continue
--- a/clearml_agent/commands/config.py
+++ b/clearml_agent/commands/config.py
@@ -44,7 +44,7 @@ def main():

    if conf_file.exists() and conf_file.is_file() and conf_file.stat().st_size > 0:
        print('Configuration file already exists: {}'.format(str(conf_file)))
-        print('Leaving setup, feel free to edit the configuration file.')
+        print('Leaving setup. If you\'ve previously initialized the ClearML SDK on this machine, manually add an \'agent\' section to this file.')
        return

    print(description, end='')
--- a/clearml_agent/commands/events.py
+++ b/clearml_agent/commands/events.py
@@ -2,6 +2,7 @@ from __future__ import print_function

 import json
 import time
+from typing import List, Tuple

 from clearml_agent.commands.base import ServiceCommandSection
 from clearml_agent.helper.base import return_list
@@ -57,6 +58,42 @@ class Events(ServiceCommandSection):
        # print('Sending events done: %d / %d events sent' % (sent_events, len(list_events)))
        return sent_events

+    def send_log_events_with_timestamps(
+        self, worker_id, task_id, lines_with_ts: List[Tuple[str, str]], level="DEBUG", session=None
+    ):
+        log_events = []
+
+        # break log lines into event packets
+        for ts, line in return_list(lines_with_ts):
+            # HACK ignore terminal reset ANSI code
+            if line == '\x1b[0m':
+                continue
+            while line:
+                if len(line) <= self.max_event_size:
+                    msg = line
+                    line = None
+                else:
+                    msg = line[:self.max_event_size]
+                    line = line[self.max_event_size:]
+
+                log_events.append(
+                    {
+                        "type": "log",
+                        "level": level,
+                        "task": task_id,
+                        "worker": worker_id,
+                        "msg": msg,
+                        "timestamp": ts,
+                    }
+                )
+
+                if line and ts is not None:
+                    # advance timestamp in case we break a line to more than one part
+                    ts += 1
+
+        # now send the events
+        return self.send_events(list_events=log_events, session=session)
+
    def send_log_events(self, worker_id, task_id, lines, level='DEBUG', session=None):
        log_events = []
        base_timestamp = int(time.time() * 1000)
--- a/clearml_agent/commands/resolver.py
+++ b/clearml_agent/commands/resolver.py
@@ -109,15 +109,15 @@ def resolve_default_container(session, task_id, container_config):
                    match.get('script.binary', None), entry))
                continue

-        if match.get('container', None):
-            # noinspection PyBroadException
-            try:
-                if not re.search(match.get('container', None), requested_container.get('image', '')):
-                    continue
-            except Exception:
-                print('Failed parsing regular expression \"{}\" in rule: {}'.format(
-                    match.get('container', None), entry))
-                continue
+        # if match.get('image', None):
+        #     # noinspection PyBroadException
+        #     try:
+        #         if not re.search(match.get('image', None), requested_container.get('image', '')):
+        #             continue
+        #     except Exception:
+        #         print('Failed parsing regular expression \"{}\" in rule: {}'.format(
+        #             match.get('image', None), entry))
+        #         continue

        matched = True
        for req_section in ['script.requirements.pip', 'script.requirements.conda']:
@@ -156,8 +156,8 @@ def resolve_default_container(session, task_id, container_config):
            break

        if matched:
-            if not container_config.get('container'):
-                container_config['container'] = entry.get('image', None)
+            if not container_config.get('image'):
+                container_config['image'] = entry.get('image', None)
            if not container_config.get('arguments'):
                container_config['arguments'] = entry.get('arguments', None)
                container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())
--- a/clearml_agent/commands/worker.py
+++ b/clearml_agent/commands/worker.py
@@ -12,6 +12,7 @@ import shlex
 import shutil
 import signal
 import string
+import socket
 import subprocess
 import sys
 import traceback
@@ -24,7 +25,7 @@ from functools import partial
 from os.path import basename
 from tempfile import mkdtemp, NamedTemporaryFile
 from time import sleep, time
-from typing import Text, Optional, Any, Tuple, List
+from typing import Text, Optional, Any, Tuple, List, Dict, Mapping, Union

 import attr
 import six
@@ -40,6 +41,7 @@ from clearml_agent.backend_api.session import CallResult, Request
 from clearml_agent.backend_api.session.defs import (
    ENV_ENABLE_ENV_CONFIG_SECTION, ENV_ENABLE_FILES_CONFIG_SECTION,
    ENV_VENV_CONFIGURED, ENV_PROPAGATE_EXITCODE, )
+from clearml_agent.backend_config import Config
 from clearml_agent.backend_config.defs import UptimeConf
 from clearml_agent.backend_config.utils import apply_environment, apply_files
 from clearml_agent.backend_config.converters import text_to_int
@@ -71,6 +73,12 @@ from clearml_agent.definitions import (
    ENV_DOCKER_ARGS_FILTERS,
    ENV_FORCE_SYSTEM_SITE_PACKAGES,
    ENV_SERVICES_DOCKER_RESTART,
+    ENV_CONFIG_BC_IN_STANDALONE,
+    ENV_FORCE_DOCKER_AGENT_REPO,
+    ENV_EXTRA_DOCKER_LABELS,
+    ENV_AGENT_FORCE_CODE_DIR,
+    ENV_AGENT_FORCE_EXEC_SCRIPT,
+    ENV_TEMP_STDOUT_FILE_DIR,
 )
 from clearml_agent.definitions import WORKING_REPOSITORY_DIR, PIP_EXTRA_INDICES
 from clearml_agent.errors import (
@@ -316,6 +324,37 @@ def get_next_task(session, queue, get_task_info=False):
    return data


+def get_task_fields(session, task_id, fields: list, log=None) -> dict:
+    """
+    Returns dict with Task docker container setup {container: '', arguments: '', setup_shell_script: ''}
+    """
+    result = session.send_request(
+        service='tasks',
+        action='get_all',
+        json={'id': [task_id], 'only_fields': list(fields), 'search_hidden': True},
+        method=Request.def_method,
+        async_enable=False,
+    )
+    # noinspection PyBroadException
+    try:
+        results = {}
+        result = result.json()['data']['tasks'][0]
+        for field in fields:
+            cur = result
+            for part in field.split("."):
+                if part.isdigit():
+                    cur = cur[part]
+                else:
+                    cur = cur.get(part, {})
+            results[field] = cur
+        return results
+    except Exception as ex:
+        if log:
+            log.error("Failed obtaining values for task fields {}: {}", fields, ex)
+        pass
+    return {}
+
+
 def get_task_container(session, task_id):
    """
    Returns dict with Task docker container setup {container: '', arguments: '', setup_shell_script: ''}
@@ -333,20 +372,25 @@ def get_task_container(session, task_id):
            container = result.json()['data']['tasks'][0]['container'] if result.ok else {}
            if container.get('arguments'):
                container['arguments'] = shlex.split(str(container.get('arguments')).strip())
+            if container.get('image'):
+                container['image'] = container.get('image').strip()
        except (ValueError, TypeError):
            container = {}
    else:
        response = get_task(session, task_id, only_fields=["execution.docker_cmd"])
-        task_docker_cmd_parts = shlex.split(str(response.execution.docker_cmd or '').strip())
-        try:
-            container = dict(
-                container=task_docker_cmd_parts[0],
-                arguments=task_docker_cmd_parts[1:] if len(task_docker_cmd_parts[0]) > 1 else ''
-            )
-        except (ValueError, TypeError):
-            container = {}
+        container = {}
+        if response.execution:
+            task_docker_cmd_parts = shlex.split(str(response.execution.docker_cmd or '').strip())
+            if task_docker_cmd_parts:
+                try:
+                    container = dict(
+                        image=task_docker_cmd_parts[0],
+                        arguments=task_docker_cmd_parts[1:] if len(task_docker_cmd_parts[0]) > 1 else ''
+                    )
+                except (ValueError, TypeError):
+                    pass

-    if (not container or not container.get('container')) and session.check_min_api_version("2.13"):
+    if (not container or not container.get('image')) and session.check_min_api_version("2.13"):
        container = resolve_default_container(session=session, task_id=task_id, container_config=container)

    return container
@@ -461,19 +505,7 @@ class TaskStopSignal(object):
            return True

        # check if abort callback is turned on
-        cb_completed = None
-        # TODO: add retries on network error with timeout
-        try:
-            task_info = self.session.get(
-                service="tasks", action="get_all", version="2.13", id=[self.task_id],
-                only_fields=["status", "status_message", "runtime._abort_callback_timeout",
-                             "runtime._abort_poll_freq", "runtime._abort_callback_completed"])
-            abort_timeout = task_info['tasks'][0]['runtime'].get('_abort_callback_timeout', 0)
-            poll_timeout = task_info['tasks'][0]['runtime'].get('_abort_poll_freq', 0)
-            cb_completed = task_info['tasks'][0]['runtime'].get('_abort_callback_completed', None)
-        except:  # noqa
-            abort_timeout = None
-            poll_timeout = None
+        abort_timeout, poll_timeout, cb_completed = self._get_abort_callback_stat()

        if not abort_timeout:
            # no callback set we can leave
@@ -496,8 +528,39 @@ class TaskStopSignal(object):
        self._active_callback_timeout = timeout
        return bool(cb_completed)

-    def was_abort_function_called(self):
-        return bool(self._active_callback_timestamp)
+    def _get_abort_callback_stat(self):
+        # TODO: add retries on network error with timeout
+        try:
+            task_info = self.session.get(
+                service="tasks", action="get_all", version="2.13", id=[self.task_id],
+                only_fields=["status", "status_message", "runtime._abort_callback_timeout",
+                             "runtime._abort_poll_freq", "runtime._abort_callback_completed"])
+            abort_timeout = task_info['tasks'][0]['runtime'].get('_abort_callback_timeout', 0)
+            poll_timeout = task_info['tasks'][0]['runtime'].get('_abort_poll_freq', 0)
+            cb_completed = task_info['tasks'][0]['runtime'].get('_abort_callback_completed', None)
+        except:  # noqa
+            abort_timeout = None
+            poll_timeout = None
+            cb_completed = None
+
+        return abort_timeout, poll_timeout, cb_completed
+
+    def was_abort_function_called(self, process_error_code=None):
+        if not self._support_callback:
+            return False
+
+        if self._active_callback_timestamp:
+            return True
+
+        # if the process error code is SIGKILL (exit code 137) -
+        # check the runtime info of the Task - it might have killed itself because it was aborted
+        if process_error_code in (-9, 137):
+            # check if abort callback is turned on
+            _, _, cb_completed = self._get_abort_callback_stat()
+            if cb_completed:
+                return True
+
+        return False

    def _test(self):
        # type: () -> TaskStopReason
@@ -597,6 +660,8 @@ class Worker(ServiceCommandSection):
    _docker_fixed_user_cache = '/clearml_agent_cache'
    _temp_cleanup_list = []

+    hostname_task_runtime_prop = "_exec_agent_hostname"
+
    @property
    def service(self):
        """ Worker command service endpoint """
@@ -622,9 +687,13 @@ class Worker(ServiceCommandSection):
        self.log = self._session.get_logger(__name__)
        self.register_signal_handler()
        self._worker_registered = False
+
+        self._apply_extra_configuration()
+
        self.is_conda = is_conda(self._session.config)  # type: bool
        # Add extra index url - system wide
        extra_url = None
+        # noinspection PyBroadException
        try:
            if self._session.config.get("agent.package_manager.extra_index_url", None):
                extra_url = self._session.config.get("agent.package_manager.extra_index_url", [])
@@ -812,6 +881,31 @@ class Worker(ServiceCommandSection):
        # "Running task '{}'".format(task_id)
        print(self._task_logging_start_message.format(task_id))
        task_session = task_session or self._session
+
+        # noinspection PyBroadException
+        try:
+            result = task_session.send_request(
+                service='tasks',
+                action='get_all',
+                version='2.15',
+                method=Request.def_method,
+                json={'id': [task_id], 'only_fields': ["runtime"], 'search_hidden': True}
+            )
+
+            runtime = result.json().get("data", {}).get("tasks", [])[0].get("runtime") or {}
+            runtime[self.hostname_task_runtime_prop] = socket.gethostname()
+
+            res = task_session.send_request(
+                service='tasks', action='edit', method=Request.def_method,
+                json={
+                    "task": task_id, "force": True, "runtime": runtime
+                },
+            )
+            if not res.ok:
+                raise Exception("failed setting runtime property")
+        except Exception as ex:
+            print("Warning: failed obtaining/setting hostname for task '{}': {}".format(task_id, ex))
+
        # set task status to in_progress so we know it was popped from the queue
        # noinspection PyBroadException
        try:
@@ -821,7 +915,7 @@ class Worker(ServiceCommandSection):
            return
        # setup console log
        temp_stdout_name = safe_mkstemp(
-            suffix=".txt", prefix=".clearml_agent_out.", name_only=True
+            suffix=".txt", prefix=".clearml_agent_out.", name_only=True, dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
        )
        # temp_stderr_name = safe_mkstemp(suffix=".txt", prefix=".clearml_agent_err.", name_only=True)
        temp_stderr_name = None
@@ -887,11 +981,21 @@ class Worker(ServiceCommandSection):

            name_format = self._session.config.get('agent.docker_container_name_format', None)
            if name_format:
+                custom_fields = {}
+                name_format_fields = self._session.config.get('agent.docker_container_name_format_fields', None)
+                if name_format_fields:
+                    field_values = get_task_fields(task_session, task_id, name_format_fields.values(), log=self.log)
+                    custom_fields = {
+                        k: field_values.get(v)
+                        for k, v in name_format_fields.items()
+                    }
+
                try:
                    name = name_format.format(
                        task_id=re.sub(r'[^a-zA-Z0-9._-]', '-', task_id),
                        worker_id=re.sub(r'[^a-zA-Z0-9._-]', '-', worker_id),
-                        rand_string="".join(sys_random.choice(string.ascii_lowercase) for _ in range(32))
+                        rand_string="".join(sys_random.choice(string.ascii_lowercase) for _ in range(32)),
+                        **custom_fields,
                    )
                except Exception as ex:
                    print("Warning: failed generating docker container name: {}".format(ex))
@@ -1032,6 +1136,7 @@ class Worker(ServiceCommandSection):
            if not (result.ok() and result.response):
                return
            new_session = copy(session)
+            new_session.config = deepcopy(session.config)
            new_session.api_client = None
            new_session.set_auth_token(result.response.token)
            return new_session
@@ -1459,8 +1564,6 @@ class Worker(ServiceCommandSection):
        return self._resolve_queue_names(queues=queues, create_if_missing=create_if_missing)

    def daemon(self, queues, log_level, foreground=False, docker=False, detached=False, order_fairness=False, **kwargs):
-        self._apply_extra_configuration()
-
        # check that we have docker command if we need it
        if docker not in (False, None) and not check_if_command_exists("docker"):
            raise ValueError("Running in Docker mode, 'docker' command was not found")
@@ -1577,6 +1680,7 @@ class Worker(ServiceCommandSection):
                open_kwargs={
                    "buffering": self._session.config.get("agent.log_files_buffering", 1)
                },
+                dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
            )
            print(
                "Running CLEARML-AGENT daemon in background mode, writing stdout/stderr to {}".format(
@@ -1631,7 +1735,11 @@ class Worker(ServiceCommandSection):
                    if self._session.config.get("agent.crash_on_exception", False):
                        raise e

-                    crash_file, name = safe_mkstemp(prefix=".clearml_agent-crash", suffix=".log")
+                    crash_file, name = safe_mkstemp(
+                        prefix=".clearml_agent-crash",
+                        suffix=".log",
+                        dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
+                    )
                    try:
                        with crash_file:
                            crash_file.write(tb)
@@ -1916,7 +2024,7 @@ class Worker(ServiceCommandSection):
            stderr_line_count += report_lines(printed_lines, "stderr")

        # make sure that if the abort function was called, the task is marked as aborted
-        if stop_signal and stop_signal.was_abort_function_called():
+        if stop_signal and stop_signal.was_abort_function_called(status):
            stop_reason = TaskStopReason.stopped

        return status, stop_reason
@@ -2001,19 +2109,26 @@ class Worker(ServiceCommandSection):

    def _apply_extra_configuration(self):
        # store a few things we updated in runtime (TODO: we should list theme somewhere)
-        agent_config = self._session.config["agent"].copy()
+        vault_loaded = False
+        session = self._session
+        agent_config = session.config["agent"].copy()
        agent_config_keys = ["cuda_version", "cudnn_version", "default_python", "worker_id", "worker_name", "debug"]
        try:
-            self._session.load_vaults()
+            vault_loaded = session.load_vaults()
        except Exception as ex:
            print("Error: failed applying extra configuration: {}".format(ex))

-        # merge back
-        for restore_key in agent_config_keys:
-            if restore_key in agent_config:
-                self._session.config["agent"][restore_key] = agent_config[restore_key]
+        config = session.config
+
+        # merge back
+        if vault_loaded:
+            for restore_key in agent_config_keys:
+                if restore_key in agent_config and agent_config[restore_key] != config["agent"].get(restore_key, None):
+                    print("Ignoring vault value for '{}' (agent config takes precedence), using '{}'".format(
+                        restore_key, agent_config[restore_key]
+                    ))
+                    config["agent"][restore_key] = agent_config[restore_key]

-        config = self._session.config
        default = config.get("agent.apply_environment", False)
        if ENV_ENABLE_ENV_CONFIG_SECTION.get(default=default):
            try:
@@ -2139,7 +2254,11 @@ class Worker(ServiceCommandSection):
    def _build_docker(self, docker, target, task_id, entry_point=None, force_docker=False):

        self.temp_config_path = safe_mkstemp(
-            suffix=".cfg", prefix=".clearml_agent.", text=True, name_only=True
+            suffix=".cfg",
+            prefix=".clearml_agent.",
+            text=True,
+            name_only=True,
+            dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
        )
        if not target:
            target = "task_id_{}".format(task_id)
@@ -2295,8 +2414,10 @@ class Worker(ServiceCommandSection):
                print("Cloning task id={}".format(task_id))
                current_task = self._session.api_client.tasks.get_by_id(
                    self._session.send_api(
-                        tasks_api.CloneRequest(task=current_task.id,
-                                               new_task_name='Clone of {}'.format(current_task.name))
+                        tasks_api.CloneRequest(
+                            task=current_task.id,
+                            new_task_name="Clone of {}".format(current_task.name)
+                        )
                    ).id
                )
                print("Task cloned, new task id={}".format(current_task.id))
@@ -2304,11 +2425,23 @@ class Worker(ServiceCommandSection):
                raise CommandFailedError("Cloning failed")
        else:
            # make sure this task is not stuck in an execution queue, it shouldn't have been, but just in case.
+            # noinspection PyBroadException
            try:
-                res = self._session.api_client.tasks.dequeue(task=current_task.id)
-                if require_queue and res.meta.result_code != 200:
-                    raise ValueError("Execution required enqueued task, "
-                                     "but task id={} is not queued.".format(current_task.id))
+                res = self._session.send_request(
+                    service="tasks", action="dequeue", method=Request.def_method,
+                    json={"task": current_task.id, "new_status": "in_progress"},
+                )
+                if require_queue and (not res.ok or res.json().get("data", {}).get("updated", 0) < 1):
+                    raise ValueError(
+                        "Execution required enqueued task, but task id={} is not queued.".format(current_task.id)
+                    )
+                # Set task status to started to prevent any external monitoring from killing it
+                self._session.api_client.tasks.started(
+                    task=current_task.id,
+                    status_reason="starting execution soon",
+                    status_message="",
+                    force=True,
+                )
            except Exception:
                if require_queue:
                    raise
@@ -2319,14 +2452,18 @@ class Worker(ServiceCommandSection):
        # We expect the same behaviour in case full_monitoring was set, and in case docker mode is used
        if full_monitoring or docker is not False:
            if full_monitoring:
-                if not (ENV_WORKER_ID.get() or '').strip():
-                    self._session.config["agent"]["worker_id"] = ''
+                if not (ENV_WORKER_ID.get() or "").strip():
+                    self._session.config["agent"]["worker_id"] = ""
                # make sure we support multiple instances if we need to
                self._singleton()
                self.temp_config_path = self.temp_config_path or safe_mkstemp(
-                    suffix=".cfg", prefix=".clearml_agent.", text=True, name_only=True
+                    suffix=".cfg",
+                    prefix=".clearml_agent.",
+                    text=True,
+                    name_only=True,
+                    dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
                )
-                self.dump_config(self.temp_config_path)
+                self.dump_config(filename=self.temp_config_path, config=self._session.pre_vault_config)
                self._session._config_file = self.temp_config_path

            worker_params = WorkerParams(
@@ -2347,8 +2484,6 @@ class Worker(ServiceCommandSection):
                    Singleton.close_pid_file()
            return status if ENV_PROPAGATE_EXITCODE.get() else 0

-        self._apply_extra_configuration()
-
        self._session.print_configuration()

        # now mark the task as started
@@ -2365,6 +2500,12 @@ class Worker(ServiceCommandSection):

        execution = self.get_execution_info(current_task)

+        if ENV_AGENT_FORCE_EXEC_SCRIPT.get():
+            entry_point_parts = str(ENV_AGENT_FORCE_EXEC_SCRIPT.get()).split(":", 1)
+            execution.entry_point = entry_point_parts[-1]
+            execution.working_dir = entry_point_parts[0] if len(entry_point_parts) > 1 else "."
+            print("WARNING: Using forced script entry [{}:{}]".format(execution.working_dir, execution.entry_point))
+
        python_ver = self._get_task_python_version(current_task)

        freeze = None
@@ -2442,8 +2583,9 @@ class Worker(ServiceCommandSection):
                code_folder = self._session.config.get("agent.venvs_dir")
                code_folder = Path(os.path.expanduser(os.path.expandvars(code_folder)))
                # let's make sure it is clear from previous runs
-                rm_tree(normalize_path(code_folder, WORKING_REPOSITORY_DIR))
-                rm_tree(normalize_path(code_folder, WORKING_STANDALONE_DIR))
+                if not standalone_mode:
+                    rm_tree(normalize_path(code_folder, WORKING_REPOSITORY_DIR))
+                    rm_tree(normalize_path(code_folder, WORKING_STANDALONE_DIR))
                if not code_folder.exists():
                    code_folder.mkdir(parents=True, exist_ok=True)
                alternative_code_folder = code_folder.as_posix()
@@ -2461,10 +2603,14 @@ class Worker(ServiceCommandSection):

                    print("\n")

-            # either use the venvs base folder for code or the cwd
-            directory, vcs, repo_info = self.get_repo_info(
-                execution, current_task, str(venv_folder or alternative_code_folder)
-            )
+            # if we force code directory - by definition we do not clone or apply any changes
+            if ENV_AGENT_FORCE_CODE_DIR.get():
+                directory, vcs, repo_info = ENV_AGENT_FORCE_CODE_DIR.get(), None, None
+            else:
+                # either use the venvs base folder for code or the cwd
+                directory, vcs, repo_info = self.get_repo_info(
+                    execution, current_task, str(alternative_code_folder or venv_folder)
+                )

            print("\n")

@@ -2634,7 +2780,10 @@ class Worker(ServiceCommandSection):
            else:
                # store stdout/stderr into file, and send to backend
                temp_stdout_fname = log_file or safe_mkstemp(
-                    suffix=".txt", prefix=".clearml_agent_out.", name_only=True
+                    suffix=".txt",
+                    prefix=".clearml_agent_out.",
+                    name_only=True,
+                    dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
                )
                print("Storing stdout and stderr log into [%s]" % temp_stdout_fname)
                exit_code, _ = self._log_command_output(
@@ -2917,7 +3066,7 @@ class Worker(ServiceCommandSection):
        # Todo: add support for poetry caching
        if not self.poetry.enabled:
            # add to cache
-            if add_venv_folder_cache:
+            if add_venv_folder_cache and not self._standalone_mode:
                print('Adding venv into cache: {}'.format(add_venv_folder_cache))
                self.package_api.add_cached_venv(
                    requirements=[freeze, previous_reqs],
@@ -3515,6 +3664,11 @@ class Worker(ServiceCommandSection):
        requirements_manager.translator.enabled = False
        print(requirements_manager.replace(contents))

+    def remove_non_backwards_compatible_entries(self, config: Config):
+        if not self._standalone_mode or not ENV_CONFIG_BC_IN_STANDALONE.get() or self._session.feature_set == "basic":
+            return
+        config.pop("agent.package_manager.pip_version")  # removed due to a breaking change in v1.5.1
+
    def get_docker_config_cmd(self, docker_args, clean_api_credentials=False):
        docker_image = str(ENV_DOCKER_IMAGE.get() or
                           self._session.config.get("agent.default_docker.image", "nvidia/cuda")) \
@@ -3537,6 +3691,7 @@ class Worker(ServiceCommandSection):
            DockerArgsSanitizer.sanitize_docker_command(self._session, self._docker_arguments) or ''))

        temp_config = deepcopy(self._session.config)
+        self.remove_non_backwards_compatible_entries(temp_config)
        mounted_cache_dir = temp_config.get(
            "agent.docker_internal_mounts.sdk_cache", self._docker_fixed_user_cache)
        mounted_pip_dl_dir = temp_config.get(
@@ -3587,34 +3742,35 @@ class Worker(ServiceCommandSection):

    def _get_docker_config_cmd(self, temp_config, clean_api_credentials=False, **kwargs):
        self.debug("Setting up docker config command")
-        host_cache = Path(os.path.expandvars(
-            self._session.config["sdk.storage.cache.default_base_dir"])).expanduser().as_posix()
+
+        def load_path(field, default=None):
+            value = self._session.config.get(field, default)
+            return Path(os.path.expandvars(value)).expanduser().as_posix() if value else None
+
+        host_cache = load_path("sdk.storage.cache.default_base_dir")
        self.debug("host_cache: {}".format(host_cache))
-        host_pip_dl = Path(os.path.expandvars(
-            self._session.config["agent.pip_download_cache.path"])).expanduser().as_posix()
+
+        host_pip_dl = load_path("agent.pip_download_cache.path")
        self.debug("host_pip_dl: {}".format(host_pip_dl))
-        host_vcs_cache = Path(os.path.expandvars(
-            self._session.config["agent.vcs_cache.path"])).expanduser().as_posix()
+
+        host_vcs_cache = load_path("agent.vcs_cache.path")
        self.debug("host_vcs_cache: {}".format(host_vcs_cache))
-        host_venvs_cache = Path(os.path.expandvars(
-            self._session.config["agent.venvs_cache.path"])).expanduser().as_posix() \
-            if self._session.config.get("agent.venvs_cache.path", None) else None
+
+        host_venvs_cache = load_path("agent.venvs_cache.path")
        self.debug("host_venvs_cache: {}".format(host_venvs_cache))
+
        host_ssh_cache = self._host_ssh_cache
        self.debug("host_ssh_cache: {}".format(host_ssh_cache))

-        host_apt_cache = Path(os.path.expandvars(self._session.config.get(
-            "agent.docker_apt_cache", '~/.clearml/apt-cache'))).expanduser().as_posix()
+        host_apt_cache = load_path("agent.docker_apt_cache", default="~/.clearml/apt-cache")
        self.debug("host_apt_cache: {}".format(host_apt_cache))
-        host_pip_cache = Path(os.path.expandvars(self._session.config.get(
-            "agent.docker_pip_cache", '~/.clearml/pip-cache'))).expanduser().as_posix()
+
+        host_pip_cache = load_path("agent.docker_pip_cache", default="~/.clearml/pip-cache")
        self.debug("host_pip_cache: {}".format(host_pip_cache))

-        if self.poetry.enabled:
-            host_poetry_cache = Path(os.path.expandvars(self._session.config.get(
-                "agent.docker_poetry_cache", '~/.clearml/poetry-cache'))).expanduser().as_posix()
-        else:
-            host_poetry_cache = None
+        host_poetry_cache = (
+            load_path("agent.docker_poetry_cache", "~/.clearml/poetry-cache") if self.poetry.enabled else None
+        )
        self.debug("host_poetry_cache: {}".format(host_poetry_cache))

        # make sure all folders are valid
@@ -3674,7 +3830,11 @@ class Worker(ServiceCommandSection):
        install_opencv_libs = self._session.config.get("agent.docker_install_opencv_libs", True)

        self.temp_config_path = self.temp_config_path or safe_mkstemp(
-            suffix=".cfg", prefix=".clearml_agent.", text=True, name_only=True
+            suffix=".cfg",
+            prefix=".clearml_agent.",
+            text=True,
+            name_only=True,
+            dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
        )

        mounted_cache_dir = temp_config.get("sdk.storage.cache.default_base_dir")
@@ -3766,6 +3926,60 @@ class Worker(ServiceCommandSection):
                        pass
        return results

+    @staticmethod
+    def _resolve_docker_env_args(docker_args):
+        # type: (List[str]) -> List[str]
+        """
+        Resolve -e / --env docker environment args matching $VAR or ${VAR} from the host environment
+
+        :argument docker_args: List of docker argument strings (flags and values)
+        """
+        non_list_args = (
+            "rm", "read-only", "sig-proxy", "tty", "privileged", "publish-all", "interactive", "init", "help", "detach"
+        )
+        non_list_args_single = (
+            "t", "P", "i", "d",
+        )
+
+        # if no filtering, do nothing
+        if not docker_args:
+            return docker_args
+
+        args = docker_args[:]
+        skip_arg = False
+        for i, cmd in enumerate(docker_args):
+            if skip_arg and not cmd.startswith("-"):
+                continue
+
+            skip_arg = False
+
+            if cmd.startswith("--"):
+                # jump over single command
+                if cmd[2:] in non_list_args:
+                    continue
+            elif cmd.startswith("-"):
+                # jump over single character non args
+                if cmd[1:] in non_list_args_single:
+                    continue
+
+            # if we are here we have a command to bypass and the list after it
+            if cmd in ('-e', '--env'):
+                skip_arg = True
+                for j in range(i+1, len(args)):
+                    if args[j].startswith("-"):
+                        break
+
+                    parts = args[j].split("=", 1)
+                    if len(parts) != 2:
+                        continue
+
+                    args[j] = "{}={}".format(parts[0], os.path.expandvars(parts[1]))
+
+            elif cmd.startswith("-"):
+                skip_arg = True
+
+        return args
+
    def _get_docker_cmd(
            self,
            worker_id, parent_worker_id,
@@ -3829,17 +4043,31 @@ class Worker(ServiceCommandSection):
            docker_arguments = list(docker_arguments) \
                if isinstance(docker_arguments, (list, tuple)) else [docker_arguments]
            docker_arguments = self._filter_docker_args(docker_arguments)
-            base_cmd += [a for a in docker_arguments if a]
+            if self._session.config.get("agent.docker_allow_host_environ", None):
+                docker_arguments = self._resolve_docker_env_args(docker_arguments)

        if extra_docker_arguments:
+            # we always resolve environments in the `extra_docker_arguments` becuase the admin set them (not users)
+            extra_docker_arguments = self._resolve_docker_env_args(extra_docker_arguments)
            extra_docker_arguments = [extra_docker_arguments] \
                if isinstance(extra_docker_arguments, six.string_types) else extra_docker_arguments
-            base_cmd += [str(a) for a in extra_docker_arguments if a]
+
+        # decide on order of docker args when merging overlapping arguments
+        # from extra_docker_args and the Task's docker_args
+        base_cmd += DockerArgsSanitizer.merge_docker_args(
+            config=self._session.config,
+            task_docker_arguments=docker_arguments,
+            extra_docker_arguments=extra_docker_arguments
+        )

        # set docker labels
        base_cmd += ['-l', self._worker_label.format(worker_id)]
        base_cmd += ['-l', self._parent_worker_label.format(parent_worker_id)]

+        extra_labels = ENV_EXTRA_DOCKER_LABELS.get()
+        for label in (extra_labels or []):
+            base_cmd += ['-l', label]
+
        self.debug("Command: {}".format(base_cmd), context="docker")

        # check if running inside a kubernetes
@@ -3903,7 +4131,7 @@ class Worker(ServiceCommandSection):

        base_cmd += ['-e', 'CLEARML_WORKER_ID='+worker_id, ]
        # update the docker image, so the system knows where it runs
-        base_cmd += ['-e', 'CLEARML_DOCKER_IMAGE={} {}'.format(docker_image, ' '.join(docker_arguments or [])).strip()]
+        base_cmd += ['-e', 'CLEARML_DOCKER_IMAGE={}'.format(docker_image)]

        if env_task_id:
            base_cmd += ['-e', 'CLEARML_TASK_ID={}'.format(env_task_id), ]
@@ -3919,9 +4147,13 @@ class Worker(ServiceCommandSection):
        if skip_pip_venv_install:
            base_cmd += ['-e', '{}={}'.format(ENV_AGENT_SKIP_PIP_VENV_INSTALL.vars[0], skip_pip_venv_install)]

+        if self._services_mode:
+            base_cmd += ['-e', 'CLEARML_AGENT_SERVICE_TASK=1']
+
        # if we are running a RC version, install the same version in the docker
        # because the default latest, will be a release version (not RC)
        specify_version = ''
+        # noinspection PyBroadException
        try:
            from clearml_agent.version import __version__
            _version_parts = __version__.split('.')
@@ -3930,13 +4162,15 @@ class Worker(ServiceCommandSection):
        except:
            pass

+        force_agent_repo = ENV_FORCE_DOCKER_AGENT_REPO.get()
+
        if os.environ.get('FORCE_LOCAL_CLEARML_AGENT_WHEEL'):
            local_wheel = os.path.expanduser(os.environ.get('FORCE_LOCAL_CLEARML_AGENT_WHEEL'))
            docker_wheel = '/tmp/{}'.format(basename(local_wheel))
            base_cmd += ['-v', local_wheel + ':' + docker_wheel]
            clearml_agent_wheel = '\"{}\"'.format(docker_wheel)
-        elif os.environ.get('FORCE_CLEARML_AGENT_REPO'):
-            clearml_agent_wheel = os.environ.get('FORCE_CLEARML_AGENT_REPO')
+        elif force_agent_repo:
+            clearml_agent_wheel = force_agent_repo
        else:
            # clearml-agent{specify_version}
            clearml_agent_wheel = 'clearml-agent{specify_version}'.format(specify_version=specify_version)
--- a/clearml_agent/definitions.py
+++ b/clearml_agent/definitions.py
@@ -152,11 +152,14 @@ WORKING_STANDALONE_DIR = "code"
 DEFAULT_VCS_CACHE = normalize_path(CONFIG_DIR, "vcs-cache")
 PIP_EXTRA_INDICES = []
 DEFAULT_PIP_DOWNLOAD_CACHE = normalize_path(CONFIG_DIR, "pip-download-cache")
+ENV_PIP_EXTRA_INSTALL_FLAGS = EnvironmentConfig("CLEARML_EXTRA_PIP_INSTALL_FLAGS", type=list)
 ENV_DOCKER_IMAGE = EnvironmentConfig("CLEARML_DOCKER_IMAGE", "TRAINS_DOCKER_IMAGE")
 ENV_WORKER_ID = EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID")
 ENV_WORKER_TAGS = EnvironmentConfig("CLEARML_WORKER_TAGS")
 ENV_AGENT_SKIP_PIP_VENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PIP_VENV_INSTALL")
 ENV_AGENT_SKIP_PYTHON_ENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL", type=bool)
+ENV_AGENT_FORCE_CODE_DIR = EnvironmentConfig("CLEARML_AGENT_FORCE_CODE_DIR")
+ENV_AGENT_FORCE_EXEC_SCRIPT = EnvironmentConfig("CLEARML_AGENT_FORCE_EXEC_SCRIPT")
 ENV_DOCKER_SKIP_GPUS_FLAG = EnvironmentConfig("CLEARML_DOCKER_SKIP_GPUS_FLAG", "TRAINS_DOCKER_SKIP_GPUS_FLAG")
 ENV_AGENT_GIT_USER = EnvironmentConfig("CLEARML_AGENT_GIT_USER", "TRAINS_AGENT_GIT_USER")
 ENV_AGENT_GIT_PASS = EnvironmentConfig("CLEARML_AGENT_GIT_PASS", "TRAINS_AGENT_GIT_PASS")
@@ -173,10 +176,15 @@ ENV_DOCKER_HOST_MOUNT = EnvironmentConfig(
 )
 ENV_VENV_CACHE_PATH = EnvironmentConfig("CLEARML_AGENT_VENV_CACHE_PATH")
 ENV_EXTRA_DOCKER_ARGS = EnvironmentConfig("CLEARML_AGENT_EXTRA_DOCKER_ARGS", type=list)
+ENV_EXTRA_DOCKER_LABELS = EnvironmentConfig("CLEARML_AGENT_EXTRA_DOCKER_LABELS", type=list)
 ENV_DEBUG_INFO = EnvironmentConfig("CLEARML_AGENT_DEBUG_INFO")
 ENV_CHILD_AGENTS_COUNT_CMD = EnvironmentConfig("CLEARML_AGENT_CHILD_AGENTS_COUNT_CMD")
 ENV_DOCKER_ARGS_FILTERS = EnvironmentConfig("CLEARML_AGENT_DOCKER_ARGS_FILTERS")
 ENV_DOCKER_ARGS_HIDE_ENV = EnvironmentConfig("CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV")
+ENV_CONFIG_BC_IN_STANDALONE = EnvironmentConfig("CLEARML_AGENT_STANDALONE_CONFIG_BC", type=bool)
+""" Maintain backwards compatible configuration when launching in standalone mode """
+
+ENV_FORCE_DOCKER_AGENT_REPO = EnvironmentConfig("FORCE_CLEARML_AGENT_REPO", "CLEARML_AGENT_DOCKER_AGENT_REPO")

 ENV_SERVICES_DOCKER_RESTART = EnvironmentConfig("CLEARML_AGENT_SERVICES_DOCKER_RESTART")
 """
@@ -232,6 +240,12 @@ ENV_CUSTOM_BUILD_SCRIPT = EnvironmentConfig("CLEARML_AGENT_CUSTOM_BUILD_SCRIPT")
    standard flow.
 """

+ENV_PACKAGE_PYTORCH_RESOLVE = EnvironmentConfig("CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE")
+
+ENV_TEMP_STDOUT_FILE_DIR = EnvironmentConfig("CLEARML_AGENT_TEMP_STDOUT_FILE_DIR")
+
+ENV_GIT_CLONE_VERBOSE = EnvironmentConfig("CLEARML_AGENT_GIT_CLONE_VERBOSE", type=bool)
+

 class FileBuffering(IntEnum):
    """
--- a/clearml_agent/glue/daemon.py
+++ b/clearml_agent/glue/daemon.py
@@ -0,0 +1,15 @@
+from threading import Thread
+from clearml_agent.session import Session
+
+
+class K8sDaemon(Thread):
+
+    def __init__(self, agent):
+        super(K8sDaemon, self).__init__(target=self.target)
+        self.daemon = True
+        self._agent = agent
+        self.log = agent.log
+        self._session: Session = agent._session
+
+    def target(self):
+        pass
--- a/clearml_agent/glue/definitions.py
+++ b/clearml_agent/glue/definitions.py
@@ -1,7 +1,11 @@
-from clearml_agent.definitions import EnvironmentConfig
+from clearml_agent.helper.environment import EnvEntry

-ENV_START_AGENT_SCRIPT_PATH = EnvironmentConfig('CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH')
+ENV_START_AGENT_SCRIPT_PATH = EnvEntry("CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH", default="~/__start_agent__.sh")
 """
 Script path to use when creating the bash script to run the agent inside the scheduled pod's docker container. 
 Script will be appended to the specified file.
 """
+
+ENV_DEFAULT_EXECUTION_AGENT_ARGS = EnvEntry("K8S_GLUE_DEF_EXEC_AGENT_ARGS", default="--full-monitoring --require-queue")
+ENV_POD_AGENT_INSTALL_ARGS = EnvEntry("K8S_GLUE_POD_AGENT_INSTALL_ARGS", default="", lstrip=False)
+ENV_POD_MONITOR_LOG_BATCH_SIZE = EnvEntry("K8S_GLUE_POD_MONITOR_LOG_BATCH_SIZE", default=5, converter=int)
--- a/clearml_agent/glue/errors.py
+++ b/clearml_agent/glue/errors.py
@@ -0,0 +1,12 @@
+
+class GetPodsError(Exception):
+    pass
+
+
+class GetJobsError(Exception):
+    pass
+
+
+class GetPodCountError(Exception):
+    pass
+
--- a/clearml_agent/glue/k8s.py
+++ b/clearml_agent/glue/k8s.py
@@ -9,17 +9,15 @@ import os
 import re
 import subprocess
 import tempfile
-from collections import defaultdict
+from collections import defaultdict, namedtuple
 from copy import deepcopy
 from pathlib import Path
 from pprint import pformat
-from threading import Thread
 from time import sleep, time
 from typing import Text, List, Callable, Any, Collection, Optional, Union, Iterable, Dict, Tuple, Set

 import yaml

-from clearml_agent.backend_api.session import Request
 from clearml_agent.commands.events import Events
 from clearml_agent.commands.worker import Worker, get_task_container, set_task_container, get_next_task
 from clearml_agent.definitions import (
@@ -28,29 +26,32 @@ from clearml_agent.definitions import (
    ENV_AGENT_GIT_PASS,
    ENV_FORCE_SYSTEM_SITE_PACKAGES,
 )
-from clearml_agent.errors import APIError
-from clearml_agent.glue.definitions import ENV_START_AGENT_SCRIPT_PATH
+from clearml_agent.errors import APIError, UsageError
+from clearml_agent.glue.errors import GetPodCountError
+from clearml_agent.glue.utilities import get_path, get_bash_output
+from clearml_agent.glue.pending_pods_daemon import PendingPodsDaemon
 from clearml_agent.helper.base import safe_remove_file
 from clearml_agent.helper.dicts import merge_dicts
 from clearml_agent.helper.process import get_bash_output, stringify_bash_output
 from clearml_agent.helper.resource_monitor import ResourceMonitor
 from clearml_agent.interface.base import ObjectID
+from clearml_agent.backend_api.session import Request
+from clearml_agent.glue.definitions import (
+    ENV_START_AGENT_SCRIPT_PATH,
+    ENV_DEFAULT_EXECUTION_AGENT_ARGS,
+    ENV_POD_AGENT_INSTALL_ARGS,
+)


 class K8sIntegration(Worker):
+    SUPPORTED_KIND = ("pod", "job")
    K8S_PENDING_QUEUE = "k8s_scheduler"
-
    K8S_DEFAULT_NAMESPACE = "clearml"
    AGENT_LABEL = "CLEARML=agent"
+    QUEUE_LABEL = "clearml-agent-queue"

    KUBECTL_APPLY_CMD = "kubectl apply --namespace={namespace} -f"

-    KUBECTL_CLEANUP_DELETE_CMD = "kubectl delete pods " \
-                                 "-l={agent_label} " \
-                                 "--field-selector=status.phase!=Pending,status.phase!=Running " \
-                                 "--namespace={namespace} " \
-                                 "--output name"
-
    BASH_INSTALL_SSH_CMD = [
        "apt-get update",
        "apt-get install -y openssh-server",
@@ -66,9 +67,6 @@ class K8sIntegration(Worker):
        'echo "ldconfig" >> /etc/profile',
        "/usr/sbin/sshd -p {port}"]

-    DEFAULT_EXECUTION_AGENT_ARGS = os.getenv("K8S_GLUE_DEF_EXEC_AGENT_ARGS", "--full-monitoring --require-queue")
-    POD_AGENT_INSTALL_ARGS = os.getenv("K8S_GLUE_POD_AGENT_INSTALL_ARGS", "")
-
    CONTAINER_BASH_SCRIPT = [
        "export DEBIAN_FRONTEND='noninteractive'",
        "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean",
@@ -81,7 +79,7 @@ class K8sIntegration(Worker):
        "[ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip",
        "[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3",
        "{extra_bash_init_cmd}",
-        "$LOCAL_PYTHON -m pip install clearml-agent{agent_install_args}",
+        "[ ! -z $CLEARML_AGENT_NO_UPDATE ] || $LOCAL_PYTHON -m pip install clearml-agent{agent_install_args}",
        "{extra_docker_bash_script}",
        "$LOCAL_PYTHON -m clearml_agent execute {default_execution_agent_args} --id {task_id}"
    ]
@@ -108,6 +106,7 @@ class K8sIntegration(Worker):
            max_pods_limit=None,
            pod_name_prefix=None,
            limit_pod_label=None,
+            force_system_packages=None,
            **kwargs
    ):
        """
@@ -135,12 +134,17 @@ class K8sIntegration(Worker):
        :param int max_pods_limit: Maximum number of pods that K8S glue can run at the same time
        """
        super(K8sIntegration, self).__init__()
+        self.kind = os.environ.get("CLEARML_K8S_GLUE_KIND", "pod").strip().lower()
+        if self.kind not in self.SUPPORTED_KIND:
+            raise UsageError(f"Kind '{self.kind}' not supported (expected {','.join(self.SUPPORTED_KIND)})")
+        self.using_jobs = self.kind == "job"
        self.pod_name_prefix = pod_name_prefix or self.DEFAULT_POD_NAME_PREFIX
        self.limit_pod_label = limit_pod_label or self.DEFAULT_LIMIT_POD_LABEL
        self.k8s_pending_queue_name = k8s_pending_queue_name or self.K8S_PENDING_QUEUE
        self.k8s_pending_queue_id = None
        self.container_bash_script = container_bash_script or self.CONTAINER_BASH_SCRIPT
-        force_system_packages = ENV_FORCE_SYSTEM_SITE_PACKAGES.get()
+        if force_system_packages is None:
+            force_system_packages = ENV_FORCE_SYSTEM_SITE_PACKAGES.get()
        self._force_system_site_packages = force_system_packages if force_system_packages is not None else True
        if self._force_system_site_packages:
            # Use system packages, because by we will be running inside a docker
@@ -179,11 +183,18 @@ class K8sIntegration(Worker):

        self._agent_label = None

-        self._monitor_hanging_pods()
+        self._pending_pods_daemon = self._create_daemon_instance(
+            cls_=PendingPodsDaemon,
+            polling_interval=self._polling_interval
+        )
+        self._pending_pods_daemon.start()

        self._min_cleanup_interval_per_ns_sec = 1.0
        self._last_pod_cleanup_per_ns = defaultdict(lambda: 0.)

+    def _create_daemon_instance(self, cls_, **kwargs):
+        return cls_(agent=self, **kwargs)
+
    def _load_overrides_yaml(self, overrides_yaml):
        if not overrides_yaml:
            return
@@ -209,26 +220,33 @@ class K8sIntegration(Worker):
            self.log.warning('Removing containers section: {}'.format(overrides['spec'].pop('containers')))
        self.overrides_json_string = json.dumps(overrides)

-    def _monitor_hanging_pods(self):
-        _check_pod_thread = Thread(target=self._monitor_hanging_pods_daemon)
-        _check_pod_thread.daemon = True
-        _check_pod_thread.start()
-
    @staticmethod
    def _load_template_file(path):
        with open(os.path.expandvars(os.path.expanduser(str(path))), 'rt') as f:
            return yaml.load(f, Loader=getattr(yaml, 'FullLoader', None))

-    def _get_kubectl_options(self, command, extra_labels=None, filters=None, output="json", labels=None):
-        # type: (str, Iterable[str], Iterable[str], str, Iterable[str]) -> Dict
-        if not labels:
+    @staticmethod
+    def _get_path(d, *path, default=None):
+        try:
+            return functools.reduce(
+                lambda a, b: a[b], path, d
+            )
+        except (IndexError, KeyError):
+            return default
+
+    def _get_kubectl_options(self, command, extra_labels=None, filters=None, output="json", labels=None, ns=None):
+        # type: (str, Iterable[str], Iterable[str], str, Iterable[str], str) -> Dict
+        if labels is False:
+            labels = []
+        elif not labels:
            labels = [self._get_agent_label()]
        labels = list(labels) + (list(extra_labels) if extra_labels else [])
        d = {
-            "-l": ",".join(labels),
-            "-n": str(self.namespace),
+            "-n": ns or str(self.namespace),
            "-o": output,
        }
+        if labels:
+            d["-l"] = ",".join(labels)
        if filters:
            d["--field-selector"] = ",".join(filters)
        return d
@@ -239,132 +257,6 @@ class K8sIntegration(Worker):
            command=command, opts=" ".join(x for item in opts.items() for x in item)
        )

-    def _monitor_hanging_pods_daemon(self):
-        last_tasks_msgs = {}  # last msg updated for every task
-
-        while True:
-            kubectl_cmd = self.get_kubectl_command("get pods", filters=["status.phase=Pending"])
-            self.log.debug("Detecting hanging pods: {}".format(kubectl_cmd))
-            output = stringify_bash_output(get_bash_output(kubectl_cmd))
-            try:
-                output_config = json.loads(output)
-            except Exception as ex:
-                self.log.warning('K8S Glue pods monitor: Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
-                sleep(self._polling_interval)
-                continue
-            pods = output_config.get('items', [])
-            task_id_to_details = dict()
-            for pod in pods:
-                pod_name = pod.get('metadata', {}).get('name', None)
-                if not pod_name:
-                    continue
-
-                task_id = pod_name.rpartition('-')[-1]
-                if not task_id:
-                    continue
-
-                namespace = pod.get('metadata', {}).get('namespace', None)
-                if not namespace:
-                    continue
-
-                task_id_to_details[task_id] = (pod_name, namespace)
-
-                msg = None
-
-                waiting = self._get_path(pod, 'status', 'containerStatuses', 0, 'state', 'waiting')
-                if not waiting:
-                    condition = self._get_path(pod, 'status', 'conditions', 0)
-                    if condition:
-                        reason = condition.get('reason')
-                        if reason == 'Unschedulable':
-                            message = condition.get('message')
-                            msg = reason + (" ({})".format(message) if message else "")
-                else:
-                    reason = waiting.get("reason", None)
-                    message = waiting.get("message", None)
-
-                    msg = reason + (" ({})".format(message) if message else "")
-
-                    if reason == 'ImagePullBackOff':
-                        delete_pod_cmd = 'kubectl delete pods {} -n {}'.format(pod_name, namespace)
-                        self.log.debug(" - deleting pod due to ImagePullBackOff: {}".format(delete_pod_cmd))
-                        get_bash_output(delete_pod_cmd)
-                        try:
-                            self.log.debug(" - Detecting hanging pods: {}".format(kubectl_cmd))
-                            self._session.api_client.tasks.failed(
-                                task=task_id,
-                                status_reason="K8S glue error: {}".format(msg),
-                                status_message="Changed by K8S glue",
-                                force=True
-                            )
-                        except Exception as ex:
-                            self.log.warning(
-                                'K8S Glue pods monitor: Failed deleting task "{}"\nEX: {}'.format(task_id, ex)
-                            )
-
-                        # clean up any msg for this task
-                        last_tasks_msgs.pop(task_id, None)
-                        continue
-                if msg and last_tasks_msgs.get(task_id, None) != msg:
-                    try:
-                        result = self._session.send_request(
-                            service='tasks',
-                            action='update',
-                            json={"task": task_id, "status_message": "K8S glue status: {}".format(msg)},
-                            method=Request.def_method,
-                            async_enable=False,
-                        )
-                        if not result.ok:
-                            result_msg = self._get_path(result.json(), 'meta', 'result_msg')
-                            raise Exception(result_msg or result.text)
-
-                        # update last msg for this task
-                        last_tasks_msgs[task_id] = msg
-                    except Exception as ex:
-                        self.log.warning(
-                            'K8S Glue pods monitor: Failed setting status message for task "{}"\nMSG: {}\nEX: {}'.format(
-                                task_id, msg, ex
-                            )
-                        )
-
-            if task_id_to_details:
-                try:
-                    result = self._session.get(
-                        service='tasks',
-                        action='get_all',
-                        json={"id": list(task_id_to_details), "status": ["stopped"], "only_fields": ["id"]},
-                        method=Request.def_method,
-                        async_enable=False,
-                    )
-                    aborted_task_ids = list(filter(None, (task.get("id") for task in result["tasks"])))
-
-                    for task_id in aborted_task_ids:
-                        pod_name, namespace = task_id_to_details.get(task_id)
-                        if not pod_name:
-                            self.log.error("Failed locating aborted task {} in pending pods list".format(task_id))
-                            continue
-                        self.log.info(
-                            "K8S Glue pods monitor: task {} was aborted by its pod {} is still pending, "
-                            "deleting pod".format(task_id, pod_name)
-                        )
-
-                        kubectl_cmd = "kubectl delete pod {pod_name} --output name {namespace}".format(
-                            namespace=f"--namespace={namespace}" if namespace else "", pod_name=pod_name,
-                        ).strip()
-                        self.log.debug("Deleting aborted task pending pod: {}".format(kubectl_cmd))
-                        output = stringify_bash_output(get_bash_output(kubectl_cmd))
-                        if not output:
-                            self.log.warning("K8S Glue pods monitor: failed deleting pod {}".format(pod_name))
-                except Exception as ex:
-                    self.log.warning(
-                        'K8S Glue pods monitor: failed checking aborted tasks for hanging pods: {}'.format(ex)
-                    )
-
-            # clean up any last message for a task that wasn't seen as a pod
-            last_tasks_msgs = {k: v for k, v in last_tasks_msgs.items() if k in task_id_to_details}
-
-            sleep(self._polling_interval)
-
    def _set_task_user_properties(self, task_id: str, task_session=None, **properties: str):
        session = task_session or self._session
        if self._edit_hyperparams_support is not True:
@@ -408,34 +300,50 @@ class K8sIntegration(Worker):

        return self._agent_label

-    def _get_used_pods(self):
-        # type: () -> Tuple[int, Set[str]]
-        # noinspection PyBroadException
+    RunningPod = namedtuple("RunningPod", "name queue namespace")
+
+    def _get_running_pods(self):
        try:
            kubectl_cmd = self.get_kubectl_command(
                "get pods",
-                output="jsonpath=\"{range .items[*]}{.metadata.name}{' '}{.metadata.namespace}{'\\n'}{end}\""
+                output="jsonpath=\"{{range .items[*]}}{{.metadata.name}}{{' '}}{{.metadata.namespace}}{{' '}}"
+                       "{{.metadata.labels.{}}}{{'\\n'}}{{end}}\"".format(self.QUEUE_LABEL)
            )
            self.log.debug("Getting used pods: {}".format(kubectl_cmd))
            output = stringify_bash_output(get_bash_output(kubectl_cmd, raise_error=True))

            if not output:
                # No such pod exist so we can use the pod_number we found
-                return 0, set([])
+                return []

            try:
-                items = output.splitlines()
-                current_pod_count = len(items)
-                namespaces = {item.rpartition(" ")[-1] for item in items}
-                self.log.debug(" - found {} pods in namespaces {}".format(current_pod_count, ", ".join(namespaces)))
-            except (KeyError, ValueError, TypeError, AttributeError) as ex:
-                print("Failed parsing used pods command response for cleanup: {}".format(ex))
-                return -1, set([])
+                return [
+                    self.RunningPod(
+                        name=parts[0],
+                        namespace=parts[1],
+                        queue=parts[2]
+                    )
+                    for parts in (line.split(" ") for line in output.splitlines())
+                ]
+            except Exception as ex:
+                raise Exception("Failed parsing used pods command response for cleanup: {}".format(ex))
+        except Exception as ex:
+            raise Exception('Failed obtaining used pods information: {}'.format(ex))

+    def _get_used_pods(self):
+        # type: () -> Tuple[int, Set[str]]
+        # noinspection PyBroadException
+        try:
+            items = self._get_running_pods()
+            if not items:
+                return 0, set([])
+            current_pod_count = len(items)
+            namespaces = {item.namespace for item in items}
+            self.log.debug(" - found {} pods in namespaces {}".format(current_pod_count, ", ".join(namespaces)))
            return current_pod_count, namespaces
        except Exception as ex:
-            print('Failed obtaining used pods information: {}'.format(ex))
-            return -2, set([])
+            self.log.debug("Failed getting used pods: {}", ex)
+            return -1, set([])

    def _is_same_tenant(self, task_session):
        if not task_session or task_session is self._session:
@@ -448,6 +356,73 @@ class K8sIntegration(Worker):
        except Exception as ex:
            print("ERROR: Failed getting tenant for task session: {}".format(ex))

+    def get_jobs_info(self, info_path: str, condition: str = None, namespace=None, debug_msg: str = None)\
+            -> Dict[str, str]:
+        cond = "==".join((x.strip("=") for x in condition.partition("=")[::2]))
+        output = f"jsonpath='{{range .items[?(@.{cond})]}}{{@.{info_path}}}{{\" \"}}{{@.metadata.namespace}}{{\"\\n\"}}{{end}}'"
+        kubectl_cmd = self.get_kubectl_command("get job", output=output, ns=namespace)
+        if debug_msg:
+            self.log.debug(debug_msg.format(cmd=kubectl_cmd))
+        output = stringify_bash_output(get_bash_output(kubectl_cmd))
+        output = output.strip("'")  # for Windows debugging :(
+        try:
+            data_items = dict(l.strip().partition(" ")[::2] for l in output.splitlines())
+            return data_items
+        except Exception as ex:
+            self.log.warning('Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
+
+    def get_pods_for_jobs(self, job_condition: str = None, pod_filters: List[str] = None, debug_msg: str = None):
+        controller_uids = self.get_jobs_info(
+            "spec.selector.matchLabels.controller-uid", condition=job_condition, debug_msg=debug_msg
+        )
+        if not controller_uids:
+            # No pods were found for these jobs
+            return []
+        pods = self.get_pods(filters=pod_filters, debug_msg=debug_msg)
+        return [
+            pod for pod in pods
+            if get_path(pod, "metadata", "labels", "controller-uid") in controller_uids
+        ]
+
+    def get_pods(self, filters: List[str] = None, debug_msg: str = None):
+        kubectl_cmd = self.get_kubectl_command(
+            "get pods",
+            filters=filters,
+            labels=False if self.using_jobs else None,
+        )
+        if debug_msg:
+            self.log.debug(debug_msg.format(cmd=kubectl_cmd))
+        output = stringify_bash_output(get_bash_output(kubectl_cmd))
+        try:
+            output_config = json.loads(output)
+        except Exception as ex:
+            self.log.warning('Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
+            return
+        return output_config.get('items', [])
+
+    def _get_pod_count(self, extra_labels: List[str] = None, msg: str = None):
+            kubectl_cmd_new = self.get_kubectl_command(
+                f"get {self.kind}s",
+                extra_labels= extra_labels
+            )
+            self.log.debug("{}{}".format((msg + ": ") if msg else "", kubectl_cmd_new))
+            process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+            output, error = process.communicate()
+            output = stringify_bash_output(output)
+            error = stringify_bash_output(error)
+
+            try:
+                return len(json.loads(output).get("items", []))
+            except (ValueError, TypeError) as ex:
+                self.log.warning(
+                    "K8S Glue pods monitor: Failed parsing kubectl output:\n{}\nEx: {}".format(output, ex)
+                )
+                raise GetPodCountError()
+
+    def resource_applied(self, resource_name: str, namespace: str, task_id: str, session):
+        """ Called when a resource (pod/job) was applied """
+        pass
+
    def run_one_task(self, queue: Text, task_id: Text, worker_args=None, task_session=None, **_):
        print('Pulling task {} launching on kubernetes cluster'.format(task_id))
        session = task_session or self._session
@@ -457,7 +432,13 @@ class K8sIntegration(Worker):
        if self._is_same_tenant(task_session):
            try:
                print('Pushing task {} into temporary pending queue'.format(task_id))
-                _ = session.api_client.tasks.stop(task_id, force=True)
+                _ = session.api_client.tasks.stop(task_id, force=True, status_reason="moving to k8s pending queue")
+
+                # Just make sure to clean up in case the task is stuck in the queue (known issue)
+                self._session.api_client.queues.remove_task(
+                    task=task_id,
+                    queue=self.k8s_pending_queue_id,
+                )

                res = self._session.api_client.tasks.enqueue(
                    task_id,
@@ -498,14 +479,14 @@ class K8sIntegration(Worker):

        hocon_config_encoded = config_content.encode("ascii")

-        create_clearml_conf = ["echo '{}' | base64 --decode >> ~/clearml.conf".format(
+        clearml_conf_create_script = ["echo '{}' | base64 --decode >> ~/clearml.conf".format(
            base64.b64encode(
                hocon_config_encoded
            ).decode('ascii')
        )]

        if task_session:
-            create_clearml_conf.append(
+            clearml_conf_create_script.append(
                "export CLEARML_AUTH_TOKEN=$(echo '{}' | base64 --decode)".format(
                    base64.b64encode(task_session.token.encode("ascii")).decode('ascii')
                )
@@ -526,23 +507,15 @@ class K8sIntegration(Worker):
        while self.ports_mode or self.max_pods_limit:
            pod_number = self.base_pod_num + pod_count

-            kubectl_cmd_new = self.get_kubectl_command(
-                "get pods",
-                extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if self.ports_mode else None
-            )
-            self.log.debug("Looking for a free pod/port: {}".format(kubectl_cmd_new))
-            process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
-            output, error = process.communicate()
-            output = stringify_bash_output(output)
-            error = stringify_bash_output(error)
-
            try:
-                items_count = len(json.loads(output).get("items", []))
-            except (ValueError, TypeError) as ex:
+                items_count = self._get_pod_count(
+                    extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if self.ports_mode else None,
+                    msg="Looking for a free pod/port"
+                )
+            except GetPodCountError:
                self.log.warning(
-                    "K8S Glue pods monitor: Failed parsing kubectl output:\n{}\ntask '{}' "
-                    "will be enqueued back to queue '{}'\nEx: {}".format(
-                        output, task_id, queue, ex
+                    "K8S Glue pods monitor: task '{}' will be enqueued back to queue '{}'".format(
+                        task_id, queue
                    )
                )
                session.api_client.tasks.stop(task_id, force=True)
@@ -566,8 +539,6 @@ class K8sIntegration(Worker):

            if current_pod_count >= max_count:
                # All pods are taken, exit
-                self.log.debug(
-                    "kubectl last result: {}\n{}".format(error, output))
                self.log.warning(
                    "All k8s services are in use, task '{}' "
                    "will be enqueued back to queue '{}'".format(
@@ -608,25 +579,34 @@ class K8sIntegration(Worker):
        except (KeyError, TypeError, AttributeError):
            namespace = self.namespace

-        if template:
-            output, error = self._kubectl_apply(
-                template=template,
-                pod_number=pod_number,
-                create_clearml_conf=create_clearml_conf,
-                labels=labels,
-                docker_image=container['image'],
-                docker_args=container['arguments'],
-                docker_bash=container.get('setup_shell_script'),
-                task_id=task_id,
-                queue=queue,
-                namespace=namespace,
-            )
+        if not template:
+            print("ERROR: no template for task {}, skipping".format(task_id))
+            return
+
+        output, error, pod_name = self._kubectl_apply(
+            template=template,
+            pod_number=pod_number,
+            clearml_conf_create_script=clearml_conf_create_script,
+            labels=labels,
+            docker_image=container['image'],
+            docker_args=container.get('arguments'),
+            docker_bash=container.get('setup_shell_script'),
+            task_id=task_id,
+            queue=queue,
+            namespace=namespace,
+        )

        print('kubectl output:\n{}\n{}'.format(error, output))
        if error:
            send_log = "Running kubectl encountered an error: {}".format(error)
            self.log.error(send_log)
            self.send_logs(task_id, send_log.splitlines())
+            return
+
+        if pod_name:
+            self.resource_applied(
+                resource_name=pod_name, namespace=namespace, task_id=task_id, session=session
+            )

        user_props = {"k8s-queue": str(queue_name)}
        if self.ports_mode:
@@ -657,8 +637,8 @@ class K8sIntegration(Worker):
    def _get_pod_labels(self, queue, queue_name):
        return [
            self._get_agent_label(),
-            "clearml-agent-queue={}".format(self._safe_k8s_label_value(queue)),
-            "clearml-agent-queue-name={}".format(self._safe_k8s_label_value(queue_name))
+            "{}={}".format(self.QUEUE_LABEL, self._safe_k8s_label_value(queue)),
+            "{}-name={}".format(self.QUEUE_LABEL, self._safe_k8s_label_value(queue_name))
        ]

    def _get_docker_args(self, docker_args, flags, target=None, convert=None):
@@ -687,32 +667,13 @@ class K8sIntegration(Worker):
            return {target: results} if results else {}
        return results

-    def _kubectl_apply(
-        self,
-        create_clearml_conf,
-        docker_image,
-        docker_args,
-        docker_bash,
-        labels,
-        queue,
-        task_id,
-        namespace,
-        template=None,
-        pod_number=None
-    ):
-        template.setdefault('apiVersion', 'v1')
-        template['kind'] = 'Pod'
-        template.setdefault('metadata', {})
-        name = self.pod_name_prefix + str(task_id)
-        template['metadata']['name'] = name
-        template.setdefault('spec', {})
-        template['spec'].setdefault('containers', [])
-        template['spec'].setdefault('restartPolicy', 'Never')
-        if labels:
-            labels_dict = dict(pair.split('=', 1) for pair in labels)
-            template['metadata'].setdefault('labels', {})
-            template['metadata']['labels'].update(labels_dict)
+    def get_task_worker_id(self, template, task_id, pod_name, namespace, queue):
+        return f"{self.worker_id}:{task_id}"

+    def _create_template_container(
+        self, pod_name: str, task_id: str, docker_image: str, docker_args: List[str],
+        docker_bash: str, clearml_conf_create_script: List[str], task_worker_id: str
+    ) -> dict:
        container = self._get_docker_args(
            docker_args,
            target="env",
@@ -720,6 +681,16 @@ class K8sIntegration(Worker):
            convert=lambda env: {'name': env.partition("=")[0], 'value': env.partition("=")[2]},
        )

+        # Set worker ID
+        env_vars = container.get('env', [])
+        found_worker_id = False
+        for entry in env_vars:
+            if entry.get('name') == 'CLEARML_WORKER_ID':
+                entry['name'] = task_worker_id
+                found_worker_id = True
+        if not found_worker_id:
+            container['env'] = env_vars + [{'name': 'CLEARML_WORKER_ID', 'value': task_worker_id}]
+
        container_bash_script = [self.container_bash_script] if isinstance(self.container_bash_script, str) \
            else self.container_bash_script

@@ -732,11 +703,11 @@ class K8sIntegration(Worker):
            [line.format(extra_bash_init_cmd=self.extra_bash_init_script or '',
                         task_id=task_id,
                         extra_docker_bash_script=extra_docker_bash_script,
-                         default_execution_agent_args=self.DEFAULT_EXECUTION_AGENT_ARGS,
-                         agent_install_args=self.POD_AGENT_INSTALL_ARGS)
+                         default_execution_agent_args=ENV_DEFAULT_EXECUTION_AGENT_ARGS.get(),
+                         agent_install_args=ENV_POD_AGENT_INSTALL_ARGS.get())
             for line in container_bash_script])

-        extra_bash_commands = list(create_clearml_conf or [])
+        extra_bash_commands = list(clearml_conf_create_script or [])

        start_agent_script_path = ENV_START_AGENT_SCRIPT_PATH.get() or "~/__start_agent__.sh"

@@ -750,20 +721,82 @@ class K8sIntegration(Worker):
        )

        # Notice: we always leave with exit code 0, so pods are never restarted
-        container = self._merge_containers(
+        return self._merge_containers(
            container,
-            dict(name=name, image=docker_image,
+            dict(name=pod_name, image=docker_image,
                 command=['/bin/bash'],
                 args=['-c', '{} ; exit 0'.format(' ; '.join(extra_bash_commands))])
        )

-        if template['spec']['containers']:
-            template['spec']['containers'][0] = self._merge_containers(template['spec']['containers'][0], container)
+    def _kubectl_apply(
+        self,
+        clearml_conf_create_script: List[str],
+        docker_image,
+        docker_args,
+        docker_bash,
+        labels,
+        queue,
+        task_id,
+        namespace,
+        template,
+        pod_number=None
+    ):
+        if "apiVersion" not in template:
+            template["apiVersion"] = "batch/v1" if self.using_jobs else "v1"
+        if "kind" in template:
+            if template["kind"].lower() != self.kind:
+                return (
+                    "",
+                    f"Template kind {template['kind']} does not maych kind {self.kind.capitalize()} set for agent",
+                    None
+                )
        else:
-            template['spec']['containers'].append(container)
+            template["kind"] = self.kind.capitalize()
+
+        metadata = template.setdefault('metadata', {})
+        name = self.pod_name_prefix + str(task_id)
+        metadata['name'] = name
+
+        def place_labels(metadata_dict):
+            labels_dict = dict(pair.split('=', 1) for pair in labels)
+            metadata_dict.setdefault('labels', {}).update(labels_dict)
+
+        if labels:
+            # Place labels on base resource (job or single pod)
+            place_labels(metadata)
+
+        spec = template.setdefault('spec', {})
+        if self.using_jobs:
+            spec.setdefault('backoffLimit', 0)
+            spec_template = spec.setdefault('template', {})
+            if labels:
+                # Place same labels fro any pod spawned by the job
+                place_labels(spec_template.setdefault('metadata', {}))
+
+            spec = spec_template.setdefault('spec', {})
+
+        containers = spec.setdefault('containers', [])
+        spec.setdefault('restartPolicy', 'Never')
+
+        task_worker_id = self.get_task_worker_id(template, task_id, name, namespace, queue)
+
+        container = self._create_template_container(
+            pod_name=name,
+            task_id=task_id,
+            docker_image=docker_image,
+            docker_args=docker_args,
+            docker_bash=docker_bash,
+            clearml_conf_create_script=clearml_conf_create_script,
+            task_worker_id=task_worker_id
+        )
+
+        if containers:
+            containers[0] = self._merge_containers(containers[0], container)
+        else:
+            containers.append(container)

        if self._docker_force_pull:
-            for c in template['spec']['containers']:
+            for c in containers:
                c.setdefault('imagePullPolicy', 'Always')

        fp, yaml_file = tempfile.mkstemp(prefix='clearml_k8stmpl_', suffix='.yml')
@@ -789,11 +822,88 @@ class K8sIntegration(Worker):
            process = subprocess.Popen(kubectl_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            output, error = process.communicate()
        except Exception as ex:
-            return None, str(ex)
+            return None, str(ex), None
        finally:
            safe_remove_file(yaml_file)

-        return stringify_bash_output(output), stringify_bash_output(error)
+        return stringify_bash_output(output), stringify_bash_output(error), name
+
+    def _process_bash_lines_response(self, bash_cmd: str, raise_error=True):
+        res = get_bash_output(bash_cmd, raise_error=raise_error)
+        lines = [
+            line for line in
+            (r.strip().rpartition("/")[-1] for r in res.splitlines())
+            if line.startswith(self.pod_name_prefix)
+        ]
+        return lines
+
+    def _delete_pods(self, selectors: List[str], namespace: str, msg: str = None) -> List[str]:
+        kubectl_cmd = \
+            "kubectl delete pod -l={agent_label} " \
+            "--namespace={namespace} --field-selector={selector} --output name".format(
+                selector=",".join(selectors),
+                agent_label=self._get_agent_label(),
+                namespace=namespace,
+            )
+        self.log.debug("Deleting old/failed pods{} for ns {}: {}".format(
+            msg or "", namespace, kubectl_cmd
+        ))
+        lines = self._process_bash_lines_response(kubectl_cmd)
+        self.log.debug(" - deleted pods %s", ", ".join(lines))
+        return lines
+
+    def _delete_jobs_by_names(self, names_to_ns: Dict[str, str], msg: str = None) -> List[str]:
+        if not names_to_ns:
+            return []
+        ns_to_names = defaultdict(list)
+        for name, ns in names_to_ns.items():
+            ns_to_names[ns].append(name)
+
+        results = []
+        for ns, names in ns_to_names.items():
+            kubectl_cmd = "kubectl delete job --namespace={ns} --output=name {names}".format(
+                ns=ns, names=" ".join(names)
+            )
+            self.log.debug("Deleting jobs {}: {}".format(
+                msg or "", kubectl_cmd
+            ))
+            lines = self._process_bash_lines_response(kubectl_cmd)
+            if not lines:
+                continue
+            self.log.debug(" - deleted jobs %s", ", ".join(lines))
+            results.extend(lines)
+        return results
+
+    def _delete_completed_or_failed_pods(self, namespace, msg: str = None):
+        if not self.using_jobs:
+            return self._delete_pods(
+                selectors=["status.phase!=Pending", "status.phase!=Running"], namespace=namespace, msg=msg
+            )
+
+        job_names_to_delete = {}
+
+        # locate failed pods for jobs
+        failed_pods = self.get_pods_for_jobs(
+            job_condition="status.active=1",
+            pod_filters=["status.phase!=Pending", "status.phase!=Running", "status.phase!=Terminating"],
+            debug_msg="Deleting failed pods: {cmd}"
+        )
+        if failed_pods:
+            job_names_to_delete = {
+                get_path(pod, "metadata", "labels", "job-name"): get_path(pod, "metadata", "namespace")
+                for pod in failed_pods
+                if get_path(pod, "metadata", "labels", "job-name")
+            }
+            self.log.debug(f" - found jobs with failed pods: {' '.join(job_names_to_delete)}")
+
+        completed_job_names = self.get_jobs_info(
+            "metadata.name", condition="status.succeeded=1", namespace=namespace, debug_msg=msg
+        )
+        if completed_job_names:
+            self.log.debug(f" - found completed jobs: {' '.join(completed_job_names)}")
+            job_names_to_delete.update(completed_job_names)
+
+        return self._delete_jobs_by_names(names_to_ns=job_names_to_delete, msg=msg)

    def _cleanup_old_pods(self, namespaces, extra_msg=None):
        # type: (Iterable[str], Optional[str]) -> Dict[str, List[str]]
@@ -803,23 +913,12 @@ class K8sIntegration(Worker):
            if time() - self._last_pod_cleanup_per_ns[namespace] < self._min_cleanup_interval_per_ns_sec:
                # Do not try to cleanup the same namespace too quickly
                continue
-            kubectl_cmd = self.KUBECTL_CLEANUP_DELETE_CMD.format(
-                namespace=namespace, agent_label=self._get_agent_label()
-            )
-            self.log.debug("Deleting old/failed pods{} for ns {}: {}".format(
-                extra_msg or "", namespace, kubectl_cmd
-            ))
+
            try:
-                res = get_bash_output(kubectl_cmd, raise_error=True)
-                lines = [
-                    line for line in
-                    (r.strip().rpartition("/")[-1] for r in res.splitlines())
-                    if line.startswith(self.pod_name_prefix)
-                ]
-                self.log.debug(" - deleted pod(s) %s", ", ".join(lines))
-                deleted_pods[namespace].extend(lines)
+                res = self._delete_completed_or_failed_pods(namespace, extra_msg)
+                deleted_pods[namespace].extend(res)
            except Exception as ex:
-                self.log.error("Failed deleting old/failed pods for ns %s: %s", namespace, str(ex))
+                self.log.error("Failed deleting completed/failed pods for ns %s: %s", namespace, str(ex))
            finally:
                self._last_pod_cleanup_per_ns[namespace] = time()

@@ -840,7 +939,7 @@ class K8sIntegration(Worker):
                )
                tasks_to_abort = result["tasks"]
        except Exception as ex:
-            self.log.warning('Failed getting running tasks for deleted pods: {}'.format(ex))
+            self.log.warning('Failed getting running tasks for deleted {}(s): {}'.format(self.kind, ex))

        for task in tasks_to_abort:
            task_id = task.get("id")
@@ -853,15 +952,27 @@ class K8sIntegration(Worker):
                    self._session.get(
                        service='tasks',
                        action='dequeue',
-                        json={"task": task_id, "force": True, "status_reason": "Pod deleted (not pending or running)",
-                              "status_message": "Pod deleted by agent {}".format(self.worker_id or "unknown")},
+                        json={
+                            "task": task_id,
+                            "force": True,
+                            "status_reason": "Pod deleted (not pending or running)",
+                            "status_message": "{} deleted by agent {}".format(
+                                self.kind.capitalize(), self.worker_id or "unknown"
+                            )
+                        },
                        method=Request.def_method,
                    )
                self._session.get(
                    service='tasks',
                    action='failed',
-                    json={"task": task_id, "force": True, "status_reason": "Pod deleted (not pending or running)",
-                          "status_message": "Pod deleted by agent {}".format(self.worker_id or "unknown")},
+                    json={
+                        "task": task_id,
+                        "force": True,
+                        "status_reason": "Pod deleted (not pending or running)",
+                        "status_message": "{} deleted by agent {}".format(
+                            self.kind.capitalize(), self.worker_id or "unknown"
+                        )
+                    },
                    method=Request.def_method,
                )
            except Exception as ex:
@@ -902,10 +1013,10 @@ class K8sIntegration(Worker):
            # check if have pod limit, then check if we hit it.
            if self.max_pods_limit:
                if current_pods >= self.max_pods_limit:
-                    print("Maximum pod limit reached {}/{}, sleeping for {:.1f} seconds".format(
-                        current_pods, self.max_pods_limit, self._polling_interval))
+                    print("Maximum {} limit reached {}/{}, sleeping for {:.1f} seconds".format(
+                        self.kind, current_pods, self.max_pods_limit, self._polling_interval))
                    # delete old completed / failed pods
-                    self._cleanup_old_pods(namespaces, " due to pod limit")
+                    self._cleanup_old_pods(namespaces, f" due to {self.kind} limit")
                    # go to sleep
                    sleep(self._polling_interval)
                    continue
@@ -913,7 +1024,7 @@ class K8sIntegration(Worker):
            # iterate over queues (priority style, queues[0] is highest)
            for queue in queues:
                # delete old completed / failed pods
-                self._cleanup_old_pods(namespaces)
+                self._cleanup_old_pods(namespaces, extra_msg="Cleanup cycle {cmd}")

                # get next task in queue
                try:
--- a/clearml_agent/glue/pending_pods_daemon.py
+++ b/clearml_agent/glue/pending_pods_daemon.py
@@ -0,0 +1,236 @@
+from time import sleep
+from typing import Dict, Tuple, Optional, List
+
+from clearml_agent.backend_api.session import Request
+from clearml_agent.glue.utilities import get_bash_output
+
+from clearml_agent.helper.process import stringify_bash_output
+
+from .daemon import K8sDaemon
+from .utilities import get_path
+from .errors import GetPodsError
+
+
+class PendingPodsDaemon(K8sDaemon):
+    def __init__(self, polling_interval: float, agent):
+        super(PendingPodsDaemon, self).__init__(agent=agent)
+        self._polling_interval = polling_interval
+        self._last_tasks_msgs = {}  # last msg updated for every task
+
+    def get_pods(self, pod_name=None):
+        filters = ["status.phase=Pending"]
+        if pod_name:
+            filters.append(f"metadata.name={pod_name}")
+
+        if self._agent.using_jobs:
+            return self._agent.get_pods_for_jobs(
+                job_condition="status.active=1", pod_filters=filters, debug_msg="Detecting pending pods: {cmd}"
+            )
+        return self._agent.get_pods(filters=filters, debug_msg="Detecting pending pods: {cmd}")
+
+    def _get_pod_name(self, pod: dict):
+        return get_path(pod, "metadata", "name")
+
+    def _get_k8s_resource_name(self, pod: dict):
+        if self._agent.using_jobs:
+            return get_path(pod, "metadata", "labels", "job-name")
+        return get_path(pod, "metadata", "name")
+
+    def _get_task_id(self, pod: dict):
+        return self._get_k8s_resource_name(pod).rpartition('-')[-1]
+
+    @staticmethod
+    def _get_k8s_resource_namespace(pod: dict):
+        return pod.get('metadata', {}).get('namespace', None)
+
+    def target(self):
+        """
+            Handle pending objects (pods or jobs, depending on the agent mode).
+            - Delete any pending objects that are not expected to recover
+            - Delete any pending objects for whom the associated task was aborted
+        """
+        while True:
+            # noinspection PyBroadException
+            try:
+                # Get pods (standalone pods if we're in pods mode, or pods associated to jobs if we're in jobs mode)
+                pods = self.get_pods()
+                if pods is None:
+                    raise GetPodsError()
+
+                task_id_to_pod = dict()
+
+                for pod in pods:
+                    pod_name = self._get_pod_name(pod)
+                    if not pod_name:
+                        continue
+
+                    task_id = self._get_task_id(pod)
+                    if not task_id:
+                        continue
+
+                    namespace = self._get_k8s_resource_namespace(pod)
+                    if not namespace:
+                        continue
+
+                    task_id_to_pod[task_id] = pod
+
+                    msg = None
+                    tags = []
+
+                    waiting = get_path(pod, 'status', 'containerStatuses', 0, 'state', 'waiting')
+                    if not waiting:
+                        condition = get_path(pod, 'status', 'conditions', 0)
+                        if condition:
+                            reason = condition.get('reason')
+                            if reason == 'Unschedulable':
+                                message = condition.get('message')
+                                msg = reason + (" ({})".format(message) if message else "")
+                    else:
+                        reason = waiting.get("reason", None)
+                        message = waiting.get("message", None)
+
+                        msg = reason + (" ({})".format(message) if message else "")
+
+                        if reason == 'ImagePullBackOff':
+                            self.delete_k8s_resource(k8s_resource=pod, msg=reason)
+                            try:
+                                self._session.api_client.tasks.failed(
+                                    task=task_id,
+                                    status_reason="K8S glue error: {}".format(msg),
+                                    status_message="Changed by K8S glue",
+                                    force=True
+                                )
+                                self._agent.send_logs(
+                                    task_id, ["K8S Error: {}".format(msg)],
+                                    session=self._session
+                                )
+                            except Exception as ex:
+                                self.log.warning(
+                                    'K8S Glue pending monitor: Failed deleting task "{}"\nEX: {}'.format(task_id, ex)
+                                )
+
+                            # clean up any msg for this task
+                            self._last_tasks_msgs.pop(task_id, None)
+                            continue
+
+                    self._update_pending_task_msg(task_id, msg, tags)
+
+                if task_id_to_pod:
+                    self._process_tasks_for_pending_pods(task_id_to_pod)
+
+                # clean up any last message for a task that wasn't seen as a pod
+                self._last_tasks_msgs = {k: v for k, v in self._last_tasks_msgs.items() if k in task_id_to_pod}
+            except GetPodsError:
+                pass
+            except Exception:
+                self.log.exception("Hanging pods daemon loop")
+
+            sleep(self._polling_interval)
+
+    def delete_k8s_resource(self, k8s_resource: dict, msg: str = None):
+        delete_cmd = "kubectl delete {kind} {name} -n {namespace} --output name".format(
+            kind=self._agent.kind,
+            name=self._get_k8s_resource_name(k8s_resource),
+            namespace=self._get_k8s_resource_namespace(k8s_resource)
+        ).strip()
+        self.log.debug(" - deleting {} {}: {}".format(self._agent.kind, (" " + msg) if msg else "", delete_cmd))
+        return get_bash_output(delete_cmd).strip()
+
+    def _process_tasks_for_pending_pods(self, task_id_to_details: Dict[str, dict]):
+        self._handle_aborted_tasks(task_id_to_details)
+
+    def _handle_aborted_tasks(self, pending_tasks_details: Dict[str, dict]):
+        try:
+            result = self._session.get(
+                service='tasks',
+                action='get_all',
+                json={
+                    "id": list(pending_tasks_details),
+                    "status": ["stopped"],
+                    "only_fields": ["id"]
+                }
+            )
+            aborted_task_ids = list(filter(None, (task.get("id") for task in result["tasks"])))
+
+            for task_id in aborted_task_ids:
+                pod = pending_tasks_details.get(task_id)
+                if not pod:
+                    self.log.error("Failed locating aborted task {} in pending pods list".format(task_id))
+                    continue
+
+                pod_name = self._get_pod_name(pod)
+                if not self.get_pods(pod_name=pod_name):
+                    self.log.debug("K8S Glue pending monitor: pod {} is no longer pending, skipping".format(pod_name))
+                    continue
+
+                resource_name = self._get_k8s_resource_name(pod)
+                self.log.info(
+                    "K8S Glue pending monitor: task {} was aborted but the k8s resource {} is still pending, "
+                    "deleting pod".format(task_id, resource_name)
+                )
+
+                result = self._session.get(
+                    service='tasks',
+                    action='get_all',
+                    json={"id": [task_id], "status": ["stopped"], "only_fields": ["id"]},
+                )
+                if not result["tasks"]:
+                    self.log.debug("K8S Glue pending monitor: task {} is no longer aborted, skipping".format(task_id))
+                    continue
+
+                output = self.delete_k8s_resource(k8s_resource=pod, msg="Pending resource of an aborted task")
+                if not output:
+                    self.log.warning("K8S Glue pending monitor: failed deleting resource {}".format(resource_name))
+        except Exception as ex:
+            self.log.warning(
+                'K8S Glue pending monitor: failed checking aborted tasks for pending resources: {}'.format(ex)
+            )
+
+    def _update_pending_task_msg(self, task_id: str, msg: str, tags: List[str] = None):
+        if not msg or self._last_tasks_msgs.get(task_id, None) == (msg, tags):
+            return
+        try:
+            # Make sure the task is queued
+            result = self._session.send_request(
+                service='tasks',
+                action='get_all',
+                json={"id": task_id, "only_fields": ["status"]},
+                method=Request.def_method,
+                async_enable=False,
+            )
+            if result.ok:
+                status = get_path(result.json(), 'data', 'tasks', 0, 'status')
+                # if task is in progress, change its status to enqueued
+                if status == "in_progress":
+                    result = self._session.send_request(
+                        service='tasks', action='enqueue',
+                        json={
+                            "task": task_id, "force": True, "queue": self._agent.k8s_pending_queue_id
+                        },
+                        method=Request.def_method,
+                        async_enable=False,
+                    )
+                    if not result.ok:
+                        result_msg = get_path(result.json(), 'meta', 'result_msg')
+                        self.log.debug(
+                            "K8S Glue pods monitor: failed forcing task status change"
+                            " for pending task {}: {}".format(task_id, result_msg)
+                        )
+
+            # Update task status message
+            payload = {"task": task_id, "status_message": "K8S glue status: {}".format(msg)}
+            if tags:
+                payload["tags"] = tags
+            result = self._session.send_request('tasks', 'update', json=payload, method=Request.def_method)
+            if not result.ok:
+                result_msg = get_path(result.json(), 'meta', 'result_msg')
+                raise Exception(result_msg or result.text)
+
+            # update last msg for this task
+            self._last_tasks_msgs[task_id] = msg
+        except Exception as ex:
+            self.log.warning(
+                'K8S Glue pods monitor: Failed setting status message for task "{}"\nMSG: {}\nEX: {}'.format(
+                    task_id, msg, ex
+                )
+            )
--- a/clearml_agent/glue/utilities.py
+++ b/clearml_agent/glue/utilities.py
@@ -0,0 +1,18 @@
+import functools
+
+from subprocess import DEVNULL
+
+from clearml_agent.helper.process import get_bash_output as _get_bash_output
+
+
+def get_path(d, *path, default=None):
+    try:
+        return functools.reduce(
+            lambda a, b: a[b], path, d
+        )
+    except (IndexError, KeyError):
+        return default
+
+
+def get_bash_output(cmd, stderr=DEVNULL, raise_error=False):
+    return _get_bash_output(cmd, stderr=stderr, raise_error=raise_error)
--- a/clearml_agent/helper/base.py
+++ b/clearml_agent/helper/base.py
@@ -20,20 +20,22 @@ from typing import Text, Dict, Any, Optional, AnyStr, IO, Union

 import attr
 import furl
+import six
 import yaml
 from attr import fields_dict
 from pathlib2 import Path
-
-import six
 from six.moves import reduce
-from clearml_agent.external import pyhocon
+
 from clearml_agent.errors import CommandFailedError
+from clearml_agent.external import pyhocon
 from clearml_agent.helper.dicts import filter_keys

 pretty_lines = False

 log = logging.getLogger(__name__)

+use_powershell = os.getenv("CLEARML_AGENT_USE_POWERSHELL", None)
+

 def which(cmd, path=None):
    result = find_executable(cmd, path)
@@ -52,7 +54,7 @@ def select_for_platform(linux, windows):


 def bash_c():
-    return 'bash -c' if not is_windows_platform() else 'cmd /c'
+    return 'bash -c' if not is_windows_platform() else ('powershell -Command' if use_powershell else 'cmd /c')


 def return_list(arg):
--- a/clearml_agent/helper/docker_args.py
+++ b/clearml_agent/helper/docker_args.py
@@ -17,6 +17,30 @@ if TYPE_CHECKING:
    from clearml_agent.session import Session


+def sanitize_urls(s: str) -> Tuple[str, bool]:
+    """
+    Replaces passwords in URLs with asterisks.
+    Returns the sanitized string and a boolean indicating whether sanitation was performed.
+    """
+    regex = re.compile("^([^:]*:)[^@]+(.*)$")
+    tokens = re.split(r"\s", s)
+    changed = False
+    for k in range(len(tokens)):
+        if "@" in tokens[k]:
+            res = urlparse(tokens[k])
+            if regex.match(res.netloc):
+                changed = True
+                tokens[k] = urlunparse((
+                    res.scheme,
+                    regex.sub("\\1********\\2", res.netloc),
+                    res.path,
+                    res.params,
+                    res.query,
+                    res.fragment
+                ))
+    return " ".join(tokens) if changed else s, changed
+
+
 class DockerArgsSanitizer:
    @classmethod
    def sanitize_docker_command(cls, session, docker_command):
@@ -62,11 +86,11 @@ class DockerArgsSanitizer:
                    elif key in keys:
                        val = "********"
                    elif parse_embedded_urls:
-                        val = cls._sanitize_urls(val)[0]
+                        val = sanitize_urls(val)[0]
                    result[i + 1] = "{}={}".format(key, val)
                    skip_next = True
                elif parse_embedded_urls and not item.startswith("-"):
-                    item, changed = cls._sanitize_urls(item)
+                    item, changed = sanitize_urls(item)
                    if changed:
                        result[i] = item
            except (KeyError, TypeError):
@@ -75,22 +99,71 @@ class DockerArgsSanitizer:
        return result

    @staticmethod
-    def _sanitize_urls(s: str) -> Tuple[str, bool]:
-        """ Replaces passwords in URLs with asterisks """
-        regex = re.compile("^([^:]*:)[^@]+(.*)$")
-        tokens = re.split(r"\s", s)
-        changed = False
-        for k in range(len(tokens)):
-            if "@" in tokens[k]:
-                res = urlparse(tokens[k])
-                if regex.match(res.netloc):
-                    changed = True
-                    tokens[k] = urlunparse((
-                        res.scheme,
-                        regex.sub("\\1********\\2", res.netloc),
-                        res.path,
-                        res.params,
-                        res.query,
-                        res.fragment
-                    ))
-        return " ".join(tokens) if changed else s, changed
+    def get_list_of_switches(docker_args: List[str]) -> List[str]:
+        args = []
+        for token in docker_args:
+            if token.strip().startswith("-"):
+                args += [token.strip().split("=")[0].lstrip("-")]
+
+        return args
+
+    @staticmethod
+    def filter_switches(docker_args: List[str], exclude_switches: List[str]) -> List[str]:
+        # shortcut if we are sure we have no matches
+        if (not exclude_switches or
+                not any("-{}".format(s) in " ".join(docker_args) for s in exclude_switches)):
+            return docker_args
+
+        args = []
+        in_switch_args = True
+        for token in docker_args:
+            if token.strip().startswith("-"):
+                if "=" in token:
+                    switch = token.strip().split("=")[0]
+                    in_switch_args = False
+                else:
+                    switch = token
+                    in_switch_args = True
+
+                if switch.lstrip("-") in exclude_switches:
+                    # if in excluded, skip the switch and following arguments
+                    in_switch_args = False
+                else:
+                    args += [token]
+
+            elif in_switch_args:
+                args += [token]
+            else:
+                # this is the switch arguments we need to skip
+                pass
+
+        return args
+
+    @staticmethod
+    def merge_docker_args(config, task_docker_arguments: List[str], extra_docker_arguments: List[str]) -> List[str]:
+        base_cmd = []
+        # currently only resolving --network, --ipc, --privileged
+        override_switches = config.get(
+            "agent.protected_docker_extra_args",
+            ["privileged", "security-opt", "network", "ipc"]
+        )
+
+        if config.get("agent.docker_args_extra_precedes_task", True):
+            switches = []
+            if extra_docker_arguments:
+                switches = DockerArgsSanitizer.get_list_of_switches(extra_docker_arguments)
+                switches = list(set(switches) & set(override_switches))
+                base_cmd += [str(a) for a in extra_docker_arguments if a]
+            if task_docker_arguments:
+                docker_arguments = DockerArgsSanitizer.filter_switches(task_docker_arguments, switches)
+                base_cmd += [a for a in docker_arguments if a]
+        else:
+            switches = []
+            if task_docker_arguments:
+                switches = DockerArgsSanitizer.get_list_of_switches(task_docker_arguments)
+                switches = list(set(switches) & set(override_switches))
+                base_cmd += [a for a in task_docker_arguments if a]
+            if extra_docker_arguments:
+                extra_docker_arguments = DockerArgsSanitizer.filter_switches(extra_docker_arguments, switches)
+                base_cmd += [a for a in extra_docker_arguments if a]
+        return base_cmd
--- a/clearml_agent/helper/environment/init.py
+++ b/clearml_agent/helper/environment/init.py
@@ -0,0 +1,8 @@
+from .entry import Entry, NotSet
+from .environment import EnvEntry
+
+__all__ = [
+    'Entry',
+    'NotSet',
+    'EnvEntry',
+]
--- a/clearml_agent/helper/environment/converters.py
+++ b/clearml_agent/helper/environment/converters.py
@@ -0,0 +1,70 @@
+import base64
+from distutils.util import strtobool
+from typing import Union, Optional, Any, TypeVar, Callable, Tuple
+
+import six
+
+try:
+    from typing import Text
+except ImportError:
+    # windows conda-less hack
+    Text = Any
+
+
+ConverterType = TypeVar("ConverterType", bound=Callable[[Any], Any])
+
+
+def base64_to_text(value):
+    # type: (Any) -> Text
+    return base64.b64decode(value).decode("utf-8")
+
+
+def text_to_int(value, default=0):
+    # type: (Any, int) -> int
+    try:
+        return int(value)
+    except (ValueError, TypeError):
+        return default
+
+
+def text_to_bool(value):
+    # type: (Text) -> bool
+    return bool(strtobool(value))
+
+
+def safe_text_to_bool(value):
+    # type: (Text) -> bool
+    try:
+        return text_to_bool(value)
+    except ValueError:
+        return bool(value)
+
+
+def any_to_bool(value):
+    # type: (Optional[Union[int, float, Text]]) -> bool
+    if isinstance(value, six.text_type):
+        return text_to_bool(value)
+    return bool(value)
+
+
+# noinspection PyIncorrectDocstring
+def or_(*converters, **kwargs):
+    # type: (ConverterType, Tuple[Exception, ...]) -> ConverterType
+    """
+    Wrapper that implements an "optional converter" pattern. Allows specifying a converter
+    for which a set of exceptions is ignored (and the original value is returned)
+    :param converters: A converter callable
+    :param exceptions: A tuple of exception types to ignore
+    """
+    # noinspection PyUnresolvedReferences
+    exceptions = kwargs.get("exceptions", (ValueError, TypeError))
+
+    def wrapper(value):
+        for converter in converters:
+            try:
+                return converter(value)
+            except exceptions:
+                pass
+        return value
+
+    return wrapper
--- a/clearml_agent/helper/environment/entry.py
+++ b/clearml_agent/helper/environment/entry.py
@@ -0,0 +1,134 @@
+import abc
+from typing import Optional, Any, Tuple, Callable, Dict
+
+import six
+
+from .converters import any_to_bool
+
+try:
+    from typing import Text
+except ImportError:
+    # windows conda-less hack
+    Text = Any
+
+
+NotSet = object()
+
+Converter = Callable[[Any], Any]
+
+
+@six.add_metaclass(abc.ABCMeta)
+class Entry(object):
+    """
+    Configuration entry definition
+    """
+
+    def default_conversions(self):
+        # type: () -> Dict[Any, Converter]
+
+        if self.lstrip and self.rstrip:
+
+            def str_convert(s):
+                return six.text_type(s).strip()
+
+        elif self.lstrip:
+
+            def str_convert(s):
+                return six.text_type(s).lstrip()
+
+        elif self.rstrip:
+
+            def str_convert(s):
+                return six.text_type(s).rstrip()
+
+        else:
+
+            def str_convert(s):
+                return six.text_type(s)
+
+        return {
+            bool: lambda x: any_to_bool(x.strip()),
+            six.text_type: str_convert,
+        }
+
+    def __init__(self, key, *more_keys, **kwargs):
+        # type: (Text, Text, Any) -> None
+        """
+        :rtype: object
+        :param key: Entry's key (at least one).
+        :param more_keys: More alternate keys for this entry.
+        :param type: Value type. If provided, will be used choosing a default conversion or
+        (if none exists) for casting the environment value.
+        :param converter: Value converter. If provided, will be used to convert the environment value.
+        :param default: Default value. If provided, will be used as the default value on calls to get() and get_pair()
+        in case no value is found for any key and no specific default value was provided in the call.
+        Default value is None.
+        :param help: Help text describing this entry
+        """
+        self.keys = (key,) + more_keys
+        self.type = kwargs.pop("type", six.text_type)
+        self.converter = kwargs.pop("converter", None)
+        self.default = kwargs.pop("default", None)
+        self.help = kwargs.pop("help", None)
+        self.lstrip = kwargs.pop("lstrip", True)
+        self.rstrip = kwargs.pop("rstrip", True)
+
+    def __str__(self):
+        return str(self.key)
+
+    @property
+    def key(self):
+        return self.keys[0]
+
+    def convert(self, value, converter=None):
+        # type: (Any, Converter) -> Optional[Any]
+        converter = converter or self.converter
+        if not converter:
+            converter = self.default_conversions().get(self.type, self.type)
+        return converter(value)
+
+    def get_pair(self, default=NotSet, converter=None, value_cb=None):
+        # type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
+        for key in self.keys:
+            value = self._get(key)
+            if value is NotSet:
+                continue
+            try:
+                value = self.convert(value, converter)
+            except Exception as ex:
+                self.error("invalid value {key}={value}: {ex}".format(**locals()))
+                break
+            # noinspection PyBroadException
+            try:
+                if value_cb:
+                    value_cb(key, value)
+            except Exception:
+                pass
+            return key, value
+
+        result = self.default if default is NotSet else default
+        return self.key, result
+
+    def get(self, default=NotSet, converter=None, value_cb=None):
+        # type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
+        return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
+
+    def set(self, value):
+        # type: (Any, Any) -> (Text, Any)
+        # key, _ = self.get_pair(default=None, converter=None)
+        for k in self.keys:
+            self._set(k, str(value))
+
+    def _set(self, key, value):
+        # type: (Text, Text) -> None
+        pass
+
+    @abc.abstractmethod
+    def _get(self, key):
+        # type: (Text) -> Any
+        pass
+
+    @abc.abstractmethod
+    def error(self, message):
+        # type: (Text) -> None
+        pass
--- a/clearml_agent/helper/environment/environment.py
+++ b/clearml_agent/helper/environment/environment.py
@@ -0,0 +1,28 @@
+from os import getenv, environ
+
+from .converters import text_to_bool
+from .entry import Entry, NotSet
+
+
+class EnvEntry(Entry):
+    def default_conversions(self):
+        conversions = super(EnvEntry, self).default_conversions().copy()
+        conversions[bool] = lambda x: text_to_bool(x.strip())
+        return conversions
+
+    def pop(self):
+        for k in self.keys:
+            environ.pop(k, None)
+
+    def _get(self, key):
+        value = getenv(key, "")
+        return value or NotSet
+
+    def _set(self, key, value):
+        environ[key] = value
+
+    def __str__(self):
+        return "env:{}".format(super(EnvEntry, self).__str__())
+
+    def error(self, message):
+        print("Environment configuration: {}".format(message))
--- a/clearml_agent/helper/gpu/gpustat.py
+++ b/clearml_agent/helper/gpu/gpustat.py
@@ -15,10 +15,8 @@ from __future__ import print_function
 from __future__ import unicode_literals

 import json
-import os.path
 import platform
 import sys
-import time
 from datetime import datetime
 from typing import Optional

@@ -164,13 +162,14 @@ class GPUStatCollection(object):
    _device_count = None
    _gpu_device_info = {}

-    def __init__(self, gpu_list, driver_version=None):
+    def __init__(self, gpu_list, driver_version=None, driver_cuda_version=None):
        self.gpus = gpu_list

        # attach additional system information
        self.hostname = platform.node()
        self.query_time = datetime.now()
        self.driver_version = driver_version
+        self.driver_cuda_version = driver_cuda_version

    @staticmethod
    def clean_processes():
@@ -181,10 +180,11 @@ class GPUStatCollection(object):
    @staticmethod
    def new_query(shutdown=False, per_process_stats=False, get_driver_info=False):
        """Query the information of all the GPUs on local machine"""
-
+        initialized = False
        if not GPUStatCollection._initialized:
            N.nvmlInit()
            GPUStatCollection._initialized = True
+            initialized = True

        def _decode(b):
            if isinstance(b, bytes):
@@ -200,10 +200,10 @@ class GPUStatCollection(object):
                if nv_process.pid not in GPUStatCollection.global_processes:
                    GPUStatCollection.global_processes[nv_process.pid] = \
                        psutil.Process(pid=nv_process.pid)
-                ps_process = GPUStatCollection.global_processes[nv_process.pid]
                process['pid'] = nv_process.pid
                # noinspection PyBroadException
                try:
+                    # ps_process = GPUStatCollection.global_processes[nv_process.pid]
                    # we do not actually use these, so no point in collecting them
                    # process['username'] = ps_process.username()
                    # # cmdline returns full path;
@@ -286,11 +286,11 @@ class GPUStatCollection(object):
                for nv_process in nv_comp_processes + nv_graphics_processes:
                    try:
                        process = get_process_info(nv_process)
-                        processes.append(process)
                    except psutil.NoSuchProcess:
                        # TODO: add some reminder for NVML broken context
                        # e.g. nvidia-smi reset  or  reboot the system
-                        pass
+                        process = None
+                    processes.append(process)

                # we do not actually use these, so no point in collecting them
                # # TODO: Do not block if full process info is not requested
@@ -314,7 +314,7 @@ class GPUStatCollection(object):
                # Convert bytes into MBytes
                'memory.used': memory.used // MB if memory else None,
                'memory.total': memory.total // MB if memory else None,
-                'processes': processes,
+                'processes': None if (processes and all(p is None for p in processes)) else processes
            }
            if per_process_stats:
                GPUStatCollection.clean_processes()
@@ -337,15 +337,32 @@ class GPUStatCollection(object):
                driver_version = _decode(N.nvmlSystemGetDriverVersion())
            except N.NVMLError:
                driver_version = None  # N/A
+
+            # noinspection PyBroadException
+            try:
+                cuda_driver_version = str(N.nvmlSystemGetCudaDriverVersion())
+            except BaseException:
+                # noinspection PyBroadException
+                try:
+                    cuda_driver_version = str(N.nvmlSystemGetCudaDriverVersion_v2())
+                except BaseException:
+                    cuda_driver_version = None
+            if cuda_driver_version:
+                try:
+                    cuda_driver_version = '{}.{}'.format(
+                        int(cuda_driver_version)//1000, (int(cuda_driver_version) % 1000)//10)
+                except (ValueError, TypeError):
+                    pass
        else:
            driver_version = None
+            cuda_driver_version = None

        # no need to shutdown:
-        if shutdown:
+        if shutdown and initialized:
            N.nvmlShutdown()
            GPUStatCollection._initialized = False

-        return GPUStatCollection(gpu_list, driver_version=driver_version)
+        return GPUStatCollection(gpu_list, driver_version=driver_version, driver_cuda_version=cuda_driver_version)

    def __len__(self):
        return len(self.gpus)
--- a/clearml_agent/helper/gpu/pynvml.py
+++ b/clearml_agent/helper/gpu/pynvml.py
--- a/clearml_agent/helper/package/base.py
+++ b/clearml_agent/helper/package/base.py
@@ -50,7 +50,7 @@ class PackageManager(object):
        pass

    @abc.abstractmethod
-    def freeze(self):
+    def freeze(self, freeze_full_environment=False):
        pass

    @abc.abstractmethod
@@ -141,8 +141,9 @@ class PackageManager(object):
    @classmethod
    def out_of_scope_install_package(cls, package_name, *args):
        if PackageManager._selected_manager is not None:
+            # noinspection PyBroadException
            try:
-                result = PackageManager._selected_manager._install(package_name, *args)
+                result = PackageManager._selected_manager.install_packages(package_name, *args)
                if result not in (0, None, True):
                    return False
            except Exception:
@@ -150,10 +151,11 @@ class PackageManager(object):
        return True

    @classmethod
-    def out_of_scope_freeze(cls):
+    def out_of_scope_freeze(cls, freeze_full_environment=False):
        if PackageManager._selected_manager is not None:
+            # noinspection PyBroadException
            try:
-                return PackageManager._selected_manager.freeze()
+                return PackageManager._selected_manager.freeze(freeze_full_environment)
            except Exception:
                pass
        return []
--- a/clearml_agent/helper/package/external_req.py
+++ b/clearml_agent/helper/package/external_req.py
@@ -92,21 +92,14 @@ class ExternalRequirements(SimpleSubstitution):
                vcs_url = req_line[4:]
                # reverse replace
                vcs_url = vcs_url[::-1].replace(fragment[::-1], '', 1)[::-1]
-                # remove ssh:// or git:// prefix for git detection and credentials
-                scheme = ''
-                full_vcs_url = vcs_url
-                if vcs_url and (vcs_url.startswith('ssh://') or vcs_url.startswith('git://')):
-                    scheme = 'ssh://'  # notice git:// is actually ssh://
-                    vcs_url = vcs_url[6:]
+                # notice git:// is actually ssh://
+                if vcs_url and vcs_url.startswith('git://'):
+                    vcs_url = vcs_url.replace('git://', 'ssh://', 1)

                from ..repo import Git
-                vcs = Git(session=session, url=full_vcs_url, location=None, revision=None)
+                vcs = Git(session=session, url=vcs_url, location=None, revision=None)
                vcs._set_ssh_url()
-                new_req_line = 'git+{}{}{}'.format(
-                    '' if scheme and '://' in vcs.url else scheme,
-                    vcs_url if session.config.get('agent.force_git_ssh_protocol', None) else vcs.url_with_auth,
-                    fragment
-                )
+                new_req_line = 'git+{}{}'.format(vcs.url_with_auth, fragment)
                if new_req_line != req_line:
                    furl_line = furl(new_req_line)
                    print('Replacing original pip vcs \'{}\' with \'{}\''.format(
--- a/clearml_agent/helper/package/pip_api/system.py
+++ b/clearml_agent/helper/package/pip_api/system.py
@@ -4,7 +4,7 @@ from itertools import chain
 from pathlib import Path
 from typing import Text, Optional

-from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME
+from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME, ENV_PIP_EXTRA_INSTALL_FLAGS
 from clearml_agent.helper.package.base import PackageManager
 from clearml_agent.helper.process import Argv, DEVNULL
 from clearml_agent.session import Session
@@ -12,8 +12,6 @@ from clearml_agent.session import Session

 class SystemPip(PackageManager):

-    indices_args = None
-
    def __init__(self, interpreter=None, session=None):
        # type: (Optional[Text], Optional[Session]) -> ()
        """
@@ -52,7 +50,7 @@ class SystemPip(PackageManager):
                package,
                '--dest', cache_dir,
                '--no-deps',
-            ) + self.install_flags()
+            ) + self.download_flags()
        )

    def load_requirements(self, requirements):
@@ -65,13 +63,14 @@ class SystemPip(PackageManager):
    def uninstall(self, package):
        self.run_with_env(('uninstall', '-y', package))

-    def freeze(self):
+    def freeze(self, freeze_full_environment=False):
        """
        pip freeze to all install packages except the running program
        :return: Dict contains pip as key and pip's packages to install
        :rtype: Dict[str: List[str]]
        """
-        packages = self.run_with_env(('freeze',), output=True).splitlines()
+        packages = self.run_with_env(
+            ('freeze',) if not freeze_full_environment else ('freeze', '--all'), output=True).splitlines()
        packages_without_program = [package for package in packages if PROGRAM_NAME not in package]
        return {'pip': packages_without_program}

@@ -87,14 +86,30 @@ class SystemPip(PackageManager):
        # make sure we are not running it with our own PYTHONPATH
        env = dict(**os.environ)
        env.pop('PYTHONPATH', None)
+
+        # Debug print
+        if self.session.debug_mode:
+            print(command)
+
        return (command.get_output if output else command.check_call)(stdin=DEVNULL, env=env, **kwargs)

    def _make_command(self, command):
        return Argv(self.bin, '-m', 'pip', '--disable-pip-version-check', *command)

    def install_flags(self):
-        if self.indices_args is None:
-            self.indices_args = tuple(
-                chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
-            )
-        return self.indices_args
+        indices_args = tuple(
+            chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
+        )
+
+        extra_pip_flags = \
+            ENV_PIP_EXTRA_INSTALL_FLAGS.get() or \
+            self.session.config.get("agent.package_manager.extra_pip_install_flags", None)
+
+        return (indices_args + tuple(extra_pip_flags)) if extra_pip_flags else indices_args
+
+    def download_flags(self):
+        indices_args = tuple(
+            chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
+        )
+
+        return indices_args
--- a/clearml_agent/helper/package/poetry_api.py
+++ b/clearml_agent/helper/package/poetry_api.py
@@ -69,7 +69,7 @@ class PoetryConfig:
                path = path.replace(':'+sys.base_prefix, ':'+sys.real_prefix, 1)
                kwargs['env']['PATH'] = path

-        if self.session and self.session.config:
+        if self.session and self.session.config and args and args[0] == "install":
            extra_args = self.session.config.get("agent.package_manager.poetry_install_extra_args", None)
            if extra_args:
                args = args + tuple(extra_args)
@@ -147,7 +147,7 @@ class PoetryAPI(object):
            any((self.path / indicator).exists() for indicator in self.INDICATOR_FILES)
        )

-    def freeze(self):
+    def freeze(self, freeze_full_environment=False):
        lines = self.config.run("show", cwd=str(self.path)).splitlines()
        lines = [[p for p in line.split(' ') if p] for line in lines]
        return {"pip": [parts[0]+'=='+parts[1]+' # '+' '.join(parts[2:]) for parts in lines]}
--- a/clearml_agent/helper/package/priority_req.py
+++ b/clearml_agent/helper/package/priority_req.py
@@ -7,7 +7,7 @@ from .requirements import SimpleSubstitution

 class PriorityPackageRequirement(SimpleSubstitution):

-    name = ("cython", "numpy", "setuptools", )
+    name = ("cython", "numpy", "setuptools", "pip", )
    optional_package_names = tuple()

    def __init__(self, *args, **kwargs):
@@ -50,31 +50,39 @@ class PriorityPackageRequirement(SimpleSubstitution):
        """
        # if we replaced setuptools, it means someone requested it, and since freeze will not contain it,
        # we need to add it manually
-        if not self._replaced_packages or "setuptools" not in self._replaced_packages:
+        if not self._replaced_packages:
            return list_of_requirements

-        try:
-            for k, lines in list_of_requirements.items():
-                # k is either pip/conda
-                if k not in ('pip', 'conda'):
-                    continue
-                for i, line in enumerate(lines):
-                    if not line or line.lstrip().startswith('#'):
-                        continue
-                    parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
-                    if not parts:
-                        continue
-                    # if we found setuptools, do nothing
-                    if parts[0] == "setuptools":
-                        return list_of_requirements
+        if "pip" in self._replaced_packages:
+            full_freeze = PackageManager.out_of_scope_freeze(freeze_full_environment=True)
+            # now let's look for pip
+            pips = [line for line in full_freeze.get("pip", []) if line.split("==")[0] == "pip"]
+            if pips and "pip" in list_of_requirements:
+                list_of_requirements["pip"] = [pips[0]] + list_of_requirements["pip"]

-            # if we are here it means we have not found setuptools
-            # we should add it:
-            if "pip" in list_of_requirements:
-                list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
+        if "setuptools" in self._replaced_packages:
+            try:
+                for k, lines in list_of_requirements.items():
+                    # k is either pip/conda
+                    if k not in ('pip', 'conda'):
+                        continue
+                    for i, line in enumerate(lines):
+                        if not line or line.lstrip().startswith('#'):
+                            continue
+                        parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
+                        if not parts:
+                            continue
+                        # if we found setuptools, do nothing
+                        if parts[0] == "setuptools":
+                            return list_of_requirements

-        except Exception as ex:  # noqa
-            return list_of_requirements
+                # if we are here it means we have not found setuptools
+                # we should add it:
+                if "pip" in list_of_requirements:
+                    list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
+
+            except Exception as ex:  # noqa
+                return list_of_requirements

        return list_of_requirements

--- a/clearml_agent/helper/package/pytorch.py
+++ b/clearml_agent/helper/package/pytorch.py
@@ -16,6 +16,7 @@ import six
 from .requirements import (
    SimpleSubstitution, FatalSpecsResolutionError, SimpleVersion, MarkerRequirement,
    compare_version_rules, )
+from ...definitions import ENV_PACKAGE_PYTORCH_RESOLVE
 from ...external.requirements_parser.requirement import Requirement

 OS_TO_WHEEL_NAME = {"linux": "linux_x86_64", "windows": "win_amd64"}
@@ -174,6 +175,7 @@ class PytorchRequirement(SimpleSubstitution):
    extra_index_url_template = 'https://download.pytorch.org/whl/cu{}/'
    nightly_extra_index_url_template = 'https://download.pytorch.org/whl/nightly/cu{}/'
    torch_index_url_lookup = {}
+    resolver_types = ("pip", "direct", "none")

    def __init__(self, *args, **kwargs):
        os_name = kwargs.pop("os_override", None)
@@ -208,6 +210,13 @@ class PytorchRequirement(SimpleSubstitution):
        if self.config.get("agent.package_manager.torch_url_template", None):
            PytorchWheel.url_template = \
                self.config.get("agent.package_manager.torch_url_template", None)
+        self.resolve_algorithm = str(
+            ENV_PACKAGE_PYTORCH_RESOLVE.get() or
+            self.config.get("agent.package_manager.pytorch_resolve", "pip")).lower()
+        if self.resolve_algorithm not in self.resolver_types:
+            print("WARNING: agent.package_manager.pytorch_resolve=={} not in {} reverting to '{}'".format(
+                self.resolve_algorithm, self.resolver_types, self.resolver_types[0]))
+            self.resolve_algorithm = self.resolver_types[0]

    def _init_python_ver_cuda_ver(self):
        if self.cuda is None:
@@ -261,6 +270,10 @@ class PytorchRequirement(SimpleSubstitution):
            )

    def match(self, req):
+        if self.resolve_algorithm == "none":
+            # skipping resolver
+            return False
+
        return req.name in self.packages

    @staticmethod
@@ -310,6 +323,12 @@ class PytorchRequirement(SimpleSubstitution):
            # yes this is for linux python 2.7 support, this is the only python 2.7 we support...
            if py_ver and py_ver[0] == '2' and len(parts) > 3 and not parts[3].endswith('u'):
                continue
+
+            # check if this an actual match
+            if not req.compare_version(v) or \
+                    (last_v and SimpleVersion.compare_versions(last_v, '>', v, ignore_sub_versions=False)):
+                continue
+
            # update the closest matched version (from above)
            if not closest_v:
                closest_v = v
@@ -318,10 +337,6 @@ class PytorchRequirement(SimpleSubstitution):
                    SimpleVersion.compare_versions(
                        version_a=v, op='>=', version_b=req.specs[0][1], num_parts=3):
                closest_v = v
-            # check if this an actual match
-            if not req.compare_version(v) or \
-                    (last_v and SimpleVersion.compare_versions(last_v, '>', v, ignore_sub_versions=False)):
-                continue

            url = '/'.join(torch_url.split('/')[:-1] + l.split('/'))
            last_v = v
@@ -345,8 +360,10 @@ class PytorchRequirement(SimpleSubstitution):
                from pip._internal.commands.show import search_packages_info
                installed_torch = list(search_packages_info([req.name]))
                # notice the comparison order, the first part will make sure we have a valid installed package
-                installed_torch_version = (getattr(installed_torch[0], 'version', None) or installed_torch[0]['version']) \
-                    if installed_torch else None
+                installed_torch_version = \
+                    (getattr(installed_torch[0], 'version', None) or
+                     installed_torch[0]['version']) if installed_torch else None
+
                if installed_torch and installed_torch_version and \
                        req.compare_version(installed_torch_version):
                    print('PyTorch: requested "{}" version {}, using pre-installed version {}'.format(
@@ -354,6 +371,7 @@ class PytorchRequirement(SimpleSubstitution):
                    # package already installed, do nothing
                    req.specs = [('==', str(installed_torch_version))]
                    return '{} {} {}'.format(req.name, req.specs[0][0], req.specs[0][1]), True
+
        except Exception:
            pass

@@ -475,6 +493,26 @@ class PytorchRequirement(SimpleSubstitution):
        return self.match_version(req, base).replace(" ", "\n")

    def replace(self, req):
+        # we first try to resolve things ourselves because pytorch pip is not always picking the correct
+        # versions from their pip repository
+
+        resolve_algorithm = self.resolve_algorithm
+        if resolve_algorithm == "none":
+            # skipping resolver
+            return None
+        elif resolve_algorithm == "direct":
+            # noinspection PyBroadException
+            try:
+                new_req = self._replace(req)
+                if new_req:
+                    self._original_req.append((req, new_req))
+                return new_req
+            except Exception:
+                print("Warning: Failed resolving using `pytorch_resolve=direct` reverting to `pytorch_resolve=pip`")
+        elif resolve_algorithm not in self.resolver_types:
+            print("Warning: `agent.package_manager.pytorch_resolve={}` "
+                  "unrecognized, default to `pip`".format(resolve_algorithm))
+
        # check if package is already installed with system packages
        self.validate_python_version()

@@ -508,6 +546,8 @@ class PytorchRequirement(SimpleSubstitution):
                    # return the original line
                    line = req.line

+                print("PyTorch: Adding index `{}` and installing `{}`".format(extra_index_url[0], line))
+
                return line

        except Exception:  # noqa
@@ -566,6 +606,19 @@ class PytorchRequirement(SimpleSubstitution):
        :param list_of_requirements: {'pip': ['a==1.0', ]}
        :return: {'pip': ['a==1.0', ]}
        """
+        def build_specific_version_req(a_line, a_name, a_new_req):
+            try:
+                r = Requirement.parse(a_line)
+                wheel_parts = r.uri.split("/")[-1].split('-')
+                version = str(wheel_parts[1].split('%')[0].split('+')[0])
+                new_r = Requirement.parse("{} == {} # {}".format(a_name, version, str(a_new_req)))
+                if new_r.line:
+                    # great it worked!
+                    return new_r.line
+            except:  # noqa
+                pass
+            return None
+
        if not self._original_req:
            return list_of_requirements
        try:
@@ -589,9 +642,18 @@ class PytorchRequirement(SimpleSubstitution):
                                    if req.local_file:
                                        lines[i] = '{}'.format(str(new_req))
                                    else:
-                                        lines[i] = '{} # {}'.format(str(req), str(new_req))
+                                        # try to rebuild requirements with specific version:
+                                        new_line = build_specific_version_req(line, req.req.name, new_req)
+                                        if new_line:
+                                            lines[i] = new_line
+                                        else:
+                                            lines[i] = '{} # {}'.format(str(req), str(new_req))
                            else:
-                                lines[i] = '{} # {}'.format(line, str(new_req))
+                                new_line = build_specific_version_req(line, req.req.name, new_req)
+                                if new_line:
+                                    lines[i] = new_line
+                                else:
+                                    lines[i] = '{} # {}'.format(line, str(new_req))
                            break
        except:
            pass
@@ -640,7 +702,7 @@ class PytorchRequirement(SimpleSubstitution):
            # noinspection PyBroadException
            try:
                if requests.get(torch_url, timeout=10).ok:
-                    print('Torch CUDA {} index page found'.format(c))
+                    print('Torch CUDA {} index page found, adding `{}`'.format(c, torch_url))
                    cls.torch_index_url_lookup[c] = torch_url
                    return cls.torch_index_url_lookup[c], c
            except Exception:
--- a/clearml_agent/helper/package/requirements.py
+++ b/clearml_agent/helper/package/requirements.py
@@ -240,6 +240,23 @@ class SimpleVersion:
        if not version_b:
            return True

+        # remove trailing "*" in both
+        if "*" in version_a:
+            ignore_sub_versions = True
+            while version_a.endswith(".*"):
+                version_a = version_a[:-2]
+            if version_a == "*":
+                version_a = ""
+            num_parts = min(len(version_a.split('.')), len(version_b.split('.')), )
+
+        if "*" in version_b:
+            ignore_sub_versions = True
+            while version_b.endswith(".*"):
+                version_b = version_b[:-2]
+            if version_b == "*":
+                version_b = ""
+            num_parts = min(len(version_a.split('.')), len(version_b.split('.')), )
+
        if not num_parts:
            num_parts = max(len(version_a.split('.')), len(version_b.split('.')), )

--- a/clearml_agent/helper/repo.py
+++ b/clearml_agent/helper/repo.py
@@ -19,7 +19,7 @@ from pathlib2 import Path

 import six

-from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST
+from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST, ENV_GIT_CLONE_VERBOSE
 from clearml_agent.helper.console import ensure_text, ensure_binary
 from clearml_agent.errors import CommandFailedError
 from clearml_agent.helper.base import (
@@ -197,8 +197,9 @@ class VCS(object):
            self.log.info("successfully applied uncommitted changes")
        return True

-    # Command-line flags for clone command
-    clone_flags = ()
+    def clone_flags(self):
+        """Command-line flags for clone command"""
+        return tuple()

    @abc.abstractmethod
    def executable_not_found_error_help(self):
@@ -320,6 +321,7 @@ class VCS(object):
                        self.url, new_url))
                    self.url = new_url
                return
+
            # rewrite ssh URLs only if either ssh port or ssh user are forced in config
            if parsed_url.scheme == "ssh" and (
                self.session.config.get('agent.force_git_ssh_port', None) or
@@ -334,6 +336,9 @@ class VCS(object):
                    print("Using SSH credentials - ssh url '{}' with ssh url '{}'".format(
                        self.url, new_url))
                    self.url = new_url
+                return
+            elif parsed_url.scheme == "ssh":
+                return

        if not self.session.config.agent.translate_ssh:
            return
@@ -341,11 +346,18 @@ class VCS(object):
        # if we have git_user / git_pass replace ssh credentials with https authentication
        if (ENV_AGENT_GIT_USER.get() or self.session.config.get('agent.git_user', None)) and \
                (ENV_AGENT_GIT_PASS.get() or self.session.config.get('agent.git_pass', None)):
+
            # only apply to a specific domain (if requested)
            config_domain = \
-                ENV_AGENT_GIT_HOST.get() or self.session.config.get("git_host", None)
-            if config_domain and config_domain != furl(self.url).host:
-                return
+                ENV_AGENT_GIT_HOST.get() or self.session.config.get("agent.git_host", None)
+            if config_domain:
+                if config_domain != furl(self.url).host:
+                    # bail out here if we have a git_host configured and it's different than the URL host
+                    # however, we should make sure this is not an ssh@ URL that furl failed to parse
+                    ssh_git_url_match = self.SSH_URL_GIT_SYNTAX.match(self.url)
+                    if not ssh_git_url_match or config_domain != ssh_git_url_match.groupdict().get("host"):
+                        # do not replace to ssh url
+                        return

            new_url = self.replace_ssh_url(self.url)
            if new_url != self.url:
@@ -362,7 +374,7 @@ class VCS(object):
        self._set_ssh_url()
        # if we are on linux no need for the full auth url because we use GIT_ASKPASS
        url = self.url_without_auth if self._use_ask_pass else self.url_with_auth
-        clone_command = ("clone", url, self.location) + self.clone_flags
+        clone_command = ("clone", url, self.location) + self.clone_flags()
        # clone all branches regardless of when we want to later checkout
        # if branch:
        #    clone_command += ("-b", branch)
@@ -539,7 +551,6 @@ class VCS(object):
 class Git(VCS):
    executable_name = "git"
    main_branch = ("master", "main")
-    clone_flags = ("--quiet", "--recursive")
    checkout_flags = ("--force",)
    COMMAND_ENV = {
        # do not prompt for password
@@ -551,7 +562,7 @@ class Git(VCS):
    def __init__(self, *args, **kwargs):
        super(Git, self).__init__(*args, **kwargs)

-        self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', None) \
+        self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', True) \
            else sys.platform == "linux"

        try:
@@ -565,6 +576,12 @@ class Git(VCS):
            "origin/{}".format(b) for b in ([branch] if isinstance(branch, str) else branch)
        ]

+    def clone_flags(self):
+        return (
+            "--recursive",
+            "--verbose" if ENV_GIT_CLONE_VERBOSE.get() else "--quiet"
+        )
+
    def executable_not_found_error_help(self):
        return 'Cannot find "{}" executable. {}'.format(
            self.executable_name,
--- a/clearml_agent/helper/resource_monitor.py
+++ b/clearml_agent/helper/resource_monitor.py
@@ -79,6 +79,7 @@ class ResourceMonitor(object):
        self._gpustat_fail = 0
        self._gpustat = gpustat
        self._active_gpus = None
+        self._disk_use_path = str(session.config.get("agent.resource_monitoring.disk_use_path", None) or Path.home())
        if not worker_tags and ENV_WORKER_TAGS.get():
            worker_tags = shlex.split(ENV_WORKER_TAGS.get())
        self._worker_tags = worker_tags
@@ -139,42 +140,45 @@ class ResourceMonitor(object):
    def _daemon(self):
        seconds_since_started = 0
        reported = 0
-        while True:
-            last_report = time()
-            current_report_frequency = (
-                self._report_frequency if reported != 0 else self._first_report_sec
-            )
-            while (time() - last_report) < current_report_frequency:
-                # wait for self._sample_frequency seconds, if event set quit
-                if self._exit_event.wait(1 / self._sample_frequency):
-                    return
-                # noinspection PyBroadException
-                try:
-                    self._update_readouts()
-                except Exception as ex:
-                    log.warning("failed getting machine stats: %s", report_error(ex))
-                    self._failure()
+        try:
+            while True:
+                last_report = time()
+                current_report_frequency = (
+                    self._report_frequency if reported != 0 else self._first_report_sec
+                )
+                while (time() - last_report) < current_report_frequency:
+                    # wait for self._sample_frequency seconds, if event set quit
+                    if self._exit_event.wait(1 / self._sample_frequency):
+                        return
+                    # noinspection PyBroadException
+                    try:
+                        self._update_readouts()
+                    except Exception as ex:
+                        log.warning("failed getting machine stats: %s", report_error(ex))
+                        self._failure()

-            seconds_since_started += int(round(time() - last_report))
-            # check if we do not report any metric (so it means the last iteration will not be changed)
+                seconds_since_started += int(round(time() - last_report))
+                # check if we do not report any metric (so it means the last iteration will not be changed)

-            # if we do not have last_iteration, we just use seconds as iteration
+                # if we do not have last_iteration, we just use seconds as iteration

-            # start reporting only when we figured out, if this is seconds based, or iterations based
-            average_readouts = self._get_average_readouts()
-            stats = {
-                # 3 points after the dot
-                key: round(value, 3) if isinstance(value, float) else [round(v, 3) for v in value]
-                for key, value in average_readouts.items()
-            }
+                # start reporting only when we figured out, if this is seconds based, or iterations based
+                average_readouts = self._get_average_readouts()
+                stats = {
+                    # 3 points after the dot
+                    key: round(value, 3) if isinstance(value, float) else [round(v, 3) for v in value]
+                    for key, value in average_readouts.items()
+                }

-            # send actual report
-            if self.send_report(stats):
-                # clear readouts if this is update was sent
-                self._clear_readouts()
+                # send actual report
+                if self.send_report(stats):
+                    # clear readouts if this is update was sent
+                    self._clear_readouts()

-            # count reported iterations
-            reported += 1
+                # count reported iterations
+                reported += 1
+        except Exception as ex:
+            log.exception("Error reporting monitoring info: %s", str(ex))

    def _update_readouts(self):
        readouts = self._machine_stats()
@@ -239,7 +243,7 @@ class ResourceMonitor(object):
        virtual_memory = psutil.virtual_memory()
        stats["memory_used"] = BytesSizes.megabytes(virtual_memory.used)
        stats["memory_free"] = BytesSizes.megabytes(virtual_memory.available)
-        disk_use_percentage = psutil.disk_usage(Text(Path.home())).percent
+        disk_use_percentage = psutil.disk_usage(self._disk_use_path).percent
        stats["disk_free_percent"] = 100 - disk_use_percentage
        sensor_stat = (
            psutil.sensors_temperatures()
@@ -263,8 +267,10 @@ class ResourceMonitor(object):
                gpu_stat = self._gpustat.new_query()
                for i, g in enumerate(gpu_stat.gpus):
                    # only monitor the active gpu's, if none were selected, monitor everything
-                    if self._active_gpus and str(i) not in self._active_gpus:
-                        continue
+                    if self._active_gpus:
+                        uuid = getattr(g, "uuid", None)
+                        if str(i) not in self._active_gpus and (not uuid or uuid not in self._active_gpus):
+                            continue
                    stats["gpu_temperature_{:d}".format(i)] = g["temperature.gpu"]
                    stats["gpu_utilization_{:d}".format(i)] = g["utilization.gpu"]
                    stats["gpu_mem_usage_{:d}".format(i)] = (
--- a/clearml_agent/session.py
+++ b/clearml_agent/session.py
@@ -19,7 +19,7 @@ from clearml_agent.definitions import ENVIRONMENT_CONFIG, ENV_TASK_EXECUTE_AS_US
 from clearml_agent.errors import APIError
 from clearml_agent.helper.base import HOCONEncoder
 from clearml_agent.helper.process import Argv
-from clearml_agent.helper.docker_args import DockerArgsSanitizer
+from clearml_agent.helper.docker_args import DockerArgsSanitizer, sanitize_urls
 from .version import __version__

 POETRY = "poetry"
@@ -245,33 +245,43 @@ class Session(_Session):
            remove_secret_keys=("secret", "pass", "token", "account_key", "contents"),
            skip_value_keys=("environment", ),
            docker_args_sanitize_keys=("extra_docker_arguments", ),
+            sanitize_urls_keys=("extra_index_url", ),
    ):
        # remove all the secrets from the print
-        def recursive_remove_secrets(dictionary, secret_keys=(), empty_keys=()):
+        def recursive_remove_secrets(dictionary):
            for k in list(dictionary):
-                for s in secret_keys:
+                for s in remove_secret_keys:
                    if s in k:
                        dictionary.pop(k)
                        break
-                for s in empty_keys:
+                for s in skip_value_keys:
                    if s == k:
                        dictionary[k] = {key: '****' for key in dictionary[k]} \
                            if isinstance(dictionary[k], dict) else '****'
                        break
+                for s in sanitize_urls_keys:
+                    if s == k:
+                        value = dictionary.get(k, None)
+                        if isinstance(value, str):
+                            dictionary[k] = sanitize_urls(value)[0]
+                        elif isinstance(value, (list, tuple)):
+                            dictionary[k] = [sanitize_urls(v)[0] for v in value]
+                        elif isinstance(value, dict):
+                            dictionary[k] = {k_: sanitize_urls(v)[0] for k_, v in value.items()}
                if isinstance(dictionary.get(k, None), dict):
-                    recursive_remove_secrets(dictionary[k], secret_keys=secret_keys, empty_keys=empty_keys)
+                    recursive_remove_secrets(dictionary[k])
                elif isinstance(dictionary.get(k, None), (list, tuple)):
                    if k in (docker_args_sanitize_keys or []):
                        dictionary[k] = DockerArgsSanitizer.sanitize_docker_command(self, dictionary[k])
                    for item in dictionary[k]:
                        if isinstance(item, dict):
-                            recursive_remove_secrets(item, secret_keys=secret_keys, empty_keys=empty_keys)
+                            recursive_remove_secrets(item)

        config = deepcopy(self.config.to_dict())
        # remove the env variable, it's not important
        config.pop('env', None)
-        if remove_secret_keys or skip_value_keys or docker_args_sanitize_keys:
-            recursive_remove_secrets(config, secret_keys=remove_secret_keys, empty_keys=skip_value_keys)
+        if remove_secret_keys or skip_value_keys or docker_args_sanitize_keys or sanitize_urls_keys:
+            recursive_remove_secrets(config)
        # remove logging.loggers.urllib3.level from the print
        try:
            config['logging']['loggers']['urllib3'].pop('level', None)
--- a/clearml_agent/version.py
+++ b/clearml_agent/version.py
@@ -1 +1 @@
-__version__ = '1.5.2'
+__version__ = '1.7.0'
--- a/docker/k8s-glue/build-resources/clearml.conf
+++ b/docker/k8s-glue/build-resources/clearml.conf
@@ -58,7 +58,7 @@ agent {
        type: pip,

        # specify pip version to use (examples "<20.2", "==19.3.1", "", empty string will install the latest version)
-        pip_version: "<21",
+        pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"],

        # virtual environment inheres packages from system
        system_site_packages: false,
@@ -171,7 +171,7 @@ agent {

    default_docker: {
        # default docker image to use when running in docker mode
-        image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
+        image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"

        # optional arguments to pass to docker image
        # arguments: ["--ipc=host", ]
@@ -224,7 +224,7 @@ sdk {

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
            size {
                # max_used_bytes = -1
--- a/docker/k8s-glue/build-resources/entrypoint.sh
+++ b/docker/k8s-glue/build-resources/entrypoint.sh
@@ -33,4 +33,9 @@ echo "api.files_server: ${CLEARML_FILES_HOST}" >> ~/clearml.conf

 ./provider_entrypoint.sh

-python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
+if [[ -z "${K8S_GLUE_MAX_PODS}" ]]
+then
+  python3 k8s_glue_example.py --queue ${QUEUE} ${EXTRA_ARGS}
+else
+  python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
+fi
--- a/docker/k8s-glue/build-resources/k8s_glue_example.py
+++ b/docker/k8s-glue/build-resources/k8s_glue_example.py
@@ -65,6 +65,19 @@ def parse_args():
        help="Limit the maximum number of pods that this service can run at the same time."
             "Should not be used with ports-mode"
    )
+    parser.add_argument(
+        "--use-owner-token", action="store_true", default=False,
+        help="Generate and use task owner token for the execution of each task"
+    )
+    parser.add_argument(
+        "--standalone-mode", action="store_true", default=False,
+        help="Do not use any network connects, assume everything is pre-installed"
+    )
+    parser.add_argument(
+        "--child-report-tags", type=str, nargs="+", default=None,
+        help="List of tags to send with the status reports from a worker that runs a task"
+    )
+
    return parser.parse_args()


@@ -85,9 +98,14 @@ def main():
        user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
        template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
            ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
-        namespace=args.namespace, max_pods_limit=args.max_pods or None,
+        namespace=args.namespace, max_pods_limit=args.max_pods or None
+    )
+    k8s.k8s_daemon(
+        args.queue,
+        use_owner_token=args.use_owner_token,
+        standalone_mode=args.standalone_mode,
+        child_report_tags=args.child_report_tags
    )
-    k8s.k8s_daemon(args.queue)


 if __name__ == "__main__":
--- a/docker/k8s-glue/build-resources/setup.sh
+++ b/docker/k8s-glue/build-resources/setup.sh
@@ -2,13 +2,17 @@

 chmod +x /root/entrypoint.sh

-apt-get update -y
-apt-get dist-upgrade -y
-apt-get install -y curl unzip less locales
+apt-get update -qqy
+apt-get dist-upgrade -qqy
+apt-get install -qqy curl unzip less locales

 locale-gen en_US.UTF-8

-apt-get install -y curl python3-pip git
+apt-get update -qqy
+apt-get install -qqy curl gcc python3-dev python3-pip apt-transport-https lsb-release openssh-client git gnupg
+rm -rf /var/lib/apt/lists/*
+apt clean
+
 python3 -m pip install -U pip
-python3 -m pip install clearml-agent
-python3 -m pip install -U "cryptography>=2.9"
+python3 -m pip install --no-cache-dir clearml-agent
+python3 -m pip install -U --no-cache-dir "cryptography>=2.9"
--- a/docker/k8s-glue/glue-build/Dockerfile.alpine
+++ b/docker/k8s-glue/glue-build/Dockerfile.alpine
@@ -1,4 +1,4 @@
-ARG TAG=3.7.12-alpine3.15
+ARG TAG=3.7.17-alpine3.18

 FROM python:${TAG} as build

@@ -20,7 +20,7 @@ FROM python:${TAG} as target

 WORKDIR /app

-ARG KUBECTL_VERSION=1.22.4
+ARG KUBECTL_VERSION=1.24.0

 # Not sure about these ENV vars
 # ENV LC_ALL=en_US.UTF-8
--- a/docker/k8s-glue/glue-build/Dockerfile.bullseye
+++ b/docker/k8s-glue/glue-build/Dockerfile.bullseye
@@ -1,4 +1,4 @@
-ARG TAG=3.7.12-slim-bullseye
+ARG TAG=3.7.17-slim-bullseye

 FROM python:${TAG} as target

--- a/docker/k8s-glue/glue-build/k8s_glue_example.py
+++ b/docker/k8s-glue/glue-build/k8s_glue_example.py
@@ -65,6 +65,10 @@ def parse_args():
        help="Limit the maximum number of pods that this service can run at the same time."
             "Should not be used with ports-mode"
    )
+    parser.add_argument(
+        "--use-owner-token", action="store_true", default=False,
+        help="Generate and use task owner token for the execution of each task"
+    )
    return parser.parse_args()


@@ -87,7 +91,7 @@ def main():
            ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
        namespace=args.namespace, max_pods_limit=args.max_pods or None,
    )
-    k8s.k8s_daemon(args.queue)
+    k8s.k8s_daemon(args.queue, use_owner_token=args.use_owner_token)


 if __name__ == "__main__":
--- a/docs/clearml.conf
+++ b/docs/clearml.conf
@@ -58,8 +58,8 @@ agent {
    # it solves passing user/token to git submodules.
    # this is a safer way to ensure multiple users using the same repository will
    # not accidentally leak credentials
-    # Only supported on Linux systems, it will be the default in future releases
-    # enable_git_ask_pass: false
+    # Note: this is only supported on Linux systems
+    # enable_git_ask_pass: true

    # in docker mode, if container's entrypoint automatically activated a virtual environment
    # use the activated virtual environment and install everything there
@@ -93,25 +93,43 @@ agent {
        # extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
        extra_index_url: []

+        # additional flags to use when calling pip install, example: ["--use-deprecated=legacy-resolver", ]
+        # extra_pip_install_flags: []
+
+        # control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
+        # Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
+        # "pip" (default): would automatically detect the cuda version, and supply pip with the correct
+        # extra-index-url, based on pytorch.org tables
+        # "direct": would resolve a direct link to the pytorch wheel by parsing the pytorch.org pip repository
+        # and matching the automatically detected cuda version with the required pytorch wheel.
+        # if the exact cuda version is not found for the required pytorch wheel, it will try
+        # a lower cuda version until a match is found
+        # "none": No resolver used, install pytorch like any other package
+        # pytorch_resolve: "pip"
+
        # additional conda channels to use when installing with conda package manager
        conda_channels: ["pytorch", "conda-forge", "defaults", ]
        # conda_full_env_update: false
        # conda_env_as_base_docker: false

        # set the priority packages to be installed before the rest of the required packages
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # priority_packages: ["cython", "numpy", "setuptools", ]

        # set the optional priority packages to be installed before the rest of the required packages,
        # In case a package installation fails, the package will be ignored,
        # and the virtual environment process will continue
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # priority_optional_packages: ["pygobject", ]

        # set the post packages to be installed after all the rest of the required packages
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # post_packages: ["horovod", ]

        # set the optional post packages to be installed after all the rest of the required packages,
        # In case a package installation fails, the package will be ignored,
        # and the virtual environment process will continue
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # post_optional_packages: []

        # set to True to support torch nightly build installation,
@@ -168,8 +186,16 @@ agent {

    # optional arguments to pass to docker image
    # these are local for this agent and will not be updated in the experiment's docker_cmd section
+    # You can also pass host environments into the container with ["-e", "HOST_NAME=$HOST_NAME"]
    # extra_docker_arguments: ["--ipc=host", "-v", "/mnt/host/data:/mnt/data"]

+    # Allow the extra docker arg to override task level docker arg (if the same argument is passed on both),
+    # if set to False, a task docker arg will override the docker extra arg
+    # docker_args_extra_precedes_task: true
+
+    # prevent a task docker args to be used if already specified in the extra_docker_arguments
+    # protected_docker_extra_args: ["privileged", "security-opt", "network", "ipc"]
+
    # optional shell script to run in docker when started before the experiment is started
    # extra_docker_shell_script: ["apt-get install -y bindfs", ]

@@ -178,13 +204,19 @@ agent {
    # change to false to skip installation and decrease docker spin up time
    # docker_install_opencv_libs: true

+    # Allow passing host environments into docker container with Task's docker container args
+    # Example "-e HOST_NAME=$HOST_NAME"
+    # NOTICE this might introduce security risk allowing access to keys/secret on the host machine1
+    # Use with care!
+    # docker_allow_host_environ: false
+
    # set to true in order to force "docker pull" before running an experiment using a docker image.
    # This makes sure the docker image is updated.
    docker_force_pull: false

    default_docker: {
        # default docker image to use when running in docker mode
-        image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
+        image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"

        # optional arguments to pass to docker image
        # arguments: ["--ipc=host"]
@@ -194,7 +226,7 @@ agent {
        # enterprise version only
        # match_rules: [
        #     {
-        #         image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
+        #         image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
        #         arguments: "-e define=value"
        #         match: {
        #             script{
@@ -215,7 +247,7 @@ agent {
        #         }
        #     },
        #     {
-        #         image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
+        #         image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
        #         arguments: "-e define=value"
        #         match: {
        #             # must match all requirements (not partial)
@@ -228,8 +260,6 @@ agent {
        #                 # no repository matching required
        #                 repository: ""
        #             }
-        #             # no container image matching required (allow to replace one requested container with another)
-        #             container: ""
        #             # no repository matching required
        #             project: ""
        #         }
@@ -283,7 +313,7 @@ sdk {

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
        }

@@ -469,7 +499,8 @@ sdk {
 #  target_format: format used to encode contents before writing into the target file. Supported values are json,
 #                 yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
 #  overwrite: overwrite the target file in case it exists. Default is true.
-#
+#  mode: file-system mode to be applied to the file after its creation. The mode string will be parsed into an
+#        integer (e.g. "0o777" for -rwxrwxrwx)
 # Example:
 #   files {
 #     myfile1 {
--- a/docs/screenshots.gif
+++ b/docs/screenshots.gif
--- a/docs/trains.conf
+++ b/docs/trains.conf
@@ -146,7 +146,7 @@ sdk {

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
        }

--- a/examples/dynamic_cloud_cluster.ipynb
+++ b/examples/dynamic_cloud_cluster.ipynb
@@ -5,27 +5,30 @@
   "metadata": {},
   "source": [
    "# Auto-Magically Spin AWS EC2 Instances On Demand \n",
-    "# and Create a Dynamic Cluster Running *Trains-Agent*\n",
+    "# and Create a Dynamic Cluster Running *ClearML-Agent*\n",
    "\n",
-    "### Define your budget and execute the notebook, that's it\n",
-    "### You now have a fully managed cluster on AWS  🎉 🎊 "
+    "## Define your budget and execute the notebook, that's it\n",
+    "## You now have a fully managed cluster on AWS  🎉 🎊"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**trains-agent**'s main goal is to quickly pull a job from an execution queue, setup the environment (as defined in the experiment, including git cloning, python packages etc.) then execute the experiment and monitor it.\n",
+    "**clearml-agent**'s main goal is to quickly pull a job from an execution queue, set up the environment (as defined in the experiment, including git cloning, python packages etc.), then execute the experiment and monitor it.\n",
    "\n",
    "This notebook defines a cloud budget (currently only AWS is supported, but feel free to expand with PRs), and spins an instance the minute a job is waiting for execution. It will also spin down idle machines, saving you some $$$ :)\n",
    "\n",
-    "Configuration steps\n",
+    "> **Note:**\n",
+    "> This is just an example of how you can use ClearML Agent to implement custom autoscaling. For a more structured autoscaler script, see [here](https://github.com/allegroai/clearml/blob/master/clearml/automation/auto_scaler.py).\n",
+    "\n",
+    "Configuration steps:\n",
    "- Define maximum budget to be used (instance type / number of instances).\n",
-    "- Create new execution *queues* in the **trains-server**.\n",
-    "- Define mapping between the created the *queues* and an instance budget.\n",
+    "- Create new execution *queues* in the **clearml-server**.\n",
+    "- Define mapping between the created *queues* and an instance budget.\n",
    "\n",
    "**TL;DR - This notebook:**\n",
-    "- Will spin instances if there are jobs in the execution *queues*, until it will hit the budget limit. \n",
+    "- Will spin instances if there are jobs in the execution *queues* until it will hit the budget limit.\n",
    "- If machines are idle, it will spin them down.\n",
    "\n",
    "The controller implementation itself is stateless, meaning you can always re-execute the notebook, if for some reason it stopped.\n",
@@ -39,7 +42,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Install & import required packages"
+    "### Install & import required packages"
   ]
  },
  {
@@ -48,7 +51,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "!pip install trains-agent\n",
+    "!pip install clearml-agent\n",
    "!pip install boto3"
   ]
  },
@@ -56,7 +59,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)"
+    "### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)"
   ]
  },
  {
@@ -92,17 +95,17 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Define machine budget per execution queue\n",
+    "### Define machine budget per execution queue\n",
    "\n",
-    "Now that we defined our budget, we need to connect it with the **Trains** cluster.\n",
+    "Now that we defined our budget, we need to connect it with the **ClearML** cluster.\n",
    "\n",
    "We map each queue to a resource type (instance type).\n",
    "\n",
-    "Create two queues in the WebUI:\n",
-    "- Browse to http://your_trains_server_ip:8080/workers-and-queues/queues\n",
+    "Create two queues in the Web UI:\n",
+    "- Browse to http://your_clearml_server_ip:8080/workers-and-queues/queues\n",
    "- Then click on the \"New Queue\" button and name your queues \"aws_normal\" and \"aws_high\" respectively\n",
    "\n",
-    "The QUEUES dictionary hold the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n",
+    "The QUEUES dictionary holds the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n",
    "```\n",
    "QUEUES = {\n",
    "    'queue_name': [(\"instance-type-as-defined-in-RESOURCE_CONFIGURATIONS\", max_number_of_instances), ]\n",
@@ -116,7 +119,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# Trains-Agent Queues - Machines budget per Queue\n",
+    "# ClearML Agent Queues - Machines budget per Queue\n",
    "# Per queue: list of (machine type as defined in RESOURCE_CONFIGURATIONS,\n",
    "# max instances for the specific queue). Order machines from most preferred to least.\n",
    "QUEUES = {\n",
@@ -129,7 +132,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Credentials for your AWS account, as well as for your **Trains-Server**"
+    "### Credentials for your AWS account, as well as for your **ClearML Server**"
   ]
  },
  {
@@ -143,24 +146,25 @@
    "CLOUD_CREDENTIALS_SECRET = \"\"\n",
    "CLOUD_CREDENTIALS_REGION = \"us-east-1\"\n",
    "\n",
-    "# TRAINS configuration\n",
-    "TRAINS_SERVER_WEB_SERVER = \"http://localhost:8080\"\n",
-    "TRAINS_SERVER_API_SERVER = \"http://localhost:8008\"\n",
-    "TRAINS_SERVER_FILES_SERVER = \"http://localhost:8081\"\n",
-    "# TRAINS credentials\n",
-    "TRAINS_ACCESS_KEY = \"\"\n",
-    "TRAINS_SECRET_KEY = \"\"\n",
-    "# Git User/Pass to be used by trains-agent,\n",
+    "# CLEARML configuration\n",
+    "CLEARML_WEB_SERVER = \"http://localhost:8080\"\n",
+    "CLEARML_API_SERVER = \"http://localhost:8008\"\n",
+    "CLEARML_FILES_SERVER = \"http://localhost:8081\"\n",
+    "# CLEARML credentials\n",
+    "CLEARML_API_ACCESS_KEY = \"\"\n",
+    "CLEARML_API_SECRET_KEY = \"\"\n",
+    "# Git User/Pass to be used by clearml-agent,\n",
    "# leave empty if image already contains git ssh-key\n",
-    "TRAINS_GIT_USER = \"\"\n",
-    "TRAINS_GIT_PASS = \"\"\n",
+    "CLEARML_AGENT_GIT_USER = \"\"\n",
+    "CLEARML_AGENT_GIT_PASS = \"\"\n",
    "\n",
-    "# Additional fields for trains.conf file created on the remote instance\n",
-    "# for example: 'agent.default_docker.image: \"nvidia/cuda:10.0-cudnn7-runtime\"'\n",
-    "EXTRA_TRAINS_CONF = \"\"\"\n",
+    "# Additional fields for clearml.conf file created on the remote instance\n",
+    "# for example: 'agent.default_docker.image: \"nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04\"'\n",
+    "\n",
+    "EXTRA_CLEARML_CONF = \"\"\"\n",
    "\"\"\"\n",
    "\n",
-    "# Bash script to run on instances before running trains-agent\n",
+    "# Bash script to run on instances before running clearml-agent\n",
    "# Example: \"\"\"\n",
    "# echo \"This is the first line\"\n",
    "# echo \"This is the second line\"\n",
@@ -168,9 +172,9 @@
    "EXTRA_BASH_SCRIPT = \"\"\"\n",
    "\"\"\"\n",
    "\n",
-    "# Default docker for trains-agent when running in docker mode (requires docker v19.03 and above). \n",
-    "# Leave empty to run trains-agent in non-docker mode.\n",
-    "DEFAULT_DOCKER_IMAGE = \"nvidia/cuda\""
+    "# Default docker for clearml-agent when running in docker mode (requires docker v19.03 and above).\n",
+    "# Leave empty to run clearml-agent in non-docker mode.\n",
+    "CLEARML_AGENT_DOCKER_IMAGE = \"nvidia/cuda\""
   ]
  },
  {
@@ -192,7 +196,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Import Packages and Budget Definition Sanity Check"
+    "### Import Packages and Budget Definition Sanity Check"
   ]
  },
  {
@@ -209,7 +213,7 @@
    "from time import sleep, time\n",
    "\n",
    "import boto3\n",
-    "from trains_agent.backend_api.session.client import APIClient"
+    "from clearml_agent.backend_api.session.client import APIClient"
   ]
  },
  {
@@ -227,36 +231,36 @@
    "        \"A resource name can only appear in a single queue definition.\"\n",
    "    )\n",
    "\n",
-    "# Encode EXTRA_TRAINS_CONF for later bash script usage\n",
-    "EXTRA_TRAINS_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_TRAINS_CONF.split(\"\\\"\"))"
+    "# Encode EXTRA_CLEARML_CONF for later bash script usage\n",
+    "EXTRA_CLEARML_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_CLEARML_CONF.split(\"\\\"\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Cloud specific implementation of spin up/down - currently supports AWS only"
+    "### Cloud specific implementation of spin up/down - currently supports AWS only"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cloud-specific implementation (currently, only AWS EC2 is supported)\n",
    "def spin_up_worker(resource, worker_id_prefix, queue_name):\n",
    "    \"\"\"\n",
-    "    Creates a new worker for trains.\n",
+    "    Creates a new worker for clearml.\n",
    "    First, create an instance in the cloud and install some required packages.\n",
-    "    Then, define trains-agent environment variables and run \n",
-    "    trains-agent for the specified queue.\n",
+    "    Then, define clearml-agent environment variables and run\n",
+    "    clearml-agent for the specified queue.\n",
    "    NOTE: - Will wait until instance is running\n",
    "          - This implementation assumes the instance image already has docker installed\n",
    "\n",
    "    :param str resource: resource name, as defined in BUDGET and QUEUES.\n",
    "    :param str worker_id_prefix: worker name prefix\n",
-    "    :param str queue_name: trains queue to listen to\n",
+    "    :param str queue_name: clearml queue to listen to\n",
    "    \"\"\"\n",
    "    resource_conf = RESOURCE_CONFIGURATIONS[resource]\n",
    "    # Add worker type and AWS instance type to the worker name.\n",
@@ -267,8 +271,8 @@
    "    )\n",
    "\n",
    "    # user_data script will automatically run when the instance is started. \n",
-    "    # It will install the required packages for trains-agent configure it using \n",
-    "    # environment variables and run trains-agent on the required queue\n",
+    "    # It will install the required packages for clearml-agent configure it using\n",
+    "    # environment variables and run clearml-agent on the required queue\n",
    "    user_data = \"\"\"#!/bin/bash\n",
    "    sudo apt-get update\n",
    "    sudo apt-get install -y python3-dev\n",
@@ -278,36 +282,36 @@
    "    sudo apt-get install -y build-essential\n",
    "    python3 -m pip install -U pip\n",
    "    python3 -m pip install virtualenv\n",
-    "    python3 -m virtualenv trains_agent_venv\n",
-    "    source trains_agent_venv/bin/activate\n",
-    "    python -m pip install trains-agent\n",
-    "    echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/trains.conf\n",
-    "    echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/trains.conf\n",
-    "    echo \"{trains_conf}\" >> /root/trains.conf\n",
-    "    export TRAINS_API_HOST={api_server}\n",
-    "    export TRAINS_WEB_HOST={web_server}\n",
-    "    export TRAINS_FILES_HOST={files_server}\n",
+    "    python3 -m virtualenv clearml_agent_venv\n",
+    "    source clearml_agent_venv/bin/activate\n",
+    "    python -m pip install clearml-agent\n",
+    "    echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/clearml.conf\n",
+    "    echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/clearml.conf\n",
+    "    echo \"{clearml_conf}\" >> /root/clearml.conf\n",
+    "    export CLEARML_API_HOST={api_server}\n",
+    "    export CLEARML_WEB_HOST={web_server}\n",
+    "    export CLEARML_FILES_HOST={files_server}\n",
    "    export DYNAMIC_INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`\n",
-    "    export TRAINS_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n",
-    "    export TRAINS_API_ACCESS_KEY='{access_key}'\n",
-    "    export TRAINS_API_SECRET_KEY='{secret_key}'\n",
+    "    export CLEARML_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n",
+    "    export CLEARML_API_ACCESS_KEY='{access_key}'\n",
+    "    export CLEARML_API_SECRET_KEY='{secret_key}'\n",
    "    {bash_script}\n",
    "    source ~/.bashrc\n",
-    "    python -m trains_agent --config-file '/root/trains.conf' daemon --queue '{queue}' {docker}\n",
+    "    python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker}\n",
    "    shutdown\n",
    "    \"\"\".format(\n",
-    "        api_server=TRAINS_SERVER_API_SERVER,\n",
-    "        web_server=TRAINS_SERVER_WEB_SERVER,\n",
-    "        files_server=TRAINS_SERVER_FILES_SERVER,\n",
+    "        api_server=CLEARML_API_SERVER,\n",
+    "        web_server=CLEARML_WEB_SERVER,\n",
+    "        files_server=CLEARML_FILES_SERVER,\n",
    "        worker_id=worker_id,\n",
-    "        access_key=TRAINS_ACCESS_KEY,\n",
-    "        secret_key=TRAINS_SECRET_KEY,\n",
+    "        access_key=CLEARML_API_ACCESS_KEY,\n",
+    "        secret_key=CLEARML_API_SECRET_KEY,\n",
    "        queue=queue_name,\n",
-    "        git_user=TRAINS_GIT_USER,\n",
-    "        git_pass=TRAINS_GIT_PASS,\n",
-    "        trains_conf=EXTRA_TRAINS_CONF_ENCODED,\n",
+    "        git_user=CLEARML_AGENT_GIT_USER,\n",
+    "        git_pass=CLEARML_AGENT_GIT_PASS,\n",
+    "        clearml_conf=EXTRA_CLEARML_CONF_ENCODED,\n",
    "        bash_script=EXTRA_BASH_SCRIPT,\n",
-    "        docker=\"--docker '{}'\".format(DEFAULT_DOCKER_IMAGE) if DEFAULT_DOCKER_IMAGE else \"\"\n",
+    "        docker=\"--docker '{}'\".format(CLEARML_AGENT_DOCKER_IMAGE) if CLEARML_AGENT_DOCKER_IMAGE else \"\"\n",
    "    )\n",
    "\n",
    "    ec2 = boto3.client(\n",
@@ -405,7 +409,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "###### Controller Implementation and Logic"
+    "#### Controller Implementation and Logic"
   ]
  },
  {
@@ -430,18 +434,18 @@
    "\n",
    "    # Internal definitions\n",
    "    workers_prefix = \"dynamic_aws\"\n",
-    "    # Worker's id in trains would be composed from:\n",
+    "    # Worker's id in clearml would be composed from:\n",
    "    # prefix, name, instance_type and cloud_id separated by ';'\n",
    "    workers_pattern = re.compile(\n",
    "        r\"^(?P<prefix>[^:]+):(?P<name>[^:]+):(?P<instance_type>[^:]+):(?P<cloud_id>[^:]+)\"\n",
    "    )\n",
    "\n",
-    "    # Set up the environment variables for trains\n",
-    "    os.environ[\"TRAINS_API_HOST\"] = TRAINS_SERVER_API_SERVER\n",
-    "    os.environ[\"TRAINS_WEB_HOST\"] = TRAINS_SERVER_WEB_SERVER\n",
-    "    os.environ[\"TRAINS_FILES_HOST\"] = TRAINS_SERVER_FILES_SERVER\n",
-    "    os.environ[\"TRAINS_API_ACCESS_KEY\"] = TRAINS_ACCESS_KEY\n",
-    "    os.environ[\"TRAINS_API_SECRET_KEY\"] = TRAINS_SECRET_KEY\n",
+    "    # Set up the environment variables for clearml\n",
+    "    os.environ[\"CLEARML_API_HOST\"] = CLEARML_API_SERVER\n",
+    "    os.environ[\"CLEARML_WEB_HOST\"] = CLEARML_WEB_SERVER\n",
+    "    os.environ[\"CLEARML_FILES_HOST\"] = CLEARML_FILES_SERVER\n",
+    "    os.environ[\"CLEARML_API_ACCESS_KEY\"] = CLEARM_API_ACCESS_KEY\n",
+    "    os.environ[\"CLEARML_API_SECRET_KEY\"] = CLEARML_API_SECRET_KEY\n",
    "    api_client = APIClient()\n",
    "\n",
    "    # Verify the requested queues exist and create those that doesn't exist\n",
@@ -520,7 +524,7 @@
    "            # skip resource types that might be needed\n",
    "            if resources in required_idle_resources:\n",
    "                continue\n",
-    "            # Remove from both aws and trains all instances that are \n",
+    "            # Remove from both aws and clearml all instances that are\n",
    "            # idle for longer than MAX_IDLE_TIME_MIN\n",
    "            if time() - timestamp > MAX_IDLE_TIME_MIN * 60.0:\n",
    "                cloud_id = workers_pattern.match(worker.id)[\"cloud_id\"]\n",
@@ -535,7 +539,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)"
+    "### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)"
   ]
  },
  {
@@ -584,4 +588,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 2
-}
+}
--- a/requirements.txt
+++ b/requirements.txt
@@ -8,7 +8,7 @@ pyparsing>=2.0.3,<3.1.0
 python-dateutil>=2.4.2,<2.9.0
 pyjwt>=2.4.0,<2.7.0
 PyYAML>=3.12,<6.1
-requests>=2.20.0,<2.29.0
+requests>=2.20.0,<=2.31.0
 six>=1.13.0,<1.17.0
 typing>=3.6.4,<3.8.0 ; python_version < '3.5'
 urllib3>=1.21.1,<1.27.0
--- a/setup.py
+++ b/setup.py
@@ -62,6 +62,7 @@ setup(
        'Programming Language :: Python :: 3.8',
        'Programming Language :: Python :: 3.9',
        'Programming Language :: Python :: 3.10',
+        'Programming Language :: Python :: 3.11',
        'License :: OSI Approved :: Apache Software License',
    ],
Author	SHA1	Message	Date
allegroai	c9fc092f4e	Support force_system_packages argument in k8s glue class	2023-12-26 10:12:32 +02:00
allegroai	432ee395e1	Version bump to v1.7.0	2023-12-20 18:08:38 +02:00
allegroai	98fc4f0fb9	Add `agent.resource_monitoring.disk_use_path` configuration option to allow monitoring a different volume than the one containing the home folder	2023-12-20 17:49:33 +02:00
allegroai	111e774c21	Add extra_index_url sanitization in configuration printout	2023-12-20 17:49:04 +02:00
allegroai	3dd8d783e1	Fix `agent.git_host` setting will cause git@domain URLs to not be replaced by SSH URLs since furl cannot parse them to obtain host	2023-12-20 17:48:18 +02:00
allegroai	7c3e420df4	Add git clone verbosity using `CLEARML_AGENT_GIT_CLONE_VERBOSE` env var	2023-12-20 17:47:52 +02:00
allegroai	55b065a114	Update GPU stats and pynvml support	2023-12-20 17:47:19 +02:00
allegroai	faa97b6cc2	Set worker ID in k8s glue mode	2023-12-20 17:45:34 +02:00
allegroai	f5861b1e4a	Change default `agent.enable_git_ask_pass` to True	2023-12-20 17:44:41 +02:00
allegroai	030cbb69f1	Fix check if process return code is SIGKILL (-9 or 137) and abort callback was called, do not mark as failed but as aborted	2023-12-20 17:43:02 +02:00
allegroai	564f769ff7	Add `agent.docker_args_extra_precedes_task`, `agent.protected_docker_extra_args` to prevent the same switch to be used by both `extra_docker_args` and the a Task's docker args	2023-12-20 17:42:36 +02:00
pollfly	2c7f091e57	Update example (#177 ) * Edit README * Edit README * small edits * update example * update example * update example	2023-12-09 12:52:44 +02:00
allegroai	dd5d24b0ca	Add CLEARML_AGENT_TEMP_STDOUT_FILE_DIR to allow specifying temp dir used for storing agent log files and temporary log files (daemon and execution)	2023-11-14 11:45:13 +02:00
allegroai	996bb797c3	Add env var in case we're running a service task	2023-11-14 11:44:36 +02:00
allegroai	9ad49a0d21	Fix KeyError if container does not contain the arguments field	2023-11-01 15:11:07 +02:00
allegroai	ba4fee7b19	Fix agent.package_manager.poetry_install_extra_args are used in all Poetry commands and not just in install (#173 )	2023-11-01 15:10:40 +02:00
allegroai	0131db8b7d	Add support for resource_applied() callback in k8s glue Add support for sending log events with k8s-provided timestamps Refactor env vars infrastructure	2023-11-01 15:10:08 +02:00
allegroai	d2384a9a95	Add example and support for prebuilt containers including services-mode support with overrides CLEARML_AGENT_FORCE_CODE_DIR CLEARML_AGENT_FORCE_EXEC_SCRIPT	2023-11-01 15:05:57 +02:00
allegroai	5b86c230c1	Fix an environment variable that should be set with a numerical value of 0 (i.e. end up as "0" or "0.0") is set to an empty string	2023-11-01 15:04:59 +02:00
allegroai	21e4be966f	Fix recursion issue when deep-copying a session	2023-11-01 15:04:24 +02:00
allegroai	9c6cb421b3	When cleaning up pending pods, verify task is still aborted and pod is still pending before deleting the pod	2023-11-01 15:04:01 +02:00
allegroai	52405c343d	Fix k8s glue configuration might be contaminated when changed during apply	2023-11-01 15:03:37 +02:00
allegroai	46f0c991c8	Add status reason when aborting before moving to k8s_scheduler queue	2023-11-01 15:02:24 +02:00
allegroai	0254279ed5	Version bump to v1.6.1	2023-09-06 15:41:29 +03:00
allegroai	0e1750f90e	Fix requests library lower constraint breaks backwards compatibility	2023-09-06 15:40:48 +03:00
allegroai	58e0dc42ec	Version bump to v1.6.0	2023-09-05 15:05:11 +03:00
allegroai	d16825029d	Add new pytorch no resolver mode and CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE to change resolver on a Task basis, now supports "pip", "direct", "none"	2023-09-02 17:45:10 +03:00
allegroai	fb639afcb9	Fix PyTorch extra index pip resolver	2023-09-02 17:43:41 +03:00
allegroai	eefb94d1bc	Add Python 3.11 support	2023-09-02 17:42:27 +03:00
Alex Burlacu	f1e9266075	Adjust docker image versions in a couple more places	2023-08-24 19:03:24 +03:00
Alex Burlacu	e1e3c84a8d	Update docker versions	2023-08-24 19:01:26 +03:00
Alex Burlacu	ed1356976b	Move extra configurations to Worker init to make sure all available configurations can be overridden	2023-08-24 19:00:36 +03:00
Alex Burlacu	2b815354e0	Improve file mode comment	2023-08-24 18:53:00 +03:00
Alex Burlacu	edae380a9e	Version bump	2023-08-24 18:51:47 +03:00
Alex Burlacu	946e9d9ce9	Fix invalid reference	2023-08-24 18:51:27 +03:00
jday1	a56343ffc7	Upgrade requests library (#162 ) * Upgrade requests * Modified package requirement * Modified package requirement	2023-08-01 10:41:22 +03:00
allegroai	159a6e9a5a	Fix runtime property overriding existing properties	2023-07-20 10:41:15 +03:00
pollfly	6b7ee12dc1	Edit README (#156 )	2023-07-19 16:51:14 +03:00
allegroai	3838247716	Update k8s glue docker build resources	2023-07-19 16:47:50 +03:00
pollfly	6e7d35a42a	Improve configuration files (#160 )	2023-07-11 10:32:01 +03:00
allegroai	4c056a17b9	Add support for k8s jobs execution Strip docker container obtained from task in k8s apply	2023-07-04 14:45:00 +03:00
allegroai	21d98afca5	Add support for extra docker arguments referencing machines environment variables using the agent.docker_allow_host_environ configuration option to allow users to also be able to use $ENV in the task's docker arguments	2023-07-04 14:42:28 +03:00
allegroai	6a1bf11549	Fix Task docker arguments passed twice	2023-07-04 14:41:07 +03:00
allegroai	7115a9b9a7	Add CLEARML_EXTRA_PIP_INSTALL_FLAGS / agent.package_manager.extra_pip_install_flags to control additional pip install flags Fix pip version marking in "installed packages" is now preserved for and reinstalled	2023-07-04 14:39:40 +03:00
allegroai	450df2f8d3	Support skipping agent pip upgrade in container bash script using the CLEARML_AGENT_NO_UPDATE env var	2023-07-04 14:38:50 +03:00
allegroai	ccf752c4e4	Add support for setting mode on files applied by the agent	2023-07-04 14:37:58 +03:00
allegroai	3ed63e2154	Fix docker container backwards compatibility for API <2.13 Fix default docker match rules resolver (used incorrect field "container" instead of "image") Remove "container" (image) match rule option from default docker image resolver	2023-07-04 14:37:18 +03:00
allegroai	a535f93cd6	Add support for CLEARML_AGENT_FORCE_MAX_API_VERSION for testing	2023-07-04 14:35:54 +03:00
allegroai	b380ec54c6	Improve config file comments	2023-07-04 14:34:43 +03:00
allegroai	a1274299ce	Add support for CLEARML_AGENT_EXTRA_DOCKER_LABELS env var	2023-07-03 11:08:59 +03:00
allegroai	c77224af68	Add support for task field injection into container docker name	2023-07-03 11:07:12 +03:00
allegroai	95dadca45c	Refactor k8s glue running/used pods getter	2023-05-21 22:56:12 +03:00
allegroai	685918fd9b	Version bump to v1.5.3rc3	2023-05-21 22:54:38 +03:00
allegroai	bc85ddf78d	Fix pytorch direct resolve replacing wheel link with directly installed version	2023-05-21 22:53:51 +03:00
allegroai	5b5fb0b8a6	Add `agent.package_manager.pytorch_resolve` configuration setting with `pip` or `direct` values. `pip` sets extra index based on cuda and lets pip resolve, `direct` is the previous parsing algorithm that does the matching and downloading (default `pip`)	2023-05-21 22:53:11 +03:00
allegroai	fec0ce1756	Better message for agent init when an existing clearml.conf is found	2023-05-21 22:51:11 +03:00
allegroai	1e09b88b7a	Add alias `CLEARML_AGENT_DOCKER_AGENT_REPO` env var for the `FORCE_CLEARML_AGENT_REPO` env var	2023-05-21 22:50:01 +03:00
allegroai	b6ca0fa6a5	Print error on resource monitor failure	2023-05-11 16:18:11 +03:00
allegroai	307ec9213e	Fix git+ssh:// links inside installed packages not being converted properly to HTTPS authenticated and vice versa	2023-05-11 16:16:51 +03:00
allegroai	a78a25d966	Support new `Retry.DEFAULT_BACKOFF_MAX` in a backwards-compatible way	2023-05-11 16:16:18 +03:00
allegroai	ebb6231f5a	Add CLEARML_AGENT_STANDALONE_CONFIG_BC to support backwards compatibility in standalone mode	2023-05-11 16:15:06 +03:00
pollfly	e1d65cb280	Update clearml-agent gif (#137 )	2023-04-10 10:58:10 +03:00