Version bump to v1.6.0

Add new pytorch no resolver mode and CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE to change resolver on a Task basis, now supports "pip", "direct", "none"
Fix PyTorch extra index pip resolver
2025-06-26 18:16:15 +00:00 · 2023-09-05 15:05:11 +03:00 · 2023-09-02 17:45:10 +03:00 · 2023-09-02 17:43:41 +03:00 · 2023-09-02 17:42:27 +03:00 · 2023-08-24 19:03:24 +03:00
42 changed files with 1237 additions and 500 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -11,3 +11,6 @@ build/
 dist/
 *.egg-info

+# VSCode
+.vscode
+
--- a/README.md
+++ b/README.md
@@ -24,8 +24,7 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
 * Launch-and-Forget service containers
 * [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
 * [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
-*
-Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
+* Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)

 It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.

@@ -35,8 +34,8 @@ It is a zero configuration fire-and-forget execution agent, providing a full ML/
   or [free tier hosting](https://app.clear.ml)
 2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine:
   on-premises / cloud / ...)
-3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or
-   Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
+3. Create a [job](https://clear.ml/docs/latest/docs/apps/clearml_task) or
+   add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines of code
 4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or
   automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
 5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes:  :beer:
@@ -81,21 +80,21 @@ Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://githu

 **Two K8s integration flavours**

- Spin ClearML-Agent as a long-lasting service pod
-    - use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
+- Spin ClearML-Agent as a long-lasting service pod:
+    - Use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
    - map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
-    - allow the clearml-agent to manage sibling dockers
-    - benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
-    - downside: Sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
+    - Allow the clearml-agent to manage sibling dockers
+    - Benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
+    - Downside: sibling containers
+- Kubernetes Glue, map ClearML jobs directly to K8s jobs:
    - Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
      a K8s cpu node
    - The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
      yaml template)
    - Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
      experiment's process
-    - benefits: Kubernetes full view of all running jobs in the system
-    - downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
+    - Benefits: Kubernetes full view of all running jobs in the system
+    - Downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)

 ### Using the ClearML Agent

@@ -110,15 +109,15 @@ A previously run experiment can be put into 'Draft' state by either of two metho

 * Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any
  results and artifacts the previous run had created.
-* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new '
-  Draft' experiment with the same configuration as the original experiment.
+* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new 
+  'Draft' experiment with the same configuration as the original experiment.

 An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in
 the ClearML UI and selecting the execution queue.

 See [creating an experiment and enqueuing it for execution](#from-scratch).

-Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue.
+Once an experiment is enqueued, it will be picked up and executed by a ClearML Agent monitoring this queue.

 The ClearML UI Workers & Queues page provides ongoing execution information:

@@ -170,22 +169,22 @@ clearml-agent init
 ```

 Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
-ClearML Agent cache folder is `~/.clearml`
+ClearML Agent cache folder is `~/.clearml`.

-See full details in your configuration file at `~/clearml.conf`
+See full details in your configuration file at `~/clearml.conf`.

-Note: The **ClearML agent** extends the **ClearML** configuration file `~/clearml.conf`
+Note: The **ClearML Agent** extends the **ClearML** configuration file `~/clearml.conf`.
 They are designed to share the same configuration file, see example [here](docs/clearml.conf)

 #### Running the ClearML Agent

-For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
+For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen:

 ```bash
 clearml-agent daemon --queue default --foreground
 ```

-For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
+For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe).
 Notice: with `--detached` flag, the *clearml-agent* will be running in the background

 ```bash
@@ -195,20 +194,21 @@ clearml-agent daemon --detached --queue default
 GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled
 with `--cpu-only`).

-If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for
-the `clearml-agent` <br>
+If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPUs will be allocated for
+the `clearml-agent`. <br>
 If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES="none"`, no gpu will be allocated for
-the `clearml-agent`
+the `clearml-agent`.

-Example: spin two agents, one per gpu on the same machine:
-Notice: with `--detached` flag, the *clearml-agent* will be running in the background
+Example: spin two agents, one per GPU on the same machine:
+
+Notice: with `--detached` flag, the *clearml-agent* will run in the background

 ```bash
 clearml-agent daemon --detached --gpus 0 --queue default
 clearml-agent daemon --detached --gpus 1 --queue default
 ```

-Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
+Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent

 ```bash
 clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
@@ -223,43 +223,43 @@ For debug and experimentation, start the ClearML agent in `foreground` mode, whe
 clearml-agent daemon --queue default --docker --foreground
 ```

-For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
-Notice: with `--detached` flag, the *clearml-agent* will be running in the background
+For actual service mode, all the stdout will be stored automatically into a file (no need to pipe).
+Notice: with `--detached` flag, the *clearml-agent* will run in the background

 ```bash
 clearml-agent daemon --detached --queue default --docker
 ```

-Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+Example: spin two agents, one per gpu on the same machine, with default `nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04`
 docker:

 ```bash
-clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
-clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
+clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
 ```

-Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:
-10.1-cudnn7-runtime-ubuntu18.04 docker:
+Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent, with default 
+`nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04` docker:

 ```bash
-clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
-clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
+clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
 ```

 ##### Starting the ClearML Agent - Priority Queues

 Priority Queues are also supported, example use case:

-High priority queue: `important_jobs`  Low priority queue: `default`
+High priority queue: `important_jobs`, low priority queue: `default`

 ```bash
 clearml-agent daemon --queue important_jobs default
 ```

-The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from
-the `default` queue.
+The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, and only if it is empty, the agent 
+will try to pull from the `default` queue.

-Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see
+Adding queues, managing job order within a queue, and moving jobs between queues, is available using the Web UI, see
 example on our [free server](https://app.clear.ml/workers-and-queues/queues)

 ##### Stopping the ClearML Agent
@@ -268,7 +268,7 @@ To stop a **ClearML Agent** running in the background, run the same command line
 appended. For example, to stop the first of the above shown same machine, single gpu agents:

 ```bash
-clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop
+clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 --stop
 ```

 ### How do I create an experiment on the ClearML Server? <a name="from-scratch"></a>
@@ -279,32 +279,33 @@ clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10
    - Git repository link and commit ID (or an entire jupyter notebook)
    - Git diff (we’re not saying you never commit and push, but still...)
    - Python packages used by your code (including specific versions used)
-    - Hyper-Parameters
-    - Input Artifacts
+    - Hyperparameters
+    - Input artifacts

  You now have a 'template' of your experiment with everything required for automated execution

-* In the ClearML UI, Right-click on the experiment and select 'clone'. A copy of your experiment will be created.
+* In the ClearML UI, right-click on the experiment and select 'clone'. A copy of your experiment will be created.
 * You now have a new draft experiment cloned from your original experiment, feel free to edit it
-    - Change the Hyper-Parameters
+    - Change the hyperparameters
    - Switch to the latest code base of the repository
    - Update package versions
    - Select a specific docker image to run in (see docker execution mode section)
    - Or simply change nothing to run the same experiment again...
-* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
+* Schedule the newly created experiment for execution: right-click the experiment and select 'enqueue'

 ### ClearML-Agent Services Mode <a name="services"></a>

 ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that
 previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks)
-for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the
-budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as
-Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data
-transparency)
+for different use cases: 
+* Auto-scaler service (spinning instances when the need arises and the budget allows)
+* Controllers (Implementing pipelines and more sophisticated DevOps logic)
+* Optimizer (such as Hyperparameter Optimization or sweeping)
+* Application (such as interactive Bokeh apps for increased data transparency)

 ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by
 ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
-Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched
+Currently, clearml-agent in services-mode supports CPU only configuration. ClearML-Agent services mode can be launched
 alongside GPU agents.

 ```bash
@@ -321,15 +322,15 @@ ClearML package.
 Sample AutoML & Orchestration examples can be found in the
 ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.

-AutoML examples
+AutoML examples:

 - [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
    - In order to create an experiment-template in the system, this code must be executed once manually
 - [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
-    - This example will create multiple copies of the Keras experiment-template, with different hyper-parameter
+    - This example will create multiple copies of the Keras experiment-template, with different hyperparameter
      combinations

-Experiment Pipeline examples
+Experiment Pipeline examples:

 - [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
    - This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
--- a/clearml_agent/backend_api/config/default/agent.conf
+++ b/clearml_agent/backend_api/config/default/agent.conf
@@ -80,6 +80,17 @@
        # additional artifact repositories to use when installing python packages
        # extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]

+        # control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
+        # Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
+        # "pip" (default): would automatically detect the cuda version, and supply pip with the correct
+        # extra-index-url, based on pytorch.org tables
+        # "direct": would resolve a direct link to the pytorch wheel by parsing the pytorch.org pip repository
+        # and matching the automatically detected cuda version with the required pytorch wheel.
+        # if the exact cuda version is not found for the required pytorch wheel, it will try
+        # a lower cuda version until a match is found
+        # "none": No resolver used, install pytorch like any other package
+        # pytorch_resolve: "pip"
+
        # additional conda channels to use when installing with conda package manager
        conda_channels: ["pytorch", "conda-forge", "defaults", ]

@@ -88,19 +99,23 @@
        # force_repo_requirements_txt: false

        # set the priority packages to be installed before the rest of the required packages
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # priority_packages: ["cython", "numpy", "setuptools", ]

        # set the optional priority packages to be installed before the rest of the required packages,
        # In case a package installation fails, the package will be ignored,
        # and the virtual environment process will continue
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        priority_optional_packages: ["pygobject", ]

        # set the post packages to be installed after all the rest of the required packages
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # post_packages: ["horovod", ]

        # set the optional post packages to be installed after all the rest of the required packages,
        # In case a package installation fails, the package will be ignored,
        # and the virtual environment process will continue
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # post_optional_packages: []

        # set to True to support torch nightly build installation,
@@ -192,7 +207,7 @@

    default_docker: {
        # default docker image to use when running in docker mode
-        image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
+        image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"

        # optional arguments to pass to docker image
        # arguments: ["--ipc=host", ]
@@ -259,10 +274,15 @@

    # Name docker containers created by the daemon using the following string format (supported from Docker 0.6.5)
    # Allowed variables are task_id, worker_id and rand_string (random lower-case letters string, up to 32 characters)
+    # Custom variables may be specified using the docker_container_name_format_fields option.
    # Note: resulting name must start with an alphanumeric character and
    #       continue with alphanumeric characters, underscores (_), dots (.) and/or dashes (-)
    # docker_container_name_format: "clearml-id-{task_id}-{rand_string:.8}"

+    # Specify custom variables for the docker_container_name_format option using a mapping of variable name
+    # to a (nested) task field (using "." as a task field separator, digits specify array index)
+    # docker_container_name_format_fields: { foo: "bar.moo" }
+
    # Apply top-level environment section from configuration into os.environ
    apply_environment: true
    # Top-level environment section is in the form of:
@@ -283,6 +303,8 @@
    #  target_format: format used to encode contents before writing into the target file. Supported values are json,
    #                 yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
    #  overwrite: overwrite the target file in case it exists. Default is true.
+    #  mode: file-system mode to be applied to the file after its creation. The mode string will be parsed into an
+    #        integer (e.g. "0o777" for -rwxrwxrwx)
    #
    # Example:
    #   files {
@@ -348,7 +370,7 @@
    # Notice: Matching is done via regular expression, for example "^searchme$" will match exactly "searchme$" string
    #
    #     "default_docker": {
-    #         "image": "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04",
+    #         "image": "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04",
    #         # optional arguments to pass to docker image
    #         # arguments: ["--ipc=host", ]
    #         "match_rules": [
@@ -369,13 +391,6 @@
    #                 }
    #             },
    #             {
-    #                 "image": "better_container:tag",
-    #                 "arguments": "",
-    #                 "match": {
-    #                     "container": "replace_me_please"
-    #                 }
-    #             },
-    #             {
    #                 "image": "another_container:tag",
    #                 "arguments": "",
    #                 "match": {
--- a/clearml_agent/backend_api/config/default/sdk.conf
+++ b/clearml_agent/backend_api/config/default/sdk.conf
@@ -3,7 +3,7 @@

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
            size {
                # max_used_bytes = -1
--- a/clearml_agent/backend_api/session/defs.py
+++ b/clearml_agent/backend_api/session/defs.py
@@ -20,6 +20,7 @@ ENV_PROPAGATE_EXITCODE = EnvEntry("CLEARML_AGENT_PROPAGATE_EXITCODE", type=bool,
 ENV_INITIAL_CONNECT_RETRY_OVERRIDE = EnvEntry(
    'CLEARML_AGENT_INITIAL_CONNECT_RETRY_OVERRIDE', default=True, converter=safe_text_to_bool
 )
+ENV_FORCE_MAX_API_VERSION = EnvEntry("CLEARML_AGENT_FORCE_MAX_API_VERSION", type=str)

 """
 Experimental option to set the request method for all API requests and auth login.
--- a/clearml_agent/backend_api/session/session.py
+++ b/clearml_agent/backend_api/session/session.py
@@ -16,11 +16,11 @@ from requests.auth import HTTPBasicAuth
 from six.moves.urllib.parse import urlparse, urlunparse

 from clearml_agent.external.pyhocon import ConfigTree, ConfigFactory
-
 from .callresult import CallResult
 from .defs import (
    ENV_VERBOSE, ENV_HOST, ENV_ACCESS_KEY, ENV_SECRET_KEY, ENV_WEB_HOST, ENV_FILES_HOST, ENV_AUTH_TOKEN,
-    ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD, )
+    ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD,
+    ENV_FORCE_MAX_API_VERSION)
 from .request import Request, BatchRequest
 from .token_manager import TokenManager
 from ..config import load
@@ -28,7 +28,6 @@ from ..utils import get_http_session_with_retry, urllib_log_warning_setup
 from ...backend_config.environment import backward_compatibility_support
 from ...version import __version__

-
 sys_random = SystemRandom()


@@ -64,6 +63,7 @@ class Session(TokenManager):
    default_files = "https://demofiles.demo.clear.ml"
    default_key = "EGRTCO8JMSIGI6S39GTP43NFWXDQOW"
    default_secret = "x!XTov_G-#vspE*Y(h$Anm&DIc5Ou-F)jsl$PdOyj5wG1&E!Z8"
+    force_max_api_version = ENV_FORCE_MAX_API_VERSION.get()

    # TODO: add requests.codes.gateway_timeout once we support async commits
    _retry_codes = [
@@ -199,6 +199,12 @@ class Session(TokenManager):
        # notice: this is across the board warning omission
        urllib_log_warning_setup(total_retries=http_retries_config.get('total', 0), display_warning_after=3)

+        if self.force_max_api_version and self.check_min_api_version(self.force_max_api_version):
+            print("Using forced API version {}".format(self.force_max_api_version))
+            Session.max_api_version = Session.api_version = str(self.force_max_api_version)
+
+        self.pre_vault_config = None
+
    def _setup_session(self, http_retries_config, initial_session=False, default_initial_connect_override=None):
        # type: (dict, bool, Optional[bool]) -> (dict, requests.Session)
        http_retries_config = http_retries_config or self.config.get(
@@ -250,7 +256,11 @@ class Session(TokenManager):
        def parse(vault):
            # noinspection PyBroadException
            try:
-                d = vault.get('data', None)
+                print("Loaded {} vault: {}".format(
+                    vault.get("scope", ""),
+                    (vault.get("description", None) or "")[:50] or vault.get("id", ""))
+                )
+                d = vault.get("data", None)
                if d:
                    r = ConfigFactory.parse_string(d)
                    if isinstance(r, (ConfigTree, dict)):
@@ -266,6 +276,7 @@ class Session(TokenManager):
                vaults = res.json().get("data", {}).get("vaults", [])
                data = list(filter(None, map(parse, vaults)))
                if data:
+                    self.pre_vault_config = self.config.copy()
                    self.config.set_overrides(*data)
                    return True
            elif res.status_code != 404:
--- a/clearml_agent/backend_api/utils.py
+++ b/clearml_agent/backend_api/utils.py
@@ -86,7 +86,10 @@ def get_http_session_with_retry(
    session = requests.Session()

    if backoff_max is not None:
-        Retry.BACKOFF_MAX = backoff_max
+        if "BACKOFF_MAX" in vars(Retry):
+            Retry.BACKOFF_MAX = backoff_max
+        else:
+            Retry.DEFAULT_BACKOFF_MAX = backoff_max

    retry = Retry(
        total=total, connect=connect, read=read, redirect=redirect, status=status,
--- a/clearml_agent/backend_config/config.py
+++ b/clearml_agent/backend_config/config.py
@@ -297,6 +297,9 @@ class Config(object):
    def put(self, key, value):
        self._config.put(key, value)

+    def pop(self, key, default=None):
+        return self._config.pop(key, default=default)
+
    def to_dict(self):
        return self._config.as_plain_ordered_dict()

--- a/clearml_agent/backend_config/utils.py
+++ b/clearml_agent/backend_config/utils.py
@@ -52,6 +52,7 @@ def apply_files(config):
        target_fmt = data.get("target_format", "string")
        overwrite = bool(data.get("overwrite", True))
        contents = data.get("contents")
+        mode = data.get("mode")

        target = Path(expanduser(expandvars(path)))

@@ -110,3 +111,14 @@ def apply_files(config):
        except Exception as ex:
            print("Skipped [{}]: failed saving file {} ({})".format(key, target, ex))
            continue
+
+        try:
+            if mode:
+                if isinstance(mode, int):
+                    mode = int(str(mode), 8)
+                else:
+                    mode = int(mode, 8)
+                target.chmod(mode)
+        except Exception as ex:
+            print("Skipped [{}]: failed setting mode {} for {} ({})".format(key, mode, target, ex))
+            continue
--- a/clearml_agent/commands/config.py
+++ b/clearml_agent/commands/config.py
@@ -44,7 +44,7 @@ def main():

    if conf_file.exists() and conf_file.is_file() and conf_file.stat().st_size > 0:
        print('Configuration file already exists: {}'.format(str(conf_file)))
-        print('Leaving setup, feel free to edit the configuration file.')
+        print('Leaving setup. If you\'ve previously initialized the ClearML SDK on this machine, manually add an \'agent\' section to this file.')
        return

    print(description, end='')
--- a/clearml_agent/commands/resolver.py
+++ b/clearml_agent/commands/resolver.py
@@ -109,15 +109,15 @@ def resolve_default_container(session, task_id, container_config):
                    match.get('script.binary', None), entry))
                continue

-        if match.get('container', None):
-            # noinspection PyBroadException
-            try:
-                if not re.search(match.get('container', None), requested_container.get('image', '')):
-                    continue
-            except Exception:
-                print('Failed parsing regular expression \"{}\" in rule: {}'.format(
-                    match.get('container', None), entry))
-                continue
+        # if match.get('image', None):
+        #     # noinspection PyBroadException
+        #     try:
+        #         if not re.search(match.get('image', None), requested_container.get('image', '')):
+        #             continue
+        #     except Exception:
+        #         print('Failed parsing regular expression \"{}\" in rule: {}'.format(
+        #             match.get('image', None), entry))
+        #         continue

        matched = True
        for req_section in ['script.requirements.pip', 'script.requirements.conda']:
@@ -156,8 +156,8 @@ def resolve_default_container(session, task_id, container_config):
            break

        if matched:
-            if not container_config.get('container'):
-                container_config['container'] = entry.get('image', None)
+            if not container_config.get('image'):
+                container_config['image'] = entry.get('image', None)
            if not container_config.get('arguments'):
                container_config['arguments'] = entry.get('arguments', None)
                container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())
--- a/clearml_agent/commands/worker.py
+++ b/clearml_agent/commands/worker.py
@@ -12,6 +12,7 @@ import shlex
 import shutil
 import signal
 import string
+import socket
 import subprocess
 import sys
 import traceback
@@ -24,7 +25,7 @@ from functools import partial
 from os.path import basename
 from tempfile import mkdtemp, NamedTemporaryFile
 from time import sleep, time
-from typing import Text, Optional, Any, Tuple, List
+from typing import Text, Optional, Any, Tuple, List, Dict, Mapping, Union

 import attr
 import six
@@ -40,6 +41,7 @@ from clearml_agent.backend_api.session import CallResult, Request
 from clearml_agent.backend_api.session.defs import (
    ENV_ENABLE_ENV_CONFIG_SECTION, ENV_ENABLE_FILES_CONFIG_SECTION,
    ENV_VENV_CONFIGURED, ENV_PROPAGATE_EXITCODE, )
+from clearml_agent.backend_config import Config
 from clearml_agent.backend_config.defs import UptimeConf
 from clearml_agent.backend_config.utils import apply_environment, apply_files
 from clearml_agent.backend_config.converters import text_to_int
@@ -71,6 +73,9 @@ from clearml_agent.definitions import (
    ENV_DOCKER_ARGS_FILTERS,
    ENV_FORCE_SYSTEM_SITE_PACKAGES,
    ENV_SERVICES_DOCKER_RESTART,
+    ENV_CONFIG_BC_IN_STANDALONE,
+    ENV_FORCE_DOCKER_AGENT_REPO,
+    ENV_EXTRA_DOCKER_LABELS,
 )
 from clearml_agent.definitions import WORKING_REPOSITORY_DIR, PIP_EXTRA_INDICES
 from clearml_agent.errors import (
@@ -316,6 +321,37 @@ def get_next_task(session, queue, get_task_info=False):
    return data


+def get_task_fields(session, task_id, fields: list, log=None) -> dict:
+    """
+    Returns dict with Task docker container setup {container: '', arguments: '', setup_shell_script: ''}
+    """
+    result = session.send_request(
+        service='tasks',
+        action='get_all',
+        json={'id': [task_id], 'only_fields': list(fields), 'search_hidden': True},
+        method=Request.def_method,
+        async_enable=False,
+    )
+    # noinspection PyBroadException
+    try:
+        results = {}
+        result = result.json()['data']['tasks'][0]
+        for field in fields:
+            cur = result
+            for part in field.split("."):
+                if part.isdigit():
+                    cur = cur[part]
+                else:
+                    cur = cur.get(part, {})
+            results[field] = cur
+        return results
+    except Exception as ex:
+        if log:
+            log.error("Failed obtaining values for task fields {}: {}", fields, ex)
+        pass
+    return {}
+
+
 def get_task_container(session, task_id):
    """
    Returns dict with Task docker container setup {container: '', arguments: '', setup_shell_script: ''}
@@ -333,20 +369,25 @@ def get_task_container(session, task_id):
            container = result.json()['data']['tasks'][0]['container'] if result.ok else {}
            if container.get('arguments'):
                container['arguments'] = shlex.split(str(container.get('arguments')).strip())
+            if container.get('image'):
+                container['image'] = container.get('image').strip()
        except (ValueError, TypeError):
            container = {}
    else:
        response = get_task(session, task_id, only_fields=["execution.docker_cmd"])
-        task_docker_cmd_parts = shlex.split(str(response.execution.docker_cmd or '').strip())
-        try:
-            container = dict(
-                container=task_docker_cmd_parts[0],
-                arguments=task_docker_cmd_parts[1:] if len(task_docker_cmd_parts[0]) > 1 else ''
-            )
-        except (ValueError, TypeError):
-            container = {}
+        container = {}
+        if response.execution:
+            task_docker_cmd_parts = shlex.split(str(response.execution.docker_cmd or '').strip())
+            if task_docker_cmd_parts:
+                try:
+                    container = dict(
+                        image=task_docker_cmd_parts[0],
+                        arguments=task_docker_cmd_parts[1:] if len(task_docker_cmd_parts[0]) > 1 else ''
+                    )
+                except (ValueError, TypeError):
+                    pass

-    if (not container or not container.get('container')) and session.check_min_api_version("2.13"):
+    if (not container or not container.get('image')) and session.check_min_api_version("2.13"):
        container = resolve_default_container(session=session, task_id=task_id, container_config=container)

    return container
@@ -597,6 +638,8 @@ class Worker(ServiceCommandSection):
    _docker_fixed_user_cache = '/clearml_agent_cache'
    _temp_cleanup_list = []

+    hostname_task_runtime_prop = "_exec_agent_hostname"
+
    @property
    def service(self):
        """ Worker command service endpoint """
@@ -622,9 +665,13 @@ class Worker(ServiceCommandSection):
        self.log = self._session.get_logger(__name__)
        self.register_signal_handler()
        self._worker_registered = False
+
+        self._apply_extra_configuration()
+
        self.is_conda = is_conda(self._session.config)  # type: bool
        # Add extra index url - system wide
        extra_url = None
+        # noinspection PyBroadException
        try:
            if self._session.config.get("agent.package_manager.extra_index_url", None):
                extra_url = self._session.config.get("agent.package_manager.extra_index_url", [])
@@ -812,6 +859,31 @@ class Worker(ServiceCommandSection):
        # "Running task '{}'".format(task_id)
        print(self._task_logging_start_message.format(task_id))
        task_session = task_session or self._session
+
+        # noinspection PyBroadException
+        try:
+            result = task_session.send_request(
+                service='tasks',
+                action='get_all',
+                version='2.15',
+                method=Request.def_method,
+                json={'id': [task_id], 'only_fields': ["runtime"], 'search_hidden': True}
+            )
+
+            runtime = result.json().get("data", {}).get("tasks", [])[0].get("runtime") or {}
+            runtime[self.hostname_task_runtime_prop] = socket.gethostname()
+
+            res = task_session.send_request(
+                service='tasks', action='edit', method=Request.def_method,
+                json={
+                    "task": task_id, "force": True, "runtime": runtime
+                },
+            )
+            if not res.ok:
+                raise Exception("failed setting runtime property")
+        except Exception as ex:
+            print("Warning: failed obtaining/setting hostname for task '{}': {}".format(task_id, ex))
+
        # set task status to in_progress so we know it was popped from the queue
        # noinspection PyBroadException
        try:
@@ -887,11 +959,21 @@ class Worker(ServiceCommandSection):

            name_format = self._session.config.get('agent.docker_container_name_format', None)
            if name_format:
+                custom_fields = {}
+                name_format_fields = self._session.config.get('agent.docker_container_name_format_fields', None)
+                if name_format_fields:
+                    field_values = get_task_fields(task_session, task_id, name_format_fields.values(), log=self.log)
+                    custom_fields = {
+                        k: field_values.get(v)
+                        for k, v in name_format_fields.items()
+                    }
+
                try:
                    name = name_format.format(
                        task_id=re.sub(r'[^a-zA-Z0-9._-]', '-', task_id),
                        worker_id=re.sub(r'[^a-zA-Z0-9._-]', '-', worker_id),
-                        rand_string="".join(sys_random.choice(string.ascii_lowercase) for _ in range(32))
+                        rand_string="".join(sys_random.choice(string.ascii_lowercase) for _ in range(32)),
+                        **custom_fields,
                    )
                except Exception as ex:
                    print("Warning: failed generating docker container name: {}".format(ex))
@@ -1459,8 +1541,6 @@ class Worker(ServiceCommandSection):
        return self._resolve_queue_names(queues=queues, create_if_missing=create_if_missing)

    def daemon(self, queues, log_level, foreground=False, docker=False, detached=False, order_fairness=False, **kwargs):
-        self._apply_extra_configuration()
-
        # check that we have docker command if we need it
        if docker not in (False, None) and not check_if_command_exists("docker"):
            raise ValueError("Running in Docker mode, 'docker' command was not found")
@@ -2001,19 +2081,26 @@ class Worker(ServiceCommandSection):

    def _apply_extra_configuration(self):
        # store a few things we updated in runtime (TODO: we should list theme somewhere)
-        agent_config = self._session.config["agent"].copy()
+        vault_loaded = False
+        session = self._session
+        agent_config = session.config["agent"].copy()
        agent_config_keys = ["cuda_version", "cudnn_version", "default_python", "worker_id", "worker_name", "debug"]
        try:
-            self._session.load_vaults()
+            vault_loaded = session.load_vaults()
        except Exception as ex:
            print("Error: failed applying extra configuration: {}".format(ex))

-        # merge back
-        for restore_key in agent_config_keys:
-            if restore_key in agent_config:
-                self._session.config["agent"][restore_key] = agent_config[restore_key]
+        config = session.config
+
+        # merge back
+        if vault_loaded:
+            for restore_key in agent_config_keys:
+                if restore_key in agent_config and agent_config[restore_key] != config["agent"].get(restore_key, None):
+                    print("Ignoring vault value for '{}' (agent config takes precedence), using '{}'".format(
+                        restore_key, agent_config[restore_key]
+                    ))
+                    config["agent"][restore_key] = agent_config[restore_key]

-        config = self._session.config
        default = config.get("agent.apply_environment", False)
        if ENV_ENABLE_ENV_CONFIG_SECTION.get(default=default):
            try:
@@ -2295,8 +2382,10 @@ class Worker(ServiceCommandSection):
                print("Cloning task id={}".format(task_id))
                current_task = self._session.api_client.tasks.get_by_id(
                    self._session.send_api(
-                        tasks_api.CloneRequest(task=current_task.id,
-                                               new_task_name='Clone of {}'.format(current_task.name))
+                        tasks_api.CloneRequest(
+                            task=current_task.id,
+                            new_task_name="Clone of {}".format(current_task.name)
+                        )
                    ).id
                )
                print("Task cloned, new task id={}".format(current_task.id))
@@ -2304,11 +2393,23 @@ class Worker(ServiceCommandSection):
                raise CommandFailedError("Cloning failed")
        else:
            # make sure this task is not stuck in an execution queue, it shouldn't have been, but just in case.
+            # noinspection PyBroadException
            try:
-                res = self._session.api_client.tasks.dequeue(task=current_task.id)
-                if require_queue and res.meta.result_code != 200:
-                    raise ValueError("Execution required enqueued task, "
-                                     "but task id={} is not queued.".format(current_task.id))
+                res = self._session.send_request(
+                    service="tasks", action="dequeue", method=Request.def_method,
+                    json={"task": current_task.id, "new_status": "in_progress"},
+                )
+                if require_queue and (not res.ok or res.json().get("data", {}).get("updated", 0) < 1):
+                    raise ValueError(
+                        "Execution required enqueued task, but task id={} is not queued.".format(current_task.id)
+                    )
+                # Set task status to started to prevent any external monitoring from killing it
+                self._session.api_client.tasks.started(
+                    task=current_task.id,
+                    status_reason="starting execution soon",
+                    status_message="",
+                    force=True,
+                )
            except Exception:
                if require_queue:
                    raise
@@ -2319,14 +2420,14 @@ class Worker(ServiceCommandSection):
        # We expect the same behaviour in case full_monitoring was set, and in case docker mode is used
        if full_monitoring or docker is not False:
            if full_monitoring:
-                if not (ENV_WORKER_ID.get() or '').strip():
-                    self._session.config["agent"]["worker_id"] = ''
+                if not (ENV_WORKER_ID.get() or "").strip():
+                    self._session.config["agent"]["worker_id"] = ""
                # make sure we support multiple instances if we need to
                self._singleton()
                self.temp_config_path = self.temp_config_path or safe_mkstemp(
                    suffix=".cfg", prefix=".clearml_agent.", text=True, name_only=True
                )
-                self.dump_config(self.temp_config_path)
+                self.dump_config(filename=self.temp_config_path, config=self._session.pre_vault_config)
                self._session._config_file = self.temp_config_path

            worker_params = WorkerParams(
@@ -2347,8 +2448,6 @@ class Worker(ServiceCommandSection):
                    Singleton.close_pid_file()
            return status if ENV_PROPAGATE_EXITCODE.get() else 0

-        self._apply_extra_configuration()
-
        self._session.print_configuration()

        # now mark the task as started
@@ -3515,6 +3614,11 @@ class Worker(ServiceCommandSection):
        requirements_manager.translator.enabled = False
        print(requirements_manager.replace(contents))

+    def remove_non_backwards_compatible_entries(self, config: Config):
+        if not self._standalone_mode or not ENV_CONFIG_BC_IN_STANDALONE.get() or self._session.feature_set == "basic":
+            return
+        config.pop("agent.package_manager.pip_version")  # removed due to a breaking change in v1.5.1
+
    def get_docker_config_cmd(self, docker_args, clean_api_credentials=False):
        docker_image = str(ENV_DOCKER_IMAGE.get() or
                           self._session.config.get("agent.default_docker.image", "nvidia/cuda")) \
@@ -3537,6 +3641,7 @@ class Worker(ServiceCommandSection):
            DockerArgsSanitizer.sanitize_docker_command(self._session, self._docker_arguments) or ''))

        temp_config = deepcopy(self._session.config)
+        self.remove_non_backwards_compatible_entries(temp_config)
        mounted_cache_dir = temp_config.get(
            "agent.docker_internal_mounts.sdk_cache", self._docker_fixed_user_cache)
        mounted_pip_dl_dir = temp_config.get(
@@ -3587,34 +3692,35 @@ class Worker(ServiceCommandSection):

    def _get_docker_config_cmd(self, temp_config, clean_api_credentials=False, **kwargs):
        self.debug("Setting up docker config command")
-        host_cache = Path(os.path.expandvars(
-            self._session.config["sdk.storage.cache.default_base_dir"])).expanduser().as_posix()
+
+        def load_path(field, default=None):
+            value = self._session.config.get(field, default)
+            return Path(os.path.expandvars(value)).expanduser().as_posix() if value else None
+
+        host_cache = load_path("sdk.storage.cache.default_base_dir")
        self.debug("host_cache: {}".format(host_cache))
-        host_pip_dl = Path(os.path.expandvars(
-            self._session.config["agent.pip_download_cache.path"])).expanduser().as_posix()
+
+        host_pip_dl = load_path("agent.pip_download_cache.path")
        self.debug("host_pip_dl: {}".format(host_pip_dl))
-        host_vcs_cache = Path(os.path.expandvars(
-            self._session.config["agent.vcs_cache.path"])).expanduser().as_posix()
+
+        host_vcs_cache = load_path("agent.vcs_cache.path")
        self.debug("host_vcs_cache: {}".format(host_vcs_cache))
-        host_venvs_cache = Path(os.path.expandvars(
-            self._session.config["agent.venvs_cache.path"])).expanduser().as_posix() \
-            if self._session.config.get("agent.venvs_cache.path", None) else None
+
+        host_venvs_cache = load_path("agent.venvs_cache.path")
        self.debug("host_venvs_cache: {}".format(host_venvs_cache))
+
        host_ssh_cache = self._host_ssh_cache
        self.debug("host_ssh_cache: {}".format(host_ssh_cache))

-        host_apt_cache = Path(os.path.expandvars(self._session.config.get(
-            "agent.docker_apt_cache", '~/.clearml/apt-cache'))).expanduser().as_posix()
+        host_apt_cache = load_path("agent.docker_apt_cache", default="~/.clearml/apt-cache")
        self.debug("host_apt_cache: {}".format(host_apt_cache))
-        host_pip_cache = Path(os.path.expandvars(self._session.config.get(
-            "agent.docker_pip_cache", '~/.clearml/pip-cache'))).expanduser().as_posix()
+
+        host_pip_cache = load_path("agent.docker_pip_cache", default="~/.clearml/pip-cache")
        self.debug("host_pip_cache: {}".format(host_pip_cache))

-        if self.poetry.enabled:
-            host_poetry_cache = Path(os.path.expandvars(self._session.config.get(
-                "agent.docker_poetry_cache", '~/.clearml/poetry-cache'))).expanduser().as_posix()
-        else:
-            host_poetry_cache = None
+        host_poetry_cache = (
+            load_path("agent.docker_poetry_cache", "~/.clearml/poetry-cache") if self.poetry.enabled else None
+        )
        self.debug("host_poetry_cache: {}".format(host_poetry_cache))

        # make sure all folders are valid
@@ -3766,6 +3872,60 @@ class Worker(ServiceCommandSection):
                        pass
        return results

+    @staticmethod
+    def _resolve_docker_env_args(docker_args):
+        # type: (List[str]) -> List[str]
+        """
+        Resolve -e / --env docker environment args matching $VAR or ${VAR} from the host environment
+
+        :argument docker_args: List of docker argument strings (flags and values)
+        """
+        non_list_args = (
+            "rm", "read-only", "sig-proxy", "tty", "privileged", "publish-all", "interactive", "init", "help", "detach"
+        )
+        non_list_args_single = (
+            "t", "P", "i", "d",
+        )
+
+        # if no filtering, do nothing
+        if not docker_args:
+            return docker_args
+
+        args = docker_args[:]
+        skip_arg = False
+        for i, cmd in enumerate(docker_args):
+            if skip_arg and not cmd.startswith("-"):
+                continue
+
+            skip_arg = False
+
+            if cmd.startswith("--"):
+                # jump over single command
+                if cmd[2:] in non_list_args:
+                    continue
+            elif cmd.startswith("-"):
+                # jump over single character non args
+                if cmd[1:] in non_list_args_single:
+                    continue
+
+            # if we are here we have a command to bypass and the list after it
+            if cmd in ('-e', '--env'):
+                skip_arg = True
+                for j in range(i+1, len(args)):
+                    if args[j].startswith("-"):
+                        break
+
+                    parts = args[j].split("=", 1)
+                    if len(parts) != 2:
+                        continue
+
+                    args[j] = "{}={}".format(parts[0], os.path.expandvars(parts[1]))
+
+            elif cmd.startswith("-"):
+                skip_arg = True
+
+        return args
+
    def _get_docker_cmd(
            self,
            worker_id, parent_worker_id,
@@ -3829,9 +3989,14 @@ class Worker(ServiceCommandSection):
            docker_arguments = list(docker_arguments) \
                if isinstance(docker_arguments, (list, tuple)) else [docker_arguments]
            docker_arguments = self._filter_docker_args(docker_arguments)
+            if self._session.config.get("agent.docker_allow_host_environ", None):
+                docker_arguments = self._resolve_docker_env_args(docker_arguments)
            base_cmd += [a for a in docker_arguments if a]

        if extra_docker_arguments:
+            # we always resolve environments in the `extra_docker_arguments` becuase the admin set them (not users)
+            extra_docker_arguments = self._resolve_docker_env_args(extra_docker_arguments)
+
            extra_docker_arguments = [extra_docker_arguments] \
                if isinstance(extra_docker_arguments, six.string_types) else extra_docker_arguments
            base_cmd += [str(a) for a in extra_docker_arguments if a]
@@ -3840,6 +4005,10 @@ class Worker(ServiceCommandSection):
        base_cmd += ['-l', self._worker_label.format(worker_id)]
        base_cmd += ['-l', self._parent_worker_label.format(parent_worker_id)]

+        extra_labels = ENV_EXTRA_DOCKER_LABELS.get()
+        for label in (extra_labels or []):
+            base_cmd += ['-l', label]
+
        self.debug("Command: {}".format(base_cmd), context="docker")

        # check if running inside a kubernetes
@@ -3903,7 +4072,7 @@ class Worker(ServiceCommandSection):

        base_cmd += ['-e', 'CLEARML_WORKER_ID='+worker_id, ]
        # update the docker image, so the system knows where it runs
-        base_cmd += ['-e', 'CLEARML_DOCKER_IMAGE={} {}'.format(docker_image, ' '.join(docker_arguments or [])).strip()]
+        base_cmd += ['-e', 'CLEARML_DOCKER_IMAGE={}'.format(docker_image)]

        if env_task_id:
            base_cmd += ['-e', 'CLEARML_TASK_ID={}'.format(env_task_id), ]
@@ -3922,6 +4091,7 @@ class Worker(ServiceCommandSection):
        # if we are running a RC version, install the same version in the docker
        # because the default latest, will be a release version (not RC)
        specify_version = ''
+        # noinspection PyBroadException
        try:
            from clearml_agent.version import __version__
            _version_parts = __version__.split('.')
@@ -3930,13 +4100,15 @@ class Worker(ServiceCommandSection):
        except:
            pass

+        force_agent_repo = ENV_FORCE_DOCKER_AGENT_REPO.get()
+
        if os.environ.get('FORCE_LOCAL_CLEARML_AGENT_WHEEL'):
            local_wheel = os.path.expanduser(os.environ.get('FORCE_LOCAL_CLEARML_AGENT_WHEEL'))
            docker_wheel = '/tmp/{}'.format(basename(local_wheel))
            base_cmd += ['-v', local_wheel + ':' + docker_wheel]
            clearml_agent_wheel = '\"{}\"'.format(docker_wheel)
-        elif os.environ.get('FORCE_CLEARML_AGENT_REPO'):
-            clearml_agent_wheel = os.environ.get('FORCE_CLEARML_AGENT_REPO')
+        elif force_agent_repo:
+            clearml_agent_wheel = force_agent_repo
        else:
            # clearml-agent{specify_version}
            clearml_agent_wheel = 'clearml-agent{specify_version}'.format(specify_version=specify_version)
--- a/clearml_agent/definitions.py
+++ b/clearml_agent/definitions.py
@@ -152,6 +152,7 @@ WORKING_STANDALONE_DIR = "code"
 DEFAULT_VCS_CACHE = normalize_path(CONFIG_DIR, "vcs-cache")
 PIP_EXTRA_INDICES = []
 DEFAULT_PIP_DOWNLOAD_CACHE = normalize_path(CONFIG_DIR, "pip-download-cache")
+ENV_PIP_EXTRA_INSTALL_FLAGS = EnvironmentConfig("CLEARML_EXTRA_PIP_INSTALL_FLAGS", type=list)
 ENV_DOCKER_IMAGE = EnvironmentConfig("CLEARML_DOCKER_IMAGE", "TRAINS_DOCKER_IMAGE")
 ENV_WORKER_ID = EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID")
 ENV_WORKER_TAGS = EnvironmentConfig("CLEARML_WORKER_TAGS")
@@ -173,10 +174,15 @@ ENV_DOCKER_HOST_MOUNT = EnvironmentConfig(
 )
 ENV_VENV_CACHE_PATH = EnvironmentConfig("CLEARML_AGENT_VENV_CACHE_PATH")
 ENV_EXTRA_DOCKER_ARGS = EnvironmentConfig("CLEARML_AGENT_EXTRA_DOCKER_ARGS", type=list)
+ENV_EXTRA_DOCKER_LABELS = EnvironmentConfig("CLEARML_AGENT_EXTRA_DOCKER_LABELS", type=list)
 ENV_DEBUG_INFO = EnvironmentConfig("CLEARML_AGENT_DEBUG_INFO")
 ENV_CHILD_AGENTS_COUNT_CMD = EnvironmentConfig("CLEARML_AGENT_CHILD_AGENTS_COUNT_CMD")
 ENV_DOCKER_ARGS_FILTERS = EnvironmentConfig("CLEARML_AGENT_DOCKER_ARGS_FILTERS")
 ENV_DOCKER_ARGS_HIDE_ENV = EnvironmentConfig("CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV")
+ENV_CONFIG_BC_IN_STANDALONE = EnvironmentConfig("CLEARML_AGENT_STANDALONE_CONFIG_BC", type=bool)
+""" Maintain backwards compatible configuration when launching in standalone mode """
+
+ENV_FORCE_DOCKER_AGENT_REPO = EnvironmentConfig("FORCE_CLEARML_AGENT_REPO", "CLEARML_AGENT_DOCKER_AGENT_REPO")

 ENV_SERVICES_DOCKER_RESTART = EnvironmentConfig("CLEARML_AGENT_SERVICES_DOCKER_RESTART")
 """
@@ -232,6 +238,8 @@ ENV_CUSTOM_BUILD_SCRIPT = EnvironmentConfig("CLEARML_AGENT_CUSTOM_BUILD_SCRIPT")
    standard flow.
 """

+ENV_PACKAGE_PYTORCH_RESOLVE = EnvironmentConfig("CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE")
+

 class FileBuffering(IntEnum):
    """
--- a/clearml_agent/glue/daemon.py
+++ b/clearml_agent/glue/daemon.py
@@ -0,0 +1,15 @@
+from threading import Thread
+from clearml_agent.session import Session
+
+
+class K8sDaemon(Thread):
+
+    def __init__(self, agent):
+        super(K8sDaemon, self).__init__(target=self.target)
+        self.daemon = True
+        self._agent = agent
+        self.log = agent.log
+        self._session: Session = agent._session
+
+    def target(self):
+        pass
--- a/clearml_agent/glue/errors.py
+++ b/clearml_agent/glue/errors.py
@@ -0,0 +1,12 @@
+
+class GetPodsError(Exception):
+    pass
+
+
+class GetJobsError(Exception):
+    pass
+
+
+class GetPodCountError(Exception):
+    pass
+
--- a/clearml_agent/glue/k8s.py
+++ b/clearml_agent/glue/k8s.py
@@ -9,11 +9,10 @@ import os
 import re
 import subprocess
 import tempfile
-from collections import defaultdict
+from collections import defaultdict, namedtuple
 from copy import deepcopy
 from pathlib import Path
 from pprint import pformat
-from threading import Thread
 from time import sleep, time
 from typing import Text, List, Callable, Any, Collection, Optional, Union, Iterable, Dict, Tuple, Set

@@ -28,8 +27,11 @@ from clearml_agent.definitions import (
    ENV_AGENT_GIT_PASS,
    ENV_FORCE_SYSTEM_SITE_PACKAGES,
 )
-from clearml_agent.errors import APIError
+from clearml_agent.errors import APIError, UsageError
 from clearml_agent.glue.definitions import ENV_START_AGENT_SCRIPT_PATH
+from clearml_agent.glue.errors import GetPodCountError
+from clearml_agent.glue.utilities import get_path, get_bash_output
+from clearml_agent.glue.pending_pods_daemon import PendingPodsDaemon
 from clearml_agent.helper.base import safe_remove_file
 from clearml_agent.helper.dicts import merge_dicts
 from clearml_agent.helper.process import get_bash_output, stringify_bash_output
@@ -38,19 +40,15 @@ from clearml_agent.interface.base import ObjectID


 class K8sIntegration(Worker):
+    SUPPORTED_KIND = ("pod", "job")
    K8S_PENDING_QUEUE = "k8s_scheduler"

    K8S_DEFAULT_NAMESPACE = "clearml"
    AGENT_LABEL = "CLEARML=agent"
+    QUEUE_LABEL = "clearml-agent-queue"

    KUBECTL_APPLY_CMD = "kubectl apply --namespace={namespace} -f"

-    KUBECTL_CLEANUP_DELETE_CMD = "kubectl delete pods " \
-                                 "-l={agent_label} " \
-                                 "--field-selector=status.phase!=Pending,status.phase!=Running " \
-                                 "--namespace={namespace} " \
-                                 "--output name"
-
    BASH_INSTALL_SSH_CMD = [
        "apt-get update",
        "apt-get install -y openssh-server",
@@ -81,7 +79,7 @@ class K8sIntegration(Worker):
        "[ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip",
        "[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3",
        "{extra_bash_init_cmd}",
-        "$LOCAL_PYTHON -m pip install clearml-agent{agent_install_args}",
+        "[ ! -z $CLEARML_AGENT_NO_UPDATE ] || $LOCAL_PYTHON -m pip install clearml-agent{agent_install_args}",
        "{extra_docker_bash_script}",
        "$LOCAL_PYTHON -m clearml_agent execute {default_execution_agent_args} --id {task_id}"
    ]
@@ -135,6 +133,10 @@ class K8sIntegration(Worker):
        :param int max_pods_limit: Maximum number of pods that K8S glue can run at the same time
        """
        super(K8sIntegration, self).__init__()
+        self.kind = os.environ.get("CLEARML_K8S_GLUE_KIND", "pod").strip().lower()
+        if self.kind not in self.SUPPORTED_KIND:
+            raise UsageError(f"Kind '{self.kind}' not supported (expected {','.join(self.SUPPORTED_KIND)})")
+        self.using_jobs = self.kind == "job"
        self.pod_name_prefix = pod_name_prefix or self.DEFAULT_POD_NAME_PREFIX
        self.limit_pod_label = limit_pod_label or self.DEFAULT_LIMIT_POD_LABEL
        self.k8s_pending_queue_name = k8s_pending_queue_name or self.K8S_PENDING_QUEUE
@@ -179,11 +181,18 @@ class K8sIntegration(Worker):

        self._agent_label = None

-        self._monitor_hanging_pods()
+        self._pending_pods_daemon = self._create_pending_pods_daemon(
+            cls_=PendingPodsDaemon,
+            polling_interval=self._polling_interval
+        )
+        self._pending_pods_daemon.start()

        self._min_cleanup_interval_per_ns_sec = 1.0
        self._last_pod_cleanup_per_ns = defaultdict(lambda: 0.)

+    def _create_pending_pods_daemon(self, cls_, **kwargs):
+        return cls_(agent=self, **kwargs)
+
    def _load_overrides_yaml(self, overrides_yaml):
        if not overrides_yaml:
            return
@@ -209,26 +218,33 @@ class K8sIntegration(Worker):
            self.log.warning('Removing containers section: {}'.format(overrides['spec'].pop('containers')))
        self.overrides_json_string = json.dumps(overrides)

-    def _monitor_hanging_pods(self):
-        _check_pod_thread = Thread(target=self._monitor_hanging_pods_daemon)
-        _check_pod_thread.daemon = True
-        _check_pod_thread.start()
-
    @staticmethod
    def _load_template_file(path):
        with open(os.path.expandvars(os.path.expanduser(str(path))), 'rt') as f:
            return yaml.load(f, Loader=getattr(yaml, 'FullLoader', None))

-    def _get_kubectl_options(self, command, extra_labels=None, filters=None, output="json", labels=None):
-        # type: (str, Iterable[str], Iterable[str], str, Iterable[str]) -> Dict
-        if not labels:
+    @staticmethod
+    def _get_path(d, *path, default=None):
+        try:
+            return functools.reduce(
+                lambda a, b: a[b], path, d
+            )
+        except (IndexError, KeyError):
+            return default
+
+    def _get_kubectl_options(self, command, extra_labels=None, filters=None, output="json", labels=None, ns=None):
+        # type: (str, Iterable[str], Iterable[str], str, Iterable[str], str) -> Dict
+        if labels is False:
+            labels = []
+        elif not labels:
            labels = [self._get_agent_label()]
        labels = list(labels) + (list(extra_labels) if extra_labels else [])
        d = {
-            "-l": ",".join(labels),
-            "-n": str(self.namespace),
+            "-n": ns or str(self.namespace),
            "-o": output,
        }
+        if labels:
+            d["-l"] = ",".join(labels)
        if filters:
            d["--field-selector"] = ",".join(filters)
        return d
@@ -239,132 +255,6 @@ class K8sIntegration(Worker):
            command=command, opts=" ".join(x for item in opts.items() for x in item)
        )

-    def _monitor_hanging_pods_daemon(self):
-        last_tasks_msgs = {}  # last msg updated for every task
-
-        while True:
-            kubectl_cmd = self.get_kubectl_command("get pods", filters=["status.phase=Pending"])
-            self.log.debug("Detecting hanging pods: {}".format(kubectl_cmd))
-            output = stringify_bash_output(get_bash_output(kubectl_cmd))
-            try:
-                output_config = json.loads(output)
-            except Exception as ex:
-                self.log.warning('K8S Glue pods monitor: Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
-                sleep(self._polling_interval)
-                continue
-            pods = output_config.get('items', [])
-            task_id_to_details = dict()
-            for pod in pods:
-                pod_name = pod.get('metadata', {}).get('name', None)
-                if not pod_name:
-                    continue
-
-                task_id = pod_name.rpartition('-')[-1]
-                if not task_id:
-                    continue
-
-                namespace = pod.get('metadata', {}).get('namespace', None)
-                if not namespace:
-                    continue
-
-                task_id_to_details[task_id] = (pod_name, namespace)
-
-                msg = None
-
-                waiting = self._get_path(pod, 'status', 'containerStatuses', 0, 'state', 'waiting')
-                if not waiting:
-                    condition = self._get_path(pod, 'status', 'conditions', 0)
-                    if condition:
-                        reason = condition.get('reason')
-                        if reason == 'Unschedulable':
-                            message = condition.get('message')
-                            msg = reason + (" ({})".format(message) if message else "")
-                else:
-                    reason = waiting.get("reason", None)
-                    message = waiting.get("message", None)
-
-                    msg = reason + (" ({})".format(message) if message else "")
-
-                    if reason == 'ImagePullBackOff':
-                        delete_pod_cmd = 'kubectl delete pods {} -n {}'.format(pod_name, namespace)
-                        self.log.debug(" - deleting pod due to ImagePullBackOff: {}".format(delete_pod_cmd))
-                        get_bash_output(delete_pod_cmd)
-                        try:
-                            self.log.debug(" - Detecting hanging pods: {}".format(kubectl_cmd))
-                            self._session.api_client.tasks.failed(
-                                task=task_id,
-                                status_reason="K8S glue error: {}".format(msg),
-                                status_message="Changed by K8S glue",
-                                force=True
-                            )
-                        except Exception as ex:
-                            self.log.warning(
-                                'K8S Glue pods monitor: Failed deleting task "{}"\nEX: {}'.format(task_id, ex)
-                            )
-
-                        # clean up any msg for this task
-                        last_tasks_msgs.pop(task_id, None)
-                        continue
-                if msg and last_tasks_msgs.get(task_id, None) != msg:
-                    try:
-                        result = self._session.send_request(
-                            service='tasks',
-                            action='update',
-                            json={"task": task_id, "status_message": "K8S glue status: {}".format(msg)},
-                            method=Request.def_method,
-                            async_enable=False,
-                        )
-                        if not result.ok:
-                            result_msg = self._get_path(result.json(), 'meta', 'result_msg')
-                            raise Exception(result_msg or result.text)
-
-                        # update last msg for this task
-                        last_tasks_msgs[task_id] = msg
-                    except Exception as ex:
-                        self.log.warning(
-                            'K8S Glue pods monitor: Failed setting status message for task "{}"\nMSG: {}\nEX: {}'.format(
-                                task_id, msg, ex
-                            )
-                        )
-
-            if task_id_to_details:
-                try:
-                    result = self._session.get(
-                        service='tasks',
-                        action='get_all',
-                        json={"id": list(task_id_to_details), "status": ["stopped"], "only_fields": ["id"]},
-                        method=Request.def_method,
-                        async_enable=False,
-                    )
-                    aborted_task_ids = list(filter(None, (task.get("id") for task in result["tasks"])))
-
-                    for task_id in aborted_task_ids:
-                        pod_name, namespace = task_id_to_details.get(task_id)
-                        if not pod_name:
-                            self.log.error("Failed locating aborted task {} in pending pods list".format(task_id))
-                            continue
-                        self.log.info(
-                            "K8S Glue pods monitor: task {} was aborted by its pod {} is still pending, "
-                            "deleting pod".format(task_id, pod_name)
-                        )
-
-                        kubectl_cmd = "kubectl delete pod {pod_name} --output name {namespace}".format(
-                            namespace=f"--namespace={namespace}" if namespace else "", pod_name=pod_name,
-                        ).strip()
-                        self.log.debug("Deleting aborted task pending pod: {}".format(kubectl_cmd))
-                        output = stringify_bash_output(get_bash_output(kubectl_cmd))
-                        if not output:
-                            self.log.warning("K8S Glue pods monitor: failed deleting pod {}".format(pod_name))
-                except Exception as ex:
-                    self.log.warning(
-                        'K8S Glue pods monitor: failed checking aborted tasks for hanging pods: {}'.format(ex)
-                    )
-
-            # clean up any last message for a task that wasn't seen as a pod
-            last_tasks_msgs = {k: v for k, v in last_tasks_msgs.items() if k in task_id_to_details}
-
-            sleep(self._polling_interval)
-
    def _set_task_user_properties(self, task_id: str, task_session=None, **properties: str):
        session = task_session or self._session
        if self._edit_hyperparams_support is not True:
@@ -408,34 +298,50 @@ class K8sIntegration(Worker):

        return self._agent_label

-    def _get_used_pods(self):
-        # type: () -> Tuple[int, Set[str]]
-        # noinspection PyBroadException
+    RunningPod = namedtuple("RunningPod", "name queue namespace")
+
+    def _get_running_pods(self):
        try:
            kubectl_cmd = self.get_kubectl_command(
                "get pods",
-                output="jsonpath=\"{range .items[*]}{.metadata.name}{' '}{.metadata.namespace}{'\\n'}{end}\""
+                output="jsonpath=\"{{range .items[*]}}{{.metadata.name}}{{' '}}{{.metadata.namespace}}{{' '}}"
+                       "{{.metadata.labels.{}}}{{'\\n'}}{{end}}\"".format(self.QUEUE_LABEL)
            )
            self.log.debug("Getting used pods: {}".format(kubectl_cmd))
            output = stringify_bash_output(get_bash_output(kubectl_cmd, raise_error=True))

            if not output:
                # No such pod exist so we can use the pod_number we found
-                return 0, set([])
+                return []

            try:
-                items = output.splitlines()
-                current_pod_count = len(items)
-                namespaces = {item.rpartition(" ")[-1] for item in items}
-                self.log.debug(" - found {} pods in namespaces {}".format(current_pod_count, ", ".join(namespaces)))
-            except (KeyError, ValueError, TypeError, AttributeError) as ex:
-                print("Failed parsing used pods command response for cleanup: {}".format(ex))
-                return -1, set([])
+                return [
+                    self.RunningPod(
+                        name=parts[0],
+                        namespace=parts[1],
+                        queue=parts[2]
+                    )
+                    for parts in (line.split(" ") for line in output.splitlines())
+                ]
+            except Exception as ex:
+                raise Exception("Failed parsing used pods command response for cleanup: {}".format(ex))
+        except Exception as ex:
+            raise Exception('Failed obtaining used pods information: {}'.format(ex))

+    def _get_used_pods(self):
+        # type: () -> Tuple[int, Set[str]]
+        # noinspection PyBroadException
+        try:
+            items = self._get_running_pods()
+            if not items:
+                return 0, set([])
+            current_pod_count = len(items)
+            namespaces = {item.namespace for item in items}
+            self.log.debug(" - found {} pods in namespaces {}".format(current_pod_count, ", ".join(namespaces)))
            return current_pod_count, namespaces
        except Exception as ex:
-            print('Failed obtaining used pods information: {}'.format(ex))
-            return -2, set([])
+            self.log.debug("Failed getting used pods: {}", ex)
+            return -1, set([])

    def _is_same_tenant(self, task_session):
        if not task_session or task_session is self._session:
@@ -448,6 +354,69 @@ class K8sIntegration(Worker):
        except Exception as ex:
            print("ERROR: Failed getting tenant for task session: {}".format(ex))

+    def get_jobs_info(self, info_path: str, condition: str = None, namespace=None, debug_msg: str = None)\
+            -> Dict[str, str]:
+        cond = "==".join((x.strip("=") for x in condition.partition("=")[::2]))
+        output = f"jsonpath='{{range .items[?(@.{cond})]}}{{@.{info_path}}}{{\" \"}}{{@.metadata.namespace}}{{\"\\n\"}}{{end}}'"
+        kubectl_cmd = self.get_kubectl_command("get job", output=output, ns=namespace)
+        if debug_msg:
+            self.log.debug(debug_msg.format(cmd=kubectl_cmd))
+        output = stringify_bash_output(get_bash_output(kubectl_cmd))
+        output = output.strip("'")  # for Windows debugging :(
+        try:
+            data_items = dict(l.strip().partition(" ")[::2] for l in output.splitlines())
+            return data_items
+        except Exception as ex:
+            self.log.warning('Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
+
+    def get_pods_for_jobs(self, job_condition: str = None, pod_filters: List[str] = None, debug_msg: str = None):
+        controller_uids = self.get_jobs_info(
+            "spec.selector.matchLabels.controller-uid", condition=job_condition, debug_msg=debug_msg
+        )
+        if not controller_uids:
+            # No pods were found for these jobs
+            return []
+        pods = self.get_pods(filters=pod_filters, debug_msg=debug_msg)
+        return [
+            pod for pod in pods
+            if get_path(pod, "metadata", "labels", "controller-uid") in controller_uids
+        ]
+
+    def get_pods(self, filters: List[str] = None, debug_msg: str = None):
+        kubectl_cmd = self.get_kubectl_command(
+            "get pods",
+            filters=filters,
+            labels=False if self.using_jobs else None,
+        )
+        if debug_msg:
+            self.log.debug(debug_msg.format(cmd=kubectl_cmd))
+        output = stringify_bash_output(get_bash_output(kubectl_cmd))
+        try:
+            output_config = json.loads(output)
+        except Exception as ex:
+            self.log.warning('Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
+            return
+        return output_config.get('items', [])
+
+    def _get_pod_count(self, extra_labels: List[str] = None, msg: str = None):
+            kubectl_cmd_new = self.get_kubectl_command(
+                f"get {self.kind}s",
+                extra_labels= extra_labels
+            )
+            self.log.debug("{}{}".format((msg + ": ") if msg else "", kubectl_cmd_new))
+            process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+            output, error = process.communicate()
+            output = stringify_bash_output(output)
+            error = stringify_bash_output(error)
+
+            try:
+                return len(json.loads(output).get("items", []))
+            except (ValueError, TypeError) as ex:
+                self.log.warning(
+                    "K8S Glue pods monitor: Failed parsing kubectl output:\n{}\nEx: {}".format(output, ex)
+                )
+                raise GetPodCountError()
+
    def run_one_task(self, queue: Text, task_id: Text, worker_args=None, task_session=None, **_):
        print('Pulling task {} launching on kubernetes cluster'.format(task_id))
        session = task_session or self._session
@@ -459,6 +428,12 @@ class K8sIntegration(Worker):
                print('Pushing task {} into temporary pending queue'.format(task_id))
                _ = session.api_client.tasks.stop(task_id, force=True)

+                # Just make sure to clean up in case the task is stuck in the queue (known issue)
+                self._session.api_client.queues.remove_task(
+                    task=task_id,
+                    queue=self.k8s_pending_queue_id,
+                )
+
                res = self._session.api_client.tasks.enqueue(
                    task_id,
                    queue=self.k8s_pending_queue_id,
@@ -498,14 +473,14 @@ class K8sIntegration(Worker):

        hocon_config_encoded = config_content.encode("ascii")

-        create_clearml_conf = ["echo '{}' | base64 --decode >> ~/clearml.conf".format(
+        clearml_conf_create_script = ["echo '{}' | base64 --decode >> ~/clearml.conf".format(
            base64.b64encode(
                hocon_config_encoded
            ).decode('ascii')
        )]

        if task_session:
-            create_clearml_conf.append(
+            clearml_conf_create_script.append(
                "export CLEARML_AUTH_TOKEN=$(echo '{}' | base64 --decode)".format(
                    base64.b64encode(task_session.token.encode("ascii")).decode('ascii')
                )
@@ -526,23 +501,15 @@ class K8sIntegration(Worker):
        while self.ports_mode or self.max_pods_limit:
            pod_number = self.base_pod_num + pod_count

-            kubectl_cmd_new = self.get_kubectl_command(
-                "get pods",
-                extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if self.ports_mode else None
-            )
-            self.log.debug("Looking for a free pod/port: {}".format(kubectl_cmd_new))
-            process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
-            output, error = process.communicate()
-            output = stringify_bash_output(output)
-            error = stringify_bash_output(error)
-
            try:
-                items_count = len(json.loads(output).get("items", []))
-            except (ValueError, TypeError) as ex:
+                items_count = self._get_pod_count(
+                    extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if self.ports_mode else None,
+                    msg="Looking for a free pod/port"
+                )
+            except GetPodCountError:
                self.log.warning(
-                    "K8S Glue pods monitor: Failed parsing kubectl output:\n{}\ntask '{}' "
-                    "will be enqueued back to queue '{}'\nEx: {}".format(
-                        output, task_id, queue, ex
+                    "K8S Glue pods monitor: task '{}' will be enqueued back to queue '{}'".format(
+                        task_id, queue
                    )
                )
                session.api_client.tasks.stop(task_id, force=True)
@@ -566,8 +533,6 @@ class K8sIntegration(Worker):

            if current_pod_count >= max_count:
                # All pods are taken, exit
-                self.log.debug(
-                    "kubectl last result: {}\n{}".format(error, output))
                self.log.warning(
                    "All k8s services are in use, task '{}' "
                    "will be enqueued back to queue '{}'".format(
@@ -612,7 +577,7 @@ class K8sIntegration(Worker):
            output, error = self._kubectl_apply(
                template=template,
                pod_number=pod_number,
-                create_clearml_conf=create_clearml_conf,
+                clearml_conf_create_script=clearml_conf_create_script,
                labels=labels,
                docker_image=container['image'],
                docker_args=container['arguments'],
@@ -622,11 +587,11 @@ class K8sIntegration(Worker):
                namespace=namespace,
            )

-        print('kubectl output:\n{}\n{}'.format(error, output))
-        if error:
-            send_log = "Running kubectl encountered an error: {}".format(error)
-            self.log.error(send_log)
-            self.send_logs(task_id, send_log.splitlines())
+            print('kubectl output:\n{}\n{}'.format(error, output))
+            if error:
+                send_log = "Running kubectl encountered an error: {}".format(error)
+                self.log.error(send_log)
+                self.send_logs(task_id, send_log.splitlines())

        user_props = {"k8s-queue": str(queue_name)}
        if self.ports_mode:
@@ -657,8 +622,8 @@ class K8sIntegration(Worker):
    def _get_pod_labels(self, queue, queue_name):
        return [
            self._get_agent_label(),
-            "clearml-agent-queue={}".format(self._safe_k8s_label_value(queue)),
-            "clearml-agent-queue-name={}".format(self._safe_k8s_label_value(queue_name))
+            "{}={}".format(self.QUEUE_LABEL, self._safe_k8s_label_value(queue)),
+            "{}-name={}".format(self.QUEUE_LABEL, self._safe_k8s_label_value(queue_name))
        ]

    def _get_docker_args(self, docker_args, flags, target=None, convert=None):
@@ -687,32 +652,10 @@ class K8sIntegration(Worker):
            return {target: results} if results else {}
        return results

-    def _kubectl_apply(
-        self,
-        create_clearml_conf,
-        docker_image,
-        docker_args,
-        docker_bash,
-        labels,
-        queue,
-        task_id,
-        namespace,
-        template=None,
-        pod_number=None
-    ):
-        template.setdefault('apiVersion', 'v1')
-        template['kind'] = 'Pod'
-        template.setdefault('metadata', {})
-        name = self.pod_name_prefix + str(task_id)
-        template['metadata']['name'] = name
-        template.setdefault('spec', {})
-        template['spec'].setdefault('containers', [])
-        template['spec'].setdefault('restartPolicy', 'Never')
-        if labels:
-            labels_dict = dict(pair.split('=', 1) for pair in labels)
-            template['metadata'].setdefault('labels', {})
-            template['metadata']['labels'].update(labels_dict)
-
+    def _create_template_container(
+        self, pod_name: str, task_id: str, docker_image: str, docker_args: List[str],
+        docker_bash: str, clearml_conf_create_script: List[str]
+    ) -> dict:
        container = self._get_docker_args(
            docker_args,
            target="env",
@@ -736,7 +679,7 @@ class K8sIntegration(Worker):
                         agent_install_args=self.POD_AGENT_INSTALL_ARGS)
             for line in container_bash_script])

-        extra_bash_commands = list(create_clearml_conf or [])
+        extra_bash_commands = list(clearml_conf_create_script or [])

        start_agent_script_path = ENV_START_AGENT_SCRIPT_PATH.get() or "~/__start_agent__.sh"

@@ -750,20 +693,77 @@ class K8sIntegration(Worker):
        )

        # Notice: we always leave with exit code 0, so pods are never restarted
-        container = self._merge_containers(
+        return self._merge_containers(
            container,
-            dict(name=name, image=docker_image,
+            dict(name=pod_name, image=docker_image,
                 command=['/bin/bash'],
                 args=['-c', '{} ; exit 0'.format(' ; '.join(extra_bash_commands))])
        )

-        if template['spec']['containers']:
-            template['spec']['containers'][0] = self._merge_containers(template['spec']['containers'][0], container)
+    def _kubectl_apply(
+        self,
+        clearml_conf_create_script: List[str],
+        docker_image,
+        docker_args,
+        docker_bash,
+        labels,
+        queue,
+        task_id,
+        namespace,
+        template=None,
+        pod_number=None
+    ):
+        if "apiVersion" not in template:
+            template["apiVersion"] = "batch/v1" if self.using_jobs else "v1"
+        if "kind" in template:
+            if template["kind"].lower() != self.kind:
+                return (
+                    "", f"Template kind {template['kind']} does not maych kind {self.kind.capitalize()} set for agent"
+                )
        else:
-            template['spec']['containers'].append(container)
+            template["kind"] = self.kind.capitalize()
+
+        metadata = template.setdefault('metadata', {})
+        name = self.pod_name_prefix + str(task_id)
+        metadata['name'] = name
+
+        def place_labels(metadata_dict):
+            labels_dict = dict(pair.split('=', 1) for pair in labels)
+            metadata_dict.setdefault('labels', {}).update(labels_dict)
+
+        if labels:
+            # Place labels on base resource (job or single pod)
+            place_labels(metadata)
+
+        spec = template.setdefault('spec', {})
+        if self.using_jobs:
+            spec.setdefault('backoffLimit', 0)
+            spec_template = spec.setdefault('template', {})
+            if labels:
+                # Place same labels fro any pod spawned by the job
+                place_labels(spec_template.setdefault('metadata', {}))
+
+            spec = spec_template.setdefault('spec', {})
+
+        containers = spec.setdefault('containers', [])
+        spec.setdefault('restartPolicy', 'Never')
+
+        container = self._create_template_container(
+            pod_name=name,
+            task_id=task_id,
+            docker_image=docker_image,
+            docker_args=docker_args,
+            docker_bash=docker_bash,
+            clearml_conf_create_script=clearml_conf_create_script
+        )
+
+        if containers:
+            containers[0] = self._merge_containers(containers[0], container)
+        else:
+            containers.append(container)

        if self._docker_force_pull:
-            for c in template['spec']['containers']:
+            for c in containers:
                c.setdefault('imagePullPolicy', 'Always')

        fp, yaml_file = tempfile.mkstemp(prefix='clearml_k8stmpl_', suffix='.yml')
@@ -795,6 +795,83 @@ class K8sIntegration(Worker):

        return stringify_bash_output(output), stringify_bash_output(error)

+    def _process_bash_lines_response(self, bash_cmd: str, raise_error=True):
+        res = get_bash_output(bash_cmd, raise_error=raise_error)
+        lines = [
+            line for line in
+            (r.strip().rpartition("/")[-1] for r in res.splitlines())
+            if line.startswith(self.pod_name_prefix)
+        ]
+        return lines
+
+    def _delete_pods(self, selectors: List[str], namespace: str, msg: str = None) -> List[str]:
+        kubectl_cmd = \
+            "kubectl delete pod -l={agent_label} " \
+            "--namespace={namespace} --field-selector={selector} --output name".format(
+                selector=",".join(selectors),
+                agent_label=self._get_agent_label(),
+                namespace=namespace,
+            )
+        self.log.debug("Deleting old/failed pods{} for ns {}: {}".format(
+            msg or "", namespace, kubectl_cmd
+        ))
+        lines = self._process_bash_lines_response(kubectl_cmd)
+        self.log.debug(" - deleted pods %s", ", ".join(lines))
+        return lines
+
+    def _delete_jobs_by_names(self, names_to_ns: Dict[str, str], msg: str = None) -> List[str]:
+        if not names_to_ns:
+            return []
+        ns_to_names = defaultdict(list)
+        for name, ns in names_to_ns.items():
+            ns_to_names[ns].append(name)
+
+        results = []
+        for ns, names in ns_to_names.items():
+            kubectl_cmd = "kubectl delete job --namespace={ns} --output=name {names}".format(
+                ns=ns, names=" ".join(names)
+            )
+            self.log.debug("Deleting jobs {}: {}".format(
+                msg or "", kubectl_cmd
+            ))
+            lines = self._process_bash_lines_response(kubectl_cmd)
+            if not lines:
+                continue
+            self.log.debug(" - deleted jobs %s", ", ".join(lines))
+            results.extend(lines)
+        return results
+
+    def _delete_completed_or_failed_pods(self, namespace, msg: str = None):
+        if not self.using_jobs:
+            return self._delete_pods(
+                selectors=["status.phase!=Pending", "status.phase!=Running"], namespace=namespace, msg=msg
+            )
+
+        job_names_to_delete = {}
+
+        # locate failed pods for jobs
+        failed_pods = self.get_pods_for_jobs(
+            job_condition="status.active=1",
+            pod_filters=["status.phase!=Pending", "status.phase!=Running", "status.phase!=Terminating"],
+            debug_msg="Deleting failed pods: {cmd}"
+        )
+        if failed_pods:
+            job_names_to_delete = {
+                get_path(pod, "metadata", "labels", "job-name"): get_path(pod, "metadata", "namespace")
+                for pod in failed_pods
+                if get_path(pod, "metadata", "labels", "job-name")
+            }
+            self.log.debug(f" - found jobs with failed pods: {' '.join(job_names_to_delete)}")
+
+        completed_job_names = self.get_jobs_info(
+            "metadata.name", condition="status.succeeded=1", namespace=namespace, debug_msg=msg
+        )
+        if completed_job_names:
+            self.log.debug(f" - found completed jobs: {' '.join(completed_job_names)}")
+            job_names_to_delete.update(completed_job_names)
+
+        return self._delete_jobs_by_names(names_to_ns=job_names_to_delete, msg=msg)
+
    def _cleanup_old_pods(self, namespaces, extra_msg=None):
        # type: (Iterable[str], Optional[str]) -> Dict[str, List[str]]
        self.log.debug("Cleaning up pods")
@@ -803,23 +880,12 @@ class K8sIntegration(Worker):
            if time() - self._last_pod_cleanup_per_ns[namespace] < self._min_cleanup_interval_per_ns_sec:
                # Do not try to cleanup the same namespace too quickly
                continue
-            kubectl_cmd = self.KUBECTL_CLEANUP_DELETE_CMD.format(
-                namespace=namespace, agent_label=self._get_agent_label()
-            )
-            self.log.debug("Deleting old/failed pods{} for ns {}: {}".format(
-                extra_msg or "", namespace, kubectl_cmd
-            ))
+
            try:
-                res = get_bash_output(kubectl_cmd, raise_error=True)
-                lines = [
-                    line for line in
-                    (r.strip().rpartition("/")[-1] for r in res.splitlines())
-                    if line.startswith(self.pod_name_prefix)
-                ]
-                self.log.debug(" - deleted pod(s) %s", ", ".join(lines))
-                deleted_pods[namespace].extend(lines)
+                res = self._delete_completed_or_failed_pods(namespace, extra_msg)
+                deleted_pods[namespace].extend(res)
            except Exception as ex:
-                self.log.error("Failed deleting old/failed pods for ns %s: %s", namespace, str(ex))
+                self.log.error("Failed deleting completed/failed pods for ns %s: %s", namespace, str(ex))
            finally:
                self._last_pod_cleanup_per_ns[namespace] = time()

@@ -840,7 +906,7 @@ class K8sIntegration(Worker):
                )
                tasks_to_abort = result["tasks"]
        except Exception as ex:
-            self.log.warning('Failed getting running tasks for deleted pods: {}'.format(ex))
+            self.log.warning('Failed getting running tasks for deleted {}(s): {}'.format(self.kind, ex))

        for task in tasks_to_abort:
            task_id = task.get("id")
@@ -853,15 +919,27 @@ class K8sIntegration(Worker):
                    self._session.get(
                        service='tasks',
                        action='dequeue',
-                        json={"task": task_id, "force": True, "status_reason": "Pod deleted (not pending or running)",
-                              "status_message": "Pod deleted by agent {}".format(self.worker_id or "unknown")},
+                        json={
+                            "task": task_id,
+                            "force": True,
+                            "status_reason": "Pod deleted (not pending or running)",
+                            "status_message": "{} deleted by agent {}".format(
+                                self.kind.capitalize(), self.worker_id or "unknown"
+                            )
+                        },
                        method=Request.def_method,
                    )
                self._session.get(
                    service='tasks',
                    action='failed',
-                    json={"task": task_id, "force": True, "status_reason": "Pod deleted (not pending or running)",
-                          "status_message": "Pod deleted by agent {}".format(self.worker_id or "unknown")},
+                    json={
+                        "task": task_id,
+                        "force": True,
+                        "status_reason": "Pod deleted (not pending or running)",
+                        "status_message": "{} deleted by agent {}".format(
+                            self.kind.capitalize(), self.worker_id or "unknown"
+                        )
+                    },
                    method=Request.def_method,
                )
            except Exception as ex:
@@ -902,10 +980,10 @@ class K8sIntegration(Worker):
            # check if have pod limit, then check if we hit it.
            if self.max_pods_limit:
                if current_pods >= self.max_pods_limit:
-                    print("Maximum pod limit reached {}/{}, sleeping for {:.1f} seconds".format(
-                        current_pods, self.max_pods_limit, self._polling_interval))
+                    print("Maximum {} limit reached {}/{}, sleeping for {:.1f} seconds".format(
+                        self.kind, current_pods, self.max_pods_limit, self._polling_interval))
                    # delete old completed / failed pods
-                    self._cleanup_old_pods(namespaces, " due to pod limit")
+                    self._cleanup_old_pods(namespaces, f" due to {self.kind} limit")
                    # go to sleep
                    sleep(self._polling_interval)
                    continue
@@ -913,7 +991,7 @@ class K8sIntegration(Worker):
            # iterate over queues (priority style, queues[0] is highest)
            for queue in queues:
                # delete old completed / failed pods
-                self._cleanup_old_pods(namespaces)
+                self._cleanup_old_pods(namespaces, extra_msg="Cleanup cycle {cmd}")

                # get next task in queue
                try:
--- a/clearml_agent/glue/pending_pods_daemon.py
+++ b/clearml_agent/glue/pending_pods_daemon.py
@@ -0,0 +1,223 @@
+from time import sleep
+from typing import Dict, Tuple, Optional, List
+
+from clearml_agent.backend_api.session import Request
+from clearml_agent.glue.utilities import get_bash_output
+
+from clearml_agent.helper.process import stringify_bash_output
+
+from .daemon import K8sDaemon
+from .utilities import get_path
+from .errors import GetPodsError
+
+
+class PendingPodsDaemon(K8sDaemon):
+    def __init__(self, polling_interval: float, agent):
+        super(PendingPodsDaemon, self).__init__(agent=agent)
+        self._polling_interval = polling_interval
+        self._last_tasks_msgs = {}  # last msg updated for every task
+
+    def get_pods(self):
+        if self._agent.using_jobs:
+            return self._agent.get_pods_for_jobs(
+                job_condition="status.active=1",
+                pod_filters=["status.phase=Pending"],
+                debug_msg="Detecting pending pods: {cmd}"
+            )
+        return self._agent.get_pods(
+            filters=["status.phase=Pending"],
+            debug_msg="Detecting pending pods: {cmd}"
+        )
+
+    def _get_pod_name(self, pod: dict):
+        return get_path(pod, "metadata", "name")
+
+    def _get_k8s_resource_name(self, pod: dict):
+        if self._agent.using_jobs:
+            return get_path(pod, "metadata", "labels", "job-name")
+        return get_path(pod, "metadata", "name")
+
+    def _get_task_id(self, pod: dict):
+        return self._get_k8s_resource_name(pod).rpartition('-')[-1]
+
+    @staticmethod
+    def _get_k8s_resource_namespace(pod: dict):
+        return pod.get('metadata', {}).get('namespace', None)
+
+    def target(self):
+        """
+            Handle pending objects (pods or jobs, depending on the agent mode).
+            - Delete any pending objects that are not expected to recover
+            - Delete any pending objects for whom the associated task was aborted
+        """
+        while True:
+            # noinspection PyBroadException
+            try:
+                # Get pods (standalone pods if we're in pods mode, or pods associated to jobs if we're in jobs mode)
+                pods = self.get_pods()
+                if pods is None:
+                    raise GetPodsError()
+
+                task_id_to_pod = dict()
+
+                for pod in pods:
+                    pod_name = self._get_pod_name(pod)
+                    if not pod_name:
+                        continue
+
+                    task_id = self._get_task_id(pod)
+                    if not task_id:
+                        continue
+
+                    namespace = self._get_k8s_resource_namespace(pod)
+                    if not namespace:
+                        continue
+
+                    task_id_to_pod[task_id] = pod
+
+                    msg = None
+                    tags = []
+
+                    waiting = get_path(pod, 'status', 'containerStatuses', 0, 'state', 'waiting')
+                    if not waiting:
+                        condition = get_path(pod, 'status', 'conditions', 0)
+                        if condition:
+                            reason = condition.get('reason')
+                            if reason == 'Unschedulable':
+                                message = condition.get('message')
+                                msg = reason + (" ({})".format(message) if message else "")
+                    else:
+                        reason = waiting.get("reason", None)
+                        message = waiting.get("message", None)
+
+                        msg = reason + (" ({})".format(message) if message else "")
+
+                        if reason == 'ImagePullBackOff':
+                            self.delete_k8s_resource(k8s_resource=pod, msg=reason)
+                            try:
+                                self._session.api_client.tasks.failed(
+                                    task=task_id,
+                                    status_reason="K8S glue error: {}".format(msg),
+                                    status_message="Changed by K8S glue",
+                                    force=True
+                                )
+                                self._agent.send_logs(
+                                    task_id, ["K8S Error: {}".format(msg)],
+                                    session=self._session
+                                )
+                            except Exception as ex:
+                                self.log.warning(
+                                    'K8S Glue pending monitor: Failed deleting task "{}"\nEX: {}'.format(task_id, ex)
+                                )
+
+                            # clean up any msg for this task
+                            self._last_tasks_msgs.pop(task_id, None)
+                            continue
+
+                    self._update_pending_task_msg(task_id, msg, tags)
+
+                if task_id_to_pod:
+                    self._process_tasks_for_pending_pods(task_id_to_pod)
+
+                # clean up any last message for a task that wasn't seen as a pod
+                self._last_tasks_msgs = {k: v for k, v in self._last_tasks_msgs.items() if k in task_id_to_pod}
+            except GetPodsError:
+                pass
+            except Exception:
+                self.log.exception("Hanging pods daemon loop")
+
+            sleep(self._polling_interval)
+
+    def delete_k8s_resource(self, k8s_resource: dict, msg: str = None):
+        delete_cmd = "kubectl delete {kind} {name} -n {namespace} --output name".format(
+            kind=self._agent.kind,
+            name=self._get_k8s_resource_name(k8s_resource),
+            namespace=self._get_k8s_resource_namespace(k8s_resource)
+        ).strip()
+        self.log.debug(" - deleting {} {}: {}".format(self._agent.kind, (" " + msg) if msg else "", delete_cmd))
+        return get_bash_output(delete_cmd).strip()
+
+    def _process_tasks_for_pending_pods(self, task_id_to_details: Dict[str, dict]):
+        self._handle_aborted_tasks(task_id_to_details)
+
+    def _handle_aborted_tasks(self, pending_tasks_details: Dict[str, dict]):
+        try:
+            result = self._session.get(
+                service='tasks',
+                action='get_all',
+                json={
+                    "id": list(pending_tasks_details),
+                    "status": ["stopped"],
+                    "only_fields": ["id"]
+                },
+                method=Request.def_method,
+                async_enable=False,
+            )
+            aborted_task_ids = list(filter(None, (task.get("id") for task in result["tasks"])))
+
+            for task_id in aborted_task_ids:
+                pod = pending_tasks_details.get(task_id)
+                if not pod:
+                    self.log.error("Failed locating aborted task {} in pending pods list".format(task_id))
+                    continue
+                resource_name = self._get_k8s_resource_name(pod)
+                self.log.info(
+                    "K8S Glue pending monitor: task {} was aborted but the k8s resource {} is still pending, "
+                    "deleting pod".format(task_id, resource_name)
+                )
+                output = self.delete_k8s_resource(k8s_resource=pod, msg="Pending resource of an aborted task")
+                if not output:
+                    self.log.warning("K8S Glue pending monitor: failed deleting resource {}".format(resource_name))
+        except Exception as ex:
+            self.log.warning(
+                'K8S Glue pending monitor: failed checking aborted tasks for pending resources: {}'.format(ex)
+            )
+
+    def _update_pending_task_msg(self, task_id: str, msg: str, tags: List[str] = None):
+        if not msg or self._last_tasks_msgs.get(task_id, None) == (msg, tags):
+            return
+        try:
+            # Make sure the task is queued
+            result = self._session.send_request(
+                service='tasks',
+                action='get_all',
+                json={"id": task_id, "only_fields": ["status"]},
+                method=Request.def_method,
+                async_enable=False,
+            )
+            if result.ok:
+                status = get_path(result.json(), 'data', 'tasks', 0, 'status')
+                # if task is in progress, change its status to enqueued
+                if status == "in_progress":
+                    result = self._session.send_request(
+                        service='tasks', action='enqueue',
+                        json={
+                            "task": task_id, "force": True, "queue": self._agent.k8s_pending_queue_id
+                        },
+                        method=Request.def_method,
+                        async_enable=False,
+                    )
+                    if not result.ok:
+                        result_msg = get_path(result.json(), 'meta', 'result_msg')
+                        self.log.debug(
+                            "K8S Glue pods monitor: failed forcing task status change"
+                            " for pending task {}: {}".format(task_id, result_msg)
+                        )
+
+            # Update task status message
+            payload = {"task": task_id, "status_message": "K8S glue status: {}".format(msg)}
+            if tags:
+                payload["tags"] = tags
+            result = self._session.send_request('tasks', 'update', json=payload, method=Request.def_method)
+            if not result.ok:
+                result_msg = get_path(result.json(), 'meta', 'result_msg')
+                raise Exception(result_msg or result.text)
+
+            # update last msg for this task
+            self._last_tasks_msgs[task_id] = msg
+        except Exception as ex:
+            self.log.warning(
+                'K8S Glue pods monitor: Failed setting status message for task "{}"\nMSG: {}\nEX: {}'.format(
+                    task_id, msg, ex
+                )
+            )
--- a/clearml_agent/glue/utilities.py
+++ b/clearml_agent/glue/utilities.py
@@ -0,0 +1,18 @@
+import functools
+
+from subprocess import DEVNULL
+
+from clearml_agent.helper.process import get_bash_output as _get_bash_output
+
+
+def get_path(d, *path, default=None):
+    try:
+        return functools.reduce(
+            lambda a, b: a[b], path, d
+        )
+    except (IndexError, KeyError):
+        return default
+
+
+def get_bash_output(cmd, stderr=DEVNULL, raise_error=False):
+    return _get_bash_output(cmd, stderr=stderr, raise_error=raise_error)
--- a/clearml_agent/helper/base.py
+++ b/clearml_agent/helper/base.py
@@ -20,20 +20,22 @@ from typing import Text, Dict, Any, Optional, AnyStr, IO, Union

 import attr
 import furl
+import six
 import yaml
 from attr import fields_dict
 from pathlib2 import Path
-
-import six
 from six.moves import reduce
-from clearml_agent.external import pyhocon
+
 from clearml_agent.errors import CommandFailedError
+from clearml_agent.external import pyhocon
 from clearml_agent.helper.dicts import filter_keys

 pretty_lines = False

 log = logging.getLogger(__name__)

+use_powershell = os.getenv("CLEARML_AGENT_USE_POWERSHELL", None)
+

 def which(cmd, path=None):
    result = find_executable(cmd, path)
@@ -52,7 +54,7 @@ def select_for_platform(linux, windows):


 def bash_c():
-    return 'bash -c' if not is_windows_platform() else 'cmd /c'
+    return 'bash -c' if not is_windows_platform() else ('powershell -Command' if use_powershell else 'cmd /c')


 def return_list(arg):
--- a/clearml_agent/helper/package/base.py
+++ b/clearml_agent/helper/package/base.py
@@ -50,7 +50,7 @@ class PackageManager(object):
        pass

    @abc.abstractmethod
-    def freeze(self):
+    def freeze(self, freeze_full_environment=False):
        pass

    @abc.abstractmethod
@@ -141,8 +141,9 @@ class PackageManager(object):
    @classmethod
    def out_of_scope_install_package(cls, package_name, *args):
        if PackageManager._selected_manager is not None:
+            # noinspection PyBroadException
            try:
-                result = PackageManager._selected_manager._install(package_name, *args)
+                result = PackageManager._selected_manager.install_packages(package_name, *args)
                if result not in (0, None, True):
                    return False
            except Exception:
@@ -150,10 +151,11 @@ class PackageManager(object):
        return True

    @classmethod
-    def out_of_scope_freeze(cls):
+    def out_of_scope_freeze(cls, freeze_full_environment=False):
        if PackageManager._selected_manager is not None:
+            # noinspection PyBroadException
            try:
-                return PackageManager._selected_manager.freeze()
+                return PackageManager._selected_manager.freeze(freeze_full_environment)
            except Exception:
                pass
        return []
--- a/clearml_agent/helper/package/external_req.py
+++ b/clearml_agent/helper/package/external_req.py
@@ -92,21 +92,14 @@ class ExternalRequirements(SimpleSubstitution):
                vcs_url = req_line[4:]
                # reverse replace
                vcs_url = vcs_url[::-1].replace(fragment[::-1], '', 1)[::-1]
-                # remove ssh:// or git:// prefix for git detection and credentials
-                scheme = ''
-                full_vcs_url = vcs_url
-                if vcs_url and (vcs_url.startswith('ssh://') or vcs_url.startswith('git://')):
-                    scheme = 'ssh://'  # notice git:// is actually ssh://
-                    vcs_url = vcs_url[6:]
+                # notice git:// is actually ssh://
+                if vcs_url and vcs_url.startswith('git://'):
+                    vcs_url = vcs_url.replace('git://', 'ssh://', 1)

                from ..repo import Git
-                vcs = Git(session=session, url=full_vcs_url, location=None, revision=None)
+                vcs = Git(session=session, url=vcs_url, location=None, revision=None)
                vcs._set_ssh_url()
-                new_req_line = 'git+{}{}{}'.format(
-                    '' if scheme and '://' in vcs.url else scheme,
-                    vcs_url if session.config.get('agent.force_git_ssh_protocol', None) else vcs.url_with_auth,
-                    fragment
-                )
+                new_req_line = 'git+{}{}'.format(vcs.url_with_auth, fragment)
                if new_req_line != req_line:
                    furl_line = furl(new_req_line)
                    print('Replacing original pip vcs \'{}\' with \'{}\''.format(
--- a/clearml_agent/helper/package/pip_api/system.py
+++ b/clearml_agent/helper/package/pip_api/system.py
@@ -4,7 +4,7 @@ from itertools import chain
 from pathlib import Path
 from typing import Text, Optional

-from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME
+from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME, ENV_PIP_EXTRA_INSTALL_FLAGS
 from clearml_agent.helper.package.base import PackageManager
 from clearml_agent.helper.process import Argv, DEVNULL
 from clearml_agent.session import Session
@@ -12,8 +12,6 @@ from clearml_agent.session import Session

 class SystemPip(PackageManager):

-    indices_args = None
-
    def __init__(self, interpreter=None, session=None):
        # type: (Optional[Text], Optional[Session]) -> ()
        """
@@ -52,7 +50,7 @@ class SystemPip(PackageManager):
                package,
                '--dest', cache_dir,
                '--no-deps',
-            ) + self.install_flags()
+            ) + self.download_flags()
        )

    def load_requirements(self, requirements):
@@ -65,13 +63,14 @@ class SystemPip(PackageManager):
    def uninstall(self, package):
        self.run_with_env(('uninstall', '-y', package))

-    def freeze(self):
+    def freeze(self, freeze_full_environment=False):
        """
        pip freeze to all install packages except the running program
        :return: Dict contains pip as key and pip's packages to install
        :rtype: Dict[str: List[str]]
        """
-        packages = self.run_with_env(('freeze',), output=True).splitlines()
+        packages = self.run_with_env(
+            ('freeze',) if not freeze_full_environment else ('freeze', '--all'), output=True).splitlines()
        packages_without_program = [package for package in packages if PROGRAM_NAME not in package]
        return {'pip': packages_without_program}

@@ -87,14 +86,30 @@ class SystemPip(PackageManager):
        # make sure we are not running it with our own PYTHONPATH
        env = dict(**os.environ)
        env.pop('PYTHONPATH', None)
+
+        # Debug print
+        if self.session.debug_mode:
+            print(command)
+
        return (command.get_output if output else command.check_call)(stdin=DEVNULL, env=env, **kwargs)

    def _make_command(self, command):
        return Argv(self.bin, '-m', 'pip', '--disable-pip-version-check', *command)

    def install_flags(self):
-        if self.indices_args is None:
-            self.indices_args = tuple(
-                chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
-            )
-        return self.indices_args
+        indices_args = tuple(
+            chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
+        )
+
+        extra_pip_flags = \
+            ENV_PIP_EXTRA_INSTALL_FLAGS.get() or \
+            self.session.config.get("agent.package_manager.extra_pip_install_flags", None)
+
+        return (indices_args + tuple(extra_pip_flags)) if extra_pip_flags else indices_args
+
+    def download_flags(self):
+        indices_args = tuple(
+            chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
+        )
+
+        return indices_args
--- a/clearml_agent/helper/package/poetry_api.py
+++ b/clearml_agent/helper/package/poetry_api.py
@@ -147,7 +147,7 @@ class PoetryAPI(object):
            any((self.path / indicator).exists() for indicator in self.INDICATOR_FILES)
        )

-    def freeze(self):
+    def freeze(self, freeze_full_environment=False):
        lines = self.config.run("show", cwd=str(self.path)).splitlines()
        lines = [[p for p in line.split(' ') if p] for line in lines]
        return {"pip": [parts[0]+'=='+parts[1]+' # '+' '.join(parts[2:]) for parts in lines]}
--- a/clearml_agent/helper/package/priority_req.py
+++ b/clearml_agent/helper/package/priority_req.py
@@ -7,7 +7,7 @@ from .requirements import SimpleSubstitution

 class PriorityPackageRequirement(SimpleSubstitution):

-    name = ("cython", "numpy", "setuptools", )
+    name = ("cython", "numpy", "setuptools", "pip", )
    optional_package_names = tuple()

    def __init__(self, *args, **kwargs):
@@ -50,31 +50,39 @@ class PriorityPackageRequirement(SimpleSubstitution):
        """
        # if we replaced setuptools, it means someone requested it, and since freeze will not contain it,
        # we need to add it manually
-        if not self._replaced_packages or "setuptools" not in self._replaced_packages:
+        if not self._replaced_packages:
            return list_of_requirements

-        try:
-            for k, lines in list_of_requirements.items():
-                # k is either pip/conda
-                if k not in ('pip', 'conda'):
-                    continue
-                for i, line in enumerate(lines):
-                    if not line or line.lstrip().startswith('#'):
-                        continue
-                    parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
-                    if not parts:
-                        continue
-                    # if we found setuptools, do nothing
-                    if parts[0] == "setuptools":
-                        return list_of_requirements
+        if "pip" in self._replaced_packages:
+            full_freeze = PackageManager.out_of_scope_freeze(freeze_full_environment=True)
+            # now let's look for pip
+            pips = [line for line in full_freeze.get("pip", []) if line.split("==")[0] == "pip"]
+            if pips and "pip" in list_of_requirements:
+                list_of_requirements["pip"] = [pips[0]] + list_of_requirements["pip"]

-            # if we are here it means we have not found setuptools
-            # we should add it:
-            if "pip" in list_of_requirements:
-                list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
+        if "setuptools" in self._replaced_packages:
+            try:
+                for k, lines in list_of_requirements.items():
+                    # k is either pip/conda
+                    if k not in ('pip', 'conda'):
+                        continue
+                    for i, line in enumerate(lines):
+                        if not line or line.lstrip().startswith('#'):
+                            continue
+                        parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
+                        if not parts:
+                            continue
+                        # if we found setuptools, do nothing
+                        if parts[0] == "setuptools":
+                            return list_of_requirements

-        except Exception as ex:  # noqa
-            return list_of_requirements
+                # if we are here it means we have not found setuptools
+                # we should add it:
+                if "pip" in list_of_requirements:
+                    list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
+
+            except Exception as ex:  # noqa
+                return list_of_requirements

        return list_of_requirements

--- a/clearml_agent/helper/package/pytorch.py
+++ b/clearml_agent/helper/package/pytorch.py
@@ -16,6 +16,7 @@ import six
 from .requirements import (
    SimpleSubstitution, FatalSpecsResolutionError, SimpleVersion, MarkerRequirement,
    compare_version_rules, )
+from ...definitions import ENV_PACKAGE_PYTORCH_RESOLVE
 from ...external.requirements_parser.requirement import Requirement

 OS_TO_WHEEL_NAME = {"linux": "linux_x86_64", "windows": "win_amd64"}
@@ -174,6 +175,7 @@ class PytorchRequirement(SimpleSubstitution):
    extra_index_url_template = 'https://download.pytorch.org/whl/cu{}/'
    nightly_extra_index_url_template = 'https://download.pytorch.org/whl/nightly/cu{}/'
    torch_index_url_lookup = {}
+    resolver_types = ("pip", "direct", "none")

    def __init__(self, *args, **kwargs):
        os_name = kwargs.pop("os_override", None)
@@ -208,6 +210,13 @@ class PytorchRequirement(SimpleSubstitution):
        if self.config.get("agent.package_manager.torch_url_template", None):
            PytorchWheel.url_template = \
                self.config.get("agent.package_manager.torch_url_template", None)
+        self.resolve_algorithm = str(
+            ENV_PACKAGE_PYTORCH_RESOLVE.get() or
+            self.config.get("agent.package_manager.pytorch_resolve", "pip")).lower()
+        if self.resolve_algorithm not in self.resolver_types:
+            print("WARNING: agent.package_manager.pytorch_resolve=={} not in {} reverting to '{}'".format(
+                self.resolve_algorithm, self.resolver_types, self.resolver_types[0]))
+            self.resolve_algorithm = self.resolver_types[0]

    def _init_python_ver_cuda_ver(self):
        if self.cuda is None:
@@ -261,6 +270,10 @@ class PytorchRequirement(SimpleSubstitution):
            )

    def match(self, req):
+        if self.resolve_algorithm == "none":
+            # skipping resolver
+            return False
+
        return req.name in self.packages

    @staticmethod
@@ -310,6 +323,12 @@ class PytorchRequirement(SimpleSubstitution):
            # yes this is for linux python 2.7 support, this is the only python 2.7 we support...
            if py_ver and py_ver[0] == '2' and len(parts) > 3 and not parts[3].endswith('u'):
                continue
+
+            # check if this an actual match
+            if not req.compare_version(v) or \
+                    (last_v and SimpleVersion.compare_versions(last_v, '>', v, ignore_sub_versions=False)):
+                continue
+
            # update the closest matched version (from above)
            if not closest_v:
                closest_v = v
@@ -318,10 +337,6 @@ class PytorchRequirement(SimpleSubstitution):
                    SimpleVersion.compare_versions(
                        version_a=v, op='>=', version_b=req.specs[0][1], num_parts=3):
                closest_v = v
-            # check if this an actual match
-            if not req.compare_version(v) or \
-                    (last_v and SimpleVersion.compare_versions(last_v, '>', v, ignore_sub_versions=False)):
-                continue

            url = '/'.join(torch_url.split('/')[:-1] + l.split('/'))
            last_v = v
@@ -345,8 +360,10 @@ class PytorchRequirement(SimpleSubstitution):
                from pip._internal.commands.show import search_packages_info
                installed_torch = list(search_packages_info([req.name]))
                # notice the comparison order, the first part will make sure we have a valid installed package
-                installed_torch_version = (getattr(installed_torch[0], 'version', None) or installed_torch[0]['version']) \
-                    if installed_torch else None
+                installed_torch_version = \
+                    (getattr(installed_torch[0], 'version', None) or
+                     installed_torch[0]['version']) if installed_torch else None
+
                if installed_torch and installed_torch_version and \
                        req.compare_version(installed_torch_version):
                    print('PyTorch: requested "{}" version {}, using pre-installed version {}'.format(
@@ -354,6 +371,7 @@ class PytorchRequirement(SimpleSubstitution):
                    # package already installed, do nothing
                    req.specs = [('==', str(installed_torch_version))]
                    return '{} {} {}'.format(req.name, req.specs[0][0], req.specs[0][1]), True
+
        except Exception:
            pass

@@ -475,6 +493,26 @@ class PytorchRequirement(SimpleSubstitution):
        return self.match_version(req, base).replace(" ", "\n")

    def replace(self, req):
+        # we first try to resolve things ourselves because pytorch pip is not always picking the correct
+        # versions from their pip repository
+
+        resolve_algorithm = self.resolve_algorithm
+        if resolve_algorithm == "none":
+            # skipping resolver
+            return None
+        elif resolve_algorithm == "direct":
+            # noinspection PyBroadException
+            try:
+                new_req = self._replace(req)
+                if new_req:
+                    self._original_req.append((req, new_req))
+                return new_req
+            except Exception:
+                print("Warning: Failed resolving using `pytorch_resolve=direct` reverting to `pytorch_resolve=pip`")
+        elif resolve_algorithm not in self.resolver_types:
+            print("Warning: `agent.package_manager.pytorch_resolve={}` "
+                  "unrecognized, default to `pip`".format(resolve_algorithm))
+
        # check if package is already installed with system packages
        self.validate_python_version()

@@ -508,6 +546,8 @@ class PytorchRequirement(SimpleSubstitution):
                    # return the original line
                    line = req.line

+                print("PyTorch: Adding index `{}` and installing `{}`".format(extra_index_url[0], line))
+
                return line

        except Exception:  # noqa
@@ -566,6 +606,19 @@ class PytorchRequirement(SimpleSubstitution):
        :param list_of_requirements: {'pip': ['a==1.0', ]}
        :return: {'pip': ['a==1.0', ]}
        """
+        def build_specific_version_req(a_line, a_name, a_new_req):
+            try:
+                r = Requirement.parse(a_line)
+                wheel_parts = r.uri.split("/")[-1].split('-')
+                version = str(wheel_parts[1].split('%')[0].split('+')[0])
+                new_r = Requirement.parse("{} == {} # {}".format(a_name, version, str(a_new_req)))
+                if new_r.line:
+                    # great it worked!
+                    return new_r.line
+            except:  # noqa
+                pass
+            return None
+
        if not self._original_req:
            return list_of_requirements
        try:
@@ -589,9 +642,18 @@ class PytorchRequirement(SimpleSubstitution):
                                    if req.local_file:
                                        lines[i] = '{}'.format(str(new_req))
                                    else:
-                                        lines[i] = '{} # {}'.format(str(req), str(new_req))
+                                        # try to rebuild requirements with specific version:
+                                        new_line = build_specific_version_req(line, req.req.name, new_req)
+                                        if new_line:
+                                            lines[i] = new_line
+                                        else:
+                                            lines[i] = '{} # {}'.format(str(req), str(new_req))
                            else:
-                                lines[i] = '{} # {}'.format(line, str(new_req))
+                                new_line = build_specific_version_req(line, req.req.name, new_req)
+                                if new_line:
+                                    lines[i] = new_line
+                                else:
+                                    lines[i] = '{} # {}'.format(line, str(new_req))
                            break
        except:
            pass
@@ -640,7 +702,7 @@ class PytorchRequirement(SimpleSubstitution):
            # noinspection PyBroadException
            try:
                if requests.get(torch_url, timeout=10).ok:
-                    print('Torch CUDA {} index page found'.format(c))
+                    print('Torch CUDA {} index page found, adding `{}`'.format(c, torch_url))
                    cls.torch_index_url_lookup[c] = torch_url
                    return cls.torch_index_url_lookup[c], c
            except Exception:
--- a/clearml_agent/helper/package/requirements.py
+++ b/clearml_agent/helper/package/requirements.py
@@ -240,6 +240,23 @@ class SimpleVersion:
        if not version_b:
            return True

+        # remove trailing "*" in both
+        if "*" in version_a:
+            ignore_sub_versions = True
+            while version_a.endswith(".*"):
+                version_a = version_a[:-2]
+            if version_a == "*":
+                version_a = ""
+            num_parts = min(len(version_a.split('.')), len(version_b.split('.')), )
+
+        if "*" in version_b:
+            ignore_sub_versions = True
+            while version_b.endswith(".*"):
+                version_b = version_b[:-2]
+            if version_b == "*":
+                version_b = ""
+            num_parts = min(len(version_a.split('.')), len(version_b.split('.')), )
+
        if not num_parts:
            num_parts = max(len(version_a.split('.')), len(version_b.split('.')), )

--- a/clearml_agent/helper/repo.py
+++ b/clearml_agent/helper/repo.py
@@ -320,6 +320,7 @@ class VCS(object):
                        self.url, new_url))
                    self.url = new_url
                return
+
            # rewrite ssh URLs only if either ssh port or ssh user are forced in config
            if parsed_url.scheme == "ssh" and (
                self.session.config.get('agent.force_git_ssh_port', None) or
@@ -334,6 +335,9 @@ class VCS(object):
                    print("Using SSH credentials - ssh url '{}' with ssh url '{}'".format(
                        self.url, new_url))
                    self.url = new_url
+                return
+            elif parsed_url.scheme == "ssh":
+                return

        if not self.session.config.agent.translate_ssh:
            return
@@ -343,7 +347,7 @@ class VCS(object):
                (ENV_AGENT_GIT_PASS.get() or self.session.config.get('agent.git_pass', None)):
            # only apply to a specific domain (if requested)
            config_domain = \
-                ENV_AGENT_GIT_HOST.get() or self.session.config.get("git_host", None)
+                ENV_AGENT_GIT_HOST.get() or self.session.config.get("agent.git_host", None)
            if config_domain and config_domain != furl(self.url).host:
                return

--- a/clearml_agent/helper/resource_monitor.py
+++ b/clearml_agent/helper/resource_monitor.py
@@ -139,42 +139,45 @@ class ResourceMonitor(object):
    def _daemon(self):
        seconds_since_started = 0
        reported = 0
-        while True:
-            last_report = time()
-            current_report_frequency = (
-                self._report_frequency if reported != 0 else self._first_report_sec
-            )
-            while (time() - last_report) < current_report_frequency:
-                # wait for self._sample_frequency seconds, if event set quit
-                if self._exit_event.wait(1 / self._sample_frequency):
-                    return
-                # noinspection PyBroadException
-                try:
-                    self._update_readouts()
-                except Exception as ex:
-                    log.warning("failed getting machine stats: %s", report_error(ex))
-                    self._failure()
+        try:
+            while True:
+                last_report = time()
+                current_report_frequency = (
+                    self._report_frequency if reported != 0 else self._first_report_sec
+                )
+                while (time() - last_report) < current_report_frequency:
+                    # wait for self._sample_frequency seconds, if event set quit
+                    if self._exit_event.wait(1 / self._sample_frequency):
+                        return
+                    # noinspection PyBroadException
+                    try:
+                        self._update_readouts()
+                    except Exception as ex:
+                        log.warning("failed getting machine stats: %s", report_error(ex))
+                        self._failure()

-            seconds_since_started += int(round(time() - last_report))
-            # check if we do not report any metric (so it means the last iteration will not be changed)
+                seconds_since_started += int(round(time() - last_report))
+                # check if we do not report any metric (so it means the last iteration will not be changed)

-            # if we do not have last_iteration, we just use seconds as iteration
+                # if we do not have last_iteration, we just use seconds as iteration

-            # start reporting only when we figured out, if this is seconds based, or iterations based
-            average_readouts = self._get_average_readouts()
-            stats = {
-                # 3 points after the dot
-                key: round(value, 3) if isinstance(value, float) else [round(v, 3) for v in value]
-                for key, value in average_readouts.items()
-            }
+                # start reporting only when we figured out, if this is seconds based, or iterations based
+                average_readouts = self._get_average_readouts()
+                stats = {
+                    # 3 points after the dot
+                    key: round(value, 3) if isinstance(value, float) else [round(v, 3) for v in value]
+                    for key, value in average_readouts.items()
+                }

-            # send actual report
-            if self.send_report(stats):
-                # clear readouts if this is update was sent
-                self._clear_readouts()
+                # send actual report
+                if self.send_report(stats):
+                    # clear readouts if this is update was sent
+                    self._clear_readouts()

-            # count reported iterations
-            reported += 1
+                # count reported iterations
+                reported += 1
+        except Exception as ex:
+            log.exception("Error reporting monitoring info: %s", str(ex))

    def _update_readouts(self):
        readouts = self._machine_stats()
--- a/clearml_agent/version.py
+++ b/clearml_agent/version.py
@@ -1 +1 @@
-__version__ = '1.5.2'
+__version__ = '1.6.0'
--- a/docker/k8s-glue/build-resources/clearml.conf
+++ b/docker/k8s-glue/build-resources/clearml.conf
@@ -58,7 +58,7 @@ agent {
        type: pip,

        # specify pip version to use (examples "<20.2", "==19.3.1", "", empty string will install the latest version)
-        pip_version: "<21",
+        pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"],

        # virtual environment inheres packages from system
        system_site_packages: false,
@@ -171,7 +171,7 @@ agent {

    default_docker: {
        # default docker image to use when running in docker mode
-        image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
+        image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"

        # optional arguments to pass to docker image
        # arguments: ["--ipc=host", ]
@@ -224,7 +224,7 @@ sdk {

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
            size {
                # max_used_bytes = -1
--- a/docker/k8s-glue/build-resources/entrypoint.sh
+++ b/docker/k8s-glue/build-resources/entrypoint.sh
@@ -33,4 +33,9 @@ echo "api.files_server: ${CLEARML_FILES_HOST}" >> ~/clearml.conf

 ./provider_entrypoint.sh

-python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
+if [[ -z "${K8S_GLUE_MAX_PODS}" ]]
+then
+  python3 k8s_glue_example.py --queue ${QUEUE} ${EXTRA_ARGS}
+else
+  python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
+fi
--- a/docker/k8s-glue/build-resources/k8s_glue_example.py
+++ b/docker/k8s-glue/build-resources/k8s_glue_example.py
@@ -65,6 +65,19 @@ def parse_args():
        help="Limit the maximum number of pods that this service can run at the same time."
             "Should not be used with ports-mode"
    )
+    parser.add_argument(
+        "--use-owner-token", action="store_true", default=False,
+        help="Generate and use task owner token for the execution of each task"
+    )
+    parser.add_argument(
+        "--standalone-mode", action="store_true", default=False,
+        help="Do not use any network connects, assume everything is pre-installed"
+    )
+    parser.add_argument(
+        "--child-report-tags", type=str, nargs="+", default=None,
+        help="List of tags to send with the status reports from a worker that runs a task"
+    )
+
    return parser.parse_args()


@@ -85,9 +98,14 @@ def main():
        user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
        template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
            ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
-        namespace=args.namespace, max_pods_limit=args.max_pods or None,
+        namespace=args.namespace, max_pods_limit=args.max_pods or None
+    )
+    k8s.k8s_daemon(
+        args.queue,
+        use_owner_token=args.use_owner_token,
+        standalone_mode=args.standalone_mode,
+        child_report_tags=args.child_report_tags
    )
-    k8s.k8s_daemon(args.queue)


 if __name__ == "__main__":
--- a/docker/k8s-glue/build-resources/setup.sh
+++ b/docker/k8s-glue/build-resources/setup.sh
@@ -2,13 +2,17 @@

 chmod +x /root/entrypoint.sh

-apt-get update -y
-apt-get dist-upgrade -y
-apt-get install -y curl unzip less locales
+apt-get update -qqy
+apt-get dist-upgrade -qqy
+apt-get install -qqy curl unzip less locales

 locale-gen en_US.UTF-8

-apt-get install -y curl python3-pip git
+apt-get update -qqy
+apt-get install -qqy curl gcc python3-dev python3-pip apt-transport-https lsb-release openssh-client git gnupg
+rm -rf /var/lib/apt/lists/*
+apt clean
+
 python3 -m pip install -U pip
-python3 -m pip install clearml-agent
-python3 -m pip install -U "cryptography>=2.9"
+python3 -m pip install --no-cache-dir clearml-agent
+python3 -m pip install -U --no-cache-dir "cryptography>=2.9"
--- a/docker/k8s-glue/glue-build/Dockerfile.alpine
+++ b/docker/k8s-glue/glue-build/Dockerfile.alpine
@@ -1,4 +1,4 @@
-ARG TAG=3.7.12-alpine3.15
+ARG TAG=3.7.17-alpine3.18

 FROM python:${TAG} as build

@@ -20,7 +20,7 @@ FROM python:${TAG} as target

 WORKDIR /app

-ARG KUBECTL_VERSION=1.22.4
+ARG KUBECTL_VERSION=1.24.0

 # Not sure about these ENV vars
 # ENV LC_ALL=en_US.UTF-8
--- a/docker/k8s-glue/glue-build/Dockerfile.bullseye
+++ b/docker/k8s-glue/glue-build/Dockerfile.bullseye
@@ -1,4 +1,4 @@
-ARG TAG=3.7.12-slim-bullseye
+ARG TAG=3.7.17-slim-bullseye

 FROM python:${TAG} as target

--- a/docker/k8s-glue/glue-build/k8s_glue_example.py
+++ b/docker/k8s-glue/glue-build/k8s_glue_example.py
@@ -65,6 +65,10 @@ def parse_args():
        help="Limit the maximum number of pods that this service can run at the same time."
             "Should not be used with ports-mode"
    )
+    parser.add_argument(
+        "--use-owner-token", action="store_true", default=False,
+        help="Generate and use task owner token for the execution of each task"
+    )
    return parser.parse_args()


@@ -87,7 +91,7 @@ def main():
            ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
        namespace=args.namespace, max_pods_limit=args.max_pods or None,
    )
-    k8s.k8s_daemon(args.queue)
+    k8s.k8s_daemon(args.queue, use_owner_token=args.use_owner_token)


 if __name__ == "__main__":
--- a/docs/clearml.conf
+++ b/docs/clearml.conf
@@ -93,25 +93,43 @@ agent {
        # extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
        extra_index_url: []

+        # additional flags to use when calling pip install, example: ["--use-deprecated=legacy-resolver", ]
+        # extra_pip_install_flags: []
+
+        # control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
+        # Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
+        # "pip" (default): would automatically detect the cuda version, and supply pip with the correct
+        # extra-index-url, based on pytorch.org tables
+        # "direct": would resolve a direct link to the pytorch wheel by parsing the pytorch.org pip repository
+        # and matching the automatically detected cuda version with the required pytorch wheel.
+        # if the exact cuda version is not found for the required pytorch wheel, it will try
+        # a lower cuda version until a match is found
+        # "none": No resolver used, install pytorch like any other package
+        # pytorch_resolve: "pip"
+
        # additional conda channels to use when installing with conda package manager
        conda_channels: ["pytorch", "conda-forge", "defaults", ]
        # conda_full_env_update: false
        # conda_env_as_base_docker: false

        # set the priority packages to be installed before the rest of the required packages
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # priority_packages: ["cython", "numpy", "setuptools", ]

        # set the optional priority packages to be installed before the rest of the required packages,
        # In case a package installation fails, the package will be ignored,
        # and the virtual environment process will continue
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # priority_optional_packages: ["pygobject", ]

        # set the post packages to be installed after all the rest of the required packages
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # post_packages: ["horovod", ]

        # set the optional post packages to be installed after all the rest of the required packages,
        # In case a package installation fails, the package will be ignored,
        # and the virtual environment process will continue
+        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # post_optional_packages: []

        # set to True to support torch nightly build installation,
@@ -168,6 +186,7 @@ agent {

    # optional arguments to pass to docker image
    # these are local for this agent and will not be updated in the experiment's docker_cmd section
+    # You can also pass host environments into the container with ["-e", "HOST_NAME=$HOST_NAME"]
    # extra_docker_arguments: ["--ipc=host", "-v", "/mnt/host/data:/mnt/data"]

    # optional shell script to run in docker when started before the experiment is started
@@ -178,13 +197,19 @@ agent {
    # change to false to skip installation and decrease docker spin up time
    # docker_install_opencv_libs: true

+    # Allow passing host environments into docker container with Task's docker container args
+    # Example "-e HOST_NAME=$HOST_NAME"
+    # NOTICE this might introduce security risk allowing access to keys/secret on the host machine1
+    # Use with care!
+    # docker_allow_host_environ: false
+
    # set to true in order to force "docker pull" before running an experiment using a docker image.
    # This makes sure the docker image is updated.
    docker_force_pull: false

    default_docker: {
        # default docker image to use when running in docker mode
-        image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
+        image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"

        # optional arguments to pass to docker image
        # arguments: ["--ipc=host"]
@@ -194,7 +219,7 @@ agent {
        # enterprise version only
        # match_rules: [
        #     {
-        #         image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
+        #         image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
        #         arguments: "-e define=value"
        #         match: {
        #             script{
@@ -215,7 +240,7 @@ agent {
        #         }
        #     },
        #     {
-        #         image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
+        #         image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
        #         arguments: "-e define=value"
        #         match: {
        #             # must match all requirements (not partial)
@@ -228,8 +253,6 @@ agent {
        #                 # no repository matching required
        #                 repository: ""
        #             }
-        #             # no container image matching required (allow to replace one requested container with another)
-        #             container: ""
        #             # no repository matching required
        #             project: ""
        #         }
@@ -283,7 +306,7 @@ sdk {

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
        }

@@ -469,7 +492,8 @@ sdk {
 #  target_format: format used to encode contents before writing into the target file. Supported values are json,
 #                 yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
 #  overwrite: overwrite the target file in case it exists. Default is true.
-#
+#  mode: file-system mode to be applied to the file after its creation. The mode string will be parsed into an
+#        integer (e.g. "0o777" for -rwxrwxrwx)
 # Example:
 #   files {
 #     myfile1 {
--- a/docs/screenshots.gif
+++ b/docs/screenshots.gif
--- a/docs/trains.conf
+++ b/docs/trains.conf
@@ -146,7 +146,7 @@ sdk {

    storage {
        cache {
-            # Defaults to system temp folder / cache
+            # Defaults to <system_temp_folder>/clearml_cache
            default_base_dir: "~/.clearml/cache"
        }

--- a/examples/dynamic_cloud_cluster.ipynb
+++ b/examples/dynamic_cloud_cluster.ipynb
@@ -156,7 +156,7 @@
    "TRAINS_GIT_PASS = \"\"\n",
    "\n",
    "# Additional fields for trains.conf file created on the remote instance\n",
-    "# for example: 'agent.default_docker.image: \"nvidia/cuda:10.0-cudnn7-runtime\"'\n",
+    "# for example: 'agent.default_docker.image: \"nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04\"'\n",
    "EXTRA_TRAINS_CONF = \"\"\"\n",
    "\"\"\"\n",
    "\n",
@@ -584,4 +584,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 2
-}
+}
--- a/requirements.txt
+++ b/requirements.txt
@@ -8,7 +8,7 @@ pyparsing>=2.0.3,<3.1.0
 python-dateutil>=2.4.2,<2.9.0
 pyjwt>=2.4.0,<2.7.0
 PyYAML>=3.12,<6.1
-requests>=2.20.0,<2.29.0
+requests>=2.29.0,<=2.31.0
 six>=1.13.0,<1.17.0
 typing>=3.6.4,<3.8.0 ; python_version < '3.5'
 urllib3>=1.21.1,<1.27.0
--- a/setup.py
+++ b/setup.py
@@ -62,6 +62,7 @@ setup(
        'Programming Language :: Python :: 3.8',
        'Programming Language :: Python :: 3.9',
        'Programming Language :: Python :: 3.10',
+        'Programming Language :: Python :: 3.11',
        'License :: OSI Approved :: Apache Software License',
    ],
Author	SHA1	Message	Date
allegroai	58e0dc42ec	Version bump to v1.6.0	2023-09-05 15:05:11 +03:00
allegroai	d16825029d	Add new pytorch no resolver mode and CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE to change resolver on a Task basis, now supports "pip", "direct", "none"	2023-09-02 17:45:10 +03:00
allegroai	fb639afcb9	Fix PyTorch extra index pip resolver	2023-09-02 17:43:41 +03:00
allegroai	eefb94d1bc	Add Python 3.11 support	2023-09-02 17:42:27 +03:00
Alex Burlacu	f1e9266075	Adjust docker image versions in a couple more places	2023-08-24 19:03:24 +03:00
Alex Burlacu	e1e3c84a8d	Update docker versions	2023-08-24 19:01:26 +03:00
Alex Burlacu	ed1356976b	Move extra configurations to Worker init to make sure all available configurations can be overridden	2023-08-24 19:00:36 +03:00
Alex Burlacu	2b815354e0	Improve file mode comment	2023-08-24 18:53:00 +03:00
Alex Burlacu	edae380a9e	Version bump	2023-08-24 18:51:47 +03:00
Alex Burlacu	946e9d9ce9	Fix invalid reference	2023-08-24 18:51:27 +03:00
jday1	a56343ffc7	Upgrade requests library (#162 ) * Upgrade requests * Modified package requirement * Modified package requirement	2023-08-01 10:41:22 +03:00
allegroai	159a6e9a5a	Fix runtime property overriding existing properties	2023-07-20 10:41:15 +03:00
pollfly	6b7ee12dc1	Edit README (#156 )	2023-07-19 16:51:14 +03:00
allegroai	3838247716	Update k8s glue docker build resources	2023-07-19 16:47:50 +03:00
pollfly	6e7d35a42a	Improve configuration files (#160 )	2023-07-11 10:32:01 +03:00
allegroai	4c056a17b9	Add support for k8s jobs execution Strip docker container obtained from task in k8s apply	2023-07-04 14:45:00 +03:00
allegroai	21d98afca5	Add support for extra docker arguments referencing machines environment variables using the agent.docker_allow_host_environ configuration option to allow users to also be able to use $ENV in the task's docker arguments	2023-07-04 14:42:28 +03:00
allegroai	6a1bf11549	Fix Task docker arguments passed twice	2023-07-04 14:41:07 +03:00
allegroai	7115a9b9a7	Add CLEARML_EXTRA_PIP_INSTALL_FLAGS / agent.package_manager.extra_pip_install_flags to control additional pip install flags Fix pip version marking in "installed packages" is now preserved for and reinstalled	2023-07-04 14:39:40 +03:00
allegroai	450df2f8d3	Support skipping agent pip upgrade in container bash script using the CLEARML_AGENT_NO_UPDATE env var	2023-07-04 14:38:50 +03:00
allegroai	ccf752c4e4	Add support for setting mode on files applied by the agent	2023-07-04 14:37:58 +03:00
allegroai	3ed63e2154	Fix docker container backwards compatibility for API <2.13 Fix default docker match rules resolver (used incorrect field "container" instead of "image") Remove "container" (image) match rule option from default docker image resolver	2023-07-04 14:37:18 +03:00
allegroai	a535f93cd6	Add support for CLEARML_AGENT_FORCE_MAX_API_VERSION for testing	2023-07-04 14:35:54 +03:00
allegroai	b380ec54c6	Improve config file comments	2023-07-04 14:34:43 +03:00
allegroai	a1274299ce	Add support for CLEARML_AGENT_EXTRA_DOCKER_LABELS env var	2023-07-03 11:08:59 +03:00
allegroai	c77224af68	Add support for task field injection into container docker name	2023-07-03 11:07:12 +03:00
allegroai	95dadca45c	Refactor k8s glue running/used pods getter	2023-05-21 22:56:12 +03:00
allegroai	685918fd9b	Version bump to v1.5.3rc3	2023-05-21 22:54:38 +03:00
allegroai	bc85ddf78d	Fix pytorch direct resolve replacing wheel link with directly installed version	2023-05-21 22:53:51 +03:00
allegroai	5b5fb0b8a6	Add `agent.package_manager.pytorch_resolve` configuration setting with `pip` or `direct` values. `pip` sets extra index based on cuda and lets pip resolve, `direct` is the previous parsing algorithm that does the matching and downloading (default `pip`)	2023-05-21 22:53:11 +03:00
allegroai	fec0ce1756	Better message for agent init when an existing clearml.conf is found	2023-05-21 22:51:11 +03:00
allegroai	1e09b88b7a	Add alias `CLEARML_AGENT_DOCKER_AGENT_REPO` env var for the `FORCE_CLEARML_AGENT_REPO` env var	2023-05-21 22:50:01 +03:00
allegroai	b6ca0fa6a5	Print error on resource monitor failure	2023-05-11 16:18:11 +03:00
allegroai	307ec9213e	Fix git+ssh:// links inside installed packages not being converted properly to HTTPS authenticated and vice versa	2023-05-11 16:16:51 +03:00
allegroai	a78a25d966	Support new `Retry.DEFAULT_BACKOFF_MAX` in a backwards-compatible way	2023-05-11 16:16:18 +03:00
allegroai	ebb6231f5a	Add CLEARML_AGENT_STANDALONE_CONFIG_BC to support backwards compatibility in standalone mode	2023-05-11 16:15:06 +03:00
pollfly	e1d65cb280	Update clearml-agent gif (#137 )	2023-04-10 10:58:10 +03:00