Version bump to v1.9.2

Fix reload method is found in the config object
Fix use req_token_expiration_sec when creating a task session and not the default value
2025-06-26 18:16:15 +00:00 · 2024-10-28 18:33:04 +02:00 · 2024-10-28 18:12:22 +02:00 · 2024-10-28 18:11:42 +02:00 · 2024-10-28 18:11:00 +02:00 · 2024-10-13 10:08:03 +03:00
60 changed files with 4803 additions and 1913 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -14,3 +14,5 @@ dist/
 # VSCode
 .vscode

+# MirrorD
+.mirrord
--- a/README.md
+++ b/README.md
@@ -2,14 +2,17 @@

 <img src="https://github.com/allegroai/clearml-agent/blob/master/docs/clearml_agent_logo.png?raw=true" width="250px">

-**ClearML Agent - ML-Ops made easy  
-ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
+**ClearML Agent - MLOps/LLMOps made easy  
+MLOps/LLMOps scheduler & orchestration solution supporting Linux, macOS and Windows**

 [![GitHub license](https://img.shields.io/github/license/allegroai/clearml-agent.svg)](https://img.shields.io/github/license/allegroai/clearml-agent.svg)
 [![PyPI pyversions](https://img.shields.io/pypi/pyversions/clearml-agent.svg)](https://img.shields.io/pypi/pyversions/clearml-agent.svg)
 [![PyPI version shields.io](https://img.shields.io/pypi/v/clearml-agent.svg)](https://img.shields.io/pypi/v/clearml-agent.svg)
 [![PyPI Downloads](https://pepy.tech/badge/clearml-agent/month)](https://pypi.org/project/clearml-agent/)
 [![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/allegroai)](https://artifacthub.io/packages/search?repo=allegroai)
+
+`🌟 ClearML is open-source - Leave a star to support the project! 🌟`
+
 </div>

 ---
@@ -65,36 +68,39 @@ or [Free tier Hosting](https://app.clear.ml)

 ### Kubernetes Integration (Optional)

-We think Kubernetes is awesome, but it should be a choice. We designed `clearml-agent` so you can run bare-metal or
-inside a pod with any mix that fits your environment.
+We think Kubernetes is awesome, but it is not a must to get started with remote execution agents and cluster management.
+We designed `clearml-agent` so you can run both bare-metal and on top of Kubernetes, in any combination that fits your environment.

-Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://github.com/allegroai/clearml-helm-charts
+You can find the Dockerfiles in the [docker folder](./docker) and the helm Chart in https://github.com/allegroai/clearml-helm-charts

-#### Benefits of integrating existing K8s with ClearML-Agent
+#### Benefits of integrating existing Kubernetes cluster with ClearML

- ClearML-Agent adds the missing scheduling capabilities to K8s
- Allowing for more flexible automation from code
- A programmatic interface for easier learning curve (and debugging)
- Seamless integration with ML/DL experiment manager
+- ClearML-Agent adds the missing scheduling capabilities to your Kubernetes cluster
+- Users do not need to have direct Kubernetes access!
+- Easy learning curve with UI and CLI requiring no DevOps knowledge from end users
+- Unlike other solutions, ClearML-Agents work in tandem with other customers of your Kubernetes cluster 
+- Allows for more flexible automation from code, building pipelines and visibility
+- A programmatic interface for easy CI/CD workflows, enabling GitOps to trigger jobs inside your cluster
+- Seamless integration with the ClearML ML/DL/GenAI experiment manager
 - Web UI for customization, scheduling & prioritization of jobs
+- **Enterprise Features**: RBAC, vault, multi-tenancy, scheduler, quota management, fractional GPU support 

-**Two K8s integration flavours**
+**Run the agent in Kubernetes Glue mode an map ClearML jobs directly to K8s jobs:**
+- Use the [ClearML Agent Helm Chart](https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml-agent) to spin an agent pod acting as a controller
+  - Or run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
+    a Kubernetes cpu node
+- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a Kubernetes job (based on provided
+  yaml template)
+- Inside each pod the clearml-agent will install the job (experiment) environment and spin and monitor the
+  experiment's process, fully visible in the clearml UI
+- Benefits: Kubernetes full view of all running jobs in the system
+- **Enterprise Features**
+  - Full scheduler features added on Top of Kubernetes, with quota/over-quota management, priorities and order.
+  - Fractional GPU support, allowing multiple isolated containers sharing the same GPU with memory/compute limit per container 

- Spin ClearML-Agent as a long-lasting service pod:
-    - Use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
-    - map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
-    - Allow the clearml-agent to manage sibling dockers
-    - Benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
-    - Downside: sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs:
-    - Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
-      a K8s cpu node
-    - The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
-      yaml template)
-    - Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
-      experiment's process
-    - Benefits: Kubernetes full view of all running jobs in the system
-    - Downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
+### SLURM (Optional)
+
+Yes! Slurm integration is available, check the [documentation](https://clear.ml/docs/latest/docs/clearml_agent/#slurm) for further details 

 ### Using the ClearML Agent

--- a/clearml_agent/backend_api/config/default/agent.conf
+++ b/clearml_agent/backend_api/config/default/agent.conf
@@ -45,8 +45,8 @@
    # it solves passing user/token to git submodules.
    # this is a safer way to ensure multiple users using the same repository will
    # not accidentally leak credentials
-    # Only supported on Linux systems, it will be the default in future releases
-    # enable_git_ask_pass: false
+    # Note: this is only supported on Linux systems
+    # enable_git_ask_pass: true

    # in docker mode, if container's entrypoint automatically activated a virtual environment
    # use the activated virtual environment and install everything there
@@ -66,7 +66,7 @@
        type: pip,

        # specify pip version to use (examples "<20.2", "==19.3.1", "", empty string will install the latest version)
-        pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"],
+        pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10' and python_version <= '3.11'", ">=23,<24.3 ; python_version >= '3.12'"]
        # specify poetry version to use (examples "<2", "==1.1.1", "", empty string will install the latest version)
        # poetry_version: "<2",
        # poetry_install_extra_args: ["-v"]
@@ -80,6 +80,14 @@
        # additional artifact repositories to use when installing python packages
        # extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]

+        # turn on the "--use-deprecated=legacy-resolver" flag for pip, to avoid package dependency version mismatch
+        # is any version restrictions are matched we add the "--use-deprecated=legacy-resolver" flag
+        # example: pip_legacy_resolver = [">=20.3,<24.3", ">99"]
+        # if pip==20.2 or pip==29.0 is installed we do nothing,
+        # if pip==21.1 or pip==101.1 is installed the flag is added
+        # disable the feature by passing an empty list
+        pip_legacy_resolver = [">=20.3,<24.3"]
+
        # control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
        # Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
        # "pip" (default): would automatically detect the cuda version, and supply pip with the correct
@@ -92,7 +100,7 @@
        # pytorch_resolve: "pip"

        # additional conda channels to use when installing with conda package manager
-        conda_channels: ["pytorch", "conda-forge", "defaults", ]
+        conda_channels: ["pytorch", "conda-forge", "nvidia", "defaults", ]

        # If set to true, Task's "installed packages" are ignored,
        # and the repository's "requirements.txt" is used instead
@@ -177,6 +185,13 @@
    # these are local for this agent and will not be updated in the experiment's docker_cmd section
    # extra_docker_arguments: ["--ipc=host", ]

+    # Allow the extra docker arg to override task level docker arg (if the same argument is passed on both),
+    # if set to False, a task docker arg will override the docker extra arg
+    # docker_args_extra_precedes_task: true
+
+    # allows the following task docker args to be overridden by the extra_docker_arguments
+    # protected_docker_extra_args: ["privileged", "security-opt", "network", "ipc"]
+
    # optional shell script to run in docker when started before the experiment is started
    # extra_docker_shell_script: ["apt-get install -y bindfs", ]

@@ -211,6 +226,76 @@

        # optional arguments to pass to docker image
        # arguments: ["--ipc=host", ]
+
+        # Choose the default docker based on the Task properties,
+        # Notice: Enterprise feature, ignored otherwise
+        # Examples: 'script.requirements', 'script.binary', 'script.repository', 'script.branch', 'project'
+        # Notice: Matching is done via regular expression, for example "^searchme$" will match exactly "searchme" string
+        "match_rules": [
+             {
+                 "image": "python:3.6-bullseye",
+                 "arguments": "--ipc=host",
+                 "match": {
+                     "script": {
+                         "binary": "python3.6$",
+                     },
+                 }
+             },
+             {
+                 "image": "python:3.7-bullseye",
+                 "arguments": "--ipc=host",
+                 "match": {
+                     "script": {
+                         "binary": "python3.7$",
+                     },
+                 }
+             },
+             {
+                 "image": "python:3.8-bullseye",
+                 "arguments": "--ipc=host",
+                 "match": {
+                     "script": {
+                         "binary": "python3.8$",
+                     },
+                 }
+             },
+             {
+                 "image": "python:3.9-bullseye",
+                 "arguments": "--ipc=host",
+                 "match": {
+                     "script": {
+                         "binary": "python3.9$",
+                     },
+                 }
+             },
+             {
+                 "image": "python:3.10-bullseye",
+                 "arguments": "--ipc=host",
+                 "match": {
+                     "script": {
+                         "binary": "python3.10$",
+                     },
+                 }
+             },
+             {
+                 "image": "python:3.11-bullseye",
+                 "arguments": "--ipc=host",
+                 "match": {
+                     "script": {
+                         "binary": "python3.11$",
+                     },
+                 }
+             },
+             {
+                 "image": "python:3.12-bullseye",
+                 "arguments": "--ipc=host",
+                 "match": {
+                     "script": {
+                         "binary": "python3.12$",
+                     },
+                 }
+             },
+        ]
    }

    # set the OS environments based on the Task's Environment section before launching the Task process.
@@ -242,6 +327,20 @@
    # cuda_version: 10.1
    # cudnn_version: 7.6

+    # Sanitize configuration printout using these settings
+    sanitize_config_printout {
+        # Hide values of configuration keys matching these regexps
+        hide_secrets: ["^sanitize_config_printout$", "secret", "pass", "token", "account_key", "contents"]
+        # As above, only show field's value keys if value is a dictionary
+        hide_secrets_recursive: ["^environment$"]
+        # Do not hide for keys matching these regexps
+        dont_hide_secrets: ["^enable_git_ask_pass$"]
+        # Hide secrets in docker commands, according to the 'agent.hide_docker_command_env_vars' settings
+        docker_commands: ["^extra_docker_arguments$"]
+        # Hide password in URLs found in keys matching these regexps (handles single URLs, lists and dictionaries)
+        urls: ["^extra_index_url$"]
+    }
+
    # Hide docker environment variables containing secrets when printing out the docker command by replacing their
    # values with "********". Turning this feature on will hide the following environment variables values:
    #   CLEARML_API_SECRET_KEY, CLEARML_AGENT_GIT_PASS, AWS_SECRET_ACCESS_KEY, AZURE_STORAGE_KEY
@@ -268,6 +367,7 @@
        pip_cache: "/root/.cache/pip"
        poetry_cache: "/root/.cache/pypoetry"
        vcs_cache: "/root/.clearml/vcs-cache"
+        venvs_cache: "/root/.clearml/venvs-cache"
        venv_build: "~/.clearml/venvs-builds"
        pip_download: "/root/.clearml/pip-download-cache"
    }
--- a/clearml_agent/backend_api/config/default/sdk.conf
+++ b/clearml_agent/backend_api/config/default/sdk.conf
@@ -140,7 +140,7 @@
        vcs_repo_detect_async: true

        # Store uncommitted git/hg source code diff in experiment manifest when training in development mode
-        # This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
+        # This stores "git diff" or into the experiment's "script.requirements.diff" section
        store_uncommitted_code_diff: true

        # Support stopping an experiment in case it was externally stopped, status was changed or task was reset
--- a/clearml_agent/backend_api/session/defs.py
+++ b/clearml_agent/backend_api/session/defs.py
@@ -1,5 +1,5 @@
-from ...backend_config.converters import safe_text_to_bool
-from ...backend_config.environment import EnvEntry
+from clearml_agent.helper.environment import EnvEntry
+from clearml_agent.helper.environment.converters import safe_text_to_bool


 ENV_HOST = EnvEntry("CLEARML_API_HOST", "TRAINS_API_HOST")
@@ -11,6 +11,7 @@ ENV_AUTH_TOKEN = EnvEntry("CLEARML_AUTH_TOKEN")
 ENV_VERBOSE = EnvEntry("CLEARML_API_VERBOSE", "TRAINS_API_VERBOSE", type=bool, default=False)
 ENV_HOST_VERIFY_CERT = EnvEntry("CLEARML_API_HOST_VERIFY_CERT", "TRAINS_API_HOST_VERIFY_CERT", type=bool, default=True)
 ENV_CONDA_ENV_PACKAGE = EnvEntry("CLEARML_CONDA_ENV_PACKAGE", "TRAINS_CONDA_ENV_PACKAGE")
+ENV_USE_CONDA_BASE_ENV = EnvEntry("CLEARML_USE_CONDA_BASE_ENV", type=bool)
 ENV_NO_DEFAULT_SERVER = EnvEntry("CLEARML_NO_DEFAULT_SERVER", "TRAINS_NO_DEFAULT_SERVER", type=bool, default=True)
 ENV_DISABLE_VAULT_SUPPORT = EnvEntry('CLEARML_AGENT_DISABLE_VAULT_SUPPORT', type=bool)
 ENV_ENABLE_ENV_CONFIG_SECTION = EnvEntry('CLEARML_AGENT_ENABLE_ENV_CONFIG_SECTION', type=bool)
@@ -21,6 +22,9 @@ ENV_INITIAL_CONNECT_RETRY_OVERRIDE = EnvEntry(
    'CLEARML_AGENT_INITIAL_CONNECT_RETRY_OVERRIDE', default=True, converter=safe_text_to_bool
 )
 ENV_FORCE_MAX_API_VERSION = EnvEntry("CLEARML_AGENT_FORCE_MAX_API_VERSION", type=str)
+# values are 0/None (task per node), 1/2 (multi-node reporting, colored console), -1 (only report rank 0 node)
+ENV_MULTI_NODE_SINGLE_TASK = EnvEntry("CLEARML_MULTI_NODE_SINGLE_TASK", type=int, default=None)
+

 """
 Experimental option to set the request method for all API requests and auth login.
--- a/clearml_agent/backend_api/session/session.py
+++ b/clearml_agent/backend_api/session/session.py
@@ -64,6 +64,8 @@ class Session(TokenManager):
    default_key = "EGRTCO8JMSIGI6S39GTP43NFWXDQOW"
    default_secret = "x!XTov_G-#vspE*Y(h$Anm&DIc5Ou-F)jsl$PdOyj5wG1&E!Z8"
    force_max_api_version = ENV_FORCE_MAX_API_VERSION.get()
+    server_version = "1.0.0"
+    user_id = None

    # TODO: add requests.codes.gateway_timeout once we support async commits
    _retry_codes = [
@@ -191,6 +193,8 @@ class Session(TokenManager):

            Session.api_version = str(api_version)
            Session.feature_set = str(token_dict.get('feature_set', self.feature_set) or "basic")
+            Session.server_version = token_dict.get('server_version', self.server_version)
+            Session.user_id = (token_dict.get("identity") or {}).get("user") or ""
        except (jwt.DecodeError, ValueError):
            pass

@@ -256,8 +260,9 @@ class Session(TokenManager):
        def parse(vault):
            # noinspection PyBroadException
            try:
-                print("Loaded {} vault: {}".format(
+                print("Loaded {} vault{}: {}".format(
                    vault.get("scope", ""),
+                    "" if not self.user_id else " for user {}".format(self.user_id),
                    (vault.get("description", None) or "")[:50] or vault.get("id", ""))
                )
                d = vault.get("data", None)
@@ -341,11 +346,12 @@ class Session(TokenManager):
                if self._propagate_exceptions_on_send:
                    raise
                sleep_time = sys_random.uniform(*self._request_exception_retry_timeout)
-                self._logger.error(
-                    "{} exception sending {} {}: {} (retrying in {:.1f}sec)".format(
-                        type(ex).__name__, method.upper(), url, str(ex), sleep_time
+                if self._logger:
+                    self._logger.error(
+                        "{} exception sending {} {}: {} (retrying in {:.1f}sec)".format(
+                            type(ex).__name__, method.upper(), url, str(ex), sleep_time
+                        )
                    )
-                )
                time.sleep(sleep_time)
                continue

@@ -364,11 +370,12 @@ class Session(TokenManager):
                res.status_code == requests.codes.service_unavailable
                and self.config.get("api.http.wait_on_maintenance_forever", True)
            ):
-                self._logger.warning(
-                    "Service unavailable: {} is undergoing maintenance, retrying...".format(
-                        host
+                if self._logger:
+                    self._logger.warning(
+                        "Service unavailable: {} is undergoing maintenance, retrying...".format(
+                            host
+                        )
                    )
-                )
                continue
            break
        self._session_requests += 1
@@ -649,11 +656,14 @@ class Session(TokenManager):
        """
        Return True if Session.api_version is greater or equal >= to min_api_version
        """
-        def version_tuple(v):
-            v = tuple(map(int, (v.split("."))))
-            return v + (0,) * max(0, 3 - len(v))
        return version_tuple(cls.api_version) >= version_tuple(str(min_api_version))

+    @classmethod
+    def check_min_server_version(cls, min_server_version):
+        """
+        Return True if Session.server_version is greater or equal >= to min_server_version
+        """
+        return version_tuple(cls.server_version) >= version_tuple(str(min_server_version))
    def _do_refresh_token(self, current_token, exp=None):
        """ TokenManager abstract method implementation.
            Here we ignore the old token and simply obtain a new token.
@@ -731,3 +741,8 @@ class Session(TokenManager):
    def propagate_exceptions_on_send(self, value):
        # type: (bool) -> None
        self._propagate_exceptions_on_send = value
+
+
+def version_tuple(v):
+    v = tuple(map(int, (v.split("."))))
+    return v + (0,) * max(0, 3 - len(v))
--- a/clearml_agent/backend_config/converters.py
+++ b/clearml_agent/backend_config/converters.py
@@ -1,69 +1,8 @@
-import base64
-from distutils.util import strtobool
-from typing import Union, Optional, Any, TypeVar, Callable, Tuple
-
-import six
-
-try:
-    from typing import Text
-except ImportError:
-    # windows conda-less hack
-    Text = Any
-
-
-ConverterType = TypeVar("ConverterType", bound=Callable[[Any], Any])
-
-
-def text_to_int(value, default=0):
-    # type: (Any, int) -> int
-    try:
-        return int(value)
-    except (ValueError, TypeError):
-        return default
-
-
-def base64_to_text(value):
-    # type: (Any) -> Text
-    return base64.b64decode(value).decode("utf-8")
-
-
-def text_to_bool(value):
-    # type: (Text) -> bool
-    return bool(strtobool(value))
-
-
-def safe_text_to_bool(value):
-    # type: (Text) -> bool
-    try:
-        return text_to_bool(value)
-    except ValueError:
-        return bool(value)
-
-
-def any_to_bool(value):
-    # type: (Optional[Union[int, float, Text]]) -> bool
-    if isinstance(value, six.text_type):
-        return text_to_bool(value)
-    return bool(value)
-
-
-def or_(*converters, **kwargs):
-    # type: (ConverterType, Tuple[Exception, ...]) -> ConverterType
-    """
-    Wrapper that implements an "optional converter" pattern. Allows specifying a converter
-    for which a set of exceptions is ignored (and the original value is returned)
-    :param converters: A converter callable
-    :param exceptions: A tuple of exception types to ignore
-    """
-    # noinspection PyUnresolvedReferences
-    exceptions = kwargs.get("exceptions", (ValueError, TypeError))
-
-    def wrapper(value):
-        for converter in converters:
-            try:
-                return converter(value)
-            except exceptions:
-                pass
-        return value
-
-    return wrapper
+from clearml_agent.helper.environment.converters import (
+    base64_to_text,
+    text_to_bool,
+    text_to_int,
+    safe_text_to_bool,
+    any_to_bool,
+    or_,
+)
--- a/clearml_agent/backend_config/entry.py
+++ b/clearml_agent/backend_config/entry.py
@@ -1,111 +1,6 @@
-import abc
-from typing import Optional, Any, Tuple, Callable, Dict
+from clearml_agent.helper.environment import Entry, NotSet

-import six
-
-from .converters import any_to_bool
-
-try:
-    from typing import Text
-except ImportError:
-    # windows conda-less hack
-    Text = Any
-
-
-NotSet = object()
-
-Converter = Callable[[Any], Any]
-
-
-@six.add_metaclass(abc.ABCMeta)
-class Entry(object):
-    """
-    Configuration entry definition
-    """
-
-    @classmethod
-    def default_conversions(cls):
-        # type: () -> Dict[Any, Converter]
-        return {
-            bool: any_to_bool,
-            six.text_type: lambda s: six.text_type(s).strip(),
-        }
-
-    def __init__(self, key, *more_keys, **kwargs):
-        # type: (Text, Text, Any) -> None
-        """
-        :param key: Entry's key (at least one).
-        :param more_keys: More alternate keys for this entry.
-        :param type: Value type. If provided, will be used choosing a default conversion or
-        (if none exists) for casting the environment value.
-        :param converter: Value converter. If provided, will be used to convert the environment value.
-        :param default: Default value. If provided, will be used as the default value on calls to get() and get_pair()
-        in case no value is found for any key and no specific default value was provided in the call.
-        Default value is None.
-        :param help: Help text describing this entry
-        """
-        self.keys = (key,) + more_keys
-        self.type = kwargs.pop("type", six.text_type)
-        self.converter = kwargs.pop("converter", None)
-        self.default = kwargs.pop("default", None)
-        self.help = kwargs.pop("help", None)
-
-    def __str__(self):
-        return str(self.key)
-
-    @property
-    def key(self):
-        return self.keys[0]
-
-    def convert(self, value, converter=None):
-        # type: (Any, Converter) -> Optional[Any]
-        converter = converter or self.converter
-        if not converter:
-            converter = self.default_conversions().get(self.type, self.type)
-        return converter(value)
-
-    def get_pair(self, default=NotSet, converter=None, value_cb=None):
-        # type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
-        for key in self.keys:
-            value = self._get(key)
-            if value is NotSet:
-                continue
-            try:
-                value = self.convert(value, converter)
-            except Exception as ex:
-                self.error("invalid value {key}={value}: {ex}".format(**locals()))
-                break
-            # noinspection PyBroadException
-            try:
-                if value_cb:
-                    value_cb(key, value)
-            except Exception:
-                pass
-            return key, value
-
-        result = self.default if default is NotSet else default
-        return self.key, result
-
-    def get(self, default=NotSet, converter=None, value_cb=None):
-        # type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
-        return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
-
-    def set(self, value):
-        # type: (Any, Any) -> (Text, Any)
-        # key, _ = self.get_pair(default=None, converter=None)
-        for k in self.keys:
-            self._set(k, str(value))
-
-    def _set(self, key, value):
-        # type: (Text, Text) -> None
-        pass
-
-    @abc.abstractmethod
-    def _get(self, key):
-        # type: (Text) -> Any
-        pass
-
-    @abc.abstractmethod
-    def error(self, message):
-        # type: (Text) -> None
-        pass
+__all__ = [
+    "Entry",
+    "NotSet"
+]
--- a/clearml_agent/backend_config/environment.py
+++ b/clearml_agent/backend_config/environment.py
@@ -1,32 +1,6 @@
-from os import getenv, environ
+from os import environ

-from .converters import text_to_bool
-from .entry import Entry, NotSet
-
-
-class EnvEntry(Entry):
-    @classmethod
-    def default_conversions(cls):
-        conversions = super(EnvEntry, cls).default_conversions().copy()
-        conversions[bool] = text_to_bool
-        return conversions
-
-    def pop(self):
-        for k in self.keys:
-            environ.pop(k, None)
-
-    def _get(self, key):
-        value = getenv(key, "").strip()
-        return value or NotSet
-
-    def _set(self, key, value):
-        environ[key] = value
-
-    def __str__(self):
-        return "env:{}".format(super(EnvEntry, self).__str__())
-
-    def error(self, message):
-        print("Environment configuration: {}".format(message))
+from clearml_agent.helper.environment import EnvEntry


 def backward_compatibility_support():
@@ -34,6 +8,7 @@ def backward_compatibility_support():
    if ENVIRONMENT_BACKWARD_COMPATIBLE.get():
        # Add TRAINS_ prefix on every CLEARML_ os environment we support
        for k, v in ENVIRONMENT_CONFIG.items():
+            # noinspection PyBroadException
            try:
                trains_vars = [var for var in v.vars if var.startswith('CLEARML_')]
                if not trains_vars:
@@ -44,6 +19,7 @@ def backward_compatibility_support():
            except:
                continue
        for k, v in ENVIRONMENT_SDK_PARAMS.items():
+            # noinspection PyBroadException
            try:
                trains_vars = [var for var in v if var.startswith('CLEARML_')]
                if not trains_vars:
@@ -62,3 +38,9 @@ def backward_compatibility_support():
        backwards_k = k.replace('CLEARML_', 'TRAINS_', 1)
        if backwards_k not in keys:
            environ[backwards_k] = environ[k]
+
+
+__all__ = [
+    "EnvEntry",
+    "backward_compatibility_support"
+]
--- a/clearml_agent/backend_config/utils.py
+++ b/clearml_agent/backend_config/utils.py
@@ -31,7 +31,8 @@ def apply_environment(config):
    keys = list(filter(None, env_vars.keys()))

    for key in keys:
-        os.environ[str(key)] = str(env_vars[key] or "")
+        value = env_vars[key]
+        os.environ[str(key)] = str(value if value is not None else "")

    return keys

@@ -52,7 +53,7 @@ def apply_files(config):
        target_fmt = data.get("target_format", "string")
        overwrite = bool(data.get("overwrite", True))
        contents = data.get("contents")
-        mode = data.get("mode")
+        mode = data.get("mode", None)

        target = Path(expanduser(expandvars(path)))

--- a/clearml_agent/commands/events.py
+++ b/clearml_agent/commands/events.py
@@ -2,6 +2,7 @@ from __future__ import print_function

 import json
 import time
+from typing import List, Tuple

 from clearml_agent.commands.base import ServiceCommandSection
 from clearml_agent.helper.base import return_list
@@ -57,6 +58,42 @@ class Events(ServiceCommandSection):
        # print('Sending events done: %d / %d events sent' % (sent_events, len(list_events)))
        return sent_events

+    def send_log_events_with_timestamps(
+        self, worker_id, task_id, lines_with_ts: List[Tuple[str, str]], level="DEBUG", session=None
+    ):
+        log_events = []
+
+        # break log lines into event packets
+        for ts, line in return_list(lines_with_ts):
+            # HACK ignore terminal reset ANSI code
+            if line == '\x1b[0m':
+                continue
+            while line:
+                if len(line) <= self.max_event_size:
+                    msg = line
+                    line = None
+                else:
+                    msg = line[:self.max_event_size]
+                    line = line[self.max_event_size:]
+
+                log_events.append(
+                    {
+                        "type": "log",
+                        "level": level,
+                        "task": task_id,
+                        "worker": worker_id,
+                        "msg": msg,
+                        "timestamp": ts,
+                    }
+                )
+
+                if line and ts is not None:
+                    # advance timestamp in case we break a line to more than one part
+                    ts += 1
+
+        # now send the events
+        return self.send_events(list_events=log_events, session=session)
+
    def send_log_events(self, worker_id, task_id, lines, level='DEBUG', session=None):
        log_events = []
        base_timestamp = int(time.time() * 1000)
--- a/clearml_agent/commands/resolver.py
+++ b/clearml_agent/commands/resolver.py
@@ -1,14 +1,16 @@
 import json
 import re
 import shlex
+from copy import copy

 from clearml_agent.backend_api.session import Request
+from clearml_agent.helper.docker_args import DockerArgsSanitizer
 from clearml_agent.helper.package.requirements import (
    RequirementsManager, MarkerRequirement,
    compare_version_rules, )


-def resolve_default_container(session, task_id, container_config):
+def resolve_default_container(session, task_id, container_config, ignore_match_rules=False):
    container_lookup = session.config.get('agent.default_docker.match_rules', None)
    if not session.check_min_api_version("2.13") or not container_lookup:
        return container_config
@@ -17,6 +19,12 @@ def resolve_default_container(session, task_id, container_config):
    try:
        session.verify_feature_set('advanced')
    except ValueError:
+        # ignoring matching rules only supported in higher tiers
+        return container_config
+
+    if ignore_match_rules:
+        print("INFO: default docker command line override, ignoring default docker container match rules")
+        # ignoring matching rules only supported in higher tiers
        return container_config

    result = session.send_request(
@@ -159,9 +167,10 @@ def resolve_default_container(session, task_id, container_config):
            if not container_config.get('image'):
                container_config['image'] = entry.get('image', None)
            if not container_config.get('arguments'):
-                container_config['arguments'] = entry.get('arguments', None)
-                container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())
-            print('Matching default container with rule:\n{}'.format(json.dumps(entry)))
+                container_config['arguments'] = entry.get('arguments', None) or ''
+                if isinstance(container_config.get('arguments'), str):
+                    container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())
+            print('INFO: Matching default container with rule:\n{}'.format(json.dumps(entry)))
            return container_config

    return container_config
--- a/clearml_agent/commands/worker.py
+++ b/clearml_agent/commands/worker.py
--- a/clearml_agent/definitions.py
+++ b/clearml_agent/definitions.py
@@ -1,6 +1,5 @@
 import shlex
 from datetime import timedelta
-from distutils.util import strtobool
 from enum import IntEnum
 from os import getenv, environ
 from typing import Text, Optional, Union, Tuple, Any
@@ -9,6 +8,7 @@ import six
 from pathlib2 import Path

 from clearml_agent.helper.base import normalize_path
+from clearml_agent.helper.environment.converters import strtobool

 PROGRAM_NAME = "clearml-agent"
 FROM_FILE_PREFIX_CHARS = "@"
@@ -158,11 +158,16 @@ ENV_WORKER_ID = EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID")
 ENV_WORKER_TAGS = EnvironmentConfig("CLEARML_WORKER_TAGS")
 ENV_AGENT_SKIP_PIP_VENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PIP_VENV_INSTALL")
 ENV_AGENT_SKIP_PYTHON_ENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL", type=bool)
+ENV_AGENT_FORCE_CODE_DIR = EnvironmentConfig("CLEARML_AGENT_FORCE_CODE_DIR")
+ENV_AGENT_FORCE_EXEC_SCRIPT = EnvironmentConfig("CLEARML_AGENT_FORCE_EXEC_SCRIPT")
+ENV_AGENT_FORCE_POETRY = EnvironmentConfig("CLEARML_AGENT_FORCE_POETRY", type=bool)
+ENV_AGENT_FORCE_TASK_INIT = EnvironmentConfig("CLEARML_AGENT_FORCE_TASK_INIT", type=bool)
 ENV_DOCKER_SKIP_GPUS_FLAG = EnvironmentConfig("CLEARML_DOCKER_SKIP_GPUS_FLAG", "TRAINS_DOCKER_SKIP_GPUS_FLAG")
 ENV_AGENT_GIT_USER = EnvironmentConfig("CLEARML_AGENT_GIT_USER", "TRAINS_AGENT_GIT_USER")
 ENV_AGENT_GIT_PASS = EnvironmentConfig("CLEARML_AGENT_GIT_PASS", "TRAINS_AGENT_GIT_PASS")
 ENV_AGENT_GIT_HOST = EnvironmentConfig("CLEARML_AGENT_GIT_HOST", "TRAINS_AGENT_GIT_HOST")
 ENV_AGENT_DISABLE_SSH_MOUNT = EnvironmentConfig("CLEARML_AGENT_DISABLE_SSH_MOUNT", type=bool)
+ENV_AGENT_DEBUG_GET_NEXT_TASK = EnvironmentConfig("CLEARML_AGENT_DEBUG_GET_NEXT_TASK", type=bool)
 ENV_SSH_AUTH_SOCK = EnvironmentConfig("SSH_AUTH_SOCK")
 ENV_TASK_EXECUTE_AS_USER = EnvironmentConfig("CLEARML_AGENT_EXEC_USER", "TRAINS_AGENT_EXEC_USER")
 ENV_TASK_EXTRA_PYTHON_PATH = EnvironmentConfig("CLEARML_AGENT_EXTRA_PYTHON_PATH", "TRAINS_AGENT_EXTRA_PYTHON_PATH")
@@ -240,6 +245,12 @@ ENV_CUSTOM_BUILD_SCRIPT = EnvironmentConfig("CLEARML_AGENT_CUSTOM_BUILD_SCRIPT")

 ENV_PACKAGE_PYTORCH_RESOLVE = EnvironmentConfig("CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE")

+ENV_TEMP_STDOUT_FILE_DIR = EnvironmentConfig("CLEARML_AGENT_TEMP_STDOUT_FILE_DIR")
+
+ENV_GIT_CLONE_VERBOSE = EnvironmentConfig("CLEARML_AGENT_GIT_CLONE_VERBOSE", type=bool)
+
+ENV_GPU_FRACTIONS = EnvironmentConfig("CLEARML_AGENT_GPU_FRACTIONS")
+

 class FileBuffering(IntEnum):
    """
--- a/clearml_agent/external/requirements_parser/requirement.py
+++ b/clearml_agent/external/requirements_parser/requirement.py
@@ -39,7 +39,7 @@ LOCAL_REGEX = re.compile(

 class Requirement(object):
    """
-    Represents a single requirementfrom clearml_agent.external.requirements_parser.requirement import Requirement
+    Represents a single requirement from clearml_agent.external.requirements_parser.requirement import Requirement

    Typically instances of this class are created with ``Requirement.parse``.
    For local file requirements, there's no verification that the file
@@ -214,6 +214,7 @@ class Requirement(object):
    def parse(cls, line):
        """
        Parses a Requirement from a line of a requirement file.
+        This is the main entry point for parsing a single requirements line (not parse_line!)

        :param line: a line of a requirement file
        :returns: a Requirement instance for the given line
@@ -226,7 +227,7 @@ class Requirement(object):
            return cls.parse_editable(
                re.sub(r'^(-e|--editable=?)\s*', '', line))
        elif '@' in line and ('#' not in line or line.index('#') > line.index('@')):
-            # Allegro bug fix: support 'name @ git+' entries
+            # ClearML bug fix: support 'name @ git+' entries
            name, uri = line.split('@', 1)
            name = name.strip()
            uri = uri.strip()
--- a/clearml_agent/glue/definitions.py
+++ b/clearml_agent/glue/definitions.py
@@ -1,7 +1,20 @@
-from clearml_agent.definitions import EnvironmentConfig
+from clearml_agent.helper.environment import EnvEntry

-ENV_START_AGENT_SCRIPT_PATH = EnvironmentConfig('CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH')
+ENV_START_AGENT_SCRIPT_PATH = EnvEntry("CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH", default="~/__start_agent__.sh")
 """
 Script path to use when creating the bash script to run the agent inside the scheduled pod's docker container. 
 Script will be appended to the specified file.
 """
+
+ENV_DEFAULT_EXECUTION_AGENT_ARGS = EnvEntry("K8S_GLUE_DEF_EXEC_AGENT_ARGS", default="--full-monitoring --require-queue")
+ENV_POD_AGENT_INSTALL_ARGS = EnvEntry("K8S_GLUE_POD_AGENT_INSTALL_ARGS", default="", lstrip=False)
+ENV_POD_MONITOR_LOG_BATCH_SIZE = EnvEntry("K8S_GLUE_POD_MONITOR_LOG_BATCH_SIZE", default=5, converter=int)
+ENV_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION = EnvEntry(
+    "K8S_GLUE_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION", default=False, converter=bool
+)
+
+ENV_POD_USE_IMAGE_ENTRYPOINT = EnvEntry("K8S_GLUE_POD_USE_IMAGE_ENTRYPOINT", default=False, converter=bool)
+"""
+Do not inject a cmd and args to the container's image when building the k8s template (depend on the built-in image
+entrypoint)
+"""
--- a/clearml_agent/glue/k8s.py
+++ b/clearml_agent/glue/k8s.py
@@ -18,7 +18,6 @@ from typing import Text, List, Callable, Any, Collection, Optional, Union, Itera

 import yaml

-from clearml_agent.backend_api.session import Request
 from clearml_agent.commands.events import Events
 from clearml_agent.commands.worker import Worker, get_task_container, set_task_container, get_next_task
 from clearml_agent.definitions import (
@@ -26,9 +25,9 @@ from clearml_agent.definitions import (
    ENV_AGENT_GIT_USER,
    ENV_AGENT_GIT_PASS,
    ENV_FORCE_SYSTEM_SITE_PACKAGES,
+    ENV_AGENT_DEBUG_GET_NEXT_TASK,
 )
 from clearml_agent.errors import APIError, UsageError
-from clearml_agent.glue.definitions import ENV_START_AGENT_SCRIPT_PATH
 from clearml_agent.glue.errors import GetPodCountError
 from clearml_agent.glue.utilities import get_path, get_bash_output
 from clearml_agent.glue.pending_pods_daemon import PendingPodsDaemon
@@ -37,12 +36,18 @@ from clearml_agent.helper.dicts import merge_dicts
 from clearml_agent.helper.process import get_bash_output, stringify_bash_output
 from clearml_agent.helper.resource_monitor import ResourceMonitor
 from clearml_agent.interface.base import ObjectID
+from clearml_agent.backend_api.session import Request
+from clearml_agent.glue.definitions import (
+    ENV_START_AGENT_SCRIPT_PATH,
+    ENV_DEFAULT_EXECUTION_AGENT_ARGS,
+    ENV_POD_AGENT_INSTALL_ARGS,
+    ENV_POD_USE_IMAGE_ENTRYPOINT,
+)


 class K8sIntegration(Worker):
    SUPPORTED_KIND = ("pod", "job")
    K8S_PENDING_QUEUE = "k8s_scheduler"
-
    K8S_DEFAULT_NAMESPACE = "clearml"
    AGENT_LABEL = "CLEARML=agent"
    QUEUE_LABEL = "clearml-agent-queue"
@@ -64,19 +69,23 @@ class K8sIntegration(Worker):
        'echo "ldconfig" >> /etc/profile',
        "/usr/sbin/sshd -p {port}"]

-    DEFAULT_EXECUTION_AGENT_ARGS = os.getenv("K8S_GLUE_DEF_EXEC_AGENT_ARGS", "--full-monitoring --require-queue")
-    POD_AGENT_INSTALL_ARGS = os.getenv("K8S_GLUE_POD_AGENT_INSTALL_ARGS", "")
-
-    CONTAINER_BASH_SCRIPT = [
+    _CONTAINER_APT_SCRIPT_SECTION = [
        "export DEBIAN_FRONTEND='noninteractive'",
        "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean",
        "chown -R root /root/.cache/pip",
        "apt-get update",
        "apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0",
+    ]
+
+    CONTAINER_BASH_SCRIPT = [
+        *(
+            '[ ! -z "$CLEARML_AGENT_SKIP_CONTAINER_APT" ] || {}'.format(line)
+            for line in _CONTAINER_APT_SCRIPT_SECTION
+        ),
        "declare LOCAL_PYTHON",
        "[ ! -z $LOCAL_PYTHON ] || for i in {{15..5}}; do which python3.$i && python3.$i -m pip --version && "
        "export LOCAL_PYTHON=$(which python3.$i) && break ; done",
-        "[ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip",
+        '[ ! -z "$CLEARML_AGENT_SKIP_CONTAINER_APT" ] || [ ! -z "$LOCAL_PYTHON" ] || apt-get install -y python3-pip',
        "[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3",
        "{extra_bash_init_cmd}",
        "[ ! -z $CLEARML_AGENT_NO_UPDATE ] || $LOCAL_PYTHON -m pip install clearml-agent{agent_install_args}",
@@ -98,6 +107,7 @@ class K8sIntegration(Worker):
            num_of_services=20,
            base_pod_num=1,
            user_props_cb=None,
+            runtime_cb=None,
            overrides_yaml=None,
            template_yaml=None,
            clearml_conf_file=None,
@@ -106,6 +116,7 @@ class K8sIntegration(Worker):
            max_pods_limit=None,
            pod_name_prefix=None,
            limit_pod_label=None,
+            force_system_packages=None,
            **kwargs
    ):
        """
@@ -124,6 +135,7 @@ class K8sIntegration(Worker):
        :param callable user_props_cb: An Optional callable allowing additional user properties to be specified
            when scheduling a task to run in a pod. Callable can receive an optional pod number and should return
            a dictionary of user properties (name and value). Signature is [[Optional[int]], Dict[str,str]]
+        :param callable runtime_cb: An Optional callable allowing additional task runtime to be specified (see user_props_cb)
        :param str overrides_yaml: YAML file containing the overrides for the pod (optional)
        :param str template_yaml: YAML file containing the template for the pod (optional).
            If provided the pod is scheduled with kubectl apply and overrides are ignored, otherwise with kubectl run.
@@ -142,7 +154,8 @@ class K8sIntegration(Worker):
        self.k8s_pending_queue_name = k8s_pending_queue_name or self.K8S_PENDING_QUEUE
        self.k8s_pending_queue_id = None
        self.container_bash_script = container_bash_script or self.CONTAINER_BASH_SCRIPT
-        force_system_packages = ENV_FORCE_SYSTEM_SITE_PACKAGES.get()
+        if force_system_packages is None:
+            force_system_packages = ENV_FORCE_SYSTEM_SITE_PACKAGES.get()
        self._force_system_site_packages = force_system_packages if force_system_packages is not None else True
        if self._force_system_site_packages:
            # Use system packages, because by we will be running inside a docker
@@ -157,6 +170,7 @@ class K8sIntegration(Worker):
        self.base_pod_num = base_pod_num
        self._edit_hyperparams_support = None
        self._user_props_cb = user_props_cb
+        self._runtime_cb = runtime_cb
        self.conf_file_content = None
        self.overrides_json_string = None
        self.template_dict = None
@@ -181,7 +195,7 @@ class K8sIntegration(Worker):

        self._agent_label = None

-        self._pending_pods_daemon = self._create_pending_pods_daemon(
+        self._pending_pods_daemon = self._create_daemon_instance(
            cls_=PendingPodsDaemon,
            polling_interval=self._polling_interval
        )
@@ -190,7 +204,15 @@ class K8sIntegration(Worker):
        self._min_cleanup_interval_per_ns_sec = 1.0
        self._last_pod_cleanup_per_ns = defaultdict(lambda: 0.)

-    def _create_pending_pods_daemon(self, cls_, **kwargs):
+        self._server_supports_same_state_transition = (
+                self._session.feature_set != "basic" and self._session.check_min_server_version("3.22.3")
+        )
+
+    @property
+    def agent_label(self):
+        return self._get_agent_label()
+
+    def _create_daemon_instance(self, cls_, **kwargs):
        return cls_(agent=self, **kwargs)

    def _load_overrides_yaml(self, overrides_yaml):
@@ -370,8 +392,9 @@ class K8sIntegration(Worker):
            self.log.warning('Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))

    def get_pods_for_jobs(self, job_condition: str = None, pod_filters: List[str] = None, debug_msg: str = None):
+        # Use metadata.uid so job related pods can be found filterin g following list with this param
        controller_uids = self.get_jobs_info(
-            "spec.selector.matchLabels.controller-uid", condition=job_condition, debug_msg=debug_msg
+            "metadata.uid", condition=job_condition, debug_msg=debug_msg
        )
        if not controller_uids:
            # No pods were found for these jobs
@@ -417,6 +440,13 @@ class K8sIntegration(Worker):
                )
                raise GetPodCountError()

+    def resource_applied(self, resource_name: str, namespace: str, task_id: str, session):
+        """ Called when a resource (pod/job) was applied """
+        pass
+
+    def ports_mode_supported_for_task(self, task_id: str, task_data):
+        return self.ports_mode
+
    def run_one_task(self, queue: Text, task_id: Text, worker_args=None, task_session=None, **_):
        print('Pulling task {} launching on kubernetes cluster'.format(task_id))
        session = task_session or self._session
@@ -426,7 +456,9 @@ class K8sIntegration(Worker):
        if self._is_same_tenant(task_session):
            try:
                print('Pushing task {} into temporary pending queue'.format(task_id))
-                _ = session.api_client.tasks.stop(task_id, force=True)
+
+                if not self._server_supports_same_state_transition:
+                    _ = session.api_client.tasks.stop(task_id, force=True, status_reason="moving to k8s pending queue")

                # Just make sure to clean up in case the task is stuck in the queue (known issue)
                self._session.api_client.queues.remove_task(
@@ -486,8 +518,10 @@ class K8sIntegration(Worker):
                )
            )

-        if self.ports_mode:
+        ports_mode = False
+        if self.ports_mode_supported_for_task(task_id, task_data):
            print("Kubernetes looking for available pod to use")
+            ports_mode = True

        # noinspection PyBroadException
        try:
@@ -498,12 +532,12 @@ class K8sIntegration(Worker):
        # Search for a free pod number
        pod_count = 0
        pod_number = self.base_pod_num
-        while self.ports_mode or self.max_pods_limit:
+        while ports_mode or self.max_pods_limit:
            pod_number = self.base_pod_num + pod_count

            try:
                items_count = self._get_pod_count(
-                    extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if self.ports_mode else None,
+                    extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if ports_mode else None,
                    msg="Looking for a free pod/port"
                )
            except GetPodCountError:
@@ -553,17 +587,17 @@ class K8sIntegration(Worker):
                break
            pod_count += 1

-        labels = self._get_pod_labels(queue, queue_name)
-        if self.ports_mode:
+        labels = self._get_pod_labels(queue, queue_name, task_data)
+        if ports_mode:
            labels.append(self.limit_pod_label.format(pod_number=pod_number))

-        if self.ports_mode:
+        if ports_mode:
            print("Kubernetes scheduling task id={} on pod={} (pod_count={})".format(task_id, pod_number, pod_count))
        else:
            print("Kubernetes scheduling task id={}".format(task_id))

        try:
-            template = self._resolve_template(task_session, task_data, queue)
+            template = self._resolve_template(task_session, task_data, queue, task_id)
        except Exception as ex:
            print("ERROR: Failed resolving template (skipping): {}".format(ex))
            return
@@ -573,45 +607,79 @@ class K8sIntegration(Worker):
        except (KeyError, TypeError, AttributeError):
            namespace = self.namespace

-        if template:
-            output, error = self._kubectl_apply(
-                template=template,
-                pod_number=pod_number,
-                clearml_conf_create_script=clearml_conf_create_script,
-                labels=labels,
-                docker_image=container['image'],
-                docker_args=container['arguments'],
-                docker_bash=container.get('setup_shell_script'),
-                task_id=task_id,
-                queue=queue,
-                namespace=namespace,
+        if not template:
+            print("ERROR: no template for task {}, skipping".format(task_id))
+            return
+
+        output, error, pod_name = self._kubectl_apply(
+            template=template,
+            pod_number=pod_number,
+            clearml_conf_create_script=clearml_conf_create_script,
+            labels=labels,
+            docker_image=container['image'],
+            docker_args=container.get('arguments'),
+            docker_bash=container.get('setup_shell_script'),
+            task_id=task_id,
+            queue=queue,
+            namespace=namespace,
+            task_token=task_session.token.encode("ascii") if task_session else None,
+        )
+
+        print('kubectl output:\n{}\n{}'.format(error, output))
+        if error:
+            send_log = "Running kubectl encountered an error: {}".format(error)
+            self.log.error(send_log)
+            self.send_logs(task_id, send_log.splitlines())
+
+            # Make sure to remove the task from our k8s pending queue
+            self._session.api_client.queues.remove_task(
+                task=task_id,
+                queue=self.k8s_pending_queue_id,
+            )
+            # Set task as failed
+            session.api_client.tasks.failed(task_id, force=True)
+            return
+
+        if pod_name:
+            self.resource_applied(
+                resource_name=pod_name, namespace=namespace, task_id=task_id, session=session
            )

-            print('kubectl output:\n{}\n{}'.format(error, output))
-            if error:
-                send_log = "Running kubectl encountered an error: {}".format(error)
-                self.log.error(send_log)
-                self.send_logs(task_id, send_log.splitlines())
+        self.set_task_info(
+            task_id=task_id, task_session=task_session, queue_name=queue_name, ports_mode=ports_mode,
+            pod_number=pod_number, pod_count=pod_count, task_data=task_data
+        )

+    def set_task_info(
+            self, task_id: str, task_session, task_data, queue_name: str, ports_mode: bool, pod_number, pod_count
+    ):
        user_props = {"k8s-queue": str(queue_name)}
-        if self.ports_mode:
-            user_props.update(
-                {
-                    "k8s-pod-number": pod_number,
-                    "k8s-pod-label": labels[0],
-                    "k8s-internal-pod-count": pod_count,
-                    "k8s-agent": self._get_agent_label(),
-                }
-            )
+        runtime = {}
+        if ports_mode:
+            agent_label = self._get_agent_label()
+            user_props.update({
+                "k8s-pod-number": pod_number,
+                "k8s-pod-label": agent_label,  # backwards-compatibility / legacy
+                "k8s-internal-pod-count": pod_count,
+                "k8s-agent": agent_label,
+            })

        if self._user_props_cb:
            # noinspection PyBroadException
            try:
-                custom_props = self._user_props_cb(pod_number) if self.ports_mode else self._user_props_cb()
+                custom_props = self._user_props_cb(pod_number) if ports_mode else self._user_props_cb()
                user_props.update(custom_props)
            except Exception:
                pass

+        if self._runtime_cb:
+            # noinspection PyBroadException
+            try:
+                custom_runtime = self._runtime_cb(pod_number) if ports_mode else self._runtime_cb()
+                runtime.update(custom_runtime)
+            except Exception:
+                pass
+
        if user_props:
            self._set_task_user_properties(
                task_id=task_id,
@@ -619,7 +687,38 @@ class K8sIntegration(Worker):
                **user_props
            )

-    def _get_pod_labels(self, queue, queue_name):
+        if runtime:
+            task_runtime = self._get_task_runtime(task_id) or {}
+            task_runtime.update(runtime)
+
+            try:
+                res = task_session.send_request(
+                    service='tasks', action='edit', method=Request.def_method,
+                    json={
+                        "task": task_id, "force": True, "runtime": task_runtime
+                    },
+                )
+                if not res.ok:
+                    raise Exception("failed setting runtime property")
+            except Exception as ex:
+                print("WARNING: failed setting custom runtime properties for task '{}': {}".format(task_id, ex))
+
+    def _get_task_runtime(self, task_id) -> Optional[dict]:
+        try:
+            res = self._session.send_request(
+                service='tasks', action='get_by_id', method=Request.def_method,
+                json={"task": task_id, "only_fields": ["runtime"]},
+            )
+            if not res.ok:
+                raise ValueError(f"request returned {res.status_code}")
+            data = res.json().get("data")
+            if not data or "task" not in data:
+                raise ValueError("empty data in result")
+            return data["task"].get("runtime", {})
+        except Exception as ex:
+            print(f"ERROR: Failed getting runtime properties for task {task_id}: {ex}")
+
+    def _get_pod_labels(self, queue, queue_name, task_data):
        return [
            self._get_agent_label(),
            "{}={}".format(self.QUEUE_LABEL, self._safe_k8s_label_value(queue)),
@@ -652,9 +751,12 @@ class K8sIntegration(Worker):
            return {target: results} if results else {}
        return results

+    def get_task_worker_id(self, template, task_id, pod_name, namespace, queue):
+        return f"{self.worker_id}:{task_id}"
+
    def _create_template_container(
        self, pod_name: str, task_id: str, docker_image: str, docker_args: List[str],
-        docker_bash: str, clearml_conf_create_script: List[str]
+        docker_bash: str, clearml_conf_create_script: List[str], task_worker_id: str, task_token: str = None
    ) -> dict:
        container = self._get_docker_args(
            docker_args,
@@ -663,6 +765,32 @@ class K8sIntegration(Worker):
            convert=lambda env: {'name': env.partition("=")[0], 'value': env.partition("=")[2]},
        )

+        def add_or_update_env_var(name, value):
+            env_vars = container.get('env', [])
+            for entry in env_vars:
+                if entry.get('name') == name:
+                    entry['value'] = value
+                    break
+            else:
+                container['env'] = env_vars + [{'name': name, 'value': value}]
+
+        # Set worker ID
+        add_or_update_env_var('CLEARML_WORKER_ID', task_worker_id)
+
+        if ENV_POD_USE_IMAGE_ENTRYPOINT.get():
+            # Don't add a cmd and args, just the image
+
+            # Add the task ID and token since we need it (it's usually in the init script passed to us
+            add_or_update_env_var('CLEARML_TASK_ID', task_id)
+            if task_token:
+                # TODO: find a way to base64 encode the token
+                add_or_update_env_var('CLEARML_AUTH_TOKEN', task_token)
+
+            return self._merge_containers(
+                container, dict(name=pod_name, image=docker_image)
+            )
+
+        # Create bash script for container and
        container_bash_script = [self.container_bash_script] if isinstance(self.container_bash_script, str) \
            else self.container_bash_script

@@ -675,8 +803,8 @@ class K8sIntegration(Worker):
            [line.format(extra_bash_init_cmd=self.extra_bash_init_script or '',
                         task_id=task_id,
                         extra_docker_bash_script=extra_docker_bash_script,
-                         default_execution_agent_args=self.DEFAULT_EXECUTION_AGENT_ARGS,
-                         agent_install_args=self.POD_AGENT_INSTALL_ARGS)
+                         default_execution_agent_args=ENV_DEFAULT_EXECUTION_AGENT_ARGS.get(),
+                         agent_install_args=ENV_POD_AGENT_INSTALL_ARGS.get())
             for line in container_bash_script])

        extra_bash_commands = list(clearml_conf_create_script or [])
@@ -710,15 +838,18 @@ class K8sIntegration(Worker):
        queue,
        task_id,
        namespace,
-        template=None,
-        pod_number=None
+        template,
+        pod_number=None,
+        task_token=None,
    ):
        if "apiVersion" not in template:
            template["apiVersion"] = "batch/v1" if self.using_jobs else "v1"
        if "kind" in template:
            if template["kind"].lower() != self.kind:
                return (
-                    "", f"Template kind {template['kind']} does not maych kind {self.kind.capitalize()} set for agent"
+                    "",
+                    f"Template kind {template['kind']} does not maych kind {self.kind.capitalize()} set for agent",
+                    None
                )
        else:
            template["kind"] = self.kind.capitalize()
@@ -740,7 +871,7 @@ class K8sIntegration(Worker):
            spec.setdefault('backoffLimit', 0)
            spec_template = spec.setdefault('template', {})
            if labels:
-                # Place same labels fro any pod spawned by the job
+                # Place same labels for any pod spawned by the job
                place_labels(spec_template.setdefault('metadata', {}))

            spec = spec_template.setdefault('spec', {})
@@ -748,13 +879,17 @@ class K8sIntegration(Worker):
        containers = spec.setdefault('containers', [])
        spec.setdefault('restartPolicy', 'Never')

+        task_worker_id = self.get_task_worker_id(template, task_id, name, namespace, queue)
+
        container = self._create_template_container(
            pod_name=name,
            task_id=task_id,
            docker_image=docker_image,
            docker_args=docker_args,
            docker_bash=docker_bash,
-            clearml_conf_create_script=clearml_conf_create_script
+            clearml_conf_create_script=clearml_conf_create_script,
+            task_worker_id=task_worker_id,
+            task_token=task_token,
        )

        if containers:
@@ -789,11 +924,11 @@ class K8sIntegration(Worker):
            process = subprocess.Popen(kubectl_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            output, error = process.communicate()
        except Exception as ex:
-            return None, str(ex)
+            return None, str(ex), None
        finally:
            safe_remove_file(yaml_file)

-        return stringify_bash_output(output), stringify_bash_output(error)
+        return stringify_bash_output(output), stringify_bash_output(error), name

    def _process_bash_lines_response(self, bash_cmd: str, raise_error=True):
        res = get_bash_output(bash_cmd, raise_error=raise_error)
@@ -901,7 +1036,7 @@ class K8sIntegration(Worker):
                result = self._session.get(
                    service='tasks',
                    action='get_all',
-                    json={"id": task_ids, "status": ["in_progress", "queued"], "only_fields": ["id", "status"]},
+                    json={"id": task_ids, "status": ["in_progress", "queued"], "only_fields": ["id", "status", "status_reason"]},
                    method=Request.def_method,
                )
                tasks_to_abort = result["tasks"]
@@ -911,9 +1046,13 @@ class K8sIntegration(Worker):
        for task in tasks_to_abort:
            task_id = task.get("id")
            status = task.get("status")
+            status_reason = (task.get("status_reason") or "").lower()
            if not task_id or not status:
                self.log.warning('Failed getting task information: id={}, status={}'.format(task_id, status))
                continue
+            if status == "queued" and "pushed back by policy manager" in status_reason:
+                # Task was pushed back to policy queue by policy manager, don't touch it
+                continue
            try:
                if status == "queued":
                    self._session.get(
@@ -947,6 +1086,9 @@ class K8sIntegration(Worker):

        return deleted_pods

+    def check_if_suspended(self) -> bool:
+        pass
+
    def run_tasks_loop(self, queues: List[Text], worker_params, **kwargs):
        """
        :summary: Pull and run tasks from queues.
@@ -958,6 +1100,8 @@ class K8sIntegration(Worker):
        :param worker_params: Worker command line arguments
        :type worker_params: ``clearml_agent.helper.process.WorkerParams``
        """
+        # print("debug> running tasks loop")
+
        events_service = self.get_service(Events)

        # make sure we have a k8s pending queue
@@ -989,12 +1133,19 @@ class K8sIntegration(Worker):
                    continue

            # iterate over queues (priority style, queues[0] is highest)
+            # print("debug> iterating over queues")
            for queue in queues:
                # delete old completed / failed pods
                self._cleanup_old_pods(namespaces, extra_msg="Cleanup cycle {cmd}")

+                if self.check_if_suspended():
+                    print("Agent is suspended, sleeping for {:.1f} seconds".format(self._polling_interval))
+                    sleep(self._polling_interval)
+                    break
+
                # get next task in queue
                try:
+                    # print(f"debug> getting tasks for queue {queue}")
                    response = self._get_next_task(queue=queue, get_task_info=self._impersonate_as_task_owner)
                except Exception as e:
                    print("Warning: Could not access task queue [{}], error: {}".format(queue, e))
@@ -1008,6 +1159,8 @@ class K8sIntegration(Worker):
                        print("No tasks in queue {}".format(queue))
                        continue

+                    print('Received task {} from queue {}'.format(task_id, queue))
+
                    task_session = None
                    if self._impersonate_as_task_owner:
                        try:
@@ -1059,8 +1212,9 @@ class K8sIntegration(Worker):

        :param list(str) queue: queue name to pull from
        """
+        queues = queue if isinstance(queue, (list, tuple)) else ([queue] if queue else None)
        return self.daemon(
-            queues=[ObjectID(name=queue)] if queue else None,
+            queues=[ObjectID(name=q) for q in queues] if queues else None,
            log_level=logging.INFO, foreground=True, docker=False, **kwargs,
        )

@@ -1069,7 +1223,7 @@ class K8sIntegration(Worker):
            self._session, queue=queue, get_task_info=get_task_info
        )

-    def _resolve_template(self, task_session, task_data, queue):
+    def _resolve_template(self, task_session, task_data, queue, task_id):
        if self.template_dict:
            return deepcopy(self.template_dict)

--- a/clearml_agent/glue/pending_pods_daemon.py
+++ b/clearml_agent/glue/pending_pods_daemon.py
@@ -9,6 +9,7 @@ from clearml_agent.helper.process import stringify_bash_output
 from .daemon import K8sDaemon
 from .utilities import get_path
 from .errors import GetPodsError
+from .definitions import ENV_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION


 class PendingPodsDaemon(K8sDaemon):
@@ -17,17 +18,16 @@ class PendingPodsDaemon(K8sDaemon):
        self._polling_interval = polling_interval
        self._last_tasks_msgs = {}  # last msg updated for every task

-    def get_pods(self):
+    def get_pods(self, pod_name=None, debug_msg="Detecting pending pods: {cmd}"):
+        filters = ["status.phase=Pending"]
+        if pod_name:
+            filters.append(f"metadata.name={pod_name}")
+
        if self._agent.using_jobs:
            return self._agent.get_pods_for_jobs(
-                job_condition="status.active=1",
-                pod_filters=["status.phase=Pending"],
-                debug_msg="Detecting pending pods: {cmd}"
+                job_condition="status.active=1", pod_filters=filters, debug_msg=debug_msg
            )
-        return self._agent.get_pods(
-            filters=["status.phase=Pending"],
-            debug_msg="Detecting pending pods: {cmd}"
-        )
+        return self._agent.get_pods(filters=filters, debug_msg=debug_msg)

    def _get_pod_name(self, pod: dict):
        return get_path(pod, "metadata", "name")
@@ -73,6 +73,11 @@ class PendingPodsDaemon(K8sDaemon):
                    if not namespace:
                        continue

+                    updated_pod = self.get_pods(pod_name=pod_name, debug_msg="Refreshing pod information: {cmd}")
+                    if not updated_pod:
+                        continue
+                    pod = updated_pod[0]
+
                    task_id_to_pod[task_id] = pod

                    msg = None
@@ -149,9 +154,7 @@ class PendingPodsDaemon(K8sDaemon):
                    "id": list(pending_tasks_details),
                    "status": ["stopped"],
                    "only_fields": ["id"]
-                },
-                method=Request.def_method,
-                async_enable=False,
+                }
            )
            aborted_task_ids = list(filter(None, (task.get("id") for task in result["tasks"])))

@@ -160,11 +163,27 @@ class PendingPodsDaemon(K8sDaemon):
                if not pod:
                    self.log.error("Failed locating aborted task {} in pending pods list".format(task_id))
                    continue
+
+                pod_name = self._get_pod_name(pod)
+                if not self.get_pods(pod_name=pod_name):
+                    self.log.debug("K8S Glue pending monitor: pod {} is no longer pending, skipping".format(pod_name))
+                    continue
+
                resource_name = self._get_k8s_resource_name(pod)
                self.log.info(
                    "K8S Glue pending monitor: task {} was aborted but the k8s resource {} is still pending, "
                    "deleting pod".format(task_id, resource_name)
                )
+
+                result = self._session.get(
+                    service='tasks',
+                    action='get_all',
+                    json={"id": [task_id], "status": ["stopped"], "only_fields": ["id"]},
+                )
+                if not result["tasks"]:
+                    self.log.debug("K8S Glue pending monitor: task {} is no longer aborted, skipping".format(task_id))
+                    continue
+
                output = self.delete_k8s_resource(k8s_resource=pod, msg="Pending resource of an aborted task")
                if not output:
                    self.log.warning("K8S Glue pending monitor: failed deleting resource {}".format(resource_name))
@@ -177,32 +196,39 @@ class PendingPodsDaemon(K8sDaemon):
        if not msg or self._last_tasks_msgs.get(task_id, None) == (msg, tags):
            return
        try:
-            # Make sure the task is queued
-            result = self._session.send_request(
-                service='tasks',
-                action='get_all',
-                json={"id": task_id, "only_fields": ["status"]},
-                method=Request.def_method,
-                async_enable=False,
-            )
-            if result.ok:
-                status = get_path(result.json(), 'data', 'tasks', 0, 'status')
-                # if task is in progress, change its status to enqueued
-                if status == "in_progress":
-                    result = self._session.send_request(
-                        service='tasks', action='enqueue',
-                        json={
-                            "task": task_id, "force": True, "queue": self._agent.k8s_pending_queue_id
-                        },
-                        method=Request.def_method,
-                        async_enable=False,
-                    )
-                    if not result.ok:
-                        result_msg = get_path(result.json(), 'meta', 'result_msg')
-                        self.log.debug(
-                            "K8S Glue pods monitor: failed forcing task status change"
-                            " for pending task {}: {}".format(task_id, result_msg)
+            if ENV_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION.get():
+                # This disables the option to enqueue the task which is supposed to sync the ClearML task status
+                # in case the pod was preempted. In some cases this does not happen due to preemption but due to
+                # cluster communication lag issues that cause us not to discover the pod is no longer pending and
+                # enqueue the task when it's actually already running, thus essentially killing the task
+                pass
+            else:
+                # Make sure the task is queued
+                result = self._session.send_request(
+                    service='tasks',
+                    action='get_all',
+                    json={"id": task_id, "only_fields": ["status"]},
+                    method=Request.def_method,
+                    async_enable=False,
+                )
+                if result.ok:
+                    status = get_path(result.json(), 'data', 'tasks', 0, 'status')
+                    # if task is in progress, change its status to enqueued
+                    if status == "in_progress":
+                        result = self._session.send_request(
+                            service='tasks', action='enqueue',
+                            json={
+                                "task": task_id, "force": True, "queue": self._agent.k8s_pending_queue_id
+                            },
+                            method=Request.def_method,
+                            async_enable=False,
                        )
+                        if not result.ok:
+                            result_msg = get_path(result.json(), 'meta', 'result_msg')
+                            self.log.debug(
+                                "K8S Glue pods monitor: failed forcing task status change"
+                                " for pending task {}: {}".format(task_id, result_msg)
+                            )

            # Update task status message
            payload = {"task": task_id, "status_message": "K8S glue status: {}".format(msg)}
--- a/clearml_agent/helper/base.py
+++ b/clearml_agent/helper/base.py
@@ -14,7 +14,6 @@ import sys
 import tempfile
 from abc import ABCMeta
 from collections import OrderedDict
-from distutils.spawn import find_executable
 from functools import total_ordering
 from typing import Text, Dict, Any, Optional, AnyStr, IO, Union

@@ -38,6 +37,7 @@ use_powershell = os.getenv("CLEARML_AGENT_USE_POWERSHELL", None)


 def which(cmd, path=None):
+    from clearml_agent.helper.process import find_executable
    result = find_executable(cmd, path)
    if not result:
        raise ValueError('command "{}" not found'.format(cmd))
@@ -420,6 +420,7 @@ def mkstemp(
        open_kwargs=None,  # type: Optional[Dict[Text, Any]]
        text=True,         # type: bool
        name_only=False,   # type: bool
+        mode=None,         # type: str
        *args,
        **kwargs):
    # type: (...) -> Union[(IO[AnyStr], Text), Text]
@@ -429,12 +430,14 @@ def mkstemp(
    :param open_kwargs: keyword arguments for ``io.open``
    :param text: open in text mode
    :param name_only: close the file and return its name
+    :param mode: open file mode
    :param args: tempfile.mkstemp args
    :param kwargs: tempfile.mkstemp kwargs
    """
    fd, name = tempfile.mkstemp(text=text, *args, **kwargs)
-    mode = 'w+'
-    if not text:
+    if not mode:
+        mode = 'w+'
+    if not text and 'b' not in mode:
        mode += 'b'
    if name_only:
        os.close(fd)
@@ -540,6 +543,36 @@ def convert_cuda_version_to_int_10_base_str(cuda_version):
    return str(int(float(cuda_version)*10))


+def get_python_version(python_executable, log=None):
+    from clearml_agent.helper.process import Argv
+    try:
+        output = Argv(python_executable, "--version").get_output(
+            stderr=subprocess.STDOUT
+        )
+    except subprocess.CalledProcessError as ex:
+        # Windows returns 9009 code and suggests to install Python from Windows Store
+        if is_windows_platform() and ex.returncode == 9009:
+            if log:
+                log.debug("version not found: {}".format(ex))
+        else:
+            if log:
+                log.warning("error getting %s version: %s", python_executable, ex)
+        return None
+    except FileNotFoundError as ex:
+        if log:
+            log.debug("version not found: {}".format(ex))
+        return None
+
+    match = re.search(r"Python ({}(?:\.\d+)*)".format(r"\d+"), output)
+    if match:
+        if log:
+            log.debug("Found: {}".format(python_executable))
+        # only return major.minor version
+        return ".".join(str(match.group(1)).split(".")[:2])
+
+    return None
+
+
 class NonStrictAttrs(object):

    @classmethod
--- a/clearml_agent/helper/docker_args.py
+++ b/clearml_agent/helper/docker_args.py
@@ -17,6 +17,30 @@ if TYPE_CHECKING:
    from clearml_agent.session import Session


+def sanitize_urls(s: str) -> Tuple[str, bool]:
+    """
+    Replaces passwords in URLs with asterisks.
+    Returns the sanitized string and a boolean indicating whether sanitation was performed.
+    """
+    regex = re.compile("^([^:]*:)[^@]+(.*)$")
+    tokens = re.split(r"\s", s)
+    changed = False
+    for k in range(len(tokens)):
+        if "@" in tokens[k]:
+            res = urlparse(tokens[k])
+            if regex.match(res.netloc):
+                changed = True
+                tokens[k] = urlunparse((
+                    res.scheme,
+                    regex.sub("\\1********\\2", res.netloc),
+                    res.path,
+                    res.params,
+                    res.query,
+                    res.fragment
+                ))
+    return " ".join(tokens) if changed else s, changed
+
+
 class DockerArgsSanitizer:
    @classmethod
    def sanitize_docker_command(cls, session, docker_command):
@@ -62,11 +86,11 @@ class DockerArgsSanitizer:
                    elif key in keys:
                        val = "********"
                    elif parse_embedded_urls:
-                        val = cls._sanitize_urls(val)[0]
+                        val = sanitize_urls(val)[0]
                    result[i + 1] = "{}={}".format(key, val)
                    skip_next = True
                elif parse_embedded_urls and not item.startswith("-"):
-                    item, changed = cls._sanitize_urls(item)
+                    item, changed = sanitize_urls(item)
                    if changed:
                        result[i] = item
            except (KeyError, TypeError):
@@ -75,22 +99,71 @@ class DockerArgsSanitizer:
        return result

    @staticmethod
-    def _sanitize_urls(s: str) -> Tuple[str, bool]:
-        """ Replaces passwords in URLs with asterisks """
-        regex = re.compile("^([^:]*:)[^@]+(.*)$")
-        tokens = re.split(r"\s", s)
-        changed = False
-        for k in range(len(tokens)):
-            if "@" in tokens[k]:
-                res = urlparse(tokens[k])
-                if regex.match(res.netloc):
-                    changed = True
-                    tokens[k] = urlunparse((
-                        res.scheme,
-                        regex.sub("\\1********\\2", res.netloc),
-                        res.path,
-                        res.params,
-                        res.query,
-                        res.fragment
-                    ))
-        return " ".join(tokens) if changed else s, changed
+    def get_list_of_switches(docker_args: List[str]) -> List[str]:
+        args = []
+        for token in docker_args:
+            if token.strip().startswith("-"):
+                args += [token.strip().split("=")[0].lstrip("-")]
+
+        return args
+
+    @staticmethod
+    def filter_switches(docker_args: List[str], exclude_switches: List[str]) -> List[str]:
+        # shortcut if we are sure we have no matches
+        if (not exclude_switches or
+                not any("-{}".format(s) in " ".join(docker_args) for s in exclude_switches)):
+            return docker_args
+
+        args = []
+        in_switch_args = True
+        for token in docker_args:
+            if token.strip().startswith("-"):
+                if "=" in token:
+                    switch = token.strip().split("=")[0]
+                    in_switch_args = False
+                else:
+                    switch = token
+                    in_switch_args = True
+
+                if switch.lstrip("-") in exclude_switches:
+                    # if in excluded, skip the switch and following arguments
+                    in_switch_args = False
+                else:
+                    args += [token]
+
+            elif in_switch_args:
+                args += [token]
+            else:
+                # this is the switch arguments we need to skip
+                pass
+
+        return args
+
+    @staticmethod
+    def merge_docker_args(config, task_docker_arguments: List[str], extra_docker_arguments: List[str]) -> List[str]:
+        base_cmd = []
+        # currently only resolving --network, --ipc, --privileged
+        override_switches = config.get(
+            "agent.protected_docker_extra_args",
+            ["privileged", "security-opt", "network", "ipc"]
+        )
+
+        if config.get("agent.docker_args_extra_precedes_task", True):
+            switches = []
+            if extra_docker_arguments:
+                switches = DockerArgsSanitizer.get_list_of_switches(extra_docker_arguments)
+                switches = list(set(switches) & set(override_switches))
+                base_cmd += [str(a) for a in extra_docker_arguments if a]
+            if task_docker_arguments:
+                docker_arguments = DockerArgsSanitizer.filter_switches(task_docker_arguments, switches)
+                base_cmd += [a for a in docker_arguments if a]
+        else:
+            switches = []
+            if task_docker_arguments:
+                switches = DockerArgsSanitizer.get_list_of_switches(task_docker_arguments)
+                switches = list(set(switches) & set(override_switches))
+                base_cmd += [a for a in task_docker_arguments if a]
+            if extra_docker_arguments:
+                extra_docker_arguments = DockerArgsSanitizer.filter_switches(extra_docker_arguments, switches)
+                base_cmd += [a for a in extra_docker_arguments if a]
+        return base_cmd
--- a/clearml_agent/helper/environment/init.py
+++ b/clearml_agent/helper/environment/init.py
@@ -0,0 +1,8 @@
+from .entry import Entry, NotSet
+from .environment import EnvEntry
+
+__all__ = [
+    'Entry',
+    'NotSet',
+    'EnvEntry',
+]
--- a/clearml_agent/helper/environment/converters.py
+++ b/clearml_agent/helper/environment/converters.py
@@ -0,0 +1,86 @@
+import base64
+from typing import Union, Optional, Any, TypeVar, Callable, Tuple
+
+import six
+
+try:
+    from typing import Text
+except ImportError:
+    # windows conda-less hack
+    Text = Any
+
+
+ConverterType = TypeVar("ConverterType", bound=Callable[[Any], Any])
+
+
+def base64_to_text(value):
+    # type: (Any) -> Text
+    return base64.b64decode(value).decode("utf-8")
+
+
+def text_to_int(value, default=0):
+    # type: (Any, int) -> int
+    try:
+        return int(value)
+    except (ValueError, TypeError):
+        return default
+
+
+def text_to_bool(value):
+    # type: (Text) -> bool
+    return bool(strtobool(value))
+
+
+def safe_text_to_bool(value):
+    # type: (Text) -> bool
+    try:
+        return text_to_bool(value)
+    except ValueError:
+        return bool(value)
+
+
+def any_to_bool(value):
+    # type: (Optional[Union[int, float, Text]]) -> bool
+    if isinstance(value, six.text_type):
+        return text_to_bool(value)
+    return bool(value)
+
+
+# noinspection PyIncorrectDocstring
+def or_(*converters, **kwargs):
+    # type: (ConverterType, Tuple[Exception, ...]) -> ConverterType
+    """
+    Wrapper that implements an "optional converter" pattern. Allows specifying a converter
+    for which a set of exceptions is ignored (and the original value is returned)
+    :param converters: A converter callable
+    :param exceptions: A tuple of exception types to ignore
+    """
+    # noinspection PyUnresolvedReferences
+    exceptions = kwargs.get("exceptions", (ValueError, TypeError))
+
+    def wrapper(value):
+        for converter in converters:
+            try:
+                return converter(value)
+            except exceptions:
+                pass
+        return value
+
+    return wrapper
+
+
+def strtobool(val):
+    """Convert a string representation of truth to true (1) or false (0).
+
+    True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values
+    are 'n', 'no', 'f', 'false', 'off', and '0'.  Raises ValueError if
+    'val' is anything else.
+    """
+    val = val.lower()
+    if val in ('y', 'yes', 't', 'true', 'on', '1'):
+        return 1
+    elif val in ('n', 'no', 'f', 'false', 'off', '0'):
+        return 0
+    else:
+        raise ValueError("invalid truth value %r" % (val,))
+
--- a/clearml_agent/helper/environment/entry.py
+++ b/clearml_agent/helper/environment/entry.py
@@ -0,0 +1,134 @@
+import abc
+from typing import Optional, Any, Tuple, Callable, Dict
+
+import six
+
+from .converters import any_to_bool
+
+try:
+    from typing import Text
+except ImportError:
+    # windows conda-less hack
+    Text = Any
+
+
+NotSet = object()
+
+Converter = Callable[[Any], Any]
+
+
+@six.add_metaclass(abc.ABCMeta)
+class Entry(object):
+    """
+    Configuration entry definition
+    """
+
+    def default_conversions(self):
+        # type: () -> Dict[Any, Converter]
+
+        if self.lstrip and self.rstrip:
+
+            def str_convert(s):
+                return six.text_type(s).strip()
+
+        elif self.lstrip:
+
+            def str_convert(s):
+                return six.text_type(s).lstrip()
+
+        elif self.rstrip:
+
+            def str_convert(s):
+                return six.text_type(s).rstrip()
+
+        else:
+
+            def str_convert(s):
+                return six.text_type(s)
+
+        return {
+            bool: lambda x: any_to_bool(x.strip()),
+            six.text_type: str_convert,
+        }
+
+    def __init__(self, key, *more_keys, **kwargs):
+        # type: (Text, Text, Any) -> None
+        """
+        :rtype: object
+        :param key: Entry's key (at least one).
+        :param more_keys: More alternate keys for this entry.
+        :param type: Value type. If provided, will be used choosing a default conversion or
+        (if none exists) for casting the environment value.
+        :param converter: Value converter. If provided, will be used to convert the environment value.
+        :param default: Default value. If provided, will be used as the default value on calls to get() and get_pair()
+        in case no value is found for any key and no specific default value was provided in the call.
+        Default value is None.
+        :param help: Help text describing this entry
+        """
+        self.keys = (key,) + more_keys
+        self.type = kwargs.pop("type", six.text_type)
+        self.converter = kwargs.pop("converter", None)
+        self.default = kwargs.pop("default", None)
+        self.help = kwargs.pop("help", None)
+        self.lstrip = kwargs.pop("lstrip", True)
+        self.rstrip = kwargs.pop("rstrip", True)
+
+    def __str__(self):
+        return str(self.key)
+
+    @property
+    def key(self):
+        return self.keys[0]
+
+    def convert(self, value, converter=None):
+        # type: (Any, Converter) -> Optional[Any]
+        converter = converter or self.converter
+        if not converter:
+            converter = self.default_conversions().get(self.type, self.type)
+        return converter(value)
+
+    def get_pair(self, default=NotSet, converter=None, value_cb=None):
+        # type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
+        for key in self.keys:
+            value = self._get(key)
+            if value is NotSet:
+                continue
+            try:
+                value = self.convert(value, converter)
+            except Exception as ex:
+                self.error("invalid value {key}={value}: {ex}".format(**locals()))
+                break
+            # noinspection PyBroadException
+            try:
+                if value_cb:
+                    value_cb(key, value)
+            except Exception:
+                pass
+            return key, value
+
+        result = self.default if default is NotSet else default
+        return self.key, result
+
+    def get(self, default=NotSet, converter=None, value_cb=None):
+        # type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
+        return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
+
+    def set(self, value):
+        # type: (Any, Any) -> (Text, Any)
+        # key, _ = self.get_pair(default=None, converter=None)
+        for k in self.keys:
+            self._set(k, str(value))
+
+    def _set(self, key, value):
+        # type: (Text, Text) -> None
+        pass
+
+    @abc.abstractmethod
+    def _get(self, key):
+        # type: (Text) -> Any
+        pass
+
+    @abc.abstractmethod
+    def error(self, message):
+        # type: (Text) -> None
+        pass
--- a/clearml_agent/helper/environment/environment.py
+++ b/clearml_agent/helper/environment/environment.py
@@ -0,0 +1,28 @@
+from os import getenv, environ
+
+from .converters import text_to_bool
+from .entry import Entry, NotSet
+
+
+class EnvEntry(Entry):
+    def default_conversions(self):
+        conversions = super(EnvEntry, self).default_conversions().copy()
+        conversions[bool] = lambda x: text_to_bool(x.strip())
+        return conversions
+
+    def pop(self):
+        for k in self.keys:
+            environ.pop(k, None)
+
+    def _get(self, key):
+        value = getenv(key, "")
+        return value or NotSet
+
+    def _set(self, key, value):
+        environ[key] = value
+
+    def __str__(self):
+        return "env:{}".format(super(EnvEntry, self).__str__())
+
+    def error(self, message):
+        print("Environment configuration: {}".format(message))
--- a/clearml_agent/helper/gpu/gpustat.py
+++ b/clearml_agent/helper/gpu/gpustat.py
@@ -15,10 +15,8 @@ from __future__ import print_function
 from __future__ import unicode_literals

 import json
-import os.path
 import platform
 import sys
-import time
 from datetime import datetime
 from typing import Optional

@@ -59,6 +57,21 @@ class GPUStat(object):
        """
        return self.entry['uuid']

+    @property
+    def mig_index(self):
+        """
+        Returns the index of the MIG partition (as in nvidia-smi).
+        """
+        return self.entry.get("mig_index")
+
+    @property
+    def mig_uuid(self):
+        """
+        Returns the uuid of the MIG partition returned by nvidia-smi when running in MIG mode,
+        e.g. MIG-12345678-abcd-abcd-uuid-123456abcdef
+        """
+        return self.entry.get("mig_uuid")
+
    @property
    def name(self):
        """
@@ -163,14 +176,16 @@ class GPUStatCollection(object):
    _initialized = False
    _device_count = None
    _gpu_device_info = {}
+    _mig_device_info = {}

-    def __init__(self, gpu_list, driver_version=None):
+    def __init__(self, gpu_list, driver_version=None, driver_cuda_version=None):
        self.gpus = gpu_list

        # attach additional system information
        self.hostname = platform.node()
        self.query_time = datetime.now()
        self.driver_version = driver_version
+        self.driver_cuda_version = driver_cuda_version

    @staticmethod
    def clean_processes():
@@ -181,17 +196,18 @@ class GPUStatCollection(object):
    @staticmethod
    def new_query(shutdown=False, per_process_stats=False, get_driver_info=False):
        """Query the information of all the GPUs on local machine"""
-
+        initialized = False
        if not GPUStatCollection._initialized:
            N.nvmlInit()
            GPUStatCollection._initialized = True
+            initialized = True

        def _decode(b):
            if isinstance(b, bytes):
                return b.decode()  # for python3, to unicode
            return b

-        def get_gpu_info(index, handle):
+        def get_gpu_info(index, handle, is_mig=False):
            """Get one GPU information specified by nvml handle"""

            def get_process_info(nv_process):
@@ -200,10 +216,10 @@ class GPUStatCollection(object):
                if nv_process.pid not in GPUStatCollection.global_processes:
                    GPUStatCollection.global_processes[nv_process.pid] = \
                        psutil.Process(pid=nv_process.pid)
-                ps_process = GPUStatCollection.global_processes[nv_process.pid]
                process['pid'] = nv_process.pid
                # noinspection PyBroadException
                try:
+                    # ps_process = GPUStatCollection.global_processes[nv_process.pid]
                    # we do not actually use these, so no point in collecting them
                    # process['username'] = ps_process.username()
                    # # cmdline returns full path;
@@ -227,12 +243,14 @@ class GPUStatCollection(object):
                    pass
                return process

-            if not GPUStatCollection._gpu_device_info.get(index):
+            device_info = GPUStatCollection._mig_device_info if is_mig else GPUStatCollection._gpu_device_info
+
+            if not device_info.get(index):
                name = _decode(N.nvmlDeviceGetName(handle))
                uuid = _decode(N.nvmlDeviceGetUUID(handle))
-                GPUStatCollection._gpu_device_info[index] = (name, uuid)
+                device_info[index] = (name, uuid)

-            name, uuid = GPUStatCollection._gpu_device_info[index]
+            name, uuid = device_info[index]

            try:
                temperature = N.nvmlDeviceGetTemperature(
@@ -286,11 +304,11 @@ class GPUStatCollection(object):
                for nv_process in nv_comp_processes + nv_graphics_processes:
                    try:
                        process = get_process_info(nv_process)
-                        processes.append(process)
                    except psutil.NoSuchProcess:
                        # TODO: add some reminder for NVML broken context
                        # e.g. nvidia-smi reset  or  reboot the system
-                        pass
+                        process = None
+                    processes.append(process)

                # we do not actually use these, so no point in collecting them
                # # TODO: Do not block if full process info is not requested
@@ -314,7 +332,7 @@ class GPUStatCollection(object):
                # Convert bytes into MBytes
                'memory.used': memory.used // MB if memory else None,
                'memory.total': memory.total // MB if memory else None,
-                'processes': processes,
+                'processes': None if (processes and all(p is None for p in processes)) else processes
            }
            if per_process_stats:
                GPUStatCollection.clean_processes()
@@ -328,8 +346,36 @@ class GPUStatCollection(object):
        for index in range(GPUStatCollection._device_count):
            handle = N.nvmlDeviceGetHandleByIndex(index)
            gpu_info = get_gpu_info(index, handle)
-            gpu_stat = GPUStat(gpu_info)
-            gpu_list.append(gpu_stat)
+            mig_cnt = 0
+            # noinspection PyBroadException
+            try:
+                mig_cnt = N.nvmlDeviceGetMaxMigDeviceCount(handle)
+            except Exception:
+                pass
+
+            if mig_cnt <= 0:
+                gpu_list.append(GPUStat(gpu_info))
+                continue
+
+            got_mig_info = False
+            for mig_index in range(mig_cnt):
+                try:
+                    mig_handle = N.nvmlDeviceGetMigDeviceHandleByIndex(handle, mig_index)
+                    mig_info = get_gpu_info(mig_index, mig_handle, is_mig=True)
+                    mig_info["mig_name"] = mig_info["name"]
+                    mig_info["name"] = gpu_info["name"]
+                    mig_info["mig_index"] = mig_info["index"]
+                    mig_info["mig_uuid"] = mig_info["uuid"]
+                    mig_info["index"] = gpu_info["index"]
+                    mig_info["uuid"] = gpu_info["uuid"]
+                    mig_info["temperature.gpu"] = gpu_info["temperature.gpu"]
+                    mig_info["fan.speed"] = gpu_info["fan.speed"]
+                    gpu_list.append(GPUStat(mig_info))
+                    got_mig_info = True
+                except Exception as e:
+                    pass
+            if not got_mig_info:
+                gpu_list.append(GPUStat(gpu_info))

        # 2. additional info (driver version, etc).
        if get_driver_info:
@@ -337,15 +383,32 @@ class GPUStatCollection(object):
                driver_version = _decode(N.nvmlSystemGetDriverVersion())
            except N.NVMLError:
                driver_version = None  # N/A
+
+            # noinspection PyBroadException
+            try:
+                cuda_driver_version = str(N.nvmlSystemGetCudaDriverVersion())
+            except BaseException:
+                # noinspection PyBroadException
+                try:
+                    cuda_driver_version = str(N.nvmlSystemGetCudaDriverVersion_v2())
+                except BaseException:
+                    cuda_driver_version = None
+            if cuda_driver_version:
+                try:
+                    cuda_driver_version = '{}.{}'.format(
+                        int(cuda_driver_version)//1000, (int(cuda_driver_version) % 1000)//10)
+                except (ValueError, TypeError):
+                    pass
        else:
            driver_version = None
+            cuda_driver_version = None

        # no need to shutdown:
-        if shutdown:
+        if shutdown and initialized:
            N.nvmlShutdown()
            GPUStatCollection._initialized = False

-        return GPUStatCollection(gpu_list, driver_version=driver_version)
+        return GPUStatCollection(gpu_list, driver_version=driver_version, driver_cuda_version=cuda_driver_version)

    def __len__(self):
        return len(self.gpus)
--- a/clearml_agent/helper/gpu/pynvml.py
+++ b/clearml_agent/helper/gpu/pynvml.py
--- a/clearml_agent/helper/os/folder_cache.py
+++ b/clearml_agent/helper/os/folder_cache.py
@@ -13,16 +13,17 @@ from .locks import FileLock

 class FolderCache(object):
    _lock_filename = '.clearml.lock'
-    _lock_timeout_seconds = 30
+    _def_lock_timeout_seconds = 30
    _temp_entry_prefix = '_temp.'

-    def __init__(self, cache_folder, max_cache_entries=5, min_free_space_gb=None):
+    def __init__(self, cache_folder, max_cache_entries=5, min_free_space_gb=None, lock_timeout_seconds=None):
        self._cache_folder = Path(os.path.expandvars(cache_folder)).expanduser().absolute()
        self._cache_folder.mkdir(parents=True, exist_ok=True)
        self._max_cache_entries = max_cache_entries
        self._last_copied_entry_folder = None
        self._min_free_space_gb = min_free_space_gb if min_free_space_gb and min_free_space_gb > 0 else None
        self._lock = FileLock((self._cache_folder / self._lock_filename).as_posix())
+        self._lock_timeout_seconds = float(lock_timeout_seconds or self._def_lock_timeout_seconds)

    def get_cache_folder(self):
        # type: () -> Path
@@ -46,9 +47,11 @@ class FolderCache(object):
        # lock so we make sure no one deletes it before we copy it
        # noinspection PyBroadException
        try:
-            self._lock.acquire(timeout=self._lock_timeout_seconds)
+            self._lock.acquire(timeout=self._lock_timeout_seconds, readonly=True)
        except BaseException as ex:
            warning('Could not lock cache folder {}: {}'.format(self._cache_folder, ex))
+            import traceback
+            warning('DEBUG: Exception {}: {}'.format(ex, traceback.format_exc()))
            return None

        src = None
@@ -115,6 +118,8 @@ class FolderCache(object):
                    self._lock.acquire(timeout=self._lock_timeout_seconds)
                except BaseException as ex:
                    warning('Could not lock cache folder {}: {}'.format(self._cache_folder, ex))
+                    import traceback
+                    warning('DEBUG: Exception {}: {}'.format(ex, traceback.format_exc()))
                    # failed locking do nothing
                    return True
                keys = sorted(list(set(keys) | set(cached_keys)))
@@ -194,16 +199,23 @@ class FolderCache(object):
                          if cache_folder.is_dir() and not cache_folder.name.startswith(self._temp_entry_prefix)]
        folder_entries = sorted(folder_entries, key=lambda x: x[1], reverse=True)

+        number_of_entries_to_keep = self._max_cache_entries - 1 \
+            if max_cache_entries is None else max(0, int(max_cache_entries))
+
+        # if nothing to do, leave
+        if not folder_entries[number_of_entries_to_keep:]:
+            return
+
        # lock so we make sure no one deletes it before we copy it
        # noinspection PyBroadException
        try:
            self._lock.acquire(timeout=self._lock_timeout_seconds)
        except BaseException as ex:
            warning('Could not lock cache folder {}: {}'.format(self._cache_folder, ex))
+            import traceback
+            warning('DEBUG: Exception {}: {}'.format(ex, traceback.format_exc()))
            return

-        number_of_entries_to_keep = self._max_cache_entries - 1 \
-            if max_cache_entries is None else max(0, int(max_cache_entries))
        for folder, ts in folder_entries[number_of_entries_to_keep:]:
            try:
                shutil.rmtree(folder.as_posix(), ignore_errors=True)
--- a/clearml_agent/helper/os/locks.py
+++ b/clearml_agent/helper/os/locks.py
@@ -32,7 +32,7 @@ def open_atomic(filename, binary=True):
    ...     os.remove(filename)

    >>> with open_atomic(filename) as fh:
-    ...     written = fh.write(b'test')
+    ...     written = fh.write(b"test")
    >>> assert os.path.exists(filename)
    >>> os.remove(filename)

@@ -67,7 +67,7 @@ class FileLock(object):
    def __init__(
            self, filename, mode='a', timeout=DEFAULT_TIMEOUT,
            check_interval=DEFAULT_CHECK_INTERVAL, fail_when_locked=False,
-            flags=LOCK_METHOD, **file_open_kwargs):
+            **file_open_kwargs):
        """Lock manager with build-in timeout

        filename -- filename
@@ -101,11 +101,12 @@ class FileLock(object):
        self.timeout = timeout
        self.check_interval = check_interval
        self.fail_when_locked = fail_when_locked
-        self.flags = flags
+        self.flags_read = constants.LOCK_SH | constants.LOCK_NB
+        self.flags_write = constants.LOCK_EX | constants.LOCK_NB
        self.file_open_kwargs = file_open_kwargs

    def acquire(
-            self, timeout=None, check_interval=None, fail_when_locked=None):
+            self, timeout=None, check_interval=None, fail_when_locked=None, readonly=False):
        """Acquire the locked filehandle"""
        if timeout is None:
            timeout = self.timeout
@@ -123,12 +124,13 @@ class FileLock(object):
        if fh:
            return fh

-        # Get a new filehandler
-        fh = self._get_fh()
+        _fh = None
        try:
+            # Get a new filehandler
+            _fh = self._get_fh()
            # Try to lock
-            fh = self._get_lock(fh)
-        except exceptions.LockException as exception:
+            fh = self._get_lock(_fh, readonly=readonly)
+        except (exceptions.LockException, IOError) as exception:
            # Try till the timeout has passed
            timeoutend = current_time() + timeout
            while timeoutend > current_time():
@@ -144,16 +146,18 @@ class FileLock(object):
                        raise exceptions.AlreadyLocked(exception)

                    else:  # pragma: no cover
+                        if not _fh:
+                            _fh = self._get_fh()
                        # We've got the lock
-                        fh = self._get_lock(fh)
+                        fh = self._get_lock(_fh, readonly=readonly)
                        break

-                except exceptions.LockException:
+                except (exceptions.LockException, IOError):
                    pass

            else:
                # We got a timeout... reraising
-                raise exceptions.LockException(exception)
+                raise exceptions.LockTimeout(exception)

        # Prepare the filehandle (truncate if needed)
        fh = self._prepare_fh(fh)
@@ -176,16 +180,37 @@ class FileLock(object):
                pass
            self.fh = None

+    def delete_lock_file(self):
+        # type: () -> bool
+        """
+        Remove the local file used for locking (fail if file is locked)
+
+        :return: True is successful
+        """
+        if self.fh:
+            return False
+        # noinspection PyBroadException
+        try:
+            os.unlink(path=self.filename)
+        except BaseException:
+            return False
+        return True
+
    def _get_fh(self):
        """Get a new filehandle"""
+        # Create the parent directory if it doesn't exist
+        path, name = os.path.split(self.filename)
+        if path and not os.path.isdir(path):  # pragma: no cover
+            os.makedirs(path, exist_ok=True)
+
        return open(self.filename, self.mode, **self.file_open_kwargs)

-    def _get_lock(self, fh):
+    def _get_lock(self, fh, readonly=False):
        """
        Try to lock the given filehandle

        returns LockException if it fails"""
-        lock(fh, self.flags)
+        lock(fh, self.flags_read if readonly else self.flags_write)
        return fh

    def _prepare_fh(self, fh):
--- a/clearml_agent/helper/os/portalocker.py
+++ b/clearml_agent/helper/os/portalocker.py
@@ -20,6 +20,9 @@ class exceptions:
    class FileToLarge(BaseLockException):
        pass

+    class LockTimeout(BaseLockException):
+        pass
+

 class constants:
    # The actual tests will execute the code anyhow so the following code can
@@ -185,6 +188,10 @@ elif os.name == 'posix':  # pragma: no cover
            # The exception code varies on different systems so we'll catch
            # every IO error
            raise exceptions.LockException(exc_value, fh=file_)
+        except BaseException as ex:
+            # DEBUG
+            print("Uncaught [{}] Exception [{}] in portalock: {}".format(locking_exceptions, type(ex), ex))
+            raise

    def unlock(file_):
        fcntl.flock(file_.fileno(), constants.LOCK_UN)
--- a/clearml_agent/helper/package/base.py
+++ b/clearml_agent/helper/package/base.py
@@ -28,9 +28,13 @@ class PackageManager(object):
    _config_cache_folder = 'agent.venvs_cache.path'
    _config_cache_max_entries = 'agent.venvs_cache.max_entries'
    _config_cache_free_space_threshold = 'agent.venvs_cache.free_space_threshold_gb'
+    _config_cache_lock_timeout = 'agent.venvs_cache.lock_timeout'
+    _config_pip_legacy_resolver = 'agent.package_manager.pip_legacy_resolver'

    def __init__(self):
        self._cache_manager = None
+        self._existing_packages = []
+        self._base_install_flags = []

    @abc.abstractproperty
    def bin(self):
@@ -78,6 +82,23 @@ class PackageManager(object):
        # type: (Iterable[Text]) -> None
        pass

+    def add_extra_install_flags(self, extra_flags):  # type: (List[str]) -> None
+        if extra_flags:
+            extra_flags = [
+                e for e in extra_flags if e not in list(self._base_install_flags)
+            ]
+            self._base_install_flags = list(self._base_install_flags) + list(extra_flags)
+
+    def remove_extra_install_flags(self, extra_flags):  # type: (List[str]) -> bool
+        if extra_flags:
+            _base_install_flags = [
+                e for e in self._base_install_flags if e not in list(extra_flags)
+            ]
+            if self._base_install_flags != _base_install_flags:
+                self._base_install_flags = _base_install_flags
+                return True
+        return False
+
    def upgrade_pip(self):
        result = self._install(
            *select_for_platform(
@@ -86,19 +107,58 @@ class PackageManager(object):
            ),
            "--upgrade"
        )
-        packages = self.run_with_env(('list',), output=True).splitlines()
-        # p.split is ('pip', 'x.y.z')
-        pip = [p.split() for p in packages if len(p.split()) == 2 and p.split()[0] == 'pip']
-        if pip:
-            # noinspection PyBroadException
+
+        packages = (self.freeze(freeze_full_environment=True) or dict()).get("pip")
+        if packages:
+            from clearml_agent.helper.package.requirements import RequirementsManager
+            from .requirements import MarkerRequirement, SimpleVersion
+
+            # store existing packages so that we can check if we can skip preinstalled packages
+            # we will only check "@ file" "@ vcs" for exact match
+            self._existing_packages = RequirementsManager.parse_requirements_section_to_marker_requirements(
+                packages, skip_local_file_validation=True)
+
            try:
-                from .requirements import MarkerRequirement
-                pip = pip[0][1].split('.')
-                MarkerRequirement.pip_new_version = bool(int(pip[0]) >= 20)
-            except Exception:
-                pass
+                pip_pkg = next(p for p in self._existing_packages if p.name == "pip")
+            except StopIteration:
+                pip_pkg = None
+
+            # check if we need to list the pip version as well
+            if pip_pkg:
+                MarkerRequirement.pip_new_version = SimpleVersion.compare_versions(pip_pkg.version, ">=", "20")
+
+            # add --use-deprecated=legacy-resolver to pip install to avoid mismatched packages issues
+            self._add_legacy_resolver_flag(pip_pkg.version)
+
        return result

+    def _add_legacy_resolver_flag(self, pip_pkg_version):
+        if not self.session.config.get(self._config_pip_legacy_resolver, None):
+            return
+
+        from .requirements import SimpleVersion
+
+        match_versions = self.session.config.get(self._config_pip_legacy_resolver)
+        matched = False
+        for rule in match_versions:
+            matched = False
+            # make sure we match all the parts of the rule
+            for a_version in rule.split(","):
+                o, v = SimpleVersion.split_op_version(a_version.strip())
+                matched = SimpleVersion.compare_versions(pip_pkg_version, o, v)
+                if not matched:
+                    break
+            # if the rule is fully matched we have a match
+            if matched:
+                break
+
+        legacy_resolver_flags = ["--use-deprecated=legacy-resolver"]
+        if matched:
+            print("INFO: Using legacy resolver for PIP to avoid inconsistency with package versions!")
+            self.add_extra_install_flags(legacy_resolver_flags)
+        elif self.remove_extra_install_flags(legacy_resolver_flags):
+            print("INFO: removing pip legacy resolver!")
+
    def get_python_command(self, extra=()):
        # type: (...) -> Executable
        return Argv(self.bin, *extra)
@@ -148,6 +208,18 @@ class PackageManager(object):
                    return False
            except Exception:
                return False
+
+            try:
+                from .requirements import Requirement, MarkerRequirement
+                req = MarkerRequirement(Requirement.parse(package_name))
+
+                # if pip was part of the requirements, make sure we update the flags
+                # add --use-deprecated=legacy-resolver to pip install to avoid mismatched packages issues
+                if req.name == "pip" and req.version:
+                    PackageManager._selected_manager._add_legacy_resolver_flag(req.version)
+            except Exception as e:
+                print("WARNING: Error while parsing pip version legacy [{}]".format(e))
+
        return True

    @classmethod
@@ -182,7 +254,7 @@ class PackageManager(object):
    def get_pip_versions(cls, pip="pip", wrap=''):
        return [
            (wrap + pip + version + wrap)
-            for version in cls._pip_version or [pip]
+            for version in cls._pip_version or [""]
        ]

    def get_cached_venv(self, requirements, docker_cmd, python_version, cuda_version, destination_folder):
@@ -218,6 +290,8 @@ class PackageManager(object):
        if not self._get_cache_manager():
            return

+        print('Adding venv into cache: {}'.format(source_folder))
+
        try:
            keys = self._generate_reqs_hash_keys(requirements, docker_cmd, python_version, cuda_version)
            return self._get_cache_manager().add_entry(
@@ -302,7 +376,9 @@ class PackageManager(object):
                max_entries = int(self.session.config.get(self._config_cache_max_entries, 10))
                free_space_threshold = float(self.session.config.get(self._config_cache_free_space_threshold, 0))
                self._cache_manager = FolderCache(
-                    cache_folder, max_cache_entries=max_entries, min_free_space_gb=free_space_threshold)
+                    cache_folder, max_cache_entries=max_entries,
+                    min_free_space_gb=free_space_threshold,
+                    lock_timeout_seconds=self.session.config.get(self._config_cache_lock_timeout, None))
            except Exception as ex:
                print("WARNING: Failed accessing venvs cache at {}: {}".format(cache_folder, ex))
                print("WARNING: Skipping venv cache - folder not accessible!")
--- a/clearml_agent/helper/package/conda_api.py
+++ b/clearml_agent/helper/package/conda_api.py
@@ -5,7 +5,6 @@ import re
 import os
 import subprocess
 from collections import OrderedDict
-from distutils.spawn import find_executable
 from functools import partial
 from itertools import chain
 from typing import Text, Iterable, Union, Dict, Set, Sequence, Any
@@ -22,13 +21,13 @@ from clearml_agent.errors import CommandFailedError
 from clearml_agent.helper.base import (
    rm_tree, NonStrictAttrs, select_for_platform, is_windows_platform, ExecutionInfo,
    convert_cuda_version_to_float_single_digit_str, convert_cuda_version_to_int_10_base_str, )
-from clearml_agent.helper.process import Argv, Executable, DEVNULL, CommandSequence, PathLike
+from clearml_agent.helper.process import Argv, Executable, DEVNULL, CommandSequence, PathLike, find_executable
 from clearml_agent.helper.package.requirements import SimpleVersion
 from clearml_agent.session import Session
 from .base import PackageManager
 from .pip_api.venv import VirtualenvPip
 from .requirements import RequirementsManager, MarkerRequirement
-from ...backend_api.session.defs import ENV_CONDA_ENV_PACKAGE
+from ...backend_api.session.defs import ENV_CONDA_ENV_PACKAGE, ENV_USE_CONDA_BASE_ENV

 package_normalize = partial(re.compile(r"""\[version=['"](.*)['"]\]""").sub, r"\1")

@@ -79,6 +78,11 @@ class CondaAPI(PackageManager):
        self.path = path
        self.env_read_only = False
        self.extra_channels = self.session.config.get('agent.package_manager.conda_channels', [])
+        # install into base conda environment (should only be used if running in docker mode)
+        self.use_conda_base_env = ENV_USE_CONDA_BASE_ENV.get(
+            default=self.session.config.get('agent.package_manager.use_conda_base_env', None)
+        )
+        # notice this will not install any additional packages into the selected environment
        self.conda_env_as_base_docker = \
            self.session.config.get('agent.package_manager.conda_env_as_base_docker', None) or \
            bool(ENV_CONDA_ENV_PACKAGE.get())
@@ -129,16 +133,38 @@ class CondaAPI(PackageManager):
    def bin(self):
        return self.pip.bin

+    def _parse_package_marker_match_python_ver(self, line=None, marker_req=None):
+        if line:
+            marker_req = MarkerRequirement(Requirement.parse(line))
+
+        try:
+            mock_req = MarkerRequirement(Requirement.parse(marker_req.marker.replace("'", "").replace("\"", "")))
+        except Exception as ex:
+            print("WARNING: failed parsing, assuming package is okay {}".format(ex))
+            return marker_req
+
+        if not mock_req.compare_version(requested_version=self.python):
+            print("SKIPPING package `{}` not required python version {}".format(marker_req.tostr(), self.python))
+            return None
+        return marker_req
+
    # noinspection SpellCheckingInspection
    def upgrade_pip(self):
        # do not change pip version if pre built environement is used
        if self.env_read_only:
            print('Conda environment in read-only mode, skipping pip upgrade.')
            return ''
+
+        pip_versions = []
+        for req_pip_line in self.pip.get_pip_versions():
+            req = self._parse_package_marker_match_python_ver(line=req_pip_line)
+            if req:
+                pip_versions.append(req.tostr(markers=False))
+
        return self._install(
            *select_for_platform(
-                windows=self.pip.get_pip_versions(),
-                linux=self.pip.get_pip_versions()
+                windows=pip_versions,
+                linux=pip_versions
            )
        )

@@ -173,6 +199,14 @@ class CondaAPI(PackageManager):
            else:
                raise ValueError("Could not restore Conda environment, cannot find {}".format(
                    self.conda_pre_build_env_path))
+        elif self.use_conda_base_env:
+            try:
+                base_path = Path(self.conda).parent.parent.as_posix()
+                print("Using base conda environment at {}".format(base_path))
+                self._init_existing_environment(base_path, is_readonly=False)
+                return self
+            except Exception as ex:
+                print("WARNING: Failed using base conda environment, reverting to new environment: {}".format(ex))

        command = Argv(
            self.conda,
@@ -200,10 +234,25 @@ class CondaAPI(PackageManager):

        return self

-    def _init_existing_environment(self, conda_pre_build_env_path):
+    def _init_existing_environment(self, conda_pre_build_env_path, is_readonly=True):
        print("Using pre-existing Conda environment from {}".format(conda_pre_build_env_path))
        self.path = Path(conda_pre_build_env_path)
        self.source = ("conda", "activate", self.path.as_posix())
+        conda_env = self._get_conda_sh()
+        self.source = CommandSequence(('source', conda_env.as_posix()), self.source)
+
+        conda_packages_json = json.loads(
+            self._run_command((self.conda, "list", "--json", "-p", self.path), raw=True))
+
+        try:
+            for package in conda_packages_json:
+                if package.get("name") == "python" and package.get("version"):
+                    self.python = ".".join(package.get("version").split(".")[:2])
+                    print("Existing conda environment, found python version {}".format(self.python))
+                    break
+        except Exception as ex:
+            print("WARNING: failed detecting existing conda python version: {}".format(ex))
+
        self.pip = CondaPip(
            session=self.session,
            source=self.source,
@@ -211,9 +260,9 @@ class CondaAPI(PackageManager):
            requirements_manager=self.requirements_manager,
            path=self.path,
        )
-        conda_env = self._get_conda_sh()
-        self.source = self.pip.source = CommandSequence(('source', conda_env.as_posix()), self.source)
-        self.env_read_only = True
+        self.pip.source = self.source
+
+        self.env_read_only = is_readonly

    def remove(self):
        """
@@ -223,7 +272,7 @@ class CondaAPI(PackageManager):
        Conda seems to load "vcruntime140.dll" from all its environment on startup.
        This means environment have to be deleted using 'conda env remove'.
        If necessary, conda can be fooled into deleting a partially-deleted environment by creating an empty file
-        in '<ENV>\conda-meta\history' (value found in 'conda.gateways.disk.test.PREFIX_MAGIC_FILE').
+        in '<ENV>\\conda-meta\\history' (value found in 'conda.gateways.disk.test.PREFIX_MAGIC_FILE').
        Otherwise, it complains that said directory is not a conda environment.

        See: https://github.com/conda/conda/issues/7682
@@ -499,7 +548,7 @@ class CondaAPI(PackageManager):
                if '.' not in m.specs[0][1]:
                    continue

-            if m.name.lower() == 'cudatoolkit':
+            if m.name.lower() in ('cudatoolkit', 'cuda-toolkit'):
                # skip cuda if we are running on CPU
                if not cuda_version:
                    continue
@@ -526,10 +575,22 @@ class CondaAPI(PackageManager):
                has_torch = True
                m.req.name = 'tensorflow-gpu' if cuda_version > 0 else 'tensorflow'

+            # push the clearml packages into the pip_requirements
+            if "clearml" in m.req.name and "clearml" not in self.extra_channels:
+                if self.session.debug_mode:
+                    print("info: moving `{}` packages to `pip` section".format(m.req))
+                pip_requirements.append(m)
+                continue
+
            reqs.append(m)

        if not has_cudatoolkit and cuda_version:
-            m = MarkerRequirement(Requirement.parse("cudatoolkit == {}".format(cuda_version_full)))
+            # nvidia channel is using `cuda-toolkit` and has newer versions of cuda,
+            # older cuda can be picked from conda-forge (<12)
+            if "nvidia" in self.extra_channels:
+                m = MarkerRequirement(Requirement.parse("cuda-toolkit == {}".format(cuda_version_full)))
+            else:
+                m = MarkerRequirement(Requirement.parse("cudatoolkit == {}".format(cuda_version_full)))
            has_cudatoolkit = True
            reqs.append(m)

@@ -589,21 +650,30 @@ class CondaAPI(PackageManager):
            if r.name and not r.name.startswith('_') and not requirements.get('conda', None):
                r.name = r.name.replace('_', '-')

-            if has_cudatoolkit and r.specs and len(r.specs[0]) > 1 and r.name == 'cudatoolkit':
+            if has_cudatoolkit and r.specs and len(r.specs[0]) > 1 and r.name in ('cudatoolkit', 'cuda-toolkit'):
                # select specific cuda version if it came from the requirements
                r.specs = [(r.specs[0][0].replace('==', '='), r.specs[0][1].split('.post')[0])]
            elif r.specs and r.specs[0] and len(r.specs[0]) > 1:
                # remove .post from version numbers it fails with ~= version, and change == to ~=
-                r.specs = [(r.specs[0][0].replace('==', '~='), r.specs[0][1].split('.post')[0])]
+                r.specs = [(s[0].replace('==', '~='), s[1].split('.post')[0]) for s in r.specs]

        while reqs:
            # notice, we give conda more freedom in version selection, to help it choose best combination
            def clean_ver(ar):
-                if not ar.specs:
-                    return ar.tostr()
-                ar.specs = [(ar.specs[0][0], ar.specs[0][1] + '.0' if '.' not in ar.specs[0][1] else ar.specs[0][1])]
-                return ar.tostr()
-            conda_env['dependencies'] = [clean_ver(r) for r in reqs]
+                markers = None
+                if ar.marker:
+                    # check if we really need it based on python version
+                    ar = self._parse_package_marker_match_python_ver(marker_req=ar)
+                    if not ar:
+                        # empty lines should be skipped
+                        return ""
+                    # if we do make sure we note that we ignored markers
+                    print("WARNING: ignoring marker in `{}`".format(ar.tostr()))
+                    markers = False
+                if ar.specs:
+                    ar.specs = [(s[0], s[1] + '.0' if '.' not in s[1] else s[1]) for s in ar.specs]
+                return ar.tostr(markers=markers)
+            conda_env['dependencies'] = [clean_ver(r) for r in reqs if clean_ver(r)]
            with self.temp_file("conda_env", yaml.dump(conda_env), suffix=".yml") as name:
                print('Conda: Trying to install requirements:\n{}'.format(conda_env['dependencies']))
                if self.session.debug_mode:
@@ -730,6 +800,25 @@ class CondaAPI(PackageManager):
                return conda_env
        return base_conda_env

+    def add_cached_venv(self, *args, **kwargs):
+        """
+        Copy the local venv folder into the venv cache (keys are based on the requirements+python+docker).
+        """
+        # do not cache if this is a base conda environment
+        if self.conda_env_as_base_docker or self.use_conda_base_env:
+            return
+        return super().add_cached_venv(*args, **kwargs)
+
+    def get_cached_venv(self, *args, **kwargs):
+        """
+        Copy a cached copy of the venv (based on the requirements) into destination_folder.
+        Return None if failed or cached entry does not exist
+        """
+        # do not cache if this is a base conda environment
+        if self.conda_env_as_base_docker or self.use_conda_base_env:
+            return
+        return super().get_cached_venv(*args, **kwargs)
+

 # enable hashing with cmp=False because pdb fails on un-hashable exceptions
 exception = attrs(str=True, cmp=False)
--- a/clearml_agent/helper/package/pip_api/system.py
+++ b/clearml_agent/helper/package/pip_api/system.py
@@ -97,7 +97,7 @@ class SystemPip(PackageManager):
        return Argv(self.bin, '-m', 'pip', '--disable-pip-version-check', *command)

    def install_flags(self):
-        indices_args = tuple(
+        base_args = tuple(self._base_install_flags or []) + tuple(
            chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
        )

@@ -105,7 +105,7 @@ class SystemPip(PackageManager):
            ENV_PIP_EXTRA_INSTALL_FLAGS.get() or \
            self.session.config.get("agent.package_manager.extra_pip_install_flags", None)

-        return (indices_args + tuple(extra_pip_flags)) if extra_pip_flags else indices_args
+        return (base_args + tuple(extra_pip_flags)) if extra_pip_flags else base_args

    def download_flags(self):
        indices_args = tuple(
--- a/clearml_agent/helper/package/pip_api/venv.py
+++ b/clearml_agent/helper/package/pip_api/venv.py
@@ -37,7 +37,9 @@ class VirtualenvPip(SystemPip, PackageManager):

    def load_requirements(self, requirements):
        if isinstance(requirements, dict) and requirements.get("pip"):
-            requirements["pip"] = self.requirements_manager.replace(requirements["pip"])
+            requirements["pip"] = self.requirements_manager.replace(
+                requirements["pip"], existing_packages=self._existing_packages
+            )
        super(VirtualenvPip, self).load_requirements(requirements)
        self.requirements_manager.post_install(self.session, package_manager=self)

@@ -64,9 +66,18 @@ class VirtualenvPip(SystemPip, PackageManager):
        Only valid if instantiated with path.
        Use self.python as self.bin does not exist.
        """
-        self.session.command(
-            self.python, "-m", "virtualenv", self.path, *self.create_flags()
-        ).check_call()
+        # noinspection PyBroadException
+        try:
+            self.session.command(
+                self.python, "-m", "virtualenv", self.path, *self.create_flags()
+            ).check_call()
+        except Exception as ex:
+            # let's try with std library instead
+            print("WARNING: virtualenv call failed: {}\n INFO: Creating virtual environment with venv".format(ex))
+            self.session.command(
+                self.python, "-m", "venv", self.path, *self.create_flags()
+            ).check_call()
+
        return self

    def remove(self):
--- a/clearml_agent/helper/package/poetry_api.py
+++ b/clearml_agent/helper/package/poetry_api.py
@@ -6,6 +6,7 @@ import sys
 import os
 from pathlib2 import Path

+from clearml_agent.definitions import ENV_AGENT_FORCE_POETRY
 from clearml_agent.helper.process import Argv, DEVNULL, check_if_command_exists
 from clearml_agent.session import Session, POETRY

@@ -39,11 +40,11 @@ def prop_guard(prop, log_prop=None):

 class PoetryConfig:

-    def __init__(self, session, interpreter=None):
-        # type: (Session, str) -> ()
+    def __init__(self, session):
+        # type: (Session, str) -> None
        self.session = session
        self._log = session.get_logger(__name__)
-        self._python = interpreter or sys.executable
+        self._python = sys.executable  # default, overwritten from session config in initialize()
        self._initialized = False

    @property
@@ -52,7 +53,7 @@ class PoetryConfig:

    @property
    def enabled(self):
-        return self.session.config["agent.package_manager.type"] == POETRY
+        return ENV_AGENT_FORCE_POETRY.get() or self.session.config["agent.package_manager.type"] == POETRY

    _guard_enabled = prop_guard(enabled, log)

@@ -69,7 +70,7 @@ class PoetryConfig:
                path = path.replace(':'+sys.base_prefix, ':'+sys.real_prefix, 1)
                kwargs['env']['PATH'] = path

-        if self.session and self.session.config:
+        if self.session and self.session.config and args and args[0] == "install":
            extra_args = self.session.config.get("agent.package_manager.poetry_install_extra_args", None)
            if extra_args:
                args = args + tuple(extra_args)
@@ -87,32 +88,53 @@ class PoetryConfig:
    @_guard_enabled
    def initialize(self, cwd=None):
        if not self._initialized:
+            # use correct python version -- detected in Worker.install_virtualenv() and written to
+            # session
+            if self.session.config.get("agent.python_binary", None):
+                self._python = self.session.config.get("agent.python_binary")
+
            if self.session.config.get("agent.package_manager.poetry_version", None) is not None:
                version = str(self.session.config.get("agent.package_manager.poetry_version"))
-                print('Upgrading Poetry package {}'.format(version))
-                # first upgrade pip if we need to
-                try:
-                    from clearml_agent.helper.package.pip_api.venv import VirtualenvPip
-                    pip = VirtualenvPip(
-                        session=self.session, python=self._python,
-                        requirements_manager=None, path=None, interpreter=self._python)
-                    pip.upgrade_pip()
-                except Exception as ex:
-                    self.log.warning("failed upgrading pip: {}".format(ex))

+                # get poetry version
+                version = version.replace(' ', '')
+                if ('=' in version) or ('~' in version) or ('<' in version) or ('>' in version):
+                    version = version
+                elif version:
+                    version = "==" + version
+                # (we are not running it yet)
+                argv = Argv(self._python, "-m", "pip", "install", "poetry{}".format(version),
+                            "--upgrade", "--disable-pip-version-check")
+                # this is just for beauty and checks, we already set the verion in the Argv
+                if not version:
+                    version = "latest"
+            else:
+                # mark to install poetry if not already installed (we are not running it yet)
+                argv = Argv(self._python, "-m", "pip", "install", "poetry", "--disable-pip-version-check")
+                version = ""
+
+            # first upgrade pip if we need to
+            try:
+                from clearml_agent.helper.package.pip_api.venv import VirtualenvPip
+                pip = VirtualenvPip(
+                    session=self.session, python=self._python,
+                    requirements_manager=None, path=None, interpreter=self._python)
+                pip.upgrade_pip()
+            except Exception as ex:
+                self.log.warning("failed upgrading pip: {}".format(ex))
+
+            # check if we do not have a specific version and poetry is found skip installation
+            if not version and check_if_command_exists("poetry"):
+                print("Notice: Poetry was found, no specific version required, skipping poetry installation")
+            else:
+                print('Installing / Upgrading Poetry package to {}'.format(version))
                # now install poetry
                try:
-                    version = version.replace(' ', '')
-                    if ('=' in version) or ('~' in version) or ('<' in version) or ('>' in version):
-                        version = version
-                    elif version:
-                        version = "==" + version
-                    argv = Argv(self._python, "-m", "pip", "install", "poetry{}".format(version),
-                                "--upgrade", "--disable-pip-version-check")
                    print(argv.get_output())
                except Exception as ex:
-                    self.log.warning("failed upgrading poetry: {}".format(ex))
+                    self.log.warning("failed installing poetry: {}".format(ex))

+            # now setup poetry
            self._initialized = True
            try:
                self._config("--local", "virtualenvs.in-project",  "true", cwd=cwd)
--- a/clearml_agent/helper/package/priority_req.py
+++ b/clearml_agent/helper/package/priority_req.py
@@ -53,12 +53,18 @@ class PriorityPackageRequirement(SimpleSubstitution):
        if not self._replaced_packages:
            return list_of_requirements

+        # we assume that both pip & setup tools are not in list_of_requirements, and we need to add them
+
        if "pip" in self._replaced_packages:
            full_freeze = PackageManager.out_of_scope_freeze(freeze_full_environment=True)
-            # now let's look for pip
-            pips = [line for line in full_freeze.get("pip", []) if line.split("==")[0] == "pip"]
-            if pips and "pip" in list_of_requirements:
-                list_of_requirements["pip"] = [pips[0]] + list_of_requirements["pip"]
+            if not full_freeze:
+                if "pip" in list_of_requirements:
+                    list_of_requirements["pip"] = [self._replaced_packages["pip"]] + list_of_requirements["pip"]
+            else:
+                # now let's look for pip
+                pips = [line for line in full_freeze.get("pip", []) if str(line.split("==")[0]).strip() == "pip"]
+                if pips and "pip" in list_of_requirements:
+                    list_of_requirements["pip"] = [pips[0]] + list_of_requirements["pip"]

        if "setuptools" in self._replaced_packages:
            try:
@@ -87,6 +93,20 @@ class PriorityPackageRequirement(SimpleSubstitution):
        return list_of_requirements


+class CachedPackageRequirement(PriorityPackageRequirement):
+
+    name = ("setuptools", "pip", )
+    optional_package_names = tuple()
+
+    def replace(self, req):
+        """
+        Put the requirement in the list for later conversion
+        :raises: ValueError if version is pre-release
+        """
+        self._replaced_packages[req.name] = req.line
+        return Text(req)
+
+
 class PackageCollectorRequirement(SimpleSubstitution):
    """
    This RequirementSubstitution class will allow you to have multiple instances of the same
--- a/clearml_agent/helper/package/pytorch.py
+++ b/clearml_agent/helper/package/pytorch.py
@@ -670,8 +670,7 @@ class PytorchRequirement(SimpleSubstitution):
            return MarkerRequirement(Requirement.parse(self._fix_setuptools))
        return None

-    @classmethod
-    def get_torch_index_url(cls, cuda_version, nightly=False):
+    def get_torch_index_url(self, cuda_version, nightly=False):
        # noinspection PyBroadException
        try:
            cuda = int(cuda_version)
@@ -681,39 +680,39 @@ class PytorchRequirement(SimpleSubstitution):
        if nightly:
            for c in range(cuda, max(-1, cuda-15), -1):
                # then try the nightly builds, it might be there...
-                torch_url = cls.nightly_extra_index_url_template.format(c)
+                torch_url = self.nightly_extra_index_url_template.format(c)
                # noinspection PyBroadException
                try:
                    if requests.get(torch_url, timeout=10).ok:
                        print('Torch nightly CUDA {} index page found'.format(c))
-                        cls.torch_index_url_lookup[c] = torch_url
-                        return cls.torch_index_url_lookup[c], c
+                        self.torch_index_url_lookup[c] = torch_url
+                        return self.torch_index_url_lookup[c], c
                except Exception:
                    pass
            return

        # first check if key is valid
-        if cuda in cls.torch_index_url_lookup:
-            return cls.torch_index_url_lookup[cuda], cuda
+        if cuda in self.torch_index_url_lookup:
+            return self.torch_index_url_lookup[cuda], cuda

        # then try a new cuda version page
        for c in range(cuda, max(-1, cuda-15), -1):
-            torch_url = cls.extra_index_url_template.format(c)
+            torch_url = self.extra_index_url_template.format(c)
            # noinspection PyBroadException
            try:
                if requests.get(torch_url, timeout=10).ok:
                    print('Torch CUDA {} index page found, adding `{}`'.format(c, torch_url))
-                    cls.torch_index_url_lookup[c] = torch_url
-                    return cls.torch_index_url_lookup[c], c
+                    self.torch_index_url_lookup[c] = torch_url
+                    return self.torch_index_url_lookup[c], c
            except Exception:
                pass

-        keys = sorted(cls.torch_index_url_lookup.keys(), reverse=True)
+        keys = sorted(self.torch_index_url_lookup.keys(), reverse=True)
        for k in keys:
            if k <= cuda:
-                return cls.torch_index_url_lookup[k], k
+                return self.torch_index_url_lookup[k], k
        # return default - zero
-        return cls.torch_index_url_lookup[0], 0
+        return self.torch_index_url_lookup[0], 0

    MAP = {
        "windows": {
--- a/clearml_agent/helper/package/requirements.py
+++ b/clearml_agent/helper/package/requirements.py
@@ -19,7 +19,7 @@ import logging
 from clearml_agent.definitions import PIP_EXTRA_INDICES
 from clearml_agent.helper.base import (
    warning, is_conda, which, join_lines, is_windows_platform,
-    convert_cuda_version_to_int_10_base_str, )
+    convert_cuda_version_to_int_10_base_str, dump_yaml, )
 from clearml_agent.helper.process import Argv, PathLike
 from clearml_agent.helper.gpu.gpustat import get_driver_cuda_version
 from clearml_agent.session import Session, normalize_cuda_version
@@ -94,6 +94,12 @@ class MarkerRequirement(object):
    def __repr__(self):
        return '{self.__class__.__name__}[{self}]'.format(self=self)

+    def __eq__(self, other):
+        return isinstance(other, MarkerRequirement) and str(self) == str(other)
+
+    def __hash__(self):
+        return str(self).__hash__()
+
    def format_specs(self, num_parts=None, max_num_parts=None):
        max_num_parts = max_num_parts or num_parts
        if max_num_parts is None or not self.specs:
@@ -116,6 +122,10 @@ class MarkerRequirement(object):
    def specs(self):  # type: () -> List[Tuple[Text, Text]]
        return self.req.specs

+    @property
+    def version(self):  # type: () -> Text
+        return self.specs[0][1] if self.specs else ""
+
    @specs.setter
    def specs(self, value):  # type: (List[Tuple[Text, Text]]) -> None
        self.req.specs = value
@@ -143,6 +153,8 @@ class MarkerRequirement(object):
        If the requested version is 1.2 the self.spec should be 1.2*
        etc.

+        usage: it returns the value of the following comparison: requested_version "op" self.version
+
        :param str requested_version:
        :param str op: '==', '>', '>=', '<=', '<', '~='
        :param int num_parts: number of parts to compare
@@ -152,7 +164,7 @@ class MarkerRequirement(object):
        if not self.specs:
            return True

-        version = self.specs[0][1]
+        version = self.version
        op = (op or self.specs[0][0]).strip()

        return SimpleVersion.compare_versions(
@@ -170,11 +182,21 @@ class MarkerRequirement(object):
        self.req.local_file = False
        return True

-    def validate_local_file_ref(self):
+    def is_local_package_ref(self):
        # if local file does not exist, remove the reference to it
        if self.vcs or self.editable or self.path or not self.local_file or not self.name or \
                not self.uri or not self.uri.startswith("file://"):
+            return False
+        return True
+
+    def is_vcs_ref(self):
+        return bool(self.vcs)
+
+    def validate_local_file_ref(self):
+        # if local file does not exist, remove the reference to it
+        if not self.is_local_package_ref():
            return
+
        local_path = Path(self.uri[len("file://"):])
        if not local_path.exists():
            local_path = Path(unquote(self.uri)[len("file://"):])
@@ -221,6 +243,19 @@ class SimpleVersion:
    _local_version_separators = re.compile(r"[\._-]")
    _regex = re.compile(r"^\s*" + VERSION_PATTERN + r"\s*$", re.VERBOSE | re.IGNORECASE)

+    @classmethod
+    def split_op_version(cls, line):
+        """
+        Split a string in the form of ">=1.2.3" into a (op, version), i.e. (">=", "1.2.3")
+        Notice is calling with only a version string (e.g. "1.2.3") default operator is "=="
+        which means you get ("==", "1.2.3")
+        :param line: string examples: "<=0.1.2"
+        :return: tuple of (op, version) example ("<=", "0.1.2")
+        """
+        match = r"\s*([>=<~!]*)\s*(\S*)\s*"
+        groups = re.match(match, line).groups()
+        return groups[0] or "==", groups[1]
+
    @classmethod
    def compare_versions(cls, version_a, op, version_b, ignore_sub_versions=True, num_parts=3):
        """
@@ -624,14 +659,54 @@ class RequirementsManager(object):
                return handler.replace(req)
        return None

-    def replace(self, requirements):  # type: (Text) -> Text
+    def replace(
+            self,
+            requirements,  # type: Text
+            existing_packages=None,  # type: List[MarkerRequirement]
+            pkg_skip_existing_local=True,  # type: bool
+            pkg_skip_existing_vcs=True,  # type: bool
+            pkg_skip_existing=True,  # type: bool
+    ):  # type: (...) -> Text
        parsed_requirements = self.parse_requirements_section_to_marker_requirements(
-            requirements=requirements, cwd=self._cwd)
+            requirements=requirements, cwd=self._cwd, skip_local_file_validation=True)

+        if parsed_requirements and existing_packages:
+            skipped_packages = None
+            if pkg_skip_existing:
+                skipped_packages = set(parsed_requirements) & set(existing_packages)
+            elif pkg_skip_existing_local or pkg_skip_existing_vcs:
+                existing_packages = [
+                    p for p in existing_packages if (
+                        (pkg_skip_existing_local and p.is_local_package_ref()) or
+                        (pkg_skip_existing_vcs and p.is_vcs_ref())
+                    )
+                ]
+                skipped_packages = set(parsed_requirements) & set(existing_packages)
+
+            if skipped_packages:
+                # maintain order
+                num_skipped_packages = len(parsed_requirements)
+                parsed_requirements = [p for p in parsed_requirements if p not in skipped_packages]
+                num_skipped_packages -= len(parsed_requirements)
+                print("Skipping {} pre-installed packages:\n{}Remaining {} additional packages to install".format(
+                    num_skipped_packages,
+                    dump_yaml(sorted([str(p) for p in skipped_packages])),
+                    len(parsed_requirements)
+                ))
+
+                # nothing to install!
+                if not parsed_requirements:
+                    return ""
+
+        # sanity check
        if not parsed_requirements:
            # return the original requirements just in case
            return requirements

+        # remove local file reference that do not exist
+        for p in parsed_requirements:
+            p.validate_local_file_ref()
+
        def replace_one(i, req):
            # type: (int, MarkerRequirement) -> Optional[Text]
            try:
@@ -805,7 +880,7 @@ class RequirementsManager(object):
                normalize_cuda_version(cudnn_version or 0))

    @staticmethod
-    def parse_requirements_section_to_marker_requirements(requirements, cwd=None):
+    def parse_requirements_section_to_marker_requirements(requirements, cwd=None, skip_local_file_validation=False):
        def safe_parse(req_str):
            # noinspection PyBroadException
            try:
@@ -815,7 +890,8 @@ class RequirementsManager(object):

        def create_req(x):
            r = MarkerRequirement(x)
-            r.validate_local_file_ref()
+            if not skip_local_file_validation:
+                r.validate_local_file_ref()
            return r

        if not requirements:
--- a/clearml_agent/helper/process.py
+++ b/clearml_agent/helper/process.py
@@ -8,7 +8,6 @@ import subprocess
 import sys
 from contextlib import contextmanager
 from copy import copy
-from distutils.spawn import find_executable
 from itertools import chain, repeat, islice
 from os.path import devnull
 from time import sleep
@@ -492,3 +491,40 @@ def double_quote(s):
    # use single quotes, and put single quotes into double quotes
    # the string $"b is then quoted as "$"""b"
    return '"' + s.replace('"', '"\'\"\'"') + '"'
+
+
+def find_executable(executable, path=None):
+    """Tries to find 'executable' in the directories listed in 'path'.
+
+    A string listing directories separated by 'os.pathsep'; defaults to
+    os.environ['PATH'].  Returns the complete filename or None if not found.
+    """
+    _, ext = os.path.splitext(executable)
+    if (sys.platform == 'win32') and (ext != '.exe'):
+        executable = executable + '.exe'
+
+    if os.path.isfile(executable):
+        return executable
+
+    if path is None:
+        path = os.environ.get('PATH', None)
+        if path is None:
+            try:
+                path = os.confstr("CS_PATH")
+            except (AttributeError, ValueError):
+                # os.confstr() or CS_PATH is not available
+                path = os.defpath
+        # bpo-35755: Don't use os.defpath if the PATH environment variable is
+        # set to an empty string
+
+    # PATH='' doesn't match, whereas PATH=':' looks in the current directory
+    if not path:
+        return None
+
+    paths = path.split(os.pathsep)
+    for p in paths:
+        f = os.path.join(p, executable)
+        if os.path.isfile(f):
+            # the file exists, we have a shot at spawn working
+            return f
+    return None
--- a/clearml_agent/helper/repo.py
+++ b/clearml_agent/helper/repo.py
@@ -6,7 +6,6 @@ import stat
 import subprocess
 import sys
 import tempfile
-from distutils.spawn import find_executable
 from hashlib import md5
 from os import environ
 from random import random
@@ -19,7 +18,7 @@ from pathlib2 import Path

 import six

-from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST
+from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST, ENV_GIT_CLONE_VERBOSE
 from clearml_agent.helper.console import ensure_text, ensure_binary
 from clearml_agent.errors import CommandFailedError
 from clearml_agent.helper.base import (
@@ -30,7 +29,7 @@ from clearml_agent.helper.base import (
    create_file_if_not_exists, safe_remove_file,
 )
 from clearml_agent.helper.os.locks import FileLock
-from clearml_agent.helper.process import DEVNULL, Argv, PathLike, COMMAND_SUCCESS
+from clearml_agent.helper.process import DEVNULL, Argv, PathLike, COMMAND_SUCCESS, find_executable
 from clearml_agent.session import Session


@@ -197,8 +196,9 @@ class VCS(object):
            self.log.info("successfully applied uncommitted changes")
        return True

-    # Command-line flags for clone command
-    clone_flags = ()
+    def clone_flags(self):
+        """Command-line flags for clone command"""
+        return tuple()

    @abc.abstractmethod
    def executable_not_found_error_help(self):
@@ -322,6 +322,8 @@ class VCS(object):
                return

            # rewrite ssh URLs only if either ssh port or ssh user are forced in config
+            # TODO: fix, when url is in the form of `git@domain.com:user/project.git` we will fail to get scheme
+            # need to add ssh:// and replace first ":" with / , unless port is specified
            if parsed_url.scheme == "ssh" and (
                self.session.config.get('agent.force_git_ssh_port', None) or
                self.session.config.get('agent.force_git_ssh_user', None)
@@ -345,11 +347,18 @@ class VCS(object):
        # if we have git_user / git_pass replace ssh credentials with https authentication
        if (ENV_AGENT_GIT_USER.get() or self.session.config.get('agent.git_user', None)) and \
                (ENV_AGENT_GIT_PASS.get() or self.session.config.get('agent.git_pass', None)):
+
            # only apply to a specific domain (if requested)
            config_domain = \
                ENV_AGENT_GIT_HOST.get() or self.session.config.get("agent.git_host", None)
-            if config_domain and config_domain != furl(self.url).host:
-                return
+            if config_domain:
+                if config_domain != furl(self.url).host:
+                    # bail out here if we have a git_host configured and it's different than the URL host
+                    # however, we should make sure this is not an ssh@ URL that furl failed to parse
+                    ssh_git_url_match = self.SSH_URL_GIT_SYNTAX.match(self.url)
+                    if not ssh_git_url_match or config_domain != ssh_git_url_match.groupdict().get("host"):
+                        # do not replace to ssh url
+                        return

            new_url = self.replace_ssh_url(self.url)
            if new_url != self.url:
@@ -366,7 +375,7 @@ class VCS(object):
        self._set_ssh_url()
        # if we are on linux no need for the full auth url because we use GIT_ASKPASS
        url = self.url_without_auth if self._use_ask_pass else self.url_with_auth
-        clone_command = ("clone", url, self.location) + self.clone_flags
+        clone_command = ("clone", url, self.location) + self.clone_flags()
        # clone all branches regardless of when we want to later checkout
        # if branch:
        #    clone_command += ("-b", branch)
@@ -543,7 +552,6 @@ class VCS(object):
 class Git(VCS):
    executable_name = "git"
    main_branch = ("master", "main")
-    clone_flags = ("--quiet", "--recursive")
    checkout_flags = ("--force",)
    COMMAND_ENV = {
        # do not prompt for password
@@ -555,7 +563,7 @@ class Git(VCS):
    def __init__(self, *args, **kwargs):
        super(Git, self).__init__(*args, **kwargs)

-        self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', None) \
+        self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', True) \
            else sys.platform == "linux"

        try:
@@ -569,6 +577,12 @@ class Git(VCS):
            "origin/{}".format(b) for b in ([branch] if isinstance(branch, str) else branch)
        ]

+    def clone_flags(self):
+        return (
+            "--recursive",
+            "--verbose" if ENV_GIT_CLONE_VERBOSE.get() else "--quiet"
+        )
+
    def executable_not_found_error_help(self):
        return 'Cannot find "{}" executable. {}'.format(
            self.executable_name,
@@ -583,7 +597,8 @@ class Git(VCS):
        )

    def pull(self):
-        self.call("fetch", "--all", "--recurse-submodules", cwd=self.location)
+        self._set_ssh_url()
+        self.call("fetch", "--all", "--tags", "--recurse-submodules", cwd=self.location)

    def _git_pass_auth_wrapper(self, func, *args, **kwargs):
        try:
@@ -765,7 +780,22 @@ def clone_repository_cached(session, execution, destination):
                # We clone the entire repository, not a specific branch
                vcs.clone()  # branch=execution.branch)

-            vcs.pull()
+            print("pulling git")
+            try:
+                vcs.pull()
+            except Exception as ex:
+                print("git pull failed: {}".format(ex))
+                if (
+                        session.config.get("agent.vcs_cache.enabled", False) and
+                        session.config.get("agent.vcs_cache.clone_on_pull_fail", False)
+                ):
+                    print("pulling git failed, re-cloning: {}".format(no_password_url))
+                    rm_tree(cached_repo_path)
+                    vcs.clone()
+                else:
+                    raise ex
+            print("pulling git completed")
+
            rm_tree(destination)
            shutil.copytree(Text(cached_repo_path), Text(clone_folder),
                            symlinks=select_for_platform(linux=True, windows=False),
@@ -796,8 +826,8 @@ def fix_package_import_diff_patch(entry_script_file):
            lines = f.readlines()
    except Exception:
        return
-    # make sre we are the first import (i.e. we patched the source code)
-    if not lines or not lines[0].strip().startswith('from clearml ') or 'Task.init' not in lines[1]:
+    # make sure we are the first import (i.e. we patched the source code)
+    if len(lines or []) < 2 or not lines[0].strip().startswith('from clearml ') or 'Task.init' not in lines[1]:
        return

    original_lines = lines
@@ -854,3 +884,90 @@ def fix_package_import_diff_patch(entry_script_file):
            f.writelines(new_lines)
    except Exception:
        return
+
+
+def _locate_future_import(lines):
+    # type: (list[str]) -> int
+    """
+    :param lines: string lines of a python file
+    :return: line index of the last __future_ import. return -1 if no __future__ was found
+    """
+    # skip over the first two lines, they are ours
+    # then skip over empty or comment lines
+    lines = [(i, line.split('#', 1)[0].rstrip()) for i, line in enumerate(lines)
+             if line.strip('\r\n\t ') and not line.strip().startswith('#')]
+
+    # remove triple quotes ' """ '
+    nested_c = -1
+    skip_lines = []
+    for i, line_pair in enumerate(lines):
+        for _ in line_pair[1].split('"""')[1:]:
+            if nested_c >= 0:
+                skip_lines.extend(list(range(nested_c, i + 1)))
+                nested_c = -1
+            else:
+                nested_c = i
+    # now select all the
+    lines = [pair for i, pair in enumerate(lines) if i not in skip_lines]
+
+    from_future = re.compile(r"^from[\s]*__future__[\s]*")
+    import_future = re.compile(r"^import[\s]*__future__[\s]*")
+    # test if we have __future__ import
+    found_index = -1
+    for a_i, (_, a_line) in enumerate(lines):
+        if found_index >= a_i:
+            continue
+        if from_future.match(a_line) or import_future.match(a_line):
+            found_index = a_i
+            # check the last import block
+            i, line = lines[found_index]
+            # wither we have \\ character at the end of the line or the line is indented
+            parenthesized_lines = '(' in line and ')' not in line
+            while line.endswith('\\') or parenthesized_lines:
+                found_index += 1
+                i, line = lines[found_index]
+                if ')' in line:
+                    break
+
+        else:
+            break
+
+    return found_index if found_index < 0 else lines[found_index][0]
+
+
+def patch_add_task_init_call(local_filename):
+    if not local_filename or not Path(local_filename).is_file() or not str(local_filename).lower().endswith(".py"):
+        return
+
+    idx_a = 0
+    # find the right entry for the patch if we have a local file (basically after __future__
+    try:
+        with open(local_filename, 'rt') as f:
+            lines = f.readlines()
+    except Exception as ex:
+        print("Failed patching entry point file {}: {}".format(local_filename, ex))
+        return
+
+    future_found = _locate_future_import(lines)
+    if future_found >= 0:
+        idx_a = future_found + 1
+
+    # check if we have not already patched it, no need to add another one
+    if len(lines or []) >= idx_a+2 and lines[idx_a].strip().startswith('from clearml ') and 'Task.init' in lines[idx_a+1]:
+        print("File {} already patched with Task.init()".format(local_filename))
+        return
+
+    patch = [
+        "from clearml import Task\n",
+        "(__name__ != \"__main__\") or Task.init()\n",
+    ]
+    lines = lines[:idx_a] + patch + lines[idx_a:]
+    # noinspection PyBroadException
+    try:
+        with open(local_filename, 'wt') as f:
+            f.writelines(lines)
+    except Exception as ex:
+        print("Failed patching entry point file {}: {}".format(local_filename, ex))
+        return
+
+    print("Force clearml Task.init patch adding to entry point script: {}".format(local_filename))
--- a/clearml_agent/helper/resource_monitor.py
+++ b/clearml_agent/helper/resource_monitor.py
@@ -1,19 +1,20 @@
 from __future__ import unicode_literals, division

 import logging
-import os
+import re
 import shlex
 from collections import deque
 from itertools import starmap
 from threading import Thread, Event
 from time import time
-from typing import Text, Sequence
+from typing import Sequence, List, Union, Dict, Optional

 import attr
 import psutil
 from pathlib2 import Path
+
+from clearml_agent.definitions import ENV_WORKER_TAGS, ENV_GPU_FRACTIONS
 from clearml_agent.session import Session
-from clearml_agent.definitions import ENV_WORKER_TAGS

 try:
    from .gpu import gpustat
@@ -54,6 +55,14 @@ class ResourceMonitor(object):
                if value is not None
            }

+    @attr.s
+    class ClusterReport:
+        cluster_key = attr.ib(type=str)
+        max_gpus = attr.ib(type=int, default=None)
+        max_workers = attr.ib(type=int, default=None)
+        max_cpus = attr.ib(type=int, default=None)
+        resource_groups = attr.ib(type=Sequence[str], factory=list)
+
    def __init__(
        self,
        session,  # type: Session
@@ -61,7 +70,7 @@ class ResourceMonitor(object):
        sample_frequency_per_sec=2.0,
        report_frequency_sec=30.0,
        first_report_sec=None,
-        worker_tags=None,
+        worker_tags=None
    ):
        self.session = session
        self.queue = deque(maxlen=1)
@@ -79,6 +88,13 @@ class ResourceMonitor(object):
        self._gpustat_fail = 0
        self._gpustat = gpustat
        self._active_gpus = None
+        self._default_gpu_utilization = session.config.get("agent.resource_monitoring.default_gpu_utilization", 100)
+        # allow default_gpu_utilization as null in the config, in which case we don't log anything
+        if self._default_gpu_utilization is not None:
+            self._default_gpu_utilization = int(self._default_gpu_utilization)
+        self._gpu_utilization_warning_sent = False
+        self._disk_use_path = str(session.config.get("agent.resource_monitoring.disk_use_path", None) or Path.home())
+        self._fractions_handler = GpuFractionsHandler() if session.feature_set != "basic" else None
        if not worker_tags and ENV_WORKER_TAGS.get():
            worker_tags = shlex.split(ENV_WORKER_TAGS.get())
        self._worker_tags = worker_tags
@@ -91,6 +107,7 @@ class ResourceMonitor(object):
        else:
            # None means no filtering, report all gpus
            self._active_gpus = None
+            # noinspection PyBroadException
            try:
                active_gpus = Session.get_nvidia_visible_env()
                # None means no filtering, report all gpus
@@ -98,6 +115,10 @@ class ResourceMonitor(object):
                    self._active_gpus = [g.strip() for g in str(active_gpus).split(',')]
            except Exception:
                pass
+        self._cluster_report_interval_sec = int(session.config.get(
+            "agent.resource_monitoring.cluster_report_interval_sec", 60
+        ))
+        self._cluster_report = None

    def set_report(self, report):
        # type: (ResourceMonitor.StatusReport) -> ()
@@ -129,6 +150,7 @@ class ResourceMonitor(object):
        )
        log.debug("sending report: %s", report)

+        # noinspection PyBroadException
        try:
            self.session.get(service="workers", action="status_report", **report)
        except Exception:
@@ -136,7 +158,76 @@ class ResourceMonitor(object):
            return False
        return True

+    def send_cluster_report(self) -> bool:
+        if not self.session.feature_set == "basic":
+            return False
+
+        # noinspection PyBroadException
+        try:
+            properties = {
+                "max_cpus": self._cluster_report.max_cpus,
+                "max_gpus": self._cluster_report.max_gpus,
+                "max_workers": self._cluster_report.max_workers,
+            }
+            payload = {
+                "key": self._cluster_report.cluster_key,
+                "timestamp": int(time() * 1000),
+                "timeout": int(self._cluster_report_interval_sec * 2),
+                # "resource_groups": self._cluster_report.resource_groups,  # yet to be supported
+                "properties": {k: v for k, v in properties.items() if v is not None},
+            }
+            self.session.post(service="workers", action="cluster_report", **payload)
+        except Exception as ex:
+            log.warning("Failed sending cluster report: %s", ex)
+            return False
+        return True
+
+    def setup_cluster_report(self, available_gpus, gpu_queues, worker_id=None, cluster_key=None, resource_groups=None):
+        # type: (List[int], Dict[str, int], Optional[str], Optional[str], Optional[List[str]]) -> ()
+        """
+        Set up a cluster report for the enterprise server dashboard feature.
+        If a worker_id is provided, cluster_key and resource_groups are inferred from it.
+        """
+        if self.session.feature_set == "basic":
+            return
+
+        if not worker_id and not cluster_key:
+            print("Error: cannot set up dashboard reporting - worker_id or cluster key are required")
+            return
+
+        # noinspection PyBroadException
+        try:
+            if not cluster_key:
+                worker_id_parts = worker_id.split(":")
+                if len(worker_id_parts) < 3:
+                    cluster_key = self.session.config.get("agent.resource_dashboard.default_cluster_name", "onprem")
+                    resource_group = ":".join((cluster_key, worker_id_parts[0]))
+                    print(
+                        'WARNING: your worker ID "{}" is not suitable for proper resource dashboard reporting, please '
+                        'set up agent.worker_name to be at least two colon-separated parts (i.e. "<category>:<name>"). '
+                        'Using "{}" as the resource dashboard category and "{}" as the resource group.'.format(
+                            worker_id, cluster_key, resource_group
+                        )
+                    )
+                else:
+                    cluster_key = worker_id_parts[0]
+                    resource_group = ":".join((worker_id_parts[:2]))
+
+                resource_groups = [resource_group]
+
+            self._cluster_report = ResourceMonitor.ClusterReport(
+                cluster_key=cluster_key,
+                max_gpus=len(available_gpus),
+                max_workers=len(available_gpus) // min(x for x, _ in gpu_queues.values()),
+                resource_groups=resource_groups
+            )
+
+            self.send_cluster_report()
+        except Exception as ex:
+            print("Error: failed setting cluster report: {}".format(ex))
+
    def _daemon(self):
+        last_cluster_report = 0
        seconds_since_started = 0
        reported = 0
        try:
@@ -153,7 +244,7 @@ class ResourceMonitor(object):
                    try:
                        self._update_readouts()
                    except Exception as ex:
-                        log.warning("failed getting machine stats: %s", report_error(ex))
+                        log.error("failed getting machine stats: %s", report_error(ex))
                        self._failure()

                seconds_since_started += int(round(time() - last_report))
@@ -176,6 +267,15 @@ class ResourceMonitor(object):

                # count reported iterations
                reported += 1
+
+                if (
+                    self._cluster_report and
+                    self._cluster_report_interval_sec
+                    and time() - last_cluster_report > self._cluster_report_interval_sec
+                ):
+                    if self.send_cluster_report():
+                        last_cluster_report = time()
+
        except Exception as ex:
            log.exception("Error reporting monitoring info: %s", str(ex))

@@ -242,7 +342,7 @@ class ResourceMonitor(object):
        virtual_memory = psutil.virtual_memory()
        stats["memory_used"] = BytesSizes.megabytes(virtual_memory.used)
        stats["memory_free"] = BytesSizes.megabytes(virtual_memory.available)
-        disk_use_percentage = psutil.disk_usage(Text(Path.home())).percent
+        disk_use_percentage = psutil.disk_usage(self._disk_use_path).percent
        stats["disk_free_percent"] = 100 - disk_use_percentage
        sensor_stat = (
            psutil.sensors_temperatures()
@@ -264,23 +364,48 @@ class ResourceMonitor(object):
        if self._active_gpus is not False and self._gpustat:
            try:
                gpu_stat = self._gpustat.new_query()
+                report_index = 0
                for i, g in enumerate(gpu_stat.gpus):
                    # only monitor the active gpu's, if none were selected, monitor everything
-                    if self._active_gpus and str(i) not in self._active_gpus:
-                        continue
-                    stats["gpu_temperature_{:d}".format(i)] = g["temperature.gpu"]
-                    stats["gpu_utilization_{:d}".format(i)] = g["utilization.gpu"]
-                    stats["gpu_mem_usage_{:d}".format(i)] = (
+                    if self._active_gpus:
+                        uuid = getattr(g, "uuid", None)
+                        mig_uuid = getattr(g, "mig_uuid", None)
+                        if (
+                            str(g.index) not in self._active_gpus
+                            and (not uuid or uuid not in self._active_gpus)
+                            and (not mig_uuid or mig_uuid not in self._active_gpus)
+                        ):
+                            continue
+                    stats["gpu_temperature_{}".format(report_index)] = g["temperature.gpu"]
+
+                    if g["utilization.gpu"] is not None:
+                        stats["gpu_utilization_{}".format(report_index)] = g["utilization.gpu"]
+                    elif self._default_gpu_utilization is not None:
+                        stats["gpu_utilization_{}".format(report_index)] = self._default_gpu_utilization
+                        if getattr(g, "mig_index", None) is None and not self._gpu_utilization_warning_sent:
+                            # this shouldn't happen for non-MIGs, warn the user about it
+                            log.error("Failed fetching GPU utilization")
+                            self._gpu_utilization_warning_sent = True
+
+                    stats["gpu_mem_usage_{}".format(report_index)] = (
                        100.0 * g["memory.used"] / g["memory.total"]
                    )
                    # already in MBs
-                    stats["gpu_mem_free_{:d}".format(i)] = (
+                    stats["gpu_mem_free_{}".format(report_index)] = (
                        g["memory.total"] - g["memory.used"]
                    )
-                    stats["gpu_mem_used_%d" % i] = g["memory.used"]
+
+                    stats["gpu_mem_used_{}".format(report_index)] = g["memory.used"] or 0
+
+                    if self._fractions_handler:
+                        fractions = self._fractions_handler.fractions
+                        stats["gpu_fraction_{}".format(report_index)] = \
+                            (fractions[i] if i < len(fractions) else fractions[-1]) if fractions else 1.0
+                    report_index += 1
+
            except Exception as ex:
                # something happened and we can't use gpu stats,
-                log.warning("failed getting machine stats: %s", report_error(ex))
+                log.error("failed getting machine stats: %s", report_error(ex))
                self._failure()

        return stats
@@ -293,19 +418,142 @@ class ResourceMonitor(object):
            )
            self._gpustat = None

-    BACKEND_STAT_MAP = {"cpu_usage_*": "cpu_usage",
-                        "cpu_temperature_*": "cpu_temperature",
-                        "disk_free_percent": "disk_free_home",
-                        "io_read_mbs": "disk_read",
-                        "io_write_mbs": "disk_write",
-                        "network_tx_mbs": "network_tx",
-                        "network_rx_mbs": "network_rx",
-                        "memory_free": "memory_free",
-                        "memory_used": "memory_used",
-                        "gpu_temperature_*": "gpu_temperature",
-                        "gpu_mem_used_*": "gpu_memory_used",
-                        "gpu_mem_free_*": "gpu_memory_free",
-                        "gpu_utilization_*": "gpu_usage"}
+    BACKEND_STAT_MAP = {
+        "cpu_usage_*": "cpu_usage",
+        "cpu_temperature_*": "cpu_temperature",
+        "disk_free_percent": "disk_free_home",
+        "io_read_mbs": "disk_read",
+        "io_write_mbs": "disk_write",
+        "network_tx_mbs": "network_tx",
+        "network_rx_mbs": "network_rx",
+        "memory_free": "memory_free",
+        "memory_used": "memory_used",
+        "gpu_temperature_*": "gpu_temperature",
+        "gpu_mem_used_*": "gpu_memory_used",
+        "gpu_mem_free_*": "gpu_memory_free",
+        "gpu_utilization_*": "gpu_usage",
+        "gpu_fraction_*": "gpu_fraction"
+    }
+
+
+class GpuFractionsHandler:
+    _number_re = re.compile(r"^clear\.ml/fraction(-\d+)?$")
+    _mig_re = re.compile(r"^nvidia\.com/mig-(?P<compute>[0-9]+)g\.(?P<memory>[0-9]+)gb$")
+    _frac_gpu_injector_re = re.compile(r"^clearml-injector/fraction$")
+
+    _gpu_name_to_memory_gb = {
+        "A30": 24,
+        "NVIDIA A30": 24,
+        "A100-SXM4-40GB": 40,
+        "NVIDIA-A100-40GB-PCIe": 40,
+        "NVIDIA A100-40GB-PCIe": 40,
+        "NVIDIA-A100-SXM4-40GB": 40,
+        "NVIDIA A100-SXM4-40GB": 40,
+        "NVIDIA-A100-SXM4-80GB": 79,
+        "NVIDIA A100-SXM4-80GB": 79,
+        "NVIDIA-A100-80GB-PCIe": 79,
+        "NVIDIA A100-80GB-PCIe": 79,
+    }
+
+    def __init__(self):
+        self._total_memory_gb = [
+            self._gpu_name_to_memory_gb.get(name, 0)
+            for name in (self._get_gpu_names() or [])
+        ]
+        self._fractions = self._get_fractions()
+
+    @property
+    def fractions(self) -> List[float]:
+        return self._fractions
+
+    def _get_fractions(self) -> List[float]:
+        if not self._total_memory_gb:
+            # Can't compute
+            return [1.0]
+
+        fractions = (ENV_GPU_FRACTIONS.get() or "").strip()
+        if not fractions:
+            # No fractions
+            return [1.0]
+
+        decoded_fractions = self.decode_fractions(fractions)
+
+        if isinstance(decoded_fractions, list):
+            return decoded_fractions
+
+        totals = []
+        for i, (fraction, count) in enumerate(decoded_fractions.items()):
+            m = self._mig_re.match(fraction)
+            if not m:
+                continue
+            try:
+                total_gb = self._total_memory_gb[i] if i < len(self._total_memory_gb) else self._total_memory_gb[-1]
+                if not total_gb:
+                    continue
+                totals.append((int(m.group("memory")) * count) / total_gb)
+            except ValueError:
+                pass
+
+        if not totals:
+            log.warning("Fractions count is empty for {}".format(fractions))
+            return [1.0]
+
+        return totals
+
+    @classmethod
+    def extract_custom_limits(cls, limits: dict):
+        for k, v in list((limits or {}).items()):
+            if cls._number_re.match(k):
+                limits.pop(k, None)
+
+    @classmethod
+    def get_simple_fractions_total(cls, limits: dict) -> float:
+        try:
+            if any(cls._number_re.match(x) for x in limits):
+                return sum(float(v) for k, v in limits.items() if cls._number_re.match(k))
+        except Exception as ex:
+            log.error("Failed summing up fractions from {}: {}".format(limits, ex))
+        return 0
+
+    @classmethod
+    def encode_fractions(cls, limits: dict, annotations: dict) -> str:
+        if limits:
+            if any(cls._number_re.match(x) for x in (limits or {})):
+                return ",".join(str(v) for k, v in sorted(limits.items()) if cls._number_re.match(k))
+            return ",".join(("{}:{}".format(k, v) for k, v in (limits or {}).items() if cls._mig_re.match(k)))
+        elif annotations:
+            if any(cls._frac_gpu_injector_re.match(x) for x in (annotations or {})):
+                return ",".join(str(v) for k, v in sorted(annotations.items()) if cls._frac_gpu_injector_re.match(k))
+
+    @staticmethod
+    def decode_fractions(fractions: str) -> Union[List[float], Dict[str, int]]:
+        try:
+            items = [f.strip() for f in fractions.strip().split(",")]
+            tuples = [(k.strip(), v.strip()) for k, v in (f.partition(":")[::2] for f in items)]
+            if all(not v for _, v in tuples):
+                # comma-separated float fractions
+                return [float(k) for k, _ in tuples]
+            # comma-separated slice:count items
+            return {
+                k.strip(): int(v.strip())
+                for k, v in tuples
+            }
+        except Exception as ex:
+            log.error("Failed decoding GPU fractions '{}': {}".format(fractions, ex))
+        return {}
+
+    @staticmethod
+    def _get_gpu_names():
+        # noinspection PyBroadException
+        try:
+            gpus = gpustat.new_query().gpus
+            names = [g["name"] for g in gpus]
+
+            print("GPU names: {}".format(names))
+
+            return names
+        except Exception as ex:
+            log.error("Failed getting GPU names: {}".format(ex))


 def report_error(ex):
--- a/clearml_agent/interface/worker.py
+++ b/clearml_agent/interface/worker.py
@@ -44,6 +44,11 @@ WORKER_ARGS = {
 }

 DAEMON_ARGS = dict({
+    '--polling-interval': {
+        'help': 'Polling interval in seconds. Minimum is 5 (default 5)',
+        'type': int,
+        'default': 5,
+    },
    '--foreground': {
        'help': 'Pipe full log to stdout/stderr, should not be used if running in background',
        'action': 'store_true',
@@ -62,7 +67,10 @@ DAEMON_ARGS = dict({
        'group': 'Docker support',
    },
    '--queue': {
-        'help': 'Queue ID(s)/Name(s) to pull tasks from (\'default\' queue)',
+        'help': 'Queue ID(s)/Name(s) to pull tasks from (\'default\' queue).'
+                ' Note that the queue list order determines priority, with the first listed queue having the'
+                ' highest priority. To change this behavior, use --order-fairness to pull from each queue in a'
+                ' round-robin order',
        'nargs': '+',
        'default': tuple(),
        'dest': 'queues',
@@ -107,8 +115,11 @@ DAEMON_ARGS = dict({
    '--dynamic-gpus': {
        'help': 'Allow to dynamically allocate gpus based on queue properties, '
                'configure with \'--queue <queue_name>=<num_gpus>\'.'
-                ' Example: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 single_gpu=1\''
-                ' Example Opportunistic: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 max_quad_gpus=1-4 \'',
+                ' Example: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 single_gpu=1\'.'
+                ' Example Opportunistic: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 max_quad_gpus=1-4\'.'
+                ' Note that the queue list order determines priority, with the first listed queue having the'
+                ' highest priority. To change this behavior, use --order-fairness to pull from each queue in a'
+                ' round-robin order',
        'action': 'store_true',
    },
    '--uptime': {
--- a/clearml_agent/session.py
+++ b/clearml_agent/session.py
@@ -4,6 +4,7 @@ import json
 import logging
 import os
 import platform
+import re
 import sys
 from copy import deepcopy
 from typing import Any, Callable
@@ -19,7 +20,7 @@ from clearml_agent.definitions import ENVIRONMENT_CONFIG, ENV_TASK_EXECUTE_AS_US
 from clearml_agent.errors import APIError
 from clearml_agent.helper.base import HOCONEncoder
 from clearml_agent.helper.process import Argv
-from clearml_agent.helper.docker_args import DockerArgsSanitizer
+from clearml_agent.helper.docker_args import DockerArgsSanitizer, sanitize_urls
 from .version import __version__

 POETRY = "poetry"
@@ -240,38 +241,49 @@ class Session(_Session):
            except:
                pass

-    def print_configuration(
-            self,
-            remove_secret_keys=("secret", "pass", "token", "account_key", "contents"),
-            skip_value_keys=("environment", ),
-            docker_args_sanitize_keys=("extra_docker_arguments", ),
-    ):
+    def print_configuration(self):
+        def load_config(key, default):
+            return [re.compile(x) for x in self.config.get(f"agent.sanitize_config_printout.{key}", default=default)]
+
+        dont_hide_secret_keys = load_config("dont_hide_secrets", ("^enable_git_ask_pass$",))
+        hide_secret_keys = load_config("hide_secrets", ("secret", "pass", "token", "account_key", "contents"))
+        hide_secret_section_keys = load_config("hide_secrets_recursive", ("^environment$",))
+        docker_cmd_keys = load_config("docker_commands", ("^extra_docker_arguments$",))
+        urls_keys = load_config("urls", ("^extra_index_url$",))
+
        # remove all the secrets from the print
-        def recursive_remove_secrets(dictionary, secret_keys=(), empty_keys=()):
+        def recursive_remove_secrets(dictionary):
            for k in list(dictionary):
-                for s in secret_keys:
-                    if s in k:
-                        dictionary.pop(k)
-                        break
-                for s in empty_keys:
-                    if s == k:
+                if not any(r.search(k) for r in dont_hide_secret_keys):
+                    if any(r.search(k) for r in hide_secret_keys):
+                        dictionary[k] = '****'
+                        continue
+                    if any(r.search(k) for r in hide_secret_section_keys):
                        dictionary[k] = {key: '****' for key in dictionary[k]} \
                            if isinstance(dictionary[k], dict) else '****'
-                        break
+                        continue
+                if any(r.search(k) for r in urls_keys):
+                    value = dictionary.get(k, None)
+                    if isinstance(value, str):
+                        dictionary[k] = sanitize_urls(value)[0]
+                    elif isinstance(value, (list, tuple)):
+                        dictionary[k] = [sanitize_urls(v)[0] for v in value]
+                    elif isinstance(value, dict):
+                        dictionary[k] = {k_: sanitize_urls(v)[0] for k_, v in value.items()}
                if isinstance(dictionary.get(k, None), dict):
-                    recursive_remove_secrets(dictionary[k], secret_keys=secret_keys, empty_keys=empty_keys)
+                    recursive_remove_secrets(dictionary[k])
                elif isinstance(dictionary.get(k, None), (list, tuple)):
-                    if k in (docker_args_sanitize_keys or []):
+                    if any(r.match(k) for r in docker_cmd_keys):
                        dictionary[k] = DockerArgsSanitizer.sanitize_docker_command(self, dictionary[k])
                    for item in dictionary[k]:
                        if isinstance(item, dict):
-                            recursive_remove_secrets(item, secret_keys=secret_keys, empty_keys=empty_keys)
+                            recursive_remove_secrets(item)

        config = deepcopy(self.config.to_dict())
        # remove the env variable, it's not important
        config.pop('env', None)
-        if remove_secret_keys or skip_value_keys or docker_args_sanitize_keys:
-            recursive_remove_secrets(config, secret_keys=remove_secret_keys, empty_keys=skip_value_keys)
+        if hide_secret_keys or hide_secret_section_keys or docker_cmd_keys or urls_keys:
+            recursive_remove_secrets(config)
        # remove logging.loggers.urllib3.level from the print
        try:
            config['logging']['loggers']['urllib3'].pop('level', None)
--- a/clearml_agent/version.py
+++ b/clearml_agent/version.py
@@ -1 +1 @@
-__version__ = '1.6.1'
+__version__ = '1.9.2'
--- a/docker/k8s-glue/build-image-helper.sh
+++ b/docker/k8s-glue/build-image-helper.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+
+# Check if image name and Dockerfile path are provided
+if [ -z "$1" ] || [ -z "$2" ]; then
+    echo "Usage: $0 <image_name> <dockerfile_path> <build_context>"
+    exit 1
+fi
+
+# Build the Docker image
+image_name=$1
+dockerfile_path=$2
+build_context=$3
+
+if [ $build_context == "glue-build-aws" ] || [ $build_context == "glue-build-gcp" ]; then
+    if [ ! -f $build_context/clearml.conf ]; then
+        cp build-resources/clearml.conf $build_context
+    fi
+    if [ ! -f $build_context/entrypoint.sh ]; then
+        cp build-resources/entrypoint.sh $build_context
+        chmod +x $build_context/entrypoint.sh
+    fi
+    if [ ! -f $build_context/setup.sh ]; then
+        cp build-resources/setup.sh $build_context
+        chmod +x $build_context/setup.sh
+    fi
+fi
+cp ../../examples/k8s_glue_example.py $build_context
+
+docker build -f $dockerfile_path -t $image_name $build_context
+
+# cleanup
+if [ $build_context == "glue-build-aws" ] || [ $build_context == "glue-build-gcp" ]; then
+    rm $build_context/clearml.conf
+    rm $build_context/entrypoint.sh
+    rm $build_context/setup.sh
+fi
+rm $build_context/k8s_glue_example.py
--- a/docker/k8s-glue/build-resources/clearml.conf
+++ b/docker/k8s-glue/build-resources/clearml.conf
@@ -361,7 +361,7 @@ sdk {
        vcs_repo_detect_async: true

        # Store uncommitted git/hg source code diff in experiment manifest when training in development mode
-        # This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
+        # This stores "git diff" or into the experiment's "script.requirements.diff" section
        store_uncommitted_code_diff: true

        # Support stopping an experiment in case it was externally stopped, status was changed or task was reset
--- a/docker/k8s-glue/build-resources/k8s_glue_example.py
+++ b/docker/k8s-glue/build-resources/k8s_glue_example.py
@@ -1,112 +0,0 @@
-"""
-This example assumes you have preconfigured services with selectors in the form of
- "ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022.
-The K8sIntegration component will label each pod accordingly.
-"""
-from argparse import ArgumentParser
-
-from clearml_agent.glue.k8s import K8sIntegration
-
-
-def parse_args():
-    parser = ArgumentParser()
-    group = parser.add_mutually_exclusive_group()
-
-    parser.add_argument(
-        "--queue", type=str, help="Queue to pull tasks from"
-    )
-    group.add_argument(
-        "--ports-mode", action='store_true', default=False,
-        help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
-             "Should not be used with max-pods"
-    )
-    parser.add_argument(
-        "--num-of-services", type=int, default=20,
-        help="Specify the number of k8s services to be used. Use only with ports-mode."
-    )
-    parser.add_argument(
-        "--base-port", type=int,
-        help="Used in conjunction with ports-mode, specifies the base port exposed by the services. "
-             "For pod #X, the port will be <base-port>+X. Note that pod number is calculated based on base-pod-num"
-             "e.g. if base-port=20000 and base-pod-num=3, the port for the first pod will be 20003"
-    )
-    parser.add_argument(
-        "--base-pod-num", type=int, default=1,
-        help="Used in conjunction with ports-mode and base-port, specifies the base pod number to be used by the "
-             "service (default: %(default)s)"
-    )
-    parser.add_argument(
-        "--gateway-address", type=str, default=None,
-        help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
-    )
-    parser.add_argument(
-        "--pod-clearml-conf", type=str,
-        help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
-    )
-    parser.add_argument(
-        "--overrides-yaml", type=str,
-        help="YAML file containing pod overrides to be used when launching a new pod"
-    )
-    parser.add_argument(
-        "--template-yaml", type=str,
-        help="YAML file containing pod template. If provided pod will be scheduled with kubectl apply "
-             "and overrides are ignored, otherwise it will be scheduled with kubectl run"
-    )
-    parser.add_argument(
-        "--ssh-server-port", type=int, default=0,
-        help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)"
-    )
-    parser.add_argument(
-        "--namespace", type=str,
-        help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
-    )
-    group.add_argument(
-        "--max-pods", type=int,
-        help="Limit the maximum number of pods that this service can run at the same time."
-             "Should not be used with ports-mode"
-    )
-    parser.add_argument(
-        "--use-owner-token", action="store_true", default=False,
-        help="Generate and use task owner token for the execution of each task"
-    )
-    parser.add_argument(
-        "--standalone-mode", action="store_true", default=False,
-        help="Do not use any network connects, assume everything is pre-installed"
-    )
-    parser.add_argument(
-        "--child-report-tags", type=str, nargs="+", default=None,
-        help="List of tags to send with the status reports from a worker that runs a task"
-    )
-
-    return parser.parse_args()
-
-
-def main():
-    args = parse_args()
-
-    user_props_cb = None
-    if args.ports_mode and args.base_port:
-        def k8s_user_props_cb(pod_number=0):
-            user_prop = {"k8s-pod-port": args.base_port + pod_number}
-            if args.gateway_address:
-                user_prop["k8s-gateway-address"] = args.gateway_address
-            return user_prop
-        user_props_cb = k8s_user_props_cb
-
-    k8s = K8sIntegration(
-        ports_mode=args.ports_mode, num_of_services=args.num_of_services, base_pod_num=args.base_pod_num,
-        user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
-        template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
-            ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
-        namespace=args.namespace, max_pods_limit=args.max_pods or None
-    )
-    k8s.k8s_daemon(
-        args.queue,
-        use_owner_token=args.use_owner_token,
-        standalone_mode=args.standalone_mode,
-        child_report_tags=args.child_report_tags
-    )
-
-
-if __name__ == "__main__":
-    main()
--- a/docker/k8s-glue/glue-build-aws/Dockerfile
+++ b/docker/k8s-glue/glue-build-aws/Dockerfile
@@ -1,4 +1,4 @@
-FROM ubuntu:18.04
+FROM ubuntu:22.04

 USER root
 WORKDIR /root
@@ -8,15 +8,16 @@ ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 ENV PYTHONIOENCODING=UTF-8

-COPY ../build-resources/setup.sh /root/setup.sh
+COPY ./setup.sh /root/setup.sh
 RUN /root/setup.sh

 COPY ./setup_aws.sh /root/setup_aws.sh
-RUN /root/setup_aws.sh
+RUN chmod +x /root/setup_aws.sh && /root/setup_aws.sh

-COPY ../build-resources/entrypoint.sh /root/entrypoint.sh
+COPY ./entrypoint.sh /root/entrypoint.sh
 COPY ./provider_entrypoint.sh /root/provider_entrypoint.sh
-COPY ./build-resources/k8s_glue_example.py /root/k8s_glue_example.py
+RUN chmod +x /root/provider_entrypoint.sh
+COPY ./k8s_glue_example.py /root/k8s_glue_example.py
 COPY ./clearml.conf /root/clearml.conf

 ENTRYPOINT ["/root/entrypoint.sh"]
--- a/docker/k8s-glue/glue-build-aws/setup_aws.sh
+++ b/docker/k8s-glue/glue-build-aws/setup_aws.sh
@@ -4,7 +4,8 @@ curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip
 unzip awscliv2.zip
 ./aws/install

-curl -o kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/kubectl
+curl -o kubectl https://s3.us-west-2.amazonaws.com/amazon-eks/1.29.3/2024-04-19/bin/linux/amd64/kubectl
+#curl -o kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/kubectl
 #curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.19.6/2021-01-05/bin/linux/amd64/kubectl
 chmod +x ./kubectl && mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$PATH:$HOME/bin

--- a/docker/k8s-glue/glue-build-gcp/Dockerfile
+++ b/docker/k8s-glue/glue-build-gcp/Dockerfile
@@ -1,4 +1,4 @@
-FROM ubuntu:18.04
+FROM ubuntu:22.04

 USER root
 WORKDIR /root
@@ -8,15 +8,15 @@ ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 ENV PYTHONIOENCODING=UTF-8

-COPY ../build-resources/setup.sh /root/setup.sh
+COPY ./setup.sh /root/setup.sh
 RUN /root/setup.sh

 COPY ./setup_gcp.sh /root/setup_gcp.sh
-RUN /root/setup_gcp.sh
+RUN chmod +x /root/setup_gcp.sh && /root/setup_gcp.sh

-COPY ../build-resources/entrypoint.sh /root/entrypoint.sh
+COPY ./entrypoint.sh /root/entrypoint.sh
 COPY ./provider_entrypoint.sh /root/provider_entrypoint.sh
-COPY ./build-resources/k8s_glue_example.py /root/k8s_glue_example.py
+COPY ./k8s_glue_example.py /root/k8s_glue_example.py
 COPY ./clearml.conf /root/clearml.conf

 ENTRYPOINT ["/root/entrypoint.sh"]
--- a/docker/k8s-glue/glue-build-gcp/setup_gcp.sh
+++ b/docker/k8s-glue/glue-build-gcp/setup_gcp.sh
@@ -1,6 +1,6 @@
 #!/bin/bash

-curl -LO https://dl.k8s.io/release/v1.21.0/bin/linux/amd64/kubectl
+curl -LO https://dl.k8s.io/release/v1.29.3/bin/linux/amd64/kubectl

 install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

--- a/docker/k8s-glue/glue-build/Dockerfile.alpine
+++ b/docker/k8s-glue/glue-build/Dockerfile.alpine
@@ -20,7 +20,7 @@ FROM python:${TAG} as target

 WORKDIR /app

-ARG KUBECTL_VERSION=1.24.0
+ARG KUBECTL_VERSION=1.29.3

 # Not sure about these ENV vars
 # ENV LC_ALL=en_US.UTF-8
--- a/docker/k8s-glue/glue-build/Dockerfile.bullseye
+++ b/docker/k8s-glue/glue-build/Dockerfile.bullseye
@@ -2,7 +2,7 @@ ARG TAG=3.7.17-slim-bullseye

 FROM python:${TAG} as target

-ARG KUBECTL_VERSION=1.22.4
+ARG KUBECTL_VERSION=1.29.3

 WORKDIR /app

--- a/docker/k8s-glue/glue-build/k8s_glue_example.py
+++ b/docker/k8s-glue/glue-build/k8s_glue_example.py
@@ -1,98 +0,0 @@
-"""
-This example assumes you have preconfigured services with selectors in the form of
- "ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022.
-The K8sIntegration component will label each pod accordingly.
-"""
-from argparse import ArgumentParser
-
-from clearml_agent.glue.k8s import K8sIntegration
-
-
-def parse_args():
-    parser = ArgumentParser()
-    group = parser.add_mutually_exclusive_group()
-
-    parser.add_argument(
-        "--queue", type=str, help="Queue to pull tasks from"
-    )
-    group.add_argument(
-        "--ports-mode", action='store_true', default=False,
-        help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
-             "Should not be used with max-pods"
-    )
-    parser.add_argument(
-        "--num-of-services", type=int, default=20,
-        help="Specify the number of k8s services to be used. Use only with ports-mode."
-    )
-    parser.add_argument(
-        "--base-port", type=int,
-        help="Used in conjunction with ports-mode, specifies the base port exposed by the services. "
-             "For pod #X, the port will be <base-port>+X. Note that pod number is calculated based on base-pod-num"
-             "e.g. if base-port=20000 and base-pod-num=3, the port for the first pod will be 20003"
-    )
-    parser.add_argument(
-        "--base-pod-num", type=int, default=1,
-        help="Used in conjunction with ports-mode and base-port, specifies the base pod number to be used by the "
-             "service (default: %(default)s)"
-    )
-    parser.add_argument(
-        "--gateway-address", type=str, default=None,
-        help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
-    )
-    parser.add_argument(
-        "--pod-clearml-conf", type=str,
-        help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
-    )
-    parser.add_argument(
-        "--overrides-yaml", type=str,
-        help="YAML file containing pod overrides to be used when launching a new pod"
-    )
-    parser.add_argument(
-        "--template-yaml", type=str,
-        help="YAML file containing pod template. If provided pod will be scheduled with kubectl apply "
-             "and overrides are ignored, otherwise it will be scheduled with kubectl run"
-    )
-    parser.add_argument(
-        "--ssh-server-port", type=int, default=0,
-        help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)"
-    )
-    parser.add_argument(
-        "--namespace", type=str,
-        help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
-    )
-    group.add_argument(
-        "--max-pods", type=int,
-        help="Limit the maximum number of pods that this service can run at the same time."
-             "Should not be used with ports-mode"
-    )
-    parser.add_argument(
-        "--use-owner-token", action="store_true", default=False,
-        help="Generate and use task owner token for the execution of each task"
-    )
-    return parser.parse_args()
-
-
-def main():
-    args = parse_args()
-
-    user_props_cb = None
-    if args.ports_mode and args.base_port:
-        def k8s_user_props_cb(pod_number=0):
-            user_prop = {"k8s-pod-port": args.base_port + pod_number}
-            if args.gateway_address:
-                user_prop["k8s-gateway-address"] = args.gateway_address
-            return user_prop
-        user_props_cb = k8s_user_props_cb
-
-    k8s = K8sIntegration(
-        ports_mode=args.ports_mode, num_of_services=args.num_of_services, base_pod_num=args.base_pod_num,
-        user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
-        template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
-            ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
-        namespace=args.namespace, max_pods_limit=args.max_pods or None,
-    )
-    k8s.k8s_daemon(args.queue, use_owner_token=args.use_owner_token)
-
-
-if __name__ == "__main__":
-    main()
--- a/docker/k8s-glue/task-pod-build/Dockerfile
+++ b/docker/k8s-glue/task-pod-build/Dockerfile
@@ -1,4 +1,4 @@
-FROM ubuntu:18.04
+FROM ubuntu:22.04

 USER root
 WORKDIR /root
--- a/docker/services/entrypoint.sh
+++ b/docker/services/entrypoint.sh
@@ -33,4 +33,10 @@ if [ -z "$CLEARML_AGENT_NO_UPDATE" ]; then
  fi
 fi

-clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES --docker "${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}" --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
+DOCKER_ARGS="--docker \"${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}\""
+
+if [ -n "$CLEARML_AGENT_NO_DOCKER" ]; then
+  DOCKER_ARGS=""
+fi
+
+clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES $DOCKER_ARGS --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
--- a/docs/clearml.conf
+++ b/docs/clearml.conf
@@ -58,8 +58,8 @@ agent {
    # it solves passing user/token to git submodules.
    # this is a safer way to ensure multiple users using the same repository will
    # not accidentally leak credentials
-    # Only supported on Linux systems, it will be the default in future releases
-    # enable_git_ask_pass: false
+    # Note: this is only supported on Linux systems
+    # enable_git_ask_pass: true

    # in docker mode, if container's entrypoint automatically activated a virtual environment
    # use the activated virtual environment and install everything there
@@ -108,10 +108,17 @@ agent {
        # pytorch_resolve: "pip"

        # additional conda channels to use when installing with conda package manager
-        conda_channels: ["pytorch", "conda-forge", "defaults", ]
+        conda_channels: ["pytorch", "conda-forge", "nvidia", "defaults", ]
        # conda_full_env_update: false
+
+        # notice this will not install any additional packages into the selected environment, should be used in
+        # conjunction with CLEARML_CONDA_ENV_PACKAGE which points to an existing conda environment directory
        # conda_env_as_base_docker: false

+        # install into base conda environment
+        # (should only be used if running in docker mode, because it will change the conda base enrichment)
+        # use_conda_base_env: false
+
        # set the priority packages to be installed before the rest of the required packages
        # Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
        # priority_packages: ["cython", "numpy", "setuptools", ]
@@ -154,6 +161,9 @@ agent {
    vcs_cache: {
        enabled: true,
        path: ~/.clearml/vcs-cache
+
+        # if git pull failed, always revert to re-cloning the repo, it protects against old user name changes
+        # clone_on_pull_fail: false
    },

    # DEPRECATED: please use `venvs_cache` and set `venvs_cache.path`
@@ -189,6 +199,13 @@ agent {
    # You can also pass host environments into the container with ["-e", "HOST_NAME=$HOST_NAME"]
    # extra_docker_arguments: ["--ipc=host", "-v", "/mnt/host/data:/mnt/data"]

+    # Allow the extra docker arg to override task level docker arg (if the same argument is passed on both),
+    # if set to False, a task docker arg will override the docker extra arg
+    # docker_args_extra_precedes_task: true
+
+    # prevent a task docker args to be used if already specified in the extra_docker_arguments
+    # protected_docker_extra_args: ["privileged", "security-opt", "network", "ipc"]
+
    # optional shell script to run in docker when started before the experiment is started
    # extra_docker_shell_script: ["apt-get install -y bindfs", ]

@@ -287,9 +304,11 @@ agent {
    #     sdk_cache: "/clearml_agent_cache"
    #     apt_cache: "/var/cache/apt/archives"
    #     ssh_folder: "/root/.ssh"
+    #     ssh_ro_folder: "/.ssh"
    #     pip_cache: "/root/.cache/pip"
    #     poetry_cache: "/root/.cache/pypoetry"
    #     vcs_cache: "/root/.clearml/vcs-cache"
+    #     venvs_cache: "/root/.clearml/venvs-cache"
    #     venv_build: "~/.clearml/venvs-builds"
    #     pip_download: "/root/.clearml/pip-download-cache"
    # }
@@ -444,7 +463,7 @@ sdk {
        vcs_repo_detect_async: True

        # Store uncommitted git/hg source code diff in experiment manifest when training in development mode
-        # This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
+        # This stores "git diff" or into the experiment's "script.requirements.diff" section
        store_uncommitted_code_diff_on_train: True

        # Support stopping an experiment in case it was externally stopped, status was changed or task was reset
--- a/examples/dynamic_cloud_cluster.ipynb
+++ b/examples/dynamic_cloud_cluster.ipynb
@@ -5,27 +5,30 @@
   "metadata": {},
   "source": [
    "# Auto-Magically Spin AWS EC2 Instances On Demand \n",
-    "# and Create a Dynamic Cluster Running *Trains-Agent*\n",
+    "# and Create a Dynamic Cluster Running *ClearML-Agent*\n",
    "\n",
-    "### Define your budget and execute the notebook, that's it\n",
-    "### You now have a fully managed cluster on AWS  🎉 🎊 "
+    "## Define your budget and execute the notebook, that's it\n",
+    "## You now have a fully managed cluster on AWS  🎉 🎊"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**trains-agent**'s main goal is to quickly pull a job from an execution queue, setup the environment (as defined in the experiment, including git cloning, python packages etc.) then execute the experiment and monitor it.\n",
+    "**clearml-agent**'s main goal is to quickly pull a job from an execution queue, set up the environment (as defined in the experiment, including git cloning, python packages etc.), then execute the experiment and monitor it.\n",
    "\n",
    "This notebook defines a cloud budget (currently only AWS is supported, but feel free to expand with PRs), and spins an instance the minute a job is waiting for execution. It will also spin down idle machines, saving you some $$$ :)\n",
    "\n",
-    "Configuration steps\n",
+    "> **Note:**\n",
+    "> This is just an example of how you can use ClearML Agent to implement custom autoscaling. For a more structured autoscaler script, see [here](https://github.com/allegroai/clearml/blob/master/clearml/automation/auto_scaler.py).\n",
+    "\n",
+    "Configuration steps:\n",
    "- Define maximum budget to be used (instance type / number of instances).\n",
-    "- Create new execution *queues* in the **trains-server**.\n",
-    "- Define mapping between the created the *queues* and an instance budget.\n",
+    "- Create new execution *queues* in the **clearml-server**.\n",
+    "- Define mapping between the created *queues* and an instance budget.\n",
    "\n",
    "**TL;DR - This notebook:**\n",
-    "- Will spin instances if there are jobs in the execution *queues*, until it will hit the budget limit. \n",
+    "- Will spin instances if there are jobs in the execution *queues* until it will hit the budget limit.\n",
    "- If machines are idle, it will spin them down.\n",
    "\n",
    "The controller implementation itself is stateless, meaning you can always re-execute the notebook, if for some reason it stopped.\n",
@@ -39,7 +42,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Install & import required packages"
+    "### Install & import required packages"
   ]
  },
  {
@@ -48,7 +51,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "!pip install trains-agent\n",
+    "!pip install clearml-agent\n",
    "!pip install boto3"
   ]
  },
@@ -56,7 +59,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)"
+    "### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)"
   ]
  },
  {
@@ -92,17 +95,17 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Define machine budget per execution queue\n",
+    "### Define machine budget per execution queue\n",
    "\n",
-    "Now that we defined our budget, we need to connect it with the **Trains** cluster.\n",
+    "Now that we defined our budget, we need to connect it with the **ClearML** cluster.\n",
    "\n",
    "We map each queue to a resource type (instance type).\n",
    "\n",
-    "Create two queues in the WebUI:\n",
-    "- Browse to http://your_trains_server_ip:8080/workers-and-queues/queues\n",
+    "Create two queues in the Web UI:\n",
+    "- Browse to http://your_clearml_server_ip:8080/workers-and-queues/queues\n",
    "- Then click on the \"New Queue\" button and name your queues \"aws_normal\" and \"aws_high\" respectively\n",
    "\n",
-    "The QUEUES dictionary hold the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n",
+    "The QUEUES dictionary holds the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n",
    "```\n",
    "QUEUES = {\n",
    "    'queue_name': [(\"instance-type-as-defined-in-RESOURCE_CONFIGURATIONS\", max_number_of_instances), ]\n",
@@ -116,7 +119,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# Trains-Agent Queues - Machines budget per Queue\n",
+    "# ClearML Agent Queues - Machines budget per Queue\n",
    "# Per queue: list of (machine type as defined in RESOURCE_CONFIGURATIONS,\n",
    "# max instances for the specific queue). Order machines from most preferred to least.\n",
    "QUEUES = {\n",
@@ -129,7 +132,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Credentials for your AWS account, as well as for your **Trains-Server**"
+    "### Credentials for your AWS account, as well as for your **ClearML Server**"
   ]
  },
  {
@@ -143,24 +146,25 @@
    "CLOUD_CREDENTIALS_SECRET = \"\"\n",
    "CLOUD_CREDENTIALS_REGION = \"us-east-1\"\n",
    "\n",
-    "# TRAINS configuration\n",
-    "TRAINS_SERVER_WEB_SERVER = \"http://localhost:8080\"\n",
-    "TRAINS_SERVER_API_SERVER = \"http://localhost:8008\"\n",
-    "TRAINS_SERVER_FILES_SERVER = \"http://localhost:8081\"\n",
-    "# TRAINS credentials\n",
-    "TRAINS_ACCESS_KEY = \"\"\n",
-    "TRAINS_SECRET_KEY = \"\"\n",
-    "# Git User/Pass to be used by trains-agent,\n",
+    "# CLEARML configuration\n",
+    "CLEARML_WEB_SERVER = \"http://localhost:8080\"\n",
+    "CLEARML_API_SERVER = \"http://localhost:8008\"\n",
+    "CLEARML_FILES_SERVER = \"http://localhost:8081\"\n",
+    "# CLEARML credentials\n",
+    "CLEARML_API_ACCESS_KEY = \"\"\n",
+    "CLEARML_API_SECRET_KEY = \"\"\n",
+    "# Git User/Pass to be used by clearml-agent,\n",
    "# leave empty if image already contains git ssh-key\n",
-    "TRAINS_GIT_USER = \"\"\n",
-    "TRAINS_GIT_PASS = \"\"\n",
+    "CLEARML_AGENT_GIT_USER = \"\"\n",
+    "CLEARML_AGENT_GIT_PASS = \"\"\n",
    "\n",
-    "# Additional fields for trains.conf file created on the remote instance\n",
+    "# Additional fields for clearml.conf file created on the remote instance\n",
    "# for example: 'agent.default_docker.image: \"nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04\"'\n",
-    "EXTRA_TRAINS_CONF = \"\"\"\n",
+    "\n",
+    "EXTRA_CLEARML_CONF = \"\"\"\n",
    "\"\"\"\n",
    "\n",
-    "# Bash script to run on instances before running trains-agent\n",
+    "# Bash script to run on instances before running clearml-agent\n",
    "# Example: \"\"\"\n",
    "# echo \"This is the first line\"\n",
    "# echo \"This is the second line\"\n",
@@ -168,9 +172,9 @@
    "EXTRA_BASH_SCRIPT = \"\"\"\n",
    "\"\"\"\n",
    "\n",
-    "# Default docker for trains-agent when running in docker mode (requires docker v19.03 and above). \n",
-    "# Leave empty to run trains-agent in non-docker mode.\n",
-    "DEFAULT_DOCKER_IMAGE = \"nvidia/cuda\""
+    "# Default docker for clearml-agent when running in docker mode (requires docker v19.03 and above).\n",
+    "# Leave empty to run clearml-agent in non-docker mode.\n",
+    "CLEARML_AGENT_DOCKER_IMAGE = \"nvidia/cuda\""
   ]
  },
  {
@@ -192,7 +196,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Import Packages and Budget Definition Sanity Check"
+    "### Import Packages and Budget Definition Sanity Check"
   ]
  },
  {
@@ -209,7 +213,7 @@
    "from time import sleep, time\n",
    "\n",
    "import boto3\n",
-    "from trains_agent.backend_api.session.client import APIClient"
+    "from clearml_agent.backend_api.session.client import APIClient"
   ]
  },
  {
@@ -227,36 +231,36 @@
    "        \"A resource name can only appear in a single queue definition.\"\n",
    "    )\n",
    "\n",
-    "# Encode EXTRA_TRAINS_CONF for later bash script usage\n",
-    "EXTRA_TRAINS_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_TRAINS_CONF.split(\"\\\"\"))"
+    "# Encode EXTRA_CLEARML_CONF for later bash script usage\n",
+    "EXTRA_CLEARML_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_CLEARML_CONF.split(\"\\\"\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Cloud specific implementation of spin up/down - currently supports AWS only"
+    "### Cloud specific implementation of spin up/down - currently supports AWS only"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cloud-specific implementation (currently, only AWS EC2 is supported)\n",
    "def spin_up_worker(resource, worker_id_prefix, queue_name):\n",
    "    \"\"\"\n",
-    "    Creates a new worker for trains.\n",
+    "    Creates a new worker for clearml.\n",
    "    First, create an instance in the cloud and install some required packages.\n",
-    "    Then, define trains-agent environment variables and run \n",
-    "    trains-agent for the specified queue.\n",
+    "    Then, define clearml-agent environment variables and run\n",
+    "    clearml-agent for the specified queue.\n",
    "    NOTE: - Will wait until instance is running\n",
    "          - This implementation assumes the instance image already has docker installed\n",
    "\n",
    "    :param str resource: resource name, as defined in BUDGET and QUEUES.\n",
    "    :param str worker_id_prefix: worker name prefix\n",
-    "    :param str queue_name: trains queue to listen to\n",
+    "    :param str queue_name: clearml queue to listen to\n",
    "    \"\"\"\n",
    "    resource_conf = RESOURCE_CONFIGURATIONS[resource]\n",
    "    # Add worker type and AWS instance type to the worker name.\n",
@@ -267,8 +271,8 @@
    "    )\n",
    "\n",
    "    # user_data script will automatically run when the instance is started. \n",
-    "    # It will install the required packages for trains-agent configure it using \n",
-    "    # environment variables and run trains-agent on the required queue\n",
+    "    # It will install the required packages for clearml-agent configure it using\n",
+    "    # environment variables and run clearml-agent on the required queue\n",
    "    user_data = \"\"\"#!/bin/bash\n",
    "    sudo apt-get update\n",
    "    sudo apt-get install -y python3-dev\n",
@@ -278,36 +282,36 @@
    "    sudo apt-get install -y build-essential\n",
    "    python3 -m pip install -U pip\n",
    "    python3 -m pip install virtualenv\n",
-    "    python3 -m virtualenv trains_agent_venv\n",
-    "    source trains_agent_venv/bin/activate\n",
-    "    python -m pip install trains-agent\n",
-    "    echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/trains.conf\n",
-    "    echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/trains.conf\n",
-    "    echo \"{trains_conf}\" >> /root/trains.conf\n",
-    "    export TRAINS_API_HOST={api_server}\n",
-    "    export TRAINS_WEB_HOST={web_server}\n",
-    "    export TRAINS_FILES_HOST={files_server}\n",
+    "    python3 -m virtualenv clearml_agent_venv\n",
+    "    source clearml_agent_venv/bin/activate\n",
+    "    python -m pip install clearml-agent\n",
+    "    echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/clearml.conf\n",
+    "    echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/clearml.conf\n",
+    "    echo \"{clearml_conf}\" >> /root/clearml.conf\n",
+    "    export CLEARML_API_HOST={api_server}\n",
+    "    export CLEARML_WEB_HOST={web_server}\n",
+    "    export CLEARML_FILES_HOST={files_server}\n",
    "    export DYNAMIC_INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`\n",
-    "    export TRAINS_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n",
-    "    export TRAINS_API_ACCESS_KEY='{access_key}'\n",
-    "    export TRAINS_API_SECRET_KEY='{secret_key}'\n",
+    "    export CLEARML_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n",
+    "    export CLEARML_API_ACCESS_KEY='{access_key}'\n",
+    "    export CLEARML_API_SECRET_KEY='{secret_key}'\n",
    "    {bash_script}\n",
    "    source ~/.bashrc\n",
-    "    python -m trains_agent --config-file '/root/trains.conf' daemon --queue '{queue}' {docker}\n",
+    "    python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker}\n",
    "    shutdown\n",
    "    \"\"\".format(\n",
-    "        api_server=TRAINS_SERVER_API_SERVER,\n",
-    "        web_server=TRAINS_SERVER_WEB_SERVER,\n",
-    "        files_server=TRAINS_SERVER_FILES_SERVER,\n",
+    "        api_server=CLEARML_API_SERVER,\n",
+    "        web_server=CLEARML_WEB_SERVER,\n",
+    "        files_server=CLEARML_FILES_SERVER,\n",
    "        worker_id=worker_id,\n",
-    "        access_key=TRAINS_ACCESS_KEY,\n",
-    "        secret_key=TRAINS_SECRET_KEY,\n",
+    "        access_key=CLEARML_API_ACCESS_KEY,\n",
+    "        secret_key=CLEARML_API_SECRET_KEY,\n",
    "        queue=queue_name,\n",
-    "        git_user=TRAINS_GIT_USER,\n",
-    "        git_pass=TRAINS_GIT_PASS,\n",
-    "        trains_conf=EXTRA_TRAINS_CONF_ENCODED,\n",
+    "        git_user=CLEARML_AGENT_GIT_USER,\n",
+    "        git_pass=CLEARML_AGENT_GIT_PASS,\n",
+    "        clearml_conf=EXTRA_CLEARML_CONF_ENCODED,\n",
    "        bash_script=EXTRA_BASH_SCRIPT,\n",
-    "        docker=\"--docker '{}'\".format(DEFAULT_DOCKER_IMAGE) if DEFAULT_DOCKER_IMAGE else \"\"\n",
+    "        docker=\"--docker '{}'\".format(CLEARML_AGENT_DOCKER_IMAGE) if CLEARML_AGENT_DOCKER_IMAGE else \"\"\n",
    "    )\n",
    "\n",
    "    ec2 = boto3.client(\n",
@@ -405,7 +409,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "###### Controller Implementation and Logic"
+    "#### Controller Implementation and Logic"
   ]
  },
  {
@@ -430,18 +434,18 @@
    "\n",
    "    # Internal definitions\n",
    "    workers_prefix = \"dynamic_aws\"\n",
-    "    # Worker's id in trains would be composed from:\n",
+    "    # Worker's id in clearml would be composed from:\n",
    "    # prefix, name, instance_type and cloud_id separated by ';'\n",
    "    workers_pattern = re.compile(\n",
    "        r\"^(?P<prefix>[^:]+):(?P<name>[^:]+):(?P<instance_type>[^:]+):(?P<cloud_id>[^:]+)\"\n",
    "    )\n",
    "\n",
-    "    # Set up the environment variables for trains\n",
-    "    os.environ[\"TRAINS_API_HOST\"] = TRAINS_SERVER_API_SERVER\n",
-    "    os.environ[\"TRAINS_WEB_HOST\"] = TRAINS_SERVER_WEB_SERVER\n",
-    "    os.environ[\"TRAINS_FILES_HOST\"] = TRAINS_SERVER_FILES_SERVER\n",
-    "    os.environ[\"TRAINS_API_ACCESS_KEY\"] = TRAINS_ACCESS_KEY\n",
-    "    os.environ[\"TRAINS_API_SECRET_KEY\"] = TRAINS_SECRET_KEY\n",
+    "    # Set up the environment variables for clearml\n",
+    "    os.environ[\"CLEARML_API_HOST\"] = CLEARML_API_SERVER\n",
+    "    os.environ[\"CLEARML_WEB_HOST\"] = CLEARML_WEB_SERVER\n",
+    "    os.environ[\"CLEARML_FILES_HOST\"] = CLEARML_FILES_SERVER\n",
+    "    os.environ[\"CLEARML_API_ACCESS_KEY\"] = CLEARM_API_ACCESS_KEY\n",
+    "    os.environ[\"CLEARML_API_SECRET_KEY\"] = CLEARML_API_SECRET_KEY\n",
    "    api_client = APIClient()\n",
    "\n",
    "    # Verify the requested queues exist and create those that doesn't exist\n",
@@ -520,7 +524,7 @@
    "            # skip resource types that might be needed\n",
    "            if resources in required_idle_resources:\n",
    "                continue\n",
-    "            # Remove from both aws and trains all instances that are \n",
+    "            # Remove from both aws and clearml all instances that are\n",
    "            # idle for longer than MAX_IDLE_TIME_MIN\n",
    "            if time() - timestamp > MAX_IDLE_TIME_MIN * 60.0:\n",
    "                cloud_id = workers_pattern.match(worker.id)[\"cloud_id\"]\n",
@@ -535,7 +539,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "##### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)"
+    "### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)"
   ]
  },
  {
--- a/examples/k8s_glue_example.py
+++ b/examples/k8s_glue_example.py
@@ -13,61 +13,86 @@ def parse_args():
    group = parser.add_mutually_exclusive_group()

    parser.add_argument(
-        "--queue", type=str, help="Queue to pull tasks from"
+        "--queue",
+        type=str,
+        help="Queues to pull tasks from. If multiple queues, use comma separated list, e.g. 'queue1,queue2'",
    )
    group.add_argument(
-        "--ports-mode", action='store_true', default=False,
+        "--ports-mode",
+        action="store_true",
+        default=False,
        help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
             "Should not be used with max-pods"
    )
    parser.add_argument(
-        "--num-of-services", type=int, default=20,
-        help="Specify the number of k8s services to be used. Use only with ports-mode."
+        "--num-of-services",
+        type=int,
+        default=20,
+        help="Specify the number of k8s services to be used. Use only with ports-mode.",
    )
    parser.add_argument(
-        "--base-port", type=int,
+        "--base-port",
+        type=int,
        help="Used in conjunction with ports-mode, specifies the base port exposed by the services. "
             "For pod #X, the port will be <base-port>+X. Note that pod number is calculated based on base-pod-num"
             "e.g. if base-port=20000 and base-pod-num=3, the port for the first pod will be 20003"
    )
    parser.add_argument(
-        "--base-pod-num", type=int, default=1,
+        "--base-pod-num",
+        type=int,
+        default=1,
        help="Used in conjunction with ports-mode and base-port, specifies the base pod number to be used by the "
             "service (default: %(default)s)"
    )
    parser.add_argument(
-        "--gateway-address", type=str, default=None,
-        help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
+        "--gateway-address",
+        type=str,
+        default=None,
+        help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB",
    )
    parser.add_argument(
-        "--pod-clearml-conf", type=str,
-        help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
+        "--pod-clearml-conf",
+        type=str,
+        help="Configuration file to be used by the pod itself (if not provided, current configuration is used)",
    )
    parser.add_argument(
-        "--overrides-yaml", type=str,
-        help="YAML file containing pod overrides to be used when launching a new pod"
+        "--overrides-yaml", type=str, help="YAML file containing pod overrides to be used when launching a new pod"
    )
    parser.add_argument(
-        "--template-yaml", type=str,
+        "--template-yaml",
+        type=str,
        help="YAML file containing pod template. If provided pod will be scheduled with kubectl apply "
             "and overrides are ignored, otherwise it will be scheduled with kubectl run"
    )
    parser.add_argument(
-        "--ssh-server-port", type=int, default=0,
-        help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)"
+        "--ssh-server-port",
+        type=int,
+        default=0,
+        help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)",
    )
    parser.add_argument(
-        "--namespace", type=str,
-        help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
+        "--namespace",
+        type=str,
+        help="Specify the namespace in which pods will be created (default: %(default)s)",
+        default="clearml",
    )
    group.add_argument(
-        "--max-pods", type=int,
+        "--max-pods",
+        type=int,
        help="Limit the maximum number of pods that this service can run at the same time."
             "Should not be used with ports-mode"
    )
    parser.add_argument(
-        "--use-owner-token", action="store_true", default=False,
-        help="Generate and use task owner token for the execution of each task"
+        "--use-owner-token",
+        action="store_true",
+        default=False,
+        help="Generate and use task owner token for the execution of each task",
+    )
+    parser.add_argument(
+        "--create-queue",
+        action="store_true",
+        default=False,
+        help="Create the queue if it does not exist (default: %(default)s)",
    )
    return parser.parse_args()

@@ -77,21 +102,32 @@ def main():

    user_props_cb = None
    if args.ports_mode and args.base_port:
+
        def k8s_user_props_cb(pod_number=0):
            user_prop = {"k8s-pod-port": args.base_port + pod_number}
            if args.gateway_address:
                user_prop["k8s-gateway-address"] = args.gateway_address
            return user_prop
+
        user_props_cb = k8s_user_props_cb

    k8s = K8sIntegration(
-        ports_mode=args.ports_mode, num_of_services=args.num_of_services, base_pod_num=args.base_pod_num,
-        user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
-        template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
-            ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
-        namespace=args.namespace, max_pods_limit=args.max_pods or None,
+        ports_mode=args.ports_mode,
+        num_of_services=args.num_of_services,
+        base_pod_num=args.base_pod_num,
+        user_props_cb=user_props_cb,
+        overrides_yaml=args.overrides_yaml,
+        clearml_conf_file=args.pod_clearml_conf,
+        template_yaml=args.template_yaml,
+        extra_bash_init_script=K8sIntegration.get_ssh_server_bash(ssh_port_number=args.ssh_server_port)
+        if args.ssh_server_port
+        else None,
+        namespace=args.namespace,
+        max_pods_limit=args.max_pods or None,
    )
-    k8s.k8s_daemon(args.queue, use_owner_token=args.use_owner_token)
+    queue = [q.strip() for q in args.queue.split(",") if q.strip()] if args.queue else None
+
+    k8s.k8s_daemon(queue, use_owner_token=args.use_owner_token, create_queue=args.create_queue)


 if __name__ == "__main__":
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,15 +1,15 @@
-attrs>=18.0,<23.0.0
+attrs>=18.0,<24.0.0
 enum34>=0.9,<1.2.0 ; python_version < '3.6'
 furl>=2.0.0,<2.2.0
 jsonschema>=2.6.0,<5.0.0
 pathlib2>=2.3.0,<2.4.0
 psutil>=3.4.2,<5.10.0
-pyparsing>=2.0.3,<3.1.0
+pyparsing>=2.0.3,<3.2.0
 python-dateutil>=2.4.2,<2.9.0
-pyjwt>=2.4.0,<2.7.0
+pyjwt>=2.4.0,<2.9.0
 PyYAML>=3.12,<6.1
 requests>=2.20.0,<=2.31.0
 six>=1.13.0,<1.17.0
 typing>=3.6.4,<3.8.0 ; python_version < '3.5'
-urllib3>=1.21.1,<1.27.0
+urllib3>=1.21.1,<2
 virtualenv>=16,<21
--- a/setup.py
+++ b/setup.py
@@ -63,6 +63,7 @@ setup(
        'Programming Language :: Python :: 3.9',
        'Programming Language :: Python :: 3.10',
        'Programming Language :: Python :: 3.11',
+        'Programming Language :: Python :: 3.12',
        'License :: OSI Approved :: Apache Software License',
    ],
Author	SHA1	Message	Date
clearml	3273f76b46	Version bump to v1.9.2	2024-10-28 18:33:04 +02:00
clearml	9af0f9fe41	Fix reload method is found in the config object	2024-10-28 18:12:22 +02:00
clearml	205cd47cb9	Fix use req_token_expiration_sec when creating a task session and not the default value	2024-10-28 18:11:42 +02:00
clearml	0ff428bb96	Fix report index not advancing in resource monitoring causes more than one GPU not to be reported	2024-10-28 18:11:00 +02:00
Matteo Destro	bf8d9c96e9	Handle OSError when checking for is_file (#215 )	2024-10-13 10:08:03 +03:00
allegroai	a88487ff25	Add support for pip legacy resolver for versions specified in the `agent.package_manager.pip_legacy_resolver` configuration option Add skip existing packages	2024-09-22 22:36:06 +03:00
Jake Henning	785e22dc87	Version bump to v1.9.1	2024-09-02 01:04:49 +03:00
Jake Henning	6a2b778d53	Add default pip version support for Python 3.12	2024-09-02 01:03:52 +03:00
allegroai	b2c3702830	Version bump to v1.9.0	2024-08-28 23:18:26 +03:00
allegroai	6302d43990	Add support for skipping container apt installs using CLEARML_AGENT_SKIP_CONTAINER_APT env var in k8s Add runtime callback support for setting runtime properties per task in k8s Fix remove task from pending queue and set to failed when kubectl apply fails	2024-08-27 23:01:27 +03:00
allegroai	760bbca74e	Fix failed Task in services mode logged "User aborted" instead of failed, add Task reason string	2024-08-27 22:56:37 +03:00
allegroai	e63fd31420	Fix string format	2024-08-27 22:55:49 +03:00
allegroai	2ff9985db7	Add user ID to the vault loading print	2024-08-27 22:55:32 +03:00
allegroai	b8c762401b	Fix use same state transition if supported by the server (instead of stopping the task before re-enqueue)	2024-08-27 22:54:45 +03:00
allegroai	99e1e54f94	Add support for tasks containing only bash script or python module command	2024-08-27 22:53:14 +03:00
allegroai	a4d3b5bad6	Fix only set Task started status on node rank 0	2024-08-27 22:52:31 +03:00
allegroai	b21665ed6e	Fix do not cache venv cache if venv/python skip env var was set	2024-08-27 22:52:01 +03:00
Surya Teja	f877aa96e2	Update Docker base image to Ubuntu 22.04 and Kubectl to 1.29.3 (#201 )	2024-07-29 18:41:50 +03:00
pollfly	f99344d194	Add queue priority info to CLI help (#211 ) * add queue priority comment * Add --order-fairness info --------- Co-authored-by: Jake Henning <59198928+jkhenning@users.noreply.github.com>	2024-07-29 18:40:38 +03:00
allegroai	d9f2a1999a	Fix Only send pip freeze update on RANK 0, only update task status on exit on RANK 0	2024-07-29 17:40:24 +03:00
Valentin Schabschneider	79d0abe707	Add NO_DOCKER flag to clearml-agent-services entrypoint (#206 )	2024-07-26 19:09:54 +03:00
allegroai	6213ef4c02	Add /bin/bash -c "command" support. Task `binary` should be set to `/bin/bash` and entry_point should be set to `-c command`	2024-07-24 18:00:13 +03:00
allegroai	aef6aa9fc8	Fix a race condition where in rare conditions popping a Task from a queue that was aborted did not set it to started before the watchdog killed it. Does not happen in k8s/slurm	2024-07-24 17:59:46 +03:00
allegroai	0bb267115b	Add venvs_cache.path mount override for non-root containers (use: agent.docker_internal_mounts.venvs_cache)	2024-07-24 17:59:18 +03:00
allegroai	f89a92556f	Fix check logger is not None	2024-07-24 17:55:02 +03:00
allegroai	8ba4d75e80	Add CLEARML_TASK_ID and auth token to pod env vars in original entrypoint flow	2024-07-24 17:47:48 +03:00
allegroai	edc333ba5f	Add K8S_GLUE_POD_USE_IMAGE_ENTRYPOINT to allow running images without overriding the entrypoint (useful for agents using prebuilt images in k8s)	2024-07-24 17:46:27 +03:00
allegroai	2f0553b873	Fix CLEARML_MULTI_NODE_SINGLE_TASK should be read once not every reported line	2024-07-24 17:45:02 +03:00
allegroai	b2a4bf08ac	Fix pass --docker only (i.e. no default container image) for --dynamic-gpus feature	2024-07-24 17:44:35 +03:00
allegroai	f18c6b809f	Fix slurm multi-node rank detection	2024-07-24 17:44:05 +03:00
allegroai	cd5b4d2186	Add "-m module args" in script entry now supports standalone script, standalone script is converted to "untitled.py" by default or if specified in working_dir such as <dir>:<target_file> for example ".:standalone.py"	2024-07-24 17:43:21 +03:00
allegroai	5f1bab6711	Add default docker match_rules for enterprise users, NOTICE: matching_rules are ignored if `--docker container` is passed in command line	2024-07-24 17:42:55 +03:00
allegroai	ab9b9db0c9	Add CLEARML_MULTI_NODE_SINGLE_TASK (values -1, 0, 1, 2) for easier multi-node singe Task workloads	2024-07-24 17:42:25 +03:00
allegroai	93df021108	Add support for .ipynb script entry files (install nbconvert in runtime, copnvert to python and execute the python script), including CLEARML_AGENT_FORCE_TASK_INIT patching of ipynb files (post python conversion)	2024-07-24 17:41:59 +03:00
allegroai	700ae85de0	Fix file mode should be optional in configuration `files` section	2024-07-24 17:41:06 +03:00
allegroai	f367c5a571	Fix git fetch did not update new tags #209	2024-07-24 17:39:53 +03:00
allegroai	ebc5944b44	Fix setting tasks that someone just marked as aborted to started - only force Task to started after dequeuing it otherwise lease it as is	2024-07-24 17:39:26 +03:00
allegroai	8f41002845	Add task.script.binary /bin/bash support Fix -m module $env to support parsing the $env before launching	2024-07-24 17:37:26 +03:00
allegroai	7e8670d57f	Find the correct python version when using a pre-installed python environment	2024-07-21 14:10:38 +03:00
allegroai	77de343863	Use "venv" module if virtualenv is not supported	2024-07-19 13:18:07 +03:00
allegroai	6b31883e45	Fix queue resolution when no queue is passed	2024-05-15 18:30:24 +03:00
allegroai	e48b4756fa	Add Python 3.12 support	2024-05-15 18:25:29 +03:00
allegroai	47147e3237	Fix cached repositories were not passing user/token when pulling, agent.vcs_cache.clone_on_pull_fail now defaults to false	2024-04-19 23:50:17 +03:00
allegroai	41fc4ec646	Fix disabling vcs cache should not add vcs mount point to container	2024-04-19 23:48:50 +03:00
allegroai	441e5a73b2	Fix conda env should not be cached if installing into base conda or conda existing env exists	2024-04-19 23:48:10 +03:00
allegroai	27ed6821c4	Add mirrorD config files to gitignore	2024-04-19 23:47:34 +03:00
allegroai	10c6629982	Support skipping re-enqueue on suspected preempted k8s pods	2024-04-19 23:46:57 +03:00
allegroai	6fb48a4c6e	Revert version to v1.8.1	2024-04-19 23:44:31 +03:00
allegroai	105ade31f1	Version bump to v1.8.2	2024-04-14 18:18:10 +03:00
allegroai	502e266b6b	Fix polling interval missing when not using daemon mode	2024-04-14 18:17:57 +03:00
allegroai	cd9a3b9f4e	Version bump to v1.8.1	2024-04-12 20:30:11 +03:00
allegroai	4179ac5234	Fix git pulling on cached invalid git entry. On error, re-clone the entire repo again (disable using "agent.vcs_cache.clone_on_pull_fail: false")	2024-04-12 20:29:36 +03:00
Liron Ilouz	98cc0d86ba	Add option to set daemon polling interval (#197 ) * add option to set worker polling interval * polling interval minimum value --------- Co-authored-by: Liron <liron@tapwithus.com>	2024-04-03 14:33:52 +03:00
allegroai	293cbc0ac6	Version bump to v1.8.0	2024-04-02 16:38:22 +03:00
allegroai	4387ed73b6	Fix None handling when no limits exist	2024-04-02 16:36:09 +03:00
allegroai	43443ccf08	Pass task_id when resolving k8s template	2024-04-01 11:37:01 +03:00
allegroai	3d43240c8f	Improve conda package manager support Add agent.package_manager.use_conda_base_env (CLEARML_USE_CONDA_BASE_ENV) allowing to use base conda environment (instead of installing a new one) Fix conda support for python packages with markers and multiple specifications Added "nvidia" conda channel and support for cuda-toolkit >= 12	2024-04-01 11:36:26 +03:00
allegroai	fc58ba947b	Update requirements	2024-04-01 11:35:07 +03:00
allegroai	22672d2444	Improve GPU monitoring	2024-03-17 19:13:57 +02:00
allegroai	6a4fcda1bf	Improve resource monitor	2024-03-17 19:06:57 +02:00
allegroai	a4ebf8293d	Fix role support	2024-03-17 19:00:59 +02:00
allegroai	10fb157d58	Fix queue handling for backwards compatibility	2024-03-17 19:00:18 +02:00
allegroai	56058beec2	Update deprecated references	2024-03-17 18:59:48 +02:00
allegroai	9f207d5155	Fix dynamic GPU sometimes misses the initial print - if we found the closing print it should be good enough to signal everything is okay	2024-03-17 18:59:04 +02:00
allegroai	8a2bea3c14	Fix comment lines (#) are not ignored in docker startup bash script	2024-03-17 18:58:14 +02:00
allegroai	f1f9278928	Fix torch resolver settings applied to PytorchRequirement instance are not used	2024-03-17 18:56:47 +02:00
nfzd	2de1c926bf	Use correct Python version in Poetry init (#179 ) * Use correct Python version in Poetry init * Use interpreter override if configured * Don't use agent.python_binary if it is empty --------- Co-authored-by: Michael Mueller <michael.mueller@wsa.com>	2024-03-11 23:36:10 +02:00
allegroai	e1104e60bb	Update README	2024-03-11 16:58:38 +02:00
ae-ae	8b2970350c	Fix FileNotFoundException crash in find_python_executable_for_version… (#192 ) * Fix FileNotFoundException crash in find_python_executable_for_version (#164) * Add a Windows check for error 9009 when searching for Python --------- Co-authored-by: 12037964+ae-ae@users.noreply.github.com 12037964+ae-ae@users.noreply.github.com <ae-ae>	2024-03-06 09:17:31 +02:00
FeU-aKlos	a2758250b2	Fix queue handling in K8sIntegration and k8s_glue_example.py (#183 ) * Fix queue handling in K8sIntegration and k8s_glue_example.py * Update Dockerfile and k8s_glue_example.py * Add executable permission to provider_entrypoint.sh * ADJUST docker * Update clearml-agent version * ADDJUST stuff * ADJUST queue string handling * DELETE pip install from own repo	2024-02-29 14:20:54 +02:00
allegroai	01e8ffd854	Improve venv cache handling: - Add FileLock readonly mode, default is write mode (i.e. exclusive lock, preserving behavior) - Add venv cache now uses readonly lock when copying folders from venv cache into target folder. This enables multiple read, single write operation - Do not lock the cache folder if we do not need to delete old entries	2024-02-29 14:19:24 +02:00
allegroai	74edf6aa36	Fix IOError on file lock when using shared folder	2024-02-29 14:16:25 +02:00
allegroai	09c5ef99af	Fix Python 3.12 support by removing distutil imports	2024-02-29 14:12:21 +02:00
allegroai	17ae28a62f	Add agent.venvs_cache.lock_timeout to control the venv cache folder lock timeout (in seconds, default 30)	2024-02-29 14:06:06 +02:00
allegroai	059a9385e9	Fix delete temp console pipe log files after Task execution is completed. This is important for long lasting services agents, avoiding collecting temp files on host machine	2024-02-29 14:03:30 +02:00
allegroai	9a321a410f	Add CLEARML_AGENT_FORCE_TASK_INIT to allow runtime patching of script even if no repo is specified and the code is running a preinstalled docker	2024-02-29 14:02:27 +02:00
allegroai	919013d4fe	Add CLEARML_AGENT_FORCE_POETRY to allow forcing poetry even when using pip requirements manager	2024-02-29 13:59:26 +02:00
allegroai	05530b712b	Fix sanitization did not cover all keys	2024-02-29 13:56:14 +02:00
allegroai	8d15fd8798	Fix `pippip` is returned as a pip version if no value exists in agent.package_manager.pip_version	2024-02-29 13:55:41 +02:00
allegroai	b34329934b	Add queue ID report before pulling task	2024-02-29 13:52:17 +02:00
allegroai	85049d8705	Move configuration sanitization settings to the default config file	2024-02-29 13:51:40 +02:00
allegroai	6fbd70786e	Add protection for truncate() call	2024-02-29 13:51:09 +02:00
allegroai	05a65548da	Fix agent.enable_git_ask_pass does not show in configuration dump	2024-02-29 13:50:52 +02:00
allegroai	6657003d65	Fix using controller-uid will not always return required pods	2024-02-29 13:49:30 +02:00
allegroai	95dde6ca0c	Update README	2024-01-25 11:27:56 +02:00
allegroai	c9fc092f4e	Support force_system_packages argument in k8s glue class	2023-12-26 10:12:32 +02:00
allegroai	432ee395e1	Version bump to v1.7.0	2023-12-20 18:08:38 +02:00
allegroai	98fc4f0fb9	Add `agent.resource_monitoring.disk_use_path` configuration option to allow monitoring a different volume than the one containing the home folder	2023-12-20 17:49:33 +02:00
allegroai	111e774c21	Add extra_index_url sanitization in configuration printout	2023-12-20 17:49:04 +02:00
allegroai	3dd8d783e1	Fix `agent.git_host` setting will cause git@domain URLs to not be replaced by SSH URLs since furl cannot parse them to obtain host	2023-12-20 17:48:18 +02:00
allegroai	7c3e420df4	Add git clone verbosity using `CLEARML_AGENT_GIT_CLONE_VERBOSE` env var	2023-12-20 17:47:52 +02:00
allegroai	55b065a114	Update GPU stats and pynvml support	2023-12-20 17:47:19 +02:00
allegroai	faa97b6cc2	Set worker ID in k8s glue mode	2023-12-20 17:45:34 +02:00
allegroai	f5861b1e4a	Change default `agent.enable_git_ask_pass` to True	2023-12-20 17:44:41 +02:00
allegroai	030cbb69f1	Fix check if process return code is SIGKILL (-9 or 137) and abort callback was called, do not mark as failed but as aborted	2023-12-20 17:43:02 +02:00
allegroai	564f769ff7	Add `agent.docker_args_extra_precedes_task`, `agent.protected_docker_extra_args` to prevent the same switch to be used by both `extra_docker_args` and the a Task's docker args	2023-12-20 17:42:36 +02:00
pollfly	2c7f091e57	Update example (#177 ) * Edit README * Edit README * small edits * update example * update example * update example	2023-12-09 12:52:44 +02:00
allegroai	dd5d24b0ca	Add CLEARML_AGENT_TEMP_STDOUT_FILE_DIR to allow specifying temp dir used for storing agent log files and temporary log files (daemon and execution)	2023-11-14 11:45:13 +02:00
allegroai	996bb797c3	Add env var in case we're running a service task	2023-11-14 11:44:36 +02:00
allegroai	9ad49a0d21	Fix KeyError if container does not contain the arguments field	2023-11-01 15:11:07 +02:00
allegroai	ba4fee7b19	Fix agent.package_manager.poetry_install_extra_args are used in all Poetry commands and not just in install (#173 )	2023-11-01 15:10:40 +02:00
allegroai	0131db8b7d	Add support for resource_applied() callback in k8s glue Add support for sending log events with k8s-provided timestamps Refactor env vars infrastructure	2023-11-01 15:10:08 +02:00
allegroai	d2384a9a95	Add example and support for prebuilt containers including services-mode support with overrides CLEARML_AGENT_FORCE_CODE_DIR CLEARML_AGENT_FORCE_EXEC_SCRIPT	2023-11-01 15:05:57 +02:00
allegroai	5b86c230c1	Fix an environment variable that should be set with a numerical value of 0 (i.e. end up as "0" or "0.0") is set to an empty string	2023-11-01 15:04:59 +02:00
allegroai	21e4be966f	Fix recursion issue when deep-copying a session	2023-11-01 15:04:24 +02:00
allegroai	9c6cb421b3	When cleaning up pending pods, verify task is still aborted and pod is still pending before deleting the pod	2023-11-01 15:04:01 +02:00
allegroai	52405c343d	Fix k8s glue configuration might be contaminated when changed during apply	2023-11-01 15:03:37 +02:00
allegroai	46f0c991c8	Add status reason when aborting before moving to k8s_scheduler queue	2023-11-01 15:02:24 +02:00