Fix auto mount SSH_AUTH_SOCK into docker (issue #45 )

Update agent gif (#69 )
Fix services mode killing child processes when running in services mode + venv
2025-06-26 18:16:15 +00:00 · 2021-07-11 09:44:49 +03:00 · 2021-07-08 09:20:45 +03:00 · 2021-06-30 23:58:25 +03:00 · 2021-06-29 11:04:52 +03:00 · 2021-06-29 07:59:02 +03:00
28 changed files with 908 additions and 411 deletions
--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@
 **ClearML Agent - ML-Ops made easy  
 ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**

-[![GitHub license](https://img.shields.io/github/license/allegroai/trains-agent.svg)](https://img.shields.io/github/license/allegroai/trains-agent.svg)
+[![GitHub license](https://img.shields.io/github/license/allegroai/clearml-agent.svg)](https://img.shields.io/github/license/allegroai/clearml-agent.svg)
 [![PyPI pyversions](https://img.shields.io/pypi/pyversions/clearml-agent.svg)](https://img.shields.io/pypi/pyversions/clearml-agent.svg)
 [![PyPI version shields.io](https://img.shields.io/pypi/v/clearml-agent.svg)](https://img.shields.io/pypi/v/clearml-agent.svg)

@@ -21,23 +21,23 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
 * Implement optimized resource utilization policies
 * Deploy execution environments with either virtualenv or fully docker containerized with zero effort
 * Launch-and-Forget service containers
-* [Cloud autoscaling](https://allegro.ai/clearml/docs/docs/examples/services/aws_autoscaler/aws_autoscaler.html)
-* [Customizable cleanup](https://allegro.ai/clearml/docs/docs/examples/services/cleanup/cleanup_service.html)
-* Advanced [pipeline building and execution](https://allegro.ai/clearml/docs/docs/examples/frameworks/pytorch/notebooks/table/tabular_training_pipeline.html)
+* [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
+* [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
+* Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)

 It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.

 **Full Automation in 5 steps**
-1. ClearML Server [self-hosted](https://github.com/allegroai/trains-server) or [free tier hosting](https://app.community.clear.ml)
+1. ClearML Server [self-hosted](https://github.com/allegroai/clearml-server) or [free tier hosting](https://app.community.clear.ml)
 2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine: on-premises / cloud / ...)
-3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or Add [ClearML](https://github.com/allegroai/trains) to your code with just 2 lines
+3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
 4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
 5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes:  :beer:

 "All the Deep/Machine-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"

-**Try ClearML now** [Self Hosted](https://github.com/allegroai/trains-server) or [Free tier Hosting](https://app.community.clear.ml)
-<a href="https://app.community.clear.ml"><img src="https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width="100%"></a>
+**Try ClearML now** [Self Hosted](https://github.com/allegroai/clearml-server) or [Free tier Hosting](https://app.community.clear.ml)
+<a href="https://app.community.clear.ml"><img src="https://github.com/allegroai/clearml-agent/blob/master/docs/screenshots.gif?raw=true" width="100%"></a>

 ### Simple, Flexible Experiment Orchestration
 **The ClearML Agent was built to address the DL/ML R&D DevOps needs:**
@@ -68,13 +68,13 @@ We designed `clearml-agent` so you can run bare-metal or inside a pod with any m

 **Two K8s integration flavours** 
 - Spin ClearML-Agent as a long-lasting service pod
-    - use [clearml-agent](https://hub.docker.com/r/allegroai/trains-agent) docker image
+    - use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
    - map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
    - allow the clearml-agent to manage sibling dockers
    - benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
    - downside: Sibling containers
 - Kubernetes Glue, map ClearML jobs directly to K8s jobs
-    - Run the [clearml-k8s glue](https://github.com/allegroai/trains-agent/blob/master/examples/k8s_glue_example.py) on a K8s cpu node
+    - Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on a K8s cpu node
    - The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml template)
    - Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the experiment's process
    - benefits: Kubernetes full view of all running jobs in the system
@@ -122,7 +122,7 @@ The ClearML Agent executes experiments using the following process:

 #### System Design & Flow

-<img src="https://allegro.ai/clearml/docs/_images/ClearML_Architecture.png" width="100%" alt="clearml-architecture">
+<img src="https://github.com/allegroai/clearml-agent/blob/master/docs/clearml_architecture.png" width="100%" alt="clearml-architecture">


 #### Installing the ClearML Agent
@@ -196,16 +196,16 @@ Notice: with `--detached` flag, the *clearml-agent* will be running in the backg
 clearml-agent daemon --detached --queue default --docker
 ```

-Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda docker:
+Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 docker:
 ```bash
-clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda
-clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda
+clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
 ```

-Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda docker:
+Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 docker:
 ```bash
-clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda
-clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
+clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
 ```

 ##### Starting the ClearML Agent - Priority Queues
@@ -225,11 +225,11 @@ Adding queues, managing job order within a queue and moving jobs between queues,
 To stop a **ClearML Agent** running in the background, run the same command line used to start the agent with `--stop` appended.
 For example, to stop the first of the above shown same machine, single gpu agents:
 ```bash
-clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda --stop
+clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop
 ```

 ### How do I create an experiment on the ClearML Server? <a name="from-scratch"></a>
-* Integrate [ClearML](https://github.com/allegroai/trains) with your code
+* Integrate [ClearML](https://github.com/allegroai/clearml) with your code
 * Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
 * As your code is running, **ClearML** creates an experiment logging all the necessary execution information:
  - Git repository link and commit ID (or an entire jupyter notebook)
@@ -273,18 +273,18 @@ clearml-agent daemon --services-mode --detached --queue services --create-queue
 ### AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
 The ClearML Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the ClearML package.

-Sample AutoML & Orchestration examples can be found in the ClearML [example/automation](https://github.com/allegroai/trains/tree/master/examples/automation) folder.
+Sample AutoML & Orchestration examples can be found in the ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.

 AutoML examples
-  - [Toy Keras training experiment](https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
+  - [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
    - In order to create an experiment-template in the system, this code must be executed once manually
-  - [Random Search over the above Keras experiment-template](https://github.com/allegroai/trains/blob/master/examples/automation/manual_random_param_search_example.py)
+  - [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
    - This example will create multiple copies of the Keras experiment-template, with different hyper-parameter combinations

 Experiment Pipeline examples
-  - [First step experiment](https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py)
+  - [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
    - This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
-  - [Second step experiment](https://github.com/allegroai/trains/blob/master/examples/automation/toy_base_task.py)
+  - [Second step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/toy_base_task.py)
    - In order to create an experiment-template in the system, this code must be executed once manually

 ### License
--- a/clearml_agent/backend_api/config/default/agent.conf
+++ b/clearml_agent/backend_api/config/default/agent.conf
@@ -26,6 +26,9 @@
    # Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
    # The default is the python executing the clearml_agent
    python_binary: ""
+    # ignore any requested python version (Default: False, if a Task was using a
+    # specific python version and the system supports multiple python the agent will use the requested python version)
+    # ignore_requested_python_version: true

    # select python package manager:
    # currently supported pip and conda
@@ -47,7 +50,7 @@
        # extra_index_url: ["https://allegroai.jfrog.io/clearmlai/api/pypi/public/simple"]

        # additional conda channels to use when installing with conda package manager
-        conda_channels: ["defaults", "conda-forge", "pytorch", ]
+        conda_channels: ["pytorch", "conda-forge", "defaults", ]

        # If set to true, Task's "installed packages" are ignored,
        # and the repository's "requirements.txt" is used instead
@@ -121,6 +124,11 @@
    # optional shell script to run in docker when started before the experiment is started
    # extra_docker_shell_script: ["apt-get install -y bindfs", ]

+    # Install the required packages for opencv libraries (libsm6 libxext6 libxrender-dev libglib2.0-0),
+    # for backwards compatibility reasons, true as default,
+    # change to false to skip installation and decrease docker spin up time
+    # docker_install_opencv_libs: true
+
    # optional uptime configuration, make sure to use only one of 'uptime/downtime' and not both.
    # If uptime is specified, agent will actively poll (and execute) tasks in the time-spans defined here.
    # Outside of the specified time-spans, the agent will be idle.
@@ -177,4 +185,16 @@
    # should be detected automatically. Override with os environment CUDA_VERSION / CUDNN_VERSION
    # cuda_version: 10.1
    # cudnn_version: 7.6
+
+    # Hide docker environment variables containing secrets when printing out the docker command by replacing their
+    # values with "********". Turning this feature on will hide the following environment variables values:
+    #   CLEARML_API_SECRET_KEY, CLEARML_AGENT_GIT_PASS, AWS_SECRET_ACCESS_KEY, AZURE_STORAGE_KEY
+    # To include more environment variables, add their keys to the "extra_keys" list. E.g. to make sure the value of
+    # your custom environment variable named MY_SPECIAL_PASSWORD will not show in the logs when included in the
+    # docker command, set:
+    #   extra_keys: ["MY_SPECIAL_PASSWORD"]
+    hide_docker_command_env_vars {
+        enabled: true
+        extra_keys: []
+    }
 }
--- a/clearml_agent/backend_api/session/session.py
+++ b/clearml_agent/backend_api/session/session.py
@@ -111,7 +111,8 @@ class Session(TokenManager):
        self._logger = logger

        self.__access_key = api_key or ENV_ACCESS_KEY.get(
-            default=(self.config.get("api.credentials.access_key", None) or self.default_key)
+            default=(self.config.get("api.credentials.access_key", None) or self.default_key),
+            value_cb=lambda key, value: print("Using environment access key {}={}".format(key, value))
        )
        if not self.access_key:
            raise ValueError(
@@ -119,7 +120,8 @@ class Session(TokenManager):
            )

        self.__secret_key = secret_key or ENV_SECRET_KEY.get(
-            default=(self.config.get("api.credentials.secret_key", None) or self.default_secret)
+            default=(self.config.get("api.credentials.secret_key", None) or self.default_secret),
+            value_cb=lambda key, value: print("Using environment secret key {}=********".format(key))
        )
        if not self.secret_key:
            raise ValueError(
@@ -155,7 +157,7 @@ class Session(TokenManager):

        # update api version from server response
        try:
-            token_dict = jwt.decode(self.token, verify=False)
+            token_dict = TokenManager.get_decoded_token(self.token, verify=False)
            api_version = token_dict.get('api_version')
            if not api_version:
                api_version = '2.2' if token_dict.get('env', '') == 'prod' else Session.api_version
--- a/clearml_agent/backend_api/session/token_manager.py
+++ b/clearml_agent/backend_api/session/token_manager.py
@@ -3,6 +3,7 @@ from abc import ABCMeta, abstractmethod
 from time import time

 import jwt
+from jwt.algorithms import get_default_algorithms
 import six


@@ -66,10 +67,18 @@ class TokenManager(object):
                pass
        return 0

+    @classmethod
+    def get_decoded_token(cls, token, verify=False):
+        """ Get token expiration time. If not present, assume forever """
+        return jwt.decode(
+            token, verify=verify,
+            options=dict(verify_signature=False),
+            algorithms=get_default_algorithms())
+
    @classmethod
    def _get_token_exp(cls, token):
        """ Get token expiration time. If not present, assume forever """
-        return jwt.decode(token, verify=False).get('exp', sys.maxsize)
+        return cls.get_decoded_token(token).get('exp', sys.maxsize)

    def _set_token(self, token):
        if token:
--- a/clearml_agent/backend_api/utils.py
+++ b/clearml_agent/backend_api/utils.py
@@ -6,16 +6,9 @@ import requests
 from requests.adapters import HTTPAdapter
 from urllib3.util import Retry
 from urllib3 import PoolManager
-import six

 from .session.defs import ENV_HOST_VERIFY_CERT

-if six.PY3:
-    from functools import lru_cache
-elif six.PY2:
-    # python 2 support
-    from backports.functools_lru_cache import lru_cache
-

 __disable_certificate_verification_warning = 0

--- a/clearml_agent/backend_config/entry.py
+++ b/clearml_agent/backend_config/entry.py
@@ -64,8 +64,8 @@ class Entry(object):
            converter = self.default_conversions().get(self.type, self.type)
        return converter(value)

-    def get_pair(self, default=NotSet, converter=None):
-        # type: (Any, Converter) -> Optional[Tuple[Text, Any]]
+    def get_pair(self, default=NotSet, converter=None, value_cb=None):
+        # type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
        for key in self.keys:
            value = self._get(key)
            if value is NotSet:
@@ -75,13 +75,20 @@ class Entry(object):
            except Exception as ex:
                self.error("invalid value {key}={value}: {ex}".format(**locals()))
                break
+            # noinspection PyBroadException
+            try:
+                if value_cb:
+                    value_cb(key, value)
+            except Exception:
+                pass
            return key, value
+
        result = self.default if default is NotSet else default
        return self.key, result

-    def get(self, default=NotSet, converter=None):
-        # type: (Any, Converter) -> Optional[Any]
-        return self.get_pair(default=default, converter=converter)[1]
+    def get(self, default=NotSet, converter=None, value_cb=None):
+        # type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
+        return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]

    def set(self, value):
        # type: (Any, Any) -> (Text, Any)
--- a/clearml_agent/commands/worker.py
+++ b/clearml_agent/commands/worker.py
--- a/clearml_agent/definitions.py
+++ b/clearml_agent/definitions.py
@@ -62,6 +62,10 @@ class EnvironmentConfig(object):
        return None


+ENV_AGENT_SECRET_KEY = EnvironmentConfig("CLEARML_API_SECRET_KEY", "TRAINS_API_SECRET_KEY")
+ENV_AWS_SECRET_KEY = EnvironmentConfig("AWS_SECRET_ACCESS_KEY")
+ENV_AZURE_ACCOUNT_KEY = EnvironmentConfig("AZURE_STORAGE_KEY")
+
 ENVIRONMENT_CONFIG = {
    "api.api_server": EnvironmentConfig("CLEARML_API_HOST", "TRAINS_API_HOST", ),
    "api.files_server": EnvironmentConfig("CLEARML_FILES_HOST", "TRAINS_FILES_HOST", ),
@@ -69,9 +73,7 @@ ENVIRONMENT_CONFIG = {
    "api.credentials.access_key": EnvironmentConfig(
        "CLEARML_API_ACCESS_KEY", "TRAINS_API_ACCESS_KEY",
    ),
-    "api.credentials.secret_key": EnvironmentConfig(
-        "CLEARML_API_SECRET_KEY", "TRAINS_API_SECRET_KEY",
-    ),
+    "api.credentials.secret_key": ENV_AGENT_SECRET_KEY,
    "agent.worker_name": EnvironmentConfig("CLEARML_WORKER_NAME", "TRAINS_WORKER_NAME", ),
    "agent.worker_id": EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID", ),
    "agent.cuda_version": EnvironmentConfig(
@@ -84,10 +86,10 @@ ENVIRONMENT_CONFIG = {
        names=("CLEARML_CPU_ONLY", "TRAINS_CPU_ONLY", "CPU_ONLY"), type=bool
    ),
    "sdk.aws.s3.key": EnvironmentConfig("AWS_ACCESS_KEY_ID"),
-    "sdk.aws.s3.secret": EnvironmentConfig("AWS_SECRET_ACCESS_KEY"),
+    "sdk.aws.s3.secret": ENV_AWS_SECRET_KEY,
    "sdk.aws.s3.region": EnvironmentConfig("AWS_DEFAULT_REGION"),
    "sdk.azure.storage.containers.0": {'account_name': EnvironmentConfig("AZURE_STORAGE_ACCOUNT"),
-                                       'account_key': EnvironmentConfig("AZURE_STORAGE_KEY")},
+                                       'account_key': ENV_AZURE_ACCOUNT_KEY},
    "sdk.google.storage.credentials_json": EnvironmentConfig("GOOGLE_APPLICATION_CREDENTIALS"),
 }

@@ -132,6 +134,8 @@ ENV_DOCKER_SKIP_GPUS_FLAG = EnvironmentConfig('CLEARML_DOCKER_SKIP_GPUS_FLAG', '
 ENV_AGENT_GIT_USER = EnvironmentConfig('CLEARML_AGENT_GIT_USER', 'TRAINS_AGENT_GIT_USER')
 ENV_AGENT_GIT_PASS = EnvironmentConfig('CLEARML_AGENT_GIT_PASS', 'TRAINS_AGENT_GIT_PASS')
 ENV_AGENT_GIT_HOST = EnvironmentConfig('CLEARML_AGENT_GIT_HOST', 'TRAINS_AGENT_GIT_HOST')
+ENV_AGENT_DISABLE_SSH_MOUNT = EnvironmentConfig('CLEARML_AGENT_DISABLE_SSH_MOUNT', type=bool)
+ENV_SSH_AUTH_SOCK = EnvironmentConfig('SSH_AUTH_SOCK')
 ENV_TASK_EXECUTE_AS_USER = EnvironmentConfig('CLEARML_AGENT_EXEC_USER', 'TRAINS_AGENT_EXEC_USER')
 ENV_TASK_EXTRA_PYTHON_PATH = EnvironmentConfig('CLEARML_AGENT_EXTRA_PYTHON_PATH', 'TRAINS_AGENT_EXTRA_PYTHON_PATH')
 ENV_DOCKER_HOST_MOUNT = EnvironmentConfig('CLEARML_AGENT_K8S_HOST_MOUNT', 'CLEARML_AGENT_DOCKER_HOST_MOUNT',
--- a/clearml_agent/external/requirements_parser/parser.py
+++ b/clearml_agent/external/requirements_parser/parser.py
@@ -4,13 +4,14 @@ import warnings
 from .requirement import Requirement


-def parse(reqstr):
+def parse(reqstr, cwd=None):
    """
    Parse a requirements file into a list of Requirements

    See: pip/req.py:parse_requirements()

    :param reqstr: a string or file like object containing requirements
+    :param cwd: Optional current working dir for -r file.txt loading
    :returns: a *generator* of Requirement objects
    """
    filename = getattr(reqstr, 'name', None)
@@ -32,8 +33,8 @@ def parse(reqstr):
            continue
        elif line.startswith('-r') or line.startswith('--requirement'):
            _, new_filename = line.split()
-            new_file_path = os.path.join(os.path.dirname(filename or '.'),
-                                         new_filename)
+            new_file_path = os.path.join(
+                os.path.dirname(filename or '.') if filename or not cwd else cwd, new_filename)
            with open(new_file_path) as f:
                for requirement in parse(f):
                    yield requirement
--- a/clearml_agent/external/requirements_parser/requirement.py
+++ b/clearml_agent/external/requirements_parser/requirement.py
@@ -20,6 +20,15 @@ VCS_REGEX = re.compile(
    r'(#(?P<fragment>\S+))?'
 )

+VCS_EXT_REGEX = re.compile(
+    r'^(?P<scheme>{0})(@)'.format(r'|'.join(
+        [scheme.replace('+', r'\+') for scheme in ['git+git']])) +
+    r'((?P<login>[^/@]+)@)?'
+    r'(?P<path>[^#@]+)'
+    r'(@(?P<revision>[^#]+))?'
+    r'(#(?P<fragment>\S+))?'
+)
+
 # This matches just about everyting
 LOCAL_REGEX = re.compile(
    r'^((?P<scheme>file)://)?'
@@ -100,7 +109,7 @@ class Requirement(object):

        req = cls('-e {0}'.format(line))
        req.editable = True
-        vcs_match = VCS_REGEX.match(line)
+        vcs_match = VCS_REGEX.match(line) or VCS_EXT_REGEX.match(line)
        local_match = LOCAL_REGEX.match(line)

        if vcs_match is not None:
@@ -147,7 +156,7 @@ class Requirement(object):

        req = cls(line)

-        vcs_match = VCS_REGEX.match(line)
+        vcs_match = VCS_REGEX.match(line) or VCS_EXT_REGEX.match(line)
        uri_match = URI_REGEX.match(line)
        local_match = LOCAL_REGEX.match(line)

@@ -226,7 +235,7 @@ class Requirement(object):
                # check if the name is valid & parsed
                Req.parse(name)
                # if we are here, name is a valid package name, check if the vcs part is valid
-                if VCS_REGEX.match(uri):
+                if VCS_REGEX.match(uri) or VCS_EXT_REGEX.match(uri):
                    req = cls.parse_line(uri)
                    req.name = name
                    return req
--- a/clearml_agent/glue/k8s.py
+++ b/clearml_agent/glue/k8s.py
@@ -12,12 +12,12 @@ from copy import deepcopy
 from pathlib import Path
 from threading import Thread
 from time import sleep
-from typing import Text, List
+from typing import Text, List, Callable, Any, Collection, Optional, Union

 import yaml

 from clearml_agent.commands.events import Events
-from clearml_agent.commands.worker import Worker
+from clearml_agent.commands.worker import Worker, get_task_container
 from clearml_agent.definitions import ENV_DOCKER_IMAGE
 from clearml_agent.errors import APIError
 from clearml_agent.helper.base import safe_remove_file
@@ -31,20 +31,23 @@ class K8sIntegration(Worker):
    K8S_PENDING_QUEUE = "k8s_scheduler"

    K8S_DEFAULT_NAMESPACE = "clearml"
+    AGENT_LABEL = "CLEARML=agent"
+    LIMIT_POD_LABEL = "ai.allegro.agent.serial=pod-{pod_number}"

-    KUBECTL_APPLY_CMD = "kubectl apply -f"
+    KUBECTL_APPLY_CMD = "kubectl apply --namespace={namespace} -f"

-    KUBECTL_RUN_CMD = "kubectl run clearml-{queue_name}-id-{task_id} " \
-                      "--image {docker_image} " \
+    KUBECTL_RUN_CMD = "kubectl run clearml-id-{task_id} " \
+                      "--image {docker_image} {docker_args} " \
                      "--restart=Never " \
                      "--namespace={namespace}"

    KUBECTL_DELETE_CMD = "kubectl delete pods " \
-                         "--selector=TRAINS=agent " \
+                         "--selector={selector} " \
                         "--field-selector=status.phase!=Pending,status.phase!=Running " \
                         "--namespace={namespace}"

    BASH_INSTALL_SSH_CMD = [
+        "apt-get update",
        "apt-get install -y openssh-server",
        "mkdir -p /var/run/sshd",
        "echo 'root:training' | chpasswd",
@@ -71,12 +74,10 @@ class K8sIntegration(Worker):
        "[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3",
        "$LOCAL_PYTHON -m pip install clearml-agent",
        "{extra_bash_init_cmd}",
+        "{extra_docker_bash_script}",
        "$LOCAL_PYTHON -m clearml_agent execute --full-monitoring --require-queue --id {task_id}"
    ]

-    AGENT_LABEL = "TRAINS=agent"
-    LIMIT_POD_LABEL = "ai.allegro.agent.serial=pod-{pod_number}"
-
    _edit_hyperparams_version = "2.9"

    def __init__(
@@ -94,6 +95,7 @@ class K8sIntegration(Worker):
            clearml_conf_file=None,
            extra_bash_init_script=None,
            namespace=None,
+            max_pods_limit=None,
            **kwargs
    ):
        """
@@ -102,7 +104,7 @@ class K8sIntegration(Worker):
        :param str k8s_pending_queue_name: queue name to use when task is pending in the k8s scheduler
        :param str|callable kubectl_cmd: kubectl command line str, supports formatting (default: KUBECTL_RUN_CMD)
            example: "task={task_id} image={docker_image} queue_id={queue_id}"
-            or a callable function: kubectl_cmd(task_id, docker_image, queue_id, task_data)
+            or a callable function: kubectl_cmd(task_id, docker_image, docker_args, queue_id, task_data)
        :param str container_bash_script: container bash script to be executed in k8s (default: CONTAINER_BASH_SCRIPT)
            Notice this string will use format() call, if you have curly brackets they should be doubled { -> {{
            Format arguments passed: {task_id} and {extra_bash_init_cmd}
@@ -121,6 +123,7 @@ class K8sIntegration(Worker):
        :param str clearml_conf_file: clearml.conf file to be use by the pod itself (optional)
        :param str extra_bash_init_script: Additional bash script to run before starting the Task inside the container
        :param str namespace: K8S namespace to be used when creating the new pods (default: clearml)
+        :param int max_pods_limit: Maximum number of pods that K8S glue can run at the same time
        """
        super(K8sIntegration, self).__init__()
        self.k8s_pending_queue_name = k8s_pending_queue_name or self.K8S_PENDING_QUEUE
@@ -146,6 +149,7 @@ class K8sIntegration(Worker):
        self.namespace = namespace or self.K8S_DEFAULT_NAMESPACE
        self.pod_limits = []
        self.pod_requests = []
+        self.max_pods_limit = max_pods_limit if not self.ports_mode else None
        if overrides_yaml:
            with open(os.path.expandvars(os.path.expanduser(str(overrides_yaml))), 'rt') as f:
                overrides = yaml.load(f, Loader=getattr(yaml, 'FullLoader', None))
@@ -271,16 +275,12 @@ class K8sIntegration(Worker):
                task_id, self.k8s_pending_queue_name, e))
            return

-        if task_data.execution.docker_cmd:
-            docker_cmd = task_data.execution.docker_cmd
-        else:
-            docker_cmd = str(ENV_DOCKER_IMAGE.get() or
-                             self._session.config.get("agent.default_docker.image", "nvidia/cuda"))
-
-        # take the first part, this is the docker image name (not arguments)
-        docker_parts = docker_cmd.split()
-        docker_image = docker_parts[0]
-        docker_args = docker_parts[1:] if len(docker_parts) > 1 else []
+        container = get_task_container(self._session, task_id)
+        if not container.get('image'):
+            container['image'] = str(
+                ENV_DOCKER_IMAGE.get() or self._session.config.get("agent.default_docker.image", "nvidia/cuda")
+            )
+            container['arguments'] = self._session.config.get("agent.default_docker.arguments", None)

        # get the clearml.conf encoded file
        # noinspection PyProtectedMember
@@ -303,20 +303,22 @@ class K8sIntegration(Worker):
        except Exception:
            queue_name = 'k8s'

-        # conform queue name to k8s standards
-        safe_queue_name = queue_name.lower().strip()
-        safe_queue_name = re.sub(r'\W+', '', safe_queue_name).replace('_', '').replace('-', '')
-
        # Search for a free pod number
        pod_count = 0
        pod_number = self.base_pod_num
-        while self.ports_mode:
+        while self.ports_mode or self.max_pods_limit:
            pod_number = self.base_pod_num + pod_count
-            kubectl_cmd_new = "kubectl get pods -l {pod_label},{agent_label} -n {namespace}".format(
-                pod_label=self.LIMIT_POD_LABEL.format(pod_number=pod_number),
-                agent_label=self.AGENT_LABEL,
-                namespace=self.namespace,
-            )
+            if self.ports_mode:
+                kubectl_cmd_new = "kubectl get pods -l {pod_label},{agent_label} -n {namespace}".format(
+                    pod_label=self.LIMIT_POD_LABEL.format(pod_number=pod_number),
+                    agent_label=self.AGENT_LABEL,
+                    namespace=self.namespace,
+                )
+            else:
+                kubectl_cmd_new = "kubectl get pods -l {agent_label} -n {namespace} -o json".format(
+                    agent_label=self.AGENT_LABEL,
+                    namespace=self.namespace,
+                )
            process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            output, error = process.communicate()
            output = '' if not output else output if isinstance(output, str) else output.decode('utf-8')
@@ -325,38 +327,67 @@ class K8sIntegration(Worker):
            if not output:
                # No such pod exist so we can use the pod_number we found
                break
-            if pod_count >= self.num_of_services - 1:
-                # All pod numbers are taken, exit
+
+            if self.max_pods_limit:
+                try:
+                    current_pod_count = len(json.loads(output).get("items", []))
+                except (ValueError, TypeError) as ex:
+                    self.log.warning(
+                        "K8S Glue pods monitor: Failed parsing kubectl output:\n{}\ntask '{}' "
+                        "will be enqueued back to queue '{}'\nEx: {}".format(
+                            output, task_id, queue, ex
+                        )
+                    )
+                    self._session.api_client.tasks.reset(task_id)
+                    self._session.api_client.tasks.enqueue(task_id, queue=queue, status_reason='kubectl parsing error')
+                    return
+                max_count = self.max_pods_limit
+            else:
+                current_pod_count = pod_count
+                max_count = self.num_of_services - 1
+
+            if current_pod_count >= max_count:
+                # All pods are taken, exit
+                self.log.debug(
+                    "kubectl last result: {}\n{}".format(error, output))
                self.log.warning(
-                    "kubectl last result: {}\n{}\nAll k8s services are in use, task '{}' "
+                    "All k8s services are in use, task '{}' "
                    "will be enqueued back to queue '{}'".format(
-                        error, output, task_id, queue
+                        task_id, queue
                    )
                )
                self._session.api_client.tasks.reset(task_id)
                self._session.api_client.tasks.enqueue(
                    task_id, queue=queue, status_reason='k8s max pod limit (no free k8s service)')
                return
+            elif self.max_pods_limit:
+                # max pods limit hasn't reached yet, so we can create the pod
+                break
            pod_count += 1

        labels = ([self.LIMIT_POD_LABEL.format(pod_number=pod_number)] if self.ports_mode else []) + [self.AGENT_LABEL]
+        labels.append("clearml-agent-queue={}".format(self._safe_k8s_label_value(queue)))
+        labels.append("clearml-agent-queue-name={}".format(self._safe_k8s_label_value(queue_name)))

        if self.ports_mode:
            print("Kubernetes scheduling task id={} on pod={} (pod_count={})".format(task_id, pod_number, pod_count))
        else:
            print("Kubernetes scheduling task id={}".format(task_id))

+        kubectl_kwargs = dict(
+            create_clearml_conf=create_clearml_conf,
+            labels=labels,
+            docker_image=container['image'],
+            docker_args=container['arguments'],
+            docker_bash=container.get('setup_shell_script'),
+            task_id=task_id,
+            queue=queue
+        )
+
        if self.template_dict:
-            output, error = self._kubectl_apply(
-                create_clearml_conf=create_clearml_conf,
-                labels=labels, docker_image=docker_image, docker_args=docker_args,
-                task_id=task_id, queue=queue, queue_name=safe_queue_name)
+            output, error = self._kubectl_apply(**kubectl_kwargs)
        else:
-            output, error = self._kubectl_run(
-                create_clearml_conf=create_clearml_conf,
-                labels=labels, docker_image=docker_cmd,
-                task_data=task_data,
-                task_id=task_id, queue=queue, queue_name=safe_queue_name)
+            output, error = self._kubectl_run(task_data=task_data, **kubectl_kwargs)

        error = '' if not error else (error if isinstance(error, str) else error.decode('utf-8'))
        output = '' if not output else (output if isinstance(output, str) else output.decode('utf-8'))
@@ -390,40 +421,66 @@ class K8sIntegration(Worker):
                **user_props
            )

-    def _parse_docker_args(self, docker_args):
-        # type: (list) -> dict
-        kube_args = {'env': []}
-        while docker_args:
-            cmd = docker_args.pop().strip()
-            if cmd in ('-e', '--env',):
-                env = docker_args.pop().strip()
-                key, value = env.split('=', 1)
-                kube_args[key] += {key: value}
+    def _get_docker_args(self, docker_args, flags, target=None, convert=None):
+        # type: (List[str], Collection[str], Optional[str], Callable[[str], Any]) -> Union[dict, List[str]]
+        """
+        Get docker args matching specific flags.
+
+        :argument docker_args: List of docker argument strings (flags and values)
+        :argument flags: List of flags/names to intercept (e.g. "--env" etc.)
+        :argument target: Controls return format. If provided, returns a dict with a target field containing a list
+         of result strings, otherwise returns a list of result strings
+        :argument convert: Optional conversion function for each result string
+        """
+        args = docker_args[:] if docker_args else []
+        results = []
+        while args:
+            cmd = args.pop(0).strip()
+            if cmd in flags:
+                env = args.pop(0).strip()
+                if convert:
+                    env = convert(env)
+                results.append(env)
            else:
                self.log.warning('skipping docker argument {} (only -e --env supported)'.format(cmd))
-        return kube_args
+        if target:
+            return {target: results} if results else {}
+        return results

-    def _kubectl_apply(self, create_clearml_conf, docker_image, docker_args, labels, queue, task_id, queue_name):
+    def _kubectl_apply(self, create_clearml_conf, docker_image, docker_args, docker_bash, labels, queue, task_id):
        template = deepcopy(self.template_dict)
        template.setdefault('apiVersion', 'v1')
        template['kind'] = 'Pod'
        template.setdefault('metadata', {})
-        name = 'clearml-{queue}-id-{task_id}'.format(queue=queue_name, task_id=task_id)
+        name = 'clearml-id-{task_id}'.format(task_id=task_id)
        template['metadata']['name'] = name
        template.setdefault('spec', {})
        template['spec'].setdefault('containers', [])
+        template['spec'].setdefault('restartPolicy', 'Never')
        if labels:
            labels_dict = dict(pair.split('=', 1) for pair in labels)
            template['metadata'].setdefault('labels', {})
            template['metadata']['labels'].update(labels_dict)
-        container = self._parse_docker_args(docker_args)
+
+        container = self._get_docker_args(
+            docker_args,
+            target="env",
+            flags={"-e", "--env"},
+            convert=lambda env: {'name': env.partition("=")[0], 'value': env.partition("=")[2]},
+        )

        container_bash_script = [self.container_bash_script] if isinstance(self.container_bash_script, str) \
            else self.container_bash_script

+        extra_docker_bash_script = '\n'.join(self._session.config.get("agent.extra_docker_shell_script", None) or [])
+        if docker_bash:
+            extra_docker_bash_script += '\n' + str(docker_bash) + '\n'
+
        script_encoded = '\n'.join(
            ['#!/bin/bash', ] +
-            [line.format(extra_bash_init_cmd=self.extra_bash_init_script or '', task_id=task_id)
+            [line.format(extra_bash_init_cmd=self.extra_bash_init_script or '',
+                         task_id=task_id,
+                         extra_docker_bash_script=extra_docker_bash_script)
             for line in container_bash_script])

        create_init_script = \
@@ -433,18 +490,23 @@ class K8sIntegration(Worker):
                    script_encoded.encode('ascii')
                ).decode('ascii'))

-        container = merge_dicts(
+        # Notice: we always leave with exit code 0, so pods are never restarted
+        container = self._merge_containers(
            container,
            dict(name=name, image=docker_image,
                 command=['/bin/bash'],
-                 args=['-c', '{} ; {}'.format(create_clearml_conf, create_init_script)])
+                 args=['-c', '{} ; {} ; exit 0'.format(create_clearml_conf, create_init_script)])
        )

        if template['spec']['containers']:
-            template['spec']['containers'][0] = merge_dicts(template['spec']['containers'][0], container)
+            template['spec']['containers'][0] = self._merge_containers(template['spec']['containers'][0], container)
        else:
            template['spec']['containers'].append(container)

+        if self._docker_force_pull:
+            for c in template['spec']['containers']:
+                c.setdefault('imagePullPolicy', 'Always')
+
        fp, yaml_file = tempfile.mkstemp(prefix='clearml_k8stmpl_', suffix='.yml')
        os.close(fp)
        with open(yaml_file, 'wt') as f:
@@ -472,14 +534,18 @@ class K8sIntegration(Worker):

        return output, error

-    def _kubectl_run(self, create_clearml_conf, docker_image, labels, queue, task_data, task_id, queue_name):
+    def _kubectl_run(
+        self, create_clearml_conf, docker_image, docker_args, docker_bash, labels, queue, task_data, task_id
+    ):
        if callable(self.kubectl_cmd):
-            kubectl_cmd = self.kubectl_cmd(task_id, docker_image, queue, task_data, queue_name)
+            kubectl_cmd = self.kubectl_cmd(task_id, docker_image, docker_args, queue, task_data)
        else:
            kubectl_cmd = self.kubectl_cmd.format(
-                queue_name=queue_name,
                task_id=task_id,
                docker_image=docker_image,
+                docker_args=" ".join(self._get_docker_args(
+                    docker_args, flags={"-e", "--env"}, convert=lambda env: '--env={}'.format(env))
+                ),
                queue_id=queue,
                namespace=self.namespace,
            )
@@ -495,6 +561,9 @@ class K8sIntegration(Worker):
        if self.pod_requests:
            kubectl_cmd += ['--requests', ",".join(self.pod_requests)]

+        if self._docker_force_pull and not any(x.startswith("--image-pull-policy=") for x in kubectl_cmd):
+            kubectl_cmd += ["--image-pull-policy='always'"]
+
        container_bash_script = [self.container_bash_script] if isinstance(self.container_bash_script, str) \
            else self.container_bash_script
        container_bash_script = ' ; '.join(container_bash_script)
@@ -506,7 +575,10 @@ class K8sIntegration(Worker):
            "/bin/sh",
            "-c",
            "{} ; {}".format(create_clearml_conf, container_bash_script.format(
-                extra_bash_init_cmd=self.extra_bash_init_script, task_id=task_id)),
+                extra_bash_init_cmd=self.extra_bash_init_script or "",
+                extra_docker_bash_script=docker_bash or "",
+                task_id=task_id
+            )),
        ]
        process = subprocess.Popen(kubectl_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        output, error = process.communicate()
@@ -539,7 +611,7 @@ class K8sIntegration(Worker):
            # iterate over queues (priority style, queues[0] is highest)
            for queue in queues:
                # delete old completed / failed pods
-                get_bash_output(self.KUBECTL_DELETE_CMD.format(namespace=self.namespace))
+                get_bash_output(self.KUBECTL_DELETE_CMD.format(namespace=self.namespace, selector=self.AGENT_LABEL))

                # get next task in queue
                try:
@@ -591,3 +663,27 @@ class K8sIntegration(Worker):
    @classmethod
    def get_ssh_server_bash(cls, ssh_port_number):
        return ' ; '.join(line.format(port=ssh_port_number) for line in cls.BASH_INSTALL_SSH_CMD)
+
+    @staticmethod
+    def _merge_containers(c1, c2):
+        def merge_env(k, d1, d2, not_set):
+            if k != "env":
+                return not_set
+            # Merge environment lists, second list overrides first
+            return list({
+                item['name']: item for envs in (d1, d2) for item in envs
+            }.values())
+
+        return merge_dicts(
+            c1, c2, custom_merge_func=merge_env
+        )
+
+    @staticmethod
+    def _safe_k8s_label_value(value):
+        """ Conform string to k8s standards for a label value """
+        value = value.lower().strip()
+        value = re.sub(r'^[^A-Za-z0-9]+', '', value)  # strip leading non-alphanumeric chars
+        value = re.sub(r'[^A-Za-z0-9]+$', '', value)  # strip trailing non-alphanumeric chars
+        value = re.sub(r'\W+', '-', value)  # allow only word chars (this removed "." which is supported, but nvm)
+        value = re.sub(r'-+', '-', value)  # don't leave messy "--" after replacing previous chars
+        return value[:63]
--- a/clearml_agent/helper/dicts.py
+++ b/clearml_agent/helper/dicts.py
@@ -1,17 +1,23 @@
-from typing import Callable, Dict, Any
+from typing import Callable, Dict, Any, Optional
+
+_not_set = object()


 def filter_keys(filter_, dct):  # type: (Callable[[Any], bool], Dict) -> Dict
    return {key: value for key, value in dct.items() if filter_(key)}


-def merge_dicts(dict1, dict2):
+def merge_dicts(dict1, dict2, custom_merge_func=None):
+    # type: (Any, Any, Optional[Callable[[str, Any, Any, Any], Any]]) -> Any
    """ Recursively merges dict2 into dict1 """
    if not isinstance(dict1, dict) or not isinstance(dict2, dict):
        return dict2
    for k in dict2:
        if k in dict1:
-            dict1[k] = merge_dicts(dict1[k], dict2[k])
+            res = None
+            if custom_merge_func:
+                res = custom_merge_func(k, dict1[k], dict2[k], _not_set)
+            dict1[k] = merge_dicts(dict1[k], dict2[k], custom_merge_func) if res is _not_set else res
        else:
            dict1[k] = dict2[k]
    return dict1
--- a/clearml_agent/helper/gpu/gpustat.py
+++ b/clearml_agent/helper/gpu/gpustat.py
@@ -421,4 +421,8 @@ def get_driver_cuda_version():
    except BaseException:
        return None

+    # for some reason we get CUDA version 11020 instead of 11200, so this is the fix
+    if cuda_version and len(cuda_version) >= 4 and cuda_version[2] == '0' and cuda_version[3] != '0':
+        return cuda_version[:2]+cuda_version[3]
+
    return cuda_version[:3] if cuda_version else None
--- a/clearml_agent/helper/os/folder_cache.py
+++ b/clearml_agent/helper/os/folder_cache.py
@@ -152,7 +152,14 @@ class FolderCache(object):
        for f in source_folder.glob('*'):
            if f.name in exclude_sub_folders:
                continue
-            shutil.copytree(src=f.as_posix(), dst=(temp_folder / f.name).as_posix(), symlinks=True)
+            if f.is_dir():
+                shutil.copytree(
+                    src=f.as_posix(), dst=(temp_folder / f.name).as_posix(),
+                    symlinks=True, ignore_dangling_symlinks=True)
+            else:
+                shutil.copy(
+                    src=f.as_posix(), dst=(temp_folder / f.name).as_posix(),
+                    follow_symlinks=False)

        # rename the target folder
        target_cache_folder = self._cache_folder / '.'.join(keys)
--- a/clearml_agent/helper/package/conda_api.py
+++ b/clearml_agent/helper/package/conda_api.py
@@ -573,6 +573,8 @@ class CondaAPI(PackageManager):
            conda_env['dependencies'] = [clean_ver(r) for r in reqs]
            with self.temp_file("conda_env", yaml.dump(conda_env), suffix=".yml") as name:
                print('Conda: Trying to install requirements:\n{}'.format(conda_env['dependencies']))
+                if self.session.debug_mode:
+                    print('{}:\n{}'.format(name, yaml.dump(conda_env)))
                result = self._run_command(
                    ("env", "update", "-p", self.path, "--file", name)
                )
@@ -603,6 +605,8 @@ class CondaAPI(PackageManager):
                pip_req_str = [r.tostr() for r in pip_requirements if r.name not in ('pip', 'virtualenv', )]
                print('Conda: Installing requirements: step 2 - using pip:\n{}'.format(pip_req_str))
                PackageManager._selected_manager = self.pip
+                if self.session.debug_mode:
+                    print('pip requirements.txt:\n{}'.format('\n'.join(pip_req_str)))
                self.pip.load_requirements({'pip': '\n'.join(pip_req_str)})
            except Exception as e:
                print(e)
@@ -646,12 +650,16 @@ class CondaAPI(PackageManager):
            ansi_escape = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
            return ansi_escape.sub('', line)

+        # make sure we are not running it with our own PYTHONPATH
+        env = dict(**os.environ)
+        env.pop('PYTHONPATH', None)
+
        command = Argv(*command)  # type: Executable
        if not raw:
            command = (self.conda,) + command + ("--quiet", "--json")
        try:
            print('Executing Conda: {}'.format(command.serialize()))
-            result = command.get_output(stdin=DEVNULL, **kwargs)
+            result = command.get_output(stdin=DEVNULL, env=env, **kwargs)
            if self.session.debug_mode:
                print(result)
        except Exception as e:
--- a/clearml_agent/helper/package/external_req.py
+++ b/clearml_agent/helper/package/external_req.py
@@ -2,6 +2,8 @@ import re
 from collections import OrderedDict
 from typing import Text

+from pathlib2 import Path
+
 from .base import PackageManager
 from .requirements import SimpleSubstitution
 from ..base import safe_furl as furl
@@ -10,22 +12,26 @@ from ..base import safe_furl as furl
 class ExternalRequirements(SimpleSubstitution):

    name = "external_link"
+    cwd = None

    def __init__(self, *args, **kwargs):
        super(ExternalRequirements, self).__init__(*args, **kwargs)
        self.post_install_req = []
        self.post_install_req_lookup = OrderedDict()
+        self.post_install_local_req_lookup = OrderedDict()

    def match(self, req):
        # match local folder building:
-        # noinspection PyBroadException
-        try:
-            if not req.name and req.req and not req.req.editable and not req.req.vcs and \
-                    req.req.line and req.req.line.strip().split('#')[0] and \
-                    not req.req.line.strip().split('#')[0].lower().endswith('.whl'):
-                return True
-        except Exception:
-            pass
+        if self.is_local_folder_package(req):
+            # noinspection PyBroadException
+            try:
+                folder_path = req.req.line.strip().split('#')[0].strip()
+                if self.cwd and not Path(folder_path).is_absolute():
+                    folder_path = (Path(self.cwd) / Path(folder_path)).absolute().as_posix()
+                self.post_install_local_req_lookup['file://{}'.format(folder_path)] = req.req.line
+            except Exception:
+                pass
+            return True

        # match both editable or code or unparsed
        if not (not req.name or req.req and (req.req.editable or req.req.vcs)):
@@ -113,8 +119,32 @@ class ExternalRequirements(SimpleSubstitution):
                                       if r not in self.post_install_req_lookup]
            list_of_requirements[k] += [self.post_install_req_lookup.get(r, '')
                                        for r in self.post_install_req_lookup.keys() if r in original_requirements]
+
+            if self.post_install_local_req_lookup:
+                original_requirements = list_of_requirements[k]
+                list_of_requirements[k] = [
+                    r for r in original_requirements
+                    if len(r.split('@', 1)) != 2 or r.split('@', 1)[1].strip() not in self.post_install_local_req_lookup]
+
+                list_of_requirements[k] += [
+                    self.post_install_local_req_lookup.get(r.split('@', 1)[1].strip(), '')
+                    for r in original_requirements
+                    if len(r.split('@', 1)) == 2 and r.split('@', 1)[1].strip() in self.post_install_local_req_lookup]
+
        return list_of_requirements

+    @classmethod
+    def is_local_folder_package(cls, req):
+        # noinspection PyBroadException
+        try:
+            if not req.name and req.req and not req.req.editable and not req.req.vcs and \
+                    req.req.line and req.req.line.strip().split('#')[0] and \
+                    not req.req.line.strip().split('#')[0].lower().endswith('.whl'):
+                return True
+        except Exception:
+            pass
+        return False
+

 class OnlyExternalRequirements(ExternalRequirements):
    def __init__(self, *args, **kwargs):
--- a/clearml_agent/helper/package/pip_api/system.py
+++ b/clearml_agent/helper/package/pip_api/system.py
@@ -1,3 +1,4 @@
+import os
 import sys
 from itertools import chain
 from typing import Text, Optional
@@ -82,7 +83,10 @@ class SystemPip(PackageManager):
        :param kwargs: kwargs for get_output/check_output command
        """
        command = self._make_command(command)
-        return (command.get_output if output else command.check_call)(stdin=DEVNULL, **kwargs)
+        # make sure we are not running it with our own PYTHONPATH
+        env = dict(**os.environ)
+        env.pop('PYTHONPATH', None)
+        return (command.get_output if output else command.check_call)(stdin=DEVNULL, env=env, **kwargs)

    def _make_command(self, command):
        return Argv(self.bin, '-m', 'pip', '--disable-pip-version-check', *command)
--- a/clearml_agent/helper/package/requirements.py
+++ b/clearml_agent/helper/package/requirements.py
@@ -448,10 +448,14 @@ class RequirementsManager(object):
        self.translator = RequirementsTranslator(session, interpreter=base_interpreter,
                                                 cache_dir=pip_cache_dir.as_posix())
        self._base_interpreter = base_interpreter
+        self._cwd = None

    def register(self, cls):  # type: (Type[RequirementSubstitution]) -> None
        self.handlers.append(cls(self._session))

+    def set_cwd(self, cwd):
+        self._cwd = str(cwd) if cwd else None
+
    def _replace_one(self, req):  # type: (MarkerRequirement) -> Optional[Text]
        match = re.search(r';\s*(.*)', Text(req))
        if match:
@@ -466,7 +470,7 @@ class RequirementsManager(object):
    def replace(self, requirements):  # type: (Text) -> Text
        def safe_parse(req_str):
            try:
-                return next(parse(req_str))
+                return next(parse(req_str, cwd=self._cwd))
            except Exception as ex:
                return Requirement(req_str)

--- a/clearml_agent/helper/process.py
+++ b/clearml_agent/helper/process.py
@@ -42,20 +42,31 @@ def get_bash_output(cmd, strip=False, stderr=subprocess.STDOUT, stdin=False):
    return output if not strip or not output else output.strip()


-def terminate_process(pid, timeout=10.):
+def terminate_process(pid, timeout=10., ignore_zombie=True, include_children=False):
    # noinspection PyBroadException
    try:
        proc = psutil.Process(pid)
+        children = proc.children(recursive=True) if include_children else []
        proc.terminate()
        cnt = 0
-        while proc.is_running() and cnt < timeout:
+        while proc.is_running() and (ignore_zombie or proc.status() != 'zombie') and cnt < timeout:
            sleep(1.)
            cnt += 1
        proc.terminate()
+
+        # terminate children
+        for c in children:
+            c.terminate()
+
        cnt = 0
-        while proc.is_running() and cnt < timeout:
+        while proc.is_running() and (ignore_zombie or proc.status() != 'zombie') and cnt < timeout:
            sleep(1.)
            cnt += 1
+
+        # kill children
+        for c in children:
+            c.kill()
+
        proc.kill()
    except Exception:
        pass
@@ -66,9 +77,8 @@ def terminate_process(pid, timeout=10.):
        return True


-def kill_all_child_processes(pid=None):
+def kill_all_child_processes(pid=None, include_parent=True):
    # get current process if pid not provided
-    include_parent = True
    if not pid:
        pid = os.getpid()
        include_parent = False
@@ -84,6 +94,23 @@ def kill_all_child_processes(pid=None):
        parent.kill()


+def terminate_all_child_processes(pid=None, timeout=10., include_parent=True):
+    # get current process if pid not provided
+    if not pid:
+        pid = os.getpid()
+        include_parent = False
+    try:
+        parent = psutil.Process(pid)
+    except psutil.Error:
+        # could not find parent process id
+        return
+    for child in parent.children(recursive=False):
+        print('Terminating child process {}'.format(child.pid))
+        terminate_process(child.pid, timeout=timeout, ignore_zombie=False, include_children=True)
+    if include_parent:
+        terminate_process(parent.pid, timeout=timeout, ignore_zombie=False)
+
+
 def get_docker_id(docker_cmd_contains):
    try:
        containers_running = get_bash_output(cmd='docker ps --no-trunc --format \"{{.ID}}: {{.Command}}\"')
@@ -103,9 +130,10 @@ def shutdown_docker_process(docker_cmd_contains=None, docker_id=None):
            docker_id = get_docker_id(docker_cmd_contains=docker_cmd_contains)
        if docker_id:
            # we found our docker, stop it
-            get_bash_output(cmd='docker stop -t 1 {}'.format(docker_id))
+            return get_bash_output(cmd='docker stop -t 1 {}'.format(docker_id))
    except Exception:
        pass
+    return None


 def commit_docker(container_name, docker_cmd_contains=None, docker_id=None, apply_change=None):
@@ -193,6 +221,7 @@ class Argv(Executable):
        """
        self.argv = argv
        self._log = kwargs.pop("log", None)
+        self._display_argv = kwargs.pop("display_argv", argv)
        if not self._log:
            self._log = logging.getLogger(__name__)
            self._log.propagate = False
@@ -217,10 +246,10 @@ class Argv(Executable):
        return self.argv

    def __repr__(self):
-        return "<Argv{}>".format(self.argv)
+        return "<Argv{}>".format(self._display_argv)

    def __str__(self):
-        return "Executing: {}".format(self.argv)
+        return "Executing: {}".format(self._display_argv)

    def __iter__(self):
        if is_windows_platform():
--- a/clearml_agent/helper/repo.py
+++ b/clearml_agent/helper/repo.py
@@ -591,7 +591,7 @@ def clone_repository_cached(session, execution, destination):
    # mock lock
    repo_lock = Lock()
    repo_lock_timeout_sec = 300
-    repo_url = execution.repository  # type: str
+    repo_url = execution.repository or ''  # type: str
    parsed_url = furl(repo_url)
    no_password_url = parsed_url.copy().remove(password=True).url

--- a/clearml_agent/interface/worker.py
+++ b/clearml_agent/interface/worker.py
@@ -50,7 +50,7 @@ DAEMON_ARGS = dict({
    },
    '--docker': {
        'help': 'Run execution task inside a docker (v19.03 and above). Optional args <image> <arguments> or '
-                'specify default docker image in agent.default_docker.image / agent.default_docker.arguments'
+                'specify default docker image in agent.default_docker.image / agent.default_docker.arguments '
                'use --gpus/--cpu-only (or set NVIDIA_VISIBLE_DEVICES) to limit gpu visibility for docker',
        'nargs': '*',
        'default': False,
@@ -99,7 +99,8 @@ DAEMON_ARGS = dict({
    '--dynamic-gpus': {
        'help': 'Allow to dynamically allocate gpus based on queue properties, '
                'configure with \'--queues <queue_name>=<num_gpus>\'.'
-                ' Example: \'--dynamic-gpus --queue dual_gpus=2 single_gpu=1\'',
+                ' Example: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 single_gpu=1\''
+                ' Example Opportunistic: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 max_quad_gpus=1-4 \'',
        'action': 'store_true',
    },
    '--uptime': {
@@ -110,7 +111,7 @@ DAEMON_ARGS = dict({
        'default': None,
    },
    '--downtime': {
-        'help': 'Specify uptime for clearml-agent in "<hours> <days>" format. for example, use "09-13 TUE" to set '
+        'help': 'Specify downtime for clearml-agent in "<hours> <days>" format. for example, use "09-13 TUE" to set '
                'Tuesday\'s downtime to 09-13'
                'Note: Make sure to have only on of uptime/downtime configuration and not both.',
        'nargs': '*',
--- a/clearml_agent/version.py
+++ b/clearml_agent/version.py
@@ -1 +1 @@
-__version__ = '0.17.2'
+__version__ = '1.0.1rc1'
--- a/docs/clearml.conf
+++ b/docs/clearml.conf
@@ -42,6 +42,9 @@ agent {
    # Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
    # The default is the python executing the clearml_agent
    python_binary: ""
+    # ignore any requested python version (Default: False, if a Task was using a
+    # specific python version and the system supports multiple python the agent will use the requested python version)
+    # ignore_requested_python_version: true

    # select python package manager:
    # currently supported pip and conda
@@ -63,7 +66,7 @@ agent {
        extra_index_url: []

        # additional conda channels to use when installing with conda package manager
-        conda_channels: ["pytorch", "conda-forge", ]
+        conda_channels: ["pytorch", "conda-forge", "defaults", ]
        # conda_full_env_update: false
        # conda_env_as_base_docker: false

@@ -107,11 +110,12 @@ agent {
        path: ~/.clearml/vcs-cache
    },

+    # DEPRECATED: please use `venvs_cache` and set `venvs_cache.path`
    # use venv-update in order to accelerate python virtual environment building
    # Still in beta, turned off by default
-    venv_update: {
-        enabled: false,
-    },
+    # venv_update: {
+    #     enabled: false,
+    # },

    # cached folder for specific python package download (mostly pytorch versions)
    pip_download_cache {
@@ -135,6 +139,11 @@ agent {
    # optional shell script to run in docker when started before the experiment is started
    # extra_docker_shell_script: ["apt-get install -y bindfs", ]

+    # Install the required packages for opencv libraries (libsm6 libxext6 libxrender-dev libglib2.0-0),
+    # for backwards compatibility reasons, true as default,
+    # change to false to skip installation and decrease docker spin up time
+    # docker_install_opencv_libs: true
+
    # set to true in order to force "docker pull" before running an experiment using a docker image.
    # This makes sure the docker image is updated.
    docker_force_pull: false
--- a/docs/clearml_architecture.png
+++ b/docs/clearml_architecture.png
--- a/docs/screenshots.gif
+++ b/docs/screenshots.gif
--- a/examples/k8s_glue_example.py
+++ b/examples/k8s_glue_example.py
@@ -10,12 +10,15 @@ from clearml_agent.glue.k8s import K8sIntegration

 def parse_args():
    parser = ArgumentParser()
+    group = parser.add_mutually_exclusive_group()
+
    parser.add_argument(
        "--queue", type=str, help="Queue to pull tasks from"
    )
-    parser.add_argument(
+    group.add_argument(
        "--ports-mode", action='store_true', default=False,
        help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
+             "Should not be used with max-pods"
    )
    parser.add_argument(
        "--num-of-services", type=int, default=20,
@@ -57,6 +60,11 @@ def parse_args():
        "--namespace", type=str,
        help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
    )
+    group.add_argument(
+        "--max-pods", type=int,
+        help="Limit the maximum number of pods that this service can run at the same time."
+             "Should not be used with ports-mode"
+    )
    return parser.parse_args()


@@ -77,7 +85,7 @@ def main():
        user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
        template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
            ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
-        namespace=args.namespace,
+        namespace=args.namespace, max_pods_limit=args.max_pods or None,
    )
    k8s.k8s_daemon(args.queue)

--- a/requirements.txt
+++ b/requirements.txt
@@ -8,10 +8,10 @@ psutil>=3.4.2,<5.9.0
 pyhocon>=0.3.38,<0.4.0
 pyparsing>=2.0.3,<2.5.0
 python-dateutil>=2.4.2,<2.9.0
-pyjwt>=1.6.4,<1.8.0
-PyYAML>=3.12,<5.4.0
+pyjwt>=1.6.4,<2.1.0
+PyYAML>=3.12,<5.5.0
 requests>=2.20.0,<2.26.0
 six>=1.11.0,<1.16.0
 typing>=3.6.4,<3.8.0
 urllib3>=1.21.1,<1.27.0
-virtualenv>=16,<20
+virtualenv>=16,<21
--- a/setup.py
+++ b/setup.py
@@ -60,6 +60,7 @@ setup(
        'Programming Language :: Python :: 3.6',
        'Programming Language :: Python :: 3.7',
        'Programming Language :: Python :: 3.8',
+        'Programming Language :: Python :: 3.9',
        'License :: OSI Approved :: Apache Software License',
    ],
Author	SHA1	Message	Date
allegroai	4c9410c5fe	Fix auto mount SSH_AUTH_SOCK into docker (issue #45 )	2021-07-11 09:44:49 +03:00
pollfly	351f0657c3	Update agent gif (#69 )	2021-07-08 09:20:45 +03:00
allegroai	382604e923	Fix services mode killing child processes when running in services mode + venv	2021-06-30 23:58:25 +03:00
Jake Henning	b48f25a7f9	Merge pull request #68 from pollfly/master Fix documentation links	2021-06-29 11:04:52 +03:00
Revital	b76e4fc02b	Merge remote-tracking branch 'origin/master'	2021-06-29 07:59:02 +03:00
Revital	27cf7dd67f	add clearml_architecture picture	2021-06-29 07:58:29 +03:00
pollfly	05ec45352c	Merge branch 'allegroai:master' into master	2021-06-29 07:37:10 +03:00
allegroai	0e7546f248	Fix docker force pull in k8s glue _kubectl_apply()	2021-06-27 09:42:14 +03:00
allegroai	e3c8bd5666	Add support for agent.docker_force_pull configuration setting in k8s glue	2021-06-25 17:36:08 +03:00
allegroai	3ae1741343	Fix k8s glue task container arguments not supported in kubectl_run command Fix k8s glue not passing required extra_docker_bash_script to string format	2021-06-25 17:35:01 +03:00
allegroai	53c106c3af	Fix k8s glue task container handling fails parsing docker image Fix k8s glue uses task container image arguments when no image is specified	2021-06-25 17:34:28 +03:00
allegroai	44fc7dffe6	Fix key/secret usage printout	2021-06-24 19:37:59 +03:00
allegroai	aaa6b32f9f	Fix support for "-r requirements.txt" inside "installed packages"	2021-06-24 19:26:35 +03:00
allegroai	821a0c4a2b	Fix parsing VCS links starting with "git+git@" (notice "git+git://" was already supported)	2021-06-24 19:25:41 +03:00
Revital	6373237960	switch allegro.ai link to clear.ml links	2021-06-22 13:59:37 +03:00
pollfly	1caf7b104f	Merge branch 'allegroai:master' into master	2021-06-22 13:47:48 +03:00
allegroai	176b4a4cde	Fix --services-mode when the execute agent fails when starting to run with error code 0	2021-06-16 18:32:29 +03:00
allegroai	29bf993be7	Add printout when using key/secret from env vars	2021-06-02 21:15:48 +03:00
allegroai	eda597dea5	Version bump	2021-06-02 13:17:57 +03:00
allegroai	8c56777125	Add CLEARML_AGENT_DISABLE_SSH_MOUNT allowing disabling the auto .ssh mount into the docker	2021-06-02 13:16:58 +03:00
allegroai	7e90ebd5db	Fix _dynamic_gpu_get_available worker timeout increase to 10 minutes	2021-06-02 13:16:17 +03:00
allegroai	3a07bfe1d7	Version bump	2021-05-31 23:19:46 +03:00
allegroai	0694b9e8af	Fix PyYAML supported versions	2021-05-26 18:33:35 +03:00
allegroai	742cbf5767	Add docker environment arguments log masking support (issue #67 )	2021-05-25 19:31:45 +03:00
allegroai	e93384b99b	Fix --stop with dynamic gpus	2021-05-20 10:58:46 +03:00
allegroai	3c4e976093	Add agent.ignore_requested_python_version to config file	2021-05-19 15:20:44 +03:00
allegroai	1e795beec8	Fix support for spaces in docker arguments (issue #358 )	2021-05-19 15:20:03 +03:00
allegroai	4f7407084d	Fix standalone script with pre-exiting conda venv	2021-05-12 15:46:25 +03:00
allegroai	ae3d034531	Protect against None in execution.repository	2021-05-12 15:45:31 +03:00
allegroai	a2db1f5ab5	Remove queue name from pod name in k8s glue, add queue name and ID to pod labels (issue #64 )	2021-05-05 12:03:35 +03:00
allegroai	cec6420c8f	Version bump to v1.0.0	2021-05-03 18:33:53 +03:00
allegroai	4f18bb7ea0	Add k8s glue default restartPolicy=Never to template to prevent pods from restarting	2021-04-28 13:20:13 +03:00
allegroai	3ec2a3a92e	Add k8s pod limit to k8s glue example	2021-04-28 13:19:34 +03:00
allegroai	823b67a3ce	Deprecate venv_update (replaced by the more robust venvs_cache)	2021-04-28 13:17:37 +03:00
Revital	24dc59e31f	add space to help message	2021-04-27 13:50:44 +03:00
allegroai	08ff5e6db7	Add number of pods limit to k8s glue	2021-04-25 10:47:49 +03:00
allegroai	e60a6f9d14	Fix --stop support for dynamic gpus	2021-04-25 10:46:43 +03:00
Revital	161656d9e4	add space to help message	2021-04-22 14:14:38 +03:00
Allegro AI	8569c02b33	Merge pull request #58 from pollfly/master fix --downtime help	2021-04-21 15:27:47 +03:00
Revital	35e714d8d9	fix --downtime help	2021-04-21 09:13:47 +03:00
allegroai	6f8d5710d6	Fix dynamic gpus priority queue	2021-04-20 18:11:59 +03:00
allegroai	a671692832	Fix --services-mode with instance limit	2021-04-20 18:11:36 +03:00
allegroai	5c8675e43a	Add support for dynamic gpus opportunistic scheduling (with min/max gpus per queue)	2021-04-20 18:11:16 +03:00
allegroai	60a58f6fad	Fix poetry support (issue #57 )	2021-04-14 11:22:07 +03:00
allegroai	948fc4c6ce	Add python 3.9 to the support table	2021-04-12 23:01:40 +03:00
allegroai	5be5f3209d	Fix documentation links	2021-04-12 23:01:22 +03:00
allegroai	537b67e0cd	Fix agent can return non-zero error code and pods will end up restarting forever (issue #56 )	2021-04-12 23:00:59 +03:00
allegroai	82c5e55fe4	Fix usage of not_set in k8s template merge	2021-04-07 21:30:13 +03:00
allegroai	5f0d51d485	Add documentation for agent.docker_install_opencv_libs	2021-04-07 18:48:30 +03:00
allegroai	945dd816ad	Fix no docker arguments	2021-04-07 18:47:13 +03:00
allegroai	45009e6cc2	Add support for updating back docker on new API v2.13	2021-04-07 18:46:58 +03:00
allegroai	8eace6d57b	Bump virtualenv dependency version	2021-04-07 18:46:35 +03:00
allegroai	3774fa6abd	Add support for new container base setup script feature	2021-04-07 18:46:14 +03:00
allegroai	e71e6865d2	Add agent.docker_install_opencv_libs (default: True) to enable auto opencv libs install for faster docker spin-up	2021-04-07 18:45:44 +03:00
allegroai	0e8f1528b1	Remove redundant py2 code	2021-04-07 18:44:59 +03:00
allegroai	c331babf51	Add stopping message on Task process termination Fix --stop on dynamic gpus venv mode	2021-04-07 18:44:33 +03:00
allegroai	c59d268995	Fix venv cache crash on bad symbolic links	2021-04-07 18:44:11 +03:00
allegroai	9e9fcb0ba9	Add dynamic mode terminate dockers on sig_term	2021-04-07 18:43:44 +03:00
allegroai	f33e0b2f78	Verify docker command exists when running in docker mode	2021-04-07 18:42:27 +03:00
allegroai	0e4b99351f	Add --stop support for dynamic gpus Fix --stop mark tasks as aborted (not failed as before)	2021-04-07 18:42:10 +03:00
allegroai	81edd2860f	Fix --dynamic-gpus should keep original queue priority order	2021-03-31 23:55:12 +03:00
allegroai	14ac584577	Support k8s glue container env vars merging	2021-03-31 23:53:58 +03:00
allegroai	9ce6baf074	Fix broken k8s glue docker args parsing Fix empty env prevents override when merging template	2021-03-26 12:26:15 +03:00
allegroai	92a1e07b33	Fix local path replace back when using cache	2021-03-26 12:16:05 +03:00
allegroai	cb6bdece39	Fix cuda version from driver does not return minor version	2021-03-18 10:07:59 +02:00
allegroai	2ea38364bb	Change the default conda channel order, so it pulls the correct pytorch	2021-03-18 10:07:58 +02:00
allegroai	cf6fdc0d81	Add support for PyJWT v2	2021-03-18 10:07:58 +02:00
allegroai	91eec99563	Add conda debug prints (--debug)	2021-03-18 10:07:58 +02:00
allegroai	f8cbaa9a06	documentation	2021-03-18 03:05:26 +02:00