Version bump to v0.16.0

Fix GPU Windows monitoring support (Trains Issue #177 )
Sync generated conf file with latest Trains
2025-06-26 18:16:15 +00:00 · 2020-08-10 17:28:00 +03:00 · 2020-08-10 08:07:51 +03:00 · 2020-08-08 14:44:45 +03:00 · 2020-08-08 14:43:25 +03:00 · 2020-07-30 14:30:23 +03:00
46 changed files with 2388 additions and 192 deletions
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# TRAINS Agent
+# Allegro Trains Agent
 ## Deep Learning DevOps For Everyone - Now supporting all platforms (Linux, macOS, and Windows)

 "All the Deep-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
@@ -10,27 +10,27 @@

 ### Help improve Trains by filling our 2-min [user survey](https://allegro.ai/lp/trains-user-survey/)

-**TRAINS Agent is an AI experiment cluster solution.**
+**Trains Agent is an AI experiment cluster solution.**

 It is a zero configuration fire-and-forget execution agent, which combined with trains-server provides a full AI cluster solution.

 **Full AutoML in 5 steps** 
-1. Install the [TRAINS server](https://github.com/allegroai/trains-agent) (or use our [open server](https://demoapp.trains.allegro.ai))
-2. `pip install trains-agent` ([install](#installing-the-trains-agent) the TRAINS agent on any GPU machine: on-premises / cloud / ...)
-3. Add [TRAINS](https://github.com/allegroai/trains) to your code with just 2 lines & run it once (on your machine / laptop)
+1. Install the [Trains Server](https://github.com/allegroai/trains-agent) (or use our [open server](https://demoapp.trains.allegro.ai))
+2. `pip install trains-agent` ([install](#installing-the-trains-agent) the Trains Agent on any GPU machine: on-premises / cloud / ...)
+3. Add [Trains](https://github.com/allegroai/trains) to your code with just 2 lines & run it once (on your machine / laptop)
 4. Change the [parameters](#using-the-trains-agent) in the UI & schedule for [execution](#using-the-trains-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
 5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes:  :beer:


-**Using the TRAINS agent, you can now set up a dynamic cluster with \*epsilon DevOps**
+**Using the Trains Agent, you can now set up a dynamic cluster with \*epsilon DevOps**

 *epsilon - Because we are scientists :triangular_ruler: and nothing is really zero work

-(Experience TRAINS live at [https://demoapp.trains.allegro.ai](https://demoapp.trains.allegro.ai))
+(Experience Trains live at [https://demoapp.trains.allegro.ai](https://demoapp.trains.allegro.ai))
 <a href="https://demoapp.trains.allegro.ai"><img src="https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width="100%"></a>

 ## Simple, Flexible Experiment Orchestration
-**The TRAINS Agent was built to address the DL/ML R&D DevOps needs:**
+**The Trains Agent was built to address the DL/ML R&D DevOps needs:**

 * Easily add & remove machines from the cluster
 * Reuse machines without the need for any dedicated containers or images
@@ -51,30 +51,30 @@ If you are considering K8S for your research, also consider that you will soon b
 In our experience, handling and building the environments, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, it’s usually out of scope for the research team, and overwhelming even for the DevOps team).

 We feel there has to be a better way, that can be just as powerful for R&D and at the same time allow integration with K8S **when the need arises**.  
-(If you already have a K8S cluster for AI, detailed instructions on how to integrate TRAINS into your K8S cluster are *coming soon*.)
+(If you already have a K8S cluster for AI, detailed instructions on how to integrate Trains into your K8S cluster are [here](https://github.com/allegroai/trains-server-k8s/tree/master/trains-server-chart) with included [helm chart](https://github.com/allegroai/trains-server-helm))


-## Using the TRAINS Agent
+## Using the Trains Agent
 **Full scale HPC with a click of a button**

-TRAINS Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.
+The Trains Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.

-Any 'Draft' experiment can be scheduled for execution by a TRAINS agent.
+Any 'Draft' experiment can be scheduled for execution by a Trains agent.

 A previously run experiment can be put into 'Draft' state by either of two methods:
 * Using the **'Reset'** action from the experiment right-click context menu in the
-  TRAINS UI - This will clear any results and artifacts the previous run had created.
+  Trains UI - This will clear any results and artifacts the previous run had created.
 * Using the **'Clone'** action from the experiment right-click context menu in the
-  TRAINS UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.
+  Trains UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.

 An experiment is scheduled for execution using the **'Enqueue'** action from the experiment
- right-click context menu in the TRAINS UI and selecting the execution queue.
+ right-click context menu in the Trains UI and selecting the execution queue.

 See [creating an experiment and enqueuing it for execution](#from-scratch).

-Once an experiment is enqueued, it will be picked up and executed by a TRAINS agent monitoring this queue.
+Once an experiment is enqueued, it will be picked up and executed by a Trains agent monitoring this queue.

-The TRAINS UI Workers & Queues page provides ongoing execution information:
+The Trains UI Workers & Queues page provides ongoing execution information:
  - Workers Tab: Monitor you cluster
    - Review available resources
    - Monitor machines statistics (CPU / GPU / Disk / Network)
@@ -83,16 +83,16 @@ The TRAINS UI Workers & Queues page provides ongoing execution information:
    - Cancel or abort job execution
    - Move jobs between execution queues

-### What The TRAINS Agent Actually Does
-The TRAINS agent executes experiments using the following process:
+### What The Trains Agent Actually Does
+The Trains Agent executes experiments using the following process:
  - Create a new virtual environment (or launch the selected docker image)
  - Clone the code into the virtual-environment (or inside the docker)
  - Install python packages based on the package requirements listed for the experiment
-    - Special note for PyTorch: The TRAINS agent will automatically select the
+    - Special note for PyTorch: The Trains Agent will automatically select the
      torch packages based on the CUDA_VERSION environment variable of the machine
  - Execute the code, while monitoring the process
-  - Log all stdout/stderr in the TRAINS UI, including the cloning and installation process, for easy debugging
-  - Monitor the execution and allow you to manually abort the job using the TRAINS UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
+  - Log all stdout/stderr in the Trains UI, including the cloning and installation process, for easy debugging
+  - Monitor the execution and allow you to manually abort the job using the Trains UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)

 ### System Design & Flow
 ```text
@@ -100,24 +100,24 @@ The TRAINS agent executes experiments using the following process:
                                                                              |  GPU  Machine   |
 Development Machine                                                           |                 |
 +------------------------+                                                    | +-------------+ |
-|    Data Scientist's    |                            +--------------+        | |TRAINS Agent | |
+|    Data Scientist's    |                            +--------------+        | |Trains Agent | |
 |      DL/ML Code        |                            |    WEB UI    |        | |             | |
 |                        |                            |              |        | | +---------+ | |
 |                        |                            |              |        | | |  DL/ML  | | |
 |                        |                            +--------------+        | | |  Code   | | |
 |                        |       User Clones Exp #1  / . . . . . . . /        | | |         | | |
 | +-------------------+  |           into Exp #2    / . . . . . . . /         | | +---------+ | |
-| |      TRAINS       |  |         +---------------/-_____________-/          | |             | |
+| |      Trains       |  |         +---------------/-_____________-/          | |             | |
 | +---------+---------+  |         |                                          | |      ^      | |
 +-----------|------------+         |                                          | +------|------+ |
            |                      |                                          +--------|--------+
 Auto-Magically                    |                                                   |
- Creates Exp #1                    |                                      The TRAINS Agent
+ Creates Exp #1                    |                                      The Trains Agent
             \          User Change Hyper-Parameters                      Pulls Exp #2, setup the
             |                     |                                      environment & clone code.
             |                     |                                      Start execution with the
 +------------|------------+        |            +--------------------+    new set of Hyper-Parameters.
-|  +---------v---------+  |        |            |   TRAINS-SERVER    |                 |
+|  +---------v---------+  |        |            |   Trains Server    |                 |
 |  | Experiment #1     |  |        |            |                    |                 |
 |  +-------------------+  |        |            |  Execution Queue   |                 |
 |            ||           |        |            |                    |                 |
@@ -128,17 +128,17 @@ Development Machine                                                           |
 |                         |          ------------->---------------+  |                 |
 |                         |  User Send Exp #2   | |Execute Exp #2 +--------------------+
 |                         |  For Execution      | +---------------+  |
-|     TRAINS-SERVER       |                     |                    |
+|     Trains Server       |                     |                    |
 +-------------------------+                     +--------------------+
 ```

-### Installing the TRAINS Agent
+### Installing the Trains Agent

 ```bash
 pip install trains-agent
 ```

-### TRAINS Agent Usage Examples
+### Trains Agent Usage Examples

 Full Interface and capabilities are available with
 ```bash
@@ -146,22 +146,22 @@ trains-agent --help
 trains-agent daemon --help
 ```

-### Configuring the TRAINS Agent
+### Configuring the Trains Agent

 ```bash
 trains-agent init
 ```

-Note: The TRAINS agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default TRAINS Agent cache folder is `~/.trains`
+Note: The Trains Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default Trains Agent cache folder is `~/.trains`

 See full details in your configuration file at `~/trains.conf`

-Note: The **TRAINS agent** extends the **TRAINS** configuration file `~/trains.conf`
+Note: The **Trains agent** extends the **Trains** configuration file `~/trains.conf`
 They are designed to share the same configuration file, see example [here](docs/trains.conf)

-### Running the TRAINS Agent
+### Running the Trains Agent

-For debug and experimentation, start the TRAINS agent in `foreground` mode, where all the output is printed to screen
+For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
 ```bash
 trains-agent daemon --queue default --foreground
 ```
@@ -190,9 +190,9 @@ trains-agent daemon --detached --gpus 0,1 --queue dual_gpu
 trains-agent daemon --detached --gpus 2,3 --queue dual_gpu
 ```

-#### Starting the TRAINS Agent in docker mode
+#### Starting the Trains Agent in docker mode

-For debug and experimentation, start the TRAINS agent in `foreground` mode, where all the output is printed to screen
+For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
 ```bash
 trains-agent daemon --queue default --docker --foreground
 ```
@@ -215,7 +215,7 @@ trains-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda
 trains-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
 ```

-#### Starting the TRAINS Agent - Priority Queues
+#### Starting the Trains Agent - Priority Queues

 Priority Queues are also supported, example use case:

@@ -223,14 +223,14 @@ High priority queue: `important_jobs`  Low priority queue: `default`
 ```bash
 trains-agent daemon --queue important_jobs default
 ```
-The **TRAINS agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.
+The **Trains Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.

 Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our [open server](https://demoapp.trains.allegro.ai/workers-and-queues/queues)

-# How do I create an experiment on the TRAINS server? <a name="from-scratch"></a>
-* Integrate [TRAINS](https://github.com/allegroai/trains) with your code
+## How do I create an experiment on the Trains Server? <a name="from-scratch"></a>
+* Integrate [Trains](https://github.com/allegroai/trains) with your code
 * Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
-* As your code is running, **TRAINS** creates an experiment logging all the necessary execution information:
+* As your code is running, **Trains** creates an experiment logging all the necessary execution information:
  - Git repository link and commit ID (or an entire jupyter notebook)
  - Git diff (we’re not saying you never commit and push, but still...)
  - Python packages used by your code (including specific versions used)
@@ -239,7 +239,7 @@ Adding queues, managing job order within a queue and moving jobs between queues,

  You now have a 'template' of your experiment with everything required for automated execution

-* In the TRAINS UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
+* In the Trains UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
 * You now have a new draft experiment cloned from your original experiment, feel free to edit it
  - Change the Hyper-Parameters
  - Switch to the latest code base of the repository
@@ -248,10 +248,31 @@ Adding queues, managing job order within a queue and moving jobs between queues,
  - Or simply change nothing to run the same experiment again...
 * Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'

-# AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
-The TRAINS Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the TRAINS package.
+## Trains-Agent Services Mode <a name="services"></a> 

-Sample AutoML & Orchestration examples can be found in the TRAINS [example/automl](https://github.com/allegroai/trains/tree/master/examples/automl) folder.
+Trains-Agent Services is a special mode of Trains-Agent that provides the ability to launch long-lasting jobs 
+that previously had to be executed on local / dedicated machines. It allows a single agent to 
+launch multiple dockers (Tasks) for different use cases. To name a few use cases, auto-scaler service (spinning instances 
+when the need arises and the budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic),
+Optimizer (such as Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for 
+increased data transparency)
+
+Trains-Agent Services mode will spin **any** task enqueued into the specified queue. 
+Every task launched by Trains-Agent Services will be registered as a new node in the system, 
+providing tracking and transparency capabilities. 
+Currently trains-agent in services-mode supports cpu only configuration. Trains-agent services mode can be launched alongside GPU agents.
+
+```bash
+trains-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
+```
+
+**Note**: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue. 
+
+
+## AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
+The Trains Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the Trains package.
+
+Sample AutoML & Orchestration examples can be found in the Trains [example/automl](https://github.com/allegroai/trains/tree/master/examples/automl) folder.

 AutoML examples
  - [Toy Keras training experiment](https://github.com/allegroai/trains/blob/master/examples/automl/automl_base_template_keras_simple.py)
@@ -265,6 +286,6 @@ Experiment Pipeline examples
  - [Second step experiment](https://github.com/allegroai/trains/blob/master/examples/automl/toy_base_task.py)
    - In order to create an experiment-template in the system, this code must be executed once manually

-# License
+## License

 Apache License, Version 2.0 (see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0.html) for more information)
--- a/docker/agent/Dockerfile
+++ b/docker/agent/Dockerfile
@@ -0,0 +1,18 @@
+# syntax = docker/dockerfile
+FROM nvidia/cuda
+
+WORKDIR /usr/agent
+
+COPY . /usr/agent
+
+RUN apt-get update
+RUN apt-get dist-upgrade -y
+RUN apt-get install -y curl python3-pip git
+RUN curl -sSL https://get.docker.com/ | sh
+RUN python3 -m pip install -U pip
+RUN python3 -m pip install trains-agent
+RUN python3 -m pip install -U "cryptography>=2.9"
+
+ENV TRAINS_DOCKER_SKIP_GPUS_FLAG=1
+
+ENTRYPOINT ["/usr/agent/entrypoint.sh"]
--- a/docker/agent/entrypoint.sh
+++ b/docker/agent/entrypoint.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+LOWER_PIP_UPDATE_VERSION="$(echo "$PIP_UPDATE_VERSION" | tr '[:upper:]' '[:lower:]')"
+LOWER_TRAINS_AGENT_UPDATE_VERSION="$(echo "$TRAINS_AGENT_UPDATE_VERSION" | tr '[:upper:]' '[:lower:]')"
+
+if [ "$LOWER_PIP_UPDATE_VERSION" = "yes" ] || [ "$LOWER_PIP_UPDATE_VERSION" = "true" ] ; then
+    python3 -m pip install -U pip
+elif [ ! -z "$LOWER_PIP_UPDATE_VERSION" ] ; then
+    python3 -m pip install pip$LOWER_PIP_UPDATE_VERSION ;
+fi
+
+echo "TRAINS_AGENT_UPDATE_VERSION = $LOWER_TRAINS_AGENT_UPDATE_VERSION"
+if [ "$LOWER_TRAINS_AGENT_UPDATE_VERSION" = "yes" ] || [ "$LOWER_TRAINS_AGENT_UPDATE_VERSION" = "true" ] ; then
+    python3 -m pip install trains-agent -U
+elif [ ! -z "$LOWER_TRAINS_AGENT_UPDATE_VERSION" ] ; then
+    python3 -m pip install trains-agent$LOWER_TRAINS_AGENT_UPDATE_VERSION ;
+fi
+
+python3 -m trains_agent daemon --docker "$TRAINS_AGENT_DEFAULT_BASE_DOCKER" --force-current-version $TRAINS_AGENT_EXTRA_ARGS
--- a/docker/services/Dockerfile
+++ b/docker/services/Dockerfile
@@ -0,0 +1,25 @@
+# syntax = docker/dockerfile
+FROM ubuntu:18.04
+
+WORKDIR /usr/agent
+
+COPY . /usr/agent
+
+ENV LC_ALL=en_US.UTF-8
+ENV LANG=en_US.UTF-8
+ENV LANGUAGE=en_US.UTF-8
+ENV PYTHONIOENCODING=UTF-8
+
+RUN apt-get update
+RUN apt-get dist-upgrade -y
+RUN apt-get install -y locales
+
+RUN locale-gen en_US.UTF-8
+
+RUN apt-get install -y curl python3-pip git
+RUN curl -sSL https://get.docker.com/ | sh
+RUN python3 -m pip install -U pip
+RUN python3 -m pip install trains-agent
+RUN python3 -m pip install -U "cryptography>=2.9"
+
+ENTRYPOINT ["/usr/agent/entrypoint.sh"]
--- a/docker/services/entrypoint.sh
+++ b/docker/services/entrypoint.sh
@@ -0,0 +1,14 @@
+#!/bin/sh
+
+if [ -z "$TRAINS_FILES_HOST" ]; then
+    TRAINS_HOST_IP=${TRAINS_HOST_IP:-$(curl -s https://ifconfig.me/ip)}
+fi
+
+TRAINS_FILES_HOST=${TRAINS_FILES_HOST:-"http://$TRAINS_HOST_IP:8081"}
+TRAINS_WEB_HOST=${TRAINS_WEB_HOST:-"http://$TRAINS_HOST_IP:8080"}
+TRAINS_API_HOST=${TRAINS_API_HOST:-"http://$TRAINS_HOST_IP:8008"}
+
+echo $TRAINS_FILES_HOST $TRAINS_WEB_HOST $TRAINS_API_HOST 1>&2
+
+python3 -m pip install -q -U "trains-agent${TRAINS_AGENT_UPDATE_VERSION}"
+trains-agent daemon --services-mode --queue services --create-queue --docker "$TRAINS_AGENT_DEFAULT_BASE_DOCKER" --cpu-only $TRAINS_AGENT_EXTRA_ARGS
--- a/docs/trains.conf
+++ b/docs/trains.conf
@@ -13,11 +13,13 @@ api {
 }

 agent {
-    # Set GIT user/pass credentials
-    # leave blank for GIT SSH credentials
+    # Set GIT user/pass credentials (if user/pass are set, GIT protocol will be set to https)
+    # leave blank for GIT SSH credentials (set force_git_ssh_protocol=true to force SSH protocol)
    git_user=""
    git_pass=""

+    # Force GIT protocol to use SSH regardless of the git url (Assumes GIT user/pass are blank)
+    force_git_ssh_protocol: false

    # unique name of this worker, if None, created based on hostname:process_id
    # Overridden with os environment: TRAINS_WORKER_NAME
@@ -104,11 +106,16 @@ agent {

    default_docker: {
        # default docker image to use when running in docker mode
-        image: "nvidia/cuda"
+        image: "nvidia/cuda:10.1-runtime-ubuntu18.04"

        # optional arguments to pass to docker image
        # arguments: ["--ipc=host"]
    }
+
+    # CUDA versions used for Conda setup & solving PyTorch wheel packages
+    # it Should be detected automatically. Override with os environment CUDA_VERSION / CUDNN_VERSION
+    # cuda_version: 10.1
+    # cudnn_version: 7.6
 }

 sdk {
--- a/requirements.txt
+++ b/requirements.txt
@@ -3,7 +3,6 @@ enum34>=0.9 ; python_version < '3.6'
 furl>=2.0.0
 future>=0.16.0
 humanfriendly>=2.1
-jsonmodels>=2.2
 jsonschema>=2.6.0
 pathlib2>=2.3.0
 psutil>=3.4.2
@@ -14,7 +13,6 @@ pyjwt>=1.6.4
 PyYAML>=3.12
 requests-file>=1.4.2
 requests>=2.20.0
-requirements_parser>=0.2.0
 six>=1.11.0
 tqdm>=4.19.5
 typing>=3.6.4
--- a/setup.py
+++ b/setup.py
@@ -4,28 +4,31 @@ TRAINS-AGENT DevOps for machine/deep learning
 https://github.com/allegroai/trains-agent
 """

+import os.path
 # Always prefer setuptools over distutils
 from setuptools import setup, find_packages
-from six import exec_
-from pathlib2 import Path

+def read_text(filepath):
+    with open(filepath, "r") as f:
+        return f.read()

-here = Path(__file__).resolve().parent
-
+here = os.path.dirname(__file__)
 # Get the long description from the README file
-long_description = (here / 'README.md').read_text()
+long_description = read_text(os.path.join(here, 'README.md'))


-def read_version_string():
-    result = {}
-    exec_((here / 'trains_agent/version.py').read_text(), result)
-    return result['__version__']
+def read_version_string(version_file):
+    for line in read_text(version_file).splitlines():
+        if line.startswith('__version__'):
+            delim = '"' if '"' in line else "'"
+            return line.split(delim)[1]
+    else:
+        raise RuntimeError("Unable to find version string.")


-version = read_version_string()
-
-requirements = (here / 'requirements.txt').read_text().splitlines()
+version = read_version_string("trains_agent/version.py")

+requirements = read_text(os.path.join(here, 'requirements.txt')).splitlines()

 setup(
    name='trains_agent',
--- a/tests/package/ssh_conversion.py
+++ b/tests/package/ssh_conversion.py
@@ -30,6 +30,6 @@ from trains_agent.helper.repo import VCS
    ),
 )
 def test(url, expected):
-    result = VCS.resolve_ssh_url(url)
+    result = VCS.replace_ssh_url(url)
    expected = expected or url
    assert result == expected
--- a/trains_agent/backend_api/config/default/agent.conf
+++ b/trains_agent/backend_api/config/default/agent.conf
@@ -9,10 +9,14 @@
    # worker_name: "trains-agent-machine1"
    worker_name: ""

-    # Set GIT user/pass credentials for cloning code, leave blank for GIT SSH credentials.
+    # Set GIT user/pass credentials (if user/pass are set, GIT protocol will be set to https)
+    # leave blank for GIT SSH credentials (set force_git_ssh_protocol=true to force SSH protocol)
    # git_user: ""
    # git_pass: ""

+    # Force GIT protocol to use SSH regardless of the git url (Assumes GIT user/pass are blank)
+    force_git_ssh_protocol: false
+
    # Set the python version to use when creating the virtual environment and launching the experiment
    # Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
    # The default is the python executing the trains_agent
@@ -88,12 +92,23 @@

    default_docker: {
        # default docker image to use when running in docker mode
-        image: "nvidia/cuda"
+        image: "nvidia/cuda:10.1-runtime-ubuntu18.04"

        # optional arguments to pass to docker image
        # arguments: ["--ipc=host", ]
    }

+    # set the initial bash script to execute at the startup of any docker.
+    # all lines will be executed regardless of their exit code.
+    # {python_single_digit} is translated to 'python3' or 'python2' according to requested python version
+    # docker_init_bash_script = [
+    #     "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean",
+    #     "chown -R root /root/.cache/pip",
+    #     "apt-get update",
+    #     "apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0",
+    #     "(which {python_single_digit} && {python_single_digit} -m pip --version) || apt-get install -y {python_single_digit}-pip",
+    # ]
+
    # cuda versions used for solving pytorch wheel packages
    # should be detected automatically. Override with os environment CUDA_VERSION / CUDNN_VERSION
    # cuda_version: 10.1
--- a/trains_agent/backend_api/config/default/sdk.conf
+++ b/trains_agent/backend_api/config/default/sdk.conf
@@ -37,6 +37,9 @@
            quality: 87
            subsampling: 0
        }
+
+        # Support plot-per-graph fully matching Tensorboard behavior (i.e. if this is set to true, each series should have its own graph)
+        tensorboard_single_series_per_graph: false
    }

    network {
@@ -117,11 +120,11 @@

    log {
        # debugging feature: set this to true to make null log propagate messages to root logger (so they appear in stdout)
-        null_log_propagate: False
+        null_log_propagate: false
        task_log_buffer_capacity: 66

        # disable urllib info and lower levels
-        disable_urllib3_info: True
+        disable_urllib3_info: true
    }

    development {
@@ -131,14 +134,30 @@
        task_reuse_time_window_in_hours: 72.0

        # Run VCS repository detection asynchronously
-        vcs_repo_detect_async: True
+        vcs_repo_detect_async: true

        # Store uncommitted git/hg source code diff in experiment manifest when training in development mode
        # This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
-        store_uncommitted_code_diff_on_train: True
+        store_uncommitted_code_diff: true

        # Support stopping an experiment in case it was externally stopped, status was changed or task was reset
-        support_stopping: True
+        support_stopping: true
+
+        # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
+        default_output_uri: ""
+
+        # Default auto generated requirements optimize for smaller requirements
+        # If True, analyze the entire repository regardless of the entry point.
+        # If False, first analyze the entry point script, if it does not contain other to local files,
+        # do not analyze the entire repository.
+        force_analyze_entire_repo: false
+
+        # If set to true, *trains* update message will not be printed to the console
+        # this value can be overwritten with os environment variable TRAINS_SUPPRESS_UPDATE_MESSAGE=1
+        suppress_update_message: false
+
+        # If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze`
+        detect_with_pip_freeze: false

        # Development mode worker
        worker {
@@ -149,7 +168,11 @@
            ping_period_sec: 30

            # Log all stdout & stderr
-            log_stdout: True
+            log_stdout: true
+
+            # compatibility feature, report memory usage for the entire machine
+            # default (false), report only on the running process and its sub-processes
+            report_global_mem_used: false
        }
    }
-}
+}
--- a/trains_agent/backend_api/session/jsonmodels/init.py
+++ b/trains_agent/backend_api/session/jsonmodels/init.py
@@ -0,0 +1,9 @@
+# coding: utf-8
+
+__author__ = 'Szczepan Cieślik'
+__email__ = 'szczepan.cieslik@gmail.com'
+__version__ = '2.4'
+
+from . import models
+from . import fields
+from . import errors
--- a/trains_agent/backend_api/session/jsonmodels/builders.py
+++ b/trains_agent/backend_api/session/jsonmodels/builders.py
@@ -0,0 +1,230 @@
+"""Builders to generate in memory representation of model and fields tree."""
+
+from __future__ import absolute_import
+
+from collections import defaultdict
+
+import six
+
+from . import errors
+from .fields import NotSet
+
+
+class Builder(object):
+
+    def __init__(self, parent=None, nullable=False, default=NotSet):
+        self.parent = parent
+        self.types_builders = {}
+        self.types_count = defaultdict(int)
+        self.definitions = set()
+        self.nullable = nullable
+        self.default = default
+
+    @property
+    def has_default(self):
+        return self.default is not NotSet
+
+    def register_type(self, type, builder):
+        if self.parent:
+            return self.parent.register_type(type, builder)
+
+        self.types_count[type] += 1
+        if type not in self.types_builders:
+            self.types_builders[type] = builder
+
+    def get_builder(self, type):
+        if self.parent:
+            return self.parent.get_builder(type)
+
+        return self.types_builders[type]
+
+    def count_type(self, type):
+        if self.parent:
+            return self.parent.count_type(type)
+
+        return self.types_count[type]
+
+    @staticmethod
+    def maybe_build(value):
+        return value.build() if isinstance(value, Builder) else value
+
+    def add_definition(self, builder):
+        if self.parent:
+            return self.parent.add_definition(builder)
+
+        self.definitions.add(builder)
+
+
+class ObjectBuilder(Builder):
+
+    def __init__(self, model_type, *args, **kwargs):
+        super(ObjectBuilder, self).__init__(*args, **kwargs)
+        self.properties = {}
+        self.required = []
+        self.type = model_type
+
+        self.register_type(self.type, self)
+
+    def add_field(self, name, field, schema):
+        _apply_validators_modifications(schema, field)
+        self.properties[name] = schema
+        if field.required:
+            self.required.append(name)
+
+    def build(self):
+        builder = self.get_builder(self.type)
+        if self.is_definition and not self.is_root:
+            self.add_definition(builder)
+            [self.maybe_build(value) for _, value in self.properties.items()]
+            return '#/definitions/{name}'.format(name=self.type_name)
+        else:
+            return builder.build_definition(nullable=self.nullable)
+
+    @property
+    def type_name(self):
+        module_name = '{module}.{name}'.format(
+            module=self.type.__module__,
+            name=self.type.__name__,
+        )
+        return module_name.replace('.', '_').lower()
+
+    def build_definition(self, add_defintitions=True, nullable=False):
+        properties = dict(
+            (name, self.maybe_build(value))
+            for name, value
+            in self.properties.items()
+        )
+        schema = {
+            'type': 'object',
+            'additionalProperties': False,
+            'properties': properties,
+        }
+        if self.required:
+            schema['required'] = self.required
+        if self.definitions and add_defintitions:
+            schema['definitions'] = dict(
+                (builder.type_name, builder.build_definition(False, False))
+                for builder in self.definitions
+            )
+        return schema
+
+    @property
+    def is_definition(self):
+        if self.count_type(self.type) > 1:
+            return True
+        elif self.parent:
+            return self.parent.is_definition
+        else:
+            return False
+
+    @property
+    def is_root(self):
+        return not bool(self.parent)
+
+
+def _apply_validators_modifications(field_schema, field):
+    for validator in field.validators:
+        try:
+            validator.modify_schema(field_schema)
+        except AttributeError:
+            pass
+
+
+class PrimitiveBuilder(Builder):
+
+    def __init__(self, type, *args, **kwargs):
+        super(PrimitiveBuilder, self).__init__(*args, **kwargs)
+        self.type = type
+
+    def build(self):
+        schema = {}
+        if issubclass(self.type, six.string_types):
+            obj_type = 'string'
+        elif issubclass(self.type, bool):
+            obj_type = 'boolean'
+        elif issubclass(self.type, int):
+            obj_type = 'number'
+        elif issubclass(self.type, float):
+            obj_type = 'number'
+        else:
+            raise errors.FieldNotSupported(
+                "Can't specify value schema!", self.type
+            )
+
+        if self.nullable:
+            obj_type = [obj_type, 'null']
+        schema['type'] = obj_type
+
+        if self.has_default:
+            schema["default"] = self.default
+
+        return schema
+
+
+class ListBuilder(Builder):
+
+    def __init__(self, *args, **kwargs):
+        super(ListBuilder, self).__init__(*args, **kwargs)
+        self.schemas = []
+
+    def add_type_schema(self, schema):
+        self.schemas.append(schema)
+
+    def build(self):
+        schema = {'type': 'array'}
+        if self.nullable:
+            self.add_type_schema({'type': 'null'})
+
+        if self.has_default:
+            schema["default"] = [self.to_struct(i) for i in self.default]
+
+        schemas = [self.maybe_build(s) for s in self.schemas]
+        if len(schemas) == 1:
+            items = schemas[0]
+        else:
+            items = {'oneOf': schemas}
+
+        schema['items'] = items
+        return schema
+
+    @property
+    def is_definition(self):
+        return self.parent.is_definition
+
+    @staticmethod
+    def to_struct(item):
+        from .models import Base
+        if isinstance(item, Base):
+            return item.to_struct()
+        return item
+
+
+class EmbeddedBuilder(Builder):
+
+    def __init__(self, *args, **kwargs):
+        super(EmbeddedBuilder, self).__init__(*args, **kwargs)
+        self.schemas = []
+
+    def add_type_schema(self, schema):
+        self.schemas.append(schema)
+
+    def build(self):
+        if self.nullable:
+            self.add_type_schema({'type': 'null'})
+
+        schemas = [self.maybe_build(schema) for schema in self.schemas]
+        if len(schemas) == 1:
+            schema = schemas[0]
+        else:
+            schema = {'oneOf': schemas}
+
+        if self.has_default:
+            # The default value of EmbeddedField is expected to be an instance
+            # of a subclass of models.Base, thus have `to_struct`
+            schema["default"] = self.default.to_struct()
+
+        return schema
+
+    @property
+    def is_definition(self):
+        return self.parent.is_definition
--- a/trains_agent/backend_api/session/jsonmodels/collections.py
+++ b/trains_agent/backend_api/session/jsonmodels/collections.py
@@ -0,0 +1,21 @@
+
+
+class ModelCollection(list):
+
+    """`ModelCollection` is list which validates stored values.
+
+    Validation is made with use of field passed to `__init__` at each point,
+    when new value is assigned.
+
+    """
+
+    def __init__(self, field):
+        self.field = field
+
+    def append(self, value):
+        self.field.validate_single_value(value)
+        super(ModelCollection, self).append(value)
+
+    def __setitem__(self, key, value):
+        self.field.validate_single_value(value)
+        super(ModelCollection, self).__setitem__(key, value)
--- a/trains_agent/backend_api/session/jsonmodels/errors.py
+++ b/trains_agent/backend_api/session/jsonmodels/errors.py
@@ -0,0 +1,15 @@
+
+
+class ValidationError(RuntimeError):
+
+    pass
+
+
+class FieldNotFound(RuntimeError):
+
+    pass
+
+
+class FieldNotSupported(ValueError):
+
+    pass
--- a/trains_agent/backend_api/session/jsonmodels/fields.py
+++ b/trains_agent/backend_api/session/jsonmodels/fields.py
@@ -0,0 +1,488 @@
+import datetime
+import re
+from weakref import WeakKeyDictionary
+
+import six
+from dateutil.parser import parse
+
+from .errors import ValidationError
+from .collections import ModelCollection
+
+
+# unique marker for "no default value specified". None is not good enough since
+# it is a completely valid default value.
+NotSet = object()
+
+
+class BaseField(object):
+
+    """Base class for all fields."""
+
+    types = None
+
+    def __init__(
+            self,
+            required=False,
+            nullable=False,
+            help_text=None,
+            validators=None,
+            default=NotSet,
+            name=None):
+        self.memory = WeakKeyDictionary()
+        self.required = required
+        self.help_text = help_text
+        self.nullable = nullable
+        self._assign_validators(validators)
+        self.name = name
+        self._validate_name()
+        if default is not NotSet:
+            self.validate(default)
+        self._default = default
+
+    @property
+    def has_default(self):
+        return self._default is not NotSet
+
+    def _assign_validators(self, validators):
+        if validators and not isinstance(validators, list):
+            validators = [validators]
+        self.validators = validators or []
+
+    def __set__(self, instance, value):
+        self._finish_initialization(type(instance))
+        value = self.parse_value(value)
+        self.validate(value)
+        self.memory[instance._cache_key] = value
+
+    def __get__(self, instance, owner=None):
+        if instance is None:
+            self._finish_initialization(owner)
+            return self
+
+        self._finish_initialization(type(instance))
+
+        self._check_value(instance)
+        return self.memory[instance._cache_key]
+
+    def _finish_initialization(self, owner):
+        pass
+
+    def _check_value(self, obj):
+        if obj._cache_key not in self.memory:
+            self.__set__(obj, self.get_default_value())
+
+    def validate_for_object(self, obj):
+        value = self.__get__(obj)
+        self.validate(value)
+
+    def validate(self, value):
+        self._check_types()
+        self._validate_against_types(value)
+        self._check_against_required(value)
+        self._validate_with_custom_validators(value)
+
+    def _check_against_required(self, value):
+        if value is None and self.required:
+            raise ValidationError('Field is required!')
+
+    def _validate_against_types(self, value):
+        if value is not None and not isinstance(value, self.types):
+            raise ValidationError(
+                'Value is wrong, expected type "{types}"'.format(
+                    types=', '.join([t.__name__ for t in self.types])
+                ),
+                value,
+            )
+
+    def _check_types(self):
+        if self.types is None:
+            raise ValidationError(
+                'Field "{type}" is not usable, try '
+                'different field type.'.format(type=type(self).__name__))
+
+    def to_struct(self, value):
+        """Cast value to Python structure."""
+        return value
+
+    def parse_value(self, value):
+        """Parse value from primitive to desired format.
+
+        Each field can parse value to form it wants it to be (like string or
+        int).
+
+        """
+        return value
+
+    def _validate_with_custom_validators(self, value):
+        if value is None and self.nullable:
+            return
+
+        for validator in self.validators:
+            try:
+                validator.validate(value)
+            except AttributeError:
+                validator(value)
+
+    def get_default_value(self):
+        """Get default value for field.
+
+        Each field can specify its default.
+
+        """
+        return self._default if self.has_default else None
+
+    def _validate_name(self):
+        if self.name is None:
+            return
+        if not re.match('^[A-Za-z_](([\w\-]*)?\w+)?$', self.name):
+            raise ValueError('Wrong name', self.name)
+
+    def structue_name(self, default):
+        return self.name if self.name is not None else default
+
+
+class StringField(BaseField):
+
+    """String field."""
+
+    types = six.string_types
+
+
+class IntField(BaseField):
+
+    """Integer field."""
+
+    types = (int,)
+
+    def parse_value(self, value):
+        """Cast value to `int`, e.g. from string or long"""
+        parsed = super(IntField, self).parse_value(value)
+        if parsed is None:
+            return parsed
+        return int(parsed)
+
+
+class FloatField(BaseField):
+
+    """Float field."""
+
+    types = (float, int)
+
+
+class BoolField(BaseField):
+
+    """Bool field."""
+
+    types = (bool,)
+
+    def parse_value(self, value):
+        """Cast value to `bool`."""
+        parsed = super(BoolField, self).parse_value(value)
+        return bool(parsed) if parsed is not None else None
+
+
+class ListField(BaseField):
+
+    """List field."""
+
+    types = (list,)
+
+    def __init__(self, items_types=None, *args, **kwargs):
+        """Init.
+
+        `ListField` is **always not required**. If you want to control number
+        of items use validators.
+
+        """
+        self._assign_types(items_types)
+        super(ListField, self).__init__(*args, **kwargs)
+        self.required = False
+
+    def get_default_value(self):
+        default = super(ListField, self).get_default_value()
+        if default is None:
+            return ModelCollection(self)
+        return default
+
+    def _assign_types(self, items_types):
+        if items_types:
+            try:
+                self.items_types = tuple(items_types)
+            except TypeError:
+                self.items_types = items_types,
+        else:
+            self.items_types = tuple()
+
+        types = []
+        for type_ in self.items_types:
+            if isinstance(type_, six.string_types):
+                types.append(_LazyType(type_))
+            else:
+                types.append(type_)
+        self.items_types = tuple(types)
+
+    def validate(self, value):
+        super(ListField, self).validate(value)
+
+        if len(self.items_types) == 0:
+            return
+
+        for item in value:
+            self.validate_single_value(item)
+
+    def validate_single_value(self, item):
+        if len(self.items_types) == 0:
+            return
+
+        if not isinstance(item, self.items_types):
+            raise ValidationError(
+                'All items must be instances '
+                'of "{types}", and not "{type}".'.format(
+                    types=', '.join([t.__name__ for t in self.items_types]),
+                    type=type(item).__name__,
+                ))
+
+    def parse_value(self, values):
+        """Cast value to proper collection."""
+        result = self.get_default_value()
+
+        if not values:
+            return result
+
+        if not isinstance(values, list):
+            return values
+
+        return [self._cast_value(value) for value in values]
+
+    def _cast_value(self, value):
+        if isinstance(value, self.items_types):
+            return value
+        else:
+            if len(self.items_types) != 1:
+                tpl = 'Cannot decide which type to choose from "{types}".'
+                raise ValidationError(
+                    tpl.format(
+                        types=', '.join([t.__name__ for t in self.items_types])
+                    )
+                )
+            return self.items_types[0](**value)
+
+    def _finish_initialization(self, owner):
+        super(ListField, self)._finish_initialization(owner)
+
+        types = []
+        for type in self.items_types:
+            if isinstance(type, _LazyType):
+                types.append(type.evaluate(owner))
+            else:
+                types.append(type)
+        self.items_types = tuple(types)
+
+    def _elem_to_struct(self, value):
+        try:
+            return value.to_struct()
+        except AttributeError:
+            return value
+
+    def to_struct(self, values):
+        return [self._elem_to_struct(v) for v in values]
+
+
+class EmbeddedField(BaseField):
+
+    """Field for embedded models."""
+
+    def __init__(self, model_types, *args, **kwargs):
+        self._assign_model_types(model_types)
+        super(EmbeddedField, self).__init__(*args, **kwargs)
+
+    def _assign_model_types(self, model_types):
+        if not isinstance(model_types, (list, tuple)):
+            model_types = (model_types,)
+
+        types = []
+        for type_ in model_types:
+            if isinstance(type_, six.string_types):
+                types.append(_LazyType(type_))
+            else:
+                types.append(type_)
+        self.types = tuple(types)
+
+    def _finish_initialization(self, owner):
+        super(EmbeddedField, self)._finish_initialization(owner)
+
+        types = []
+        for type in self.types:
+            if isinstance(type, _LazyType):
+                types.append(type.evaluate(owner))
+            else:
+                types.append(type)
+        self.types = tuple(types)
+
+    def validate(self, value):
+        super(EmbeddedField, self).validate(value)
+        try:
+            value.validate()
+        except AttributeError:
+            pass
+
+    def parse_value(self, value):
+        """Parse value to proper model type."""
+        if not isinstance(value, dict):
+            return value
+
+        embed_type = self._get_embed_type()
+        return embed_type(**value)
+
+    def _get_embed_type(self):
+        if len(self.types) != 1:
+            raise ValidationError(
+                'Cannot decide which type to choose from "{types}".'.format(
+                    types=', '.join([t.__name__ for t in self.types])
+                )
+            )
+        return self.types[0]
+
+    def to_struct(self, value):
+        return value.to_struct()
+
+
+class _LazyType(object):
+
+    def __init__(self, path):
+        self.path = path
+
+    def evaluate(self, base_cls):
+        module, type_name = _evaluate_path(self.path, base_cls)
+        return _import(module, type_name)
+
+
+def _evaluate_path(relative_path, base_cls):
+    base_module = base_cls.__module__
+
+    modules = _get_modules(relative_path, base_module)
+
+    type_name = modules.pop()
+    module = '.'.join(modules)
+    if not module:
+        module = base_module
+    return module, type_name
+
+
+def _get_modules(relative_path, base_module):
+    canonical_path = relative_path.lstrip('.')
+    canonical_modules = canonical_path.split('.')
+
+    if not relative_path.startswith('.'):
+        return canonical_modules
+
+    parents_amount = len(relative_path) - len(canonical_path)
+    parent_modules = base_module.split('.')
+    parents_amount = max(0, parents_amount - 1)
+    if parents_amount > len(parent_modules):
+        raise ValueError("Can't evaluate path '{}'".format(relative_path))
+    return parent_modules[:parents_amount * -1] + canonical_modules
+
+
+def _import(module_name, type_name):
+    module = __import__(module_name, fromlist=[type_name])
+    try:
+        return getattr(module, type_name)
+    except AttributeError:
+        raise ValueError(
+            "Can't find type '{}.{}'.".format(module_name, type_name))
+
+
+class TimeField(StringField):
+
+    """Time field."""
+
+    types = (datetime.time,)
+
+    def __init__(self, str_format=None, *args, **kwargs):
+        """Init.
+
+        :param str str_format: Format to cast time to (if `None` - casting to
+            ISO 8601 format).
+
+        """
+        self.str_format = str_format
+        super(TimeField, self).__init__(*args, **kwargs)
+
+    def to_struct(self, value):
+        """Cast `time` object to string."""
+        if self.str_format:
+            return value.strftime(self.str_format)
+        return value.isoformat()
+
+    def parse_value(self, value):
+        """Parse string into instance of `time`."""
+        if value is None:
+            return value
+        if isinstance(value, datetime.time):
+            return value
+        return parse(value).timetz()
+
+
+class DateField(StringField):
+
+    """Date field."""
+
+    types = (datetime.date,)
+    default_format = '%Y-%m-%d'
+
+    def __init__(self, str_format=None, *args, **kwargs):
+        """Init.
+
+        :param str str_format: Format to cast date to (if `None` - casting to
+            %Y-%m-%d format).
+
+        """
+        self.str_format = str_format
+        super(DateField, self).__init__(*args, **kwargs)
+
+    def to_struct(self, value):
+        """Cast `date` object to string."""
+        if self.str_format:
+            return value.strftime(self.str_format)
+        return value.strftime(self.default_format)
+
+    def parse_value(self, value):
+        """Parse string into instance of `date`."""
+        if value is None:
+            return value
+        if isinstance(value, datetime.date):
+            return value
+        return parse(value).date()
+
+
+class DateTimeField(StringField):
+
+    """Datetime field."""
+
+    types = (datetime.datetime,)
+
+    def __init__(self, str_format=None, *args, **kwargs):
+        """Init.
+
+        :param str str_format: Format to cast datetime to (if `None` - casting
+            to ISO 8601 format).
+
+        """
+        self.str_format = str_format
+        super(DateTimeField, self).__init__(*args, **kwargs)
+
+    def to_struct(self, value):
+        """Cast `datetime` object to string."""
+        if self.str_format:
+            return value.strftime(self.str_format)
+        return value.isoformat()
+
+    def parse_value(self, value):
+        """Parse string into instance of `datetime`."""
+        if isinstance(value, datetime.datetime):
+            return value
+        if value:
+            return parse(value)
+        else:
+            return None
--- a/trains_agent/backend_api/session/jsonmodels/models.py
+++ b/trains_agent/backend_api/session/jsonmodels/models.py
@@ -0,0 +1,154 @@
+import six
+
+from . import parsers, errors
+from .fields import BaseField
+from .errors import ValidationError
+
+
+class JsonmodelMeta(type):
+
+    def __new__(cls, name, bases, attributes):
+        cls.validate_fields(attributes)
+        return super(cls, cls).__new__(cls, name, bases, attributes)
+
+    @staticmethod
+    def validate_fields(attributes):
+        fields = {
+            key: value for key, value in attributes.items()
+            if isinstance(value, BaseField)
+        }
+        taken_names = set()
+        for name, field in fields.items():
+            structue_name = field.structue_name(name)
+            if structue_name in taken_names:
+                raise ValueError('Name taken', structue_name, name)
+            taken_names.add(structue_name)
+
+
+class Base(six.with_metaclass(JsonmodelMeta, object)):
+
+    """Base class for all models."""
+
+    def __init__(self, **kwargs):
+        self._cache_key = _CacheKey()
+        self.populate(**kwargs)
+
+    def populate(self, **values):
+        """Populate values to fields. Skip non-existing."""
+        values = values.copy()
+        fields = list(self.iterate_with_name())
+        for _, structure_name, field in fields:
+            if structure_name in values:
+                field.__set__(self, values.pop(structure_name))
+        for name, _, field in fields:
+            if name in values:
+                field.__set__(self, values.pop(name))
+
+    def get_field(self, field_name):
+        """Get field associated with given attribute."""
+        for attr_name, field in self:
+            if field_name == attr_name:
+                return field
+
+        raise errors.FieldNotFound('Field not found', field_name)
+
+    def __iter__(self):
+        """Iterate through fields and values."""
+        for name, field in self.iterate_over_fields():
+            yield name, field
+
+    def validate(self):
+        """Explicitly validate all the fields."""
+        for name, field in self:
+            try:
+                field.validate_for_object(self)
+            except ValidationError as error:
+                raise ValidationError(
+                    "Error for field '{name}'.".format(name=name),
+                    error,
+                )
+
+    @classmethod
+    def iterate_over_fields(cls):
+        """Iterate through fields as `(attribute_name, field_instance)`."""
+        for attr in dir(cls):
+            clsattr = getattr(cls, attr)
+            if isinstance(clsattr, BaseField):
+                yield attr, clsattr
+
+    @classmethod
+    def iterate_with_name(cls):
+        """Iterate over fields, but also give `structure_name`.
+
+        Format is `(attribute_name, structue_name, field_instance)`.
+        Structure name is name under which value is seen in structure and
+        schema (in primitives) and only there.
+        """
+        for attr_name, field in cls.iterate_over_fields():
+            structure_name = field.structue_name(attr_name)
+            yield attr_name, structure_name, field
+
+    def to_struct(self):
+        """Cast model to Python structure."""
+        return parsers.to_struct(self)
+
+    @classmethod
+    def to_json_schema(cls):
+        """Generate JSON schema for model."""
+        return parsers.to_json_schema(cls)
+
+    def __repr__(self):
+        attrs = {}
+        for name, _ in self:
+            try:
+                attr = getattr(self, name)
+                if attr is not None:
+                    attrs[name] = repr(attr)
+            except ValidationError:
+                pass
+
+        return '{class_name}({fields})'.format(
+            class_name=self.__class__.__name__,
+            fields=', '.join(
+                '{0[0]}={0[1]}'.format(x) for x in sorted(attrs.items())
+            ),
+        )
+
+    def __str__(self):
+        return '{name} object'.format(name=self.__class__.__name__)
+
+    def __setattr__(self, name, value):
+        try:
+            return super(Base, self).__setattr__(name, value)
+        except ValidationError as error:
+            raise ValidationError(
+                "Error for field '{name}'.".format(name=name),
+                error
+            )
+
+    def __eq__(self, other):
+        if type(other) is not type(self):
+            return False
+
+        for name, _ in self.iterate_over_fields():
+            try:
+                our = getattr(self, name)
+            except errors.ValidationError:
+                our = None
+
+            try:
+                their = getattr(other, name)
+            except errors.ValidationError:
+                their = None
+
+            if our != their:
+                return False
+
+        return True
+
+    def __ne__(self, other):
+        return not (self == other)
+
+
+class _CacheKey(object):
+    """Object to identify model in memory."""
--- a/trains_agent/backend_api/session/jsonmodels/parsers.py
+++ b/trains_agent/backend_api/session/jsonmodels/parsers.py
@@ -0,0 +1,106 @@
+"""Parsers to change model structure into different ones."""
+import inspect
+
+from . import fields, builders, errors
+
+
+def to_struct(model):
+    """Cast instance of model to python structure.
+
+    :param model: Model to be casted.
+    :rtype: ``dict``
+
+    """
+    model.validate()
+
+    resp = {}
+    for _, name, field in model.iterate_with_name():
+        value = field.__get__(model)
+        if value is None:
+            continue
+
+        value = field.to_struct(value)
+        resp[name] = value
+    return resp
+
+
+def to_json_schema(cls):
+    """Generate JSON schema for given class.
+
+    :param cls: Class to be casted.
+    :rtype: ``dict``
+
+    """
+    builder = build_json_schema(cls)
+    return builder.build()
+
+
+def build_json_schema(value, parent_builder=None):
+    from .models import Base
+
+    cls = value if inspect.isclass(value) else value.__class__
+    if issubclass(cls, Base):
+        return build_json_schema_object(cls, parent_builder)
+    else:
+        return build_json_schema_primitive(cls, parent_builder)
+
+
+def build_json_schema_object(cls, parent_builder=None):
+    builder = builders.ObjectBuilder(cls, parent_builder)
+    if builder.count_type(builder.type) > 1:
+        return builder
+    for _, name, field in cls.iterate_with_name():
+        if isinstance(field, fields.EmbeddedField):
+            builder.add_field(name, field, _parse_embedded(field, builder))
+        elif isinstance(field, fields.ListField):
+            builder.add_field(name, field, _parse_list(field, builder))
+        else:
+            builder.add_field(
+                name, field, _create_primitive_field_schema(field))
+    return builder
+
+
+def _parse_list(field, parent_builder):
+    builder = builders.ListBuilder(
+        parent_builder, field.nullable, default=field._default)
+    for type in field.items_types:
+        builder.add_type_schema(build_json_schema(type, builder))
+    return builder
+
+
+def _parse_embedded(field, parent_builder):
+    builder = builders.EmbeddedBuilder(
+        parent_builder, field.nullable, default=field._default)
+    for type in field.types:
+        builder.add_type_schema(build_json_schema(type, builder))
+    return builder
+
+
+def build_json_schema_primitive(cls, parent_builder):
+    builder = builders.PrimitiveBuilder(cls, parent_builder)
+    return builder
+
+
+def _create_primitive_field_schema(field):
+    if isinstance(field, fields.StringField):
+        obj_type = 'string'
+    elif isinstance(field, fields.IntField):
+        obj_type = 'number'
+    elif isinstance(field, fields.FloatField):
+        obj_type = 'float'
+    elif isinstance(field, fields.BoolField):
+        obj_type = 'boolean'
+    else:
+        raise errors.FieldNotSupported(
+            'Field {field} is not supported!'.format(
+                field=type(field).__class__.__name__))
+
+    if field.nullable:
+        obj_type = [obj_type, 'null']
+
+    schema = {'type': obj_type}
+
+    if field.has_default:
+        schema["default"] = field._default
+
+    return schema
--- a/trains_agent/backend_api/session/jsonmodels/utilities.py
+++ b/trains_agent/backend_api/session/jsonmodels/utilities.py
@@ -0,0 +1,156 @@
+from __future__ import absolute_import
+
+import six
+import re
+from collections import namedtuple
+
+SCALAR_TYPES = tuple(list(six.string_types) + [int, float, bool])
+
+ECMA_TO_PYTHON_FLAGS = {
+    'i': re.I,
+    'm': re.M,
+}
+
+PYTHON_TO_ECMA_FLAGS = dict(
+    (value, key) for key, value in ECMA_TO_PYTHON_FLAGS.items()
+)
+
+PythonRegex = namedtuple('PythonRegex', ['regex', 'flags'])
+
+
+def _normalize_string_type(value):
+    if isinstance(value, six.string_types):
+        return six.text_type(value)
+    else:
+        return value
+
+
+def _compare_dicts(one, two):
+    if len(one) != len(two):
+        return False
+
+    for key, value in one.items():
+        if key not in one or key not in two:
+            return False
+
+        if not compare_schemas(one[key], two[key]):
+            return False
+    return True
+
+
+def _compare_lists(one, two):
+    if len(one) != len(two):
+        return False
+
+    they_match = False
+    for first_item in one:
+        for second_item in two:
+            if they_match:
+                continue
+            they_match = compare_schemas(first_item, second_item)
+    return they_match
+
+
+def _assert_same_types(one, two):
+    if not isinstance(one, type(two)) or not isinstance(two, type(one)):
+        raise RuntimeError('Types mismatch! "{type1}" and "{type2}".'.format(
+            type1=type(one).__name__, type2=type(two).__name__))
+
+
+def compare_schemas(one, two):
+    """Compare two structures that represents JSON schemas.
+
+    For comparison you can't use normal comparison, because in JSON schema
+    lists DO NOT keep order (and Python lists do), so this must be taken into
+    account during comparison.
+
+    Note this wont check all configurations, only first one that seems to
+    match, which can lead to wrong results.
+
+    :param one: First schema to compare.
+    :param two: Second schema to compare.
+    :rtype: `bool`
+
+    """
+    one = _normalize_string_type(one)
+    two = _normalize_string_type(two)
+
+    _assert_same_types(one, two)
+
+    if isinstance(one, list):
+        return _compare_lists(one, two)
+    elif isinstance(one, dict):
+        return _compare_dicts(one, two)
+    elif isinstance(one, SCALAR_TYPES):
+        return one == two
+    elif one is None:
+        return one is two
+    else:
+        raise RuntimeError('Not allowed type "{type}"'.format(
+            type=type(one).__name__))
+
+
+def is_ecma_regex(regex):
+    """Check if given regex is of type ECMA 262 or not.
+
+    :rtype: bool
+
+    """
+    parts = regex.split('/')
+
+    if len(parts) == 1:
+        return False
+
+    if len(parts) < 3:
+        raise ValueError('Given regex isn\'t ECMA regex nor Python regex.')
+    parts.pop()
+    parts.append('')
+
+    raw_regex = '/'.join(parts)
+    if raw_regex.startswith('/') and raw_regex.endswith('/'):
+        return True
+    return False
+
+
+def convert_ecma_regex_to_python(value):
+    """Convert ECMA 262 regex to Python tuple with regex and flags.
+
+    If given value is already Python regex it will be returned unchanged.
+
+    :param string value: ECMA regex.
+    :return: 2-tuple with `regex` and `flags`
+    :rtype: namedtuple
+
+    """
+    if not is_ecma_regex(value):
+        return PythonRegex(value, [])
+
+    parts = value.split('/')
+    flags = parts.pop()
+
+    try:
+        result_flags = [ECMA_TO_PYTHON_FLAGS[f] for f in flags]
+    except KeyError:
+        raise ValueError('Wrong flags "{}".'.format(flags))
+
+    return PythonRegex('/'.join(parts[1:]), result_flags)
+
+
+def convert_python_regex_to_ecma(value, flags=[]):
+    """Convert Python regex to ECMA 262 regex.
+
+    If given value is already ECMA regex it will be returned unchanged.
+
+    :param string value: Python regex.
+    :param list flags: List of flags (allowed flags: `re.I`, `re.M`)
+    :return: ECMA 262 regex
+    :rtype: str
+
+    """
+    if is_ecma_regex(value):
+        return value
+
+    result_flags = [PYTHON_TO_ECMA_FLAGS[f] for f in flags]
+    result_flags = ''.join(result_flags)
+
+    return '/{value}/{flags}'.format(value=value, flags=result_flags)
--- a/trains_agent/backend_api/session/jsonmodels/validators.py
+++ b/trains_agent/backend_api/session/jsonmodels/validators.py
@@ -0,0 +1,202 @@
+"""Predefined validators."""
+import re
+
+from six.moves import reduce
+
+from .errors import ValidationError
+from . import utilities
+
+
+class Min(object):
+
+    """Validator for minimum value."""
+
+    def __init__(self, minimum_value, exclusive=False):
+        """Init.
+
+        :param minimum_value: Minimum value for validator.
+        :param bool exclusive: If `True`, then validated value must be strongly
+            lower than given threshold.
+
+        """
+        self.minimum_value = minimum_value
+        self.exclusive = exclusive
+
+    def validate(self, value):
+        """Validate value."""
+        if self.exclusive:
+            if value <= self.minimum_value:
+                tpl = "'{value}' is lower or equal than minimum ('{min}')."
+                raise ValidationError(
+                    tpl.format(value=value, min=self.minimum_value))
+        else:
+            if value < self.minimum_value:
+                raise ValidationError(
+                    "'{value}' is lower than minimum ('{min}').".format(
+                        value=value, min=self.minimum_value))
+
+    def modify_schema(self, field_schema):
+        """Modify field schema."""
+        field_schema['minimum'] = self.minimum_value
+        if self.exclusive:
+            field_schema['exclusiveMinimum'] = True
+
+
+class Max(object):
+
+    """Validator for maximum value."""
+
+    def __init__(self, maximum_value, exclusive=False):
+        """Init.
+
+        :param maximum_value: Maximum value for validator.
+        :param bool exclusive: If `True`, then validated value must be strongly
+            bigger than given threshold.
+
+        """
+        self.maximum_value = maximum_value
+        self.exclusive = exclusive
+
+    def validate(self, value):
+        """Validate value."""
+        if self.exclusive:
+            if value >= self.maximum_value:
+                tpl = "'{val}' is bigger or equal than maximum ('{max}')."
+                raise ValidationError(
+                    tpl.format(val=value, max=self.maximum_value))
+        else:
+            if value > self.maximum_value:
+                raise ValidationError(
+                    "'{value}' is bigger than maximum ('{max}').".format(
+                        value=value, max=self.maximum_value))
+
+    def modify_schema(self, field_schema):
+        """Modify field schema."""
+        field_schema['maximum'] = self.maximum_value
+        if self.exclusive:
+            field_schema['exclusiveMaximum'] = True
+
+
+class Regex(object):
+
+    """Validator for regular expressions."""
+
+    FLAGS = {
+        'ignorecase': re.I,
+        'multiline': re.M,
+    }
+
+    def __init__(self, pattern, **flags):
+        """Init.
+
+        Note, that if given pattern is ECMA regex, given flags will be
+        **completely ignored** and taken from given regex.
+
+
+        :param string pattern: Pattern of regex.
+        :param bool flags: Flags used for the regex matching.
+            Allowed flag names are in the `FLAGS` attribute. The flag value
+            does not matter as long as it evaluates to True.
+            Flags with False values will be ignored.
+            Invalid flags will be ignored.
+
+        """
+        if utilities.is_ecma_regex(pattern):
+            result = utilities.convert_ecma_regex_to_python(pattern)
+            self.pattern, self.flags = result
+        else:
+            self.pattern = pattern
+            self.flags = [self.FLAGS[key] for key, value in flags.items()
+                          if key in self.FLAGS and value]
+
+    def validate(self, value):
+        """Validate value."""
+        flags = self._calculate_flags()
+
+        try:
+            result = re.search(self.pattern, value, flags)
+        except TypeError as te:
+            raise ValidationError(*te.args)
+
+        if not result:
+            raise ValidationError(
+                'Value "{value}" did not match pattern "{pattern}".'.format(
+                    value=value, pattern=self.pattern
+                ))
+
+    def _calculate_flags(self):
+        return reduce(lambda x, y: x | y, self.flags, 0)
+
+    def modify_schema(self, field_schema):
+        """Modify field schema."""
+        field_schema['pattern'] = utilities.convert_python_regex_to_ecma(
+            self.pattern, self.flags)
+
+
+class Length(object):
+
+    """Validator for length."""
+
+    def __init__(self, minimum_value=None, maximum_value=None):
+        """Init.
+
+        Note that if no `minimum_value` neither `maximum_value` will be
+        specified, `ValueError` will be raised.
+
+        :param int minimum_value: Minimum value (optional).
+        :param int maximum_value: Maximum value (optional).
+
+        """
+        if minimum_value is None and maximum_value is None:
+            raise ValueError(
+                "Either 'minimum_value' or 'maximum_value' must be specified.")
+
+        self.minimum_value = minimum_value
+        self.maximum_value = maximum_value
+
+    def validate(self, value):
+        """Validate value."""
+        len_ = len(value)
+
+        if self.minimum_value is not None and len_ < self.minimum_value:
+            tpl = "Value '{val}' length is lower than allowed minimum '{min}'."
+            raise ValidationError(tpl.format(
+                val=value, min=self.minimum_value
+            ))
+
+        if self.maximum_value is not None and len_ > self.maximum_value:
+            raise ValidationError(
+                "Value '{val}' length is bigger than "
+                "allowed maximum '{max}'.".format(
+                    val=value,
+                    max=self.maximum_value,
+                ))
+
+    def modify_schema(self, field_schema):
+        """Modify field schema."""
+        if self.minimum_value:
+            field_schema['minLength'] = self.minimum_value
+
+        if self.maximum_value:
+            field_schema['maxLength'] = self.maximum_value
+
+
+class Enum(object):
+
+    """Validator for enums."""
+
+    def __init__(self, *choices):
+        """Init.
+
+        :param [] choices: Valid choices for the field.
+        """
+
+        self.choices = list(choices)
+
+    def validate(self, value):
+        if value not in self.choices:
+            tpl = "Value '{val}' is not a valid choice."
+            raise ValidationError(tpl.format(val=value))
+
+    def modify_schema(self, field_schema):
+        field_schema['enum'] = self.choices
--- a/trains_agent/backend_api/session/response.py
+++ b/trains_agent/backend_api/session/response.py
@@ -1,10 +1,8 @@
 import requests

 import six
-import jsonmodels.models
-import jsonmodels.fields
-import jsonmodels.errors

+from . import jsonmodels
 from .apimodel import ApiModel
 from .datamodel import NonStrictDataModelMixin

--- a/trains_agent/backend_api/session/session.py
+++ b/trains_agent/backend_api/session/session.py
@@ -40,6 +40,7 @@ class Session(TokenManager):
    _session_requests = 0
    _session_initial_timeout = (3.0, 10.)
    _session_timeout = (10.0, 30.)
+    _session_initial_connect_retry = 4
    _write_session_data_size = 15000
    _write_session_timeout = (30.0, 30.)

@@ -85,6 +86,7 @@ class Session(TokenManager):
        initialize_logging=True,
        client=None,
        config=None,
+        http_retries_config=None,
        **kwargs
    ):
        # add backward compatibility support for old environment variables
@@ -95,7 +97,7 @@ class Session(TokenManager):
        else:
            self.config = load()
            if initialize_logging:
-                self.config.initialize_logging()
+                self.config.initialize_logging(debug=kwargs.get('debug', False))

        token_expiration_threshold_sec = self.config.get(
            "auth.token_expiration_threshold_sec", 60
@@ -129,11 +131,10 @@ class Session(TokenManager):
            raise ValueError("host is required in init or config")

        self.__host = host.strip("/")
-        http_retries_config = self.config.get(
+        http_retries_config = http_retries_config or self.config.get(
            "api.http.retries", ConfigTree()
        ).as_plain_ordered_dict()
        http_retries_config["status_forcelist"] = self._retry_codes
-        self.__http_session = get_http_session_with_retry(**http_retries_config)

        self.__worker = worker or gethostname()

@@ -143,7 +144,14 @@ class Session(TokenManager):

        self.client = client or "api-{}".format(__version__)

+        # limit the reconnect retries, so we get an error if we are starting the session
+        http_no_retries_config = dict(**http_retries_config)
+        http_no_retries_config['connect'] = self._session_initial_connect_retry
+        self.__http_session = get_http_session_with_retry(**http_no_retries_config)
+        # try to connect with the server
        self.refresh_token()
+        # create the default session with many retries
+        self.__http_session = get_http_session_with_retry(**http_retries_config)

        # update api version from server response
        try:
@@ -545,6 +553,9 @@ class Session(TokenManager):
            else:
                raise LoginError("Response data mismatch: No 'token' in 'data' value from res, receive : {}, "
                                 "exception: {}".format(res, ex))
+        except requests.ConnectionError as ex:
+            raise ValueError('Connection Error: it seems *api_server* is misconfigured. '
+                             'Is this the TRAINS API server {} ?'.format('/'.join(ex.request.url.split('/')[:3])))
        except Exception as ex:
            raise LoginError('Unrecognized Authentication Error: {} {}'.format(type(ex), ex))

--- a/trains_agent/backend_config/config.py
+++ b/trains_agent/backend_config/config.py
@@ -190,7 +190,7 @@ class Config(object):
    def reload(self):
        self.replace(self._reload())

-    def initialize_logging(self):
+    def initialize_logging(self, debug=False):
        logging_config = self._config.get("logging", None)
        if not logging_config:
            return False
@@ -217,6 +217,8 @@ class Config(object):
        )
        for logger in loggers:
            handlers = logger.get("handlers", None)
+            if debug:
+                logger['level'] = 'DEBUG'
            if not handlers:
                continue
            logger["handlers"] = [h for h in handlers if h not in deleted]
--- a/trains_agent/commands/config.py
+++ b/trains_agent/commands/config.py
@@ -142,6 +142,7 @@ def main():
        with open(str(conf_file), 'wt') as f:
            header = '# TRAINS-AGENT configuration file\n' \
                     'api {\n' \
+                     '    # Notice: \'host\' is the api server (default port 8008), not the web server.\n' \
                     '    api_server: %s\n' \
                     '    web_server: %s\n' \
                     '    files_server: %s\n' \
@@ -233,7 +234,8 @@ def verify_credentials(api_host, credentials):
    try:
        print('Verifying credentials ...')
        if api_host:
-            Session(api_key=credentials['access_key'], secret_key=credentials['secret_key'], host=api_host)
+            Session(api_key=credentials['access_key'], secret_key=credentials['secret_key'], host=api_host,
+                    http_retries_config={"total": 2})
            print('Credentials verified!')
            return True
        else:
@@ -275,7 +277,7 @@ def read_manual_credentials():

 def input_url(host_type, host=None):
    while True:
-        print('{} configured to: [{}] '.format(host_type, host), end='')
+        print('{} configured to: {}'.format(host_type, '[{}] '.format(host) if host else ''), end='')
        parse_input = input()
        if host and (not parse_input or parse_input.lower() == 'yes' or parse_input.lower() == 'y'):
            break
@@ -289,11 +291,12 @@ def input_url(host_type, host=None):
 def input_host_port(host_type, parsed_host):
    print('Enter port for {} host '.format(host_type), end='')
    replace_port = input().lower()
-    return parsed_host.scheme + "://" + parsed_host.netloc + (':{}'.format(replace_port) if replace_port else '') + \
-           parsed_host.path
+    return parsed_host.scheme + "://" + parsed_host.netloc + (
+        ':{}'.format(replace_port) if replace_port else '') + parsed_host.path


 def verify_url(parse_input):
+    # noinspection PyBroadException
    try:
        if not parse_input.startswith('http://') and not parse_input.startswith('https://'):
            # if we have a specific port, use http prefix, otherwise assume https
--- a/trains_agent/commands/worker.py
+++ b/trains_agent/commands/worker.py
@@ -17,7 +17,7 @@ from datetime import datetime
 from distutils.spawn import find_executable
 from functools import partial
 from itertools import chain
-from tempfile import mkdtemp, gettempdir
+from tempfile import mkdtemp, NamedTemporaryFile
 from time import sleep, time
 from typing import Text, Optional, Any, Tuple

@@ -66,7 +66,7 @@ from trains_agent.helper.base import (
    get_python_path,
    is_linux_platform,
    rm_file,
-    add_python_path)
+    add_python_path, safe_remove_tree, )
 from trains_agent.helper.console import ensure_text, print_text, decode_binary_lines
 from trains_agent.helper.os.daemonize import daemonize_process
 from trains_agent.helper.package.base import PackageManager
@@ -89,7 +89,7 @@ from trains_agent.helper.process import (
    get_bash_output,
    shutdown_docker_process,
    get_docker_id,
-    commit_docker
+    commit_docker, terminate_process,
 )
 from trains_agent.helper.package.cython_req import CythonRequirement
 from trains_agent.helper.repo import clone_repository_cached, RepoInfo, VCS
@@ -104,6 +104,7 @@ log = logging.getLogger(__name__)
 DOCKER_ROOT_CONF_FILE = "/root/trains.conf"
 DOCKER_DEFAULT_CONF_FILE = "/root/default_trains.conf"

+
@attr.s
 class LiteralScriptManager(object):
    """
@@ -212,6 +213,7 @@ class TaskStopSignal(object):
        statuses.stopped,
        statuses.failed,
        statuses.published,
+        statuses.queued,
    ]
    default = TaskStopReason.no_stop
    stopping_message = "stopping"
@@ -317,6 +319,10 @@ class Worker(ServiceCommandSection):
    # last message before passing control to the actual task
    _task_logging_pass_control_message = "Running task id [{}]:"

+    _run_as_user_home = '/trains_agent_home'
+    _docker_fixed_user_cache = '/trains_agent_cache'
+    _temp_cleanup_list = []
+
    @property
    def service(self):
        """ Worker command service endpoint """
@@ -329,6 +335,8 @@ class Worker(ServiceCommandSection):
    @staticmethod
    def register_signal_handler():
        def handler(*_):
+            for f in Worker._temp_cleanup_list + [Singleton.get_pid_file()]:
+                safe_remove_tree(f)
            raise Sigterm()

        signal.signal(signal.SIGTERM, handler)
@@ -385,6 +393,7 @@ class Worker(ServiceCommandSection):
        self._standalone_mode = None
        self._services_mode = None
        self._force_current_version = None
+        self._redirected_stdout_file_no = None

    @classmethod
    def _verify_command_states(cls, kwargs):
@@ -396,7 +405,7 @@ class Worker(ServiceCommandSection):
        """
        if kwargs.get('services_mode'):
            kwargs['cpu_only'] = True
-            kwargs['docker'] = kwargs.get('docker', [])
+            kwargs['docker'] = kwargs.get('docker') or []
            kwargs['gpus'] = None

        return kwargs
@@ -430,7 +439,7 @@ class Worker(ServiceCommandSection):
            pass

    def run_one_task(self, queue, task_id, worker_args, docker=None):
-        # type: (Text, Text, WorkerParams) -> ()
+        # type: (Text, Text, WorkerParams, Optional[Text]) -> ()
        """
        Run one task pulled from queue.
        :param queue: ID of queue that task was pulled from
@@ -566,12 +575,12 @@ class Worker(ServiceCommandSection):
            else:
                self.handle_task_termination(task_id, status, stop_signal_status)
                # remove temp files after we sent everything to the backend
-                safe_remove_file(temp_stdout_name)
-                safe_remove_file(temp_stderr_name)
                if self.docker_image_func:
                    shutdown_docker_process(docker_cmd_contains='--id {}\'\"'.format(task_id))
+                safe_remove_file(temp_stdout_name)
+                safe_remove_file(temp_stderr_name)

-    def run_tasks_loop(self, queues, worker_params):
+    def run_tasks_loop(self, queues, worker_params, priority_order=True):
        """
        :summary: Pull and run tasks from queues.
        :description: 1. Go through ``queues`` by order.
@@ -581,6 +590,9 @@ class Worker(ServiceCommandSection):
        :type queues: list of ``Text``
        :param worker_params: Worker command line arguments
        :type worker_params: ``trains_agent.helper.process.WorkerParams``
+        :param priority_order: If True pull order in priority manner. always from the first
+            If False, pull from each queue once in a round robin manner
+        :type priority_order: bool
        """

        if not self._daemon_foreground:
@@ -611,6 +623,16 @@ class Worker(ServiceCommandSection):
                            print("No tasks in queue {}".format(queue))
                        continue

+                    # clear output log if we start a new Task
+                    if not worker_params.debug and self._redirected_stdout_file_no is not None and \
+                            self._redirected_stdout_file_no > 2:
+                        # noinspection PyBroadException
+                        try:
+                            os.lseek(self._redirected_stdout_file_no, 0, 0)
+                            os.ftruncate(self._redirected_stdout_file_no, 0)
+                        except:
+                            pass
+
                    self.send_logs(
                        task_id=task_id,
                        lines=["task {} pulled from {} by worker {}\n".format(task_id, queue, self.worker_id)],
@@ -619,7 +641,11 @@ class Worker(ServiceCommandSection):
                    self.report_monitor(ResourceMonitor.StatusReport(queues=queues, queue=queue, task=task_id))
                    self.run_one_task(queue, task_id, worker_params)
                    self.report_monitor(ResourceMonitor.StatusReport(queues=self.queues))
-                    break
+
+                    # if we are using priority start pulling from the first always,
+                    # if we are doing round robin, pull from the next one
+                    if priority_order:
+                        break
            else:
                # sleep and retry polling
                if self._daemon_foreground or worker_params.debug:
@@ -671,13 +697,22 @@ class Worker(ServiceCommandSection):

        self._session.print_configuration()

-    def daemon(self, queues, log_level, foreground=False, docker=False, detached=False, **kwargs):
+    def daemon(self, queues, log_level, foreground=False, docker=False, detached=False, order_fairness=False, **kwargs):
        # if we do not need to create queues, make sure they are valid
        # match previous behaviour when we validated queue names before everything else
        queues = self._resolve_queue_names(queues, create_if_missing=kwargs.get('create_queue', False))

        self._standalone_mode = kwargs.get('standalone_mode', False)
        self._services_mode = kwargs.get('services_mode', False)
+        # must have docker in services_mode
+        if self._services_mode:
+            kwargs = self._verify_command_states(kwargs)
+            docker = docker or kwargs.get('docker')
+
+        # We are not running a daemon we are killing one.
+        # find the pid send termination signal and leave
+        if kwargs.get('stop', False):
+            return 1 if not self._kill_daemon() else 0

        # make sure we only have a single instance,
        # also make sure we set worker_id properly and cache folders
@@ -704,9 +739,8 @@ class Worker(ServiceCommandSection):
        self._register(queues)

        # create temp config file with current configuration
-        self.temp_config_path = safe_mkstemp(
-            suffix=".cfg", prefix=".trains_agent.", text=True, name_only=True
-        )
+        self.temp_config_path = NamedTemporaryFile(
+            suffix=".cfg", prefix=".trains_agent.", mode='w+t').name

        # print docker image
        if docker is not False and docker is not None:
@@ -714,6 +748,22 @@ class Worker(ServiceCommandSection):
            self.set_docker_variables(docker)
        else:
            self.dump_config()
+            # only in none docker we have to make sure we have CUDA setup
+
+            # make sure we have CUDA set if we have --gpus
+            if kwargs.get('gpus') and self._session.config.get('agent.cuda_version', None) in (None, 0, '0'):
+                message = 'Running with GPUs but no CUDA version was detected!\n' \
+                          '\tSet OS environemnt CUDA_VERSION & CUDNN_VERSION to the correct version\n' \
+                          '\tExample: export CUDA_VERSION=10.1 or (Windows: set CUDA_VERSION=10.1)'
+                if is_conda(self._session.config):
+                    self._unregister(queues)
+                    safe_remove_file(self.temp_config_path)
+                    raise ValueError(message)
+                else:
+                    warning(message+'\n')
+
+        if self._services_mode:
+            print('Trains-Agent running in services mode')

        self._daemon_foreground = foreground
        if not foreground:
@@ -730,6 +780,9 @@ class Worker(ServiceCommandSection):
                )
            )

+            if not self._session.debug_mode:
+                self._temp_cleanup_list.append(name)
+
            if not detached:
                # redirect std out/err to new file
                sys.stdout = sys.stderr = out_file
@@ -737,6 +790,7 @@ class Worker(ServiceCommandSection):
                # in detached mode
                # fully detach stdin.stdout/stderr and leave main process, running in the background
                daemonize_process(out_file.fileno())
+                self._redirected_stdout_file_no = out_file.fileno()
                # make sure we update the singleton lock file to the new pid
                Singleton.update_pid_file()
                # reprint headers to std file (we are now inside the daemon process)
@@ -756,6 +810,7 @@ class Worker(ServiceCommandSection):
                            debug=self._session.debug_mode,
                            trace=self._session.trace,
                        ),
+                        priority_order=not order_fairness,
                    )
                except Exception:
                    tb = six.text_type(traceback.format_exc())
@@ -1151,7 +1206,7 @@ class Worker(ServiceCommandSection):
                        clone=("--clone" if entry_point == "clone_task" else ""),
                     )
        else:
-            change = 'ENTRYPOINT bash'
+            change = 'ENTRYPOINT []'

        print('Committing docker container to: {}'.format(target))
        print(commit_docker(container_name=target, docker_id=docker_id, apply_change=change))
@@ -1713,7 +1768,7 @@ class Worker(ServiceCommandSection):
                if self._session.debug_mode and temp_file:
                    rm_file(temp_file.name)
            # call post installation callback
-            requirements_manager.post_install()
+            requirements_manager.post_install(self._session)
            # mark as successful installation
            repo_requirements_installed = True

@@ -1978,7 +2033,7 @@ class Worker(ServiceCommandSection):
        print("Running in Docker {} mode (v19.03 and above) - using default docker image: {} running {}\n".format(
            '*standalone*' if self._standalone_mode else '', docker_image, python_version))
        temp_config = deepcopy(self._session.config)
-        mounted_cache_dir = '/root/.trains/cache'
+        mounted_cache_dir = self._docker_fixed_user_cache  # '/root/.trains/cache'
        mounted_pip_dl_dir = '/root/.trains/pip-download-cache'
        mounted_vcs_cache = '/root/.trains/vcs-cache'
        mounted_venv_dir = '/root/.trains/venvs-builds'
@@ -2002,6 +2057,7 @@ class Worker(ServiceCommandSection):
        host_pip_cache = Path(os.path.expandvars(self._session.config.get(
            "agent.docker_pip_cache", '~/.trains/pip-cache'))).expanduser().as_posix()
        host_ssh_cache = mkdtemp(prefix='trains_agent.ssh.')
+        self._temp_cleanup_list.append(host_ssh_cache)

        # make sure all folders are valid
        Path(host_apt_cache).mkdir(parents=True, exist_ok=True)
@@ -2023,12 +2079,10 @@ class Worker(ServiceCommandSection):
            pass

        # check if the .git credentials exist:
-        host_git_credentials = Path('~/.git-credentials').expanduser()
        try:
-            if host_git_credentials.is_file():
-                host_git_credentials = host_git_credentials.as_posix()
-            else:
-                host_git_credentials = None
+            host_git_credentials = [
+                f.as_posix() for f in [Path('~/.git-credentials').expanduser(), Path('~/.gitconfig').expanduser()]
+                if f.is_file()]
        except Exception:
            host_git_credentials = None

@@ -2043,6 +2097,8 @@ class Worker(ServiceCommandSection):
                cmds = [cmds]
            extra_shell_script_str = " ; ".join(map(str, cmds)) + " ; "

+        bash_script = self._session.config.get("agent.docker_init_bash_script", None)
+
        self.temp_config_path = self.temp_config_path or safe_mkstemp(
            suffix=".cfg", prefix=".trains_agent.", text=True, name_only=True
        )
@@ -2059,7 +2115,10 @@ class Worker(ServiceCommandSection):
                          host_cache=host_cache, mounted_cache=mounted_cache_dir,
                          host_pip_dl=host_pip_dl, mounted_pip_dl=mounted_pip_dl_dir,
                          host_vcs_cache=host_vcs_cache, mounted_vcs_cache=mounted_vcs_cache,
-                          standalone_mode=self._standalone_mode, force_current_version=self._force_current_version)
+                          standalone_mode=self._standalone_mode,
+                          force_current_version=self._force_current_version,
+                          bash_script=bash_script,
+                          )
        return temp_config, partial(docker_cmd_functor, docker_cmd, temp_config)

    @staticmethod
@@ -2072,7 +2131,7 @@ class Worker(ServiceCommandSection):
                        host_pip_dl, mounted_pip_dl,
                        host_vcs_cache, mounted_vcs_cache,
                        standalone_mode=False, extra_docker_arguments=None, extra_shell_script=None,
-                        force_current_version=None, host_git_credentials=None):
+                        force_current_version=None, host_git_credentials=None, bash_script=None):
        docker = 'docker'

        base_cmd = [docker, 'run', '-t']
@@ -2163,21 +2222,42 @@ class Worker(ServiceCommandSection):
        except:
            pass

+        if os.environ.get('FORCE_LOCAL_TRAINS_AGENT_WHEEL'):
+            local_wheel = os.path.expanduser(os.environ.get('FORCE_LOCAL_TRAINS_AGENT_WHEEL'))
+            docker_wheel = str(Path('/tmp') / Path(local_wheel).name)
+            base_cmd += ['-v', local_wheel + ':' + docker_wheel]
+            trains_agent_wheel = '\"{}\"'.format(docker_wheel)
+        else:
+            # trains-agent{specify_version}
+            trains_agent_wheel = 'trains-agent{specify_version}'.format(specify_version=specify_version)
+
        if not standalone_mode:
-            update_scheme += \
-                "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean ; " \
-                "chown -R root /root/.cache/pip ; " \
-                "apt-get update ; " \
-                "apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 {python_single_digit}-pip ; " \
-                "{python} -m pip install -U \"pip{pip_version}\" ; " \
-                "{python} -m pip install -U trains-agent{specify_version} ; ".format(
-                    python_single_digit=python_version.split('.')[0],
-                    python=python_version, pip_version=PackageManager.get_pip_version(),
-                    specify_version=specify_version)
+            if not bash_script:
+                bash_script = [
+                    "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean",
+                    "chown -R root /root/.cache/pip",
+                    "apt-get update",
+                    "apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0",
+                    "(which {python_single_digit} && {python_single_digit} -m pip --version) || " +
+                    "apt-get install -y {python_single_digit}-pip",
+                ]
+
+            docker_bash_script = " ; ".join(bash_script) if not isinstance(bash_script, str) else bash_script
+
+            update_scheme += (
+                    docker_bash_script + " ; " +
+                    "{python} -m pip install -U \"pip{pip_version}\" ; " +
+                    "{python} -m pip install -U {trains_agent_wheel} ; ").format(
+                python_single_digit=python_version.split('.')[0],
+                python=python_version, pip_version=PackageManager.get_pip_version(),
+                trains_agent_wheel=trains_agent_wheel)
+
+        if host_git_credentials:
+            for git_credentials in host_git_credentials:
+                base_cmd += ['-v', '{}:/root/{}'.format(git_credentials, Path(git_credentials).name)]

        base_cmd += (
            ['-v', conf_file+':'+DOCKER_ROOT_CONF_FILE] +
-            (['-v', host_git_credentials+':/root/.git-credentials'] if host_git_credentials else []) +
            (['-v', host_ssh_cache+':/root/.ssh'] if host_ssh_cache else []) +
            (['-v', host_apt_cache+':/var/cache/apt/archives'] if host_apt_cache else []) +
            (['-v', host_pip_cache+':/root/.cache/pip'] if host_pip_cache else []) +
@@ -2208,21 +2288,24 @@ class Worker(ServiceCommandSection):

            def set_uid(self, user_uid, user_gid):
                from pwd import getpwnam
-                self.uid = getpwnam(user_uid).pw_uid
-                self.gid = getpwnam(user_gid).pw_gid
+                try:
+                    self.uid = getpwnam(user_uid).pw_uid
+                    self.gid = getpwnam(user_gid).pw_gid
+                except Exception:
+                    raise ValueError("Could not find requested user uid={} gid={}".format(user_uid, user_gid))

            def _change_uid(self):
                os.setgid(self.gid)
                os.setuid(self.uid)

        # create a home folder for our user
-        trains_agent_home = 'trains_agent_home{}'.format('.'+str(Singleton.get_slot()) if Singleton.get_slot() else '')
+        trains_agent_home = self._run_as_user_home + '{}'.format('.'+str(Singleton.get_slot()) if Singleton.get_slot() else '')
        try:
-            home_folder = '/trains_agent_home'
+            home_folder = self._run_as_user_home
            rm_tree(home_folder)
            Path(home_folder).mkdir(parents=True, exist_ok=True)
        except:
-            home_folder = '/home/trains_agent_home'
+            home_folder = os.path.join('/home', self._run_as_user_home)
            rm_tree(home_folder)
            Path(home_folder).mkdir(parents=True, exist_ok=True)

@@ -2241,6 +2324,10 @@ class Worker(ServiceCommandSection):
        # make sure we will be able to access the cache folder (we assume we have the ability change mod)
        if sdk_cache_folder:
            sdk_cache_folder = Path(os.path.expandvars(sdk_cache_folder)).expanduser().absolute()
+            try:
+                sdk_cache_folder.chmod(0o0777)
+            except:
+                pass
            for f in sdk_cache_folder.rglob('*'):
                try:
                    f.chmod(0o0777)
@@ -2268,8 +2355,41 @@ class Worker(ServiceCommandSection):

        return command, script_dir

+    def _kill_daemon(self):
+        worker_id, worker_name = self._generate_worker_id_name()
+        # Iterate over all running process
+        for pid, uid, slot, file in sorted(Singleton.get_running_pids(), key=lambda x: x[1] or ''):
+            # wither we have a match for the worker_id or we just pick the first one
+            if pid >= 0 and (
+                    (worker_id and uid == worker_id) or
+                    (not worker_id and uid.startswith('{}:'.format(worker_name)))):
+                # this is us kill it
+                print('Terminating trains-agent worker_id={} pid={}'.format(uid, pid))
+                if not terminate_process(pid, timeout=10):
+                    error('Could not terminate process pid={}'.format(pid))
+                return True
+        print('Could not find a running trains-agent instance with worker_name={} worker_id={}'.format(
+            worker_name, worker_id))
+        return False
+
    def _singleton(self):
        # ensure singleton
+        worker_id, worker_name = self._generate_worker_id_name()
+
+        # if we are running in services mode, we allow double register since
+        # docker-compose will kill instances before they cleanup
+        self.worker_id, worker_slot = Singleton.register_instance(
+            unique_worker_id=worker_id, worker_name=worker_name, api_client=self._session.api_client,
+            allow_double=bool(ENV_DOCKER_HOST_MOUNT.get())  # and bool(self._services_mode),
+        )
+
+        if self.worker_id is None:
+            error('Instance with the same WORKER_ID [{}] is already running'.format(worker_id))
+            exit(1)
+        # update folders based on free slot
+        self._session.create_cache_folders(slot_index=worker_slot)
+
+    def _generate_worker_id_name(self):
        worker_id = self._session.config["agent.worker_id"]
        worker_name = self._session.config["agent.worker_name"]
        if not worker_id and os.environ.get('NVIDIA_VISIBLE_DEVICES') is not None:
@@ -2280,18 +2400,7 @@ class Worker(ServiceCommandSection):
                pass
            else:
                worker_name = '{}:cpu'.format(worker_name)
-
-        # if we are running in services mode, we allow double register since
-        # docker-compose will kill instances before they cleanup
-        self.worker_id, worker_slot = Singleton.register_instance(
-            unique_worker_id=worker_id, worker_name=worker_name, api_client=self._session.api_client,
-            allow_double=bool(self._services_mode) and bool(ENV_DOCKER_HOST_MOUNT.get()))
-
-        if self.worker_id is None:
-            error('Instance with the same WORKER_ID [{}] is already running'.format(worker_id))
-            exit(1)
-        # update folders based on free slot
-        self._session.create_cache_folders(slot_index=worker_slot)
+        return worker_id, worker_name

    def _resolve_queue_names(self, queues, create_if_missing=False):
        if not queues:
--- a/trains_agent/external/init.py
+++ b/trains_agent/external/init.py
--- a/trains_agent/external/requirements_parser/init.py
+++ b/trains_agent/external/requirements_parser/init.py
@@ -0,0 +1,22 @@
+from .parser import parse   # noqa
+
+_MAJOR = 0
+_MINOR = 2
+_PATCH = 0
+
+
+def version_tuple():
+    '''
+    Returns a 3-tuple of ints that represent the version
+    '''
+    return (_MAJOR, _MINOR, _PATCH)
+
+
+def version():
+    '''
+    Returns a string representation of the version
+    '''
+    return '%d.%d.%d' % (version_tuple())
+
+
+__version__ = version()
--- a/trains_agent/external/requirements_parser/fragment.py
+++ b/trains_agent/external/requirements_parser/fragment.py
@@ -0,0 +1,44 @@
+import re
+
+# Copied from pip
+# https://github.com/pypa/pip/blob/281eb61b09d87765d7c2b92f6982b3fe76ccb0af/pip/index.py#L947
+HASH_ALGORITHMS = set(['sha1', 'sha224', 'sha384', 'sha256', 'sha512', 'md5'])
+
+extras_require_search = re.compile(
+    r'(?P<name>.+)\[(?P<extras>[^\]]+)\]').search
+
+
+def parse_fragment(fragment_string):
+    """Takes a fragment string nd returns a dict of the components"""
+    fragment_string = fragment_string.lstrip('#')
+
+    try:
+        return dict(
+            key_value_string.split('=')
+            for key_value_string in fragment_string.split('&')
+        )
+    except ValueError:
+        raise ValueError(
+            'Invalid fragment string {fragment_string}'.format(
+                fragment_string=fragment_string
+            )
+        )
+
+
+def get_hash_info(d):
+    """Returns the first matching hashlib name and value from a dict"""
+    for key in d.keys():
+        if key.lower() in HASH_ALGORITHMS:
+            return key, d[key]
+
+    return None, None
+
+
+def parse_extras_require(egg):
+    if egg is not None:
+        match = extras_require_search(egg)
+        if match is not None:
+            name = match.group('name')
+            extras = match.group('extras')
+            return name, [extra.strip() for extra in extras.split(',')]
+    return egg, []
--- a/trains_agent/external/requirements_parser/parser.py
+++ b/trains_agent/external/requirements_parser/parser.py
@@ -0,0 +1,50 @@
+import os
+import warnings
+
+from .requirement import Requirement
+
+
+def parse(reqstr):
+    """
+    Parse a requirements file into a list of Requirements
+
+    See: pip/req.py:parse_requirements()
+
+    :param reqstr: a string or file like object containing requirements
+    :returns: a *generator* of Requirement objects
+    """
+    filename = getattr(reqstr, 'name', None)
+    try:
+        # Python 2.x compatibility
+        if not isinstance(reqstr, basestring):
+            reqstr = reqstr.read()
+    except NameError:
+        # Python 3.x only
+        if not isinstance(reqstr, str):
+            reqstr = reqstr.read()
+
+    for line in reqstr.splitlines():
+        line = line.strip()
+        if line == '':
+            continue
+        elif not line or line.startswith('#'):
+            # comments are lines that start with # only
+            continue
+        elif line.startswith('-r') or line.startswith('--requirement'):
+            _, new_filename = line.split()
+            new_file_path = os.path.join(os.path.dirname(filename or '.'),
+                                         new_filename)
+            with open(new_file_path) as f:
+                for requirement in parse(f):
+                    yield requirement
+        elif line.startswith('-f') or line.startswith('--find-links') or \
+                line.startswith('-i') or line.startswith('--index-url') or \
+                line.startswith('--extra-index-url') or \
+                line.startswith('--no-index'):
+            warnings.warn('Private repos not supported. Skipping.')
+            continue
+        elif line.startswith('-Z') or line.startswith('--always-unzip'):
+            warnings.warn('Unused option --always-unzip. Skipping.')
+            continue
+        else:
+            yield Requirement.parse(line)
--- a/trains_agent/external/requirements_parser/requirement.py
+++ b/trains_agent/external/requirements_parser/requirement.py
@@ -0,0 +1,236 @@
+from __future__ import unicode_literals
+import re
+from pkg_resources import Requirement as Req
+
+from .fragment import get_hash_info, parse_fragment, parse_extras_require
+from .vcs import VCS, VCS_SCHEMES
+
+
+URI_REGEX = re.compile(
+    r'^(?P<scheme>https?|file|ftps?)://(?P<path>[^#]+)'
+    r'(#(?P<fragment>\S+))?'
+)
+
+VCS_REGEX = re.compile(
+    r'^(?P<scheme>{0})://'.format(r'|'.join(
+        [scheme.replace('+', r'\+') for scheme in VCS_SCHEMES])) +
+    r'((?P<login>[^/@]+)@)?'
+    r'(?P<path>[^#@]+)'
+    r'(@(?P<revision>[^#]+))?'
+    r'(#(?P<fragment>\S+))?'
+)
+
+# This matches just about everyting
+LOCAL_REGEX = re.compile(
+    r'^((?P<scheme>file)://)?'
+    r'(?P<path>[^#]+)' +
+    r'(#(?P<fragment>\S+))?'
+)
+
+
+class Requirement(object):
+    """
+    Represents a single requirementfrom trains_agent.external.requirements_parser.requirement import Requirement
+
+    Typically instances of this class are created with ``Requirement.parse``.
+    For local file requirements, there's no verification that the file
+    exists. This class attempts to be *dict-like*.
+
+    See: http://www.pip-installer.org/en/latest/logic.html
+
+    **Members**:
+
+    * ``line`` - the actual requirement line being parsed
+    * ``editable`` - a boolean whether this requirement is "editable"
+    * ``local_file`` - a boolean whether this requirement is a local file/path
+    * ``specifier`` - a boolean whether this requirement used a requirement
+      specifier (eg. "django>=1.5" or "requirements")
+    * ``vcs`` - a string specifying the version control system
+    * ``revision`` - a version control system specifier
+    * ``name`` - the name of the requirement
+    * ``uri`` - the URI if this requirement was specified by URI
+    * ``subdirectory`` - the subdirectory fragment of the URI
+    * ``path`` - the local path to the requirement
+    * ``hash_name`` - the type of hashing algorithm indicated in the line
+    * ``hash`` - the hash value indicated by the requirement line
+    * ``extras`` - a list of extras for this requirement
+      (eg. "mymodule[extra1, extra2]")
+    * ``specs`` - a list of specs for this requirement
+      (eg. "mymodule>1.5,<1.6" => [('>', '1.5'), ('<', '1.6')])
+    """
+
+    def __init__(self, line):
+        # Do not call this private method
+        self.line = line
+        self.editable = False
+        self.local_file = False
+        self.specifier = False
+        self.vcs = None
+        self.name = None
+        self.subdirectory = None
+        self.uri = None
+        self.path = None
+        self.revision = None
+        self.hash_name = None
+        self.hash = None
+        self.extras = []
+        self.specs = []
+
+    def __repr__(self):
+        return '<Requirement: "{0}">'.format(self.line)
+
+    def __getitem__(self, key):
+        return getattr(self, key)
+
+    def keys(self):
+        return self.__dict__.keys()
+
+    @classmethod
+    def parse_editable(cls, line):
+        """
+        Parses a Requirement from an "editable" requirement which is either
+        a local project path or a VCS project URI.
+
+        See: pip/req.py:from_editable()
+
+        :param line: an "editable" requirement
+        :returns: a Requirement instance for the given line
+        :raises: ValueError on an invalid requirement
+        """
+
+        req = cls('-e {0}'.format(line))
+        req.editable = True
+        vcs_match = VCS_REGEX.match(line)
+        local_match = LOCAL_REGEX.match(line)
+
+        if vcs_match is not None:
+            groups = vcs_match.groupdict()
+            if groups.get('login'):
+                req.uri = '{scheme}://{login}@{path}'.format(**groups)
+            else:
+                req.uri = '{scheme}://{path}'.format(**groups)
+            req.revision = groups['revision']
+            if groups['fragment']:
+                fragment = parse_fragment(groups['fragment'])
+                egg = fragment.get('egg')
+                req.name, req.extras = parse_extras_require(egg)
+                req.hash_name, req.hash = get_hash_info(fragment)
+                req.subdirectory = fragment.get('subdirectory')
+            for vcs in VCS:
+                if req.uri.startswith(vcs):
+                    req.vcs = vcs
+        else:
+            assert local_match is not None, 'This should match everything'
+            groups = local_match.groupdict()
+            req.local_file = True
+            if groups['fragment']:
+                fragment = parse_fragment(groups['fragment'])
+                egg = fragment.get('egg')
+                req.name, req.extras = parse_extras_require(egg)
+                req.hash_name, req.hash = get_hash_info(fragment)
+                req.subdirectory = fragment.get('subdirectory')
+            req.path = groups['path']
+
+        return req
+
+    @classmethod
+    def parse_line(cls, line):
+        """
+        Parses a Requirement from a non-editable requirement.
+
+        See: pip/req.py:from_line()
+
+        :param line: a "non-editable" requirement
+        :returns: a Requirement instance for the given line
+        :raises: ValueError on an invalid requirement
+        """
+
+        req = cls(line)
+
+        vcs_match = VCS_REGEX.match(line)
+        uri_match = URI_REGEX.match(line)
+        local_match = LOCAL_REGEX.match(line)
+
+        if vcs_match is not None:
+            groups = vcs_match.groupdict()
+            if groups.get('login'):
+                req.uri = '{scheme}://{login}@{path}'.format(**groups)
+            else:
+                req.uri = '{scheme}://{path}'.format(**groups)
+            req.revision = groups['revision']
+            if groups['fragment']:
+                fragment = parse_fragment(groups['fragment'])
+                egg = fragment.get('egg')
+                req.name, req.extras = parse_extras_require(egg)
+                req.hash_name, req.hash = get_hash_info(fragment)
+                req.subdirectory = fragment.get('subdirectory')
+            for vcs in VCS:
+                if req.uri.startswith(vcs):
+                    req.vcs = vcs
+        elif uri_match is not None:
+            groups = uri_match.groupdict()
+            req.uri = '{scheme}://{path}'.format(**groups)
+            if groups['fragment']:
+                fragment = parse_fragment(groups['fragment'])
+                egg = fragment.get('egg')
+                req.name, req.extras = parse_extras_require(egg)
+                req.hash_name, req.hash = get_hash_info(fragment)
+                req.subdirectory = fragment.get('subdirectory')
+            if groups['scheme'] == 'file':
+                req.local_file = True
+        elif '#egg=' in line:
+            # Assume a local file match
+            assert local_match is not None, 'This should match everything'
+            groups = local_match.groupdict()
+            req.local_file = True
+            if groups['fragment']:
+                fragment = parse_fragment(groups['fragment'])
+                egg = fragment.get('egg')
+                name, extras = parse_extras_require(egg)
+                req.name = fragment.get('egg')
+                req.hash_name, req.hash = get_hash_info(fragment)
+                req.subdirectory = fragment.get('subdirectory')
+            req.path = groups['path']
+        else:
+            # This is a requirement specifier.
+            # Delegate to pkg_resources and hope for the best
+            req.specifier = True
+            pkg_req = Req.parse(line)
+            req.name = pkg_req.unsafe_name
+            req.extras = list(pkg_req.extras)
+            req.specs = pkg_req.specs
+        return req
+
+    @classmethod
+    def parse(cls, line):
+        """
+        Parses a Requirement from a line of a requirement file.
+
+        :param line: a line of a requirement file
+        :returns: a Requirement instance for the given line
+        :raises: ValueError on an invalid requirement
+        """
+        line = line.lstrip()
+        if line.startswith('-e') or line.startswith('--editable'):
+            # Editable installs are either a local project path
+            # or a VCS project URI
+            return cls.parse_editable(
+                re.sub(r'^(-e|--editable=?)\s*', '', line))
+        elif '@' in line:
+            # Allegro bug fix: support 'name @ git+' entries
+            name, vcs = line.split('@', 1)
+            name = name.strip()
+            vcs = vcs.strip()
+            # noinspection PyBroadException
+            try:
+                # check if the name is valid & parsed
+                Req.parse(name)
+                # if we are here, name is a valid package name, check if the vcs part is valid
+                if VCS_REGEX.match(vcs):
+                    req = cls.parse_line(vcs)
+                    req.name = name
+                    return req
+            except Exception:
+                pass
+
+        return cls.parse_line(line)
--- a/trains_agent/external/requirements_parser/vcs.py
+++ b/trains_agent/external/requirements_parser/vcs.py
@@ -0,0 +1,30 @@
+from __future__ import unicode_literals
+
+VCS = [
+    'git',
+    'hg',
+    'svn',
+    'bzr',
+]
+
+VCS_SCHEMES = [
+    'git',
+    'git+https',
+    'git+ssh',
+    'git+git',
+    'hg+http',
+    'hg+https',
+    'hg+static-http',
+    'hg+ssh',
+    'svn',
+    'svn+svn',
+    'svn+http',
+    'svn+https',
+    'svn+ssh',
+    'bzr+http',
+    'bzr+https',
+    'bzr+ssh',
+    'bzr+sftp',
+    'bzr+ftp',
+    'bzr+lp',
+]
--- a/trains_agent/helper/base.py
+++ b/trains_agent/helper/base.py
@@ -173,13 +173,30 @@ def normalize_path(*paths):


 def safe_remove_file(filename, error_message=None):
+    # noinspection PyBroadException
    try:
-        os.remove(filename)
+        if filename:
+            os.remove(filename)
    except Exception:
        if error_message:
            print(error_message)


+def safe_remove_tree(filename):
+    if not filename:
+        return
+    # noinspection PyBroadException
+    try:
+        shutil.rmtree(filename, ignore_errors=True)
+    except Exception:
+        pass
+    # noinspection PyBroadException
+    try:
+        os.remove(filename)
+    except Exception:
+        pass
+
+
 def get_python_path(script_dir, entry_point, package_api):
    try:
        python_path_sep = ';' if is_windows_platform() else ':'
@@ -555,3 +572,17 @@ class ExecutionInfo(NonStrictAttrs):
            execution.working_dir = working_dir or ""

        return execution
+
+
+class safe_furl(furl.furl):
+
+    @property
+    def port(self):
+        return self._port
+
+    @port.setter
+    def port(self, port):
+        """
+        Any port value is valid
+        """
+        self._port = port
--- a/trains_agent/helper/gpu/gpustat.py
+++ b/trains_agent/helper/gpu/gpustat.py
@@ -200,24 +200,30 @@ class GPUStatCollection(object):
                    GPUStatCollection.global_processes[nv_process.pid] = \
                        psutil.Process(pid=nv_process.pid)
                ps_process = GPUStatCollection.global_processes[nv_process.pid]
-                process['username'] = ps_process.username()
-                # cmdline returns full path;
-                # as in `ps -o comm`, get short cmdnames.
-                _cmdline = ps_process.cmdline()
-                if not _cmdline:
-                    # sometimes, zombie or unknown (e.g. [kworker/8:2H])
-                    process['command'] = '?'
-                    process['full_command'] = ['?']
-                else:
-                    process['command'] = os.path.basename(_cmdline[0])
-                    process['full_command'] = _cmdline
-                # Bytes to MBytes
-                process['gpu_memory_usage'] = nv_process.usedGpuMemory // MB
-                process['cpu_percent'] = ps_process.cpu_percent()
-                process['cpu_memory_usage'] = \
-                    round((ps_process.memory_percent() / 100.0) *
-                          psutil.virtual_memory().total)
                process['pid'] = nv_process.pid
+                # noinspection PyBroadException
+                try:
+                    # we do not actually use these, so no point in collecting them
+                    # process['username'] = ps_process.username()
+                    # # cmdline returns full path;
+                    # # as in `ps -o comm`, get short cmdnames.
+                    # _cmdline = ps_process.cmdline()
+                    # if not _cmdline:
+                    #     # sometimes, zombie or unknown (e.g. [kworker/8:2H])
+                    #     process['command'] = '?'
+                    #     process['full_command'] = ['?']
+                    # else:
+                    #     process['command'] = os.path.basename(_cmdline[0])
+                    #     process['full_command'] = _cmdline
+                    # process['cpu_percent'] = ps_process.cpu_percent()
+                    # process['cpu_memory_usage'] = \
+                    #     round((ps_process.memory_percent() / 100.0) *
+                    #           psutil.virtual_memory().total)
+                    # Bytes to MBytes
+                    process['gpu_memory_usage'] = nv_process.usedGpuMemory // MB
+                except Exception:
+                    # insufficient permissions
+                    pass
                return process

            if not GPUStatCollection._gpu_device_info.get(index):
@@ -285,12 +291,13 @@ class GPUStatCollection(object):
                        # e.g. nvidia-smi reset  or  reboot the system
                        pass

-                # TODO: Do not block if full process info is not requested
-                time.sleep(0.1)
-                for process in processes:
-                    pid = process['pid']
-                    cache_process = GPUStatCollection.global_processes[pid]
-                    process['cpu_percent'] = cache_process.cpu_percent()
+                # we do not actually use these, so no point in collecting them
+                # # TODO: Do not block if full process info is not requested
+                # time.sleep(0.1)
+                # for process in processes:
+                #     pid = process['pid']
+                #     cache_process = GPUStatCollection.global_processes[pid]
+                #     process['cpu_percent'] = cache_process.cpu_percent()

            index = N.nvmlDeviceGetIndex(handle)
            gpu_info = {
--- a/trains_agent/helper/package/base.py
+++ b/trains_agent/helper/package/base.py
@@ -111,10 +111,12 @@ class PackageManager(object):
    def out_of_scope_install_package(cls, package_name, *args):
        if PackageManager._selected_manager is not None:
            try:
-                return PackageManager._selected_manager._install(package_name, *args)
+                result = PackageManager._selected_manager._install(package_name, *args)
+                if result not in (0, None, True):
+                    return False
            except Exception:
-                pass
-        return
+                return False
+        return True

    @classmethod
    def out_of_scope_freeze(cls):
--- a/trains_agent/helper/package/conda_api.py
+++ b/trains_agent/helper/package/conda_api.py
@@ -14,8 +14,8 @@ import yaml
 from time import time
 from attr import attrs, attrib, Factory
 from pathlib2 import Path
-from requirements import parse
-from requirements.requirement import Requirement
+from trains_agent.external.requirements_parser import parse
+from trains_agent.external.requirements_parser.requirement import Requirement

 from trains_agent.errors import CommandFailedError
 from trains_agent.helper.base import rm_tree, NonStrictAttrs, select_for_platform, is_windows_platform
@@ -378,7 +378,7 @@ class CondaAPI(PackageManager):
                print(e)
                raise e

-        self.requirements_manager.post_install()
+        self.requirements_manager.post_install(self.session)
        return True

    def _parse_conda_result_bad_packges(self, result_dict):
--- a/trains_agent/helper/package/external_req.py
+++ b/trains_agent/helper/package/external_req.py
@@ -1,8 +1,10 @@
+import re
 from collections import OrderedDict
 from typing import Text

 from .base import PackageManager
 from .requirements import SimpleSubstitution
+from ..base import safe_furl as furl


 class ExternalRequirements(SimpleSubstitution):
@@ -22,7 +24,7 @@ class ExternalRequirements(SimpleSubstitution):
            return False
        return True

-    def post_install(self):
+    def post_install(self, session):
        post_install_req = self.post_install_req
        self.post_install_req = []
        for req in post_install_req:
@@ -30,7 +32,34 @@ class ExternalRequirements(SimpleSubstitution):
                freeze_base = PackageManager.out_of_scope_freeze() or ''
            except:
                freeze_base = ''
-            PackageManager.out_of_scope_install_package(req.tostr(markers=False), "--no-deps")
+
+            req_line = req.tostr(markers=False)
+            if req_line.strip().startswith('-e ') or req_line.strip().startswith('--editable'):
+                req_line = re.sub(r'^(-e|--editable=?)\s*', '', req_line, count=1)
+
+            if req.req.vcs and req_line.startswith('git+'):
+                try:
+                    url_no_frag = furl(req_line)
+                    url_no_frag.set(fragment=None)
+                    # reverse replace
+                    fragment = req_line[::-1].replace(url_no_frag.url[::-1], '', 1)[::-1]
+                    vcs_url = req_line[4:]
+                    # reverse replace
+                    vcs_url = vcs_url[::-1].replace(fragment[::-1], '', 1)[::-1]
+                    from ..repo import Git
+                    vcs = Git(session=session, url=vcs_url, location=None, revision=None)
+                    vcs._set_ssh_url()
+                    new_req_line = 'git+{}{}'.format(vcs.url_with_auth, fragment)
+                    if new_req_line != req_line:
+                        furl_line = furl(new_req_line)
+                        print('Replacing original pip vcs \'{}\' with \'{}\''.format(
+                            req_line,
+                            furl_line.set(password='xxxxxx').tostr() if furl_line.password else new_req_line))
+                        req_line = new_req_line
+                except Exception:
+                    print('WARNING: Failed parsing pip git install, using original line {}'.format(req_line))
+
+            PackageManager.out_of_scope_install_package(req_line, "--no-deps")
            try:
                freeze_post = PackageManager.out_of_scope_freeze() or ''
                package_name = list(set(freeze_post['pip']) - set(freeze_base['pip']))
@@ -38,7 +67,8 @@ class ExternalRequirements(SimpleSubstitution):
                    self.post_install_req_lookup[package_name[0]] = req.req.line
            except:
                pass
-            PackageManager.out_of_scope_install_package(req.tostr(markers=False), "--ignore-installed")
+            if not PackageManager.out_of_scope_install_package(req_line, "--ignore-installed"):
+                raise ValueError("Failed installing GIT/HTTPs package \'{}\'".format(req_line))

    def replace(self, req):
        """
--- a/trains_agent/helper/package/horovod_req.py
+++ b/trains_agent/helper/package/horovod_req.py
@@ -16,7 +16,7 @@ class HorovodRequirement(SimpleSubstitution):
        # match both horovod
        return req.name and self.name == req.name.lower()

-    def post_install(self):
+    def post_install(self, session):
        if self.post_install_req:
            PackageManager.out_of_scope_install_package(self.post_install_req.tostr(markers=False))
            self.post_install_req = None
--- a/trains_agent/helper/package/pip_api/venv.py
+++ b/trains_agent/helper/package/pip_api/venv.py
@@ -37,7 +37,7 @@ class VirtualenvPip(SystemPip, PackageManager):
        if isinstance(requirements, dict) and requirements.get("pip"):
            requirements["pip"] = self.requirements_manager.replace(requirements["pip"])
        super(VirtualenvPip, self).load_requirements(requirements)
-        self.requirements_manager.post_install()
+        self.requirements_manager.post_install(self.session)

    def create_flags(self):
        """
--- a/trains_agent/helper/package/pytorch.py
+++ b/trains_agent/helper/package/pytorch.py
@@ -270,7 +270,7 @@ class PytorchRequirement(SimpleSubstitution):
    def get_url_for_platform(self, req):
        # check if package is already installed with system packages
        try:
-            if self.config.get("agent.package_manager.system_site_packages"):
+            if self.config.get("agent.package_manager.system_site_packages", None):
                from pip._internal.commands.show import search_packages_info
                installed_torch = list(search_packages_info([req.name]))
                # notice the comparision order, the first part will make sure we have a valid installed package
@@ -295,7 +295,7 @@ class PytorchRequirement(SimpleSubstitution):

        torch_url, torch_url_key = SimplePytorchRequirement.get_torch_page(self.cuda_version)
        url = self._get_link_from_torch_page(req, torch_url)
-        if not url and self.config.get("agent.package_manager.torch_nightly"):
+        if not url and self.config.get("agent.package_manager.torch_nightly", None):
            torch_url, torch_url_key = SimplePytorchRequirement.get_torch_page(self.cuda_version, nightly=True)
            url = self._get_link_from_torch_page(req, torch_url)
        # try one more time, with a lower cuda version (never fallback to CPU):
--- a/trains_agent/helper/package/requirements.py
+++ b/trains_agent/helper/package/requirements.py
@@ -12,15 +12,15 @@ from typing import Text, List, Type, Optional, Tuple, Dict

 from pathlib2 import Path
 from pyhocon import ConfigTree
-from requirements import parse
-# noinspection PyPackageRequirements
-from requirements.requirement import Requirement

 import six
 from trains_agent.definitions import PIP_EXTRA_INDICES
 from trains_agent.helper.base import warning, is_conda, which, join_lines, is_windows_platform
 from trains_agent.helper.process import Argv, PathLike
 from trains_agent.session import Session, normalize_cuda_version
+from trains_agent.external.requirements_parser import parse
+from trains_agent.external.requirements_parser.requirement import Requirement
+
 from .translator import RequirementsTranslator


@@ -54,7 +54,17 @@ class MarkerRequirement(object):

            if self.specifier:
                parts.append(self.format_specs())
-
+        elif self.vcs:
+            # leave the line as is, let pip handle it
+            if self.line:
+                return self.line
+            else:
+                # let's build the line manually
+                parts = [
+                    self.uri,
+                    '@{}'.format(self.revision) if self.revision else '',
+                    '#subdirectory={}'.format(self.subdirectory) if self.subdirectory else ''
+                ]
        else:
            parts = [self.uri]

@@ -316,7 +326,7 @@ class RequirementSubstitution(object):
        """
        pass

-    def post_install(self):
+    def post_install(self, session):
        pass

    @classmethod
@@ -472,12 +482,13 @@ class RequirementsManager(object):
            result = map(self.translator.translate, result)
        return join_lines(result)

-    def post_install(self):
+    def post_install(self, session):
        for h in self.handlers:
            try:
-                h.post_install()
+                h.post_install(session)
            except Exception as ex:
                print('RequirementsManager handler {} raised exception: {}'.format(h, ex))
+                raise

    def replace_back(self, requirements):
        for h in self.handlers:
--- a/trains_agent/helper/process.py
+++ b/trains_agent/helper/process.py
@@ -11,6 +11,7 @@ from copy import deepcopy
 from distutils.spawn import find_executable
 from itertools import chain, repeat, islice
 from os.path import devnull
+from time import sleep
 from typing import Union, Text, Sequence, Any, TypeVar, Callable

 import psutil
@@ -41,6 +42,30 @@ def get_bash_output(cmd, strip=False, stderr=subprocess.STDOUT, stdin=False):
    return output if not strip or not output else output.strip()


+def terminate_process(pid, timeout=10.):
+    # noinspection PyBroadException
+    try:
+        proc = psutil.Process(pid)
+        proc.terminate()
+        cnt = 0
+        while proc.is_running() and cnt < timeout:
+            sleep(1.)
+            cnt += 1
+        proc.terminate()
+        cnt = 0
+        while proc.is_running() and cnt < timeout:
+            sleep(1.)
+            cnt += 1
+        proc.kill()
+    except Exception:
+        pass
+    # noinspection PyBroadException
+    try:
+        return not psutil.Process(pid).is_running()
+    except Exception:
+        return True
+
+
 def kill_all_child_processes(pid=None):
    # get current process if pid not provided
    include_parent = True
--- a/trains_agent/helper/repo.py
+++ b/trains_agent/helper/repo.py
@@ -97,7 +97,7 @@ class VCS(object):
        :param session: program session
        :param url: repository url
        :param location: (desired) clone location
-        :param: desired clone revision
+        :param revision: desired clone revision
        """
        self.session = session
        self.log = self.session.get_logger(
@@ -208,7 +208,7 @@ class VCS(object):
    )

    @classmethod
-    def resolve_ssh_url(cls, url):
+    def replace_ssh_url(cls, url):
        # type: (Text) -> Text
        """
        Replace SSH URL with HTTPS URL when applicable
@@ -242,11 +242,37 @@ class VCS(object):
            ).url
        return url

+    @classmethod
+    def replace_http_url(cls, url):
+        # type: (Text) -> Text
+        """
+        Replace HTTPS URL with SSH URL when applicable
+        """
+        parsed_url = furl(url)
+        if parsed_url.scheme == "https":
+            parsed_url.scheme = "ssh"
+            parsed_url.username = "git"
+            parsed_url.password = None
+            # make sure there is no port in the final url (safe_furl support)
+            parsed_url.port = None
+            url = parsed_url.url
+        return url
+
    def _set_ssh_url(self):
        """
        Replace instance URL with SSH substitution result and report to log.
        According to ``man ssh-add``, ``SSH_AUTH_SOCK`` must be set in order for ``ssh-add`` to work.
        """
+        if self.session.config.get('agent.force_git_ssh_protocol', None) and self.url:
+            parsed_url = furl(self.url)
+            if parsed_url.scheme == "https":
+                new_url = self.replace_http_url(self.url)
+                if new_url != self.url:
+                    print("Using SSH credentials - replacing https url '{}' with ssh url '{}'".format(
+                        self.url, new_url))
+                    self.url = new_url
+                return
+
        if not self.session.config.agent.translate_ssh:
            return

@@ -255,7 +281,7 @@ class VCS(object):
                (ENV_AGENT_GIT_USER.get() or self.session.config.get('agent.git_user', None)) and
                (ENV_AGENT_GIT_PASS.get() or self.session.config.get('agent.git_pass', None))
        ):
-            new_url = self.resolve_ssh_url(self.url)
+            new_url = self.replace_ssh_url(self.url)
            if new_url != self.url:
                print("Using user/pass credentials - replacing ssh url '{}' with https url '{}'".format(
                    self.url, new_url))
@@ -396,7 +422,10 @@ class VCS(object):
        Add username and password to URL if missing from URL and present in config.
        Does not modify ssh URLs.
        """
-        parsed_url = furl(url)
+        try:
+            parsed_url = furl(url)
+        except ValueError:
+            return url
        if parsed_url.scheme in ["", "ssh"] or parsed_url.scheme.startswith("git"):
            return parsed_url.url
        config_user = ENV_AGENT_GIT_USER.get() or config.get("agent.{}_user".format(cls.executable_name), None)
--- a/trains_agent/helper/singleton.py
+++ b/trains_agent/helper/singleton.py
@@ -37,6 +37,10 @@ class Singleton(object):
        except:
            pass

+    @classmethod
+    def get_lock_filename(cls):
+        return os.path.join(cls._get_temp_folder(), cls._lock_file_name)
+
    @classmethod
    def register_instance(cls, unique_worker_id=None, worker_name=None, api_client=None, allow_double=False):
        """
@@ -47,7 +51,7 @@ class Singleton(object):
        :return: (str worker_id, int slot_number) Return None value on instance already running
        """
        # try to lock file
-        lock_file = os.path.join(cls._get_temp_folder(), cls._lock_file_name)
+        lock_file = cls.get_lock_filename()
        timeout = 0
        while os.path.exists(lock_file):
            if timeout > cls._lock_timeout:
@@ -79,30 +83,41 @@ class Singleton(object):
        return ret

    @classmethod
-    def _register_instance(cls, unique_worker_id=None, worker_name=None, api_client=None, allow_double=False):
-        if cls.worker_id:
-            return cls.worker_id, cls.instance_slot
-        # make sure we have a unique name
-        instance_num = 0
+    def get_running_pids(cls):
        temp_folder = cls._get_temp_folder()
        files = glob(os.path.join(temp_folder, cls.prefix + cls.sep + '*' + cls.ext))
-        slots = {}
+        pids = []
        for file in files:
            parts = file.split(cls.sep)
+            # noinspection PyBroadException
            try:
                pid = int(parts[1])
+                if not psutil.pid_exists(pid):
+                    pid = -1
            except Exception:
                # something is wrong, use non existing pid and delete the file
                pid = -1

            uid, slot = None, None
+            # noinspection PyBroadException
            try:
                with open(file, 'r') as f:
                    uid, slot = str(f.read()).split('\n')
                    slot = int(slot)
            except Exception:
                pass
+            pids.append((pid, uid, slot, file))

+        return pids
+
+    @classmethod
+    def _register_instance(cls, unique_worker_id=None, worker_name=None, api_client=None, allow_double=False):
+        if cls.worker_id:
+            return cls.worker_id, cls.instance_slot
+        # make sure we have a unique name
+        instance_num = 0
+        slots = {}
+        for pid, uid, slot, file in cls.get_running_pids():
            worker = None
            if api_client and ENV_DOCKER_HOST_MOUNT.get() and uid:
                try:
@@ -111,7 +126,7 @@ class Singleton(object):
                    worker = None

            # count active instances and delete dead files
-            if not worker and not psutil.pid_exists(pid):
+            if not worker and pid < 0:
                # delete the file
                try:
                    os.remove(os.path.join(file))
@@ -165,3 +180,9 @@ class Singleton(object):
    @classmethod
    def get_slot(cls):
        return cls.instance_slot or 0
+
+    @classmethod
+    def get_pid_file(cls):
+        if not cls._pid_file:
+            return None
+        return cls._pid_file.name
--- a/trains_agent/interface/worker.py
+++ b/trains_agent/interface/worker.py
@@ -68,6 +68,10 @@ DAEMON_ARGS = dict({
        'dest': 'queues',
        'type': foreign_object_id('queues'),
    },
+    '--order-fairness': {
+        'help': 'Pull from each queue in a round-robin order, instead of priority order.',
+        'action': 'store_true',
+    },
    '--standalone-mode': {
        'help': 'Do not use any network connects, assume everything is pre-installed',
        'action': 'store_true',
@@ -85,6 +89,10 @@ DAEMON_ARGS = dict({
        'action': 'store_true',
        'aliases': ['-d'],
    },
+    '--stop': {
+        'help': 'Stop the running agent (based on the same set of arguments)',
+        'action': 'store_true',
+    },

 }, **WORKER_ARGS)

--- a/trains_agent/session.py
+++ b/trains_agent/session.py
@@ -73,9 +73,11 @@ class Session(_Session):
            os.environ[LOCAL_CONFIG_FILE_OVERRIDE_VAR] = config_file
            if not Path(config_file).is_file():
                raise ValueError("Could not open configuration file: {}".format(config_file))
+
        cpu_only = kwargs.get('cpu_only')
        if cpu_only:
            os.environ['CUDA_VISIBLE_DEVICES'] = os.environ['NVIDIA_VISIBLE_DEVICES'] = 'none'
+
        if kwargs.get('gpus') and not os.environ.get('KUBERNETES_SERVICE_HOST') \
                and not os.environ.get('KUBERNETES_PORT'):
            # CUDA_VISIBLE_DEVICES does not support 'all'
@@ -84,6 +86,7 @@ class Session(_Session):
                os.environ['NVIDIA_VISIBLE_DEVICES'] = kwargs.get('gpus')
            else:
                os.environ['CUDA_VISIBLE_DEVICES'] = os.environ['NVIDIA_VISIBLE_DEVICES'] = kwargs.get('gpus')
+
        if kwargs.get('only_load_config'):
            from trains_agent.backend_api.config import load
            self.config = load()
--- a/trains_agent/version.py
+++ b/trains_agent/version.py
@@ -1 +1 @@
-__version__ = '0.15.0'
+__version__ = '0.16.0'
Author	SHA1	Message	Date
allegroai	121dec2a62	Version bump to v0.16.0	2020-08-10 17:28:00 +03:00
allegroai	4aacf9005e	Fix GPU Windows monitoring support (Trains Issue #177 )	2020-08-10 08:07:51 +03:00
allegroai	6b333202e9	Sync generated conf file with latest Trains	2020-08-08 14:44:45 +03:00
allegroai	ce6831368f	Fix GPU monitoring on Windows machines	2020-08-08 14:43:25 +03:00
allegroai	e4111c830b	Fix GIT user/pass in requirements and support for '-e git+http' lines	2020-07-30 14:30:23 +03:00
allegroai	52c1772b04	Add requirement_parser into trains-agent instead as a dependency. Fix requirement_parser to support 'package @ git+http' lines	2020-07-30 14:29:37 +03:00
allegroai	699d13bbb3	Fix task status change to queued should also never happen during Task runtime	2020-07-14 23:42:11 +03:00
allegroai	2c8d7d3d9a	Fix --debug to set all specified loggers to DEBUG Add set_urllib_log_level, in debug set urllib log level to DEBUG	2020-07-11 01:45:46 +03:00
allegroai	b13cc1e8e7	Add error message when Trains API Server is not accessible on startup	2020-07-11 01:44:45 +03:00
allegroai	17d2bf2a3e	Change daemon --stop without any specific flag to terminate the agents by worker id lexicographic order	2020-07-11 01:43:54 +03:00
allegroai	94997f9c88	Add daemon --order-fairness for round-robin queue pulling Add daemon --stop to terminate running agent (assume all the rest of the arguments are the same) Clean up all log files on termination unless executed with --debug	2020-07-11 01:42:56 +03:00
allegroai	c6d998c4df	Add terminate process and rmtree utilities	2020-07-11 01:40:50 +03:00
allegroai	f8ea445339	Fix docker to use UTF-8 encoding, so prints won't break it	2020-07-11 01:40:14 +03:00
allegroai	712efa208b	version bump	2020-07-06 21:09:21 +03:00
allegroai	09b6b6a9de	Fix non-root docker image usage Fix broken trains-agent build Improve support for dockers with preinstalled conda env Improve trains-agent-docker spinning	2020-07-06 21:09:11 +03:00
allegroai	98ff9a50e6	Changed agent.docker_init_bash_script default value in comment	2020-07-06 21:05:55 +03:00
allegroai	1f4d358316	Changed default docker image from nvidia/cuda to "nvidia/cuda:10.1-runtime-ubuntu18.04" to support cudnn frameworks (TF)	2020-07-02 01:35:57 +03:00
allegroai	f693fa165c	Fix .git-credentials and .gitconfig mapping into docker Add agent.docker_init_bash_script allow finer control over docker startup script	2020-07-02 01:33:13 +03:00
allegroai	c43084825c	Version bump to v0.15.1	2020-06-21 23:23:44 +03:00
allegroai	f1abee91dd	Add FORCE_LOCAL_TRAINS_AGENT_WHEEL to force the install of local trains agent wheel into the docker image	2020-06-21 23:23:26 +03:00
allegroai	c6b04edc34	version bump	2020-06-18 01:55:30 +03:00
allegroai	50b847f4f7	Add trains-agent dockerfile	2020-06-18 01:55:24 +03:00
allegroai	1f53a06299	Add agent.force_git_ssh_protocol option to force all git links to ssh:// (issue #16 ) Add git user/pass credentials for pip git packages (git+http and git+ssh) (issue #22)	2020-06-18 01:55:14 +03:00
allegroai	257dd95401	Add warning on --gpus without detected CUDA version (see issue #24 )	2020-06-18 01:52:58 +03:00
allegroai	1736d205bb	Documentation	2020-06-18 00:31:44 +03:00
allegroai	6fef58df6c	embed jsonmodels 2.4 into trains-agent	2020-06-18 00:30:40 +03:00
allegroai	473a8de8bb	Fix trains-agent init (max two verification retries, then print error)	2020-06-11 15:39:38 +03:00
Allegro AI	ff6272f48f	Merge pull request #23 from H4dr1en/patch-1 remove six and pathlib2 dependencies from setup.py	2020-06-05 19:20:09 +03:00
H4dr1en	1b5bcebd10	remove six and pathlib2 dependencies from setup.py	2020-06-05 18:01:35 +02:00
Allegro AI	c4344d3afd	Update README.md	2020-06-02 01:02:34 +03:00
Allegro AI	45a44b087a	Update README.md	2020-06-02 00:58:52 +03:00