Compare commits

...

66 Commits

Author SHA1 Message Date
allegroai
58e0dc42ec Version bump to v1.6.0 2023-09-05 15:05:11 +03:00
allegroai
d16825029d Add new pytorch no resolver mode and CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE to change resolver on a Task basis, now supports "pip", "direct", "none" 2023-09-02 17:45:10 +03:00
allegroai
fb639afcb9 Fix PyTorch extra index pip resolver 2023-09-02 17:43:41 +03:00
allegroai
eefb94d1bc Add Python 3.11 support 2023-09-02 17:42:27 +03:00
Alex Burlacu
f1e9266075 Adjust docker image versions in a couple more places 2023-08-24 19:03:24 +03:00
Alex Burlacu
e1e3c84a8d Update docker versions 2023-08-24 19:01:26 +03:00
Alex Burlacu
ed1356976b Move extra configurations to Worker init to make sure all available configurations can be overridden 2023-08-24 19:00:36 +03:00
Alex Burlacu
2b815354e0 Improve file mode comment 2023-08-24 18:53:00 +03:00
Alex Burlacu
edae380a9e Version bump 2023-08-24 18:51:47 +03:00
Alex Burlacu
946e9d9ce9 Fix invalid reference 2023-08-24 18:51:27 +03:00
jday1
a56343ffc7 Upgrade requests library (#162)
* Upgrade requests

* Modified package requirement

* Modified package requirement
2023-08-01 10:41:22 +03:00
allegroai
159a6e9a5a Fix runtime property overriding existing properties 2023-07-20 10:41:15 +03:00
pollfly
6b7ee12dc1 Edit README (#156) 2023-07-19 16:51:14 +03:00
allegroai
3838247716 Update k8s glue docker build resources 2023-07-19 16:47:50 +03:00
pollfly
6e7d35a42a Improve configuration files (#160) 2023-07-11 10:32:01 +03:00
allegroai
4c056a17b9 Add support for k8s jobs execution
Strip docker container obtained from task in k8s apply
2023-07-04 14:45:00 +03:00
allegroai
21d98afca5 Add support for extra docker arguments referencing machines environment variables using the agent.docker_allow_host_environ configuration option to allow users to also be able to use $ENV in the task's docker arguments 2023-07-04 14:42:28 +03:00
allegroai
6a1bf11549 Fix Task docker arguments passed twice 2023-07-04 14:41:07 +03:00
allegroai
7115a9b9a7 Add CLEARML_EXTRA_PIP_INSTALL_FLAGS / agent.package_manager.extra_pip_install_flags to control additional pip install flags
Fix pip version marking in "installed packages" is now preserved for and reinstalled
2023-07-04 14:39:40 +03:00
allegroai
450df2f8d3 Support skipping agent pip upgrade in container bash script using the CLEARML_AGENT_NO_UPDATE env var 2023-07-04 14:38:50 +03:00
allegroai
ccf752c4e4 Add support for setting mode on files applied by the agent 2023-07-04 14:37:58 +03:00
allegroai
3ed63e2154 Fix docker container backwards compatibility for API <2.13
Fix default docker match rules resolver (used incorrect field "container" instead of "image")
Remove "container" (image) match rule option from default docker image resolver
2023-07-04 14:37:18 +03:00
allegroai
a535f93cd6 Add support for CLEARML_AGENT_FORCE_MAX_API_VERSION for testing 2023-07-04 14:35:54 +03:00
allegroai
b380ec54c6 Improve config file comments 2023-07-04 14:34:43 +03:00
allegroai
a1274299ce Add support for CLEARML_AGENT_EXTRA_DOCKER_LABELS env var 2023-07-03 11:08:59 +03:00
allegroai
c77224af68 Add support for task field injection into container docker name 2023-07-03 11:07:12 +03:00
allegroai
95dadca45c Refactor k8s glue running/used pods getter 2023-05-21 22:56:12 +03:00
allegroai
685918fd9b Version bump to v1.5.3rc3 2023-05-21 22:54:38 +03:00
allegroai
bc85ddf78d Fix pytorch direct resolve replacing wheel link with directly installed version 2023-05-21 22:53:51 +03:00
allegroai
5b5fb0b8a6 Add agent.package_manager.pytorch_resolve configuration setting with pip or direct values. pip sets extra index based on cuda and lets pip resolve, direct is the previous parsing algorithm that does the matching and downloading (default pip) 2023-05-21 22:53:11 +03:00
allegroai
fec0ce1756 Better message for agent init when an existing clearml.conf is found 2023-05-21 22:51:11 +03:00
allegroai
1e09b88b7a Add alias CLEARML_AGENT_DOCKER_AGENT_REPO env var for the FORCE_CLEARML_AGENT_REPO env var 2023-05-21 22:50:01 +03:00
allegroai
b6ca0fa6a5 Print error on resource monitor failure 2023-05-11 16:18:11 +03:00
allegroai
307ec9213e Fix git+ssh:// links inside installed packages not being converted properly to HTTPS authenticated and vice versa 2023-05-11 16:16:51 +03:00
allegroai
a78a25d966 Support new Retry.DEFAULT_BACKOFF_MAX in a backwards-compatible way 2023-05-11 16:16:18 +03:00
allegroai
ebb6231f5a Add CLEARML_AGENT_STANDALONE_CONFIG_BC to support backwards compatibility in standalone mode 2023-05-11 16:15:06 +03:00
pollfly
e1d65cb280 Update clearml-agent gif (#137) 2023-04-10 10:58:10 +03:00
allegroai
3fe92a92ba Version bump to v1.5.2 2023-03-29 12:49:33 +03:00
allegroai
154db59ce6 Add agent.package_manager.poetry_install_extra_args configuration option 2023-03-28 14:37:48 +03:00
allegroai
afffa83063 Fix git+ssh:// links inside installed packages not being converted properly to https authenticated links 2023-03-28 14:35:51 +03:00
allegroai
787c7d88bb Fix additional poetry cwd support feature 2023-03-28 14:35:41 +03:00
allegroai
667c2ced3d Fix very old pip version support (<20) 2023-03-28 14:34:19 +03:00
allegroai
7f5b3c8df4 Fix None config file in session causes k8s agent to raise exception 2023-03-28 14:33:55 +03:00
allegroai
46ded2864d Fix restart feature should be tested against agent session 2023-03-28 14:33:33 +03:00
allegroai
40456be948 Black formatting
Refactor path support
2023-03-05 18:05:00 +02:00
allegroai
8d51aed679 Protect against cache folders without permission 2023-03-05 18:05:00 +02:00
allegroai
bfc4ba38cd Fix torch inside nvidia containers to use preinstalled version (i.e. ==x.y.z.* matching) 2023-03-05 18:05:00 +02:00
Niels ten Boom
3cedc104df Add poetry cwd support (#142)
Closes #138
2023-03-05 14:19:57 +02:00
Marijan Smetko
b367c80477 Switch entrypoint shell from sh to bash (#141) 2023-02-28 21:55:16 +02:00
allegroai
262b6d3a00 Update services agent entrypoint 2023-02-05 10:40:02 +02:00
allegroai
95e996bfda Reintroduce CLEARML_AGENT_SERVICES_DOCKER_RESTART accidentally reverted by a previous merge 2023-02-05 10:34:38 +02:00
allegroai
b6d132b226 Fix build fails when target is relative path 2023-02-05 10:33:32 +02:00
allegroai
4f17a2c17d Fix K8s glue does not delete pending pods if the tasks they represent were aborted 2023-02-05 10:32:16 +02:00
allegroai
00e8e9eb5a Do not allow request exceptions (only on the initial login call) 2023-02-05 10:30:45 +02:00
allegroai
af6a77918f Fix _ is allowed in k8s label names 2023-02-05 10:29:48 +02:00
allegroai
855622fd30 Support custom service on Worker.get() calls 2023-02-05 10:29:09 +02:00
allegroai
8cd12810f3 Fix login uses GET with payload which breaks when trying to connect a server running in GCP 2023-02-05 10:28:41 +02:00
achaiah
ebb955187d Fix agent update version (#132)
* Fix agent update version

Pip install command is missing the '=='  to execute successfully

* allow for fuzzy and direct version spec

adding logic to allow for flexible version specification

* Added regex to parse 1.2.3rc4 patterns
2023-01-08 19:10:26 +02:00
pollfly
85e1fadf9b Fix typos (#131) 2022-12-28 19:39:59 +02:00
allegroai
249b51a31b Version bump 2022-12-13 15:29:10 +02:00
allegroai
da19ef26c4 Fix pinging running task (and change default to once a minute) 2022-12-13 15:26:26 +02:00
allegroai
f69e16ea9d Fix clearml-agent build --docker stuck on certain containers 2022-12-13 15:24:32 +02:00
allegroai
efa1f71dac Version bump to v1.5.1 2022-12-10 22:18:21 +02:00
allegroai
692cb8cf13 Update six requirements 2022-12-10 22:18:10 +02:00
allegroai
ebdc215632 Remove " from pip commands in venv 2022-12-10 20:58:30 +02:00
allegroai
b2da639582 Add CLEARML_AGENT_FORCE_SYSTEM_SITE_PACKAGES env var (default true) to allow overriding default "system_site_packages: true" behavior when running tasks in containers (docker mode and k8s-glue) 2022-12-10 20:00:46 +02:00
46 changed files with 1719 additions and 576 deletions

3
.gitignore vendored
View File

@@ -11,3 +11,6 @@ build/
dist/
*.egg-info
# VSCode
.vscode

107
README.md
View File

@@ -24,8 +24,7 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
* Launch-and-Forget service containers
* [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
* [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
*
Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
* Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.
@@ -35,8 +34,8 @@ It is a zero configuration fire-and-forget execution agent, providing a full ML/
or [free tier hosting](https://app.clear.ml)
2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine:
on-premises / cloud / ...)
3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or
Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
3. Create a [job](https://clear.ml/docs/latest/docs/apps/clearml_task) or
add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines of code
4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or
automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
@@ -81,21 +80,21 @@ Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://githu
**Two K8s integration flavours**
- Spin ClearML-Agent as a long-lasting service pod
- use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
- Spin ClearML-Agent as a long-lasting service pod:
- Use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
- map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
- allow the clearml-agent to manage sibling dockers
- benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
- downside: Sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
- Allow the clearml-agent to manage sibling dockers
- Benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
- Downside: sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs:
- Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
a K8s cpu node
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
yaml template)
- Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
experiment's process
- benefits: Kubernetes full view of all running jobs in the system
- downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
- Benefits: Kubernetes full view of all running jobs in the system
- Downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
### Using the ClearML Agent
@@ -110,15 +109,15 @@ A previously run experiment can be put into 'Draft' state by either of two metho
* Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any
results and artifacts the previous run had created.
* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new '
Draft' experiment with the same configuration as the original experiment.
* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new
'Draft' experiment with the same configuration as the original experiment.
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in
the ClearML UI and selecting the execution queue.
See [creating an experiment and enqueuing it for execution](#from-scratch).
Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue.
Once an experiment is enqueued, it will be picked up and executed by a ClearML Agent monitoring this queue.
The ClearML UI Workers & Queues page provides ongoing execution information:
@@ -170,22 +169,22 @@ clearml-agent init
```
Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
ClearML Agent cache folder is `~/.clearml`
ClearML Agent cache folder is `~/.clearml`.
See full details in your configuration file at `~/clearml.conf`
See full details in your configuration file at `~/clearml.conf`.
Note: The **ClearML agent** extends the **ClearML** configuration file `~/clearml.conf`
Note: The **ClearML Agent** extends the **ClearML** configuration file `~/clearml.conf`.
They are designed to share the same configuration file, see example [here](docs/clearml.conf)
#### Running the ClearML Agent
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen:
```bash
clearml-agent daemon --queue default --foreground
```
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe).
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
```bash
@@ -195,20 +194,21 @@ clearml-agent daemon --detached --queue default
GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled
with `--cpu-only`).
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for
the `clearml-agent` <br>
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPUs will be allocated for
the `clearml-agent`. <br>
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES="none"`, no gpu will be allocated for
the `clearml-agent`
the `clearml-agent`.
Example: spin two agents, one per gpu on the same machine:
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
Example: spin two agents, one per GPU on the same machine:
Notice: with `--detached` flag, the *clearml-agent* will run in the background
```bash
clearml-agent daemon --detached --gpus 0 --queue default
clearml-agent daemon --detached --gpus 1 --queue default
```
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent
```bash
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
@@ -223,43 +223,43 @@ For debug and experimentation, start the ClearML agent in `foreground` mode, whe
clearml-agent daemon --queue default --docker --foreground
```
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe).
Notice: with `--detached` flag, the *clearml-agent* will run in the background
```bash
clearml-agent daemon --detached --queue default --docker
```
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
Example: spin two agents, one per gpu on the same machine, with default `nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04`
docker:
```bash
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
```
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:
10.1-cudnn7-runtime-ubuntu18.04 docker:
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent, with default
`nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04` docker:
```bash
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
```
##### Starting the ClearML Agent - Priority Queues
Priority Queues are also supported, example use case:
High priority queue: `important_jobs` Low priority queue: `default`
High priority queue: `important_jobs`, low priority queue: `default`
```bash
clearml-agent daemon --queue important_jobs default
```
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from
the `default` queue.
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, and only if it is empty, the agent
will try to pull from the `default` queue.
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see
Adding queues, managing job order within a queue, and moving jobs between queues, is available using the Web UI, see
example on our [free server](https://app.clear.ml/workers-and-queues/queues)
##### Stopping the ClearML Agent
@@ -268,7 +268,7 @@ To stop a **ClearML Agent** running in the background, run the same command line
appended. For example, to stop the first of the above shown same machine, single gpu agents:
```bash
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 --stop
```
### How do I create an experiment on the ClearML Server? <a name="from-scratch"></a>
@@ -279,32 +279,33 @@ clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10
- Git repository link and commit ID (or an entire jupyter notebook)
- Git diff (were not saying you never commit and push, but still...)
- Python packages used by your code (including specific versions used)
- Hyper-Parameters
- Input Artifacts
- Hyperparameters
- Input artifacts
You now have a 'template' of your experiment with everything required for automated execution
* In the ClearML UI, Right-click on the experiment and select 'clone'. A copy of your experiment will be created.
* In the ClearML UI, right-click on the experiment and select 'clone'. A copy of your experiment will be created.
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
- Change the Hyper-Parameters
- Change the hyperparameters
- Switch to the latest code base of the repository
- Update package versions
- Select a specific docker image to run in (see docker execution mode section)
- Or simply change nothing to run the same experiment again...
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
* Schedule the newly created experiment for execution: right-click the experiment and select 'enqueue'
### ClearML-Agent Services Mode <a name="services"></a>
ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that
previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks)
for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the
budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as
Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data
transparency)
for different use cases:
* Auto-scaler service (spinning instances when the need arises and the budget allows)
* Controllers (Implementing pipelines and more sophisticated DevOps logic)
* Optimizer (such as Hyperparameter Optimization or sweeping)
* Application (such as interactive Bokeh apps for increased data transparency)
ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by
ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched
Currently, clearml-agent in services-mode supports CPU only configuration. ClearML-Agent services mode can be launched
alongside GPU agents.
```bash
@@ -321,15 +322,15 @@ ClearML package.
Sample AutoML & Orchestration examples can be found in the
ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.
AutoML examples
AutoML examples:
- [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
- In order to create an experiment-template in the system, this code must be executed once manually
- [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter
- This example will create multiple copies of the Keras experiment-template, with different hyperparameter
combinations
Experiment Pipeline examples
Experiment Pipeline examples:
- [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
- This example will "process data", and once done, will launch a copy of the 'second step' experiment-template

View File

@@ -69,8 +69,9 @@
pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"],
# specify poetry version to use (examples "<2", "==1.1.1", "", empty string will install the latest version)
# poetry_version: "<2",
# poetry_install_extra_args: ["-v"]
# virtual environment inheres packages from system
# virtual environment inherits packages from system
system_site_packages: false,
# install with --upgrade
@@ -79,6 +80,17 @@
# additional artifact repositories to use when installing python packages
# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
# control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
# Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
# "pip" (default): would automatically detect the cuda version, and supply pip with the correct
# extra-index-url, based on pytorch.org tables
# "direct": would resolve a direct link to the pytorch wheel by parsing the pytorch.org pip repository
# and matching the automatically detected cuda version with the required pytorch wheel.
# if the exact cuda version is not found for the required pytorch wheel, it will try
# a lower cuda version until a match is found
# "none": No resolver used, install pytorch like any other package
# pytorch_resolve: "pip"
# additional conda channels to use when installing with conda package manager
conda_channels: ["pytorch", "conda-forge", "defaults", ]
@@ -87,24 +99,32 @@
# force_repo_requirements_txt: false
# set the priority packages to be installed before the rest of the required packages
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# priority_packages: ["cython", "numpy", "setuptools", ]
# set the optional priority packages to be installed before the rest of the required packages,
# In case a package installation fails, the package will be ignored,
# and the virtual environment process will continue
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
priority_optional_packages: ["pygobject", ]
# set the post packages to be installed after all the rest of the required packages
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# post_packages: ["horovod", ]
# set the optional post packages to be installed after all the rest of the required packages,
# In case a package installation fails, the package will be ignored,
# and the virtual environment process will continue
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# post_optional_packages: []
# set to True to support torch nightly build installation,
# notice: torch nightly builds are ephemeral and are deleted from time to time
torch_nightly: false,
# if set to true, the agent will look for the "poetry.lock" file
# in the passed current working directory instead of the repository's root directory.
poetry_files_from_repo_working_dir: false
},
# target folder for virtual environments builds, created when executing experiment
@@ -187,7 +207,7 @@
default_docker: {
# default docker image to use when running in docker mode
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
# optional arguments to pass to docker image
# arguments: ["--ipc=host", ]
@@ -254,10 +274,15 @@
# Name docker containers created by the daemon using the following string format (supported from Docker 0.6.5)
# Allowed variables are task_id, worker_id and rand_string (random lower-case letters string, up to 32 characters)
# Custom variables may be specified using the docker_container_name_format_fields option.
# Note: resulting name must start with an alphanumeric character and
# continue with alphanumeric characters, underscores (_), dots (.) and/or dashes (-)
# docker_container_name_format: "clearml-id-{task_id}-{rand_string:.8}"
# Specify custom variables for the docker_container_name_format option using a mapping of variable name
# to a (nested) task field (using "." as a task field separator, digits specify array index)
# docker_container_name_format_fields: { foo: "bar.moo" }
# Apply top-level environment section from configuration into os.environ
apply_environment: true
# Top-level environment section is in the form of:
@@ -278,6 +303,8 @@
# target_format: format used to encode contents before writing into the target file. Supported values are json,
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
# overwrite: overwrite the target file in case it exists. Default is true.
# mode: file-system mode to be applied to the file after its creation. The mode string will be parsed into an
# integer (e.g. "0o777" for -rwxrwxrwx)
#
# Example:
# files {
@@ -335,6 +362,42 @@
# Disable task docker override. If true, the agent will use the default docker image and ignore any docker image
# and arguments specified in the task's container section (setup shell script from the task container section will
# be used in any case, of specified).
# be used in any case, if specified).
disable_task_docker_override: false
# Choose the default docker based on the Task properties,
# Examples: 'script.requirements', 'script.binary', 'script.repository', 'script.branch', 'project'
# Notice: Matching is done via regular expression, for example "^searchme$" will match exactly "searchme$" string
#
# "default_docker": {
# "image": "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04",
# # optional arguments to pass to docker image
# # arguments: ["--ipc=host", ]
# "match_rules": [
# {
# "image": "sample_container:tag",
# "arguments": "-e VALUE=1 --ipc=host",
# "match": {
# "script": {
# "requirements": {
# "pip": {
# "tensorflow": "~=1.6"
# }
# },
# "repository": "",
# "branch": "master"
# },
# "project": "example"
# }
# },
# {
# "image": "another_container:tag",
# "arguments": "",
# "match": {
# "project": "^examples", # anything that starts with "examples", e.g. "examples", "examples/sub_project"
# }
# }
# ]
# },
#
}

View File

@@ -3,7 +3,7 @@
storage {
cache {
# Defaults to system temp folder / cache
# Defaults to <system_temp_folder>/clearml_cache
default_base_dir: "~/.clearml/cache"
size {
# max_used_bytes = -1

View File

@@ -20,6 +20,7 @@ ENV_PROPAGATE_EXITCODE = EnvEntry("CLEARML_AGENT_PROPAGATE_EXITCODE", type=bool,
ENV_INITIAL_CONNECT_RETRY_OVERRIDE = EnvEntry(
'CLEARML_AGENT_INITIAL_CONNECT_RETRY_OVERRIDE', default=True, converter=safe_text_to_bool
)
ENV_FORCE_MAX_API_VERSION = EnvEntry("CLEARML_AGENT_FORCE_MAX_API_VERSION", type=str)
"""
Experimental option to set the request method for all API requests and auth login.

View File

@@ -2,22 +2,25 @@
import json as json_lib
import os
import sys
import time
import types
from random import SystemRandom
from socket import gethostname
from typing import Optional
import jwt
import requests
import six
from requests import RequestException
from requests.auth import HTTPBasicAuth
from six.moves.urllib.parse import urlparse, urlunparse
from clearml_agent.external.pyhocon import ConfigTree, ConfigFactory
from .callresult import CallResult
from .defs import (
ENV_VERBOSE, ENV_HOST, ENV_ACCESS_KEY, ENV_SECRET_KEY, ENV_WEB_HOST, ENV_FILES_HOST, ENV_AUTH_TOKEN,
ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD, )
ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD,
ENV_FORCE_MAX_API_VERSION)
from .request import Request, BatchRequest
from .token_manager import TokenManager
from ..config import load
@@ -25,6 +28,8 @@ from ..utils import get_http_session_with_retry, urllib_log_warning_setup
from ...backend_config.environment import backward_compatibility_support
from ...version import __version__
sys_random = SystemRandom()
class LoginError(Exception):
pass
@@ -49,6 +54,7 @@ class Session(TokenManager):
_session_initial_retry_connect_override = 4
_write_session_data_size = 15000
_write_session_timeout = (30.0, 30.)
_request_exception_retry_timeout = (2.0, 3.0)
api_version = '2.1'
feature_set = 'basic'
@@ -57,6 +63,7 @@ class Session(TokenManager):
default_files = "https://demofiles.demo.clear.ml"
default_key = "EGRTCO8JMSIGI6S39GTP43NFWXDQOW"
default_secret = "x!XTov_G-#vspE*Y(h$Anm&DIc5Ou-F)jsl$PdOyj5wG1&E!Z8"
force_max_api_version = ENV_FORCE_MAX_API_VERSION.get()
# TODO: add requests.codes.gateway_timeout once we support async commits
_retry_codes = [
@@ -111,19 +118,9 @@ class Session(TokenManager):
self._verbose = verbose if verbose is not None else ENV_VERBOSE.get()
self._logger = logger
self.__auth_token = None
self._propagate_exceptions_on_send = True
if ENV_API_DEFAULT_REQ_METHOD.get(default=None):
# Make sure we update the config object, so we pass it into the new containers when we map them
self.config.put("api.http.default_method", ENV_API_DEFAULT_REQ_METHOD.get())
# notice the default setting of Request.def_method are already set by the OS environment
elif self.config.get("api.http.default_method", None):
def_method = str(self.config.get("api.http.default_method", None)).strip()
if def_method.upper() not in ("GET", "POST", "PUT"):
raise ValueError(
"api.http.default_method variable must be 'get' or 'post' (any case is allowed)."
)
Request.def_method = def_method
Request._method = Request.def_method
self.update_default_api_method()
if ENV_AUTH_TOKEN.get(
value_cb=lambda key, value: print("Using environment access token {}=********".format(key))
@@ -178,6 +175,10 @@ class Session(TokenManager):
)
# try to connect with the server
self.refresh_token()
# for resilience, from now on we won't allow propagating exceptions when sending requests
self._propagate_exceptions_on_send = False
# create the default session with many retries
http_retries_config, self.__http_session = self._setup_session(http_retries_config)
@@ -198,6 +199,12 @@ class Session(TokenManager):
# notice: this is across the board warning omission
urllib_log_warning_setup(total_retries=http_retries_config.get('total', 0), display_warning_after=3)
if self.force_max_api_version and self.check_min_api_version(self.force_max_api_version):
print("Using forced API version {}".format(self.force_max_api_version))
Session.max_api_version = Session.api_version = str(self.force_max_api_version)
self.pre_vault_config = None
def _setup_session(self, http_retries_config, initial_session=False, default_initial_connect_override=None):
# type: (dict, bool, Optional[bool]) -> (dict, requests.Session)
http_retries_config = http_retries_config or self.config.get(
@@ -223,7 +230,22 @@ class Session(TokenManager):
return http_retries_config, get_http_session_with_retry(config=self.config or None, **http_retries_config)
def update_default_api_method(self):
if ENV_API_DEFAULT_REQ_METHOD.get(default=None):
# Make sure we update the config object, so we pass it into the new containers when we map them
self.config.put("api.http.default_method", ENV_API_DEFAULT_REQ_METHOD.get())
# notice the default setting of Request.def_method are already set by the OS environment
elif self.config.get("api.http.default_method", None):
def_method = str(self.config.get("api.http.default_method", None)).strip()
if def_method.upper() not in ("GET", "POST", "PUT"):
raise ValueError(
"api.http.default_method variable must be 'get', 'post' or 'put' (any case is allowed)."
)
Request.def_method = def_method
Request._method = Request.def_method
def load_vaults(self):
# () -> Optional[bool]
if not self.check_min_api_version("2.15") or self.feature_set == "basic":
return
@@ -234,7 +256,11 @@ class Session(TokenManager):
def parse(vault):
# noinspection PyBroadException
try:
d = vault.get('data', None)
print("Loaded {} vault: {}".format(
vault.get("scope", ""),
(vault.get("description", None) or "")[:50] or vault.get("id", ""))
)
d = vault.get("data", None)
if d:
r = ConfigFactory.parse_string(d)
if isinstance(r, (ConfigTree, dict)):
@@ -244,12 +270,15 @@ class Session(TokenManager):
# noinspection PyBroadException
try:
res = self.send_request("users", "get_vaults", json={"enabled": True, "types": ["config"]})
# Use params and not data/json otherwise payload might be dropped if we're using GET with a strict firewall
res = self.send_request("users", "get_vaults", params="enabled=true&types=config&types=config")
if res.ok:
vaults = res.json().get("data", {}).get("vaults", [])
data = list(filter(None, map(parse, vaults)))
if data:
self.pre_vault_config = self.config.copy()
self.config.set_overrides(*data)
return True
elif res.status_code != 404:
raise Exception(res.json().get("meta", {}).get("result_msg", res.text))
except Exception as ex:
@@ -272,6 +301,7 @@ class Session(TokenManager):
data=None,
json=None,
refresh_token_if_unauthorized=True,
params=None,
):
""" Internal implementation for making a raw API request.
- Constructs the api endpoint name
@@ -295,6 +325,7 @@ class Session(TokenManager):
if version
else "{host}/{service}.{action}"
).format(**locals())
while True:
if data and len(data) > self._write_session_data_size:
timeout = self._write_session_timeout
@@ -302,16 +333,29 @@ class Session(TokenManager):
timeout = self._session_initial_timeout
else:
timeout = self._session_timeout
res = self.__http_session.request(
method, url, headers=headers, auth=auth, data=data, json=json, timeout=timeout)
try:
res = self.__http_session.request(
method, url, headers=headers, auth=auth, data=data, json=json, timeout=timeout, params=params)
except RequestException as ex:
if self._propagate_exceptions_on_send:
raise
sleep_time = sys_random.uniform(*self._request_exception_retry_timeout)
self._logger.error(
"{} exception sending {} {}: {} (retrying in {:.1f}sec)".format(
type(ex).__name__, method.upper(), url, str(ex), sleep_time
)
)
time.sleep(sleep_time)
continue
if (
refresh_token_if_unauthorized
and res.status_code == requests.codes.unauthorized
and not token_refreshed_on_error
):
# it seems we're unauthorized, so we'll try to refresh our token once in case permissions changed since
# the last time we got the token, and try again
# it seems we're unauthorized, so we'll try to refresh our token once in case permissions changed
# since the last time we got the token, and try again
self.refresh_token()
token_refreshed_on_error = True
# try again
@@ -348,6 +392,7 @@ class Session(TokenManager):
data=None,
json=None,
async_enable=False,
params=None,
):
"""
Send a raw API request.
@@ -360,6 +405,7 @@ class Session(TokenManager):
content type will be application/json)
:param data: Dictionary, bytes, or file-like object to send in the request body
:param async_enable: whether request is asynchronous
:param params: additional query parameters
:return: requests Response instance
"""
headers = self.add_auth_headers(
@@ -376,6 +422,7 @@ class Session(TokenManager):
headers=headers,
data=data,
json=json,
params=params,
)
def send_request_batch(
@@ -628,15 +675,14 @@ class Session(TokenManager):
res = None
try:
data = {"expiration_sec": exp} if exp else {}
res = self._send_request(
method=Request.def_method,
service="auth",
action="login",
auth=auth,
json=data,
headers=headers,
refresh_token_if_unauthorized=False,
params={"expiration_sec": exp} if exp else {},
)
try:
resp = res.json()
@@ -675,3 +721,13 @@ class Session(TokenManager):
return "{self.__class__.__name__}[{self.host}, {self.access_key}/{secret_key}]".format(
self=self, secret_key=self.secret_key[:5] + "*" * (len(self.secret_key) - 5)
)
@property
def propagate_exceptions_on_send(self):
# type: () -> bool
return self._propagate_exceptions_on_send
@propagate_exceptions_on_send.setter
def propagate_exceptions_on_send(self, value):
# type: (bool) -> None
self._propagate_exceptions_on_send = value

View File

@@ -86,7 +86,10 @@ def get_http_session_with_retry(
session = requests.Session()
if backoff_max is not None:
Retry.BACKOFF_MAX = backoff_max
if "BACKOFF_MAX" in vars(Retry):
Retry.BACKOFF_MAX = backoff_max
else:
Retry.DEFAULT_BACKOFF_MAX = backoff_max
retry = Retry(
total=total, connect=connect, read=read, redirect=redirect, status=status,

View File

@@ -297,6 +297,9 @@ class Config(object):
def put(self, key, value):
self._config.put(key, value)
def pop(self, key, default=None):
return self._config.pop(key, default=default)
def to_dict(self):
return self._config.as_plain_ordered_dict()

View File

@@ -52,6 +52,7 @@ def apply_files(config):
target_fmt = data.get("target_format", "string")
overwrite = bool(data.get("overwrite", True))
contents = data.get("contents")
mode = data.get("mode")
target = Path(expanduser(expandvars(path)))
@@ -110,3 +111,14 @@ def apply_files(config):
except Exception as ex:
print("Skipped [{}]: failed saving file {} ({})".format(key, target, ex))
continue
try:
if mode:
if isinstance(mode, int):
mode = int(str(mode), 8)
else:
mode = int(mode, 8)
target.chmod(mode)
except Exception as ex:
print("Skipped [{}]: failed setting mode {} for {} ({})".format(key, mode, target, ex))
continue

View File

@@ -118,13 +118,15 @@ class ServiceCommandSection(BaseCommandSection):
""" The name of the REST service used by this command """
pass
def get(self, endpoint, *args, session=None, **kwargs):
def get(self, endpoint, *args, service=None, session=None, **kwargs):
session = session or self._session
return session.get(service=self.service, action=endpoint, *args, **kwargs)
service = service or self.service
return session.get(service=service, action=endpoint, *args, **kwargs)
def post(self, endpoint, *args, session=None, **kwargs):
def post(self, endpoint, *args, service=None, session=None, **kwargs):
session = session or self._session
return session.post(service=self.service, action=endpoint, *args, **kwargs)
service = service or self.service
return session.post(service=service, action=endpoint, *args, **kwargs)
def get_with_act_as(self, endpoint, *args, **kwargs):
return self._session.get_with_act_as(service=self.service, action=endpoint, *args, **kwargs)

View File

@@ -44,7 +44,7 @@ def main():
if conf_file.exists() and conf_file.is_file() and conf_file.stat().st_size > 0:
print('Configuration file already exists: {}'.format(str(conf_file)))
print('Leaving setup, feel free to edit the configuration file.')
print('Leaving setup. If you\'ve previously initialized the ClearML SDK on this machine, manually add an \'agent\' section to this file.')
return
print(description, end='')

View File

@@ -109,15 +109,15 @@ def resolve_default_container(session, task_id, container_config):
match.get('script.binary', None), entry))
continue
if match.get('container', None):
# noinspection PyBroadException
try:
if not re.search(match.get('container', None), requested_container.get('image', '')):
continue
except Exception:
print('Failed parsing regular expression \"{}\" in rule: {}'.format(
match.get('container', None), entry))
continue
# if match.get('image', None):
# # noinspection PyBroadException
# try:
# if not re.search(match.get('image', None), requested_container.get('image', '')):
# continue
# except Exception:
# print('Failed parsing regular expression \"{}\" in rule: {}'.format(
# match.get('image', None), entry))
# continue
matched = True
for req_section in ['script.requirements.pip', 'script.requirements.conda']:
@@ -156,8 +156,8 @@ def resolve_default_container(session, task_id, container_config):
break
if matched:
if not container_config.get('container'):
container_config['container'] = entry.get('image', None)
if not container_config.get('image'):
container_config['image'] = entry.get('image', None)
if not container_config.get('arguments'):
container_config['arguments'] = entry.get('arguments', None)
container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())

View File

@@ -1,6 +1,7 @@
from __future__ import print_function, division, unicode_literals
import errno
import functools
import json
import logging
import os
@@ -11,6 +12,7 @@ import shlex
import shutil
import signal
import string
import socket
import subprocess
import sys
import traceback
@@ -23,7 +25,7 @@ from functools import partial
from os.path import basename
from tempfile import mkdtemp, NamedTemporaryFile
from time import sleep, time
from typing import Text, Optional, Any, Tuple, List
from typing import Text, Optional, Any, Tuple, List, Dict, Mapping, Union
import attr
import six
@@ -39,6 +41,7 @@ from clearml_agent.backend_api.session import CallResult, Request
from clearml_agent.backend_api.session.defs import (
ENV_ENABLE_ENV_CONFIG_SECTION, ENV_ENABLE_FILES_CONFIG_SECTION,
ENV_VENV_CONFIGURED, ENV_PROPAGATE_EXITCODE, )
from clearml_agent.backend_config import Config
from clearml_agent.backend_config.defs import UptimeConf
from clearml_agent.backend_config.utils import apply_environment, apply_files
from clearml_agent.backend_config.converters import text_to_int
@@ -57,10 +60,7 @@ from clearml_agent.definitions import (
ENV_WORKER_ID,
ENV_WORKER_TAGS,
ENV_DOCKER_SKIP_GPUS_FLAG,
ENV_AGENT_SECRET_KEY,
ENV_AGENT_AUTH_TOKEN,
ENV_AWS_SECRET_KEY,
ENV_AZURE_ACCOUNT_KEY,
ENV_AGENT_DISABLE_SSH_MOUNT,
ENV_SSH_AUTH_SOCK,
ENV_AGENT_SKIP_PIP_VENV_INSTALL,
@@ -71,6 +71,11 @@ from clearml_agent.definitions import (
ENV_DEBUG_INFO,
ENV_CHILD_AGENTS_COUNT_CMD,
ENV_DOCKER_ARGS_FILTERS,
ENV_FORCE_SYSTEM_SITE_PACKAGES,
ENV_SERVICES_DOCKER_RESTART,
ENV_CONFIG_BC_IN_STANDALONE,
ENV_FORCE_DOCKER_AGENT_REPO,
ENV_EXTRA_DOCKER_LABELS,
)
from clearml_agent.definitions import WORKING_REPOSITORY_DIR, PIP_EXTRA_INDICES
from clearml_agent.errors import (
@@ -316,6 +321,37 @@ def get_next_task(session, queue, get_task_info=False):
return data
def get_task_fields(session, task_id, fields: list, log=None) -> dict:
"""
Returns dict with Task docker container setup {container: '', arguments: '', setup_shell_script: ''}
"""
result = session.send_request(
service='tasks',
action='get_all',
json={'id': [task_id], 'only_fields': list(fields), 'search_hidden': True},
method=Request.def_method,
async_enable=False,
)
# noinspection PyBroadException
try:
results = {}
result = result.json()['data']['tasks'][0]
for field in fields:
cur = result
for part in field.split("."):
if part.isdigit():
cur = cur[part]
else:
cur = cur.get(part, {})
results[field] = cur
return results
except Exception as ex:
if log:
log.error("Failed obtaining values for task fields {}: {}", fields, ex)
pass
return {}
def get_task_container(session, task_id):
"""
Returns dict with Task docker container setup {container: '', arguments: '', setup_shell_script: ''}
@@ -333,20 +369,25 @@ def get_task_container(session, task_id):
container = result.json()['data']['tasks'][0]['container'] if result.ok else {}
if container.get('arguments'):
container['arguments'] = shlex.split(str(container.get('arguments')).strip())
if container.get('image'):
container['image'] = container.get('image').strip()
except (ValueError, TypeError):
container = {}
else:
response = get_task(session, task_id, only_fields=["execution.docker_cmd"])
task_docker_cmd_parts = shlex.split(str(response.execution.docker_cmd or '').strip())
try:
container = dict(
container=task_docker_cmd_parts[0],
arguments=task_docker_cmd_parts[1:] if len(task_docker_cmd_parts[0]) > 1 else ''
)
except (ValueError, TypeError):
container = {}
container = {}
if response.execution:
task_docker_cmd_parts = shlex.split(str(response.execution.docker_cmd or '').strip())
if task_docker_cmd_parts:
try:
container = dict(
image=task_docker_cmd_parts[0],
arguments=task_docker_cmd_parts[1:] if len(task_docker_cmd_parts[0]) > 1 else ''
)
except (ValueError, TypeError):
pass
if (not container or not container.get('container')) and session.check_min_api_version("2.13"):
if (not container or not container.get('image')) and session.check_min_api_version("2.13"):
container = resolve_default_container(session=session, task_id=task_id, container_config=container)
return container
@@ -597,6 +638,8 @@ class Worker(ServiceCommandSection):
_docker_fixed_user_cache = '/clearml_agent_cache'
_temp_cleanup_list = []
hostname_task_runtime_prop = "_exec_agent_hostname"
@property
def service(self):
""" Worker command service endpoint """
@@ -622,9 +665,13 @@ class Worker(ServiceCommandSection):
self.log = self._session.get_logger(__name__)
self.register_signal_handler()
self._worker_registered = False
self._apply_extra_configuration()
self.is_conda = is_conda(self._session.config) # type: bool
# Add extra index url - system wide
extra_url = None
# noinspection PyBroadException
try:
if self._session.config.get("agent.package_manager.extra_index_url", None):
extra_url = self._session.config.get("agent.package_manager.extra_index_url", [])
@@ -688,7 +735,7 @@ class Worker(ServiceCommandSection):
self._docker_args_filters = []
self._task_ping_interval_sec = max(
0, text_to_int(self._session.config.get("agent.task_ping_interval_sec", 120.0))
0, text_to_int(self._session.config.get("agent.task_ping_interval_sec", 60.0))
)
@classmethod
@@ -736,8 +783,68 @@ class Worker(ServiceCommandSection):
except Exception:
pass
def _get_docker_restart_value(self, task_session, task_id: str):
try:
self._session.verify_feature_set('advanced')
except ValueError:
return
restart = (ENV_SERVICES_DOCKER_RESTART.get() or "").strip()
if not restart:
return
# Parse value and selector
restart_value, _, selector = restart.partition(";")
if restart_value not in ("unless-stopped", "no", "always") and not restart_value.startswith("on-failure"):
self.log.error(
"Invalid value \"{}\" provided for {}, ignoring".format(restart, ENV_SERVICES_DOCKER_RESTART.vars[0])
)
return
if not selector:
return restart_value
path, _, expected_value = selector.partition("=")
result = task_session.send_request(
service='tasks',
action='get_all',
json={'id': [task_id], 'only_fields': [path], 'search_hidden': True},
method=Request.def_method,
)
if not result.ok:
result_msg = self._get_path(result.json(), 'meta', 'result_msg')
self.log.error(
"Failed obtaining selector value for restart option \"{}\", ignoring: {}".format(selector, result_msg)
)
return
not_found = object()
try:
value = self._get_path(result.json(), 'data', 'tasks', 0, *path.split("."), default=not_found)
except (ValueError, TypeError):
return
if value is not_found:
return
if not expected_value:
return restart_value
# noinspection PyBroadException
try:
if (
(isinstance(value, bool) and value == strtobool(expected_value)) # check first - bool is also an int
or (isinstance(value, (int, float)) and value == float(expected_value))
or (str(value) == str(expected_value))
):
return restart_value
except Exception as ex:
pass
def run_one_task(self, queue, task_id, worker_args, docker=None, task_session=None):
# type: (Text, Text, WorkerParams, Optional[Text], Optional[Session]) -> int
# type: (Text, Text, WorkerParams, Optional[Text], Optional[Session]) -> Optional[int]
"""
Run one task pulled from queue.
:param queue: ID of queue that task was pulled from
@@ -752,6 +859,31 @@ class Worker(ServiceCommandSection):
# "Running task '{}'".format(task_id)
print(self._task_logging_start_message.format(task_id))
task_session = task_session or self._session
# noinspection PyBroadException
try:
result = task_session.send_request(
service='tasks',
action='get_all',
version='2.15',
method=Request.def_method,
json={'id': [task_id], 'only_fields': ["runtime"], 'search_hidden': True}
)
runtime = result.json().get("data", {}).get("tasks", [])[0].get("runtime") or {}
runtime[self.hostname_task_runtime_prop] = socket.gethostname()
res = task_session.send_request(
service='tasks', action='edit', method=Request.def_method,
json={
"task": task_id, "force": True, "runtime": runtime
},
)
if not res.ok:
raise Exception("failed setting runtime property")
except Exception as ex:
print("Warning: failed obtaining/setting hostname for task '{}': {}".format(task_id, ex))
# set task status to in_progress so we know it was popped from the queue
# noinspection PyBroadException
try:
@@ -812,6 +944,7 @@ class Worker(ServiceCommandSection):
docker_image=docker_image,
docker_arguments=docker_arguments,
docker_bash_setup_script=docker_setup_script,
restart=self._get_docker_restart_value(task_session, task_id),
)
if self._impersonate_as_task_owner:
docker_params["auth_token"] = task_session.token
@@ -826,11 +959,21 @@ class Worker(ServiceCommandSection):
name_format = self._session.config.get('agent.docker_container_name_format', None)
if name_format:
custom_fields = {}
name_format_fields = self._session.config.get('agent.docker_container_name_format_fields', None)
if name_format_fields:
field_values = get_task_fields(task_session, task_id, name_format_fields.values(), log=self.log)
custom_fields = {
k: field_values.get(v)
for k, v in name_format_fields.items()
}
try:
name = name_format.format(
task_id=re.sub(r'[^a-zA-Z0-9._-]', '-', task_id),
worker_id=re.sub(r'[^a-zA-Z0-9._-]', '-', worker_id),
rand_string="".join(sys_random.choice(string.ascii_lowercase) for _ in range(32))
rand_string="".join(sys_random.choice(string.ascii_lowercase) for _ in range(32)),
**custom_fields,
)
except Exception as ex:
print("Warning: failed generating docker container name: {}".format(ex))
@@ -1398,8 +1541,6 @@ class Worker(ServiceCommandSection):
return self._resolve_queue_names(queues=queues, create_if_missing=create_if_missing)
def daemon(self, queues, log_level, foreground=False, docker=False, detached=False, order_fairness=False, **kwargs):
self._apply_extra_configuration()
# check that we have docker command if we need it
if docker not in (False, None) and not check_if_command_exists("docker"):
raise ValueError("Running in Docker mode, 'docker' command was not found")
@@ -1786,7 +1927,8 @@ class Worker(ServiceCommandSection):
if stderr:
stderr.flush()
if self._task_ping_interval_sec and time() - last_task_ping > self._task_ping_interval_sec:
if not stopping and self._task_ping_interval_sec and \
time() - last_task_ping > self._task_ping_interval_sec:
# noinspection PyBroadException
try:
res = (session or self._session).send(tasks_api.PingRequest(task=task_id))
@@ -1795,7 +1937,7 @@ class Worker(ServiceCommandSection):
except Exception as ex:
self.log.error("Failed sending ping: %s", str(ex))
finally:
self._task_ping_interval_sec = time()
last_task_ping = time()
# get diff from previous poll
printed_lines, stdout_pos_count = _print_file(stdout_path, stdout_pos_count)
@@ -1939,19 +2081,26 @@ class Worker(ServiceCommandSection):
def _apply_extra_configuration(self):
# store a few things we updated in runtime (TODO: we should list theme somewhere)
agent_config = self._session.config["agent"].copy()
vault_loaded = False
session = self._session
agent_config = session.config["agent"].copy()
agent_config_keys = ["cuda_version", "cudnn_version", "default_python", "worker_id", "worker_name", "debug"]
try:
self._session.load_vaults()
vault_loaded = session.load_vaults()
except Exception as ex:
print("Error: failed applying extra configuration: {}".format(ex))
# merge back
for restore_key in agent_config_keys:
if restore_key in agent_config:
self._session.config["agent"][restore_key] = agent_config[restore_key]
config = session.config
# merge back
if vault_loaded:
for restore_key in agent_config_keys:
if restore_key in agent_config and agent_config[restore_key] != config["agent"].get(restore_key, None):
print("Ignoring vault value for '{}' (agent config takes precedence), using '{}'".format(
restore_key, agent_config[restore_key]
))
config["agent"][restore_key] = agent_config[restore_key]
config = self._session.config
default = config.get("agent.apply_environment", False)
if ENV_ENABLE_ENV_CONFIG_SECTION.get(default=default):
try:
@@ -1968,6 +2117,11 @@ class Worker(ServiceCommandSection):
except Exception as ex:
print("Error: failed applying files from configuration: {}".format(ex))
try:
self._session.update_default_api_method()
except Exception as ex:
print("Error: failed updating default API method: {}".format(ex))
@resolve_names
def build(
self,
@@ -1982,6 +2136,10 @@ class Worker(ServiceCommandSection):
):
if not task_id:
raise CommandFailedError("Worker build must have valid task id")
if target and not os.path.isabs(target):
# Non absolute target path will lead to errors with relative python executable
target = os.path.abspath(target)
self._session.print_configuration()
@@ -2101,12 +2259,14 @@ class Worker(ServiceCommandSection):
print('Building Task {} inside docker image: {} {} setup_script={}\n'.format(
task_id, docker_image, docker_arguments or '', docker_setup_script or ''))
full_docker_cmd = self.docker_image_func(
docker_image=docker_image, docker_arguments=docker_arguments, docker_bash_setup_script=docker_setup_script)
docker_image=docker_image, docker_arguments=docker_arguments, docker_bash_setup_script=docker_setup_script
)
end_of_build_marker = "build.done=true"
docker_cmd_suffix = ' build --id {task_id} --install-globally; ' \
'echo "" >> {conf_file} ; ' \
'echo {end_of_build_marker} >> {conf_file} ; ' \
'ORG=$(stat -c "%u:%g" {conf_file}) ; chown $(whoami):$(whoami) {conf_file} ; ' \
'echo "" >> {conf_file} ; echo {end_of_build_marker} >> {conf_file} ; ' \
'chown $ORG {conf_file} ; ' \
'bash'.format(
task_id=task_id,
end_of_build_marker=end_of_build_marker,
@@ -2125,10 +2285,16 @@ class Worker(ServiceCommandSection):
# now we need to wait until the line shows on our configuration file.
while True:
while temp_config.stat().st_mtime == base_time_stamp:
sleep(5.0)
with open(temp_config.as_posix()) as f:
lines = [l.strip() for l in f.readlines()]
# noinspection PyBroadException
try:
while temp_config.stat().st_mtime == base_time_stamp:
sleep(5.0)
with open(temp_config.as_posix()) as f:
lines = [l.strip() for l in f.readlines()]
except Exception as ex:
# print("Failed reading status file [{}], retrying in 2 seconds".format(ex))
sleep(2.0)
if 'build.done=true' in lines:
break
base_time_stamp = temp_config.stat().st_mtime
@@ -2157,6 +2323,8 @@ class Worker(ServiceCommandSection):
print(commit_docker(container_name=target, docker_id=docker_id, apply_change=change))
shutdown_docker_process(docker_id=docker_id)
safe_remove_file(temp_config.as_posix())
return
def _get_task_python_version(self, task):
@@ -2214,8 +2382,10 @@ class Worker(ServiceCommandSection):
print("Cloning task id={}".format(task_id))
current_task = self._session.api_client.tasks.get_by_id(
self._session.send_api(
tasks_api.CloneRequest(task=current_task.id,
new_task_name='Clone of {}'.format(current_task.name))
tasks_api.CloneRequest(
task=current_task.id,
new_task_name="Clone of {}".format(current_task.name)
)
).id
)
print("Task cloned, new task id={}".format(current_task.id))
@@ -2223,11 +2393,23 @@ class Worker(ServiceCommandSection):
raise CommandFailedError("Cloning failed")
else:
# make sure this task is not stuck in an execution queue, it shouldn't have been, but just in case.
# noinspection PyBroadException
try:
res = self._session.api_client.tasks.dequeue(task=current_task.id)
if require_queue and res.meta.result_code != 200:
raise ValueError("Execution required enqueued task, "
"but task id={} is not queued.".format(current_task.id))
res = self._session.send_request(
service="tasks", action="dequeue", method=Request.def_method,
json={"task": current_task.id, "new_status": "in_progress"},
)
if require_queue and (not res.ok or res.json().get("data", {}).get("updated", 0) < 1):
raise ValueError(
"Execution required enqueued task, but task id={} is not queued.".format(current_task.id)
)
# Set task status to started to prevent any external monitoring from killing it
self._session.api_client.tasks.started(
task=current_task.id,
status_reason="starting execution soon",
status_message="",
force=True,
)
except Exception:
if require_queue:
raise
@@ -2238,14 +2420,14 @@ class Worker(ServiceCommandSection):
# We expect the same behaviour in case full_monitoring was set, and in case docker mode is used
if full_monitoring or docker is not False:
if full_monitoring:
if not (ENV_WORKER_ID.get() or '').strip():
self._session.config["agent"]["worker_id"] = ''
if not (ENV_WORKER_ID.get() or "").strip():
self._session.config["agent"]["worker_id"] = ""
# make sure we support multiple instances if we need to
self._singleton()
self.temp_config_path = self.temp_config_path or safe_mkstemp(
suffix=".cfg", prefix=".clearml_agent.", text=True, name_only=True
)
self.dump_config(self.temp_config_path)
self.dump_config(filename=self.temp_config_path, config=self._session.pre_vault_config)
self._session._config_file = self.temp_config_path
worker_params = WorkerParams(
@@ -2266,8 +2448,6 @@ class Worker(ServiceCommandSection):
Singleton.close_pid_file()
return status if ENV_PROPAGATE_EXITCODE.get() else 0
self._apply_extra_configuration()
self._session.print_configuration()
# now mark the task as started
@@ -2836,8 +3016,8 @@ class Worker(ServiceCommandSection):
# Todo: add support for poetry caching
if not self.poetry.enabled:
# add to cache
print('Adding venv into cache: {}'.format(add_venv_folder_cache))
if add_venv_folder_cache:
print('Adding venv into cache: {}'.format(add_venv_folder_cache))
self.package_api.add_cached_venv(
requirements=[freeze, previous_reqs],
docker_cmd=execution_info.docker_cmd if execution_info else None,
@@ -2858,19 +3038,27 @@ class Worker(ServiceCommandSection):
self.log_traceback(e)
return freeze
def _install_poetry_requirements(self, repo_info):
# type: (Optional[RepoInfo]) -> Optional[PoetryAPI]
def _install_poetry_requirements(self, repo_info, working_dir=None):
# type: (Optional[RepoInfo], Optional[str]) -> Optional[PoetryAPI]
if not repo_info:
return None
files_from_working_dir = self._session.config.get(
"agent.package_manager.poetry_files_from_repo_working_dir", False)
lockfile_path = Path(repo_info.root) / ((working_dir or "") if files_from_working_dir else "")
try:
if not self.poetry.enabled:
return None
self.poetry.initialize(cwd=repo_info.root)
api = self.poetry.get_api(repo_info.root)
self.poetry.initialize(cwd=lockfile_path)
api = self.poetry.get_api(lockfile_path)
if api.enabled:
print('Poetry Enabled: Ignoring requested python packages, using repository poetry lock file!')
api.install()
return api
print(f"Could not find pyproject.toml or poetry.lock file in {lockfile_path} \n")
except Exception as ex:
self.log.error("failed installing poetry requirements: {}".format(ex))
return None
@@ -2901,7 +3089,8 @@ class Worker(ServiceCommandSection):
"""
if package_api:
package_api.cwd = cwd
api = self._install_poetry_requirements(repo_info)
api = self._install_poetry_requirements(repo_info, execution.working_dir)
if api:
# update back the package manager, this hack should be fixed
if package_api == self.package_api:
@@ -3425,6 +3614,11 @@ class Worker(ServiceCommandSection):
requirements_manager.translator.enabled = False
print(requirements_manager.replace(contents))
def remove_non_backwards_compatible_entries(self, config: Config):
if not self._standalone_mode or not ENV_CONFIG_BC_IN_STANDALONE.get() or self._session.feature_set == "basic":
return
config.pop("agent.package_manager.pip_version") # removed due to a breaking change in v1.5.1
def get_docker_config_cmd(self, docker_args, clean_api_credentials=False):
docker_image = str(ENV_DOCKER_IMAGE.get() or
self._session.config.get("agent.default_docker.image", "nvidia/cuda")) \
@@ -3447,6 +3641,7 @@ class Worker(ServiceCommandSection):
DockerArgsSanitizer.sanitize_docker_command(self._session, self._docker_arguments) or ''))
temp_config = deepcopy(self._session.config)
self.remove_non_backwards_compatible_entries(temp_config)
mounted_cache_dir = temp_config.get(
"agent.docker_internal_mounts.sdk_cache", self._docker_fixed_user_cache)
mounted_pip_dl_dir = temp_config.get(
@@ -3458,7 +3653,6 @@ class Worker(ServiceCommandSection):
temp_config.put("sdk.storage.cache.default_base_dir", mounted_cache_dir)
temp_config.put("agent.pip_download_cache.path", mounted_pip_dl_dir)
temp_config.put("agent.vcs_cache.path", mounted_vcs_cache)
temp_config.put("agent.package_manager.system_site_packages", True)
temp_config.put("agent.package_manager.conda_env_as_base_docker", False)
temp_config.put("agent.default_python", "")
temp_config.put("agent.python_binary", "")
@@ -3470,6 +3664,11 @@ class Worker(ServiceCommandSection):
temp_config.put("agent.git_pass", (ENV_AGENT_GIT_PASS.get() or
self._session.config.get("agent.git_pass", None)))
force_system_site_packages = ENV_FORCE_SYSTEM_SITE_PACKAGES.get()
force_system_site_packages = force_system_site_packages if force_system_site_packages is not None else True
if force_system_site_packages:
temp_config.put("agent.package_manager.system_site_packages", True)
if temp_config.get("agent.venvs_cache.path", None):
temp_config.put("agent.venvs_cache.path", '/root/.clearml/venvs-cache')
@@ -3493,34 +3692,35 @@ class Worker(ServiceCommandSection):
def _get_docker_config_cmd(self, temp_config, clean_api_credentials=False, **kwargs):
self.debug("Setting up docker config command")
host_cache = Path(os.path.expandvars(
self._session.config["sdk.storage.cache.default_base_dir"])).expanduser().as_posix()
def load_path(field, default=None):
value = self._session.config.get(field, default)
return Path(os.path.expandvars(value)).expanduser().as_posix() if value else None
host_cache = load_path("sdk.storage.cache.default_base_dir")
self.debug("host_cache: {}".format(host_cache))
host_pip_dl = Path(os.path.expandvars(
self._session.config["agent.pip_download_cache.path"])).expanduser().as_posix()
host_pip_dl = load_path("agent.pip_download_cache.path")
self.debug("host_pip_dl: {}".format(host_pip_dl))
host_vcs_cache = Path(os.path.expandvars(
self._session.config["agent.vcs_cache.path"])).expanduser().as_posix()
host_vcs_cache = load_path("agent.vcs_cache.path")
self.debug("host_vcs_cache: {}".format(host_vcs_cache))
host_venvs_cache = Path(os.path.expandvars(
self._session.config["agent.venvs_cache.path"])).expanduser().as_posix() \
if self._session.config.get("agent.venvs_cache.path", None) else None
host_venvs_cache = load_path("agent.venvs_cache.path")
self.debug("host_venvs_cache: {}".format(host_venvs_cache))
host_ssh_cache = self._host_ssh_cache
self.debug("host_ssh_cache: {}".format(host_ssh_cache))
host_apt_cache = Path(os.path.expandvars(self._session.config.get(
"agent.docker_apt_cache", '~/.clearml/apt-cache'))).expanduser().as_posix()
host_apt_cache = load_path("agent.docker_apt_cache", default="~/.clearml/apt-cache")
self.debug("host_apt_cache: {}".format(host_apt_cache))
host_pip_cache = Path(os.path.expandvars(self._session.config.get(
"agent.docker_pip_cache", '~/.clearml/pip-cache'))).expanduser().as_posix()
host_pip_cache = load_path("agent.docker_pip_cache", default="~/.clearml/pip-cache")
self.debug("host_pip_cache: {}".format(host_pip_cache))
if self.poetry.enabled:
host_poetry_cache = Path(os.path.expandvars(self._session.config.get(
"agent.docker_poetry_cache", '~/.clearml/poetry-cache'))).expanduser().as_posix()
else:
host_poetry_cache = None
host_poetry_cache = (
load_path("agent.docker_poetry_cache", "~/.clearml/poetry-cache") if self.poetry.enabled else None
)
self.debug("host_poetry_cache: {}".format(host_poetry_cache))
# make sure all folders are valid
@@ -3672,6 +3872,60 @@ class Worker(ServiceCommandSection):
pass
return results
@staticmethod
def _resolve_docker_env_args(docker_args):
# type: (List[str]) -> List[str]
"""
Resolve -e / --env docker environment args matching $VAR or ${VAR} from the host environment
:argument docker_args: List of docker argument strings (flags and values)
"""
non_list_args = (
"rm", "read-only", "sig-proxy", "tty", "privileged", "publish-all", "interactive", "init", "help", "detach"
)
non_list_args_single = (
"t", "P", "i", "d",
)
# if no filtering, do nothing
if not docker_args:
return docker_args
args = docker_args[:]
skip_arg = False
for i, cmd in enumerate(docker_args):
if skip_arg and not cmd.startswith("-"):
continue
skip_arg = False
if cmd.startswith("--"):
# jump over single command
if cmd[2:] in non_list_args:
continue
elif cmd.startswith("-"):
# jump over single character non args
if cmd[1:] in non_list_args_single:
continue
# if we are here we have a command to bypass and the list after it
if cmd in ('-e', '--env'):
skip_arg = True
for j in range(i+1, len(args)):
if args[j].startswith("-"):
break
parts = args[j].split("=", 1)
if len(parts) != 2:
continue
args[j] = "{}={}".format(parts[0], os.path.expandvars(parts[1]))
elif cmd.startswith("-"):
skip_arg = True
return args
def _get_docker_cmd(
self,
worker_id, parent_worker_id,
@@ -3697,11 +3951,20 @@ class Worker(ServiceCommandSection):
name=None,
mount_ssh=None, mount_ssh_ro=None, mount_apt_cache=None, mount_pip_cache=None, mount_poetry_cache=None,
env_task_id=None,
restart=None,
):
self.debug("Constructing docker command", context="docker")
docker = 'docker'
base_cmd = [docker, 'run', '-t']
use_rm = True
if restart:
if restart in ("unless-stopped", "no", "always") or restart.startswith("on-failure"):
base_cmd += ["--restart", restart]
use_rm = False
else:
self.log.error("Invalid restart value \"{}\" , ignoring".format(restart))
update_scheme = ""
dockers_nvidia_visible_devices = 'all'
gpu_devices = Session.get_nvidia_visible_env()
@@ -3726,9 +3989,14 @@ class Worker(ServiceCommandSection):
docker_arguments = list(docker_arguments) \
if isinstance(docker_arguments, (list, tuple)) else [docker_arguments]
docker_arguments = self._filter_docker_args(docker_arguments)
if self._session.config.get("agent.docker_allow_host_environ", None):
docker_arguments = self._resolve_docker_env_args(docker_arguments)
base_cmd += [a for a in docker_arguments if a]
if extra_docker_arguments:
# we always resolve environments in the `extra_docker_arguments` becuase the admin set them (not users)
extra_docker_arguments = self._resolve_docker_env_args(extra_docker_arguments)
extra_docker_arguments = [extra_docker_arguments] \
if isinstance(extra_docker_arguments, six.string_types) else extra_docker_arguments
base_cmd += [str(a) for a in extra_docker_arguments if a]
@@ -3737,6 +4005,10 @@ class Worker(ServiceCommandSection):
base_cmd += ['-l', self._worker_label.format(worker_id)]
base_cmd += ['-l', self._parent_worker_label.format(parent_worker_id)]
extra_labels = ENV_EXTRA_DOCKER_LABELS.get()
for label in (extra_labels or []):
base_cmd += ['-l', label]
self.debug("Command: {}".format(base_cmd), context="docker")
# check if running inside a kubernetes
@@ -3800,7 +4072,7 @@ class Worker(ServiceCommandSection):
base_cmd += ['-e', 'CLEARML_WORKER_ID='+worker_id, ]
# update the docker image, so the system knows where it runs
base_cmd += ['-e', 'CLEARML_DOCKER_IMAGE={} {}'.format(docker_image, ' '.join(docker_arguments or [])).strip()]
base_cmd += ['-e', 'CLEARML_DOCKER_IMAGE={}'.format(docker_image)]
if env_task_id:
base_cmd += ['-e', 'CLEARML_TASK_ID={}'.format(env_task_id), ]
@@ -3819,6 +4091,7 @@ class Worker(ServiceCommandSection):
# if we are running a RC version, install the same version in the docker
# because the default latest, will be a release version (not RC)
specify_version = ''
# noinspection PyBroadException
try:
from clearml_agent.version import __version__
_version_parts = __version__.split('.')
@@ -3827,13 +4100,15 @@ class Worker(ServiceCommandSection):
except:
pass
force_agent_repo = ENV_FORCE_DOCKER_AGENT_REPO.get()
if os.environ.get('FORCE_LOCAL_CLEARML_AGENT_WHEEL'):
local_wheel = os.path.expanduser(os.environ.get('FORCE_LOCAL_CLEARML_AGENT_WHEEL'))
docker_wheel = '/tmp/{}'.format(basename(local_wheel))
base_cmd += ['-v', local_wheel + ':' + docker_wheel]
clearml_agent_wheel = '\"{}\"'.format(docker_wheel)
elif os.environ.get('FORCE_CLEARML_AGENT_REPO'):
clearml_agent_wheel = os.environ.get('FORCE_CLEARML_AGENT_REPO')
elif force_agent_repo:
clearml_agent_wheel = force_agent_repo
else:
# clearml-agent{specify_version}
clearml_agent_wheel = 'clearml-agent{specify_version}'.format(specify_version=specify_version)
@@ -3914,7 +4189,8 @@ class Worker(ServiceCommandSection):
(['-v', host_cache+':'+mounted_cache] if host_cache else []) +
(['-v', host_vcs_cache+':'+mounted_vcs_cache] if host_vcs_cache else []) +
(['-v', host_venvs_cache + ':' + mounted_venvs_cache] if host_venvs_cache else []) +
['--rm', docker_image, 'bash', '-c',
(['--rm'] if use_rm else []) +
[docker_image, 'bash', '-c',
update_scheme +
extra_shell_script +
"cp {} {} ; ".format(DOCKER_ROOT_CONF_FILE, DOCKER_DEFAULT_CONF_FILE) +
@@ -4120,6 +4396,15 @@ class Worker(ServiceCommandSection):
" found {})".format(role)
)
@staticmethod
def _get_path(d, *path, default=None):
try:
return functools.reduce(
lambda a, b: a[b], path, d
)
except (IndexError, KeyError):
return default
if __name__ == "__main__":
pass

View File

@@ -5,9 +5,9 @@ from enum import IntEnum
from os import getenv, environ
from typing import Text, Optional, Union, Tuple, Any
import six
from pathlib2 import Path
import six
from clearml_agent.helper.base import normalize_path
PROGRAM_NAME = "clearml-agent"
@@ -69,42 +69,65 @@ ENV_AWS_SECRET_KEY = EnvironmentConfig("AWS_SECRET_ACCESS_KEY")
ENV_AZURE_ACCOUNT_KEY = EnvironmentConfig("AZURE_STORAGE_KEY")
ENVIRONMENT_CONFIG = {
"api.api_server": EnvironmentConfig("CLEARML_API_HOST", "TRAINS_API_HOST", ),
"api.files_server": EnvironmentConfig("CLEARML_FILES_HOST", "TRAINS_FILES_HOST", ),
"api.web_server": EnvironmentConfig("CLEARML_WEB_HOST", "TRAINS_WEB_HOST", ),
"api.api_server": EnvironmentConfig(
"CLEARML_API_HOST",
"TRAINS_API_HOST",
),
"api.files_server": EnvironmentConfig(
"CLEARML_FILES_HOST",
"TRAINS_FILES_HOST",
),
"api.web_server": EnvironmentConfig(
"CLEARML_WEB_HOST",
"TRAINS_WEB_HOST",
),
"api.credentials.access_key": EnvironmentConfig(
"CLEARML_API_ACCESS_KEY", "TRAINS_API_ACCESS_KEY",
"CLEARML_API_ACCESS_KEY",
"TRAINS_API_ACCESS_KEY",
),
"api.credentials.secret_key": ENV_AGENT_SECRET_KEY,
"agent.worker_name": EnvironmentConfig("CLEARML_WORKER_NAME", "TRAINS_WORKER_NAME", ),
"agent.worker_id": EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID", ),
"agent.cuda_version": EnvironmentConfig(
"CLEARML_CUDA_VERSION", "TRAINS_CUDA_VERSION", "CUDA_VERSION"
"agent.worker_name": EnvironmentConfig(
"CLEARML_WORKER_NAME",
"TRAINS_WORKER_NAME",
),
"agent.cudnn_version": EnvironmentConfig(
"CLEARML_CUDNN_VERSION", "TRAINS_CUDNN_VERSION", "CUDNN_VERSION"
),
"agent.cpu_only": EnvironmentConfig(
names=("CLEARML_CPU_ONLY", "TRAINS_CPU_ONLY", "CPU_ONLY"), type=bool
"agent.worker_id": EnvironmentConfig(
"CLEARML_WORKER_ID",
"TRAINS_WORKER_ID",
),
"agent.cuda_version": EnvironmentConfig("CLEARML_CUDA_VERSION", "TRAINS_CUDA_VERSION", "CUDA_VERSION"),
"agent.cudnn_version": EnvironmentConfig("CLEARML_CUDNN_VERSION", "TRAINS_CUDNN_VERSION", "CUDNN_VERSION"),
"agent.cpu_only": EnvironmentConfig(names=("CLEARML_CPU_ONLY", "TRAINS_CPU_ONLY", "CPU_ONLY"), type=bool),
"agent.crash_on_exception": EnvironmentConfig("CLEAMRL_AGENT_CRASH_ON_EXCEPTION", type=bool),
"sdk.aws.s3.key": EnvironmentConfig("AWS_ACCESS_KEY_ID"),
"sdk.aws.s3.secret": ENV_AWS_SECRET_KEY,
"sdk.aws.s3.region": EnvironmentConfig("AWS_DEFAULT_REGION"),
"sdk.azure.storage.containers.0": {'account_name': EnvironmentConfig("AZURE_STORAGE_ACCOUNT"),
'account_key': ENV_AZURE_ACCOUNT_KEY},
"sdk.azure.storage.containers.0": {
"account_name": EnvironmentConfig("AZURE_STORAGE_ACCOUNT"),
"account_key": ENV_AZURE_ACCOUNT_KEY,
},
"sdk.google.storage.credentials_json": EnvironmentConfig("GOOGLE_APPLICATION_CREDENTIALS"),
}
ENVIRONMENT_SDK_PARAMS = {
"task_id": ("CLEARML_TASK_ID", "TRAINS_TASK_ID", ),
"config_file": ("CLEARML_CONFIG_FILE", "TRAINS_CONFIG_FILE", ),
"log_level": ("CLEARML_LOG_LEVEL", "TRAINS_LOG_LEVEL", ),
"log_to_backend": ("CLEARML_LOG_TASK_TO_BACKEND", "TRAINS_LOG_TASK_TO_BACKEND", ),
"task_id": (
"CLEARML_TASK_ID",
"TRAINS_TASK_ID",
),
"config_file": (
"CLEARML_CONFIG_FILE",
"TRAINS_CONFIG_FILE",
),
"log_level": (
"CLEARML_LOG_LEVEL",
"TRAINS_LOG_LEVEL",
),
"log_to_backend": (
"CLEARML_LOG_TASK_TO_BACKEND",
"TRAINS_LOG_TASK_TO_BACKEND",
),
}
ENVIRONMENT_BACKWARD_COMPATIBLE = EnvironmentConfig(
names=("CLEARML_AGENT_ALG_ENV", "TRAINS_AGENT_ALG_ENV"), type=bool)
ENVIRONMENT_BACKWARD_COMPATIBLE = EnvironmentConfig(names=("CLEARML_AGENT_ALG_ENV", "TRAINS_AGENT_ALG_ENV"), type=bool)
VIRTUAL_ENVIRONMENT_PATH = {
"python2": normalize_path(CONFIG_DIR, "py2venv"),
@@ -123,38 +146,67 @@ TOKEN_EXPIRATION_SECONDS = int(timedelta(days=2).total_seconds())
METADATA_EXTENSION = ".json"
DEFAULT_VENV_UPDATE_URL = (
"https://raw.githubusercontent.com/Yelp/venv-update/v3.2.4/venv_update.py"
)
DEFAULT_VENV_UPDATE_URL = "https://raw.githubusercontent.com/Yelp/venv-update/v3.2.4/venv_update.py"
WORKING_REPOSITORY_DIR = "task_repository"
WORKING_STANDALONE_DIR = "code"
DEFAULT_VCS_CACHE = normalize_path(CONFIG_DIR, "vcs-cache")
PIP_EXTRA_INDICES = [
]
PIP_EXTRA_INDICES = []
DEFAULT_PIP_DOWNLOAD_CACHE = normalize_path(CONFIG_DIR, "pip-download-cache")
ENV_DOCKER_IMAGE = EnvironmentConfig('CLEARML_DOCKER_IMAGE', 'TRAINS_DOCKER_IMAGE')
ENV_WORKER_ID = EnvironmentConfig('CLEARML_WORKER_ID', 'TRAINS_WORKER_ID')
ENV_WORKER_TAGS = EnvironmentConfig('CLEARML_WORKER_TAGS')
ENV_AGENT_SKIP_PIP_VENV_INSTALL = EnvironmentConfig('CLEARML_AGENT_SKIP_PIP_VENV_INSTALL')
ENV_AGENT_SKIP_PYTHON_ENV_INSTALL = EnvironmentConfig('CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL', type=bool)
ENV_DOCKER_SKIP_GPUS_FLAG = EnvironmentConfig('CLEARML_DOCKER_SKIP_GPUS_FLAG', 'TRAINS_DOCKER_SKIP_GPUS_FLAG')
ENV_AGENT_GIT_USER = EnvironmentConfig('CLEARML_AGENT_GIT_USER', 'TRAINS_AGENT_GIT_USER')
ENV_AGENT_GIT_PASS = EnvironmentConfig('CLEARML_AGENT_GIT_PASS', 'TRAINS_AGENT_GIT_PASS')
ENV_AGENT_GIT_HOST = EnvironmentConfig('CLEARML_AGENT_GIT_HOST', 'TRAINS_AGENT_GIT_HOST')
ENV_AGENT_DISABLE_SSH_MOUNT = EnvironmentConfig('CLEARML_AGENT_DISABLE_SSH_MOUNT', type=bool)
ENV_SSH_AUTH_SOCK = EnvironmentConfig('SSH_AUTH_SOCK')
ENV_TASK_EXECUTE_AS_USER = EnvironmentConfig('CLEARML_AGENT_EXEC_USER', 'TRAINS_AGENT_EXEC_USER')
ENV_TASK_EXTRA_PYTHON_PATH = EnvironmentConfig('CLEARML_AGENT_EXTRA_PYTHON_PATH', 'TRAINS_AGENT_EXTRA_PYTHON_PATH')
ENV_DOCKER_HOST_MOUNT = EnvironmentConfig('CLEARML_AGENT_K8S_HOST_MOUNT', 'CLEARML_AGENT_DOCKER_HOST_MOUNT',
'TRAINS_AGENT_K8S_HOST_MOUNT', 'TRAINS_AGENT_DOCKER_HOST_MOUNT')
ENV_VENV_CACHE_PATH = EnvironmentConfig('CLEARML_AGENT_VENV_CACHE_PATH')
ENV_EXTRA_DOCKER_ARGS = EnvironmentConfig('CLEARML_AGENT_EXTRA_DOCKER_ARGS', type=list)
ENV_DEBUG_INFO = EnvironmentConfig('CLEARML_AGENT_DEBUG_INFO')
ENV_CHILD_AGENTS_COUNT_CMD = EnvironmentConfig('CLEARML_AGENT_CHILD_AGENTS_COUNT_CMD')
ENV_DOCKER_ARGS_FILTERS = EnvironmentConfig('CLEARML_AGENT_DOCKER_ARGS_FILTERS')
ENV_DOCKER_ARGS_HIDE_ENV = EnvironmentConfig('CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV')
ENV_PIP_EXTRA_INSTALL_FLAGS = EnvironmentConfig("CLEARML_EXTRA_PIP_INSTALL_FLAGS", type=list)
ENV_DOCKER_IMAGE = EnvironmentConfig("CLEARML_DOCKER_IMAGE", "TRAINS_DOCKER_IMAGE")
ENV_WORKER_ID = EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID")
ENV_WORKER_TAGS = EnvironmentConfig("CLEARML_WORKER_TAGS")
ENV_AGENT_SKIP_PIP_VENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PIP_VENV_INSTALL")
ENV_AGENT_SKIP_PYTHON_ENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL", type=bool)
ENV_DOCKER_SKIP_GPUS_FLAG = EnvironmentConfig("CLEARML_DOCKER_SKIP_GPUS_FLAG", "TRAINS_DOCKER_SKIP_GPUS_FLAG")
ENV_AGENT_GIT_USER = EnvironmentConfig("CLEARML_AGENT_GIT_USER", "TRAINS_AGENT_GIT_USER")
ENV_AGENT_GIT_PASS = EnvironmentConfig("CLEARML_AGENT_GIT_PASS", "TRAINS_AGENT_GIT_PASS")
ENV_AGENT_GIT_HOST = EnvironmentConfig("CLEARML_AGENT_GIT_HOST", "TRAINS_AGENT_GIT_HOST")
ENV_AGENT_DISABLE_SSH_MOUNT = EnvironmentConfig("CLEARML_AGENT_DISABLE_SSH_MOUNT", type=bool)
ENV_SSH_AUTH_SOCK = EnvironmentConfig("SSH_AUTH_SOCK")
ENV_TASK_EXECUTE_AS_USER = EnvironmentConfig("CLEARML_AGENT_EXEC_USER", "TRAINS_AGENT_EXEC_USER")
ENV_TASK_EXTRA_PYTHON_PATH = EnvironmentConfig("CLEARML_AGENT_EXTRA_PYTHON_PATH", "TRAINS_AGENT_EXTRA_PYTHON_PATH")
ENV_DOCKER_HOST_MOUNT = EnvironmentConfig(
"CLEARML_AGENT_K8S_HOST_MOUNT",
"CLEARML_AGENT_DOCKER_HOST_MOUNT",
"TRAINS_AGENT_K8S_HOST_MOUNT",
"TRAINS_AGENT_DOCKER_HOST_MOUNT",
)
ENV_VENV_CACHE_PATH = EnvironmentConfig("CLEARML_AGENT_VENV_CACHE_PATH")
ENV_EXTRA_DOCKER_ARGS = EnvironmentConfig("CLEARML_AGENT_EXTRA_DOCKER_ARGS", type=list)
ENV_EXTRA_DOCKER_LABELS = EnvironmentConfig("CLEARML_AGENT_EXTRA_DOCKER_LABELS", type=list)
ENV_DEBUG_INFO = EnvironmentConfig("CLEARML_AGENT_DEBUG_INFO")
ENV_CHILD_AGENTS_COUNT_CMD = EnvironmentConfig("CLEARML_AGENT_CHILD_AGENTS_COUNT_CMD")
ENV_DOCKER_ARGS_FILTERS = EnvironmentConfig("CLEARML_AGENT_DOCKER_ARGS_FILTERS")
ENV_DOCKER_ARGS_HIDE_ENV = EnvironmentConfig("CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV")
ENV_CONFIG_BC_IN_STANDALONE = EnvironmentConfig("CLEARML_AGENT_STANDALONE_CONFIG_BC", type=bool)
""" Maintain backwards compatible configuration when launching in standalone mode """
ENV_CUSTOM_BUILD_SCRIPT = EnvironmentConfig('CLEARML_AGENT_CUSTOM_BUILD_SCRIPT')
ENV_FORCE_DOCKER_AGENT_REPO = EnvironmentConfig("FORCE_CLEARML_AGENT_REPO", "CLEARML_AGENT_DOCKER_AGENT_REPO")
ENV_SERVICES_DOCKER_RESTART = EnvironmentConfig("CLEARML_AGENT_SERVICES_DOCKER_RESTART")
"""
Specify a restart value for a services agent task containers.
Note that when a restart value is provided, task containers will not be run with the '--rm' flag and will
not be cleaned up automatically when completed (this will need to be done externally using the
'docker container prune' command to free up resources).
Value format for this env var is "<restart-value>;<task-selector>", where:
- <restart-value> can be any valid restart value for docker-run (see https://docs.docker.com/engine/reference/commandline/run/#restart)
- <task-selector> is optional, allowing to restrict this behaviour to specific tasks. The format is:
"<path-to-task-field>=<value>" where:
* <path-to-task-field> is a dot-separated path to a task field (e.g. "container.image")
* <value> is optional. If not provided, the restart policy till be applied for the task container if the
path provided exists. If provided, the restart policy will be applied if the value matches the value
obtained from the task (value parsing and comparison is based on the type of value obtained from the task)
For example:
CLEARML_AGENT_SERVICES_DOCKER_RESTART=unless-stopped
CLEARML_AGENT_SERVICES_DOCKER_RESTART=unless-stopped;container.image=some-image
"""
ENV_FORCE_SYSTEM_SITE_PACKAGES = EnvironmentConfig("CLEARML_AGENT_FORCE_SYSTEM_SITE_PACKAGES", type=bool)
""" Force system_site_packages: true when running tasks in containers (i.e. docker mode or k8s glue) """
ENV_CUSTOM_BUILD_SCRIPT = EnvironmentConfig("CLEARML_AGENT_CUSTOM_BUILD_SCRIPT")
"""
Specifies a custom environment setup script to be executed instead of installing a virtual environment.
If provided, this script is executed following Git cloning. Script command may include environment variable and
@@ -186,6 +238,8 @@ ENV_CUSTOM_BUILD_SCRIPT = EnvironmentConfig('CLEARML_AGENT_CUSTOM_BUILD_SCRIPT')
standard flow.
"""
ENV_PACKAGE_PYTORCH_RESOLVE = EnvironmentConfig("CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE")
class FileBuffering(IntEnum):
"""

View File

@@ -0,0 +1,15 @@
from threading import Thread
from clearml_agent.session import Session
class K8sDaemon(Thread):
def __init__(self, agent):
super(K8sDaemon, self).__init__(target=self.target)
self.daemon = True
self._agent = agent
self.log = agent.log
self._session: Session = agent._session
def target(self):
pass

View File

@@ -0,0 +1,12 @@
class GetPodsError(Exception):
pass
class GetJobsError(Exception):
pass
class GetPodCountError(Exception):
pass

View File

@@ -9,11 +9,10 @@ import os
import re
import subprocess
import tempfile
from collections import defaultdict
from collections import defaultdict, namedtuple
from copy import deepcopy
from pathlib import Path
from pprint import pformat
from threading import Thread
from time import sleep, time
from typing import Text, List, Callable, Any, Collection, Optional, Union, Iterable, Dict, Tuple, Set
@@ -22,9 +21,17 @@ import yaml
from clearml_agent.backend_api.session import Request
from clearml_agent.commands.events import Events
from clearml_agent.commands.worker import Worker, get_task_container, set_task_container, get_next_task
from clearml_agent.definitions import ENV_DOCKER_IMAGE, ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS
from clearml_agent.errors import APIError
from clearml_agent.definitions import (
ENV_DOCKER_IMAGE,
ENV_AGENT_GIT_USER,
ENV_AGENT_GIT_PASS,
ENV_FORCE_SYSTEM_SITE_PACKAGES,
)
from clearml_agent.errors import APIError, UsageError
from clearml_agent.glue.definitions import ENV_START_AGENT_SCRIPT_PATH
from clearml_agent.glue.errors import GetPodCountError
from clearml_agent.glue.utilities import get_path, get_bash_output
from clearml_agent.glue.pending_pods_daemon import PendingPodsDaemon
from clearml_agent.helper.base import safe_remove_file
from clearml_agent.helper.dicts import merge_dicts
from clearml_agent.helper.process import get_bash_output, stringify_bash_output
@@ -32,21 +39,16 @@ from clearml_agent.helper.resource_monitor import ResourceMonitor
from clearml_agent.interface.base import ObjectID
class K8sIntegration(Worker):
SUPPORTED_KIND = ("pod", "job")
K8S_PENDING_QUEUE = "k8s_scheduler"
K8S_DEFAULT_NAMESPACE = "clearml"
AGENT_LABEL = "CLEARML=agent"
QUEUE_LABEL = "clearml-agent-queue"
KUBECTL_APPLY_CMD = "kubectl apply --namespace={namespace} -f"
KUBECTL_DELETE_CMD = "kubectl delete pods " \
"-l={agent_label} " \
"--field-selector=status.phase!=Pending,status.phase!=Running " \
"--namespace={namespace} " \
"--output name"
BASH_INSTALL_SSH_CMD = [
"apt-get update",
"apt-get install -y openssh-server",
@@ -77,7 +79,7 @@ class K8sIntegration(Worker):
"[ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip",
"[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3",
"{extra_bash_init_cmd}",
"$LOCAL_PYTHON -m pip install clearml-agent{agent_install_args}",
"[ ! -z $CLEARML_AGENT_NO_UPDATE ] || $LOCAL_PYTHON -m pip install clearml-agent{agent_install_args}",
"{extra_docker_bash_script}",
"$LOCAL_PYTHON -m clearml_agent execute {default_execution_agent_args} --id {task_id}"
]
@@ -131,13 +133,20 @@ class K8sIntegration(Worker):
:param int max_pods_limit: Maximum number of pods that K8S glue can run at the same time
"""
super(K8sIntegration, self).__init__()
self.kind = os.environ.get("CLEARML_K8S_GLUE_KIND", "pod").strip().lower()
if self.kind not in self.SUPPORTED_KIND:
raise UsageError(f"Kind '{self.kind}' not supported (expected {','.join(self.SUPPORTED_KIND)})")
self.using_jobs = self.kind == "job"
self.pod_name_prefix = pod_name_prefix or self.DEFAULT_POD_NAME_PREFIX
self.limit_pod_label = limit_pod_label or self.DEFAULT_LIMIT_POD_LABEL
self.k8s_pending_queue_name = k8s_pending_queue_name or self.K8S_PENDING_QUEUE
self.k8s_pending_queue_id = None
self.container_bash_script = container_bash_script or self.CONTAINER_BASH_SCRIPT
# Always do system packages, because by we will be running inside a docker
self._session.config.put("agent.package_manager.system_site_packages", True)
force_system_packages = ENV_FORCE_SYSTEM_SITE_PACKAGES.get()
self._force_system_site_packages = force_system_packages if force_system_packages is not None else True
if self._force_system_site_packages:
# Use system packages, because by we will be running inside a docker
self._session.config.put("agent.package_manager.system_site_packages", True)
# Add debug logging
if debug:
self.log.logger.disabled = False
@@ -172,11 +181,18 @@ class K8sIntegration(Worker):
self._agent_label = None
self._monitor_hanging_pods()
self._pending_pods_daemon = self._create_pending_pods_daemon(
cls_=PendingPodsDaemon,
polling_interval=self._polling_interval
)
self._pending_pods_daemon.start()
self._min_cleanup_interval_per_ns_sec = 1.0
self._last_pod_cleanup_per_ns = defaultdict(lambda: 0.)
def _create_pending_pods_daemon(self, cls_, **kwargs):
return cls_(agent=self, **kwargs)
def _load_overrides_yaml(self, overrides_yaml):
if not overrides_yaml:
return
@@ -202,11 +218,6 @@ class K8sIntegration(Worker):
self.log.warning('Removing containers section: {}'.format(overrides['spec'].pop('containers')))
self.overrides_json_string = json.dumps(overrides)
def _monitor_hanging_pods(self):
_check_pod_thread = Thread(target=self._monitor_hanging_pods_daemon)
_check_pod_thread.daemon = True
_check_pod_thread.start()
@staticmethod
def _load_template_file(path):
with open(os.path.expandvars(os.path.expanduser(str(path))), 'rt') as f:
@@ -221,16 +232,19 @@ class K8sIntegration(Worker):
except (IndexError, KeyError):
return default
def _get_kubectl_options(self, command, extra_labels=None, filters=None, output="json", labels=None):
# type: (str, Iterable[str], Iterable[str], str, Iterable[str]) -> Dict
if not labels:
def _get_kubectl_options(self, command, extra_labels=None, filters=None, output="json", labels=None, ns=None):
# type: (str, Iterable[str], Iterable[str], str, Iterable[str], str) -> Dict
if labels is False:
labels = []
elif not labels:
labels = [self._get_agent_label()]
labels = list(labels) + (list(extra_labels) if extra_labels else [])
d = {
"-l": ",".join(labels),
"-n": str(self.namespace),
"-n": ns or str(self.namespace),
"-o": output,
}
if labels:
d["-l"] = ",".join(labels)
if filters:
d["--field-selector"] = ",".join(filters)
return d
@@ -241,99 +255,6 @@ class K8sIntegration(Worker):
command=command, opts=" ".join(x for item in opts.items() for x in item)
)
def _monitor_hanging_pods_daemon(self):
last_tasks_msgs = {} # last msg updated for every task
while True:
kubectl_cmd = self.get_kubectl_command("get pods", filters=["status.phase=Pending"])
self.log.debug("Detecting hanging pods: {}".format(kubectl_cmd))
output = stringify_bash_output(get_bash_output(kubectl_cmd))
try:
output_config = json.loads(output)
except Exception as ex:
self.log.warning('K8S Glue pods monitor: Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
sleep(self._polling_interval)
continue
pods = output_config.get('items', [])
task_ids = set()
for pod in pods:
pod_name = pod.get('metadata', {}).get('name', None)
if not pod_name:
continue
task_id = pod_name.rpartition('-')[-1]
if not task_id:
continue
namespace = pod.get('metadata', {}).get('namespace', None)
if not namespace:
continue
task_ids.add(task_id)
msg = None
waiting = self._get_path(pod, 'status', 'containerStatuses', 0, 'state', 'waiting')
if not waiting:
condition = self._get_path(pod, 'status', 'conditions', 0)
if condition:
reason = condition.get('reason')
if reason == 'Unschedulable':
message = condition.get('message')
msg = reason + (" ({})".format(message) if message else "")
else:
reason = waiting.get("reason", None)
message = waiting.get("message", None)
msg = reason + (" ({})".format(message) if message else "")
if reason == 'ImagePullBackOff':
delete_pod_cmd = 'kubectl delete pods {} -n {}'.format(pod_name, namespace)
self.log.debug(" - deleting pod due to ImagePullBackOff: {}".format(delete_pod_cmd))
get_bash_output(delete_pod_cmd)
try:
self.log.debug(" - Detecting hanging pods: {}".format(kubectl_cmd))
self._session.api_client.tasks.failed(
task=task_id,
status_reason="K8S glue error: {}".format(msg),
status_message="Changed by K8S glue",
force=True
)
except Exception as ex:
self.log.warning(
'K8S Glue pods monitor: Failed deleting task "{}"\nEX: {}'.format(task_id, ex)
)
# clean up any msg for this task
last_tasks_msgs.pop(task_id, None)
continue
if msg and last_tasks_msgs.get(task_id, None) != msg:
try:
result = self._session.send_request(
service='tasks',
action='update',
json={"task": task_id, "status_message": "K8S glue status: {}".format(msg)},
method=Request.def_method,
async_enable=False,
)
if not result.ok:
result_msg = self._get_path(result.json(), 'meta', 'result_msg')
raise Exception(result_msg or result.text)
# update last msg for this task
last_tasks_msgs[task_id] = msg
except Exception as ex:
self.log.warning(
'K8S Glue pods monitor: Failed setting status message for task "{}"\nMSG: {}\nEX: {}'.format(
task_id, msg, ex
)
)
# clean up any last message for a task that wasn't seen as a pod
last_tasks_msgs = {k: v for k, v in last_tasks_msgs.items() if k in task_ids}
sleep(self._polling_interval)
def _set_task_user_properties(self, task_id: str, task_session=None, **properties: str):
session = task_session or self._session
if self._edit_hyperparams_support is not True:
@@ -377,34 +298,50 @@ class K8sIntegration(Worker):
return self._agent_label
def _get_used_pods(self):
# type: () -> Tuple[int, Set[str]]
# noinspection PyBroadException
RunningPod = namedtuple("RunningPod", "name queue namespace")
def _get_running_pods(self):
try:
kubectl_cmd = self.get_kubectl_command(
"get pods",
output="jsonpath=\"{range .items[*]}{.metadata.name}{' '}{.metadata.namespace}{'\\n'}{end}\""
output="jsonpath=\"{{range .items[*]}}{{.metadata.name}}{{' '}}{{.metadata.namespace}}{{' '}}"
"{{.metadata.labels.{}}}{{'\\n'}}{{end}}\"".format(self.QUEUE_LABEL)
)
self.log.debug("Getting used pods: {}".format(kubectl_cmd))
output = stringify_bash_output(get_bash_output(kubectl_cmd, raise_error=True))
if not output:
# No such pod exist so we can use the pod_number we found
return 0, set([])
return []
try:
items = output.splitlines()
current_pod_count = len(items)
namespaces = {item.rpartition(" ")[-1] for item in items}
self.log.debug(" - found {} pods in namespaces {}".format(current_pod_count, ", ".join(namespaces)))
except (KeyError, ValueError, TypeError, AttributeError) as ex:
print("Failed parsing used pods command response for cleanup: {}".format(ex))
return -1, set([])
return [
self.RunningPod(
name=parts[0],
namespace=parts[1],
queue=parts[2]
)
for parts in (line.split(" ") for line in output.splitlines())
]
except Exception as ex:
raise Exception("Failed parsing used pods command response for cleanup: {}".format(ex))
except Exception as ex:
raise Exception('Failed obtaining used pods information: {}'.format(ex))
def _get_used_pods(self):
# type: () -> Tuple[int, Set[str]]
# noinspection PyBroadException
try:
items = self._get_running_pods()
if not items:
return 0, set([])
current_pod_count = len(items)
namespaces = {item.namespace for item in items}
self.log.debug(" - found {} pods in namespaces {}".format(current_pod_count, ", ".join(namespaces)))
return current_pod_count, namespaces
except Exception as ex:
print('Failed obtaining used pods information: {}'.format(ex))
return -2, set([])
self.log.debug("Failed getting used pods: {}", ex)
return -1, set([])
def _is_same_tenant(self, task_session):
if not task_session or task_session is self._session:
@@ -417,6 +354,69 @@ class K8sIntegration(Worker):
except Exception as ex:
print("ERROR: Failed getting tenant for task session: {}".format(ex))
def get_jobs_info(self, info_path: str, condition: str = None, namespace=None, debug_msg: str = None)\
-> Dict[str, str]:
cond = "==".join((x.strip("=") for x in condition.partition("=")[::2]))
output = f"jsonpath='{{range .items[?(@.{cond})]}}{{@.{info_path}}}{{\" \"}}{{@.metadata.namespace}}{{\"\\n\"}}{{end}}'"
kubectl_cmd = self.get_kubectl_command("get job", output=output, ns=namespace)
if debug_msg:
self.log.debug(debug_msg.format(cmd=kubectl_cmd))
output = stringify_bash_output(get_bash_output(kubectl_cmd))
output = output.strip("'") # for Windows debugging :(
try:
data_items = dict(l.strip().partition(" ")[::2] for l in output.splitlines())
return data_items
except Exception as ex:
self.log.warning('Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
def get_pods_for_jobs(self, job_condition: str = None, pod_filters: List[str] = None, debug_msg: str = None):
controller_uids = self.get_jobs_info(
"spec.selector.matchLabels.controller-uid", condition=job_condition, debug_msg=debug_msg
)
if not controller_uids:
# No pods were found for these jobs
return []
pods = self.get_pods(filters=pod_filters, debug_msg=debug_msg)
return [
pod for pod in pods
if get_path(pod, "metadata", "labels", "controller-uid") in controller_uids
]
def get_pods(self, filters: List[str] = None, debug_msg: str = None):
kubectl_cmd = self.get_kubectl_command(
"get pods",
filters=filters,
labels=False if self.using_jobs else None,
)
if debug_msg:
self.log.debug(debug_msg.format(cmd=kubectl_cmd))
output = stringify_bash_output(get_bash_output(kubectl_cmd))
try:
output_config = json.loads(output)
except Exception as ex:
self.log.warning('Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
return
return output_config.get('items', [])
def _get_pod_count(self, extra_labels: List[str] = None, msg: str = None):
kubectl_cmd_new = self.get_kubectl_command(
f"get {self.kind}s",
extra_labels= extra_labels
)
self.log.debug("{}{}".format((msg + ": ") if msg else "", kubectl_cmd_new))
process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, error = process.communicate()
output = stringify_bash_output(output)
error = stringify_bash_output(error)
try:
return len(json.loads(output).get("items", []))
except (ValueError, TypeError) as ex:
self.log.warning(
"K8S Glue pods monitor: Failed parsing kubectl output:\n{}\nEx: {}".format(output, ex)
)
raise GetPodCountError()
def run_one_task(self, queue: Text, task_id: Text, worker_args=None, task_session=None, **_):
print('Pulling task {} launching on kubernetes cluster'.format(task_id))
session = task_session or self._session
@@ -428,6 +428,12 @@ class K8sIntegration(Worker):
print('Pushing task {} into temporary pending queue'.format(task_id))
_ = session.api_client.tasks.stop(task_id, force=True)
# Just make sure to clean up in case the task is stuck in the queue (known issue)
self._session.api_client.queues.remove_task(
task=task_id,
queue=self.k8s_pending_queue_id,
)
res = self._session.api_client.tasks.enqueue(
task_id,
queue=self.k8s_pending_queue_id,
@@ -455,26 +461,26 @@ class K8sIntegration(Worker):
git_user = ENV_AGENT_GIT_USER.get() or self._session.config.get("agent.git_user", None)
git_pass = ENV_AGENT_GIT_PASS.get() or self._session.config.get("agent.git_pass", None)
extra_config_values = [
'agent.package_manager.system_site_packages: true',
'agent.package_manager.system_site_packages: true' if self._force_system_site_packages else '',
'agent.git_user: "{}"'.format(git_user) if git_user else '',
'agent.git_pass: "{}"'.format(git_pass) if git_pass else '',
]
# noinspection PyProtectedMember
config_content = (
self.conf_file_content or Path(session._config_file).read_text() or ""
self.conf_file_content or (session._config_file and Path(session._config_file).read_text()) or ""
) + '\n{}\n'.format('\n'.join(x for x in extra_config_values if x))
hocon_config_encoded = config_content.encode("ascii")
create_clearml_conf = ["echo '{}' | base64 --decode >> ~/clearml.conf".format(
clearml_conf_create_script = ["echo '{}' | base64 --decode >> ~/clearml.conf".format(
base64.b64encode(
hocon_config_encoded
).decode('ascii')
)]
if task_session:
create_clearml_conf.append(
clearml_conf_create_script.append(
"export CLEARML_AUTH_TOKEN=$(echo '{}' | base64 --decode)".format(
base64.b64encode(task_session.token.encode("ascii")).decode('ascii')
)
@@ -495,23 +501,15 @@ class K8sIntegration(Worker):
while self.ports_mode or self.max_pods_limit:
pod_number = self.base_pod_num + pod_count
kubectl_cmd_new = self.get_kubectl_command(
"get pods",
extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if self.ports_mode else None
)
self.log.debug("Looking for a free pod/port: {}".format(kubectl_cmd_new))
process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, error = process.communicate()
output = stringify_bash_output(output)
error = stringify_bash_output(error)
try:
items_count = len(json.loads(output).get("items", []))
except (ValueError, TypeError) as ex:
items_count = self._get_pod_count(
extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if self.ports_mode else None,
msg="Looking for a free pod/port"
)
except GetPodCountError:
self.log.warning(
"K8S Glue pods monitor: Failed parsing kubectl output:\n{}\ntask '{}' "
"will be enqueued back to queue '{}'\nEx: {}".format(
output, task_id, queue, ex
"K8S Glue pods monitor: task '{}' will be enqueued back to queue '{}'".format(
task_id, queue
)
)
session.api_client.tasks.stop(task_id, force=True)
@@ -535,8 +533,6 @@ class K8sIntegration(Worker):
if current_pod_count >= max_count:
# All pods are taken, exit
self.log.debug(
"kubectl last result: {}\n{}".format(error, output))
self.log.warning(
"All k8s services are in use, task '{}' "
"will be enqueued back to queue '{}'".format(
@@ -581,7 +577,7 @@ class K8sIntegration(Worker):
output, error = self._kubectl_apply(
template=template,
pod_number=pod_number,
create_clearml_conf=create_clearml_conf,
clearml_conf_create_script=clearml_conf_create_script,
labels=labels,
docker_image=container['image'],
docker_args=container['arguments'],
@@ -591,11 +587,11 @@ class K8sIntegration(Worker):
namespace=namespace,
)
print('kubectl output:\n{}\n{}'.format(error, output))
if error:
send_log = "Running kubectl encountered an error: {}".format(error)
self.log.error(send_log)
self.send_logs(task_id, send_log.splitlines())
print('kubectl output:\n{}\n{}'.format(error, output))
if error:
send_log = "Running kubectl encountered an error: {}".format(error)
self.log.error(send_log)
self.send_logs(task_id, send_log.splitlines())
user_props = {"k8s-queue": str(queue_name)}
if self.ports_mode:
@@ -604,6 +600,7 @@ class K8sIntegration(Worker):
"k8s-pod-number": pod_number,
"k8s-pod-label": labels[0],
"k8s-internal-pod-count": pod_count,
"k8s-agent": self._get_agent_label(),
}
)
@@ -625,8 +622,8 @@ class K8sIntegration(Worker):
def _get_pod_labels(self, queue, queue_name):
return [
self._get_agent_label(),
"clearml-agent-queue={}".format(self._safe_k8s_label_value(queue)),
"clearml-agent-queue-name={}".format(self._safe_k8s_label_value(queue_name))
"{}={}".format(self.QUEUE_LABEL, self._safe_k8s_label_value(queue)),
"{}-name={}".format(self.QUEUE_LABEL, self._safe_k8s_label_value(queue_name))
]
def _get_docker_args(self, docker_args, flags, target=None, convert=None):
@@ -655,32 +652,10 @@ class K8sIntegration(Worker):
return {target: results} if results else {}
return results
def _kubectl_apply(
self,
create_clearml_conf,
docker_image,
docker_args,
docker_bash,
labels,
queue,
task_id,
namespace,
template=None,
pod_number=None
):
template.setdefault('apiVersion', 'v1')
template['kind'] = 'Pod'
template.setdefault('metadata', {})
name = self.pod_name_prefix + str(task_id)
template['metadata']['name'] = name
template.setdefault('spec', {})
template['spec'].setdefault('containers', [])
template['spec'].setdefault('restartPolicy', 'Never')
if labels:
labels_dict = dict(pair.split('=', 1) for pair in labels)
template['metadata'].setdefault('labels', {})
template['metadata']['labels'].update(labels_dict)
def _create_template_container(
self, pod_name: str, task_id: str, docker_image: str, docker_args: List[str],
docker_bash: str, clearml_conf_create_script: List[str]
) -> dict:
container = self._get_docker_args(
docker_args,
target="env",
@@ -704,7 +679,7 @@ class K8sIntegration(Worker):
agent_install_args=self.POD_AGENT_INSTALL_ARGS)
for line in container_bash_script])
extra_bash_commands = list(create_clearml_conf or [])
extra_bash_commands = list(clearml_conf_create_script or [])
start_agent_script_path = ENV_START_AGENT_SCRIPT_PATH.get() or "~/__start_agent__.sh"
@@ -718,20 +693,77 @@ class K8sIntegration(Worker):
)
# Notice: we always leave with exit code 0, so pods are never restarted
container = self._merge_containers(
return self._merge_containers(
container,
dict(name=name, image=docker_image,
dict(name=pod_name, image=docker_image,
command=['/bin/bash'],
args=['-c', '{} ; exit 0'.format(' ; '.join(extra_bash_commands))])
)
if template['spec']['containers']:
template['spec']['containers'][0] = self._merge_containers(template['spec']['containers'][0], container)
def _kubectl_apply(
self,
clearml_conf_create_script: List[str],
docker_image,
docker_args,
docker_bash,
labels,
queue,
task_id,
namespace,
template=None,
pod_number=None
):
if "apiVersion" not in template:
template["apiVersion"] = "batch/v1" if self.using_jobs else "v1"
if "kind" in template:
if template["kind"].lower() != self.kind:
return (
"", f"Template kind {template['kind']} does not maych kind {self.kind.capitalize()} set for agent"
)
else:
template['spec']['containers'].append(container)
template["kind"] = self.kind.capitalize()
metadata = template.setdefault('metadata', {})
name = self.pod_name_prefix + str(task_id)
metadata['name'] = name
def place_labels(metadata_dict):
labels_dict = dict(pair.split('=', 1) for pair in labels)
metadata_dict.setdefault('labels', {}).update(labels_dict)
if labels:
# Place labels on base resource (job or single pod)
place_labels(metadata)
spec = template.setdefault('spec', {})
if self.using_jobs:
spec.setdefault('backoffLimit', 0)
spec_template = spec.setdefault('template', {})
if labels:
# Place same labels fro any pod spawned by the job
place_labels(spec_template.setdefault('metadata', {}))
spec = spec_template.setdefault('spec', {})
containers = spec.setdefault('containers', [])
spec.setdefault('restartPolicy', 'Never')
container = self._create_template_container(
pod_name=name,
task_id=task_id,
docker_image=docker_image,
docker_args=docker_args,
docker_bash=docker_bash,
clearml_conf_create_script=clearml_conf_create_script
)
if containers:
containers[0] = self._merge_containers(containers[0], container)
else:
containers.append(container)
if self._docker_force_pull:
for c in template['spec']['containers']:
for c in containers:
c.setdefault('imagePullPolicy', 'Always')
fp, yaml_file = tempfile.mkstemp(prefix='clearml_k8stmpl_', suffix='.yml')
@@ -763,6 +795,83 @@ class K8sIntegration(Worker):
return stringify_bash_output(output), stringify_bash_output(error)
def _process_bash_lines_response(self, bash_cmd: str, raise_error=True):
res = get_bash_output(bash_cmd, raise_error=raise_error)
lines = [
line for line in
(r.strip().rpartition("/")[-1] for r in res.splitlines())
if line.startswith(self.pod_name_prefix)
]
return lines
def _delete_pods(self, selectors: List[str], namespace: str, msg: str = None) -> List[str]:
kubectl_cmd = \
"kubectl delete pod -l={agent_label} " \
"--namespace={namespace} --field-selector={selector} --output name".format(
selector=",".join(selectors),
agent_label=self._get_agent_label(),
namespace=namespace,
)
self.log.debug("Deleting old/failed pods{} for ns {}: {}".format(
msg or "", namespace, kubectl_cmd
))
lines = self._process_bash_lines_response(kubectl_cmd)
self.log.debug(" - deleted pods %s", ", ".join(lines))
return lines
def _delete_jobs_by_names(self, names_to_ns: Dict[str, str], msg: str = None) -> List[str]:
if not names_to_ns:
return []
ns_to_names = defaultdict(list)
for name, ns in names_to_ns.items():
ns_to_names[ns].append(name)
results = []
for ns, names in ns_to_names.items():
kubectl_cmd = "kubectl delete job --namespace={ns} --output=name {names}".format(
ns=ns, names=" ".join(names)
)
self.log.debug("Deleting jobs {}: {}".format(
msg or "", kubectl_cmd
))
lines = self._process_bash_lines_response(kubectl_cmd)
if not lines:
continue
self.log.debug(" - deleted jobs %s", ", ".join(lines))
results.extend(lines)
return results
def _delete_completed_or_failed_pods(self, namespace, msg: str = None):
if not self.using_jobs:
return self._delete_pods(
selectors=["status.phase!=Pending", "status.phase!=Running"], namespace=namespace, msg=msg
)
job_names_to_delete = {}
# locate failed pods for jobs
failed_pods = self.get_pods_for_jobs(
job_condition="status.active=1",
pod_filters=["status.phase!=Pending", "status.phase!=Running", "status.phase!=Terminating"],
debug_msg="Deleting failed pods: {cmd}"
)
if failed_pods:
job_names_to_delete = {
get_path(pod, "metadata", "labels", "job-name"): get_path(pod, "metadata", "namespace")
for pod in failed_pods
if get_path(pod, "metadata", "labels", "job-name")
}
self.log.debug(f" - found jobs with failed pods: {' '.join(job_names_to_delete)}")
completed_job_names = self.get_jobs_info(
"metadata.name", condition="status.succeeded=1", namespace=namespace, debug_msg=msg
)
if completed_job_names:
self.log.debug(f" - found completed jobs: {' '.join(completed_job_names)}")
job_names_to_delete.update(completed_job_names)
return self._delete_jobs_by_names(names_to_ns=job_names_to_delete, msg=msg)
def _cleanup_old_pods(self, namespaces, extra_msg=None):
# type: (Iterable[str], Optional[str]) -> Dict[str, List[str]]
self.log.debug("Cleaning up pods")
@@ -771,23 +880,71 @@ class K8sIntegration(Worker):
if time() - self._last_pod_cleanup_per_ns[namespace] < self._min_cleanup_interval_per_ns_sec:
# Do not try to cleanup the same namespace too quickly
continue
kubectl_cmd = self.KUBECTL_DELETE_CMD.format(namespace=namespace, agent_label=self._get_agent_label())
self.log.debug("Deleting old/failed pods{} for ns {}: {}".format(
extra_msg or "", namespace, kubectl_cmd
))
try:
res = get_bash_output(kubectl_cmd, raise_error=True)
lines = [
line for line in
(r.strip().rpartition("/")[-1] for r in res.splitlines())
if line.startswith(self.pod_name_prefix)
]
self.log.debug(" - deleted pod(s) %s", ", ".join(lines))
deleted_pods[namespace].extend(lines)
res = self._delete_completed_or_failed_pods(namespace, extra_msg)
deleted_pods[namespace].extend(res)
except Exception as ex:
self.log.error("Failed deleting old/failed pods for ns %s: %s", namespace, str(ex))
self.log.error("Failed deleting completed/failed pods for ns %s: %s", namespace, str(ex))
finally:
self._last_pod_cleanup_per_ns[namespace] = time()
# Locate tasks belonging to deleted pods that are still marked as pending or running
tasks_to_abort = []
try:
task_ids = list(filter(None, (
pod_name[len(self.pod_name_prefix):].strip()
for pod_names in deleted_pods.values()
for pod_name in pod_names
)))
if task_ids:
result = self._session.get(
service='tasks',
action='get_all',
json={"id": task_ids, "status": ["in_progress", "queued"], "only_fields": ["id", "status"]},
method=Request.def_method,
)
tasks_to_abort = result["tasks"]
except Exception as ex:
self.log.warning('Failed getting running tasks for deleted {}(s): {}'.format(self.kind, ex))
for task in tasks_to_abort:
task_id = task.get("id")
status = task.get("status")
if not task_id or not status:
self.log.warning('Failed getting task information: id={}, status={}'.format(task_id, status))
continue
try:
if status == "queued":
self._session.get(
service='tasks',
action='dequeue',
json={
"task": task_id,
"force": True,
"status_reason": "Pod deleted (not pending or running)",
"status_message": "{} deleted by agent {}".format(
self.kind.capitalize(), self.worker_id or "unknown"
)
},
method=Request.def_method,
)
self._session.get(
service='tasks',
action='failed',
json={
"task": task_id,
"force": True,
"status_reason": "Pod deleted (not pending or running)",
"status_message": "{} deleted by agent {}".format(
self.kind.capitalize(), self.worker_id or "unknown"
)
},
method=Request.def_method,
)
except Exception as ex:
self.log.warning('Failed setting task {} to status "failed": {}'.format(task_id, ex))
return deleted_pods
def run_tasks_loop(self, queues: List[Text], worker_params, **kwargs):
@@ -823,10 +980,10 @@ class K8sIntegration(Worker):
# check if have pod limit, then check if we hit it.
if self.max_pods_limit:
if current_pods >= self.max_pods_limit:
print("Maximum pod limit reached {}/{}, sleeping for {:.1f} seconds".format(
current_pods, self.max_pods_limit, self._polling_interval))
print("Maximum {} limit reached {}/{}, sleeping for {:.1f} seconds".format(
self.kind, current_pods, self.max_pods_limit, self._polling_interval))
# delete old completed / failed pods
self._cleanup_old_pods(namespaces, " due to pod limit")
self._cleanup_old_pods(namespaces, f" due to {self.kind} limit")
# go to sleep
sleep(self._polling_interval)
continue
@@ -834,7 +991,7 @@ class K8sIntegration(Worker):
# iterate over queues (priority style, queues[0] is highest)
for queue in queues:
# delete old completed / failed pods
self._cleanup_old_pods(namespaces)
self._cleanup_old_pods(namespaces, extra_msg="Cleanup cycle {cmd}")
# get next task in queue
try:
@@ -941,5 +1098,6 @@ class K8sIntegration(Worker):
value = re.sub(r'^[^A-Za-z0-9]+', '', value) # strip leading non-alphanumeric chars
value = re.sub(r'[^A-Za-z0-9]+$', '', value) # strip trailing non-alphanumeric chars
value = re.sub(r'\W+', '-', value) # allow only word chars (this removed "." which is supported, but nvm)
value = re.sub(r'_+', '-', value) # "_" is not allowed as well
value = re.sub(r'-+', '-', value) # don't leave messy "--" after replacing previous chars
return value[:63]

View File

@@ -0,0 +1,223 @@
from time import sleep
from typing import Dict, Tuple, Optional, List
from clearml_agent.backend_api.session import Request
from clearml_agent.glue.utilities import get_bash_output
from clearml_agent.helper.process import stringify_bash_output
from .daemon import K8sDaemon
from .utilities import get_path
from .errors import GetPodsError
class PendingPodsDaemon(K8sDaemon):
def __init__(self, polling_interval: float, agent):
super(PendingPodsDaemon, self).__init__(agent=agent)
self._polling_interval = polling_interval
self._last_tasks_msgs = {} # last msg updated for every task
def get_pods(self):
if self._agent.using_jobs:
return self._agent.get_pods_for_jobs(
job_condition="status.active=1",
pod_filters=["status.phase=Pending"],
debug_msg="Detecting pending pods: {cmd}"
)
return self._agent.get_pods(
filters=["status.phase=Pending"],
debug_msg="Detecting pending pods: {cmd}"
)
def _get_pod_name(self, pod: dict):
return get_path(pod, "metadata", "name")
def _get_k8s_resource_name(self, pod: dict):
if self._agent.using_jobs:
return get_path(pod, "metadata", "labels", "job-name")
return get_path(pod, "metadata", "name")
def _get_task_id(self, pod: dict):
return self._get_k8s_resource_name(pod).rpartition('-')[-1]
@staticmethod
def _get_k8s_resource_namespace(pod: dict):
return pod.get('metadata', {}).get('namespace', None)
def target(self):
"""
Handle pending objects (pods or jobs, depending on the agent mode).
- Delete any pending objects that are not expected to recover
- Delete any pending objects for whom the associated task was aborted
"""
while True:
# noinspection PyBroadException
try:
# Get pods (standalone pods if we're in pods mode, or pods associated to jobs if we're in jobs mode)
pods = self.get_pods()
if pods is None:
raise GetPodsError()
task_id_to_pod = dict()
for pod in pods:
pod_name = self._get_pod_name(pod)
if not pod_name:
continue
task_id = self._get_task_id(pod)
if not task_id:
continue
namespace = self._get_k8s_resource_namespace(pod)
if not namespace:
continue
task_id_to_pod[task_id] = pod
msg = None
tags = []
waiting = get_path(pod, 'status', 'containerStatuses', 0, 'state', 'waiting')
if not waiting:
condition = get_path(pod, 'status', 'conditions', 0)
if condition:
reason = condition.get('reason')
if reason == 'Unschedulable':
message = condition.get('message')
msg = reason + (" ({})".format(message) if message else "")
else:
reason = waiting.get("reason", None)
message = waiting.get("message", None)
msg = reason + (" ({})".format(message) if message else "")
if reason == 'ImagePullBackOff':
self.delete_k8s_resource(k8s_resource=pod, msg=reason)
try:
self._session.api_client.tasks.failed(
task=task_id,
status_reason="K8S glue error: {}".format(msg),
status_message="Changed by K8S glue",
force=True
)
self._agent.send_logs(
task_id, ["K8S Error: {}".format(msg)],
session=self._session
)
except Exception as ex:
self.log.warning(
'K8S Glue pending monitor: Failed deleting task "{}"\nEX: {}'.format(task_id, ex)
)
# clean up any msg for this task
self._last_tasks_msgs.pop(task_id, None)
continue
self._update_pending_task_msg(task_id, msg, tags)
if task_id_to_pod:
self._process_tasks_for_pending_pods(task_id_to_pod)
# clean up any last message for a task that wasn't seen as a pod
self._last_tasks_msgs = {k: v for k, v in self._last_tasks_msgs.items() if k in task_id_to_pod}
except GetPodsError:
pass
except Exception:
self.log.exception("Hanging pods daemon loop")
sleep(self._polling_interval)
def delete_k8s_resource(self, k8s_resource: dict, msg: str = None):
delete_cmd = "kubectl delete {kind} {name} -n {namespace} --output name".format(
kind=self._agent.kind,
name=self._get_k8s_resource_name(k8s_resource),
namespace=self._get_k8s_resource_namespace(k8s_resource)
).strip()
self.log.debug(" - deleting {} {}: {}".format(self._agent.kind, (" " + msg) if msg else "", delete_cmd))
return get_bash_output(delete_cmd).strip()
def _process_tasks_for_pending_pods(self, task_id_to_details: Dict[str, dict]):
self._handle_aborted_tasks(task_id_to_details)
def _handle_aborted_tasks(self, pending_tasks_details: Dict[str, dict]):
try:
result = self._session.get(
service='tasks',
action='get_all',
json={
"id": list(pending_tasks_details),
"status": ["stopped"],
"only_fields": ["id"]
},
method=Request.def_method,
async_enable=False,
)
aborted_task_ids = list(filter(None, (task.get("id") for task in result["tasks"])))
for task_id in aborted_task_ids:
pod = pending_tasks_details.get(task_id)
if not pod:
self.log.error("Failed locating aborted task {} in pending pods list".format(task_id))
continue
resource_name = self._get_k8s_resource_name(pod)
self.log.info(
"K8S Glue pending monitor: task {} was aborted but the k8s resource {} is still pending, "
"deleting pod".format(task_id, resource_name)
)
output = self.delete_k8s_resource(k8s_resource=pod, msg="Pending resource of an aborted task")
if not output:
self.log.warning("K8S Glue pending monitor: failed deleting resource {}".format(resource_name))
except Exception as ex:
self.log.warning(
'K8S Glue pending monitor: failed checking aborted tasks for pending resources: {}'.format(ex)
)
def _update_pending_task_msg(self, task_id: str, msg: str, tags: List[str] = None):
if not msg or self._last_tasks_msgs.get(task_id, None) == (msg, tags):
return
try:
# Make sure the task is queued
result = self._session.send_request(
service='tasks',
action='get_all',
json={"id": task_id, "only_fields": ["status"]},
method=Request.def_method,
async_enable=False,
)
if result.ok:
status = get_path(result.json(), 'data', 'tasks', 0, 'status')
# if task is in progress, change its status to enqueued
if status == "in_progress":
result = self._session.send_request(
service='tasks', action='enqueue',
json={
"task": task_id, "force": True, "queue": self._agent.k8s_pending_queue_id
},
method=Request.def_method,
async_enable=False,
)
if not result.ok:
result_msg = get_path(result.json(), 'meta', 'result_msg')
self.log.debug(
"K8S Glue pods monitor: failed forcing task status change"
" for pending task {}: {}".format(task_id, result_msg)
)
# Update task status message
payload = {"task": task_id, "status_message": "K8S glue status: {}".format(msg)}
if tags:
payload["tags"] = tags
result = self._session.send_request('tasks', 'update', json=payload, method=Request.def_method)
if not result.ok:
result_msg = get_path(result.json(), 'meta', 'result_msg')
raise Exception(result_msg or result.text)
# update last msg for this task
self._last_tasks_msgs[task_id] = msg
except Exception as ex:
self.log.warning(
'K8S Glue pods monitor: Failed setting status message for task "{}"\nMSG: {}\nEX: {}'.format(
task_id, msg, ex
)
)

View File

@@ -0,0 +1,18 @@
import functools
from subprocess import DEVNULL
from clearml_agent.helper.process import get_bash_output as _get_bash_output
def get_path(d, *path, default=None):
try:
return functools.reduce(
lambda a, b: a[b], path, d
)
except (IndexError, KeyError):
return default
def get_bash_output(cmd, stderr=DEVNULL, raise_error=False):
return _get_bash_output(cmd, stderr=stderr, raise_error=raise_error)

View File

@@ -20,20 +20,22 @@ from typing import Text, Dict, Any, Optional, AnyStr, IO, Union
import attr
import furl
import six
import yaml
from attr import fields_dict
from pathlib2 import Path
import six
from six.moves import reduce
from clearml_agent.external import pyhocon
from clearml_agent.errors import CommandFailedError
from clearml_agent.external import pyhocon
from clearml_agent.helper.dicts import filter_keys
pretty_lines = False
log = logging.getLogger(__name__)
use_powershell = os.getenv("CLEARML_AGENT_USE_POWERSHELL", None)
def which(cmd, path=None):
result = find_executable(cmd, path)
@@ -52,7 +54,7 @@ def select_for_platform(linux, windows):
def bash_c():
return 'bash -c' if not is_windows_platform() else 'cmd /c'
return 'bash -c' if not is_windows_platform() else ('powershell -Command' if use_powershell else 'cmd /c')
def return_list(arg):

View File

@@ -50,7 +50,7 @@ class PackageManager(object):
pass
@abc.abstractmethod
def freeze(self):
def freeze(self, freeze_full_environment=False):
pass
@abc.abstractmethod
@@ -141,8 +141,9 @@ class PackageManager(object):
@classmethod
def out_of_scope_install_package(cls, package_name, *args):
if PackageManager._selected_manager is not None:
# noinspection PyBroadException
try:
result = PackageManager._selected_manager._install(package_name, *args)
result = PackageManager._selected_manager.install_packages(package_name, *args)
if result not in (0, None, True):
return False
except Exception:
@@ -150,10 +151,11 @@ class PackageManager(object):
return True
@classmethod
def out_of_scope_freeze(cls):
def out_of_scope_freeze(cls, freeze_full_environment=False):
if PackageManager._selected_manager is not None:
# noinspection PyBroadException
try:
return PackageManager._selected_manager.freeze()
return PackageManager._selected_manager.freeze(freeze_full_environment)
except Exception:
pass
return []
@@ -177,7 +179,7 @@ class PackageManager(object):
cls._pip_version.append("==" + version)
@classmethod
def get_pip_versions(cls, pip="pip", wrap='"'):
def get_pip_versions(cls, pip="pip", wrap=''):
return [
(wrap + pip + version + wrap)
for version in cls._pip_version or [pip]
@@ -192,8 +194,13 @@ class PackageManager(object):
if not self._get_cache_manager():
return None
keys = self._generate_reqs_hash_keys(requirements, docker_cmd, python_version, cuda_version)
return self._get_cache_manager().copy_cached_entry(keys, destination_folder)
try:
keys = self._generate_reqs_hash_keys(requirements, docker_cmd, python_version, cuda_version)
return self._get_cache_manager().copy_cached_entry(keys, destination_folder)
except Exception as ex:
print("WARNING: Failed accessing venvs cache at {}: {}".format(destination_folder, ex))
print("WARNING: Skipping venv cache - folder not accessible!")
return None
def add_cached_venv(
self,
@@ -210,9 +217,15 @@ class PackageManager(object):
"""
if not self._get_cache_manager():
return
keys = self._generate_reqs_hash_keys(requirements, docker_cmd, python_version, cuda_version)
return self._get_cache_manager().add_entry(
keys=keys, source_folder=source_folder, exclude_sub_folders=exclude_sub_folders)
try:
keys = self._generate_reqs_hash_keys(requirements, docker_cmd, python_version, cuda_version)
return self._get_cache_manager().add_entry(
keys=keys, source_folder=source_folder, exclude_sub_folders=exclude_sub_folders)
except Exception as ex:
print("WARNING: Failed accessing venvs cache at {}: {}".format(source_folder, ex))
print("WARNING: Skipping venv cache - folder not accessible!")
return None
def get_cache_folder(self):
# type: () -> Optional[Path]
@@ -280,12 +293,19 @@ class PackageManager(object):
def _get_cache_manager(self):
if not self._cache_manager:
cache_folder = ENV_VENV_CACHE_PATH.get() or self.session.config.get(self._config_cache_folder, None)
if not cache_folder:
cache_folder = None
try:
cache_folder = ENV_VENV_CACHE_PATH.get() or self.session.config.get(self._config_cache_folder, None)
if not cache_folder:
return None
max_entries = int(self.session.config.get(self._config_cache_max_entries, 10))
free_space_threshold = float(self.session.config.get(self._config_cache_free_space_threshold, 0))
self._cache_manager = FolderCache(
cache_folder, max_cache_entries=max_entries, min_free_space_gb=free_space_threshold)
except Exception as ex:
print("WARNING: Failed accessing venvs cache at {}: {}".format(cache_folder, ex))
print("WARNING: Skipping venv cache - folder not accessible!")
return None
max_entries = int(self.session.config.get(self._config_cache_max_entries, 10))
free_space_threshold = float(self.session.config.get(self._config_cache_free_space_threshold, 0))
self._cache_manager = FolderCache(
cache_folder, max_cache_entries=max_entries, min_free_space_gb=free_space_threshold)
return self._cache_manager

View File

@@ -50,6 +50,14 @@ class ExternalRequirements(SimpleSubstitution):
print("No need to reinstall \'{}\' from VCS, "
"the exact same version is already installed".format(req.name))
continue
if not req.pip_new_version:
# noinspection PyBroadException
try:
freeze_base = PackageManager.out_of_scope_freeze() or dict(pip=[])
except Exception:
freeze_base = dict(pip=[])
req_line = self._add_vcs_credentials(req, session)
# if we have older pip version we have to make sure we replace back the package name with the
@@ -58,14 +66,14 @@ class ExternalRequirements(SimpleSubstitution):
PackageManager.out_of_scope_install_package(req_line, "--no-deps")
# noinspection PyBroadException
try:
freeze_post = PackageManager.out_of_scope_freeze() or ''
freeze_post = PackageManager.out_of_scope_freeze() or dict(pip=[])
package_name = list(set(freeze_post['pip']) - set(freeze_base['pip']))
if package_name and package_name[0] not in self.post_install_req_lookup:
self.post_install_req_lookup[package_name[0]] = req.req.line
except Exception:
pass
# no need to force reinstall, pip will always rebuilt if the package comes from git
# no need to force reinstall, pip will always rebuild if the package comes from git
# and make sure the required packages are installed (if they are not it will install them)
if not PackageManager.out_of_scope_install_package(req_line):
raise ValueError("Failed installing GIT/HTTPs package \'{}\'".format(req_line))
@@ -84,20 +92,14 @@ class ExternalRequirements(SimpleSubstitution):
vcs_url = req_line[4:]
# reverse replace
vcs_url = vcs_url[::-1].replace(fragment[::-1], '', 1)[::-1]
# remove ssh:// or git:// prefix for git detection and credentials
scheme = ''
if vcs_url and (vcs_url.startswith('ssh://') or vcs_url.startswith('git://')):
scheme = 'ssh://' # notice git:// is actually ssh://
vcs_url = vcs_url[6:]
# notice git:// is actually ssh://
if vcs_url and vcs_url.startswith('git://'):
vcs_url = vcs_url.replace('git://', 'ssh://', 1)
from ..repo import Git
vcs = Git(session=session, url=vcs_url, location=None, revision=None)
vcs._set_ssh_url()
new_req_line = 'git+{}{}{}'.format(
'' if scheme and '://' in vcs.url else scheme,
vcs_url if session.config.get('agent.force_git_ssh_protocol', None) else vcs.url_with_auth,
fragment
)
new_req_line = 'git+{}{}'.format(vcs.url_with_auth, fragment)
if new_req_line != req_line:
furl_line = furl(new_req_line)
print('Replacing original pip vcs \'{}\' with \'{}\''.format(

View File

@@ -4,7 +4,7 @@ from itertools import chain
from pathlib import Path
from typing import Text, Optional
from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME
from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME, ENV_PIP_EXTRA_INSTALL_FLAGS
from clearml_agent.helper.package.base import PackageManager
from clearml_agent.helper.process import Argv, DEVNULL
from clearml_agent.session import Session
@@ -12,8 +12,6 @@ from clearml_agent.session import Session
class SystemPip(PackageManager):
indices_args = None
def __init__(self, interpreter=None, session=None):
# type: (Optional[Text], Optional[Session]) -> ()
"""
@@ -52,7 +50,7 @@ class SystemPip(PackageManager):
package,
'--dest', cache_dir,
'--no-deps',
) + self.install_flags()
) + self.download_flags()
)
def load_requirements(self, requirements):
@@ -65,13 +63,14 @@ class SystemPip(PackageManager):
def uninstall(self, package):
self.run_with_env(('uninstall', '-y', package))
def freeze(self):
def freeze(self, freeze_full_environment=False):
"""
pip freeze to all install packages except the running program
:return: Dict contains pip as key and pip's packages to install
:rtype: Dict[str: List[str]]
"""
packages = self.run_with_env(('freeze',), output=True).splitlines()
packages = self.run_with_env(
('freeze',) if not freeze_full_environment else ('freeze', '--all'), output=True).splitlines()
packages_without_program = [package for package in packages if PROGRAM_NAME not in package]
return {'pip': packages_without_program}
@@ -87,14 +86,30 @@ class SystemPip(PackageManager):
# make sure we are not running it with our own PYTHONPATH
env = dict(**os.environ)
env.pop('PYTHONPATH', None)
# Debug print
if self.session.debug_mode:
print(command)
return (command.get_output if output else command.check_call)(stdin=DEVNULL, env=env, **kwargs)
def _make_command(self, command):
return Argv(self.bin, '-m', 'pip', '--disable-pip-version-check', *command)
def install_flags(self):
if self.indices_args is None:
self.indices_args = tuple(
chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
)
return self.indices_args
indices_args = tuple(
chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
)
extra_pip_flags = \
ENV_PIP_EXTRA_INSTALL_FLAGS.get() or \
self.session.config.get("agent.package_manager.extra_pip_install_flags", None)
return (indices_args + tuple(extra_pip_flags)) if extra_pip_flags else indices_args
def download_flags(self):
indices_args = tuple(
chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
)
return indices_args

View File

@@ -69,6 +69,11 @@ class PoetryConfig:
path = path.replace(':'+sys.base_prefix, ':'+sys.real_prefix, 1)
kwargs['env']['PATH'] = path
if self.session and self.session.config:
extra_args = self.session.config.get("agent.package_manager.poetry_install_extra_args", None)
if extra_args:
args = args + tuple(extra_args)
if check_if_command_exists("poetry"):
argv = Argv("poetry", *args)
else:
@@ -142,7 +147,7 @@ class PoetryAPI(object):
any((self.path / indicator).exists() for indicator in self.INDICATOR_FILES)
)
def freeze(self):
def freeze(self, freeze_full_environment=False):
lines = self.config.run("show", cwd=str(self.path)).splitlines()
lines = [[p for p in line.split(' ') if p] for line in lines]
return {"pip": [parts[0]+'=='+parts[1]+' # '+' '.join(parts[2:]) for parts in lines]}

View File

@@ -7,7 +7,7 @@ from .requirements import SimpleSubstitution
class PriorityPackageRequirement(SimpleSubstitution):
name = ("cython", "numpy", "setuptools", )
name = ("cython", "numpy", "setuptools", "pip", )
optional_package_names = tuple()
def __init__(self, *args, **kwargs):
@@ -50,31 +50,39 @@ class PriorityPackageRequirement(SimpleSubstitution):
"""
# if we replaced setuptools, it means someone requested it, and since freeze will not contain it,
# we need to add it manually
if not self._replaced_packages or "setuptools" not in self._replaced_packages:
if not self._replaced_packages:
return list_of_requirements
try:
for k, lines in list_of_requirements.items():
# k is either pip/conda
if k not in ('pip', 'conda'):
continue
for i, line in enumerate(lines):
if not line or line.lstrip().startswith('#'):
continue
parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
if not parts:
continue
# if we found setuptools, do nothing
if parts[0] == "setuptools":
return list_of_requirements
if "pip" in self._replaced_packages:
full_freeze = PackageManager.out_of_scope_freeze(freeze_full_environment=True)
# now let's look for pip
pips = [line for line in full_freeze.get("pip", []) if line.split("==")[0] == "pip"]
if pips and "pip" in list_of_requirements:
list_of_requirements["pip"] = [pips[0]] + list_of_requirements["pip"]
# if we are here it means we have not found setuptools
# we should add it:
if "pip" in list_of_requirements:
list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
if "setuptools" in self._replaced_packages:
try:
for k, lines in list_of_requirements.items():
# k is either pip/conda
if k not in ('pip', 'conda'):
continue
for i, line in enumerate(lines):
if not line or line.lstrip().startswith('#'):
continue
parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
if not parts:
continue
# if we found setuptools, do nothing
if parts[0] == "setuptools":
return list_of_requirements
except Exception as ex: # noqa
return list_of_requirements
# if we are here it means we have not found setuptools
# we should add it:
if "pip" in list_of_requirements:
list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
except Exception as ex: # noqa
return list_of_requirements
return list_of_requirements

View File

@@ -16,6 +16,7 @@ import six
from .requirements import (
SimpleSubstitution, FatalSpecsResolutionError, SimpleVersion, MarkerRequirement,
compare_version_rules, )
from ...definitions import ENV_PACKAGE_PYTORCH_RESOLVE
from ...external.requirements_parser.requirement import Requirement
OS_TO_WHEEL_NAME = {"linux": "linux_x86_64", "windows": "win_amd64"}
@@ -174,6 +175,7 @@ class PytorchRequirement(SimpleSubstitution):
extra_index_url_template = 'https://download.pytorch.org/whl/cu{}/'
nightly_extra_index_url_template = 'https://download.pytorch.org/whl/nightly/cu{}/'
torch_index_url_lookup = {}
resolver_types = ("pip", "direct", "none")
def __init__(self, *args, **kwargs):
os_name = kwargs.pop("os_override", None)
@@ -208,6 +210,13 @@ class PytorchRequirement(SimpleSubstitution):
if self.config.get("agent.package_manager.torch_url_template", None):
PytorchWheel.url_template = \
self.config.get("agent.package_manager.torch_url_template", None)
self.resolve_algorithm = str(
ENV_PACKAGE_PYTORCH_RESOLVE.get() or
self.config.get("agent.package_manager.pytorch_resolve", "pip")).lower()
if self.resolve_algorithm not in self.resolver_types:
print("WARNING: agent.package_manager.pytorch_resolve=={} not in {} reverting to '{}'".format(
self.resolve_algorithm, self.resolver_types, self.resolver_types[0]))
self.resolve_algorithm = self.resolver_types[0]
def _init_python_ver_cuda_ver(self):
if self.cuda is None:
@@ -261,6 +270,10 @@ class PytorchRequirement(SimpleSubstitution):
)
def match(self, req):
if self.resolve_algorithm == "none":
# skipping resolver
return False
return req.name in self.packages
@staticmethod
@@ -310,6 +323,12 @@ class PytorchRequirement(SimpleSubstitution):
# yes this is for linux python 2.7 support, this is the only python 2.7 we support...
if py_ver and py_ver[0] == '2' and len(parts) > 3 and not parts[3].endswith('u'):
continue
# check if this an actual match
if not req.compare_version(v) or \
(last_v and SimpleVersion.compare_versions(last_v, '>', v, ignore_sub_versions=False)):
continue
# update the closest matched version (from above)
if not closest_v:
closest_v = v
@@ -318,10 +337,6 @@ class PytorchRequirement(SimpleSubstitution):
SimpleVersion.compare_versions(
version_a=v, op='>=', version_b=req.specs[0][1], num_parts=3):
closest_v = v
# check if this an actual match
if not req.compare_version(v) or \
(last_v and SimpleVersion.compare_versions(last_v, '>', v, ignore_sub_versions=False)):
continue
url = '/'.join(torch_url.split('/')[:-1] + l.split('/'))
last_v = v
@@ -345,8 +360,10 @@ class PytorchRequirement(SimpleSubstitution):
from pip._internal.commands.show import search_packages_info
installed_torch = list(search_packages_info([req.name]))
# notice the comparison order, the first part will make sure we have a valid installed package
installed_torch_version = (getattr(installed_torch[0], 'version', None) or installed_torch[0]['version']) \
if installed_torch else None
installed_torch_version = \
(getattr(installed_torch[0], 'version', None) or
installed_torch[0]['version']) if installed_torch else None
if installed_torch and installed_torch_version and \
req.compare_version(installed_torch_version):
print('PyTorch: requested "{}" version {}, using pre-installed version {}'.format(
@@ -354,6 +371,7 @@ class PytorchRequirement(SimpleSubstitution):
# package already installed, do nothing
req.specs = [('==', str(installed_torch_version))]
return '{} {} {}'.format(req.name, req.specs[0][0], req.specs[0][1]), True
except Exception:
pass
@@ -475,6 +493,26 @@ class PytorchRequirement(SimpleSubstitution):
return self.match_version(req, base).replace(" ", "\n")
def replace(self, req):
# we first try to resolve things ourselves because pytorch pip is not always picking the correct
# versions from their pip repository
resolve_algorithm = self.resolve_algorithm
if resolve_algorithm == "none":
# skipping resolver
return None
elif resolve_algorithm == "direct":
# noinspection PyBroadException
try:
new_req = self._replace(req)
if new_req:
self._original_req.append((req, new_req))
return new_req
except Exception:
print("Warning: Failed resolving using `pytorch_resolve=direct` reverting to `pytorch_resolve=pip`")
elif resolve_algorithm not in self.resolver_types:
print("Warning: `agent.package_manager.pytorch_resolve={}` "
"unrecognized, default to `pip`".format(resolve_algorithm))
# check if package is already installed with system packages
self.validate_python_version()
@@ -493,13 +531,23 @@ class PytorchRequirement(SimpleSubstitution):
if req.specs and len(req.specs) == 1 and req.specs[0][0] == "==":
# remove any +cu extension and let pip resolve that
line = "{} {}".format(req.name, req.format_specs(max_num_parts=3))
# and add .* if we have 3 parts version to deal with nvidia container 'a' version
# i.e. "1.13.0" -> "1.13.0.*" so it should match preinstalled "1.13.0a0+936e930"
spec_3_parts = req.format_specs(num_parts=3)
spec_max3_parts = req.format_specs(max_num_parts=3)
if spec_3_parts == spec_max3_parts and not spec_max3_parts.endswith("*"):
line = "{} {}.*".format(req.name, spec_max3_parts)
else:
line = "{} {}".format(req.name, spec_max3_parts)
if req.marker:
line += " ; {}".format(req.marker)
else:
# return the original line
line = req.line
print("PyTorch: Adding index `{}` and installing `{}`".format(extra_index_url[0], line))
return line
except Exception: # noqa
@@ -558,6 +606,19 @@ class PytorchRequirement(SimpleSubstitution):
:param list_of_requirements: {'pip': ['a==1.0', ]}
:return: {'pip': ['a==1.0', ]}
"""
def build_specific_version_req(a_line, a_name, a_new_req):
try:
r = Requirement.parse(a_line)
wheel_parts = r.uri.split("/")[-1].split('-')
version = str(wheel_parts[1].split('%')[0].split('+')[0])
new_r = Requirement.parse("{} == {} # {}".format(a_name, version, str(a_new_req)))
if new_r.line:
# great it worked!
return new_r.line
except: # noqa
pass
return None
if not self._original_req:
return list_of_requirements
try:
@@ -581,9 +642,18 @@ class PytorchRequirement(SimpleSubstitution):
if req.local_file:
lines[i] = '{}'.format(str(new_req))
else:
lines[i] = '{} # {}'.format(str(req), str(new_req))
# try to rebuild requirements with specific version:
new_line = build_specific_version_req(line, req.req.name, new_req)
if new_line:
lines[i] = new_line
else:
lines[i] = '{} # {}'.format(str(req), str(new_req))
else:
lines[i] = '{} # {}'.format(line, str(new_req))
new_line = build_specific_version_req(line, req.req.name, new_req)
if new_line:
lines[i] = new_line
else:
lines[i] = '{} # {}'.format(line, str(new_req))
break
except:
pass
@@ -632,7 +702,7 @@ class PytorchRequirement(SimpleSubstitution):
# noinspection PyBroadException
try:
if requests.get(torch_url, timeout=10).ok:
print('Torch CUDA {} index page found'.format(c))
print('Torch CUDA {} index page found, adding `{}`'.format(c, torch_url))
cls.torch_index_url_lookup[c] = torch_url
return cls.torch_index_url_lookup[c], c
except Exception:

View File

@@ -240,6 +240,23 @@ class SimpleVersion:
if not version_b:
return True
# remove trailing "*" in both
if "*" in version_a:
ignore_sub_versions = True
while version_a.endswith(".*"):
version_a = version_a[:-2]
if version_a == "*":
version_a = ""
num_parts = min(len(version_a.split('.')), len(version_b.split('.')), )
if "*" in version_b:
ignore_sub_versions = True
while version_b.endswith(".*"):
version_b = version_b[:-2]
if version_b == "*":
version_b = ""
num_parts = min(len(version_a.split('.')), len(version_b.split('.')), )
if not num_parts:
num_parts = max(len(version_a.split('.')), len(version_b.split('.')), )

View File

@@ -1,3 +1,4 @@
from tempfile import mkdtemp
from typing import Text
from furl import furl
@@ -20,7 +21,16 @@ class RequirementsTranslator(object):
config = session.config
self.cache_dir = cache_dir or Path(config["agent.pip_download_cache.path"]).expanduser().as_posix()
self.enabled = config["agent.pip_download_cache.enabled"]
Path(self.cache_dir).mkdir(parents=True, exist_ok=True)
# noinspection PyBroadException
try:
Path(self.cache_dir).mkdir(parents=True, exist_ok=True)
except Exception:
temp_cache_folder = mkdtemp(prefix='pip_download_cache.')
print("Failed creating pip download cache folder at `{}` reverting to `{}`".format(
self.cache_dir, temp_cache_folder))
self.cache_dir = temp_cache_folder
Path(self.cache_dir).mkdir(parents=True, exist_ok=True)
self.config = Config()
self.pip = SystemPip(interpreter=interpreter, session=self._session)
self._translate_back = {}

View File

@@ -117,10 +117,11 @@ def terminate_all_child_processes(pid=None, timeout=10., include_parent=True):
def get_docker_id(docker_cmd_contains):
# noinspection PyBroadException
try:
containers_running = get_bash_output(cmd='docker ps --no-trunc --format \"{{.ID}}: {{.Command}}\"')
for docker_line in containers_running.split('\n'):
parts = docker_line.split(':')
parts = docker_line.split(':', 1)
if docker_cmd_contains in parts[-1]:
# we found our docker, return it
return parts[0]

View File

@@ -320,6 +320,7 @@ class VCS(object):
self.url, new_url))
self.url = new_url
return
# rewrite ssh URLs only if either ssh port or ssh user are forced in config
if parsed_url.scheme == "ssh" and (
self.session.config.get('agent.force_git_ssh_port', None) or
@@ -334,6 +335,9 @@ class VCS(object):
print("Using SSH credentials - ssh url '{}' with ssh url '{}'".format(
self.url, new_url))
self.url = new_url
return
elif parsed_url.scheme == "ssh":
return
if not self.session.config.agent.translate_ssh:
return
@@ -343,7 +347,7 @@ class VCS(object):
(ENV_AGENT_GIT_PASS.get() or self.session.config.get('agent.git_pass', None)):
# only apply to a specific domain (if requested)
config_domain = \
ENV_AGENT_GIT_HOST.get() or self.session.config.get("git_host", None)
ENV_AGENT_GIT_HOST.get() or self.session.config.get("agent.git_host", None)
if config_domain and config_domain != furl(self.url).host:
return

View File

@@ -139,42 +139,45 @@ class ResourceMonitor(object):
def _daemon(self):
seconds_since_started = 0
reported = 0
while True:
last_report = time()
current_report_frequency = (
self._report_frequency if reported != 0 else self._first_report_sec
)
while (time() - last_report) < current_report_frequency:
# wait for self._sample_frequency seconds, if event set quit
if self._exit_event.wait(1 / self._sample_frequency):
return
# noinspection PyBroadException
try:
self._update_readouts()
except Exception as ex:
log.warning("failed getting machine stats: %s", report_error(ex))
self._failure()
try:
while True:
last_report = time()
current_report_frequency = (
self._report_frequency if reported != 0 else self._first_report_sec
)
while (time() - last_report) < current_report_frequency:
# wait for self._sample_frequency seconds, if event set quit
if self._exit_event.wait(1 / self._sample_frequency):
return
# noinspection PyBroadException
try:
self._update_readouts()
except Exception as ex:
log.warning("failed getting machine stats: %s", report_error(ex))
self._failure()
seconds_since_started += int(round(time() - last_report))
# check if we do not report any metric (so it means the last iteration will not be changed)
seconds_since_started += int(round(time() - last_report))
# check if we do not report any metric (so it means the last iteration will not be changed)
# if we do not have last_iteration, we just use seconds as iteration
# if we do not have last_iteration, we just use seconds as iteration
# start reporting only when we figured out, if this is seconds based, or iterations based
average_readouts = self._get_average_readouts()
stats = {
# 3 points after the dot
key: round(value, 3) if isinstance(value, float) else [round(v, 3) for v in value]
for key, value in average_readouts.items()
}
# start reporting only when we figured out, if this is seconds based, or iterations based
average_readouts = self._get_average_readouts()
stats = {
# 3 points after the dot
key: round(value, 3) if isinstance(value, float) else [round(v, 3) for v in value]
for key, value in average_readouts.items()
}
# send actual report
if self.send_report(stats):
# clear readouts if this is update was sent
self._clear_readouts()
# send actual report
if self.send_report(stats):
# clear readouts if this is update was sent
self._clear_readouts()
# count reported iterations
reported += 1
# count reported iterations
reported += 1
except Exception as ex:
log.exception("Error reporting monitoring info: %s", str(ex))
def _update_readouts(self):
readouts = self._machine_stats()

View File

@@ -1 +1 @@
__version__ = '1.5.1rc0'
__version__ = '1.6.0'

View File

@@ -58,7 +58,7 @@ agent {
type: pip,
# specify pip version to use (examples "<20.2", "==19.3.1", "", empty string will install the latest version)
pip_version: "<21",
pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"],
# virtual environment inheres packages from system
system_site_packages: false,
@@ -171,7 +171,7 @@ agent {
default_docker: {
# default docker image to use when running in docker mode
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
# optional arguments to pass to docker image
# arguments: ["--ipc=host", ]
@@ -224,7 +224,7 @@ sdk {
storage {
cache {
# Defaults to system temp folder / cache
# Defaults to <system_temp_folder>/clearml_cache
default_base_dir: "~/.clearml/cache"
size {
# max_used_bytes = -1

View File

@@ -33,4 +33,9 @@ echo "api.files_server: ${CLEARML_FILES_HOST}" >> ~/clearml.conf
./provider_entrypoint.sh
python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
if [[ -z "${K8S_GLUE_MAX_PODS}" ]]
then
python3 k8s_glue_example.py --queue ${QUEUE} ${EXTRA_ARGS}
else
python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
fi

View File

@@ -65,6 +65,19 @@ def parse_args():
help="Limit the maximum number of pods that this service can run at the same time."
"Should not be used with ports-mode"
)
parser.add_argument(
"--use-owner-token", action="store_true", default=False,
help="Generate and use task owner token for the execution of each task"
)
parser.add_argument(
"--standalone-mode", action="store_true", default=False,
help="Do not use any network connects, assume everything is pre-installed"
)
parser.add_argument(
"--child-report-tags", type=str, nargs="+", default=None,
help="List of tags to send with the status reports from a worker that runs a task"
)
return parser.parse_args()
@@ -85,9 +98,14 @@ def main():
user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
namespace=args.namespace, max_pods_limit=args.max_pods or None,
namespace=args.namespace, max_pods_limit=args.max_pods or None
)
k8s.k8s_daemon(
args.queue,
use_owner_token=args.use_owner_token,
standalone_mode=args.standalone_mode,
child_report_tags=args.child_report_tags
)
k8s.k8s_daemon(args.queue)
if __name__ == "__main__":

View File

@@ -2,13 +2,17 @@
chmod +x /root/entrypoint.sh
apt-get update -y
apt-get dist-upgrade -y
apt-get install -y curl unzip less locales
apt-get update -qqy
apt-get dist-upgrade -qqy
apt-get install -qqy curl unzip less locales
locale-gen en_US.UTF-8
apt-get install -y curl python3-pip git
apt-get update -qqy
apt-get install -qqy curl gcc python3-dev python3-pip apt-transport-https lsb-release openssh-client git gnupg
rm -rf /var/lib/apt/lists/*
apt clean
python3 -m pip install -U pip
python3 -m pip install clearml-agent
python3 -m pip install -U "cryptography>=2.9"
python3 -m pip install --no-cache-dir clearml-agent
python3 -m pip install -U --no-cache-dir "cryptography>=2.9"

View File

@@ -1,4 +1,4 @@
ARG TAG=3.7.12-alpine3.15
ARG TAG=3.7.17-alpine3.18
FROM python:${TAG} as build
@@ -20,7 +20,7 @@ FROM python:${TAG} as target
WORKDIR /app
ARG KUBECTL_VERSION=1.22.4
ARG KUBECTL_VERSION=1.24.0
# Not sure about these ENV vars
# ENV LC_ALL=en_US.UTF-8

View File

@@ -1,4 +1,4 @@
ARG TAG=3.7.12-slim-bullseye
ARG TAG=3.7.17-slim-bullseye
FROM python:${TAG} as target

View File

@@ -65,6 +65,10 @@ def parse_args():
help="Limit the maximum number of pods that this service can run at the same time."
"Should not be used with ports-mode"
)
parser.add_argument(
"--use-owner-token", action="store_true", default=False,
help="Generate and use task owner token for the execution of each task"
)
return parser.parse_args()
@@ -87,7 +91,7 @@ def main():
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
namespace=args.namespace, max_pods_limit=args.max_pods or None,
)
k8s.k8s_daemon(args.queue)
k8s.k8s_daemon(args.queue, use_owner_token=args.use_owner_token)
if __name__ == "__main__":

View File

@@ -1,16 +1,36 @@
#!/bin/sh
#!/bin/bash +x
CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-$TRAINS_FILES_HOST}
if [ -n "$SHUTDOWN_IF_NO_ACCESS_KEY" ] && [ -z "$CLEARML_API_ACCESS_KEY" ] && [ -z "$TRAINS_API_ACCESS_KEY" ]; then
echo "CLEARML_API_ACCESS_KEY was not provided, service will not be started"
exit 0
fi
export CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-$TRAINS_FILES_HOST}
if [ -z "$CLEARML_FILES_HOST" ]; then
CLEARML_HOST_IP=${CLEARML_HOST_IP:-${TRAINS_HOST_IP:-$(curl -s https://ifconfig.me/ip)}}
fi
CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-${TRAINS_FILES_HOST:-"http://$CLEARML_HOST_IP:8081"}}
CLEARML_WEB_HOST=${CLEARML_WEB_HOST:-${TRAINS_WEB_HOST:-"http://$CLEARML_HOST_IP:8080"}}
CLEARML_API_HOST=${CLEARML_API_HOST:-${TRAINS_API_HOST:-"http://$CLEARML_HOST_IP:8008"}}
export CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-${TRAINS_FILES_HOST:-"http://$CLEARML_HOST_IP:8081"}}
export CLEARML_WEB_HOST=${CLEARML_WEB_HOST:-${TRAINS_WEB_HOST:-"http://$CLEARML_HOST_IP:8080"}}
export CLEARML_API_HOST=${CLEARML_API_HOST:-${TRAINS_API_HOST:-"http://$CLEARML_HOST_IP:8008"}}
echo $CLEARML_FILES_HOST $CLEARML_WEB_HOST $CLEARML_API_HOST 1>&2
python3 -m pip install -q -U "clearml-agent${CLEARML_AGENT_UPDATE_VERSION:-$TRAINS_AGENT_UPDATE_VERSION}"
clearml-agent daemon --services-mode --queue services --create-queue --docker "${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}" --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
if [[ "$CLEARML_AGENT_UPDATE_VERSION" =~ ^[0-9]{1,3}\.[0-9]{1,3}(\.[0-9]{1,3}([a-zA-Z]{1,3}[0-9]{1,3})?)?$ ]]
then
CLEARML_AGENT_UPDATE_VERSION="==$CLEARML_AGENT_UPDATE_VERSION"
fi
DAEMON_OPTIONS=${CLEARML_AGENT_DAEMON_OPTIONS:---services-mode --create-queue}
QUEUES=${CLEARML_AGENT_QUEUES:-services}
if [ -z "$CLEARML_AGENT_NO_UPDATE" ]; then
if [ -n "$CLEARML_AGENT_UPDATE_REPO" ]; then
python3 -m pip install -q -U $CLEARML_AGENT_UPDATE_REPO
else
python3 -m pip install -q -U "clearml-agent${CLEARML_AGENT_UPDATE_VERSION:-$TRAINS_AGENT_UPDATE_VERSION}"
fi
fi
clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES --docker "${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}" --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}

View File

@@ -82,6 +82,7 @@ agent {
# pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"]
# specify poetry version to use (examples "<2", "==1.1.1", "", empty string will install the latest version)
# poetry_version: "<2",
# poetry_install_extra_args: ["-v"]
# virtual environment inheres packages from system
system_site_packages: false,
@@ -92,25 +93,43 @@ agent {
# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
extra_index_url: []
# additional flags to use when calling pip install, example: ["--use-deprecated=legacy-resolver", ]
# extra_pip_install_flags: []
# control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
# Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
# "pip" (default): would automatically detect the cuda version, and supply pip with the correct
# extra-index-url, based on pytorch.org tables
# "direct": would resolve a direct link to the pytorch wheel by parsing the pytorch.org pip repository
# and matching the automatically detected cuda version with the required pytorch wheel.
# if the exact cuda version is not found for the required pytorch wheel, it will try
# a lower cuda version until a match is found
# "none": No resolver used, install pytorch like any other package
# pytorch_resolve: "pip"
# additional conda channels to use when installing with conda package manager
conda_channels: ["pytorch", "conda-forge", "defaults", ]
# conda_full_env_update: false
# conda_env_as_base_docker: false
# set the priority packages to be installed before the rest of the required packages
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# priority_packages: ["cython", "numpy", "setuptools", ]
# set the optional priority packages to be installed before the rest of the required packages,
# In case a package installation fails, the package will be ignored,
# and the virtual environment process will continue
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# priority_optional_packages: ["pygobject", ]
# set the post packages to be installed after all the rest of the required packages
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# post_packages: ["horovod", ]
# set the optional post packages to be installed after all the rest of the required packages,
# In case a package installation fails, the package will be ignored,
# and the virtual environment process will continue
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# post_optional_packages: []
# set to True to support torch nightly build installation,
@@ -167,6 +186,7 @@ agent {
# optional arguments to pass to docker image
# these are local for this agent and will not be updated in the experiment's docker_cmd section
# You can also pass host environments into the container with ["-e", "HOST_NAME=$HOST_NAME"]
# extra_docker_arguments: ["--ipc=host", "-v", "/mnt/host/data:/mnt/data"]
# optional shell script to run in docker when started before the experiment is started
@@ -177,13 +197,19 @@ agent {
# change to false to skip installation and decrease docker spin up time
# docker_install_opencv_libs: true
# Allow passing host environments into docker container with Task's docker container args
# Example "-e HOST_NAME=$HOST_NAME"
# NOTICE this might introduce security risk allowing access to keys/secret on the host machine1
# Use with care!
# docker_allow_host_environ: false
# set to true in order to force "docker pull" before running an experiment using a docker image.
# This makes sure the docker image is updated.
docker_force_pull: false
default_docker: {
# default docker image to use when running in docker mode
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
# optional arguments to pass to docker image
# arguments: ["--ipc=host"]
@@ -193,7 +219,7 @@ agent {
# enterprise version only
# match_rules: [
# {
# image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
# image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
# arguments: "-e define=value"
# match: {
# script{
@@ -214,7 +240,7 @@ agent {
# }
# },
# {
# image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
# image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
# arguments: "-e define=value"
# match: {
# # must match all requirements (not partial)
@@ -227,8 +253,6 @@ agent {
# # no repository matching required
# repository: ""
# }
# # no container image matching required (allow to replace one requested container with another)
# container: ""
# # no repository matching required
# project: ""
# }
@@ -282,7 +306,7 @@ sdk {
storage {
cache {
# Defaults to system temp folder / cache
# Defaults to <system_temp_folder>/clearml_cache
default_base_dir: "~/.clearml/cache"
}
@@ -468,7 +492,8 @@ sdk {
# target_format: format used to encode contents before writing into the target file. Supported values are json,
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
# overwrite: overwrite the target file in case it exists. Default is true.
#
# mode: file-system mode to be applied to the file after its creation. The mode string will be parsed into an
# integer (e.g. "0o777" for -rwxrwxrwx)
# Example:
# files {
# myfile1 {

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.1 MiB

After

Width:  |  Height:  |  Size: 1018 KiB

View File

@@ -146,7 +146,7 @@ sdk {
storage {
cache {
# Defaults to system temp folder / cache
# Defaults to <system_temp_folder>/clearml_cache
default_base_dir: "~/.clearml/cache"
}

View File

@@ -156,7 +156,7 @@
"TRAINS_GIT_PASS = \"\"\n",
"\n",
"# Additional fields for trains.conf file created on the remote instance\n",
"# for example: 'agent.default_docker.image: \"nvidia/cuda:10.0-cudnn7-runtime\"'\n",
"# for example: 'agent.default_docker.image: \"nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04\"'\n",
"EXTRA_TRAINS_CONF = \"\"\"\n",
"\"\"\"\n",
"\n",
@@ -584,4 +584,4 @@
},
"nbformat": 4,
"nbformat_minor": 2
}
}

View File

@@ -8,8 +8,8 @@ pyparsing>=2.0.3,<3.1.0
python-dateutil>=2.4.2,<2.9.0
pyjwt>=2.4.0,<2.7.0
PyYAML>=3.12,<6.1
requests>=2.20.0,<2.29.0
six>=1.13.0,<1.16.0
requests>=2.29.0,<=2.31.0
six>=1.13.0,<1.17.0
typing>=3.6.4,<3.8.0 ; python_version < '3.5'
urllib3>=1.21.1,<1.27.0
virtualenv>=16,<21

View File

@@ -62,6 +62,7 @@ setup(
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: 3.9',
'Programming Language :: Python :: 3.10',
'Programming Language :: Python :: 3.11',
'License :: OSI Approved :: Apache Software License',
],