Compare commits

...

113 Commits

Author SHA1 Message Date
allegroai
73625bf00f Version bump 2021-12-21 14:29:43 +02:00
allegroai
f41ed09dc1 Add support for custom docker image resolving 2021-12-21 14:29:43 +02:00
allegroai
f03c4576f7 Update default docker image 2021-12-21 14:29:43 +02:00
pshowbs
6c5087e425 Update S3 bucket verify option for minio (#83)
Use verify configuration option to skip verify or set ca bundle path
2021-11-06 14:40:35 +02:00
allegroai
5a6caf6399 Fix "git+git://" requirements 2021-10-29 22:58:28 +03:00
allegroai
a07053d961 Version bump to v1.1.1 2021-10-26 10:12:21 +03:00
allegroai
aa9a9a25fb version bump 2021-10-21 12:03:29 +03:00
allegroai
cd4a39d8fc Fix config example 2021-10-21 12:03:07 +03:00
allegroai
92e3f00435 Add support for truncating task log file after reporting to server 2021-10-21 12:02:31 +03:00
allegroai
a890e36a36 Fix PY2.7 support for pytorch 2021-10-19 10:47:09 +03:00
allegroai
bed94ee431 Add support for configuration env and files section 2021-10-19 10:46:43 +03:00
allegroai
175e99b12b Fix if queue tag default does not exist and --queue not specified, try queue name "default" 2021-10-16 23:21:45 +03:00
allegroai
2a941e3abf Fix --stop checking default queue tag (issue #80) 2021-10-16 23:21:12 +03:00
allegroai
3c8e0ae5db Improve PyJWT resiliency support 2021-10-10 09:08:36 +03:00
allegroai
e416ab526b Fix Python 3.5 compatibility 2021-09-26 00:05:08 +03:00
pollfly
e17246d8ea Fix docstring typos (#79)
* edit doctring typo

* fix typos
2021-09-14 18:42:18 +03:00
allegroai
f6f043d1ca Version bump to v1.1.0 2021-09-13 15:25:25 +03:00
allegroai
db57441c5d Fix sensitive environment variable values are not masked in "executing docker" printout (issue #67) 2021-09-13 14:00:11 +03:00
allegroai
31d90be0a1 Fix package manager config documentation (issue #78) 2021-09-10 13:11:39 +03:00
allegroai
5a080798cb Add support for overriding initial server connection behavior using the CLEARML_AGENT_INITIAL_CONNECT_RETRY_OVERRIDE env var (defaults to true, allows boolean value or an explicit number specifying the number of connect retries) 2021-08-27 19:15:14 +03:00
pollfly
21c4857795 Fix doctring typo (#75) 2021-08-22 08:19:55 +03:00
allegroai
4149afa896 Add agent.docker_internal_mounts to control containers internal mounts (non-root containers) 2021-08-21 16:03:37 +03:00
allegroai
b196ab5793 Do not overwrite PYTHONIOENCODING if defined 2021-08-20 00:37:21 +03:00
allegroai
b39b54bbaf Add poetry cache into docker mapping (issue #74) 2021-08-13 11:02:21 +03:00
allegroai
26d76f52ac Fix venv cache cannot reinstall package from git with http credentials 2021-08-13 11:00:54 +03:00
allegroai
2fff28845d Fix support for unicode standalone scripts, changing default 'ascii' encoding to UTF-8. 2021-08-12 13:39:11 +03:00
allegroai
5e4c495d62 Add support for naming docker containers. Use agent.docker_container_name_format to configure the name format (disabled by default) (issue clearml/#412)
Add missing entries in docs/clearml.conf
2021-08-12 13:38:26 +03:00
allegroai
5c5802c089 Fix python package with git+git:// links or git+ssh:// conversion 2021-08-12 13:37:10 +03:00
allegroai
06010ef1b7 Disable default demo server (still available with CLEARML_NO_DEFAULT_SERVER=0) 2021-08-12 13:36:49 +03:00
allegroai
bd411a1984 version bump 2021-08-05 19:23:23 +03:00
allegroai
29d24e3eaa Update docker example for k8s glue 2021-08-05 19:22:56 +03:00
allegroai
0fbbe774fa Fix support for "-r requirements.txt" in installed packages 2021-08-05 19:19:54 +03:00
allegroai
aede6f4bac Fix README 2021-08-03 11:30:27 +03:00
allegroai
84706ba66d Add docker example for running the agent k8s glue as a pod in a k8s cluster 2021-08-03 11:23:33 +03:00
allegroai
6b602889a5 Fix import loop 2021-08-03 01:28:08 +03:00
allegroai
cd046927f3 Add k8s glue update task status_message in hanging pods daemon
Fix k8s glue not throwing error when failing to push to queue
2021-08-02 22:59:31 +03:00
allegroai
5ed47d2d2c Add support for CLEARML_NO_DEFAULT_SERVER env var to prevent agent from using the demo server
Add support for FORCE_CLEARML_AGENT_REPO env var to allow installing agent from a repo url when executing a task
Implement skip venv installation on execute and allow custom binary
Fix services mode limit implementation in docker mode
2021-08-02 22:51:26 +03:00
allegroai
fd068c0933 Add support for env vars containing bash-style string lists using shlex
Add support for CLEARML_AGENT_SKIP_PIP_VENV_INSTALL env var to skip venv installation on execute and allow custom binary
Add support for CLEARML_AGENT_VENV_CACHE_PATH env var to allow overriding venv cache folder configuration
Add support for CLEARML_AGENT_EXTRA_DOCKER_ARGS env var to allow overriding extra docker args configuration
2021-08-02 22:38:36 +03:00
Simon Gasse
9456e493ac Enable rewriting SSH URLs
ClearML Agent allows to force git cloning via SSH and also has a
setting to force a username. The relevant settings are:
agent.force_git_ssh_protocol: true
agent.force_git_ssh_user: "git"

However, forcing a specific username or port only worked so far if the
agent translated either from https->ssh or from ssh->https. A given
ssh URL was not rewritten.

This commit adds a helper function and includes it in `_set_ssh_url`
to allow rewriting ssh URLs with the username and/or port given in the
config `agent.force_git_ssh_user`.
If neither username nor port are forced in the config, the URL is not
touched.

This is somewhat related to issue #42.
Note that rewriting https->https is not covered in this commit.
2021-07-31 23:34:27 +03:00
Jake Henning
3b08a73245 Update README with artifacthub.io badge 2021-07-27 19:53:16 +03:00
allegroai
42606d9247 Fix multiple k8s glue instances with pod limits
Version bump
2021-07-15 10:28:43 +03:00
allegroai
499b3dfa66 Fix k8s glue, do not reset Task before re-enqueuing as it will remove runtime properties 2021-07-15 10:27:54 +03:00
allegroai
ca360b7d43 Improve max pod limit check 2021-07-15 10:26:49 +03:00
allegroai
6470b16b70 Add k8s set task container if using default image/arguments 2021-07-15 10:26:09 +03:00
allegroai
4c9410c5fe Fix auto mount SSH_AUTH_SOCK into docker (issue #45) 2021-07-11 09:44:49 +03:00
pollfly
351f0657c3 Update agent gif (#69) 2021-07-08 09:20:45 +03:00
allegroai
382604e923 Fix services mode killing child processes when running in services mode + venv 2021-06-30 23:58:25 +03:00
Jake Henning
b48f25a7f9 Merge pull request #68 from pollfly/master
Fix documentation links
2021-06-29 11:04:52 +03:00
Revital
b76e4fc02b Merge remote-tracking branch 'origin/master' 2021-06-29 07:59:02 +03:00
Revital
27cf7dd67f add clearml_architecture picture 2021-06-29 07:58:29 +03:00
pollfly
05ec45352c Merge branch 'allegroai:master' into master 2021-06-29 07:37:10 +03:00
allegroai
0e7546f248 Fix docker force pull in k8s glue _kubectl_apply() 2021-06-27 09:42:14 +03:00
allegroai
e3c8bd5666 Add support for agent.docker_force_pull configuration setting in k8s glue 2021-06-25 17:36:08 +03:00
allegroai
3ae1741343 Fix k8s glue task container arguments not supported in kubectl_run command
Fix k8s glue not passing required extra_docker_bash_script to string format
2021-06-25 17:35:01 +03:00
allegroai
53c106c3af Fix k8s glue task container handling fails parsing docker image
Fix k8s glue uses task container image arguments when no image is specified
2021-06-25 17:34:28 +03:00
allegroai
44fc7dffe6 Fix key/secret usage printout 2021-06-24 19:37:59 +03:00
allegroai
aaa6b32f9f Fix support for "-r requirements.txt" inside "installed packages" 2021-06-24 19:26:35 +03:00
allegroai
821a0c4a2b Fix parsing VCS links starting with "git+git@" (notice "git+git://" was already supported) 2021-06-24 19:25:41 +03:00
Revital
6373237960 switch allegro.ai link to clear.ml links 2021-06-22 13:59:37 +03:00
pollfly
1caf7b104f Merge branch 'allegroai:master' into master 2021-06-22 13:47:48 +03:00
allegroai
176b4a4cde Fix --services-mode when the execute agent fails when starting to run with error code 0 2021-06-16 18:32:29 +03:00
allegroai
29bf993be7 Add printout when using key/secret from env vars 2021-06-02 21:15:48 +03:00
allegroai
eda597dea5 Version bump 2021-06-02 13:17:57 +03:00
allegroai
8c56777125 Add CLEARML_AGENT_DISABLE_SSH_MOUNT allowing disabling the auto .ssh mount into the docker 2021-06-02 13:16:58 +03:00
allegroai
7e90ebd5db Fix _dynamic_gpu_get_available worker timeout increase to 10 minutes 2021-06-02 13:16:17 +03:00
allegroai
3a07bfe1d7 Version bump 2021-05-31 23:19:46 +03:00
allegroai
0694b9e8af Fix PyYAML supported versions 2021-05-26 18:33:35 +03:00
allegroai
742cbf5767 Add docker environment arguments log masking support (issue #67) 2021-05-25 19:31:45 +03:00
allegroai
e93384b99b Fix --stop with dynamic gpus 2021-05-20 10:58:46 +03:00
allegroai
3c4e976093 Add agent.ignore_requested_python_version to config file 2021-05-19 15:20:44 +03:00
allegroai
1e795beec8 Fix support for spaces in docker arguments (issue #358) 2021-05-19 15:20:03 +03:00
allegroai
4f7407084d Fix standalone script with pre-exiting conda venv 2021-05-12 15:46:25 +03:00
allegroai
ae3d034531 Protect against None in execution.repository 2021-05-12 15:45:31 +03:00
allegroai
a2db1f5ab5 Remove queue name from pod name in k8s glue, add queue name and ID to pod labels (issue #64) 2021-05-05 12:03:35 +03:00
allegroai
cec6420c8f Version bump to v1.0.0 2021-05-03 18:33:53 +03:00
allegroai
4f18bb7ea0 Add k8s glue default restartPolicy=Never to template to prevent pods from restarting 2021-04-28 13:20:13 +03:00
allegroai
3ec2a3a92e Add k8s pod limit to k8s glue example 2021-04-28 13:19:34 +03:00
allegroai
823b67a3ce Deprecate venv_update (replaced by the more robust venvs_cache) 2021-04-28 13:17:37 +03:00
Revital
24dc59e31f add space to help message 2021-04-27 13:50:44 +03:00
allegroai
08ff5e6db7 Add number of pods limit to k8s glue 2021-04-25 10:47:49 +03:00
allegroai
e60a6f9d14 Fix --stop support for dynamic gpus 2021-04-25 10:46:43 +03:00
Revital
161656d9e4 add space to help message 2021-04-22 14:14:38 +03:00
Allegro AI
8569c02b33 Merge pull request #58 from pollfly/master
fix --downtime help
2021-04-21 15:27:47 +03:00
Revital
35e714d8d9 fix --downtime help 2021-04-21 09:13:47 +03:00
allegroai
6f8d5710d6 Fix dynamic gpus priority queue 2021-04-20 18:11:59 +03:00
allegroai
a671692832 Fix --services-mode with instance limit 2021-04-20 18:11:36 +03:00
allegroai
5c8675e43a Add support for dynamic gpus opportunistic scheduling (with min/max gpus per queue) 2021-04-20 18:11:16 +03:00
allegroai
60a58f6fad Fix poetry support (issue #57) 2021-04-14 11:22:07 +03:00
allegroai
948fc4c6ce Add python 3.9 to the support table 2021-04-12 23:01:40 +03:00
allegroai
5be5f3209d Fix documentation links 2021-04-12 23:01:22 +03:00
allegroai
537b67e0cd Fix agent can return non-zero error code and pods will end up restarting forever (issue #56) 2021-04-12 23:00:59 +03:00
allegroai
82c5e55fe4 Fix usage of not_set in k8s template merge 2021-04-07 21:30:13 +03:00
allegroai
5f0d51d485 Add documentation for agent.docker_install_opencv_libs 2021-04-07 18:48:30 +03:00
allegroai
945dd816ad Fix no docker arguments 2021-04-07 18:47:13 +03:00
allegroai
45009e6cc2 Add support for updating back docker on new API v2.13 2021-04-07 18:46:58 +03:00
allegroai
8eace6d57b Bump virtualenv dependency version 2021-04-07 18:46:35 +03:00
allegroai
3774fa6abd Add support for new container base setup script feature 2021-04-07 18:46:14 +03:00
allegroai
e71e6865d2 Add agent.docker_install_opencv_libs (default: True) to enable auto opencv libs install for faster docker spin-up 2021-04-07 18:45:44 +03:00
allegroai
0e8f1528b1 Remove redundant py2 code 2021-04-07 18:44:59 +03:00
allegroai
c331babf51 Add stopping message on Task process termination
Fix --stop on dynamic gpus venv mode
2021-04-07 18:44:33 +03:00
allegroai
c59d268995 Fix venv cache crash on bad symbolic links 2021-04-07 18:44:11 +03:00
allegroai
9e9fcb0ba9 Add dynamic mode terminate dockers on sig_term 2021-04-07 18:43:44 +03:00
allegroai
f33e0b2f78 Verify docker command exists when running in docker mode 2021-04-07 18:42:27 +03:00
allegroai
0e4b99351f Add --stop support for dynamic gpus
Fix --stop mark tasks as aborted (not failed as before)
2021-04-07 18:42:10 +03:00
allegroai
81edd2860f Fix --dynamic-gpus should keep original queue priority order 2021-03-31 23:55:12 +03:00
allegroai
14ac584577 Support k8s glue container env vars merging 2021-03-31 23:53:58 +03:00
allegroai
9ce6baf074 Fix broken k8s glue docker args parsing
Fix empty env prevents override when merging template
2021-03-26 12:26:15 +03:00
allegroai
92a1e07b33 Fix local path replace back when using cache 2021-03-26 12:16:05 +03:00
allegroai
cb6bdece39 Fix cuda version from driver does not return minor version 2021-03-18 10:07:59 +02:00
allegroai
2ea38364bb Change the default conda channel order, so it pulls the correct pytorch 2021-03-18 10:07:58 +02:00
allegroai
cf6fdc0d81 Add support for PyJWT v2 2021-03-18 10:07:58 +02:00
allegroai
91eec99563 Add conda debug prints (--debug) 2021-03-18 10:07:58 +02:00
allegroai
f8cbaa9a06 documentation 2021-03-18 03:05:26 +02:00
57 changed files with 3378 additions and 730 deletions

View File

@@ -5,10 +5,11 @@
**ClearML Agent - ML-Ops made easy
ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
[![GitHub license](https://img.shields.io/github/license/allegroai/trains-agent.svg)](https://img.shields.io/github/license/allegroai/trains-agent.svg)
[![GitHub license](https://img.shields.io/github/license/allegroai/clearml-agent.svg)](https://img.shields.io/github/license/allegroai/clearml-agent.svg)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/clearml-agent.svg)](https://img.shields.io/pypi/pyversions/clearml-agent.svg)
[![PyPI version shields.io](https://img.shields.io/pypi/v/clearml-agent.svg)](https://img.shields.io/pypi/v/clearml-agent.svg)
[![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/allegroai)](https://artifacthub.io/packages/search?repo=allegroai)
</div>
---
@@ -21,23 +22,23 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
* Implement optimized resource utilization policies
* Deploy execution environments with either virtualenv or fully docker containerized with zero effort
* Launch-and-Forget service containers
* [Cloud autoscaling](https://allegro.ai/clearml/docs/docs/examples/services/aws_autoscaler/aws_autoscaler.html)
* [Customizable cleanup](https://allegro.ai/clearml/docs/docs/examples/services/cleanup/cleanup_service.html)
* Advanced [pipeline building and execution](https://allegro.ai/clearml/docs/docs/examples/frameworks/pytorch/notebooks/table/tabular_training_pipeline.html)
* [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
* [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
* Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.
**Full Automation in 5 steps**
1. ClearML Server [self-hosted](https://github.com/allegroai/trains-server) or [free tier hosting](https://app.community.clear.ml)
1. ClearML Server [self-hosted](https://github.com/allegroai/clearml-server) or [free tier hosting](https://app.community.clear.ml)
2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine: on-premises / cloud / ...)
3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or Add [ClearML](https://github.com/allegroai/trains) to your code with just 2 lines
3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
"All the Deep/Machine-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
**Try ClearML now** [Self Hosted](https://github.com/allegroai/trains-server) or [Free tier Hosting](https://app.community.clear.ml)
<a href="https://app.community.clear.ml"><img src="https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width="100%"></a>
**Try ClearML now** [Self Hosted](https://github.com/allegroai/clearml-server) or [Free tier Hosting](https://app.community.clear.ml)
<a href="https://app.community.clear.ml"><img src="https://github.com/allegroai/clearml-agent/blob/master/docs/screenshots.gif?raw=true" width="100%"></a>
### Simple, Flexible Experiment Orchestration
**The ClearML Agent was built to address the DL/ML R&D DevOps needs:**
@@ -68,13 +69,13 @@ We designed `clearml-agent` so you can run bare-metal or inside a pod with any m
**Two K8s integration flavours**
- Spin ClearML-Agent as a long-lasting service pod
- use [clearml-agent](https://hub.docker.com/r/allegroai/trains-agent) docker image
- use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
- map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
- allow the clearml-agent to manage sibling dockers
- benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
- downside: Sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
- Run the [clearml-k8s glue](https://github.com/allegroai/trains-agent/blob/master/examples/k8s_glue_example.py) on a K8s cpu node
- Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on a K8s cpu node
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml template)
- Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the experiment's process
- benefits: Kubernetes full view of all running jobs in the system
@@ -122,7 +123,7 @@ The ClearML Agent executes experiments using the following process:
#### System Design & Flow
<img src="https://allegro.ai/clearml/docs/_images/ClearML_Architecture.png" width="100%" alt="clearml-architecture">
<img src="https://github.com/allegroai/clearml-agent/blob/master/docs/clearml_architecture.png" width="100%" alt="clearml-architecture">
#### Installing the ClearML Agent
@@ -196,16 +197,16 @@ Notice: with `--detached` flag, the *clearml-agent* will be running in the backg
clearml-agent daemon --detached --queue default --docker
```
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda docker:
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 docker:
```bash
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
```
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda docker:
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 docker:
```bash
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
```
##### Starting the ClearML Agent - Priority Queues
@@ -225,11 +226,11 @@ Adding queues, managing job order within a queue and moving jobs between queues,
To stop a **ClearML Agent** running in the background, run the same command line used to start the agent with `--stop` appended.
For example, to stop the first of the above shown same machine, single gpu agents:
```bash
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda --stop
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop
```
### How do I create an experiment on the ClearML Server? <a name="from-scratch"></a>
* Integrate [ClearML](https://github.com/allegroai/trains) with your code
* Integrate [ClearML](https://github.com/allegroai/clearml) with your code
* Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
* As your code is running, **ClearML** creates an experiment logging all the necessary execution information:
- Git repository link and commit ID (or an entire jupyter notebook)
@@ -273,18 +274,18 @@ clearml-agent daemon --services-mode --detached --queue services --create-queue
### AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
The ClearML Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the ClearML package.
Sample AutoML & Orchestration examples can be found in the ClearML [example/automation](https://github.com/allegroai/trains/tree/master/examples/automation) folder.
Sample AutoML & Orchestration examples can be found in the ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.
AutoML examples
- [Toy Keras training experiment](https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
- [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
- In order to create an experiment-template in the system, this code must be executed once manually
- [Random Search over the above Keras experiment-template](https://github.com/allegroai/trains/blob/master/examples/automation/manual_random_param_search_example.py)
- [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter combinations
Experiment Pipeline examples
- [First step experiment](https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py)
- [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
- This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
- [Second step experiment](https://github.com/allegroai/trains/blob/master/examples/automation/toy_base_task.py)
- [Second step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/toy_base_task.py)
- In order to create an experiment-template in the system, this code must be executed once manually
### License

View File

@@ -26,10 +26,18 @@
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
# The default is the python executing the clearml_agent
python_binary: ""
# ignore any requested python version (Default: False, if a Task was using a
# specific python version and the system supports multiple python the agent will use the requested python version)
# ignore_requested_python_version: true
# select python package manager:
# currently supported pip and conda
# poetry is used if pip selected and repository contains poetry.lock file
# currently supported: pip, conda and poetry
# if "pip" or "conda" are used, the agent installs the required packages
# based on the "installed packages" section of the Task. If the "installed packages" is empty,
# it will revert to using `requirements.txt` from the repository's root directory.
# If Poetry is selected and the root repository contains `poetry.lock` or `pyproject.toml`,
# the "installed packages" section is ignored, and poetry is used.
# If Poetry is selected and no lock file is found, it reverts to "pip" package manager behaviour.
package_manager: {
# supported options: pip, conda, poetry
type: pip,
@@ -47,7 +55,7 @@
# extra_index_url: ["https://allegroai.jfrog.io/clearmlai/api/pypi/public/simple"]
# additional conda channels to use when installing with conda package manager
conda_channels: ["defaults", "conda-forge", "pytorch", ]
conda_channels: ["pytorch", "conda-forge", "defaults", ]
# If set to true, Task's "installed packages" are ignored,
# and the repository's "requirements.txt" is used instead
@@ -121,6 +129,11 @@
# optional shell script to run in docker when started before the experiment is started
# extra_docker_shell_script: ["apt-get install -y bindfs", ]
# Install the required packages for opencv libraries (libsm6 libxext6 libxrender-dev libglib2.0-0),
# for backwards compatibility reasons, true as default,
# change to false to skip installation and decrease docker spin up time
# docker_install_opencv_libs: true
# optional uptime configuration, make sure to use only one of 'uptime/downtime' and not both.
# If uptime is specified, agent will actively poll (and execute) tasks in the time-spans defined here.
# Outside of the specified time-spans, the agent will be idle.
@@ -143,7 +156,7 @@
default_docker: {
# default docker image to use when running in docker mode
image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
# optional arguments to pass to docker image
# arguments: ["--ipc=host", ]
@@ -177,4 +190,75 @@
# should be detected automatically. Override with os environment CUDA_VERSION / CUDNN_VERSION
# cuda_version: 10.1
# cudnn_version: 7.6
# Hide docker environment variables containing secrets when printing out the docker command by replacing their
# values with "********". Turning this feature on will hide the following environment variables values:
# CLEARML_API_SECRET_KEY, CLEARML_AGENT_GIT_PASS, AWS_SECRET_ACCESS_KEY, AZURE_STORAGE_KEY
# To include more environment variables, add their keys to the "extra_keys" list. E.g. to make sure the value of
# your custom environment variable named MY_SPECIAL_PASSWORD will not show in the logs when included in the
# docker command, set:
# extra_keys: ["MY_SPECIAL_PASSWORD"]
hide_docker_command_env_vars {
enabled: true
extra_keys: []
}
# allow to set internal mount points inside the docker,
# especially useful for non-root docker container images.
docker_internal_mounts {
sdk_cache: "/clearml_agent_cache"
apt_cache: "/var/cache/apt/archives"
ssh_folder: "/root/.ssh"
pip_cache: "/root/.cache/pip"
poetry_cache: "/root/.cache/pypoetry"
vcs_cache: "/root/.clearml/vcs-cache"
venv_build: "/root/.clearml/venvs-builds"
pip_download: "/root/.clearml/pip-download-cache"
}
# Name docker containers created by the daemon using the following string format (supported from Docker 0.6.5)
# Allowed variables are task_id, worker_id and rand_string (random lower-case letters string, up to 32 characters)
# Note: resulting name must start with an alphanumeric character and continue with alphanumeric characters,
# underscores (_), dots (.) and/or dashes (-)
#docker_container_name_format: "clearml-id-{task_id}-{rand_string:.8}"
# Apply top-level environment section from configuration into os.environ
apply_environment: true
# Top-level environment section is in the form of:
# environment {
# key: value
# ...
# }
# and is applied to the OS environment as `key=value` for each key/value pair
# Apply top-level files section from configuration into local file system
apply_files: true
# Top-level files section allows auto-generating files at designated paths with a predefined contents
# and target format. Options include:
# contents: the target file's content, typically a string (or any base type int/float/list/dict etc.)
# format: a custom format for the contents. Currently supported value is `base64` to automatically decode a
# base64-encoded contents string, otherwise ignored
# path: the target file's path, may include ~ and inplace env vars
# target_format: format used to encode contents before writing into the target file. Supported values are json,
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
# overwrite: overwrite the target file in case it exists. Default is true.
#
# Example:
# files {
# myfile1 {
# contents: "The quick brown fox jumped over the lazy dog"
# path: "/tmp/fox.txt"
# }
# myjsonfile {
# contents: {
# some {
# nested {
# value: [1, 2, 3, 4]
# }
# }
# }
# path: "/tmp/test.json"
# target_format: json
# }
# }
}

View File

@@ -31,7 +31,9 @@
}
auth {
# When creating a request, if token will expire in less than this value, try to refresh the token
token_expiration_threshold_sec = 360
# When creating a request, if token will expire in less than this value, try to refresh the token. Default 12 hours
token_expiration_threshold_sec: 43200
# When requesting a token, request specific expiration time. Server default (and maximum) is 30 days
# request_token_expiration_sec: 2592000
}
}

View File

@@ -1,3 +1,4 @@
from ...backend_config.converters import safe_text_to_bool
from ...backend_config.environment import EnvEntry
@@ -6,6 +7,14 @@ ENV_WEB_HOST = EnvEntry("CLEARML_WEB_HOST", "TRAINS_WEB_HOST")
ENV_FILES_HOST = EnvEntry("CLEARML_FILES_HOST", "TRAINS_FILES_HOST")
ENV_ACCESS_KEY = EnvEntry("CLEARML_API_ACCESS_KEY", "TRAINS_API_ACCESS_KEY")
ENV_SECRET_KEY = EnvEntry("CLEARML_API_SECRET_KEY", "TRAINS_API_SECRET_KEY")
ENV_AUTH_TOKEN = EnvEntry("CLEARML_AUTH_TOKEN")
ENV_VERBOSE = EnvEntry("CLEARML_API_VERBOSE", "TRAINS_API_VERBOSE", type=bool, default=False)
ENV_HOST_VERIFY_CERT = EnvEntry("CLEARML_API_HOST_VERIFY_CERT", "TRAINS_API_HOST_VERIFY_CERT", type=bool, default=True)
ENV_CONDA_ENV_PACKAGE = EnvEntry("CLEARML_CONDA_ENV_PACKAGE", "TRAINS_CONDA_ENV_PACKAGE")
ENV_NO_DEFAULT_SERVER = EnvEntry("CLEARML_NO_DEFAULT_SERVER", "TRAINS_NO_DEFAULT_SERVER", type=bool, default=True)
ENV_DISABLE_VAULT_SUPPORT = EnvEntry('CLEARML_AGENT_DISABLE_VAULT_SUPPORT', type=bool)
ENV_ENABLE_ENV_CONFIG_SECTION = EnvEntry('CLEARML_AGENT_ENABLE_ENV_CONFIG_SECTION', type=bool)
ENV_ENABLE_FILES_CONFIG_SECTION = EnvEntry('CLEARML_AGENT_ENABLE_FILES_CONFIG_SECTION', type=bool)
ENV_INITIAL_CONNECT_RETRY_OVERRIDE = EnvEntry(
'CLEARML_AGENT_INITIAL_CONNECT_RETRY_OVERRIDE', default=True, converter=safe_text_to_bool
)

View File

@@ -1,17 +1,21 @@
import json as json_lib
import os
import sys
import types
from socket import gethostname
from six.moves.urllib.parse import urlparse, urlunparse
from typing import Optional
import jwt
import requests
import six
from pyhocon import ConfigTree
from pyhocon import ConfigTree, ConfigFactory
from requests.auth import HTTPBasicAuth
from six.moves.urllib.parse import urlparse, urlunparse
from .callresult import CallResult
from .defs import ENV_VERBOSE, ENV_HOST, ENV_ACCESS_KEY, ENV_SECRET_KEY, ENV_WEB_HOST, ENV_FILES_HOST
from .defs import ENV_VERBOSE, ENV_HOST, ENV_ACCESS_KEY, ENV_SECRET_KEY, ENV_WEB_HOST, ENV_FILES_HOST, ENV_AUTH_TOKEN, \
ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE
from .request import Request, BatchRequest
from .token_manager import TokenManager
from ..config import load
@@ -40,11 +44,12 @@ class Session(TokenManager):
_session_requests = 0
_session_initial_timeout = (3.0, 10.)
_session_timeout = (10.0, 30.)
_session_initial_connect_retry = 4
_session_initial_retry_connect_override = 4
_write_session_data_size = 15000
_write_session_timeout = (30.0, 30.)
api_version = '2.1'
feature_set = 'basic'
default_host = "https://demoapi.demo.clear.ml"
default_web = "https://demoapp.demo.clear.ml"
default_files = "https://demofiles.demo.clear.ml"
@@ -99,42 +104,48 @@ class Session(TokenManager):
if initialize_logging:
self.config.initialize_logging(debug=kwargs.get('debug', False))
token_expiration_threshold_sec = self.config.get(
"auth.token_expiration_threshold_sec", 60
)
super(Session, self).__init__(
token_expiration_threshold_sec=token_expiration_threshold_sec, **kwargs
)
super(Session, self).__init__(config=config, **kwargs)
self._verbose = verbose if verbose is not None else ENV_VERBOSE.get()
self._logger = logger
self.__auth_token = None
self.__access_key = api_key or ENV_ACCESS_KEY.get(
default=(self.config.get("api.credentials.access_key", None) or self.default_key)
)
if not self.access_key:
raise ValueError(
"Missing access_key. Please set in configuration file or pass in session init."
if ENV_AUTH_TOKEN.get(
value_cb=lambda key, value: print("Using environment access token {}=********".format(key))
):
self.set_auth_token(ENV_AUTH_TOKEN.get())
else:
self.__access_key = api_key or ENV_ACCESS_KEY.get(
default=(self.config.get("api.credentials.access_key", None) or self.default_key),
value_cb=lambda key, value: print("Using environment access key {}={}".format(key, value))
)
if not self.access_key:
raise ValueError(
"Missing access_key. Please set in configuration file or pass in session init."
)
self.__secret_key = secret_key or ENV_SECRET_KEY.get(
default=(self.config.get("api.credentials.secret_key", None) or self.default_secret)
)
if not self.secret_key:
raise ValueError(
"Missing secret_key. Please set in configuration file or pass in session init."
self.__secret_key = secret_key or ENV_SECRET_KEY.get(
default=(self.config.get("api.credentials.secret_key", None) or self.default_secret),
value_cb=lambda key, value: print("Using environment secret key {}=********".format(key))
)
if not self.secret_key:
raise ValueError(
"Missing secret_key. Please set in configuration file or pass in session init."
)
if self.access_key == self.default_key and self.secret_key == self.default_secret:
print("Using built-in ClearML default key/secret")
host = host or self.get_api_server_host(config=self.config)
if not host:
raise ValueError("host is required in init or config")
raise ValueError(
"Could not find host server definition "
"(missing `~/clearml.conf` or Environment CLEARML_API_HOST)\n"
"To get started with ClearML: setup your own `clearml-server`, "
"or create a free account at https://app.community.clear.ml and run `clearml-agent init`"
)
self.__host = host.strip("/")
http_retries_config = http_retries_config or self.config.get(
"api.http.retries", ConfigTree()
).as_plain_ordered_dict()
http_retries_config["status_forcelist"] = self._retry_codes
self.__worker = worker or gethostname()
@@ -145,22 +156,25 @@ class Session(TokenManager):
self.client = client or "api-{}".format(__version__)
# limit the reconnect retries, so we get an error if we are starting the session
http_no_retries_config = dict(**http_retries_config)
http_no_retries_config['connect'] = self._session_initial_connect_retry
self.__http_session = get_http_session_with_retry(**http_no_retries_config)
_, self.__http_session = self._setup_session(
http_retries_config,
initial_session=True,
default_initial_connect_override=(False if kwargs.get("command") == "execute" else None)
)
# try to connect with the server
self.refresh_token()
# create the default session with many retries
self.__http_session = get_http_session_with_retry(**http_retries_config)
http_retries_config, self.__http_session = self._setup_session(http_retries_config)
# update api version from server response
try:
token_dict = jwt.decode(self.token, verify=False)
token_dict = TokenManager.get_decoded_token(self.token, verify=False)
api_version = token_dict.get('api_version')
if not api_version:
api_version = '2.2' if token_dict.get('env', '') == 'prod' else Session.api_version
Session.api_version = str(api_version)
Session.feature_set = str(token_dict.get('feature_set', self.feature_set) or "basic")
except (jwt.DecodeError, ValueError):
pass
@@ -169,6 +183,69 @@ class Session(TokenManager):
# notice: this is across the board warning omission
urllib_log_warning_setup(total_retries=http_retries_config.get('total', 0), display_warning_after=3)
def _setup_session(self, http_retries_config, initial_session=False, default_initial_connect_override=None):
# type: (dict, bool, Optional[bool]) -> (dict, requests.Session)
http_retries_config = http_retries_config or self.config.get(
"api.http.retries", ConfigTree()
).as_plain_ordered_dict()
http_retries_config["status_forcelist"] = self._retry_codes
if initial_session:
kwargs = {} if default_initial_connect_override is None else {
"default": default_initial_connect_override
}
if ENV_INITIAL_CONNECT_RETRY_OVERRIDE.get(**kwargs):
connect_retries = self._session_initial_retry_connect_override
try:
value = ENV_INITIAL_CONNECT_RETRY_OVERRIDE.get(converter=str)
if not isinstance(value, bool):
connect_retries = abs(int(value))
except ValueError:
pass
http_retries_config = dict(**http_retries_config)
http_retries_config['connect'] = connect_retries
return http_retries_config, get_http_session_with_retry(**http_retries_config)
def load_vaults(self):
if not self.check_min_api_version("2.15") or self.feature_set == "basic":
return
if ENV_DISABLE_VAULT_SUPPORT.get():
print("Vault support is disabled")
return
def parse(vault):
# noinspection PyBroadException
try:
d = vault.get('data', None)
if d:
r = ConfigFactory.parse_string(d)
if isinstance(r, (ConfigTree, dict)):
return r
except Exception as e:
print("Failed parsing vault {}: {}".format(vault.get("description", "<unknown>"), e))
# noinspection PyBroadException
try:
res = self.send_request("users", "get_vaults", json={"enabled": True, "types": ["config"]})
if res.ok:
vaults = res.json().get("data", {}).get("vaults", [])
data = list(filter(None, map(parse, vaults)))
if data:
self.config.set_overrides(*data)
elif res.status_code != 404:
raise Exception(res.json().get("meta", {}).get("result_msg", res.text))
except Exception as ex:
print("Failed getting vaults: {}".format(ex))
def verify_feature_set(self, feature_set):
if isinstance(feature_set, str):
feature_set = [feature_set]
if self.feature_set not in feature_set:
raise ValueError('ClearML-server does not support requested feature set {}'.format(feature_set))
def _send_request(
self,
service,
@@ -242,6 +319,10 @@ class Session(TokenManager):
headers[self._AUTHORIZATION_HEADER] = "Bearer {}".format(self.token)
return headers
def set_auth_token(self, auth_token):
self.__access_key = self.__secret_key = None
self._set_token(auth_token)
def send_request(
self,
service,
@@ -439,8 +520,11 @@ class Session(TokenManager):
if not config:
return None
return ENV_HOST.get(default=(config.get("api.api_server", None) or
config.get("api.host", None) or cls.default_host))
default = config.get("api.api_server", None) or config.get("api.host", None)
if not ENV_NO_DEFAULT_SERVER.get():
default = default or cls.default_host
return ENV_HOST.get(default=default)
@classmethod
def get_app_server_host(cls, config=None):
@@ -508,7 +592,7 @@ class Session(TokenManager):
return v + (0,) * max(0, 3 - len(v))
return version_tuple(cls.api_version) >= version_tuple(str(min_api_version))
def _do_refresh_token(self, old_token, exp=None):
def _do_refresh_token(self, current_token, exp=None):
""" TokenManager abstract method implementation.
Here we ignore the old token and simply obtain a new token.
"""
@@ -520,7 +604,13 @@ class Session(TokenManager):
)
)
auth = HTTPBasicAuth(self.access_key, self.secret_key)
auth = None
headers = None
if self.access_key and self.secret_key:
auth = HTTPBasicAuth(self.access_key, self.secret_key)
elif current_token:
headers = dict(Authorization="Bearer {}".format(current_token))
res = None
try:
data = {"expiration_sec": exp} if exp else {}
@@ -529,6 +619,7 @@ class Session(TokenManager):
action="login",
auth=auth,
json=data,
headers=headers,
refresh_token_if_unauthorized=False,
)
try:
@@ -544,7 +635,10 @@ class Session(TokenManager):
)
if verbose:
self._logger.info("Received new token")
return resp["data"]["token"]
token = resp["data"]["token"]
if ENV_AUTH_TOKEN.get():
os.environ[ENV_AUTH_TOKEN.key] = token
return token
except LoginError:
six.reraise(*sys.exc_info())
except KeyError as ex:

View File

@@ -3,11 +3,14 @@ from abc import ABCMeta, abstractmethod
from time import time
import jwt
from jwt.algorithms import get_default_algorithms
import six
@six.add_metaclass(ABCMeta)
class TokenManager(object):
_default_token_exp_threshold_sec = 12 * 60 * 60
_default_req_token_expiration_sec = None
@property
def token_expiration_threshold_sec(self):
@@ -40,17 +43,30 @@ class TokenManager(object):
return self.__token
def __init__(
self,
token=None,
req_token_expiration_sec=None,
token_history=None,
token_expiration_threshold_sec=60,
**kwargs
self,
token=None,
req_token_expiration_sec=None,
token_history=None,
token_expiration_threshold_sec=None,
config=None,
**kwargs
):
super(TokenManager, self).__init__()
assert isinstance(token_history, (type(None), dict))
self.token_expiration_threshold_sec = token_expiration_threshold_sec
self.req_token_expiration_sec = req_token_expiration_sec
if config:
req_token_expiration_sec = req_token_expiration_sec or config.get(
"api.auth.request_token_expiration_sec", None
)
token_expiration_threshold_sec = (
token_expiration_threshold_sec
or config.get("api.auth.token_expiration_threshold_sec", None)
)
self.token_expiration_threshold_sec = (
token_expiration_threshold_sec or self._default_token_exp_threshold_sec
)
self.req_token_expiration_sec = (
req_token_expiration_sec or self._default_req_token_expiration_sec
)
self._set_token(token)
def _calc_token_valid_period_sec(self, token, exp=None, at_least_sec=None):
@@ -58,7 +74,9 @@ class TokenManager(object):
try:
exp = exp or self._get_token_exp(token)
if at_least_sec:
at_least_sec = max(at_least_sec, self.token_expiration_threshold_sec)
at_least_sec = max(
at_least_sec, self.token_expiration_threshold_sec
)
else:
at_least_sec = self.token_expiration_threshold_sec
return max(0, (exp - time() - at_least_sec))
@@ -66,10 +84,26 @@ class TokenManager(object):
pass
return 0
@classmethod
def get_decoded_token(cls, token, verify=False):
""" Get token expiration time. If not present, assume forever """
if hasattr(jwt, '__version__') and jwt.__version__[0] == '1':
return jwt.decode(
token,
verify=verify,
algorithms=get_default_algorithms(),
)
return jwt.decode(
token,
options=dict(verify_signature=verify),
algorithms=get_default_algorithms(),
)
@classmethod
def _get_token_exp(cls, token):
""" Get token expiration time. If not present, assume forever """
return jwt.decode(token, verify=False).get('exp', sys.maxsize)
return cls.get_decoded_token(token).get("exp", sys.maxsize)
def _set_token(self, token):
if token:
@@ -80,7 +114,9 @@ class TokenManager(object):
self.__token_expiration_sec = 0
def get_token_valid_period_sec(self):
return self._calc_token_valid_period_sec(self.__token, self.token_expiration_sec)
return self._calc_token_valid_period_sec(
self.__token, self.token_expiration_sec
)
def _get_token(self):
if self.get_token_valid_period_sec() <= 0:
@@ -92,4 +128,6 @@ class TokenManager(object):
pass
def refresh_token(self):
self._set_token(self._do_refresh_token(self.__token, exp=self.req_token_expiration_sec))
self._set_token(
self._do_refresh_token(self.__token, exp=self.req_token_expiration_sec)
)

View File

@@ -6,16 +6,9 @@ import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
from urllib3 import PoolManager
import six
from .session.defs import ENV_HOST_VERIFY_CERT
if six.PY3:
from functools import lru_cache
elif six.PY2:
# python 2 support
from backports.functools_lru_cache import lru_cache
__disable_certificate_verification_warning = 0

View File

@@ -4,15 +4,13 @@ import functools
import json
import os
import sys
import warnings
from fnmatch import fnmatch
from os.path import expanduser
from typing import Any
import pyhocon
import six
from pathlib2 import Path
from pyhocon import ConfigTree
from pyhocon import ConfigTree, ConfigFactory
from pyparsing import (
ParseFatalException,
ParseException,
@@ -71,6 +69,10 @@ class Config(object):
# used in place of None in Config.get as default value because None is a valid value
_MISSING = object()
extra_config_values_env_key_sep = "__"
extra_config_values_env_key_prefix = [
"CLEARML_AGENT" + extra_config_values_env_key_sep,
]
def __init__(
self,
@@ -90,6 +92,7 @@ class Config(object):
self._env = env or os.environ.get("TRAINS_ENV", Environment.default)
self.config_paths = set()
self.is_server = is_server
self._overrides_configs = None
if self._verbose:
print("Config env:%s" % str(self._env))
@@ -100,6 +103,7 @@ class Config(object):
)
if self._env not in get_options(Environment):
raise ValueError("Invalid environment %s" % env)
if relative_to is not None:
self.load_relative_to(relative_to)
@@ -158,7 +162,9 @@ class Config(object):
if LOCAL_CONFIG_PATHS:
config = functools.reduce(
lambda cfg, path: ConfigTree.merge_configs(
cfg, self._read_recursive(path, verbose=self._verbose), copy_trees=True
cfg,
self._read_recursive(path, verbose=self._verbose),
copy_trees=True,
),
LOCAL_CONFIG_PATHS,
config,
@@ -181,9 +187,38 @@ class Config(object):
config,
)
config = ConfigTree.merge_configs(
config, self._read_extra_env_config_values(), copy_trees=True
)
if self._overrides_configs:
config = functools.reduce(
lambda cfg, override: ConfigTree.merge_configs(cfg, override, copy_trees=True),
self._overrides_configs,
config,
)
config["env"] = env
return config
def _read_extra_env_config_values(self) -> ConfigTree:
""" Loads extra configuration from environment-injected values """
result = ConfigTree()
for prefix in self.extra_config_values_env_key_prefix:
keys = sorted(k for k in os.environ if k.startswith(prefix))
for key in keys:
path = (
key[len(prefix) :]
.replace(self.extra_config_values_env_key_sep, ".")
.lower()
)
result = ConfigTree.merge_configs(
result, ConfigFactory.parse_string("{}: {}".format(path, os.environ[key]))
)
return result
def replace(self, config):
self._config = config
@@ -340,3 +375,10 @@ class Config(object):
except Exception as ex:
print("Failed loading %s: %s" % (file_path, ex))
raise
def set_overrides(self, *dicts):
""" Set several override dictionaries or ConfigTree objects which should be merged onto the configuration """
self._overrides_configs = [
d if isinstance(d, ConfigTree) else pyhocon.ConfigFactory.from_dict(d) for d in dicts
]
self.reload()

View File

@@ -24,6 +24,14 @@ def text_to_bool(value):
return bool(strtobool(value))
def safe_text_to_bool(value):
# type: (Text) -> bool
try:
return text_to_bool(value)
except ValueError:
return bool(value)
def any_to_bool(value):
# type: (Optional[Union[int, float, Text]]) -> bool
if isinstance(value, six.text_type):

View File

@@ -64,8 +64,8 @@ class Entry(object):
converter = self.default_conversions().get(self.type, self.type)
return converter(value)
def get_pair(self, default=NotSet, converter=None):
# type: (Any, Converter) -> Optional[Tuple[Text, Any]]
def get_pair(self, default=NotSet, converter=None, value_cb=None):
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
for key in self.keys:
value = self._get(key)
if value is NotSet:
@@ -75,13 +75,20 @@ class Entry(object):
except Exception as ex:
self.error("invalid value {key}={value}: {ex}".format(**locals()))
break
# noinspection PyBroadException
try:
if value_cb:
value_cb(key, value)
except Exception:
pass
return key, value
result = self.default if default is NotSet else default
return self.key, result
def get(self, default=NotSet, converter=None):
# type: (Any, Converter) -> Optional[Any]
return self.get_pair(default=default, converter=converter)[1]
def get(self, default=NotSet, converter=None, value_cb=None):
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
def set(self, value):
# type: (Any, Any) -> (Text, Any)

View File

@@ -1,3 +1,14 @@
import base64
import os
from os.path import expandvars, expanduser
from pathlib import Path
from typing import List, TYPE_CHECKING
from pyhocon import HOCONConverter, ConfigTree
if TYPE_CHECKING:
from .config import Config
def get_items(cls):
""" get key/value items from an enum-like class (members represent enumeration key/value) """
@@ -7,3 +18,95 @@ def get_items(cls):
def get_options(cls):
""" get options from an enum-like class (members represent enumeration key/value) """
return get_items(cls).values()
def apply_environment(config):
# type: (Config) -> List[str]
env_vars = config.get("environment", None)
if not env_vars:
return []
if isinstance(env_vars, (list, tuple)):
env_vars = dict(env_vars)
keys = list(filter(None, env_vars.keys()))
for key in keys:
os.environ[str(key)] = str(env_vars[key] or "")
return keys
def apply_files(config):
# type: (Config) -> None
files = config.get("files", None)
if not files:
return
if isinstance(files, (list, tuple)):
files = dict(files)
print("Creating files from configuration")
for key, data in files.items():
path = data.get("path")
fmt = data.get("format", "string")
target_fmt = data.get("target_format", "string")
overwrite = bool(data.get("overwrite", True))
contents = data.get("contents")
target = Path(expanduser(expandvars(path)))
# noinspection PyBroadException
try:
if target.is_dir():
print("Skipped [{}]: is a directory {}".format(key, target))
continue
if not overwrite and target.is_file():
print("Skipped [{}]: file exists {}".format(key, target))
continue
except Exception as ex:
print("Skipped [{}]: can't access {} ({})".format(key, target, ex))
continue
if contents:
try:
if fmt == "base64":
contents = base64.b64decode(contents)
if target_fmt != "bytes":
contents = contents.decode("utf-8")
except Exception as ex:
print("Skipped [{}]: failed decoding {} ({})".format(key, fmt, ex))
continue
# noinspection PyBroadException
try:
target.parent.mkdir(parents=True, exist_ok=True)
except Exception as ex:
print("Skipped [{}]: failed creating path {} ({})".format(key, target.parent, ex))
continue
try:
if target_fmt == "bytes":
try:
target.write_bytes(contents)
except TypeError:
# simpler error so the user won't get confused
raise TypeError("a bytes-like object is required")
else:
try:
if target_fmt == "json":
text = HOCONConverter.to_json(contents)
elif target_fmt in ("yaml", "yml"):
text = HOCONConverter.to_yaml(contents)
else:
if isinstance(contents, ConfigTree):
contents = contents.as_plain_ordered_dict()
text = str(contents)
except Exception as ex:
print("Skipped [{}]: failed encoding to {} ({})".format(key, target_fmt, ex))
continue
target.write_text(text)
print("Saved [{}]: {}".format(key, target))
except Exception as ex:
print("Skipped [{}]: failed saving file {} ({})".format(key, target, ex))
continue

View File

@@ -118,11 +118,13 @@ class ServiceCommandSection(BaseCommandSection):
""" The name of the REST service used by this command """
pass
def get(self, endpoint, *args, **kwargs):
return self._session.get(service=self.service, action=endpoint, *args, **kwargs)
def get(self, endpoint, *args, session=None, **kwargs):
session = session or self._session
return session.get(service=self.service, action=endpoint, *args, **kwargs)
def post(self, endpoint, *args, **kwargs):
return self._session.post(service=self.service, action=endpoint, *args, **kwargs)
def post(self, endpoint, *args, session=None, **kwargs):
session = session or self._session
return session.post(service=self.service, action=endpoint, *args, **kwargs)
def get_with_act_as(self, endpoint, *args, **kwargs):
return self._session.get_with_act_as(service=self.service, action=endpoint, *args, **kwargs)

View File

@@ -11,8 +11,8 @@ from clearml_agent.backend_config.defs import LOCAL_CONFIG_FILES
description = """
Please create new clearml credentials through the profile page in your clearml web app (e.g. https://demoapp.demo.clear.ml/profile)
Or with the free hosted service at https://app.community.clear.ml/profile
Please create new clearml credentials through the profile page in your `clearml-server` web app,
or create a free account at https://app.community.clear.ml/profile
In the profile page, press "Create new credentials", then press "Copy to clipboard".

View File

@@ -21,14 +21,16 @@ class Events(ServiceCommandSection):
""" Events command service endpoint """
return 'events'
def send_events(self, list_events):
def send_events(self, list_events, session=None):
def send_packet(jsonlines):
if not jsonlines:
return 0
num_lines = len(jsonlines)
jsonlines = '\n'.join(jsonlines)
new_events = self.post('add_batch', data=jsonlines, headers={'Content-type': 'application/json-lines'})
new_events = self.post(
'add_batch', data=jsonlines, headers={'Content-type': 'application/json-lines'}, session=session
)
if new_events['added'] != num_lines:
print('Error (%s) sending events only %d of %d registered' %
(new_events['errors'], new_events['added'], num_lines))
@@ -57,7 +59,7 @@ class Events(ServiceCommandSection):
# print('Sending events done: %d / %d events sent' % (sent_events, len(list_events)))
return sent_events
def send_log_events(self, worker_id, task_id, lines, level='DEBUG'):
def send_log_events(self, worker_id, task_id, lines, level='DEBUG', session=None):
log_events = []
base_timestamp = int(time.time() * 1000)
base_log_items = {
@@ -94,4 +96,4 @@ class Events(ServiceCommandSection):
log_events.append(get_event(count))
# now send the events
return self.send_events(list_events=log_events)
return self.send_events(list_events=log_events, session=session)

View File

@@ -0,0 +1,166 @@
import json
import re
import shlex
from clearml_agent.helper.package.requirements import (
RequirementsManager, MarkerRequirement,
compare_version_rules, )
def resolve_default_container(session, task_id, container_config):
container_lookup = session.config.get('agent.default_docker.match_rules', None)
if not session.check_min_api_version("2.13") or not container_lookup:
return container_config
# check backend support before sending any more requests (because they will fail and crash the Task)
try:
session.verify_feature_set('advanced')
except ValueError:
return container_config
result = session.send_request(
service='tasks',
action='get_all',
version='2.14',
json={'id': [task_id],
'only_fields': ['script.requirements', 'script.binary',
'script.repository', 'script.branch',
'project', 'container'],
'search_hidden': True},
method='get',
async_enable=False,
)
try:
task_info = result.json()['data']['tasks'][0] if result.ok else {}
except (ValueError, TypeError):
return container_config
from clearml_agent.external.requirements_parser.requirement import Requirement
# store tasks repository
repository = task_info.get('script', {}).get('repository') or ''
branch = task_info.get('script', {}).get('branch') or ''
binary = task_info.get('script', {}).get('binary') or ''
requested_container = task_info.get('container', {})
# get project full path
project_full_name = ''
if task_info.get('project', None):
result = session.send_request(
service='projects',
action='get_all',
version='2.13',
json={
'id': [task_info.get('project')],
'only_fields': ['name'],
},
method='get',
async_enable=False,
)
try:
if result.ok:
project_full_name = result.json()['data']['projects'][0]['name'] or ''
except (ValueError, TypeError):
pass
task_packages_lookup = {}
for entry in container_lookup:
match = entry.get('match', None)
if not match:
continue
if match.get('project', None):
# noinspection PyBroadException
try:
if not re.search(match.get('project', None), project_full_name):
continue
except Exception:
print('Failed parsing regular expression \"{}\" in rule: {}'.format(
match.get('project', None), entry))
continue
if match.get('script.repository', None):
# noinspection PyBroadException
try:
if not re.search(match.get('script.repository', None), repository):
continue
except Exception:
print('Failed parsing regular expression \"{}\" in rule: {}'.format(
match.get('script.repository', None), entry))
continue
if match.get('script.branch', None):
# noinspection PyBroadException
try:
if not re.search(match.get('script.branch', None), branch):
continue
except Exception:
print('Failed parsing regular expression \"{}\" in rule: {}'.format(
match.get('script.branch', None), entry))
continue
if match.get('script.binary', None):
# noinspection PyBroadException
try:
if not re.search(match.get('script.binary', None), binary):
continue
except Exception:
print('Failed parsing regular expression \"{}\" in rule: {}'.format(
match.get('script.binary', None), entry))
continue
if match.get('container', None):
# noinspection PyBroadException
try:
if not re.search(match.get('container', None), requested_container.get('image', '')):
continue
except Exception:
print('Failed parsing regular expression \"{}\" in rule: {}'.format(
match.get('container', None), entry))
continue
matched = True
for req_section in ['script.requirements.pip', 'script.requirements.conda']:
if not match.get(req_section, None):
continue
match_pip_reqs = [MarkerRequirement(Requirement.parse('{} {}'.format(k, v)))
for k, v in match.get(req_section, None).items()]
if not task_packages_lookup.get(req_section):
req_section_parts = req_section.split('.')
task_packages_lookup[req_section] = \
RequirementsManager.parse_requirements_section_to_marker_requirements(
requirements=task_info.get(req_section_parts[0], {}).get(
req_section_parts[1], {}).get(req_section_parts[2], None)
)
matched_all_reqs = True
for mr in match_pip_reqs:
matched_req = False
for pr in task_packages_lookup[req_section]:
if mr.req.name != pr.req.name:
continue
if compare_version_rules(mr.specs, pr.specs):
matched_req = True
break
if not matched_req:
matched_all_reqs = False
break
# if ew have a match, check second section
if matched_all_reqs:
continue
# no match stop
matched = False
break
if matched:
if not container_config.get('container'):
container_config['container'] = entry.get('image', None)
if not container_config.get('arguments'):
container_config['arguments'] = entry.get('arguments', None)
container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())
print('Matching default container with rule:\n{}'.format(json.dumps(entry)))
return container_config
return container_config

File diff suppressed because it is too large Load Diff

View File

@@ -1,10 +1,10 @@
import shlex
from datetime import timedelta
from distutils.util import strtobool
from enum import IntEnum
from os import getenv, environ
from typing import Text, Optional, Union, Tuple, Any
from furl import furl
from pathlib2 import Path
import six
@@ -34,6 +34,7 @@ class EnvironmentConfig(object):
conversions = {
bool: lambda value: bool(strtobool(value)),
six.text_type: lambda s: six.text_type(s).strip(),
list: lambda s: shlex.split(s.strip()),
}
def __init__(self, *names, **kwargs):
@@ -62,6 +63,11 @@ class EnvironmentConfig(object):
return None
ENV_AGENT_SECRET_KEY = EnvironmentConfig("CLEARML_API_SECRET_KEY", "TRAINS_API_SECRET_KEY")
ENV_AGENT_AUTH_TOKEN = EnvironmentConfig("CLEARML_AUTH_TOKEN")
ENV_AWS_SECRET_KEY = EnvironmentConfig("AWS_SECRET_ACCESS_KEY")
ENV_AZURE_ACCOUNT_KEY = EnvironmentConfig("AZURE_STORAGE_KEY")
ENVIRONMENT_CONFIG = {
"api.api_server": EnvironmentConfig("CLEARML_API_HOST", "TRAINS_API_HOST", ),
"api.files_server": EnvironmentConfig("CLEARML_FILES_HOST", "TRAINS_FILES_HOST", ),
@@ -69,9 +75,7 @@ ENVIRONMENT_CONFIG = {
"api.credentials.access_key": EnvironmentConfig(
"CLEARML_API_ACCESS_KEY", "TRAINS_API_ACCESS_KEY",
),
"api.credentials.secret_key": EnvironmentConfig(
"CLEARML_API_SECRET_KEY", "TRAINS_API_SECRET_KEY",
),
"api.credentials.secret_key": ENV_AGENT_SECRET_KEY,
"agent.worker_name": EnvironmentConfig("CLEARML_WORKER_NAME", "TRAINS_WORKER_NAME", ),
"agent.worker_id": EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID", ),
"agent.cuda_version": EnvironmentConfig(
@@ -84,10 +88,10 @@ ENVIRONMENT_CONFIG = {
names=("CLEARML_CPU_ONLY", "TRAINS_CPU_ONLY", "CPU_ONLY"), type=bool
),
"sdk.aws.s3.key": EnvironmentConfig("AWS_ACCESS_KEY_ID"),
"sdk.aws.s3.secret": EnvironmentConfig("AWS_SECRET_ACCESS_KEY"),
"sdk.aws.s3.secret": ENV_AWS_SECRET_KEY,
"sdk.aws.s3.region": EnvironmentConfig("AWS_DEFAULT_REGION"),
"sdk.azure.storage.containers.0": {'account_name': EnvironmentConfig("AZURE_STORAGE_ACCOUNT"),
'account_key': EnvironmentConfig("AZURE_STORAGE_KEY")},
'account_key': ENV_AZURE_ACCOUNT_KEY},
"sdk.google.storage.credentials_json": EnvironmentConfig("GOOGLE_APPLICATION_CREDENTIALS"),
}
@@ -128,14 +132,20 @@ PIP_EXTRA_INDICES = [
DEFAULT_PIP_DOWNLOAD_CACHE = normalize_path(CONFIG_DIR, "pip-download-cache")
ENV_DOCKER_IMAGE = EnvironmentConfig('CLEARML_DOCKER_IMAGE', 'TRAINS_DOCKER_IMAGE')
ENV_WORKER_ID = EnvironmentConfig('CLEARML_WORKER_ID', 'TRAINS_WORKER_ID')
ENV_WORKER_TAGS = EnvironmentConfig('CLEARML_WORKER_TAGS')
ENV_AGENT_SKIP_PIP_VENV_INSTALL = EnvironmentConfig('CLEARML_AGENT_SKIP_PIP_VENV_INSTALL')
ENV_DOCKER_SKIP_GPUS_FLAG = EnvironmentConfig('CLEARML_DOCKER_SKIP_GPUS_FLAG', 'TRAINS_DOCKER_SKIP_GPUS_FLAG')
ENV_AGENT_GIT_USER = EnvironmentConfig('CLEARML_AGENT_GIT_USER', 'TRAINS_AGENT_GIT_USER')
ENV_AGENT_GIT_PASS = EnvironmentConfig('CLEARML_AGENT_GIT_PASS', 'TRAINS_AGENT_GIT_PASS')
ENV_AGENT_GIT_HOST = EnvironmentConfig('CLEARML_AGENT_GIT_HOST', 'TRAINS_AGENT_GIT_HOST')
ENV_AGENT_DISABLE_SSH_MOUNT = EnvironmentConfig('CLEARML_AGENT_DISABLE_SSH_MOUNT', type=bool)
ENV_SSH_AUTH_SOCK = EnvironmentConfig('SSH_AUTH_SOCK')
ENV_TASK_EXECUTE_AS_USER = EnvironmentConfig('CLEARML_AGENT_EXEC_USER', 'TRAINS_AGENT_EXEC_USER')
ENV_TASK_EXTRA_PYTHON_PATH = EnvironmentConfig('CLEARML_AGENT_EXTRA_PYTHON_PATH', 'TRAINS_AGENT_EXTRA_PYTHON_PATH')
ENV_DOCKER_HOST_MOUNT = EnvironmentConfig('CLEARML_AGENT_K8S_HOST_MOUNT', 'CLEARML_AGENT_DOCKER_HOST_MOUNT',
'TRAINS_AGENT_K8S_HOST_MOUNT', 'TRAINS_AGENT_DOCKER_HOST_MOUNT')
ENV_VENV_CACHE_PATH = EnvironmentConfig('CLEARML_AGENT_VENV_CACHE_PATH')
ENV_EXTRA_DOCKER_ARGS = EnvironmentConfig('CLEARML_AGENT_EXTRA_DOCKER_ARGS', type=list)
class FileBuffering(IntEnum):

View File

@@ -4,13 +4,14 @@ import warnings
from .requirement import Requirement
def parse(reqstr):
def parse(reqstr, cwd=None):
"""
Parse a requirements file into a list of Requirements
See: pip/req.py:parse_requirements()
:param reqstr: a string or file like object containing requirements
:param cwd: Optional current working dir for -r file.txt loading
:returns: a *generator* of Requirement objects
"""
filename = getattr(reqstr, 'name', None)
@@ -30,10 +31,12 @@ def parse(reqstr):
elif not line or line.startswith('#'):
# comments are lines that start with # only
continue
elif line.startswith('-r') or line.startswith('--requirement'):
elif line.startswith('-r ') or line.startswith('--requirement '):
_, new_filename = line.split()
new_file_path = os.path.join(os.path.dirname(filename or '.'),
new_filename)
new_file_path = os.path.join(
os.path.dirname(filename or '.') if filename or not cwd else cwd, new_filename)
if not os.path.exists(new_file_path):
continue
with open(new_file_path) as f:
for requirement in parse(f):
yield requirement

View File

@@ -20,6 +20,15 @@ VCS_REGEX = re.compile(
r'(#(?P<fragment>\S+))?'
)
VCS_EXT_REGEX = re.compile(
r'^(?P<scheme>{0})(@)'.format(r'|'.join(
[scheme.replace('+', r'\+') for scheme in ['git+git']])) +
r'((?P<login>[^/@]+)@)?'
r'(?P<path>[^#@]+)'
r'(@(?P<revision>[^#]+))?'
r'(#(?P<fragment>\S+))?'
)
# This matches just about everyting
LOCAL_REGEX = re.compile(
r'^((?P<scheme>file)://)?'
@@ -100,7 +109,7 @@ class Requirement(object):
req = cls('-e {0}'.format(line))
req.editable = True
vcs_match = VCS_REGEX.match(line)
vcs_match = VCS_REGEX.match(line) or VCS_EXT_REGEX.match(line)
local_match = LOCAL_REGEX.match(line)
if vcs_match is not None:
@@ -147,7 +156,7 @@ class Requirement(object):
req = cls(line)
vcs_match = VCS_REGEX.match(line)
vcs_match = VCS_REGEX.match(line) or VCS_EXT_REGEX.match(line)
uri_match = URI_REGEX.match(line)
local_match = LOCAL_REGEX.match(line)
@@ -226,7 +235,7 @@ class Requirement(object):
# check if the name is valid & parsed
Req.parse(name)
# if we are here, name is a valid package name, check if the vcs part is valid
if VCS_REGEX.match(uri):
if VCS_REGEX.match(uri) or VCS_EXT_REGEX.match(uri):
req = cls.parse_line(uri)
req.name = name
return req

View File

@@ -2,6 +2,7 @@ from __future__ import print_function, division, unicode_literals
import base64
import functools
import hashlib
import json
import logging
import os
@@ -12,12 +13,12 @@ from copy import deepcopy
from pathlib import Path
from threading import Thread
from time import sleep
from typing import Text, List
from typing import Text, List, Callable, Any, Collection, Optional, Union
import yaml
from clearml_agent.commands.events import Events
from clearml_agent.commands.worker import Worker
from clearml_agent.commands.worker import Worker, get_task_container, set_task_container
from clearml_agent.definitions import ENV_DOCKER_IMAGE
from clearml_agent.errors import APIError
from clearml_agent.helper.base import safe_remove_file
@@ -31,20 +32,23 @@ class K8sIntegration(Worker):
K8S_PENDING_QUEUE = "k8s_scheduler"
K8S_DEFAULT_NAMESPACE = "clearml"
AGENT_LABEL = "CLEARML=agent"
LIMIT_POD_LABEL = "ai.allegro.agent.serial=pod-{pod_number}"
KUBECTL_APPLY_CMD = "kubectl apply -f"
KUBECTL_APPLY_CMD = "kubectl apply --namespace={namespace} -f"
KUBECTL_RUN_CMD = "kubectl run clearml-{queue_name}-id-{task_id} " \
"--image {docker_image} " \
KUBECTL_RUN_CMD = "kubectl run clearml-id-{task_id} " \
"--image {docker_image} {docker_args} " \
"--restart=Never " \
"--namespace={namespace}"
KUBECTL_DELETE_CMD = "kubectl delete pods " \
"--selector=TRAINS=agent " \
"--selector={selector} " \
"--field-selector=status.phase!=Pending,status.phase!=Running " \
"--namespace={namespace}"
BASH_INSTALL_SSH_CMD = [
"apt-get update",
"apt-get install -y openssh-server",
"mkdir -p /var/run/sshd",
"echo 'root:training' | chpasswd",
@@ -71,12 +75,10 @@ class K8sIntegration(Worker):
"[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3",
"$LOCAL_PYTHON -m pip install clearml-agent",
"{extra_bash_init_cmd}",
"{extra_docker_bash_script}",
"$LOCAL_PYTHON -m clearml_agent execute --full-monitoring --require-queue --id {task_id}"
]
AGENT_LABEL = "TRAINS=agent"
LIMIT_POD_LABEL = "ai.allegro.agent.serial=pod-{pod_number}"
_edit_hyperparams_version = "2.9"
def __init__(
@@ -94,6 +96,7 @@ class K8sIntegration(Worker):
clearml_conf_file=None,
extra_bash_init_script=None,
namespace=None,
max_pods_limit=None,
**kwargs
):
"""
@@ -102,7 +105,7 @@ class K8sIntegration(Worker):
:param str k8s_pending_queue_name: queue name to use when task is pending in the k8s scheduler
:param str|callable kubectl_cmd: kubectl command line str, supports formatting (default: KUBECTL_RUN_CMD)
example: "task={task_id} image={docker_image} queue_id={queue_id}"
or a callable function: kubectl_cmd(task_id, docker_image, queue_id, task_data)
or a callable function: kubectl_cmd(task_id, docker_image, docker_args, queue_id, task_data)
:param str container_bash_script: container bash script to be executed in k8s (default: CONTAINER_BASH_SCRIPT)
Notice this string will use format() call, if you have curly brackets they should be doubled { -> {{
Format arguments passed: {task_id} and {extra_bash_init_cmd}
@@ -121,6 +124,7 @@ class K8sIntegration(Worker):
:param str clearml_conf_file: clearml.conf file to be use by the pod itself (optional)
:param str extra_bash_init_script: Additional bash script to run before starting the Task inside the container
:param str namespace: K8S namespace to be used when creating the new pods (default: clearml)
:param int max_pods_limit: Maximum number of pods that K8S glue can run at the same time
"""
super(K8sIntegration, self).__init__()
self.k8s_pending_queue_name = k8s_pending_queue_name or self.K8S_PENDING_QUEUE
@@ -146,6 +150,7 @@ class K8sIntegration(Worker):
self.namespace = namespace or self.K8S_DEFAULT_NAMESPACE
self.pod_limits = []
self.pod_requests = []
self.max_pods_limit = max_pods_limit if not self.ports_mode else None
if overrides_yaml:
with open(os.path.expandvars(os.path.expanduser(str(overrides_yaml))), 'rt') as f:
overrides = yaml.load(f, Loader=getattr(yaml, 'FullLoader', None))
@@ -180,6 +185,8 @@ class K8sIntegration(Worker):
# make sure we use system packages!
self.conf_file_content += '\nagent.package_manager.system_site_packages=true\n'
self._agent_label = None
self._monitor_hanging_pods()
def _monitor_hanging_pods(self):
@@ -187,7 +194,18 @@ class K8sIntegration(Worker):
_check_pod_thread.daemon = True
_check_pod_thread.start()
@staticmethod
def _get_path(d, *path, default=None):
try:
return functools.reduce(
lambda a, b: a[b], path, d
)
except (IndexError, KeyError):
return default
def _monitor_hanging_pods_daemon(self):
last_tasks_msgs = {} # last msg updated for every task
while True:
output = get_bash_output('kubectl get pods -n {namespace} -o=JSON'.format(
namespace=self.namespace
@@ -200,23 +218,44 @@ class K8sIntegration(Worker):
sleep(self._polling_interval)
continue
pods = output_config.get('items', [])
task_ids = set()
for pod in pods:
try:
reason = functools.reduce(
lambda a, b: a[b], ('status', 'containerStatuses', 0, 'state', 'waiting', 'reason'), pod
)
except (IndexError, KeyError):
if self._get_path(pod, 'status', 'phase') != "Pending":
continue
if reason == 'ImagePullBackOff':
pod_name = pod.get('metadata', {}).get('name', None)
if pod_name:
task_id = pod_name.rpartition('-')[-1]
pod_name = pod.get('metadata', {}).get('name', None)
if not pod_name:
continue
task_id = pod_name.rpartition('-')[-1]
if not task_id:
continue
task_ids.add(task_id)
msg = None
waiting = self._get_path(pod, 'status', 'containerStatuses', 0, 'state', 'waiting')
if not waiting:
condition = self._get_path(pod, 'status', 'conditions', 0)
if condition:
reason = condition.get('reason')
if reason == 'Unschedulable':
message = condition.get('message')
msg = reason + (" ({})".format(message) if message else "")
else:
reason = waiting.get("reason", None)
message = waiting.get("message", None)
msg = reason + (" ({})".format(message) if message else "")
if reason == 'ImagePullBackOff':
delete_pod_cmd = 'kubectl delete pods {} -n {}'.format(pod_name, self.namespace)
get_bash_output(delete_pod_cmd)
try:
self._session.api_client.tasks.failed(
task=task_id,
status_reason="K8S glue error due to ImagePullBackOff",
status_reason="K8S glue error: {}".format(msg),
status_message="Changed by K8S glue",
force=True
)
@@ -224,6 +263,35 @@ class K8sIntegration(Worker):
self.log.warning(
'K8S Glue pods monitor: Failed deleting task "{}"\nEX: {}'.format(task_id, ex)
)
# clean up any msg for this task
last_tasks_msgs.pop(task_id, None)
continue
if msg and last_tasks_msgs.get(task_id, None) != msg:
try:
result = self._session.send_request(
service='tasks',
action='update',
json={"task": task_id, "status_message": "K8S glue status: {}".format(msg)},
method='get',
async_enable=False,
)
if not result.ok:
result_msg = self._get_path(result.json(), 'meta', 'result_msg')
raise Exception(result_msg or result.text)
# update last msg for this task
last_tasks_msgs[task_id] = msg
except Exception as ex:
self.log.warning(
'K8S Glue pods monitor: Failed setting status message for task "{}"\nEX: {}'.format(
task_id, ex
)
)
# clean up any last message for a task that wasn't seen as a pod
last_tasks_msgs = {k: v for k, v in last_tasks_msgs.items() if k in task_ids}
sleep(self._polling_interval)
def _set_task_user_properties(self, task_id: str, **properties: str):
@@ -256,6 +324,44 @@ class K8sIntegration(Worker):
if error.code == 404:
self._edit_hyperparams_support = self._session.api_version
def _get_agent_label(self):
if not self.worker_id:
print('WARNING! no worker ID found!!!')
return self.AGENT_LABEL
if not self._agent_label:
h = hashlib.md5()
h.update(str(self.worker_id).encode('utf-8'))
self._agent_label = '{}-{}'.format(self.AGENT_LABEL, h.hexdigest()[:8])
return self._agent_label
def _get_number_used_pods(self):
# noinspection PyBroadException
try:
kubectl_cmd_new = "kubectl get pods -l {agent_label} -n {namespace} -o json".format(
agent_label=self._get_agent_label(),
namespace=self.namespace,
)
process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, error = process.communicate()
output = '' if not output else output if isinstance(output, str) else output.decode('utf-8')
error = '' if not error else error if isinstance(error, str) else error.decode('utf-8')
if not output:
# No such pod exist so we can use the pod_number we found
return 0
try:
current_pod_count = len(json.loads(output).get("items", []))
except (ValueError, TypeError) as ex:
return -1
return current_pod_count
except Exception as ex:
print('Failed getting number of used pods: {}'.format(ex))
return -2
def run_one_task(self, queue: Text, task_id: Text, worker_args=None, **_):
print('Pulling task {} launching on kubernetes cluster'.format(task_id))
task_data = self._session.api_client.tasks.get_all(id=[task_id])[0]
@@ -263,24 +369,28 @@ class K8sIntegration(Worker):
# push task into the k8s queue, so we have visibility on pending tasks in the k8s scheduler
try:
print('Pushing task {} into temporary pending queue'.format(task_id))
self._session.api_client.tasks.reset(task_id)
self._session.api_client.tasks.enqueue(task_id, queue=self.k8s_pending_queue_name,
status_reason='k8s pending scheduler')
res = self._session.api_client.tasks.stop(task_id, force=True)
res = self._session.api_client.tasks.enqueue(
task_id,
queue=self.k8s_pending_queue_name,
status_reason='k8s pending scheduler',
)
if res.meta.result_code != 200:
raise Exception(res.meta.result_msg)
except Exception as e:
self.log.error("ERROR: Could not push back task [{}] to k8s pending queue [{}], error: {}".format(
task_id, self.k8s_pending_queue_name, e))
return
if task_data.execution.docker_cmd:
docker_cmd = task_data.execution.docker_cmd
else:
docker_cmd = str(ENV_DOCKER_IMAGE.get() or
self._session.config.get("agent.default_docker.image", "nvidia/cuda"))
# take the first part, this is the docker image name (not arguments)
docker_parts = docker_cmd.split()
docker_image = docker_parts[0]
docker_args = docker_parts[1:] if len(docker_parts) > 1 else []
container = get_task_container(self._session, task_id)
if not container.get('image'):
container['image'] = str(
ENV_DOCKER_IMAGE.get() or self._session.config.get("agent.default_docker.image", "nvidia/cuda")
)
container['arguments'] = self._session.config.get("agent.default_docker.arguments", None)
set_task_container(
self._session, task_id, docker_image=container['image'], docker_arguments=container['arguments']
)
# get the clearml.conf encoded file
# noinspection PyProtectedMember
@@ -303,20 +413,22 @@ class K8sIntegration(Worker):
except Exception:
queue_name = 'k8s'
# conform queue name to k8s standards
safe_queue_name = queue_name.lower().strip()
safe_queue_name = re.sub(r'\W+', '', safe_queue_name).replace('_', '').replace('-', '')
# Search for a free pod number
pod_count = 0
pod_number = self.base_pod_num
while self.ports_mode:
while self.ports_mode or self.max_pods_limit:
pod_number = self.base_pod_num + pod_count
kubectl_cmd_new = "kubectl get pods -l {pod_label},{agent_label} -n {namespace}".format(
pod_label=self.LIMIT_POD_LABEL.format(pod_number=pod_number),
agent_label=self.AGENT_LABEL,
namespace=self.namespace,
)
if self.ports_mode:
kubectl_cmd_new = "kubectl get pods -l {pod_label},{agent_label} -n {namespace}".format(
pod_label=self.LIMIT_POD_LABEL.format(pod_number=pod_number),
agent_label=self._get_agent_label(),
namespace=self.namespace,
)
else:
kubectl_cmd_new = "kubectl get pods -l {agent_label} -n {namespace} -o json".format(
agent_label=self._get_agent_label(),
namespace=self.namespace,
)
process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, error = process.communicate()
output = '' if not output else output if isinstance(output, str) else output.decode('utf-8')
@@ -325,38 +437,68 @@ class K8sIntegration(Worker):
if not output:
# No such pod exist so we can use the pod_number we found
break
if pod_count >= self.num_of_services - 1:
# All pod numbers are taken, exit
if self.max_pods_limit:
try:
current_pod_count = len(json.loads(output).get("items", []))
except (ValueError, TypeError) as ex:
self.log.warning(
"K8S Glue pods monitor: Failed parsing kubectl output:\n{}\ntask '{}' "
"will be enqueued back to queue '{}'\nEx: {}".format(
output, task_id, queue, ex
)
)
self._session.api_client.tasks.stop(task_id, force=True)
self._session.api_client.tasks.enqueue(task_id, queue=queue, status_reason='kubectl parsing error')
return
max_count = self.max_pods_limit
else:
current_pod_count = pod_count
max_count = self.num_of_services - 1
if current_pod_count >= max_count:
# All pods are taken, exit
self.log.debug(
"kubectl last result: {}\n{}".format(error, output))
self.log.warning(
"kubectl last result: {}\n{}\nAll k8s services are in use, task '{}' "
"All k8s services are in use, task '{}' "
"will be enqueued back to queue '{}'".format(
error, output, task_id, queue
task_id, queue
)
)
self._session.api_client.tasks.reset(task_id)
self._session.api_client.tasks.stop(task_id, force=True)
self._session.api_client.tasks.enqueue(
task_id, queue=queue, status_reason='k8s max pod limit (no free k8s service)')
return
elif self.max_pods_limit:
# max pods limit hasn't reached yet, so we can create the pod
break
pod_count += 1
labels = ([self.LIMIT_POD_LABEL.format(pod_number=pod_number)] if self.ports_mode else []) + [self.AGENT_LABEL]
labels = ([self.LIMIT_POD_LABEL.format(pod_number=pod_number)] if self.ports_mode else []) + \
[self._get_agent_label()]
labels.append("clearml-agent-queue={}".format(self._safe_k8s_label_value(queue)))
labels.append("clearml-agent-queue-name={}".format(self._safe_k8s_label_value(queue_name)))
if self.ports_mode:
print("Kubernetes scheduling task id={} on pod={} (pod_count={})".format(task_id, pod_number, pod_count))
else:
print("Kubernetes scheduling task id={}".format(task_id))
kubectl_kwargs = dict(
create_clearml_conf=create_clearml_conf,
labels=labels,
docker_image=container['image'],
docker_args=container['arguments'],
docker_bash=container.get('setup_shell_script'),
task_id=task_id,
queue=queue
)
if self.template_dict:
output, error = self._kubectl_apply(
create_clearml_conf=create_clearml_conf,
labels=labels, docker_image=docker_image, docker_args=docker_args,
task_id=task_id, queue=queue, queue_name=safe_queue_name)
output, error = self._kubectl_apply(**kubectl_kwargs)
else:
output, error = self._kubectl_run(
create_clearml_conf=create_clearml_conf,
labels=labels, docker_image=docker_cmd,
task_data=task_data,
task_id=task_id, queue=queue, queue_name=safe_queue_name)
output, error = self._kubectl_run(task_data=task_data, **kubectl_kwargs)
error = '' if not error else (error if isinstance(error, str) else error.decode('utf-8'))
output = '' if not output else (output if isinstance(output, str) else output.decode('utf-8'))
@@ -390,40 +532,66 @@ class K8sIntegration(Worker):
**user_props
)
def _parse_docker_args(self, docker_args):
# type: (list) -> dict
kube_args = {'env': []}
while docker_args:
cmd = docker_args.pop().strip()
if cmd in ('-e', '--env',):
env = docker_args.pop().strip()
key, value = env.split('=', 1)
kube_args[key] += {key: value}
def _get_docker_args(self, docker_args, flags, target=None, convert=None):
# type: (List[str], Collection[str], Optional[str], Callable[[str], Any]) -> Union[dict, List[str]]
"""
Get docker args matching specific flags.
:argument docker_args: List of docker argument strings (flags and values)
:argument flags: List of flags/names to intercept (e.g. "--env" etc.)
:argument target: Controls return format. If provided, returns a dict with a target field containing a list
of result strings, otherwise returns a list of result strings
:argument convert: Optional conversion function for each result string
"""
args = docker_args[:] if docker_args else []
results = []
while args:
cmd = args.pop(0).strip()
if cmd in flags:
env = args.pop(0).strip()
if convert:
env = convert(env)
results.append(env)
else:
self.log.warning('skipping docker argument {} (only -e --env supported)'.format(cmd))
return kube_args
if target:
return {target: results} if results else {}
return results
def _kubectl_apply(self, create_clearml_conf, docker_image, docker_args, labels, queue, task_id, queue_name):
def _kubectl_apply(self, create_clearml_conf, docker_image, docker_args, docker_bash, labels, queue, task_id):
template = deepcopy(self.template_dict)
template.setdefault('apiVersion', 'v1')
template['kind'] = 'Pod'
template.setdefault('metadata', {})
name = 'clearml-{queue}-id-{task_id}'.format(queue=queue_name, task_id=task_id)
name = 'clearml-id-{task_id}'.format(task_id=task_id)
template['metadata']['name'] = name
template.setdefault('spec', {})
template['spec'].setdefault('containers', [])
template['spec'].setdefault('restartPolicy', 'Never')
if labels:
labels_dict = dict(pair.split('=', 1) for pair in labels)
template['metadata'].setdefault('labels', {})
template['metadata']['labels'].update(labels_dict)
container = self._parse_docker_args(docker_args)
container = self._get_docker_args(
docker_args,
target="env",
flags={"-e", "--env"},
convert=lambda env: {'name': env.partition("=")[0], 'value': env.partition("=")[2]},
)
container_bash_script = [self.container_bash_script] if isinstance(self.container_bash_script, str) \
else self.container_bash_script
extra_docker_bash_script = '\n'.join(self._session.config.get("agent.extra_docker_shell_script", None) or [])
if docker_bash:
extra_docker_bash_script += '\n' + str(docker_bash) + '\n'
script_encoded = '\n'.join(
['#!/bin/bash', ] +
[line.format(extra_bash_init_cmd=self.extra_bash_init_script or '', task_id=task_id)
[line.format(extra_bash_init_cmd=self.extra_bash_init_script or '',
task_id=task_id,
extra_docker_bash_script=extra_docker_bash_script)
for line in container_bash_script])
create_init_script = \
@@ -433,18 +601,23 @@ class K8sIntegration(Worker):
script_encoded.encode('ascii')
).decode('ascii'))
container = merge_dicts(
# Notice: we always leave with exit code 0, so pods are never restarted
container = self._merge_containers(
container,
dict(name=name, image=docker_image,
command=['/bin/bash'],
args=['-c', '{} ; {}'.format(create_clearml_conf, create_init_script)])
args=['-c', '{} ; {} ; exit 0'.format(create_clearml_conf, create_init_script)])
)
if template['spec']['containers']:
template['spec']['containers'][0] = merge_dicts(template['spec']['containers'][0], container)
template['spec']['containers'][0] = self._merge_containers(template['spec']['containers'][0], container)
else:
template['spec']['containers'].append(container)
if self._docker_force_pull:
for c in template['spec']['containers']:
c.setdefault('imagePullPolicy', 'Always')
fp, yaml_file = tempfile.mkstemp(prefix='clearml_k8stmpl_', suffix='.yml')
os.close(fp)
with open(yaml_file, 'wt') as f:
@@ -472,14 +645,18 @@ class K8sIntegration(Worker):
return output, error
def _kubectl_run(self, create_clearml_conf, docker_image, labels, queue, task_data, task_id, queue_name):
def _kubectl_run(
self, create_clearml_conf, docker_image, docker_args, docker_bash, labels, queue, task_data, task_id
):
if callable(self.kubectl_cmd):
kubectl_cmd = self.kubectl_cmd(task_id, docker_image, queue, task_data, queue_name)
kubectl_cmd = self.kubectl_cmd(task_id, docker_image, docker_args, queue, task_data)
else:
kubectl_cmd = self.kubectl_cmd.format(
queue_name=queue_name,
task_id=task_id,
docker_image=docker_image,
docker_args=" ".join(self._get_docker_args(
docker_args, flags={"-e", "--env"}, convert=lambda env: '--env={}'.format(env))
),
queue_id=queue,
namespace=self.namespace,
)
@@ -495,6 +672,9 @@ class K8sIntegration(Worker):
if self.pod_requests:
kubectl_cmd += ['--requests', ",".join(self.pod_requests)]
if self._docker_force_pull and not any(x.startswith("--image-pull-policy=") for x in kubectl_cmd):
kubectl_cmd += ["--image-pull-policy='always'"]
container_bash_script = [self.container_bash_script] if isinstance(self.container_bash_script, str) \
else self.container_bash_script
container_bash_script = ' ; '.join(container_bash_script)
@@ -506,7 +686,10 @@ class K8sIntegration(Worker):
"/bin/sh",
"-c",
"{} ; {}".format(create_clearml_conf, container_bash_script.format(
extra_bash_init_cmd=self.extra_bash_init_script, task_id=task_id)),
extra_bash_init_cmd=self.extra_bash_init_script or "",
extra_docker_bash_script=docker_bash or "",
task_id=task_id
)),
]
process = subprocess.Popen(kubectl_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, error = process.communicate()
@@ -536,10 +719,26 @@ class K8sIntegration(Worker):
_last_machine_update_ts = 0
while True:
# check if have pod limit, then check if we hit it.
if self.max_pods_limit:
current_pods = self._get_number_used_pods()
if current_pods >= self.max_pods_limit:
print("Maximum pod limit reached {}/{}, sleeping for {:.1f} seconds".format(
current_pods, self.max_pods_limit, self._polling_interval))
# delete old completed / failed pods
get_bash_output(
self.KUBECTL_DELETE_CMD.format(namespace=self.namespace, selector=self._get_agent_label())
)
# go to sleep
sleep(self._polling_interval)
continue
# iterate over queues (priority style, queues[0] is highest)
for queue in queues:
# delete old completed / failed pods
get_bash_output(self.KUBECTL_DELETE_CMD.format(namespace=self.namespace))
get_bash_output(
self.KUBECTL_DELETE_CMD.format(namespace=self.namespace, selector=self._get_agent_label())
)
# get next task in queue
try:
@@ -591,3 +790,27 @@ class K8sIntegration(Worker):
@classmethod
def get_ssh_server_bash(cls, ssh_port_number):
return ' ; '.join(line.format(port=ssh_port_number) for line in cls.BASH_INSTALL_SSH_CMD)
@staticmethod
def _merge_containers(c1, c2):
def merge_env(k, d1, d2, not_set):
if k != "env":
return not_set
# Merge environment lists, second list overrides first
return list({
item['name']: item for envs in (d1, d2) for item in envs
}.values())
return merge_dicts(
c1, c2, custom_merge_func=merge_env
)
@staticmethod
def _safe_k8s_label_value(value):
""" Conform string to k8s standards for a label value """
value = value.lower().strip()
value = re.sub(r'^[^A-Za-z0-9]+', '', value) # strip leading non-alphanumeric chars
value = re.sub(r'[^A-Za-z0-9]+$', '', value) # strip trailing non-alphanumeric chars
value = re.sub(r'\W+', '-', value) # allow only word chars (this removed "." which is supported, but nvm)
value = re.sub(r'-+', '-', value) # don't leave messy "--" after replacing previous chars
return value[:63]

View File

@@ -1,17 +1,23 @@
from typing import Callable, Dict, Any
from typing import Callable, Dict, Any, Optional
_not_set = object()
def filter_keys(filter_, dct): # type: (Callable[[Any], bool], Dict) -> Dict
return {key: value for key, value in dct.items() if filter_(key)}
def merge_dicts(dict1, dict2):
def merge_dicts(dict1, dict2, custom_merge_func=None):
# type: (Any, Any, Optional[Callable[[str, Any, Any, Any], Any]]) -> Any
""" Recursively merges dict2 into dict1 """
if not isinstance(dict1, dict) or not isinstance(dict2, dict):
return dict2
for k in dict2:
if k in dict1:
dict1[k] = merge_dicts(dict1[k], dict2[k])
res = None
if custom_merge_func:
res = custom_merge_func(k, dict1[k], dict2[k], _not_set)
dict1[k] = merge_dicts(dict1[k], dict2[k], custom_merge_func) if res is _not_set else res
else:
dict1[k] = dict2[k]
return dict1

View File

@@ -421,4 +421,8 @@ def get_driver_cuda_version():
except BaseException:
return None
# for some reason we get CUDA version 11020 instead of 11200, so this is the fix
if cuda_version and len(cuda_version) >= 4 and cuda_version[2] == '0' and cuda_version[3] != '0':
return cuda_version[:2]+cuda_version[3]
return cuda_version[:3] if cuda_version else None

View File

@@ -152,7 +152,14 @@ class FolderCache(object):
for f in source_folder.glob('*'):
if f.name in exclude_sub_folders:
continue
shutil.copytree(src=f.as_posix(), dst=(temp_folder / f.name).as_posix(), symlinks=True)
if f.is_dir():
shutil.copytree(
src=f.as_posix(), dst=(temp_folder / f.name).as_posix(),
symlinks=True, ignore_dangling_symlinks=True)
else:
shutil.copy(
src=f.as_posix(), dst=(temp_folder / f.name).as_posix(),
follow_symlinks=False)
# rename the target folder
target_cache_folder = self._cache_folder / '.'.join(keys)

View File

@@ -3,11 +3,13 @@ from __future__ import unicode_literals
import abc
from collections import OrderedDict
from contextlib import contextmanager
from typing import Text, Iterable, Union, Optional, Dict, List
from pathlib2 import Path
from hashlib import md5
from typing import Text, Iterable, Union, Optional, Dict, List
import six
from pathlib2 import Path
from clearml_agent.definitions import ENV_VENV_CACHE_PATH
from clearml_agent.helper.base import mkstemp, safe_remove_file, join_lines, select_for_platform
from clearml_agent.helper.console import ensure_binary
from clearml_agent.helper.os.folder_cache import FolderCache
@@ -239,6 +241,9 @@ class PackageManager(object):
if p.strip(strip_chars) and not p.strip(strip_chars).startswith('#')])
if not pip_reqs and not conda_reqs:
continue
# do not process "-r" or "--requirement" because we cannot know what we have in the git repo.
if any(r.strip().startswith('-r ') or r.strip().startswith('--requirement ') for r in pip_reqs):
continue
hash_text = '{class_type}\n{docker_cmd}\n{cuda_ver}\n{python_version}\n{pip_reqs}\n{conda_reqs}'.format(
class_type=str(cls),
docker_cmd=str(docker_cmd or ''),
@@ -252,7 +257,7 @@ class PackageManager(object):
def _get_cache_manager(self):
if not self._cache_manager:
cache_folder = self.session.config.get(self._config_cache_folder, None)
cache_folder = ENV_VENV_CACHE_PATH.get() or self.session.config.get(self._config_cache_folder, None)
if not cache_folder:
return None

View File

@@ -189,14 +189,6 @@ class CondaAPI(PackageManager):
if conda_env.is_file() and not is_windows_platform():
self.source = self.pip.source = CommandSequence(('source', conda_env.as_posix()), self.source)
# install cuda toolkit
# noinspection PyBroadException
try:
cuda_version = float(int(self.session.config['agent.cuda_version'])) / 10.0
if cuda_version > 0:
self._install('cudatoolkit={:.1f}'.format(cuda_version))
except Exception:
pass
return self
def _init_existing_environment(self, conda_pre_build_env_path):
@@ -456,7 +448,9 @@ class CondaAPI(PackageManager):
requirements['conda'] = requirements['conda'].split('\n')
has_torch = False
has_matplotlib = False
has_cudatoolkit = False
try:
# notice this is an integer version: 112 (means 11.2)
cuda_version = int(self.session.config.get('agent.cuda_version', 0))
except:
cuda_version = 0
@@ -488,6 +482,19 @@ class CondaAPI(PackageManager):
if '.' not in m.specs[0][1]:
continue
if m.name.lower() == 'cudatoolkit':
# skip cuda if we are running on CPU
if not cuda_version:
continue
has_cudatoolkit = True
# cuda version, only major.minor
requested_cuda_version = '.'.join(m.specs[0][1].split('.')[:2])
# make sure that the cuda_version we support can install the requested cuda (major version)
if int(float(requested_cuda_version)) > int(float(cuda_version)/10.0):
continue
m.specs = [(m.specs[0][0], str(requested_cuda_version)), ]
conda_supported_req_names.append(m.name.lower())
if m.req.name.lower() == 'matplotlib':
has_matplotlib = True
@@ -504,6 +511,10 @@ class CondaAPI(PackageManager):
reqs.append(m)
if not has_cudatoolkit and cuda_version:
m = MarkerRequirement(Requirement("cudatoolkit == {}".format(float(cuda_version) / 10.0)))
reqs.append(m)
# if we have a conda list, the rest should be installed with pip,
# this means any experiment that was executed with pip environment,
# will be installed using pip
@@ -559,8 +570,12 @@ class CondaAPI(PackageManager):
# change _ to - in name but not the prefix _ (as this is conda prefix)
if r.name and not r.name.startswith('_') and not requirements.get('conda', None):
r.name = r.name.replace('_', '-')
# remove .post from version numbers, it fails ~= version, and change == to ~=
if r.specs and r.specs[0]:
if has_cudatoolkit and r.specs and len(r.specs[0]) > 1 and r.name == 'cudatoolkit':
# select specific cuda version if it came from the requirements
r.specs = [(r.specs[0][0].replace('==', '='), r.specs[0][1].split('.post')[0])]
elif r.specs and r.specs[0] and len(r.specs[0]) > 1:
# remove .post from version numbers it fails with ~= version, and change == to ~=
r.specs = [(r.specs[0][0].replace('==', '~='), r.specs[0][1].split('.post')[0])]
while reqs:
@@ -573,6 +588,8 @@ class CondaAPI(PackageManager):
conda_env['dependencies'] = [clean_ver(r) for r in reqs]
with self.temp_file("conda_env", yaml.dump(conda_env), suffix=".yml") as name:
print('Conda: Trying to install requirements:\n{}'.format(conda_env['dependencies']))
if self.session.debug_mode:
print('{}:\n{}'.format(name, yaml.dump(conda_env)))
result = self._run_command(
("env", "update", "-p", self.path, "--file", name)
)
@@ -603,6 +620,8 @@ class CondaAPI(PackageManager):
pip_req_str = [r.tostr() for r in pip_requirements if r.name not in ('pip', 'virtualenv', )]
print('Conda: Installing requirements: step 2 - using pip:\n{}'.format(pip_req_str))
PackageManager._selected_manager = self.pip
if self.session.debug_mode:
print('pip requirements.txt:\n{}'.format('\n'.join(pip_req_str)))
self.pip.load_requirements({'pip': '\n'.join(pip_req_str)})
except Exception as e:
print(e)
@@ -646,12 +665,16 @@ class CondaAPI(PackageManager):
ansi_escape = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
return ansi_escape.sub('', line)
# make sure we are not running it with our own PYTHONPATH
env = dict(**os.environ)
env.pop('PYTHONPATH', None)
command = Argv(*command) # type: Executable
if not raw:
command = (self.conda,) + command + ("--quiet", "--json")
try:
print('Executing Conda: {}'.format(command.serialize()))
result = command.get_output(stdin=DEVNULL, **kwargs)
result = command.get_output(stdin=DEVNULL, env=env, **kwargs)
if self.session.debug_mode:
print(result)
except Exception as e:

View File

@@ -2,6 +2,8 @@ import re
from collections import OrderedDict
from typing import Text
from pathlib2 import Path
from .base import PackageManager
from .requirements import SimpleSubstitution
from ..base import safe_furl as furl
@@ -10,22 +12,26 @@ from ..base import safe_furl as furl
class ExternalRequirements(SimpleSubstitution):
name = "external_link"
cwd = None
def __init__(self, *args, **kwargs):
super(ExternalRequirements, self).__init__(*args, **kwargs)
self.post_install_req = []
self.post_install_req_lookup = OrderedDict()
self.post_install_local_req_lookup = OrderedDict()
def match(self, req):
# match local folder building:
# noinspection PyBroadException
try:
if not req.name and req.req and not req.req.editable and not req.req.vcs and \
req.req.line and req.req.line.strip().split('#')[0] and \
not req.req.line.strip().split('#')[0].lower().endswith('.whl'):
return True
except Exception:
pass
if self.is_local_folder_package(req):
# noinspection PyBroadException
try:
folder_path = req.req.line.strip().split('#')[0].strip()
if self.cwd and not Path(folder_path).is_absolute():
folder_path = (Path(self.cwd) / Path(folder_path)).absolute().as_posix()
self.post_install_local_req_lookup['file://{}'.format(folder_path)] = req.req.line
except Exception:
pass
return True
# match both editable or code or unparsed
if not (not req.name or req.req and (req.req.editable or req.req.vcs)):
@@ -45,31 +51,7 @@ class ExternalRequirements(SimpleSubstitution):
except:
freeze_base = ''
req_line = req.tostr(markers=False)
if req_line.strip().startswith('-e ') or req_line.strip().startswith('--editable'):
req_line = re.sub(r'^(-e|--editable=?)\s*', '', req_line, count=1)
if req.req.vcs and req_line.startswith('git+'):
try:
url_no_frag = furl(req_line)
url_no_frag.set(fragment=None)
# reverse replace
fragment = req_line[::-1].replace(url_no_frag.url[::-1], '', 1)[::-1]
vcs_url = req_line[4:]
# reverse replace
vcs_url = vcs_url[::-1].replace(fragment[::-1], '', 1)[::-1]
from ..repo import Git
vcs = Git(session=session, url=vcs_url, location=None, revision=None)
vcs._set_ssh_url()
new_req_line = 'git+{}{}'.format(vcs.url_with_auth, fragment)
if new_req_line != req_line:
furl_line = furl(new_req_line)
print('Replacing original pip vcs \'{}\' with \'{}\''.format(
req_line,
furl_line.set(password='xxxxxx').tostr() if furl_line.password else new_req_line))
req_line = new_req_line
except Exception:
print('WARNING: Failed parsing pip git install, using original line {}'.format(req_line))
req_line = self._add_vcs_credentials(req, session)
# if we have older pip version we have to make sure we replace back the package name with the
# git repository link. In new versions this is supported and we get "package @ git+https://..."
@@ -89,6 +71,43 @@ class ExternalRequirements(SimpleSubstitution):
if not PackageManager.out_of_scope_install_package(req_line):
raise ValueError("Failed installing GIT/HTTPs package \'{}\'".format(req_line))
@staticmethod
def _add_vcs_credentials(req, session):
req_line = req.tostr(markers=False)
if req_line.strip().startswith('-e ') or req_line.strip().startswith('--editable'):
req_line = re.sub(r'^(-e|--editable=?)\s*', '', req_line, count=1)
if req.req.vcs and req_line.startswith('git+'):
try:
url_no_frag = furl(req_line)
url_no_frag.set(fragment=None)
# reverse replace
fragment = req_line[::-1].replace(url_no_frag.url[::-1], '', 1)[::-1]
vcs_url = req_line[4:]
# reverse replace
vcs_url = vcs_url[::-1].replace(fragment[::-1], '', 1)[::-1]
# remove ssh:// or git:// prefix for git detection and credentials
scheme = ''
if vcs_url and (vcs_url.startswith('ssh://') or vcs_url.startswith('git://')):
scheme = 'ssh://' # notice git:// is actually ssh://
vcs_url = vcs_url[6:]
from ..repo import Git
vcs = Git(session=session, url=vcs_url, location=None, revision=None)
vcs._set_ssh_url()
new_req_line = 'git+{}{}{}'.format(
'' if scheme and '://' in vcs.url else scheme,
vcs.url_with_auth, fragment
)
if new_req_line != req_line:
furl_line = furl(new_req_line)
print('Replacing original pip vcs \'{}\' with \'{}\''.format(
req_line,
furl_line.set(password='xxxxxx').tostr() if furl_line.password else new_req_line))
req_line = new_req_line
except Exception:
print('WARNING: Failed parsing pip git install, using original line {}'.format(req_line))
return req_line
def replace(self, req):
"""
Replace a requirement
@@ -113,15 +132,40 @@ class ExternalRequirements(SimpleSubstitution):
if r not in self.post_install_req_lookup]
list_of_requirements[k] += [self.post_install_req_lookup.get(r, '')
for r in self.post_install_req_lookup.keys() if r in original_requirements]
if self.post_install_local_req_lookup:
original_requirements = list_of_requirements[k]
list_of_requirements[k] = [
r for r in original_requirements
if len(r.split('@', 1)) != 2 or r.split('@', 1)[1].strip() not in self.post_install_local_req_lookup]
list_of_requirements[k] += [
self.post_install_local_req_lookup.get(r.split('@', 1)[1].strip(), '')
for r in original_requirements
if len(r.split('@', 1)) == 2 and r.split('@', 1)[1].strip() in self.post_install_local_req_lookup]
return list_of_requirements
@classmethod
def is_local_folder_package(cls, req):
# noinspection PyBroadException
try:
if not req.name and req.req and not req.req.editable and not req.req.vcs and \
req.req.line and req.req.line.strip().split('#')[0] and \
not req.req.line.strip().split('#')[0].lower().endswith('.whl') and \
not (req.req.line.strip().startswith('-r ') or req.req.line.strip().startswith('--requirement ')):
return True
except Exception:
pass
return False
class OnlyExternalRequirements(ExternalRequirements):
def __init__(self, *args, **kwargs):
super(OnlyExternalRequirements, self).__init__(*args, **kwargs)
def match(self, req):
return not super(OnlyExternalRequirements, self).match(req)
return True
def replace(self, req):
"""
@@ -130,4 +174,6 @@ class OnlyExternalRequirements(ExternalRequirements):
"""
# Do not store the skipped requirements
# mark skip package
if super(OnlyExternalRequirements, self).match(req):
return self._add_vcs_credentials(req, self._session)
return Text('')

View File

@@ -1,5 +1,7 @@
import os
import sys
from itertools import chain
from pathlib import Path
from typing import Text, Optional
from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME
@@ -18,7 +20,7 @@ class SystemPip(PackageManager):
Program interface to the system pip.
"""
super(SystemPip, self).__init__()
self._bin = interpreter or sys.executable
self._bin = Path(interpreter or sys.executable)
self.session = session
@property
@@ -82,7 +84,10 @@ class SystemPip(PackageManager):
:param kwargs: kwargs for get_output/check_output command
"""
command = self._make_command(command)
return (command.get_output if output else command.check_call)(stdin=DEVNULL, **kwargs)
# make sure we are not running it with our own PYTHONPATH
env = dict(**os.environ)
env.pop('PYTHONPATH', None)
return (command.get_output if output else command.check_call)(stdin=DEVNULL, env=env, **kwargs)
def _make_command(self, command):
return Argv(self.bin, '-m', 'pip', '--disable-pip-version-check', *command)

View File

@@ -263,6 +263,9 @@ class PytorchRequirement(SimpleSubstitution):
continue
if len(parts) < 5 or platform_wheel not in parts[4]:
continue
# yes this is for linux python 2.7 support, this is the only python 2.7 we support...
if py_ver and py_ver[0] == '2' and len(parts) > 3 and not parts[3].endswith('u'):
continue
# update the closest matched version (from above)
if not closest_v:
closest_v = v

View File

@@ -208,7 +208,11 @@ class SimpleVersion:
if not version_b:
return True
if not num_parts:
num_parts = max(len(version_a.split('.')), len(version_b.split('.')), )
if op == '~=':
num_parts = len(version_b.split('.')) - 1
num_parts = max(num_parts, 2)
op = '=='
ignore_sub_versions = True
@@ -245,6 +249,16 @@ class SimpleVersion:
return version_a_key < version_b_key
raise ValueError('Unrecognized comparison operator [{}]'.format(op))
@classmethod
def max_version(cls, version_a, version_b):
return version_a if cls.compare_versions(
version_a=version_a, op='>=', version_b=version_b, num_parts=None) else version_b
@classmethod
def min_version(cls, version_a, version_b):
return version_a if cls.compare_versions(
version_a=version_a, op='<=', version_b=version_b, num_parts=None) else version_b
@staticmethod
def _parse_letter_version(
letter, # type: str
@@ -313,6 +327,77 @@ class SimpleVersion:
return ()
def compare_version_rules(specs_a, specs_b):
# specs_a/b are a list of tuples: [('==', '1.2.3'), ] or [('>=', '1.2'), ('<', '1.3')]
# section definition:
class Section(object):
def __init__(self, left=None, left_eq=False, right=None, right_eq=False):
self.left, self.left_eq, self.right, self.right_eq = left, left_eq, right, right_eq
# first create a list of in/out sections for each spec
# >, >= are left rule
# <, <= are right rule
# ~= x.y.z is converted to: >= x.y and < x.y+1
# ==/=== are converted to: >= and <=
# != x.y.z will split a section into: left < x.y.z and right > x.y.z
def create_section(specs):
section = Section()
for op, v in specs:
a = section
if op == '>':
a.left = v
a.left_eq = False
elif op == '>=':
a.left = v
a.left_eq = True
elif op == '<':
a.right = v
a.right_eq = False
elif op == '<=':
a.right = v
a.right_eq = True
elif op == '==':
a.left = v
a.left_eq = True
a.right = v
a.right_eq = True
elif op == '~=':
new_v = v.split('.')
a_left = '.'.join(new_v[:-1])
a.left = a_left if not a.left else SimpleVersion.max_version(a_left, a.left)
a.left_eq = True
a_right = '.'.join(new_v[:-2] + [str(int(new_v[-2])+1)])
a.right = a_right if not a.right else SimpleVersion.min_version(a_right, a.right)
a.right_eq = False if a.right == a_right else a.right_eq
return section
section_a = create_section(specs_a)
section_b = create_section(specs_b)
i = Section()
# then we have a list of sections for spec A/B
if section_a.left == section_b.left:
i.left = section_a.left
i.left_eq = section_a.left_eq and section_b.left_eq
else:
i.left = SimpleVersion.max_version(section_a.left, section_b.left)
i.left_eq = section_a.left_eq if i.left == section_a.left else section_b.left_eq
if section_a.right == section_b.right:
i.right = section_a.right
i.right_eq = section_a.right_eq and section_b.right_eq
else:
i.right = SimpleVersion.min_version(section_a.right, section_b.right)
i.right_eq = section_a.right_eq if i.right == section_a.right else section_b.right_eq
# return true if any section from A intersects a section from B
valid = True
valid &= SimpleVersion.compare_versions(
version_a=i.left, op='<=' if i.left_eq else '<', version_b=i.right, num_parts=None)
valid &= SimpleVersion.compare_versions(
version_a=i.right, op='>=' if i.left_eq else '>', version_b=i.left, num_parts=None)
return valid
@six.add_metaclass(ABCMeta)
class RequirementSubstitution(object):
@@ -448,10 +533,14 @@ class RequirementsManager(object):
self.translator = RequirementsTranslator(session, interpreter=base_interpreter,
cache_dir=pip_cache_dir.as_posix())
self._base_interpreter = base_interpreter
self._cwd = None
def register(self, cls): # type: (Type[RequirementSubstitution]) -> None
self.handlers.append(cls(self._session))
def set_cwd(self, cwd):
self._cwd = str(cwd) if cwd else None
def _replace_one(self, req): # type: (MarkerRequirement) -> Optional[Text]
match = re.search(r';\s*(.*)', Text(req))
if match:
@@ -464,19 +553,9 @@ class RequirementsManager(object):
return None
def replace(self, requirements): # type: (Text) -> Text
def safe_parse(req_str):
try:
return next(parse(req_str))
except Exception as ex:
return Requirement(req_str)
parsed_requirements = self.parse_requirements_section_to_marker_requirements(
requirements=requirements, cwd=self._cwd)
parsed_requirements = tuple(
map(
MarkerRequirement,
[safe_parse(line) for line in (requirements.splitlines()
if isinstance(requirements, six.text_type) else requirements)]
)
)
if not parsed_requirements:
# return the original requirements just in case
return requirements
@@ -609,3 +688,24 @@ class RequirementsManager(object):
return (normalize_cuda_version(cuda_version or 0),
normalize_cuda_version(cudnn_version or 0))
@staticmethod
def parse_requirements_section_to_marker_requirements(requirements, cwd=None):
def safe_parse(req_str):
# noinspection PyBroadException
try:
return list(parse(req_str, cwd=cwd))
except Exception as ex:
return [Requirement(req_str)]
if not requirements:
return tuple()
parsed_requirements = tuple(
map(
MarkerRequirement,
[r for line in (requirements.splitlines() if isinstance(requirements, str) else requirements)
for r in safe_parse(line)]
)
)
return parsed_requirements

View File

@@ -42,20 +42,31 @@ def get_bash_output(cmd, strip=False, stderr=subprocess.STDOUT, stdin=False):
return output if not strip or not output else output.strip()
def terminate_process(pid, timeout=10.):
def terminate_process(pid, timeout=10., ignore_zombie=True, include_children=False):
# noinspection PyBroadException
try:
proc = psutil.Process(pid)
children = proc.children(recursive=True) if include_children else []
proc.terminate()
cnt = 0
while proc.is_running() and cnt < timeout:
while proc.is_running() and (ignore_zombie or proc.status() != 'zombie') and cnt < timeout:
sleep(1.)
cnt += 1
proc.terminate()
# terminate children
for c in children:
c.terminate()
cnt = 0
while proc.is_running() and cnt < timeout:
while proc.is_running() and (ignore_zombie or proc.status() != 'zombie') and cnt < timeout:
sleep(1.)
cnt += 1
# kill children
for c in children:
c.kill()
proc.kill()
except Exception:
pass
@@ -66,9 +77,8 @@ def terminate_process(pid, timeout=10.):
return True
def kill_all_child_processes(pid=None):
def kill_all_child_processes(pid=None, include_parent=True):
# get current process if pid not provided
include_parent = True
if not pid:
pid = os.getpid()
include_parent = False
@@ -84,6 +94,23 @@ def kill_all_child_processes(pid=None):
parent.kill()
def terminate_all_child_processes(pid=None, timeout=10., include_parent=True):
# get current process if pid not provided
if not pid:
pid = os.getpid()
include_parent = False
try:
parent = psutil.Process(pid)
except psutil.Error:
# could not find parent process id
return
for child in parent.children(recursive=False):
print('Terminating child process {}'.format(child.pid))
terminate_process(child.pid, timeout=timeout, ignore_zombie=False, include_children=True)
if include_parent:
terminate_process(parent.pid, timeout=timeout, ignore_zombie=False)
def get_docker_id(docker_cmd_contains):
try:
containers_running = get_bash_output(cmd='docker ps --no-trunc --format \"{{.ID}}: {{.Command}}\"')
@@ -103,9 +130,10 @@ def shutdown_docker_process(docker_cmd_contains=None, docker_id=None):
docker_id = get_docker_id(docker_cmd_contains=docker_cmd_contains)
if docker_id:
# we found our docker, stop it
get_bash_output(cmd='docker stop -t 1 {}'.format(docker_id))
return get_bash_output(cmd='docker stop -t 1 {}'.format(docker_id))
except Exception:
pass
return None
def commit_docker(container_name, docker_cmd_contains=None, docker_id=None, apply_change=None):
@@ -193,6 +221,7 @@ class Argv(Executable):
"""
self.argv = argv
self._log = kwargs.pop("log", None)
self._display_argv = kwargs.pop("display_argv", argv)
if not self._log:
self._log = logging.getLogger(__name__)
self._log.propagate = False
@@ -217,10 +246,10 @@ class Argv(Executable):
return self.argv
def __repr__(self):
return "<Argv{}>".format(self.argv)
return "<Argv{}>".format(self._display_argv)
def __str__(self):
return "Executing: {}".format(self.argv)
return "Executing: {}".format(self._display_argv)
def __iter__(self):
if is_windows_platform():

View File

@@ -274,6 +274,18 @@ class VCS(object):
url = parsed_url.url
return url
@classmethod
def rewrite_ssh_url(cls, url, port=None, username=None):
# type: (Text, Optional[int], Optional[str]) -> Text
"""
Rewrite SSH URL with custom port and username
"""
parsed_url = furl(url)
if parsed_url.scheme == "ssh":
parsed_url.username = username or "git"
parsed_url.port = port or None
return parsed_url.url
def _set_ssh_url(self):
"""
Replace instance URL with SSH substitution result and report to log.
@@ -297,6 +309,20 @@ class VCS(object):
self.url, new_url))
self.url = new_url
return
# rewrite ssh URLs only if either ssh port or ssh user are forced in config
if parsed_url.scheme == "ssh" and (
self.session.config.get('agent.force_git_ssh_port', None) or
self.session.config.get('agent.force_git_ssh_user', None)
):
new_url = self.rewrite_ssh_url(
self.url,
port=self.session.config.get('agent.force_git_ssh_port', None),
username=self.session.config.get('agent.force_git_ssh_user', None)
)
if new_url != self.url:
print("Using SSH credentials - ssh url '{}' with ssh url '{}'".format(
self.url, new_url))
self.url = new_url
if not self.session.config.agent.translate_ssh:
return
@@ -456,7 +482,7 @@ class VCS(object):
parsed_url = furl(url)
except ValueError:
return url
if parsed_url.scheme in ["", "ssh"] or parsed_url.scheme.startswith("git"):
if parsed_url.scheme in ["", "ssh"] or (parsed_url.scheme or '').startswith("git"):
return parsed_url.url
config_user = ENV_AGENT_GIT_USER.get() or config.get("agent.{}_user".format(cls.executable_name), None)
config_pass = ENV_AGENT_GIT_PASS.get() or config.get("agent.{}_pass".format(cls.executable_name), None)
@@ -591,7 +617,7 @@ def clone_repository_cached(session, execution, destination):
# mock lock
repo_lock = Lock()
repo_lock_timeout_sec = 300
repo_url = execution.repository # type: str
repo_url = execution.repository or '' # type: str
parsed_url = furl(repo_url)
no_password_url = parsed_url.copy().remove(password=True).url

View File

@@ -2,6 +2,7 @@ from __future__ import unicode_literals, division
import logging
import os
import shlex
from collections import deque
from itertools import starmap
from threading import Thread, Event
@@ -12,6 +13,7 @@ import attr
import psutil
from pathlib2 import Path
from clearml_agent.session import Session
from clearml_agent.definitions import ENV_WORKER_TAGS
try:
from .gpu import gpustat
@@ -59,6 +61,7 @@ class ResourceMonitor(object):
sample_frequency_per_sec=2.0,
report_frequency_sec=30.0,
first_report_sec=None,
worker_tags=None,
):
self.session = session
self.queue = deque(maxlen=1)
@@ -76,6 +79,9 @@ class ResourceMonitor(object):
self._gpustat_fail = 0
self._gpustat = gpustat
self._active_gpus = None
if not worker_tags and ENV_WORKER_TAGS.get():
worker_tags = shlex.split(ENV_WORKER_TAGS.get())
self._worker_tags = worker_tags
if os.environ.get('NVIDIA_VISIBLE_DEVICES') == 'none':
# NVIDIA_VISIBLE_DEVICES set to none, marks cpu_only flag
# active_gpus == False means no GPU reporting
@@ -118,6 +124,7 @@ class ResourceMonitor(object):
machine_stats=stats,
timestamp=(int(time()) * 1000),
worker=self._worker_id,
tags=self._worker_tags,
**self.get_report().to_dict()
)
log.debug("sending report: %s", report)

View File

@@ -50,7 +50,7 @@ DAEMON_ARGS = dict({
},
'--docker': {
'help': 'Run execution task inside a docker (v19.03 and above). Optional args <image> <arguments> or '
'specify default docker image in agent.default_docker.image / agent.default_docker.arguments'
'specify default docker image in agent.default_docker.image / agent.default_docker.arguments '
'use --gpus/--cpu-only (or set NVIDIA_VISIBLE_DEVICES) to limit gpu visibility for docker',
'nargs': '*',
'default': False,
@@ -83,6 +83,12 @@ DAEMON_ARGS = dict({
'type': int,
'default': None,
},
'--child-report-tags': {
'help': 'List of tags to send with the status reports from the worker that runs a task',
'nargs': '+',
'type': str,
'default': None,
},
'--create-queue': {
'help': 'Create requested queue if it does not exist already.',
'action': 'store_true',
@@ -98,8 +104,9 @@ DAEMON_ARGS = dict({
},
'--dynamic-gpus': {
'help': 'Allow to dynamically allocate gpus based on queue properties, '
'configure with \'--queues <queue_name>=<num_gpus>\'.'
' Example: \'--dynamic-gpus --queue dual_gpus=2 single_gpu=1\'',
'configure with \'--queue <queue_name>=<num_gpus>\'.'
' Example: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 single_gpu=1\''
' Example Opportunistic: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 max_quad_gpus=1-4 \'',
'action': 'store_true',
},
'--uptime': {
@@ -110,7 +117,7 @@ DAEMON_ARGS = dict({
'default': None,
},
'--downtime': {
'help': 'Specify uptime for clearml-agent in "<hours> <days>" format. for example, use "09-13 TUE" to set '
'help': 'Specify downtime for clearml-agent in "<hours> <days>" format. for example, use "09-13 TUE" to set '
'Tuesday\'s downtime to 09-13'
'Note: Make sure to have only on of uptime/downtime configuration and not both.',
'nargs': '*',
@@ -120,6 +127,10 @@ DAEMON_ARGS = dict({
'help': 'Print the worker\'s schedule (uptime properties, server\'s runtime properties and listening queues)',
'action': 'store_true',
},
'--use-owner-token': {
'help': 'Generate and use task owner token for the execution of the task',
'action': 'store_true',
}
}, **WORKER_ARGS)
COMMANDS = {

View File

@@ -229,26 +229,35 @@ class Session(_Session):
except:
pass
def print_configuration(self, remove_secret_keys=("secret", "pass", "token", "account_key")):
def print_configuration(
self,
remove_secret_keys=("secret", "pass", "token", "account_key", "contents"),
skip_value_keys=("environment", )
):
# remove all the secrets from the print
def recursive_remove_secrets(dictionary, secret_keys=()):
def recursive_remove_secrets(dictionary, secret_keys=(), empty_keys=()):
for k in list(dictionary):
for s in secret_keys:
if s in k:
dictionary.pop(k)
break
for s in empty_keys:
if s == k:
dictionary[k] = {key: '****' for key in dictionary[k]} \
if isinstance(dictionary[k], dict) else '****'
break
if isinstance(dictionary.get(k, None), dict):
recursive_remove_secrets(dictionary[k], secret_keys=secret_keys)
recursive_remove_secrets(dictionary[k], secret_keys=secret_keys, empty_keys=empty_keys)
elif isinstance(dictionary.get(k, None), (list, tuple)):
for item in dictionary[k]:
if isinstance(item, dict):
recursive_remove_secrets(item, secret_keys=secret_keys)
recursive_remove_secrets(item, secret_keys=secret_keys, empty_keys=empty_keys)
config = deepcopy(self.config.to_dict())
# remove the env variable, it's not important
config.pop('env', None)
if remove_secret_keys:
recursive_remove_secrets(config, secret_keys=remove_secret_keys)
if remove_secret_keys or skip_value_keys:
recursive_remove_secrets(config, secret_keys=remove_secret_keys, empty_keys=skip_value_keys)
# remove logging.loggers.urllib3.level from the print
try:
config['logging']['loggers']['urllib3'].pop('level', None)

View File

@@ -1 +1 @@
__version__ = '0.17.2'
__version__ = '1.2.0rc0'

14
docker/k8s-glue/README.md Normal file
View File

@@ -0,0 +1,14 @@
This folder contains an example docker and templates for running the k8s glue as a pod in a k8s cluster
Please note that ClearML credentials and server addresses should either be filled in the clearml.conf file before
building the glue docker or provided in the k8s-glue.yml template.
To run, you'll need to:
* Create a secret from pod_template.yml:
```bash
kubectl -n clearml create secret generic k8s-glue-pod-template --from-file=pod_template.yml
```
* Apply the k8s glue template:
```bash
kubectl -n clearml apply -f k8s-glue.yml
```

View File

@@ -0,0 +1,402 @@
# CLEARML-AGENT configuration file
api {
# Notice: 'host' is the api server (default port 8008), not the web server.
api_server: ""
web_server: ""
files_server: ""
# Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
credentials {"access_key": "", "secret_key": ""}
}
# Set GIT user/pass credentials
# leave blank for GIT SSH credentials
agent.git_user=""
agent.git_pass=""
# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
agent.package_manager.extra_index_url= [
]
agent {
# unique name of this worker, if None, created based on hostname:process_id
# Override with os environment: CLEARML_WORKER_ID
# worker_id: "clearml-agent-machine1:gpu0"
worker_id: ""
# worker name, replaces the hostname when creating a unique name for this worker
# Override with os environment: CLEARML_WORKER_NAME
# worker_name: "clearml-agent-machine1"
worker_name: ""
# Set GIT user/pass credentials (if user/pass are set, GIT protocol will be set to https)
# leave blank for GIT SSH credentials (set force_git_ssh_protocol=true to force SSH protocol)
# git_user: ""
# git_pass: ""
# git_host: ""
# Force GIT protocol to use SSH regardless of the git url (Assumes GIT user/pass are blank)
force_git_ssh_protocol: false
# Force a specific SSH port when converting http to ssh links (the domain is kept the same)
# force_git_ssh_port: 0
# Force a specific SSH username when converting http to ssh links (the default username is 'git')
# force_git_ssh_user: git
# Set the python version to use when creating the virtual environment and launching the experiment
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
# The default is the python executing the clearml_agent
python_binary: ""
# ignore any requested python version (Default: False, if a Task was using a
# specific python version and the system supports multiple python the agent will use the requested python version)
# ignore_requested_python_version: true
# select python package manager:
# currently supported pip and conda
# poetry is used if pip selected and repository contains poetry.lock file
package_manager: {
# supported options: pip, conda, poetry
type: pip,
# specify pip version to use (examples "<20", "==19.3.1", "", empty string will install the latest version)
pip_version: "<20.2",
# virtual environment inheres packages from system
system_site_packages: false,
# install with --upgrade
force_upgrade: false,
# additional artifact repositories to use when installing python packages
# extra_index_url: ["https://allegroai.jfrog.io/clearmlai/api/pypi/public/simple"]
# additional conda channels to use when installing with conda package manager
conda_channels: ["pytorch", "conda-forge", "defaults", ]
# If set to true, Task's "installed packages" are ignored,
# and the repository's "requirements.txt" is used instead
# force_repo_requirements_txt: false
# set the priority packages to be installed before the rest of the required packages
# priority_packages: ["cython", "numpy", "setuptools", ]
# set the optional priority packages to be installed before the rest of the required packages,
# In case a package installation fails, the package will be ignored,
# and the virtual environment process will continue
# priority_optional_packages: ["pygobject", ]
# set the post packages to be installed after all the rest of the required packages
# post_packages: ["horovod", ]
# set the optional post packages to be installed after all the rest of the required packages,
# In case a package installation fails, the package will be ignored,
# and the virtual environment process will continue
# post_optional_packages: []
# set to True to support torch nightly build installation,
# notice: torch nightly builds are ephemeral and are deleted from time to time
torch_nightly: false,
},
# target folder for virtual environments builds, created when executing experiment
venvs_dir = ~/.clearml/venvs-builds
# cached virtual environment folder
venvs_cache: {
# maximum number of cached venvs
max_entries: 10
# minimum required free space to allow for cache entry, disable by passing 0 or negative value
free_space_threshold_gb: 2.0
# unmark to enable virtual environment caching
# path: ~/.clearml/venvs-cache
},
# cached git clone folder
vcs_cache: {
enabled: true,
path: ~/.clearml/vcs-cache
},
# use venv-update in order to accelerate python virtual environment building
# Still in beta, turned off by default
venv_update: {
enabled: false,
},
# cached folder for specific python package download (used for pytorch package caching)
pip_download_cache {
enabled: true,
path: ~/.clearml/pip-download-cache
},
translate_ssh: true,
# reload configuration file every daemon execution
reload_config: false,
# pip cache folder mapped into docker, used for python package caching
docker_pip_cache = ~/.clearml/pip-cache
# apt cache folder mapped into docker, used for ubuntu package caching
docker_apt_cache = ~/.clearml/apt-cache
# optional arguments to pass to docker image
# these are local for this agent and will not be updated in the experiment's docker_cmd section
# extra_docker_arguments: ["--ipc=host", ]
# optional shell script to run in docker when started before the experiment is started
# extra_docker_shell_script: ["apt-get install -y bindfs", ]
# Install the required packages for opencv libraries (libsm6 libxext6 libxrender-dev libglib2.0-0),
# for backwards compatibility reasons, true as default,
# change to false to skip installation and decrease docker spin up time
# docker_install_opencv_libs: true
# optional uptime configuration, make sure to use only one of 'uptime/downtime' and not both.
# If uptime is specified, agent will actively poll (and execute) tasks in the time-spans defined here.
# Outside of the specified time-spans, the agent will be idle.
# Defined using a list of items of the format: "<hours> <days>".
# hours - use values 0-23, single values would count as start hour and end at midnight.
# days - use days in abbreviated format (SUN-SAT)
# use '-' for ranges and ',' to separate singular values.
# for example, to enable the workers every Sunday and Tuesday between 17:00-20:00 set uptime to:
# uptime: ["17-20 SUN,TUE"]
# optional downtime configuration, can be used only when uptime is not used.
# If downtime is specified, agent will be idle in the time-spans defined here.
# Outside of the specified time-spans, the agent will actively poll (and execute) tasks.
# Use the same format as described above for uptime
# downtime: []
# set to true in order to force "docker pull" before running an experiment using a docker image.
# This makes sure the docker image is updated.
docker_force_pull: false
default_docker: {
# default docker image to use when running in docker mode
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
# optional arguments to pass to docker image
# arguments: ["--ipc=host", ]
}
# set the OS environments based on the Task's Environment section before launching the Task process.
enable_task_env: false
# set the initial bash script to execute at the startup of any docker.
# all lines will be executed regardless of their exit code.
# {python_single_digit} is translated to 'python3' or 'python2' according to requested python version
# docker_init_bash_script = [
# "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean",
# "chown -R root /root/.cache/pip",
# "apt-get update",
# "apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0",
# "(which {python_single_digit} && {python_single_digit} -m pip --version) || apt-get install -y {python_single_digit}-pip",
# ]
# set the preprocessing bash script to execute at the startup of any docker.
# all lines will be executed regardless of their exit code.
# docker_preprocess_bash_script = [
# "echo \"starting docker\"",
#]
# If False replace \r with \n and display full console output
# default is True, report a single \r line in a sequence of consecutive lines, per 5 seconds.
# suppress_carriage_return: true
# cuda versions used for solving pytorch wheel packages
# should be detected automatically. Override with os environment CUDA_VERSION / CUDNN_VERSION
# cuda_version: 10.1
# cudnn_version: 7.6
# Hide docker environment variables containing secrets when printing out the docker command by replacing their
# values with "********". Turning this feature on will hide the following environment variables values:
# CLEARML_API_SECRET_KEY, CLEARML_AGENT_GIT_PASS, AWS_SECRET_ACCESS_KEY, AZURE_STORAGE_KEY
# To include more environment variables, add their keys to the "extra_keys" list. E.g. to make sure the value of
# your custom environment variable named MY_SPECIAL_PASSWORD will not show in the logs when included in the
# docker command, set:
# extra_keys: ["MY_SPECIAL_PASSWORD"]
hide_docker_command_env_vars {
enabled: true
extra_keys: []
}
}
sdk {
# ClearML - default SDK configuration
storage {
cache {
# Defaults to system temp folder / cache
default_base_dir: "~/.clearml/cache"
size {
# max_used_bytes = -1
min_free_bytes = 10GB
# cleanup_margin_percent = 5%
}
}
direct_access: [
# Objects matching are considered to be available for direct access, i.e. they will not be downloaded
# or cached, and any download request will return a direct reference.
# Objects are specified in glob format, available for url and content_type.
{ url: "file://*" } # file-urls are always directly referenced
]
}
metrics {
# History size for debug files per metric/variant. For each metric/variant combination with an attached file
# (e.g. debug image event), file names for the uploaded files will be recycled in such a way that no more than
# X files are stored in the upload destination for each metric/variant combination.
file_history_size: 100
# Max history size for matplotlib imshow files per plot title.
# File names for the uploaded images will be recycled in such a way that no more than
# X images are stored in the upload destination for each matplotlib plot title.
matplotlib_untitled_history_size: 100
# Limit the number of digits after the dot in plot reporting (reducing plot report size)
# plot_max_num_digits: 5
# Settings for generated debug images
images {
format: JPEG
quality: 87
subsampling: 0
}
# Support plot-per-graph fully matching Tensorboard behavior (i.e. if this is set to true, each series should have its own graph)
tensorboard_single_series_per_graph: false
}
network {
metrics {
# Number of threads allocated to uploading files (typically debug images) when transmitting metrics for
# a specific iteration
file_upload_threads: 4
# Warn about upload starvation if no uploads were made in specified period while file-bearing events keep
# being sent for upload
file_upload_starvation_warning_sec: 120
}
iteration {
# Max number of retries when getting frames if the server returned an error (http code 500)
max_retries_on_server_error: 5
# Backoff factory for consecutive retry attempts.
# SDK will wait for {backoff factor} * (2 ^ ({number of total retries} - 1)) between retries.
retry_backoff_factor_sec: 10
}
}
aws {
s3 {
# S3 credentials, used for read/write access by various SDK elements
# default, used for any bucket not specified below
key: ""
secret: ""
region: ""
credentials: [
# specifies key/secret credentials to use when handling s3 urls (read or write)
# {
# bucket: "my-bucket-name"
# key: "my-access-key"
# secret: "my-secret-key"
# },
# {
# # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
# host: "my-minio-host:9000"
# key: "12345678"
# secret: "12345678"
# multipart: false
# secure: false
# }
]
}
boto3 {
pool_connections: 512
max_multipart_concurrency: 16
}
}
google.storage {
# # Default project and credentials file
# # Will be used when no bucket configuration is found
# project: "clearml"
# credentials_json: "/path/to/credentials.json"
# # Specific credentials per bucket and sub directory
# credentials = [
# {
# bucket: "my-bucket"
# subdir: "path/in/bucket" # Not required
# project: "clearml"
# credentials_json: "/path/to/credentials.json"
# },
# ]
}
azure.storage {
# containers: [
# {
# account_name: "clearml"
# account_key: "secret"
# # container_name:
# }
# ]
}
log {
# debugging feature: set this to true to make null log propagate messages to root logger (so they appear in stdout)
null_log_propagate: false
task_log_buffer_capacity: 66
# disable urllib info and lower levels
disable_urllib3_info: true
}
development {
# Development-mode options
# dev task reuse window
task_reuse_time_window_in_hours: 72.0
# Run VCS repository detection asynchronously
vcs_repo_detect_async: true
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
store_uncommitted_code_diff: true
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset
support_stopping: true
# Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
default_output_uri: ""
# Default auto generated requirements optimize for smaller requirements
# If True, analyze the entire repository regardless of the entry point.
# If False, first analyze the entry point script, if it does not contain other to local files,
# do not analyze the entire repository.
force_analyze_entire_repo: false
# If set to true, *clearml* update message will not be printed to the console
# this value can be overwritten with os environment variable CLEARML_SUPPRESS_UPDATE_MESSAGE=1
suppress_update_message: false
# If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze`
detect_with_pip_freeze: false
# Development mode worker
worker {
# Status report period in seconds
report_period_sec: 2
# ping to the server - check connectivity
ping_period_sec: 30
# Log all stdout & stderr
log_stdout: true
# compatibility feature, report memory usage for the entire machine
# default (false), report only on the running process and its sub-processes
report_global_mem_used: false
}
}
}

View File

@@ -0,0 +1,36 @@
#!/bin/bash -x
export CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-$TRAINS_FILES_HOST}
if [ -z "$CLEARML_FILES_HOST" ]; then
CLEARML_HOST_IP=${CLEARML_HOST_IP:-${TRAINS_HOST_IP:-$(curl -s https://ifconfig.me/ip)}}
fi
export CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-${TRAINS_FILES_HOST:-"http://$CLEARML_HOST_IP:8081"}}
export CLEARML_WEB_HOST=${CLEARML_WEB_HOST:-${TRAINS_WEB_HOST:-"http://$CLEARML_HOST_IP:8080"}}
export CLEARML_API_HOST=${CLEARML_API_HOST:-${TRAINS_API_HOST:-"http://$CLEARML_HOST_IP:8008"}}
echo $CLEARML_FILES_HOST $CLEARML_WEB_HOST $CLEARML_API_HOST 1>&2
if [ -z "$CLEARML_AGENT_NO_UPDATE" ]; then
if [ -n "$CLEARML_AGENT_UPDATE_REPO" ]; then
python3 -m pip install -q -U "$CLEARML_AGENT_UPDATE_REPO"
else
python3 -m pip install -q -U "clearml-agent${CLEARML_AGENT_UPDATE_VERSION:-$TRAINS_AGENT_UPDATE_VERSION}"
fi
fi
QUEUE=${K8S_GLUE_QUEUE:-k8s_glue}
MAX_PODS=${K8S_GLUE_MAX_PODS:-2}
EXTRA_ARGS=${K8S_GLUE_EXTRA_ARGS:-}
# shellcheck disable=SC2129
echo "api.credentials.access_key: ${CLEARML_API_ACCESS_KEY}" >> ~/clearml.conf
echo "api.credentials.secret_key: ${CLEARML_API_SECRET_KEY}" >> ~/clearml.conf
echo "api.api_server: ${CLEARML_API_HOST}" >> ~/clearml.conf
echo "api.web_server: ${CLEARML_WEB_HOST}" >> ~/clearml.conf
echo "api.files_server: ${CLEARML_FILES_HOST}" >> ~/clearml.conf
./provider_entrypoint.sh
python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}

View File

@@ -0,0 +1,94 @@
"""
This example assumes you have preconfigured services with selectors in the form of
"ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022.
The K8sIntegration component will label each pod accordingly.
"""
from argparse import ArgumentParser
from clearml_agent.glue.k8s import K8sIntegration
def parse_args():
parser = ArgumentParser()
group = parser.add_mutually_exclusive_group()
parser.add_argument(
"--queue", type=str, help="Queue to pull tasks from"
)
group.add_argument(
"--ports-mode", action='store_true', default=False,
help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
"Should not be used with max-pods"
)
parser.add_argument(
"--num-of-services", type=int, default=20,
help="Specify the number of k8s services to be used. Use only with ports-mode."
)
parser.add_argument(
"--base-port", type=int,
help="Used in conjunction with ports-mode, specifies the base port exposed by the services. "
"For pod #X, the port will be <base-port>+X. Note that pod number is calculated based on base-pod-num"
"e.g. if base-port=20000 and base-pod-num=3, the port for the first pod will be 20003"
)
parser.add_argument(
"--base-pod-num", type=int, default=1,
help="Used in conjunction with ports-mode and base-port, specifies the base pod number to be used by the "
"service (default: %(default)s)"
)
parser.add_argument(
"--gateway-address", type=str, default=None,
help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
)
parser.add_argument(
"--pod-clearml-conf", type=str,
help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
)
parser.add_argument(
"--overrides-yaml", type=str,
help="YAML file containing pod overrides to be used when launching a new pod"
)
parser.add_argument(
"--template-yaml", type=str,
help="YAML file containing pod template. If provided pod will be scheduled with kubectl apply "
"and overrides are ignored, otherwise it will be scheduled with kubectl run"
)
parser.add_argument(
"--ssh-server-port", type=int, default=0,
help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)"
)
parser.add_argument(
"--namespace", type=str,
help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
)
group.add_argument(
"--max-pods", type=int,
help="Limit the maximum number of pods that this service can run at the same time."
"Should not be used with ports-mode"
)
return parser.parse_args()
def main():
args = parse_args()
user_props_cb = None
if args.ports_mode and args.base_port:
def k8s_user_props_cb(pod_number=0):
user_prop = {"k8s-pod-port": args.base_port + pod_number}
if args.gateway_address:
user_prop["k8s-gateway-address"] = args.gateway_address
return user_prop
user_props_cb = k8s_user_props_cb
k8s = K8sIntegration(
ports_mode=args.ports_mode, num_of_services=args.num_of_services, base_pod_num=args.base_pod_num,
user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
namespace=args.namespace, max_pods_limit=args.max_pods or None,
)
k8s.k8s_daemon(args.queue)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,14 @@
#!/bin/bash
chmod +x /root/entrypoint.sh
apt-get update -y
apt-get dist-upgrade -y
apt-get install -y curl unzip less locales
locale-gen en_US.UTF-8
apt-get install -y curl python3-pip git
python3 -m pip install -U pip
python3 -m pip install clearml-agent
python3 -m pip install -U "cryptography>=2.9"

View File

@@ -0,0 +1,22 @@
FROM ubuntu:18.04
USER root
WORKDIR /root
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
ENV PYTHONIOENCODING=UTF-8
COPY ../build-resources/setup.sh /root/setup.sh
RUN /root/setup.sh
COPY ./setup_aws.sh /root/setup_aws.sh
RUN /root/setup_aws.sh
COPY ../build-resources/entrypoint.sh /root/entrypoint.sh
COPY ./provider_entrypoint.sh /root/provider_entrypoint.sh
COPY ./build-resources/k8s_glue_example.py /root/k8s_glue_example.py
COPY ./clearml.conf /root/clearml.conf
ENTRYPOINT ["/root/entrypoint.sh"]

View File

@@ -0,0 +1,4 @@
#!/bin/bash -x
source /root/.bashrc
export PATH=$PATH:$HOME/bin

View File

@@ -0,0 +1,14 @@
#!/bin/bash
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install
curl -o kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/kubectl
#curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.19.6/2021-01-05/bin/linux/amd64/kubectl
chmod +x ./kubectl && mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$PATH:$HOME/bin
curl -o aws-iam-authenticator https://amazon-eks.s3-us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/aws-iam-authenticator
#curl -o aws-iam-authenticator https://amazon-eks.s3.us-west-2.amazonaws.com/1.19.6/2021-01-05/bin/linux/amd64/aws-iam-authenticator
chmod +x ./aws-iam-authenticator && mkdir -p $HOME/bin && cp ./aws-iam-authenticator $HOME/bin/aws-iam-authenticator && export PATH=$PATH:$HOME/bin
echo 'export PATH=$PATH:$HOME/bin' >> ~/.bashrc

View File

@@ -0,0 +1,22 @@
FROM ubuntu:18.04
USER root
WORKDIR /root
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
ENV PYTHONIOENCODING=UTF-8
COPY ../build-resources/setup.sh /root/setup.sh
RUN /root/setup.sh
COPY ./setup_gcp.sh /root/setup_gcp.sh
RUN /root/setup_gcp.sh
COPY ../build-resources/entrypoint.sh /root/entrypoint.sh
COPY ./provider_entrypoint.sh /root/provider_entrypoint.sh
COPY ./build-resources/k8s_glue_example.py /root/k8s_glue_example.py
COPY ./clearml.conf /root/clearml.conf
ENTRYPOINT ["/root/entrypoint.sh"]

View File

@@ -0,0 +1,4 @@
#!/bin/bash -x
gcloud auth activate-service-account ${CLEARML_SERVICE_ACC} --key-file=/root/keys/${SERVICE_ACC_KEY_JSON}
gcloud container clusters get-credentials ${CLUSTER_CRED}

View File

@@ -0,0 +1,14 @@
#!/bin/bash
curl -LO https://dl.k8s.io/release/v1.21.0/bin/linux/amd64/kubectl
install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
sudo apt-get install -y apt-transport-https ca-certificates gnupg
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
apt-get update -y
apt-get install -y google-cloud-sdk

View File

@@ -0,0 +1,47 @@
apiVersion: v1
kind: Pod
metadata:
name: k8s-glue
spec:
serviceAccountName: ""
containers:
- name: k8s-glue-container
image: allegroai/clearml-agent-k8s:aws-latest-1.21
imagePullPolicy: Always
command: [
"/bin/bash",
"-c",
"source /root/.bashrc && /root/entrypoint.sh"
]
volumeMounts:
- name: pod-template
mountPath: /root/template
env:
- name: CLEARML_API_HOST
value: ""
- name: CLEARML_WEB_HOST
value: ""
- name: CLEARML_FILES_HOST
value: ""
# - name: K8S_GLUE_MAX_PODS
# value: "2"
- name: K8S_GLUE_QUEUE
value: "k8s-glue"
- name: K8S_GLUE_EXTRA_ARGS
value: "--template-yaml /root/template/pod_template.yml"
- name: CLEARML_API_ACCESS_KEY
value: ""
- name: CLEARML_API_SECRET_KEY
value: ""
- name: CLEARML_WORKER_ID
value: "k8s-glue-agent"
- name: CLEARML_AGENT_UPDATE_REPO
value: ""
- name: FORCE_CLEARML_AGENT_REPO
value: ""
- name: CLEARML_DOCKER_IMAGE
value: "ubuntu:18.04"
volumes:
- name: pod-template
secret:
secretName: k8s-glue-pod-template

View File

@@ -0,0 +1,58 @@
apiVersion: v1
kind: Pod
metadata:
name: k8s-glue
spec:
serviceAccountName: ""
containers:
- name: k8s-glue-container
image: allegroai/clearml-agent-k8s:gcp-latest-1.21
imagePullPolicy: Always
command: [
"/bin/bash",
"-c",
"source /root/.bashrc && /root/entrypoint.sh"
]
volumeMounts:
- name: pod-template
mountPath: /root/template
- name: service-acc-key
mountPath: /root/keys
env:
- name: CLEARML_API_HOST
value: ""
- name: CLEARML_WEB_HOST
value: ""
- name: CLEARML_FILES_HOST
value: ""
# - name: K8S_GLUE_MAX_PODS
# value: "2"
- name: K8S_GLUE_QUEUE
value: "k8s-glue"
- name: K8S_GLUE_EXTRA_ARGS
value: "--template-yaml /root/template/pod_template.yml"
- name: CLEARML_API_ACCESS_KEY
value: ""
- name: CLEARML_API_SECRET_KEY
value: ""
- name: CLEARML_WORKER_ID
value: "k8s-glue-agent"
- name: CLEARML_AGENT_UPDATE_REPO
value: ""
- name: FORCE_CLEARML_AGENT_REPO
value: ""
- name: CLEARML_DOCKER_IMAGE
value: "ubuntu:18.04"
- name: CLEARML_SERVICE_ACC
value: ""
- name: SERVICE_ACC_KEY_JSON
value: service-account-key.json
- name: CLUSTER_CRED
value: ""
volumes:
- name: pod-template
secret:
secretName: k8s-glue-pod-template
- name: service-acc-key
secret:
secretName: k8s-glue-service-acc-key

View File

@@ -0,0 +1,13 @@
apiVersion: v1
metadata:
namespace: clearml
spec:
containers:
- resources:
limits:
cpu: 1000m
memory: 4G
requests:
cpu: 1000m
memory: 4G
restartPolicy: Never

View File

@@ -0,0 +1,7 @@
FROM ubuntu:18.04
USER root
WORKDIR /root
COPY ./setup.sh /root/setup.sh
RUN /root/setup.sh

View File

@@ -0,0 +1,10 @@
echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean
chown -R root /root/.cache/pip
apt-get update -y
apt-get dist-upgrade -y
apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 curl python3-pip
python3 -m pip install -U pip
python3 -m pip install clearml-agent
python3 -m pip install -U "cryptography>=2.9"

View File

@@ -42,10 +42,18 @@ agent {
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
# The default is the python executing the clearml_agent
python_binary: ""
# ignore any requested python version (Default: False, if a Task was using a
# specific python version and the system supports multiple python the agent will use the requested python version)
# ignore_requested_python_version: true
# select python package manager:
# currently supported pip and conda
# poetry is used if pip selected and repository contains poetry.lock file
# currently supported: pip, conda and poetry
# if "pip" or "conda" are used, the agent installs the required packages
# based on the "installed packages" section of the Task. If the "installed packages" is empty,
# it will revert to using `requirements.txt` from the repository's root directory.
# If Poetry is selected and the root repository contains `poetry.lock` or `pyproject.toml`,
# the "installed packages" section is ignored, and poetry is used.
# If Poetry is selected and no lock file is found, it reverts to "pip" package manager behaviour.
package_manager: {
# supported options: pip, conda, poetry
type: pip,
@@ -63,7 +71,7 @@ agent {
extra_index_url: []
# additional conda channels to use when installing with conda package manager
conda_channels: ["pytorch", "conda-forge", ]
conda_channels: ["pytorch", "conda-forge", "defaults", ]
# conda_full_env_update: false
# conda_env_as_base_docker: false
@@ -107,11 +115,12 @@ agent {
path: ~/.clearml/vcs-cache
},
# DEPRECATED: please use `venvs_cache` and set `venvs_cache.path`
# use venv-update in order to accelerate python virtual environment building
# Still in beta, turned off by default
venv_update: {
enabled: false,
},
# venv_update: {
# enabled: false,
# },
# cached folder for specific python package download (mostly pytorch versions)
pip_download_cache {
@@ -135,16 +144,68 @@ agent {
# optional shell script to run in docker when started before the experiment is started
# extra_docker_shell_script: ["apt-get install -y bindfs", ]
# Install the required packages for opencv libraries (libsm6 libxext6 libxrender-dev libglib2.0-0),
# for backwards compatibility reasons, true as default,
# change to false to skip installation and decrease docker spin up time
# docker_install_opencv_libs: true
# set to true in order to force "docker pull" before running an experiment using a docker image.
# This makes sure the docker image is updated.
docker_force_pull: false
default_docker: {
# default docker image to use when running in docker mode
image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
# optional arguments to pass to docker image
# arguments: ["--ipc=host"]
# lookup table rules for default container
# first matched rule will be picked, according to rule order
# enterprise version only
# match_rules: [
# {
# image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
# arguments: "-e define=value"
# match: {
# script{
# # Optional: must match all requirements (not partial)
# requirements: {
# # version selection matching PEP-440
# pip: {
# tensorflow: "~=2.6"
# },
# }
# # Optional: matching based on regular expression, example: "^exact_match$"
# repository: "/my_repository/"
# branch: "main"
# binary: "python3.6"
# }
# # Optional: matching based on regular expression, example: "^exact_match$"
# project: "project/sub_project"
# }
# },
# {
# image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
# arguments: "-e define=value"
# match: {
# # must match all requirements (not partial)
# script{
# requirements: {
# conda: {
# torch: ">=2.6,<2.8"
# }
# }
# # no repository matching required
# repository: ""
# }
# # no container image matching required (allow to replace one requested container with another)
# container: ""
# # no repository matching required
# project: ""
# }
# },
# ]
}
# set the OS environments based on the Task's Environment section before launching the Task process.
@@ -154,6 +215,37 @@ agent {
# it Should be detected automatically. Override with os environment CUDA_VERSION / CUDNN_VERSION
# cuda_version: 10.1
# cudnn_version: 7.6
# Hide docker environment variables containing secrets when printing out the docker command by replacing their
# values with "********". Turning this feature on will hide the following environment variables values:
# CLEARML_API_SECRET_KEY, CLEARML_AGENT_GIT_PASS, AWS_SECRET_ACCESS_KEY, AZURE_STORAGE_KEY
# To include more environment variables, add their keys to the "extra_keys" list. E.g. to make sure the value of
# your custom environment variable named MY_SPECIAL_PASSWORD will not show in the logs when included in the
# docker command, set:
# extra_keys: ["MY_SPECIAL_PASSWORD"]
hide_docker_command_env_vars {
enabled: true
extra_keys: []
}
# allow to set internal mount points inside the docker,
# especially useful for non-root docker container images.
# docker_internal_mounts {
# sdk_cache: "/clearml_agent_cache"
# apt_cache: "/var/cache/apt/archives"
# ssh_folder: "/root/.ssh"
# pip_cache: "/root/.cache/pip"
# poetry_cache: "/root/.cache/pypoetry"
# vcs_cache: "/root/.clearml/vcs-cache"
# venv_build: "/root/.clearml/venvs-builds"
# pip_download: "/root/.clearml/pip-download-cache"
# }
# Name docker containers created by the daemon using the following string format (supported from Docker 0.6.5)
# Allowed variables are task_id, worker_id and rand_string (random lower-case letters string, up to 32 characters)
# Note: resulting name must start with an alphanumeric character and
# continue with alphanumeric characters, underscores (_), dots (.) and/or dashes (-)
# docker_container_name_format: "clearml-id-{task_id}-{rand_string:.8}"
}
sdk {
@@ -240,6 +332,7 @@ sdk {
# secret: "12345678"
# multipart: false
# secure: false
# verify: /path/to/ca/bundle.crt OR false to not verify
# }
]
}
@@ -314,5 +407,45 @@ sdk {
log_stdout: True
}
}
# Apply top-level environment section from configuration into os.environ
apply_environment: true
# Top-level environment section is in the form of:
# environment {
# key: value
# ...
# }
# and is applied to the OS environment as `key=value` for each key/value pair
# Apply top-level files section from configuration into local file system
apply_files: true
# Top-level files section allows auto-generating files at designated paths with a predefined contents
# and target format. Options include:
# contents: the target file's content, typically a string (or any base type int/float/list/dict etc.)
# format: a custom format for the contents. Currently supported value is `base64` to automatically decode a
# base64-encoded contents string, otherwise ignored
# path: the target file's path, may include ~ and inplace env vars
# target_format: format used to encode contents before writing into the target file. Supported values are json,
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
# overwrite: overwrite the target file in case it exists. Default is true.
#
# Example:
# files {
# myfile1 {
# contents: "The quick brown fox jumped over the lazy dog"
# path: "/tmp/fox.txt"
# }
# myjsonfile {
# contents: {
# some {
# nested {
# value: [1, 2, 3, 4]
# }
# }
# }
# path: "/tmp/test.json"
# target_format: json
# }
# }
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 123 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.0 MiB

After

Width:  |  Height:  |  Size: 1.1 MiB

View File

@@ -10,12 +10,15 @@ from clearml_agent.glue.k8s import K8sIntegration
def parse_args():
parser = ArgumentParser()
group = parser.add_mutually_exclusive_group()
parser.add_argument(
"--queue", type=str, help="Queue to pull tasks from"
)
parser.add_argument(
group.add_argument(
"--ports-mode", action='store_true', default=False,
help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
"Should not be used with max-pods"
)
parser.add_argument(
"--num-of-services", type=int, default=20,
@@ -57,6 +60,11 @@ def parse_args():
"--namespace", type=str,
help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
)
group.add_argument(
"--max-pods", type=int,
help="Limit the maximum number of pods that this service can run at the same time."
"Should not be used with ports-mode"
)
return parser.parse_args()
@@ -77,7 +85,7 @@ def main():
user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
namespace=args.namespace,
namespace=args.namespace, max_pods_limit=args.max_pods or None,
)
k8s.k8s_daemon(args.queue)

View File

@@ -8,10 +8,10 @@ psutil>=3.4.2,<5.9.0
pyhocon>=0.3.38,<0.4.0
pyparsing>=2.0.3,<2.5.0
python-dateutil>=2.4.2,<2.9.0
pyjwt>=1.6.4,<1.8.0
PyYAML>=3.12,<5.4.0
pyjwt>=1.6.4,<2.1.0
PyYAML>=3.12,<5.5.0
requests>=2.20.0,<2.26.0
six>=1.11.0,<1.16.0
typing>=3.6.4,<3.8.0
urllib3>=1.21.1,<1.27.0
virtualenv>=16,<20
virtualenv>=16,<21

View File

@@ -60,6 +60,7 @@ setup(
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: 3.9',
'License :: OSI Approved :: Apache Software License',
],