Compare commits

...

141 Commits

Author SHA1 Message Date
Jake Henning
785e22dc87 Version bump to v1.9.1 2024-09-02 01:04:49 +03:00
Jake Henning
6a2b778d53 Add default pip version support for Python 3.12 2024-09-02 01:03:52 +03:00
allegroai
b2c3702830 Version bump to v1.9.0 2024-08-28 23:18:26 +03:00
allegroai
6302d43990 Add support for skipping container apt installs using CLEARML_AGENT_SKIP_CONTAINER_APT env var in k8s
Add runtime callback support for setting runtime properties per task in k8s
Fix remove task from pending queue and set to failed when kubectl apply fails
2024-08-27 23:01:27 +03:00
allegroai
760bbca74e Fix failed Task in services mode logged "User aborted" instead of failed, add Task reason string 2024-08-27 22:56:37 +03:00
allegroai
e63fd31420 Fix string format 2024-08-27 22:55:49 +03:00
allegroai
2ff9985db7 Add user ID to the vault loading print 2024-08-27 22:55:32 +03:00
allegroai
b8c762401b Fix use same state transition if supported by the server (instead of stopping the task before re-enqueue) 2024-08-27 22:54:45 +03:00
allegroai
99e1e54f94 Add support for tasks containing only bash script or python module command 2024-08-27 22:53:14 +03:00
allegroai
a4d3b5bad6 Fix only set Task started status on node rank 0 2024-08-27 22:52:31 +03:00
allegroai
b21665ed6e Fix do not cache venv cache if venv/python skip env var was set 2024-08-27 22:52:01 +03:00
Surya Teja
f877aa96e2 Update Docker base image to Ubuntu 22.04 and Kubectl to 1.29.3 (#201) 2024-07-29 18:41:50 +03:00
pollfly
f99344d194 Add queue priority info to CLI help (#211)
* add queue priority comment

* Add --order-fairness info

---------

Co-authored-by: Jake Henning <59198928+jkhenning@users.noreply.github.com>
2024-07-29 18:40:38 +03:00
allegroai
d9f2a1999a Fix Only send pip freeze update on RANK 0, only update task status on exit on RANK 0 2024-07-29 17:40:24 +03:00
Valentin Schabschneider
79d0abe707 Add NO_DOCKER flag to clearml-agent-services entrypoint (#206) 2024-07-26 19:09:54 +03:00
allegroai
6213ef4c02 Add /bin/bash -c "command" support. Task binary should be set to /bin/bash and entry_point should be set to -c command 2024-07-24 18:00:13 +03:00
allegroai
aef6aa9fc8 Fix a race condition where in rare conditions popping a Task from a queue that was aborted did not set it to started before the watchdog killed it. Does not happen in k8s/slurm 2024-07-24 17:59:46 +03:00
allegroai
0bb267115b Add venvs_cache.path mount override for non-root containers (use: agent.docker_internal_mounts.venvs_cache) 2024-07-24 17:59:18 +03:00
allegroai
f89a92556f Fix check logger is not None 2024-07-24 17:55:02 +03:00
allegroai
8ba4d75e80 Add CLEARML_TASK_ID and auth token to pod env vars in original entrypoint flow 2024-07-24 17:47:48 +03:00
allegroai
edc333ba5f Add K8S_GLUE_POD_USE_IMAGE_ENTRYPOINT to allow running images without overriding the entrypoint (useful for agents using prebuilt images in k8s) 2024-07-24 17:46:27 +03:00
allegroai
2f0553b873 Fix CLEARML_MULTI_NODE_SINGLE_TASK should be read once not every reported line 2024-07-24 17:45:02 +03:00
allegroai
b2a4bf08ac Fix pass --docker only (i.e. no default container image) for --dynamic-gpus feature 2024-07-24 17:44:35 +03:00
allegroai
f18c6b809f Fix slurm multi-node rank detection 2024-07-24 17:44:05 +03:00
allegroai
cd5b4d2186 Add "-m module args" in script entry now supports standalone script, standalone script is converted to "untitled.py" by default or if specified in working_dir such as <dir>:<target_file> for example ".:standalone.py" 2024-07-24 17:43:21 +03:00
allegroai
5f1bab6711 Add default docker match_rules for enterprise users,
NOTICE: matching_rules are ignored if `--docker container` is passed in command line
2024-07-24 17:42:55 +03:00
allegroai
ab9b9db0c9 Add CLEARML_MULTI_NODE_SINGLE_TASK (values -1, 0, 1, 2) for easier multi-node singe Task workloads 2024-07-24 17:42:25 +03:00
allegroai
93df021108 Add support for .ipynb script entry files (install nbconvert in runtime, copnvert to python and execute the python script), including CLEARML_AGENT_FORCE_TASK_INIT patching of ipynb files (post python conversion) 2024-07-24 17:41:59 +03:00
allegroai
700ae85de0 Fix file mode should be optional in configuration files section 2024-07-24 17:41:06 +03:00
allegroai
f367c5a571 Fix git fetch did not update new tags #209 2024-07-24 17:39:53 +03:00
allegroai
ebc5944b44 Fix setting tasks that someone just marked as aborted to started - only force Task to started after dequeuing it otherwise lease it as is 2024-07-24 17:39:26 +03:00
allegroai
8f41002845 Add task.script.binary /bin/bash support
Fix -m module $env to support parsing the $env before launching
2024-07-24 17:37:26 +03:00
allegroai
7e8670d57f Find the correct python version when using a pre-installed python environment 2024-07-21 14:10:38 +03:00
allegroai
77de343863 Use "venv" module if virtualenv is not supported 2024-07-19 13:18:07 +03:00
allegroai
6b31883e45 Fix queue resolution when no queue is passed 2024-05-15 18:30:24 +03:00
allegroai
e48b4756fa Add Python 3.12 support 2024-05-15 18:25:29 +03:00
allegroai
47147e3237 Fix cached repositories were not passing user/token when pulling, agent.vcs_cache.clone_on_pull_fail now defaults to false 2024-04-19 23:50:17 +03:00
allegroai
41fc4ec646 Fix disabling vcs cache should not add vcs mount point to container 2024-04-19 23:48:50 +03:00
allegroai
441e5a73b2 Fix conda env should not be cached if installing into base conda or conda existing env exists 2024-04-19 23:48:10 +03:00
allegroai
27ed6821c4 Add mirrorD config files to gitignore 2024-04-19 23:47:34 +03:00
allegroai
10c6629982 Support skipping re-enqueue on suspected preempted k8s pods 2024-04-19 23:46:57 +03:00
allegroai
6fb48a4c6e Revert version to v1.8.1 2024-04-19 23:44:31 +03:00
allegroai
105ade31f1 Version bump to v1.8.2 2024-04-14 18:18:10 +03:00
allegroai
502e266b6b Fix polling interval missing when not using daemon mode 2024-04-14 18:17:57 +03:00
allegroai
cd9a3b9f4e Version bump to v1.8.1 2024-04-12 20:30:11 +03:00
allegroai
4179ac5234 Fix git pulling on cached invalid git entry. On error, re-clone the entire repo again (disable using "agent.vcs_cache.clone_on_pull_fail: false") 2024-04-12 20:29:36 +03:00
Liron Ilouz
98cc0d86ba Add option to set daemon polling interval (#197)
* add option to set worker polling interval

* polling interval minimum value

---------

Co-authored-by: Liron <liron@tapwithus.com>
2024-04-03 14:33:52 +03:00
allegroai
293cbc0ac6 Version bump to v1.8.0 2024-04-02 16:38:22 +03:00
allegroai
4387ed73b6 Fix None handling when no limits exist 2024-04-02 16:36:09 +03:00
allegroai
43443ccf08 Pass task_id when resolving k8s template 2024-04-01 11:37:01 +03:00
allegroai
3d43240c8f Improve conda package manager support
Add agent.package_manager.use_conda_base_env (CLEARML_USE_CONDA_BASE_ENV) allowing to use base conda environment (instead of installing a new one)
Fix conda support for python packages with markers and multiple specifications
Added "nvidia" conda channel and support for cuda-toolkit >= 12
2024-04-01 11:36:26 +03:00
allegroai
fc58ba947b Update requirements 2024-04-01 11:35:07 +03:00
allegroai
22672d2444 Improve GPU monitoring 2024-03-17 19:13:57 +02:00
allegroai
6a4fcda1bf Improve resource monitor 2024-03-17 19:06:57 +02:00
allegroai
a4ebf8293d Fix role support 2024-03-17 19:00:59 +02:00
allegroai
10fb157d58 Fix queue handling for backwards compatibility 2024-03-17 19:00:18 +02:00
allegroai
56058beec2 Update deprecated references 2024-03-17 18:59:48 +02:00
allegroai
9f207d5155 Fix dynamic GPU sometimes misses the initial print - if we found the closing print it should be good enough to signal everything is okay 2024-03-17 18:59:04 +02:00
allegroai
8a2bea3c14 Fix comment lines (#) are not ignored in docker startup bash script 2024-03-17 18:58:14 +02:00
allegroai
f1f9278928 Fix torch resolver settings applied to PytorchRequirement instance are not used 2024-03-17 18:56:47 +02:00
nfzd
2de1c926bf Use correct Python version in Poetry init (#179)
* Use correct Python version in Poetry init

* Use interpreter override if configured

* Don't use agent.python_binary if it is empty

---------

Co-authored-by: Michael Mueller <michael.mueller@wsa.com>
2024-03-11 23:36:10 +02:00
allegroai
e1104e60bb Update README 2024-03-11 16:58:38 +02:00
ae-ae
8b2970350c Fix FileNotFoundException crash in find_python_executable_for_version… (#192)
* Fix FileNotFoundException crash in find_python_executable_for_version (#164)

* Add a Windows check for error 9009 when searching for Python

---------

Co-authored-by: 12037964+ae-ae@users.noreply.github.com 12037964+ae-ae@users.noreply.github.com <ae-ae>
2024-03-06 09:17:31 +02:00
FeU-aKlos
a2758250b2 Fix queue handling in K8sIntegration and k8s_glue_example.py (#183)
* Fix queue handling in K8sIntegration and k8s_glue_example.py

* Update Dockerfile and k8s_glue_example.py

* Add executable permission to provider_entrypoint.sh

* ADJUST docker

* Update clearml-agent version

* ADDJUST stuff

* ADJUST queue string handling

* DELETE pip install from own repo
2024-02-29 14:20:54 +02:00
allegroai
01e8ffd854 Improve venv cache handling:
- Add FileLock readonly mode, default is write mode (i.e. exclusive lock, preserving behavior)
- Add venv cache now uses readonly lock when copying folders from venv cache into target folder. This enables multiple read, single write operation
- Do not lock the cache folder if we do not need to delete old entries
2024-02-29 14:19:24 +02:00
allegroai
74edf6aa36 Fix IOError on file lock when using shared folder 2024-02-29 14:16:25 +02:00
allegroai
09c5ef99af Fix Python 3.12 support by removing distutil imports 2024-02-29 14:12:21 +02:00
allegroai
17ae28a62f Add agent.venvs_cache.lock_timeout to control the venv cache folder lock timeout (in seconds, default 30) 2024-02-29 14:06:06 +02:00
allegroai
059a9385e9 Fix delete temp console pipe log files after Task execution is completed. This is important for long lasting services agents, avoiding collecting temp files on host machine 2024-02-29 14:03:30 +02:00
allegroai
9a321a410f Add CLEARML_AGENT_FORCE_TASK_INIT to allow runtime patching of script even if no repo is specified and the code is running a preinstalled docker 2024-02-29 14:02:27 +02:00
allegroai
919013d4fe Add CLEARML_AGENT_FORCE_POETRY to allow forcing poetry even when using pip requirements manager 2024-02-29 13:59:26 +02:00
allegroai
05530b712b Fix sanitization did not cover all keys 2024-02-29 13:56:14 +02:00
allegroai
8d15fd8798 Fix pippip is returned as a pip version if no value exists in agent.package_manager.pip_version 2024-02-29 13:55:41 +02:00
allegroai
b34329934b Add queue ID report before pulling task 2024-02-29 13:52:17 +02:00
allegroai
85049d8705 Move configuration sanitization settings to the default config file 2024-02-29 13:51:40 +02:00
allegroai
6fbd70786e Add protection for truncate() call 2024-02-29 13:51:09 +02:00
allegroai
05a65548da Fix agent.enable_git_ask_pass does not show in configuration dump 2024-02-29 13:50:52 +02:00
allegroai
6657003d65 Fix using controller-uid will not always return required pods 2024-02-29 13:49:30 +02:00
allegroai
95dde6ca0c Update README 2024-01-25 11:27:56 +02:00
allegroai
c9fc092f4e Support force_system_packages argument in k8s glue class 2023-12-26 10:12:32 +02:00
allegroai
432ee395e1 Version bump to v1.7.0 2023-12-20 18:08:38 +02:00
allegroai
98fc4f0fb9 Add agent.resource_monitoring.disk_use_path configuration option to allow monitoring a different volume than the one containing the home folder 2023-12-20 17:49:33 +02:00
allegroai
111e774c21 Add extra_index_url sanitization in configuration printout 2023-12-20 17:49:04 +02:00
allegroai
3dd8d783e1 Fix agent.git_host setting will cause git@domain URLs to not be replaced by SSH URLs since furl cannot parse them to obtain host 2023-12-20 17:48:18 +02:00
allegroai
7c3e420df4 Add git clone verbosity using CLEARML_AGENT_GIT_CLONE_VERBOSE env var 2023-12-20 17:47:52 +02:00
allegroai
55b065a114 Update GPU stats and pynvml support 2023-12-20 17:47:19 +02:00
allegroai
faa97b6cc2 Set worker ID in k8s glue mode 2023-12-20 17:45:34 +02:00
allegroai
f5861b1e4a Change default agent.enable_git_ask_pass to True 2023-12-20 17:44:41 +02:00
allegroai
030cbb69f1 Fix check if process return code is SIGKILL (-9 or 137) and abort callback was called, do not mark as failed but as aborted 2023-12-20 17:43:02 +02:00
allegroai
564f769ff7 Add agent.docker_args_extra_precedes_task, agent.protected_docker_extra_args
to prevent the same switch to be used by both `extra_docker_args` and the a Task's docker args
2023-12-20 17:42:36 +02:00
pollfly
2c7f091e57 Update example (#177)
* Edit README

* Edit README

* small edits

* update example

* update example

* update example
2023-12-09 12:52:44 +02:00
allegroai
dd5d24b0ca Add CLEARML_AGENT_TEMP_STDOUT_FILE_DIR to allow specifying temp dir used for storing agent log files and temporary log files (daemon and execution) 2023-11-14 11:45:13 +02:00
allegroai
996bb797c3 Add env var in case we're running a service task 2023-11-14 11:44:36 +02:00
allegroai
9ad49a0d21 Fix KeyError if container does not contain the arguments field 2023-11-01 15:11:07 +02:00
allegroai
ba4fee7b19 Fix agent.package_manager.poetry_install_extra_args are used in all Poetry commands and not just in install (#173) 2023-11-01 15:10:40 +02:00
allegroai
0131db8b7d Add support for resource_applied() callback in k8s glue
Add support for sending log events with k8s-provided timestamps
Refactor env vars infrastructure
2023-11-01 15:10:08 +02:00
allegroai
d2384a9a95 Add example and support for prebuilt containers including services-mode support with overrides CLEARML_AGENT_FORCE_CODE_DIR CLEARML_AGENT_FORCE_EXEC_SCRIPT 2023-11-01 15:05:57 +02:00
allegroai
5b86c230c1 Fix an environment variable that should be set with a numerical value of 0 (i.e. end up as "0" or "0.0") is set to an empty string 2023-11-01 15:04:59 +02:00
allegroai
21e4be966f Fix recursion issue when deep-copying a session 2023-11-01 15:04:24 +02:00
allegroai
9c6cb421b3 When cleaning up pending pods, verify task is still aborted and pod is still pending before deleting the pod 2023-11-01 15:04:01 +02:00
allegroai
52405c343d Fix k8s glue configuration might be contaminated when changed during apply 2023-11-01 15:03:37 +02:00
allegroai
46f0c991c8 Add status reason when aborting before moving to k8s_scheduler queue 2023-11-01 15:02:24 +02:00
allegroai
0254279ed5 Version bump to v1.6.1 2023-09-06 15:41:29 +03:00
allegroai
0e1750f90e Fix requests library lower constraint breaks backwards compatibility 2023-09-06 15:40:48 +03:00
allegroai
58e0dc42ec Version bump to v1.6.0 2023-09-05 15:05:11 +03:00
allegroai
d16825029d Add new pytorch no resolver mode and CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE to change resolver on a Task basis, now supports "pip", "direct", "none" 2023-09-02 17:45:10 +03:00
allegroai
fb639afcb9 Fix PyTorch extra index pip resolver 2023-09-02 17:43:41 +03:00
allegroai
eefb94d1bc Add Python 3.11 support 2023-09-02 17:42:27 +03:00
Alex Burlacu
f1e9266075 Adjust docker image versions in a couple more places 2023-08-24 19:03:24 +03:00
Alex Burlacu
e1e3c84a8d Update docker versions 2023-08-24 19:01:26 +03:00
Alex Burlacu
ed1356976b Move extra configurations to Worker init to make sure all available configurations can be overridden 2023-08-24 19:00:36 +03:00
Alex Burlacu
2b815354e0 Improve file mode comment 2023-08-24 18:53:00 +03:00
Alex Burlacu
edae380a9e Version bump 2023-08-24 18:51:47 +03:00
Alex Burlacu
946e9d9ce9 Fix invalid reference 2023-08-24 18:51:27 +03:00
jday1
a56343ffc7 Upgrade requests library (#162)
* Upgrade requests

* Modified package requirement

* Modified package requirement
2023-08-01 10:41:22 +03:00
allegroai
159a6e9a5a Fix runtime property overriding existing properties 2023-07-20 10:41:15 +03:00
pollfly
6b7ee12dc1 Edit README (#156) 2023-07-19 16:51:14 +03:00
allegroai
3838247716 Update k8s glue docker build resources 2023-07-19 16:47:50 +03:00
pollfly
6e7d35a42a Improve configuration files (#160) 2023-07-11 10:32:01 +03:00
allegroai
4c056a17b9 Add support for k8s jobs execution
Strip docker container obtained from task in k8s apply
2023-07-04 14:45:00 +03:00
allegroai
21d98afca5 Add support for extra docker arguments referencing machines environment variables using the agent.docker_allow_host_environ configuration option to allow users to also be able to use $ENV in the task's docker arguments 2023-07-04 14:42:28 +03:00
allegroai
6a1bf11549 Fix Task docker arguments passed twice 2023-07-04 14:41:07 +03:00
allegroai
7115a9b9a7 Add CLEARML_EXTRA_PIP_INSTALL_FLAGS / agent.package_manager.extra_pip_install_flags to control additional pip install flags
Fix pip version marking in "installed packages" is now preserved for and reinstalled
2023-07-04 14:39:40 +03:00
allegroai
450df2f8d3 Support skipping agent pip upgrade in container bash script using the CLEARML_AGENT_NO_UPDATE env var 2023-07-04 14:38:50 +03:00
allegroai
ccf752c4e4 Add support for setting mode on files applied by the agent 2023-07-04 14:37:58 +03:00
allegroai
3ed63e2154 Fix docker container backwards compatibility for API <2.13
Fix default docker match rules resolver (used incorrect field "container" instead of "image")
Remove "container" (image) match rule option from default docker image resolver
2023-07-04 14:37:18 +03:00
allegroai
a535f93cd6 Add support for CLEARML_AGENT_FORCE_MAX_API_VERSION for testing 2023-07-04 14:35:54 +03:00
allegroai
b380ec54c6 Improve config file comments 2023-07-04 14:34:43 +03:00
allegroai
a1274299ce Add support for CLEARML_AGENT_EXTRA_DOCKER_LABELS env var 2023-07-03 11:08:59 +03:00
allegroai
c77224af68 Add support for task field injection into container docker name 2023-07-03 11:07:12 +03:00
allegroai
95dadca45c Refactor k8s glue running/used pods getter 2023-05-21 22:56:12 +03:00
allegroai
685918fd9b Version bump to v1.5.3rc3 2023-05-21 22:54:38 +03:00
allegroai
bc85ddf78d Fix pytorch direct resolve replacing wheel link with directly installed version 2023-05-21 22:53:51 +03:00
allegroai
5b5fb0b8a6 Add agent.package_manager.pytorch_resolve configuration setting with pip or direct values. pip sets extra index based on cuda and lets pip resolve, direct is the previous parsing algorithm that does the matching and downloading (default pip) 2023-05-21 22:53:11 +03:00
allegroai
fec0ce1756 Better message for agent init when an existing clearml.conf is found 2023-05-21 22:51:11 +03:00
allegroai
1e09b88b7a Add alias CLEARML_AGENT_DOCKER_AGENT_REPO env var for the FORCE_CLEARML_AGENT_REPO env var 2023-05-21 22:50:01 +03:00
allegroai
b6ca0fa6a5 Print error on resource monitor failure 2023-05-11 16:18:11 +03:00
allegroai
307ec9213e Fix git+ssh:// links inside installed packages not being converted properly to HTTPS authenticated and vice versa 2023-05-11 16:16:51 +03:00
allegroai
a78a25d966 Support new Retry.DEFAULT_BACKOFF_MAX in a backwards-compatible way 2023-05-11 16:16:18 +03:00
allegroai
ebb6231f5a Add CLEARML_AGENT_STANDALONE_CONFIG_BC to support backwards compatibility in standalone mode 2023-05-11 16:15:06 +03:00
pollfly
e1d65cb280 Update clearml-agent gif (#137) 2023-04-10 10:58:10 +03:00
71 changed files with 5719 additions and 2278 deletions

5
.gitignore vendored
View File

@@ -11,3 +11,8 @@ build/
dist/
*.egg-info
# VSCode
.vscode
# MirrorD
.mirrord

149
README.md
View File

@@ -2,14 +2,17 @@
<img src="https://github.com/allegroai/clearml-agent/blob/master/docs/clearml_agent_logo.png?raw=true" width="250px">
**ClearML Agent - ML-Ops made easy
ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
**ClearML Agent - MLOps/LLMOps made easy
MLOps/LLMOps scheduler & orchestration solution supporting Linux, macOS and Windows**
[![GitHub license](https://img.shields.io/github/license/allegroai/clearml-agent.svg)](https://img.shields.io/github/license/allegroai/clearml-agent.svg)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/clearml-agent.svg)](https://img.shields.io/pypi/pyversions/clearml-agent.svg)
[![PyPI version shields.io](https://img.shields.io/pypi/v/clearml-agent.svg)](https://img.shields.io/pypi/v/clearml-agent.svg)
[![PyPI Downloads](https://pepy.tech/badge/clearml-agent/month)](https://pypi.org/project/clearml-agent/)
[![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/allegroai)](https://artifacthub.io/packages/search?repo=allegroai)
`🌟 ClearML is open-source - Leave a star to support the project! 🌟`
</div>
---
@@ -24,8 +27,7 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
* Launch-and-Forget service containers
* [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
* [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
*
Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
* Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.
@@ -35,8 +37,8 @@ It is a zero configuration fire-and-forget execution agent, providing a full ML/
or [free tier hosting](https://app.clear.ml)
2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine:
on-premises / cloud / ...)
3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or
Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
3. Create a [job](https://clear.ml/docs/latest/docs/apps/clearml_task) or
add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines of code
4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or
automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
@@ -66,36 +68,39 @@ or [Free tier Hosting](https://app.clear.ml)
### Kubernetes Integration (Optional)
We think Kubernetes is awesome, but it should be a choice. We designed `clearml-agent` so you can run bare-metal or
inside a pod with any mix that fits your environment.
We think Kubernetes is awesome, but it is not a must to get started with remote execution agents and cluster management.
We designed `clearml-agent` so you can run both bare-metal and on top of Kubernetes, in any combination that fits your environment.
Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://github.com/allegroai/clearml-helm-charts
You can find the Dockerfiles in the [docker folder](./docker) and the helm Chart in https://github.com/allegroai/clearml-helm-charts
#### Benefits of integrating existing K8s with ClearML-Agent
#### Benefits of integrating existing Kubernetes cluster with ClearML
- ClearML-Agent adds the missing scheduling capabilities to K8s
- Allowing for more flexible automation from code
- A programmatic interface for easier learning curve (and debugging)
- Seamless integration with ML/DL experiment manager
- ClearML-Agent adds the missing scheduling capabilities to your Kubernetes cluster
- Users do not need to have direct Kubernetes access!
- Easy learning curve with UI and CLI requiring no DevOps knowledge from end users
- Unlike other solutions, ClearML-Agents work in tandem with other customers of your Kubernetes cluster
- Allows for more flexible automation from code, building pipelines and visibility
- A programmatic interface for easy CI/CD workflows, enabling GitOps to trigger jobs inside your cluster
- Seamless integration with the ClearML ML/DL/GenAI experiment manager
- Web UI for customization, scheduling & prioritization of jobs
- **Enterprise Features**: RBAC, vault, multi-tenancy, scheduler, quota management, fractional GPU support
**Two K8s integration flavours**
**Run the agent in Kubernetes Glue mode an map ClearML jobs directly to K8s jobs:**
- Use the [ClearML Agent Helm Chart](https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml-agent) to spin an agent pod acting as a controller
- Or run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
a Kubernetes cpu node
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a Kubernetes job (based on provided
yaml template)
- Inside each pod the clearml-agent will install the job (experiment) environment and spin and monitor the
experiment's process, fully visible in the clearml UI
- Benefits: Kubernetes full view of all running jobs in the system
- **Enterprise Features**
- Full scheduler features added on Top of Kubernetes, with quota/over-quota management, priorities and order.
- Fractional GPU support, allowing multiple isolated containers sharing the same GPU with memory/compute limit per container
- Spin ClearML-Agent as a long-lasting service pod
- use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
- map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
- allow the clearml-agent to manage sibling dockers
- benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
- downside: Sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
- Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
a K8s cpu node
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
yaml template)
- Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
experiment's process
- benefits: Kubernetes full view of all running jobs in the system
- downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
### SLURM (Optional)
Yes! Slurm integration is available, check the [documentation](https://clear.ml/docs/latest/docs/clearml_agent/#slurm) for further details
### Using the ClearML Agent
@@ -110,15 +115,15 @@ A previously run experiment can be put into 'Draft' state by either of two metho
* Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any
results and artifacts the previous run had created.
* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new '
Draft' experiment with the same configuration as the original experiment.
* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new
'Draft' experiment with the same configuration as the original experiment.
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in
the ClearML UI and selecting the execution queue.
See [creating an experiment and enqueuing it for execution](#from-scratch).
Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue.
Once an experiment is enqueued, it will be picked up and executed by a ClearML Agent monitoring this queue.
The ClearML UI Workers & Queues page provides ongoing execution information:
@@ -170,22 +175,22 @@ clearml-agent init
```
Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
ClearML Agent cache folder is `~/.clearml`
ClearML Agent cache folder is `~/.clearml`.
See full details in your configuration file at `~/clearml.conf`
See full details in your configuration file at `~/clearml.conf`.
Note: The **ClearML agent** extends the **ClearML** configuration file `~/clearml.conf`
Note: The **ClearML Agent** extends the **ClearML** configuration file `~/clearml.conf`.
They are designed to share the same configuration file, see example [here](docs/clearml.conf)
#### Running the ClearML Agent
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen:
```bash
clearml-agent daemon --queue default --foreground
```
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe).
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
```bash
@@ -195,20 +200,21 @@ clearml-agent daemon --detached --queue default
GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled
with `--cpu-only`).
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for
the `clearml-agent` <br>
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPUs will be allocated for
the `clearml-agent`. <br>
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES="none"`, no gpu will be allocated for
the `clearml-agent`
the `clearml-agent`.
Example: spin two agents, one per gpu on the same machine:
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
Example: spin two agents, one per GPU on the same machine:
Notice: with `--detached` flag, the *clearml-agent* will run in the background
```bash
clearml-agent daemon --detached --gpus 0 --queue default
clearml-agent daemon --detached --gpus 1 --queue default
```
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent
```bash
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
@@ -223,43 +229,43 @@ For debug and experimentation, start the ClearML agent in `foreground` mode, whe
clearml-agent daemon --queue default --docker --foreground
```
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe).
Notice: with `--detached` flag, the *clearml-agent* will run in the background
```bash
clearml-agent daemon --detached --queue default --docker
```
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
Example: spin two agents, one per gpu on the same machine, with default `nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04`
docker:
```bash
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
```
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:
10.1-cudnn7-runtime-ubuntu18.04 docker:
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent, with default
`nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04` docker:
```bash
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
```
##### Starting the ClearML Agent - Priority Queues
Priority Queues are also supported, example use case:
High priority queue: `important_jobs` Low priority queue: `default`
High priority queue: `important_jobs`, low priority queue: `default`
```bash
clearml-agent daemon --queue important_jobs default
```
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from
the `default` queue.
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, and only if it is empty, the agent
will try to pull from the `default` queue.
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see
Adding queues, managing job order within a queue, and moving jobs between queues, is available using the Web UI, see
example on our [free server](https://app.clear.ml/workers-and-queues/queues)
##### Stopping the ClearML Agent
@@ -268,7 +274,7 @@ To stop a **ClearML Agent** running in the background, run the same command line
appended. For example, to stop the first of the above shown same machine, single gpu agents:
```bash
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 --stop
```
### How do I create an experiment on the ClearML Server? <a name="from-scratch"></a>
@@ -279,32 +285,33 @@ clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10
- Git repository link and commit ID (or an entire jupyter notebook)
- Git diff (were not saying you never commit and push, but still...)
- Python packages used by your code (including specific versions used)
- Hyper-Parameters
- Input Artifacts
- Hyperparameters
- Input artifacts
You now have a 'template' of your experiment with everything required for automated execution
* In the ClearML UI, Right-click on the experiment and select 'clone'. A copy of your experiment will be created.
* In the ClearML UI, right-click on the experiment and select 'clone'. A copy of your experiment will be created.
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
- Change the Hyper-Parameters
- Change the hyperparameters
- Switch to the latest code base of the repository
- Update package versions
- Select a specific docker image to run in (see docker execution mode section)
- Or simply change nothing to run the same experiment again...
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
* Schedule the newly created experiment for execution: right-click the experiment and select 'enqueue'
### ClearML-Agent Services Mode <a name="services"></a>
ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that
previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks)
for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the
budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as
Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data
transparency)
for different use cases:
* Auto-scaler service (spinning instances when the need arises and the budget allows)
* Controllers (Implementing pipelines and more sophisticated DevOps logic)
* Optimizer (such as Hyperparameter Optimization or sweeping)
* Application (such as interactive Bokeh apps for increased data transparency)
ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by
ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched
Currently, clearml-agent in services-mode supports CPU only configuration. ClearML-Agent services mode can be launched
alongside GPU agents.
```bash
@@ -321,15 +328,15 @@ ClearML package.
Sample AutoML & Orchestration examples can be found in the
ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.
AutoML examples
AutoML examples:
- [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
- In order to create an experiment-template in the system, this code must be executed once manually
- [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter
- This example will create multiple copies of the Keras experiment-template, with different hyperparameter
combinations
Experiment Pipeline examples
Experiment Pipeline examples:
- [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
- This example will "process data", and once done, will launch a copy of the 'second step' experiment-template

View File

@@ -45,8 +45,8 @@
# it solves passing user/token to git submodules.
# this is a safer way to ensure multiple users using the same repository will
# not accidentally leak credentials
# Only supported on Linux systems, it will be the default in future releases
# enable_git_ask_pass: false
# Note: this is only supported on Linux systems
# enable_git_ask_pass: true
# in docker mode, if container's entrypoint automatically activated a virtual environment
# use the activated virtual environment and install everything there
@@ -66,7 +66,7 @@
type: pip,
# specify pip version to use (examples "<20.2", "==19.3.1", "", empty string will install the latest version)
pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"],
pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10' and python_version <= '3.11'", ">=23,<24.3 ; python_version >= '3.12'"]
# specify poetry version to use (examples "<2", "==1.1.1", "", empty string will install the latest version)
# poetry_version: "<2",
# poetry_install_extra_args: ["-v"]
@@ -80,27 +80,42 @@
# additional artifact repositories to use when installing python packages
# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
# control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
# Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
# "pip" (default): would automatically detect the cuda version, and supply pip with the correct
# extra-index-url, based on pytorch.org tables
# "direct": would resolve a direct link to the pytorch wheel by parsing the pytorch.org pip repository
# and matching the automatically detected cuda version with the required pytorch wheel.
# if the exact cuda version is not found for the required pytorch wheel, it will try
# a lower cuda version until a match is found
# "none": No resolver used, install pytorch like any other package
# pytorch_resolve: "pip"
# additional conda channels to use when installing with conda package manager
conda_channels: ["pytorch", "conda-forge", "defaults", ]
conda_channels: ["pytorch", "conda-forge", "nvidia", "defaults", ]
# If set to true, Task's "installed packages" are ignored,
# and the repository's "requirements.txt" is used instead
# force_repo_requirements_txt: false
# set the priority packages to be installed before the rest of the required packages
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# priority_packages: ["cython", "numpy", "setuptools", ]
# set the optional priority packages to be installed before the rest of the required packages,
# In case a package installation fails, the package will be ignored,
# and the virtual environment process will continue
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
priority_optional_packages: ["pygobject", ]
# set the post packages to be installed after all the rest of the required packages
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# post_packages: ["horovod", ]
# set the optional post packages to be installed after all the rest of the required packages,
# In case a package installation fails, the package will be ignored,
# and the virtual environment process will continue
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# post_optional_packages: []
# set to True to support torch nightly build installation,
@@ -162,6 +177,13 @@
# these are local for this agent and will not be updated in the experiment's docker_cmd section
# extra_docker_arguments: ["--ipc=host", ]
# Allow the extra docker arg to override task level docker arg (if the same argument is passed on both),
# if set to False, a task docker arg will override the docker extra arg
# docker_args_extra_precedes_task: true
# allows the following task docker args to be overridden by the extra_docker_arguments
# protected_docker_extra_args: ["privileged", "security-opt", "network", "ipc"]
# optional shell script to run in docker when started before the experiment is started
# extra_docker_shell_script: ["apt-get install -y bindfs", ]
@@ -192,10 +214,80 @@
default_docker: {
# default docker image to use when running in docker mode
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
# optional arguments to pass to docker image
# arguments: ["--ipc=host", ]
# Choose the default docker based on the Task properties,
# Notice: Enterprise feature, ignored otherwise
# Examples: 'script.requirements', 'script.binary', 'script.repository', 'script.branch', 'project'
# Notice: Matching is done via regular expression, for example "^searchme$" will match exactly "searchme" string
"match_rules": [
{
"image": "python:3.6-bullseye",
"arguments": "--ipc=host",
"match": {
"script": {
"binary": "python3.6$",
},
}
},
{
"image": "python:3.7-bullseye",
"arguments": "--ipc=host",
"match": {
"script": {
"binary": "python3.7$",
},
}
},
{
"image": "python:3.8-bullseye",
"arguments": "--ipc=host",
"match": {
"script": {
"binary": "python3.8$",
},
}
},
{
"image": "python:3.9-bullseye",
"arguments": "--ipc=host",
"match": {
"script": {
"binary": "python3.9$",
},
}
},
{
"image": "python:3.10-bullseye",
"arguments": "--ipc=host",
"match": {
"script": {
"binary": "python3.10$",
},
}
},
{
"image": "python:3.11-bullseye",
"arguments": "--ipc=host",
"match": {
"script": {
"binary": "python3.11$",
},
}
},
{
"image": "python:3.12-bullseye",
"arguments": "--ipc=host",
"match": {
"script": {
"binary": "python3.12$",
},
}
},
]
}
# set the OS environments based on the Task's Environment section before launching the Task process.
@@ -227,6 +319,20 @@
# cuda_version: 10.1
# cudnn_version: 7.6
# Sanitize configuration printout using these settings
sanitize_config_printout {
# Hide values of configuration keys matching these regexps
hide_secrets: ["^sanitize_config_printout$", "secret", "pass", "token", "account_key", "contents"]
# As above, only show field's value keys if value is a dictionary
hide_secrets_recursive: ["^environment$"]
# Do not hide for keys matching these regexps
dont_hide_secrets: ["^enable_git_ask_pass$"]
# Hide secrets in docker commands, according to the 'agent.hide_docker_command_env_vars' settings
docker_commands: ["^extra_docker_arguments$"]
# Hide password in URLs found in keys matching these regexps (handles single URLs, lists and dictionaries)
urls: ["^extra_index_url$"]
}
# Hide docker environment variables containing secrets when printing out the docker command by replacing their
# values with "********". Turning this feature on will hide the following environment variables values:
# CLEARML_API_SECRET_KEY, CLEARML_AGENT_GIT_PASS, AWS_SECRET_ACCESS_KEY, AZURE_STORAGE_KEY
@@ -253,16 +359,22 @@
pip_cache: "/root/.cache/pip"
poetry_cache: "/root/.cache/pypoetry"
vcs_cache: "/root/.clearml/vcs-cache"
venvs_cache: "/root/.clearml/venvs-cache"
venv_build: "~/.clearml/venvs-builds"
pip_download: "/root/.clearml/pip-download-cache"
}
# Name docker containers created by the daemon using the following string format (supported from Docker 0.6.5)
# Allowed variables are task_id, worker_id and rand_string (random lower-case letters string, up to 32 characters)
# Custom variables may be specified using the docker_container_name_format_fields option.
# Note: resulting name must start with an alphanumeric character and
# continue with alphanumeric characters, underscores (_), dots (.) and/or dashes (-)
# docker_container_name_format: "clearml-id-{task_id}-{rand_string:.8}"
# Specify custom variables for the docker_container_name_format option using a mapping of variable name
# to a (nested) task field (using "." as a task field separator, digits specify array index)
# docker_container_name_format_fields: { foo: "bar.moo" }
# Apply top-level environment section from configuration into os.environ
apply_environment: true
# Top-level environment section is in the form of:
@@ -283,6 +395,8 @@
# target_format: format used to encode contents before writing into the target file. Supported values are json,
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
# overwrite: overwrite the target file in case it exists. Default is true.
# mode: file-system mode to be applied to the file after its creation. The mode string will be parsed into an
# integer (e.g. "0o777" for -rwxrwxrwx)
#
# Example:
# files {
@@ -348,7 +462,7 @@
# Notice: Matching is done via regular expression, for example "^searchme$" will match exactly "searchme$" string
#
# "default_docker": {
# "image": "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04",
# "image": "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04",
# # optional arguments to pass to docker image
# # arguments: ["--ipc=host", ]
# "match_rules": [
@@ -369,13 +483,6 @@
# }
# },
# {
# "image": "better_container:tag",
# "arguments": "",
# "match": {
# "container": "replace_me_please"
# }
# },
# {
# "image": "another_container:tag",
# "arguments": "",
# "match": {

View File

@@ -3,7 +3,7 @@
storage {
cache {
# Defaults to system temp folder / cache
# Defaults to <system_temp_folder>/clearml_cache
default_base_dir: "~/.clearml/cache"
size {
# max_used_bytes = -1
@@ -140,7 +140,7 @@
vcs_repo_detect_async: true
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
# This stores "git diff" or into the experiment's "script.requirements.diff" section
store_uncommitted_code_diff: true
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset

View File

@@ -1,5 +1,5 @@
from ...backend_config.converters import safe_text_to_bool
from ...backend_config.environment import EnvEntry
from clearml_agent.helper.environment import EnvEntry
from clearml_agent.helper.environment.converters import safe_text_to_bool
ENV_HOST = EnvEntry("CLEARML_API_HOST", "TRAINS_API_HOST")
@@ -11,6 +11,7 @@ ENV_AUTH_TOKEN = EnvEntry("CLEARML_AUTH_TOKEN")
ENV_VERBOSE = EnvEntry("CLEARML_API_VERBOSE", "TRAINS_API_VERBOSE", type=bool, default=False)
ENV_HOST_VERIFY_CERT = EnvEntry("CLEARML_API_HOST_VERIFY_CERT", "TRAINS_API_HOST_VERIFY_CERT", type=bool, default=True)
ENV_CONDA_ENV_PACKAGE = EnvEntry("CLEARML_CONDA_ENV_PACKAGE", "TRAINS_CONDA_ENV_PACKAGE")
ENV_USE_CONDA_BASE_ENV = EnvEntry("CLEARML_USE_CONDA_BASE_ENV", type=bool)
ENV_NO_DEFAULT_SERVER = EnvEntry("CLEARML_NO_DEFAULT_SERVER", "TRAINS_NO_DEFAULT_SERVER", type=bool, default=True)
ENV_DISABLE_VAULT_SUPPORT = EnvEntry('CLEARML_AGENT_DISABLE_VAULT_SUPPORT', type=bool)
ENV_ENABLE_ENV_CONFIG_SECTION = EnvEntry('CLEARML_AGENT_ENABLE_ENV_CONFIG_SECTION', type=bool)
@@ -20,6 +21,10 @@ ENV_PROPAGATE_EXITCODE = EnvEntry("CLEARML_AGENT_PROPAGATE_EXITCODE", type=bool,
ENV_INITIAL_CONNECT_RETRY_OVERRIDE = EnvEntry(
'CLEARML_AGENT_INITIAL_CONNECT_RETRY_OVERRIDE', default=True, converter=safe_text_to_bool
)
ENV_FORCE_MAX_API_VERSION = EnvEntry("CLEARML_AGENT_FORCE_MAX_API_VERSION", type=str)
# values are 0/None (task per node), 1/2 (multi-node reporting, colored console), -1 (only report rank 0 node)
ENV_MULTI_NODE_SINGLE_TASK = EnvEntry("CLEARML_MULTI_NODE_SINGLE_TASK", type=int, default=None)
"""
Experimental option to set the request method for all API requests and auth login.

View File

@@ -16,11 +16,11 @@ from requests.auth import HTTPBasicAuth
from six.moves.urllib.parse import urlparse, urlunparse
from clearml_agent.external.pyhocon import ConfigTree, ConfigFactory
from .callresult import CallResult
from .defs import (
ENV_VERBOSE, ENV_HOST, ENV_ACCESS_KEY, ENV_SECRET_KEY, ENV_WEB_HOST, ENV_FILES_HOST, ENV_AUTH_TOKEN,
ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD, )
ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD,
ENV_FORCE_MAX_API_VERSION)
from .request import Request, BatchRequest
from .token_manager import TokenManager
from ..config import load
@@ -28,7 +28,6 @@ from ..utils import get_http_session_with_retry, urllib_log_warning_setup
from ...backend_config.environment import backward_compatibility_support
from ...version import __version__
sys_random = SystemRandom()
@@ -64,6 +63,9 @@ class Session(TokenManager):
default_files = "https://demofiles.demo.clear.ml"
default_key = "EGRTCO8JMSIGI6S39GTP43NFWXDQOW"
default_secret = "x!XTov_G-#vspE*Y(h$Anm&DIc5Ou-F)jsl$PdOyj5wG1&E!Z8"
force_max_api_version = ENV_FORCE_MAX_API_VERSION.get()
server_version = "1.0.0"
user_id = None
# TODO: add requests.codes.gateway_timeout once we support async commits
_retry_codes = [
@@ -191,6 +193,8 @@ class Session(TokenManager):
Session.api_version = str(api_version)
Session.feature_set = str(token_dict.get('feature_set', self.feature_set) or "basic")
Session.server_version = token_dict.get('server_version', self.server_version)
Session.user_id = (token_dict.get("identity") or {}).get("user") or ""
except (jwt.DecodeError, ValueError):
pass
@@ -199,6 +203,12 @@ class Session(TokenManager):
# notice: this is across the board warning omission
urllib_log_warning_setup(total_retries=http_retries_config.get('total', 0), display_warning_after=3)
if self.force_max_api_version and self.check_min_api_version(self.force_max_api_version):
print("Using forced API version {}".format(self.force_max_api_version))
Session.max_api_version = Session.api_version = str(self.force_max_api_version)
self.pre_vault_config = None
def _setup_session(self, http_retries_config, initial_session=False, default_initial_connect_override=None):
# type: (dict, bool, Optional[bool]) -> (dict, requests.Session)
http_retries_config = http_retries_config or self.config.get(
@@ -250,7 +260,12 @@ class Session(TokenManager):
def parse(vault):
# noinspection PyBroadException
try:
d = vault.get('data', None)
print("Loaded {} vault{}: {}".format(
vault.get("scope", ""),
"" if not self.user_id else " for user {}".format(self.user_id),
(vault.get("description", None) or "")[:50] or vault.get("id", ""))
)
d = vault.get("data", None)
if d:
r = ConfigFactory.parse_string(d)
if isinstance(r, (ConfigTree, dict)):
@@ -266,6 +281,7 @@ class Session(TokenManager):
vaults = res.json().get("data", {}).get("vaults", [])
data = list(filter(None, map(parse, vaults)))
if data:
self.pre_vault_config = self.config.copy()
self.config.set_overrides(*data)
return True
elif res.status_code != 404:
@@ -330,11 +346,12 @@ class Session(TokenManager):
if self._propagate_exceptions_on_send:
raise
sleep_time = sys_random.uniform(*self._request_exception_retry_timeout)
self._logger.error(
"{} exception sending {} {}: {} (retrying in {:.1f}sec)".format(
type(ex).__name__, method.upper(), url, str(ex), sleep_time
if self._logger:
self._logger.error(
"{} exception sending {} {}: {} (retrying in {:.1f}sec)".format(
type(ex).__name__, method.upper(), url, str(ex), sleep_time
)
)
)
time.sleep(sleep_time)
continue
@@ -353,11 +370,12 @@ class Session(TokenManager):
res.status_code == requests.codes.service_unavailable
and self.config.get("api.http.wait_on_maintenance_forever", True)
):
self._logger.warning(
"Service unavailable: {} is undergoing maintenance, retrying...".format(
host
if self._logger:
self._logger.warning(
"Service unavailable: {} is undergoing maintenance, retrying...".format(
host
)
)
)
continue
break
self._session_requests += 1
@@ -638,11 +656,14 @@ class Session(TokenManager):
"""
Return True if Session.api_version is greater or equal >= to min_api_version
"""
def version_tuple(v):
v = tuple(map(int, (v.split("."))))
return v + (0,) * max(0, 3 - len(v))
return version_tuple(cls.api_version) >= version_tuple(str(min_api_version))
@classmethod
def check_min_server_version(cls, min_server_version):
"""
Return True if Session.server_version is greater or equal >= to min_server_version
"""
return version_tuple(cls.server_version) >= version_tuple(str(min_server_version))
def _do_refresh_token(self, current_token, exp=None):
""" TokenManager abstract method implementation.
Here we ignore the old token and simply obtain a new token.
@@ -720,3 +741,8 @@ class Session(TokenManager):
def propagate_exceptions_on_send(self, value):
# type: (bool) -> None
self._propagate_exceptions_on_send = value
def version_tuple(v):
v = tuple(map(int, (v.split("."))))
return v + (0,) * max(0, 3 - len(v))

View File

@@ -86,7 +86,10 @@ def get_http_session_with_retry(
session = requests.Session()
if backoff_max is not None:
Retry.BACKOFF_MAX = backoff_max
if "BACKOFF_MAX" in vars(Retry):
Retry.BACKOFF_MAX = backoff_max
else:
Retry.DEFAULT_BACKOFF_MAX = backoff_max
retry = Retry(
total=total, connect=connect, read=read, redirect=redirect, status=status,

View File

@@ -297,6 +297,9 @@ class Config(object):
def put(self, key, value):
self._config.put(key, value)
def pop(self, key, default=None):
return self._config.pop(key, default=default)
def to_dict(self):
return self._config.as_plain_ordered_dict()

View File

@@ -1,69 +1,8 @@
import base64
from distutils.util import strtobool
from typing import Union, Optional, Any, TypeVar, Callable, Tuple
import six
try:
from typing import Text
except ImportError:
# windows conda-less hack
Text = Any
ConverterType = TypeVar("ConverterType", bound=Callable[[Any], Any])
def text_to_int(value, default=0):
# type: (Any, int) -> int
try:
return int(value)
except (ValueError, TypeError):
return default
def base64_to_text(value):
# type: (Any) -> Text
return base64.b64decode(value).decode("utf-8")
def text_to_bool(value):
# type: (Text) -> bool
return bool(strtobool(value))
def safe_text_to_bool(value):
# type: (Text) -> bool
try:
return text_to_bool(value)
except ValueError:
return bool(value)
def any_to_bool(value):
# type: (Optional[Union[int, float, Text]]) -> bool
if isinstance(value, six.text_type):
return text_to_bool(value)
return bool(value)
def or_(*converters, **kwargs):
# type: (ConverterType, Tuple[Exception, ...]) -> ConverterType
"""
Wrapper that implements an "optional converter" pattern. Allows specifying a converter
for which a set of exceptions is ignored (and the original value is returned)
:param converters: A converter callable
:param exceptions: A tuple of exception types to ignore
"""
# noinspection PyUnresolvedReferences
exceptions = kwargs.get("exceptions", (ValueError, TypeError))
def wrapper(value):
for converter in converters:
try:
return converter(value)
except exceptions:
pass
return value
return wrapper
from clearml_agent.helper.environment.converters import (
base64_to_text,
text_to_bool,
text_to_int,
safe_text_to_bool,
any_to_bool,
or_,
)

View File

@@ -1,111 +1,6 @@
import abc
from typing import Optional, Any, Tuple, Callable, Dict
from clearml_agent.helper.environment import Entry, NotSet
import six
from .converters import any_to_bool
try:
from typing import Text
except ImportError:
# windows conda-less hack
Text = Any
NotSet = object()
Converter = Callable[[Any], Any]
@six.add_metaclass(abc.ABCMeta)
class Entry(object):
"""
Configuration entry definition
"""
@classmethod
def default_conversions(cls):
# type: () -> Dict[Any, Converter]
return {
bool: any_to_bool,
six.text_type: lambda s: six.text_type(s).strip(),
}
def __init__(self, key, *more_keys, **kwargs):
# type: (Text, Text, Any) -> None
"""
:param key: Entry's key (at least one).
:param more_keys: More alternate keys for this entry.
:param type: Value type. If provided, will be used choosing a default conversion or
(if none exists) for casting the environment value.
:param converter: Value converter. If provided, will be used to convert the environment value.
:param default: Default value. If provided, will be used as the default value on calls to get() and get_pair()
in case no value is found for any key and no specific default value was provided in the call.
Default value is None.
:param help: Help text describing this entry
"""
self.keys = (key,) + more_keys
self.type = kwargs.pop("type", six.text_type)
self.converter = kwargs.pop("converter", None)
self.default = kwargs.pop("default", None)
self.help = kwargs.pop("help", None)
def __str__(self):
return str(self.key)
@property
def key(self):
return self.keys[0]
def convert(self, value, converter=None):
# type: (Any, Converter) -> Optional[Any]
converter = converter or self.converter
if not converter:
converter = self.default_conversions().get(self.type, self.type)
return converter(value)
def get_pair(self, default=NotSet, converter=None, value_cb=None):
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
for key in self.keys:
value = self._get(key)
if value is NotSet:
continue
try:
value = self.convert(value, converter)
except Exception as ex:
self.error("invalid value {key}={value}: {ex}".format(**locals()))
break
# noinspection PyBroadException
try:
if value_cb:
value_cb(key, value)
except Exception:
pass
return key, value
result = self.default if default is NotSet else default
return self.key, result
def get(self, default=NotSet, converter=None, value_cb=None):
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
def set(self, value):
# type: (Any, Any) -> (Text, Any)
# key, _ = self.get_pair(default=None, converter=None)
for k in self.keys:
self._set(k, str(value))
def _set(self, key, value):
# type: (Text, Text) -> None
pass
@abc.abstractmethod
def _get(self, key):
# type: (Text) -> Any
pass
@abc.abstractmethod
def error(self, message):
# type: (Text) -> None
pass
__all__ = [
"Entry",
"NotSet"
]

View File

@@ -1,32 +1,6 @@
from os import getenv, environ
from os import environ
from .converters import text_to_bool
from .entry import Entry, NotSet
class EnvEntry(Entry):
@classmethod
def default_conversions(cls):
conversions = super(EnvEntry, cls).default_conversions().copy()
conversions[bool] = text_to_bool
return conversions
def pop(self):
for k in self.keys:
environ.pop(k, None)
def _get(self, key):
value = getenv(key, "").strip()
return value or NotSet
def _set(self, key, value):
environ[key] = value
def __str__(self):
return "env:{}".format(super(EnvEntry, self).__str__())
def error(self, message):
print("Environment configuration: {}".format(message))
from clearml_agent.helper.environment import EnvEntry
def backward_compatibility_support():
@@ -34,6 +8,7 @@ def backward_compatibility_support():
if ENVIRONMENT_BACKWARD_COMPATIBLE.get():
# Add TRAINS_ prefix on every CLEARML_ os environment we support
for k, v in ENVIRONMENT_CONFIG.items():
# noinspection PyBroadException
try:
trains_vars = [var for var in v.vars if var.startswith('CLEARML_')]
if not trains_vars:
@@ -44,6 +19,7 @@ def backward_compatibility_support():
except:
continue
for k, v in ENVIRONMENT_SDK_PARAMS.items():
# noinspection PyBroadException
try:
trains_vars = [var for var in v if var.startswith('CLEARML_')]
if not trains_vars:
@@ -62,3 +38,9 @@ def backward_compatibility_support():
backwards_k = k.replace('CLEARML_', 'TRAINS_', 1)
if backwards_k not in keys:
environ[backwards_k] = environ[k]
__all__ = [
"EnvEntry",
"backward_compatibility_support"
]

View File

@@ -31,7 +31,8 @@ def apply_environment(config):
keys = list(filter(None, env_vars.keys()))
for key in keys:
os.environ[str(key)] = str(env_vars[key] or "")
value = env_vars[key]
os.environ[str(key)] = str(value if value is not None else "")
return keys
@@ -52,6 +53,7 @@ def apply_files(config):
target_fmt = data.get("target_format", "string")
overwrite = bool(data.get("overwrite", True))
contents = data.get("contents")
mode = data.get("mode", None)
target = Path(expanduser(expandvars(path)))
@@ -110,3 +112,14 @@ def apply_files(config):
except Exception as ex:
print("Skipped [{}]: failed saving file {} ({})".format(key, target, ex))
continue
try:
if mode:
if isinstance(mode, int):
mode = int(str(mode), 8)
else:
mode = int(mode, 8)
target.chmod(mode)
except Exception as ex:
print("Skipped [{}]: failed setting mode {} for {} ({})".format(key, mode, target, ex))
continue

View File

@@ -44,7 +44,7 @@ def main():
if conf_file.exists() and conf_file.is_file() and conf_file.stat().st_size > 0:
print('Configuration file already exists: {}'.format(str(conf_file)))
print('Leaving setup, feel free to edit the configuration file.')
print('Leaving setup. If you\'ve previously initialized the ClearML SDK on this machine, manually add an \'agent\' section to this file.')
return
print(description, end='')

View File

@@ -2,6 +2,7 @@ from __future__ import print_function
import json
import time
from typing import List, Tuple
from clearml_agent.commands.base import ServiceCommandSection
from clearml_agent.helper.base import return_list
@@ -57,6 +58,42 @@ class Events(ServiceCommandSection):
# print('Sending events done: %d / %d events sent' % (sent_events, len(list_events)))
return sent_events
def send_log_events_with_timestamps(
self, worker_id, task_id, lines_with_ts: List[Tuple[str, str]], level="DEBUG", session=None
):
log_events = []
# break log lines into event packets
for ts, line in return_list(lines_with_ts):
# HACK ignore terminal reset ANSI code
if line == '\x1b[0m':
continue
while line:
if len(line) <= self.max_event_size:
msg = line
line = None
else:
msg = line[:self.max_event_size]
line = line[self.max_event_size:]
log_events.append(
{
"type": "log",
"level": level,
"task": task_id,
"worker": worker_id,
"msg": msg,
"timestamp": ts,
}
)
if line and ts is not None:
# advance timestamp in case we break a line to more than one part
ts += 1
# now send the events
return self.send_events(list_events=log_events, session=session)
def send_log_events(self, worker_id, task_id, lines, level='DEBUG', session=None):
log_events = []
base_timestamp = int(time.time() * 1000)

View File

@@ -1,14 +1,16 @@
import json
import re
import shlex
from copy import copy
from clearml_agent.backend_api.session import Request
from clearml_agent.helper.docker_args import DockerArgsSanitizer
from clearml_agent.helper.package.requirements import (
RequirementsManager, MarkerRequirement,
compare_version_rules, )
def resolve_default_container(session, task_id, container_config):
def resolve_default_container(session, task_id, container_config, ignore_match_rules=False):
container_lookup = session.config.get('agent.default_docker.match_rules', None)
if not session.check_min_api_version("2.13") or not container_lookup:
return container_config
@@ -17,6 +19,12 @@ def resolve_default_container(session, task_id, container_config):
try:
session.verify_feature_set('advanced')
except ValueError:
# ignoring matching rules only supported in higher tiers
return container_config
if ignore_match_rules:
print("INFO: default docker command line override, ignoring default docker container match rules")
# ignoring matching rules only supported in higher tiers
return container_config
result = session.send_request(
@@ -109,15 +117,15 @@ def resolve_default_container(session, task_id, container_config):
match.get('script.binary', None), entry))
continue
if match.get('container', None):
# noinspection PyBroadException
try:
if not re.search(match.get('container', None), requested_container.get('image', '')):
continue
except Exception:
print('Failed parsing regular expression \"{}\" in rule: {}'.format(
match.get('container', None), entry))
continue
# if match.get('image', None):
# # noinspection PyBroadException
# try:
# if not re.search(match.get('image', None), requested_container.get('image', '')):
# continue
# except Exception:
# print('Failed parsing regular expression \"{}\" in rule: {}'.format(
# match.get('image', None), entry))
# continue
matched = True
for req_section in ['script.requirements.pip', 'script.requirements.conda']:
@@ -156,12 +164,13 @@ def resolve_default_container(session, task_id, container_config):
break
if matched:
if not container_config.get('container'):
container_config['container'] = entry.get('image', None)
if not container_config.get('image'):
container_config['image'] = entry.get('image', None)
if not container_config.get('arguments'):
container_config['arguments'] = entry.get('arguments', None)
container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())
print('Matching default container with rule:\n{}'.format(json.dumps(entry)))
container_config['arguments'] = entry.get('arguments', None) or ''
if isinstance(container_config.get('arguments'), str):
container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())
print('INFO: Matching default container with rule:\n{}'.format(json.dumps(entry)))
return container_config
return container_config

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,5 @@
import shlex
from datetime import timedelta
from distutils.util import strtobool
from enum import IntEnum
from os import getenv, environ
from typing import Text, Optional, Union, Tuple, Any
@@ -9,6 +8,7 @@ import six
from pathlib2 import Path
from clearml_agent.helper.base import normalize_path
from clearml_agent.helper.environment.converters import strtobool
PROGRAM_NAME = "clearml-agent"
FROM_FILE_PREFIX_CHARS = "@"
@@ -152,16 +152,22 @@ WORKING_STANDALONE_DIR = "code"
DEFAULT_VCS_CACHE = normalize_path(CONFIG_DIR, "vcs-cache")
PIP_EXTRA_INDICES = []
DEFAULT_PIP_DOWNLOAD_CACHE = normalize_path(CONFIG_DIR, "pip-download-cache")
ENV_PIP_EXTRA_INSTALL_FLAGS = EnvironmentConfig("CLEARML_EXTRA_PIP_INSTALL_FLAGS", type=list)
ENV_DOCKER_IMAGE = EnvironmentConfig("CLEARML_DOCKER_IMAGE", "TRAINS_DOCKER_IMAGE")
ENV_WORKER_ID = EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID")
ENV_WORKER_TAGS = EnvironmentConfig("CLEARML_WORKER_TAGS")
ENV_AGENT_SKIP_PIP_VENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PIP_VENV_INSTALL")
ENV_AGENT_SKIP_PYTHON_ENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL", type=bool)
ENV_AGENT_FORCE_CODE_DIR = EnvironmentConfig("CLEARML_AGENT_FORCE_CODE_DIR")
ENV_AGENT_FORCE_EXEC_SCRIPT = EnvironmentConfig("CLEARML_AGENT_FORCE_EXEC_SCRIPT")
ENV_AGENT_FORCE_POETRY = EnvironmentConfig("CLEARML_AGENT_FORCE_POETRY", type=bool)
ENV_AGENT_FORCE_TASK_INIT = EnvironmentConfig("CLEARML_AGENT_FORCE_TASK_INIT", type=bool)
ENV_DOCKER_SKIP_GPUS_FLAG = EnvironmentConfig("CLEARML_DOCKER_SKIP_GPUS_FLAG", "TRAINS_DOCKER_SKIP_GPUS_FLAG")
ENV_AGENT_GIT_USER = EnvironmentConfig("CLEARML_AGENT_GIT_USER", "TRAINS_AGENT_GIT_USER")
ENV_AGENT_GIT_PASS = EnvironmentConfig("CLEARML_AGENT_GIT_PASS", "TRAINS_AGENT_GIT_PASS")
ENV_AGENT_GIT_HOST = EnvironmentConfig("CLEARML_AGENT_GIT_HOST", "TRAINS_AGENT_GIT_HOST")
ENV_AGENT_DISABLE_SSH_MOUNT = EnvironmentConfig("CLEARML_AGENT_DISABLE_SSH_MOUNT", type=bool)
ENV_AGENT_DEBUG_GET_NEXT_TASK = EnvironmentConfig("CLEARML_AGENT_DEBUG_GET_NEXT_TASK", type=bool)
ENV_SSH_AUTH_SOCK = EnvironmentConfig("SSH_AUTH_SOCK")
ENV_TASK_EXECUTE_AS_USER = EnvironmentConfig("CLEARML_AGENT_EXEC_USER", "TRAINS_AGENT_EXEC_USER")
ENV_TASK_EXTRA_PYTHON_PATH = EnvironmentConfig("CLEARML_AGENT_EXTRA_PYTHON_PATH", "TRAINS_AGENT_EXTRA_PYTHON_PATH")
@@ -173,10 +179,15 @@ ENV_DOCKER_HOST_MOUNT = EnvironmentConfig(
)
ENV_VENV_CACHE_PATH = EnvironmentConfig("CLEARML_AGENT_VENV_CACHE_PATH")
ENV_EXTRA_DOCKER_ARGS = EnvironmentConfig("CLEARML_AGENT_EXTRA_DOCKER_ARGS", type=list)
ENV_EXTRA_DOCKER_LABELS = EnvironmentConfig("CLEARML_AGENT_EXTRA_DOCKER_LABELS", type=list)
ENV_DEBUG_INFO = EnvironmentConfig("CLEARML_AGENT_DEBUG_INFO")
ENV_CHILD_AGENTS_COUNT_CMD = EnvironmentConfig("CLEARML_AGENT_CHILD_AGENTS_COUNT_CMD")
ENV_DOCKER_ARGS_FILTERS = EnvironmentConfig("CLEARML_AGENT_DOCKER_ARGS_FILTERS")
ENV_DOCKER_ARGS_HIDE_ENV = EnvironmentConfig("CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV")
ENV_CONFIG_BC_IN_STANDALONE = EnvironmentConfig("CLEARML_AGENT_STANDALONE_CONFIG_BC", type=bool)
""" Maintain backwards compatible configuration when launching in standalone mode """
ENV_FORCE_DOCKER_AGENT_REPO = EnvironmentConfig("FORCE_CLEARML_AGENT_REPO", "CLEARML_AGENT_DOCKER_AGENT_REPO")
ENV_SERVICES_DOCKER_RESTART = EnvironmentConfig("CLEARML_AGENT_SERVICES_DOCKER_RESTART")
"""
@@ -232,6 +243,14 @@ ENV_CUSTOM_BUILD_SCRIPT = EnvironmentConfig("CLEARML_AGENT_CUSTOM_BUILD_SCRIPT")
standard flow.
"""
ENV_PACKAGE_PYTORCH_RESOLVE = EnvironmentConfig("CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE")
ENV_TEMP_STDOUT_FILE_DIR = EnvironmentConfig("CLEARML_AGENT_TEMP_STDOUT_FILE_DIR")
ENV_GIT_CLONE_VERBOSE = EnvironmentConfig("CLEARML_AGENT_GIT_CLONE_VERBOSE", type=bool)
ENV_GPU_FRACTIONS = EnvironmentConfig("CLEARML_AGENT_GPU_FRACTIONS")
class FileBuffering(IntEnum):
"""

View File

@@ -39,7 +39,7 @@ LOCAL_REGEX = re.compile(
class Requirement(object):
"""
Represents a single requirementfrom clearml_agent.external.requirements_parser.requirement import Requirement
Represents a single requirement from clearml_agent.external.requirements_parser.requirement import Requirement
Typically instances of this class are created with ``Requirement.parse``.
For local file requirements, there's no verification that the file
@@ -214,6 +214,7 @@ class Requirement(object):
def parse(cls, line):
"""
Parses a Requirement from a line of a requirement file.
This is the main entry point for parsing a single requirements line (not parse_line!)
:param line: a line of a requirement file
:returns: a Requirement instance for the given line
@@ -226,7 +227,7 @@ class Requirement(object):
return cls.parse_editable(
re.sub(r'^(-e|--editable=?)\s*', '', line))
elif '@' in line and ('#' not in line or line.index('#') > line.index('@')):
# Allegro bug fix: support 'name @ git+' entries
# ClearML bug fix: support 'name @ git+' entries
name, uri = line.split('@', 1)
name = name.strip()
uri = uri.strip()

View File

@@ -0,0 +1,15 @@
from threading import Thread
from clearml_agent.session import Session
class K8sDaemon(Thread):
def __init__(self, agent):
super(K8sDaemon, self).__init__(target=self.target)
self.daemon = True
self._agent = agent
self.log = agent.log
self._session: Session = agent._session
def target(self):
pass

View File

@@ -1,7 +1,20 @@
from clearml_agent.definitions import EnvironmentConfig
from clearml_agent.helper.environment import EnvEntry
ENV_START_AGENT_SCRIPT_PATH = EnvironmentConfig('CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH')
ENV_START_AGENT_SCRIPT_PATH = EnvEntry("CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH", default="~/__start_agent__.sh")
"""
Script path to use when creating the bash script to run the agent inside the scheduled pod's docker container.
Script will be appended to the specified file.
"""
ENV_DEFAULT_EXECUTION_AGENT_ARGS = EnvEntry("K8S_GLUE_DEF_EXEC_AGENT_ARGS", default="--full-monitoring --require-queue")
ENV_POD_AGENT_INSTALL_ARGS = EnvEntry("K8S_GLUE_POD_AGENT_INSTALL_ARGS", default="", lstrip=False)
ENV_POD_MONITOR_LOG_BATCH_SIZE = EnvEntry("K8S_GLUE_POD_MONITOR_LOG_BATCH_SIZE", default=5, converter=int)
ENV_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION = EnvEntry(
"K8S_GLUE_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION", default=False, converter=bool
)
ENV_POD_USE_IMAGE_ENTRYPOINT = EnvEntry("K8S_GLUE_POD_USE_IMAGE_ENTRYPOINT", default=False, converter=bool)
"""
Do not inject a cmd and args to the container's image when building the k8s template (depend on the built-in image
entrypoint)
"""

View File

@@ -0,0 +1,12 @@
class GetPodsError(Exception):
pass
class GetJobsError(Exception):
pass
class GetPodCountError(Exception):
pass

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,249 @@
from time import sleep
from typing import Dict, Tuple, Optional, List
from clearml_agent.backend_api.session import Request
from clearml_agent.glue.utilities import get_bash_output
from clearml_agent.helper.process import stringify_bash_output
from .daemon import K8sDaemon
from .utilities import get_path
from .errors import GetPodsError
from .definitions import ENV_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION
class PendingPodsDaemon(K8sDaemon):
def __init__(self, polling_interval: float, agent):
super(PendingPodsDaemon, self).__init__(agent=agent)
self._polling_interval = polling_interval
self._last_tasks_msgs = {} # last msg updated for every task
def get_pods(self, pod_name=None, debug_msg="Detecting pending pods: {cmd}"):
filters = ["status.phase=Pending"]
if pod_name:
filters.append(f"metadata.name={pod_name}")
if self._agent.using_jobs:
return self._agent.get_pods_for_jobs(
job_condition="status.active=1", pod_filters=filters, debug_msg=debug_msg
)
return self._agent.get_pods(filters=filters, debug_msg=debug_msg)
def _get_pod_name(self, pod: dict):
return get_path(pod, "metadata", "name")
def _get_k8s_resource_name(self, pod: dict):
if self._agent.using_jobs:
return get_path(pod, "metadata", "labels", "job-name")
return get_path(pod, "metadata", "name")
def _get_task_id(self, pod: dict):
return self._get_k8s_resource_name(pod).rpartition('-')[-1]
@staticmethod
def _get_k8s_resource_namespace(pod: dict):
return pod.get('metadata', {}).get('namespace', None)
def target(self):
"""
Handle pending objects (pods or jobs, depending on the agent mode).
- Delete any pending objects that are not expected to recover
- Delete any pending objects for whom the associated task was aborted
"""
while True:
# noinspection PyBroadException
try:
# Get pods (standalone pods if we're in pods mode, or pods associated to jobs if we're in jobs mode)
pods = self.get_pods()
if pods is None:
raise GetPodsError()
task_id_to_pod = dict()
for pod in pods:
pod_name = self._get_pod_name(pod)
if not pod_name:
continue
task_id = self._get_task_id(pod)
if not task_id:
continue
namespace = self._get_k8s_resource_namespace(pod)
if not namespace:
continue
updated_pod = self.get_pods(pod_name=pod_name, debug_msg="Refreshing pod information: {cmd}")
if not updated_pod:
continue
pod = updated_pod[0]
task_id_to_pod[task_id] = pod
msg = None
tags = []
waiting = get_path(pod, 'status', 'containerStatuses', 0, 'state', 'waiting')
if not waiting:
condition = get_path(pod, 'status', 'conditions', 0)
if condition:
reason = condition.get('reason')
if reason == 'Unschedulable':
message = condition.get('message')
msg = reason + (" ({})".format(message) if message else "")
else:
reason = waiting.get("reason", None)
message = waiting.get("message", None)
msg = reason + (" ({})".format(message) if message else "")
if reason == 'ImagePullBackOff':
self.delete_k8s_resource(k8s_resource=pod, msg=reason)
try:
self._session.api_client.tasks.failed(
task=task_id,
status_reason="K8S glue error: {}".format(msg),
status_message="Changed by K8S glue",
force=True
)
self._agent.send_logs(
task_id, ["K8S Error: {}".format(msg)],
session=self._session
)
except Exception as ex:
self.log.warning(
'K8S Glue pending monitor: Failed deleting task "{}"\nEX: {}'.format(task_id, ex)
)
# clean up any msg for this task
self._last_tasks_msgs.pop(task_id, None)
continue
self._update_pending_task_msg(task_id, msg, tags)
if task_id_to_pod:
self._process_tasks_for_pending_pods(task_id_to_pod)
# clean up any last message for a task that wasn't seen as a pod
self._last_tasks_msgs = {k: v for k, v in self._last_tasks_msgs.items() if k in task_id_to_pod}
except GetPodsError:
pass
except Exception:
self.log.exception("Hanging pods daemon loop")
sleep(self._polling_interval)
def delete_k8s_resource(self, k8s_resource: dict, msg: str = None):
delete_cmd = "kubectl delete {kind} {name} -n {namespace} --output name".format(
kind=self._agent.kind,
name=self._get_k8s_resource_name(k8s_resource),
namespace=self._get_k8s_resource_namespace(k8s_resource)
).strip()
self.log.debug(" - deleting {} {}: {}".format(self._agent.kind, (" " + msg) if msg else "", delete_cmd))
return get_bash_output(delete_cmd).strip()
def _process_tasks_for_pending_pods(self, task_id_to_details: Dict[str, dict]):
self._handle_aborted_tasks(task_id_to_details)
def _handle_aborted_tasks(self, pending_tasks_details: Dict[str, dict]):
try:
result = self._session.get(
service='tasks',
action='get_all',
json={
"id": list(pending_tasks_details),
"status": ["stopped"],
"only_fields": ["id"]
}
)
aborted_task_ids = list(filter(None, (task.get("id") for task in result["tasks"])))
for task_id in aborted_task_ids:
pod = pending_tasks_details.get(task_id)
if not pod:
self.log.error("Failed locating aborted task {} in pending pods list".format(task_id))
continue
pod_name = self._get_pod_name(pod)
if not self.get_pods(pod_name=pod_name):
self.log.debug("K8S Glue pending monitor: pod {} is no longer pending, skipping".format(pod_name))
continue
resource_name = self._get_k8s_resource_name(pod)
self.log.info(
"K8S Glue pending monitor: task {} was aborted but the k8s resource {} is still pending, "
"deleting pod".format(task_id, resource_name)
)
result = self._session.get(
service='tasks',
action='get_all',
json={"id": [task_id], "status": ["stopped"], "only_fields": ["id"]},
)
if not result["tasks"]:
self.log.debug("K8S Glue pending monitor: task {} is no longer aborted, skipping".format(task_id))
continue
output = self.delete_k8s_resource(k8s_resource=pod, msg="Pending resource of an aborted task")
if not output:
self.log.warning("K8S Glue pending monitor: failed deleting resource {}".format(resource_name))
except Exception as ex:
self.log.warning(
'K8S Glue pending monitor: failed checking aborted tasks for pending resources: {}'.format(ex)
)
def _update_pending_task_msg(self, task_id: str, msg: str, tags: List[str] = None):
if not msg or self._last_tasks_msgs.get(task_id, None) == (msg, tags):
return
try:
if ENV_POD_MONITOR_DISABLE_ENQUEUE_ON_PREEMPTION.get():
# This disables the option to enqueue the task which is supposed to sync the ClearML task status
# in case the pod was preempted. In some cases this does not happen due to preemption but due to
# cluster communication lag issues that cause us not to discover the pod is no longer pending and
# enqueue the task when it's actually already running, thus essentially killing the task
pass
else:
# Make sure the task is queued
result = self._session.send_request(
service='tasks',
action='get_all',
json={"id": task_id, "only_fields": ["status"]},
method=Request.def_method,
async_enable=False,
)
if result.ok:
status = get_path(result.json(), 'data', 'tasks', 0, 'status')
# if task is in progress, change its status to enqueued
if status == "in_progress":
result = self._session.send_request(
service='tasks', action='enqueue',
json={
"task": task_id, "force": True, "queue": self._agent.k8s_pending_queue_id
},
method=Request.def_method,
async_enable=False,
)
if not result.ok:
result_msg = get_path(result.json(), 'meta', 'result_msg')
self.log.debug(
"K8S Glue pods monitor: failed forcing task status change"
" for pending task {}: {}".format(task_id, result_msg)
)
# Update task status message
payload = {"task": task_id, "status_message": "K8S glue status: {}".format(msg)}
if tags:
payload["tags"] = tags
result = self._session.send_request('tasks', 'update', json=payload, method=Request.def_method)
if not result.ok:
result_msg = get_path(result.json(), 'meta', 'result_msg')
raise Exception(result_msg or result.text)
# update last msg for this task
self._last_tasks_msgs[task_id] = msg
except Exception as ex:
self.log.warning(
'K8S Glue pods monitor: Failed setting status message for task "{}"\nMSG: {}\nEX: {}'.format(
task_id, msg, ex
)
)

View File

@@ -0,0 +1,18 @@
import functools
from subprocess import DEVNULL
from clearml_agent.helper.process import get_bash_output as _get_bash_output
def get_path(d, *path, default=None):
try:
return functools.reduce(
lambda a, b: a[b], path, d
)
except (IndexError, KeyError):
return default
def get_bash_output(cmd, stderr=DEVNULL, raise_error=False):
return _get_bash_output(cmd, stderr=stderr, raise_error=raise_error)

View File

@@ -14,28 +14,30 @@ import sys
import tempfile
from abc import ABCMeta
from collections import OrderedDict
from distutils.spawn import find_executable
from functools import total_ordering
from typing import Text, Dict, Any, Optional, AnyStr, IO, Union
import attr
import furl
import six
import yaml
from attr import fields_dict
from pathlib2 import Path
import six
from six.moves import reduce
from clearml_agent.external import pyhocon
from clearml_agent.errors import CommandFailedError
from clearml_agent.external import pyhocon
from clearml_agent.helper.dicts import filter_keys
pretty_lines = False
log = logging.getLogger(__name__)
use_powershell = os.getenv("CLEARML_AGENT_USE_POWERSHELL", None)
def which(cmd, path=None):
from clearml_agent.helper.process import find_executable
result = find_executable(cmd, path)
if not result:
raise ValueError('command "{}" not found'.format(cmd))
@@ -52,7 +54,7 @@ def select_for_platform(linux, windows):
def bash_c():
return 'bash -c' if not is_windows_platform() else 'cmd /c'
return 'bash -c' if not is_windows_platform() else ('powershell -Command' if use_powershell else 'cmd /c')
def return_list(arg):
@@ -418,6 +420,7 @@ def mkstemp(
open_kwargs=None, # type: Optional[Dict[Text, Any]]
text=True, # type: bool
name_only=False, # type: bool
mode=None, # type: str
*args,
**kwargs):
# type: (...) -> Union[(IO[AnyStr], Text), Text]
@@ -427,12 +430,14 @@ def mkstemp(
:param open_kwargs: keyword arguments for ``io.open``
:param text: open in text mode
:param name_only: close the file and return its name
:param mode: open file mode
:param args: tempfile.mkstemp args
:param kwargs: tempfile.mkstemp kwargs
"""
fd, name = tempfile.mkstemp(text=text, *args, **kwargs)
mode = 'w+'
if not text:
if not mode:
mode = 'w+'
if not text and 'b' not in mode:
mode += 'b'
if name_only:
os.close(fd)
@@ -538,6 +543,36 @@ def convert_cuda_version_to_int_10_base_str(cuda_version):
return str(int(float(cuda_version)*10))
def get_python_version(python_executable, log=None):
from clearml_agent.helper.process import Argv
try:
output = Argv(python_executable, "--version").get_output(
stderr=subprocess.STDOUT
)
except subprocess.CalledProcessError as ex:
# Windows returns 9009 code and suggests to install Python from Windows Store
if is_windows_platform() and ex.returncode == 9009:
if log:
log.debug("version not found: {}".format(ex))
else:
if log:
log.warning("error getting %s version: %s", python_executable, ex)
return None
except FileNotFoundError as ex:
if log:
log.debug("version not found: {}".format(ex))
return None
match = re.search(r"Python ({}(?:\.\d+)*)".format(r"\d+"), output)
if match:
if log:
log.debug("Found: {}".format(python_executable))
# only return major.minor version
return ".".join(str(match.group(1)).split(".")[:2])
return None
class NonStrictAttrs(object):
@classmethod

View File

@@ -17,6 +17,30 @@ if TYPE_CHECKING:
from clearml_agent.session import Session
def sanitize_urls(s: str) -> Tuple[str, bool]:
"""
Replaces passwords in URLs with asterisks.
Returns the sanitized string and a boolean indicating whether sanitation was performed.
"""
regex = re.compile("^([^:]*:)[^@]+(.*)$")
tokens = re.split(r"\s", s)
changed = False
for k in range(len(tokens)):
if "@" in tokens[k]:
res = urlparse(tokens[k])
if regex.match(res.netloc):
changed = True
tokens[k] = urlunparse((
res.scheme,
regex.sub("\\1********\\2", res.netloc),
res.path,
res.params,
res.query,
res.fragment
))
return " ".join(tokens) if changed else s, changed
class DockerArgsSanitizer:
@classmethod
def sanitize_docker_command(cls, session, docker_command):
@@ -62,11 +86,11 @@ class DockerArgsSanitizer:
elif key in keys:
val = "********"
elif parse_embedded_urls:
val = cls._sanitize_urls(val)[0]
val = sanitize_urls(val)[0]
result[i + 1] = "{}={}".format(key, val)
skip_next = True
elif parse_embedded_urls and not item.startswith("-"):
item, changed = cls._sanitize_urls(item)
item, changed = sanitize_urls(item)
if changed:
result[i] = item
except (KeyError, TypeError):
@@ -75,22 +99,71 @@ class DockerArgsSanitizer:
return result
@staticmethod
def _sanitize_urls(s: str) -> Tuple[str, bool]:
""" Replaces passwords in URLs with asterisks """
regex = re.compile("^([^:]*:)[^@]+(.*)$")
tokens = re.split(r"\s", s)
changed = False
for k in range(len(tokens)):
if "@" in tokens[k]:
res = urlparse(tokens[k])
if regex.match(res.netloc):
changed = True
tokens[k] = urlunparse((
res.scheme,
regex.sub("\\1********\\2", res.netloc),
res.path,
res.params,
res.query,
res.fragment
))
return " ".join(tokens) if changed else s, changed
def get_list_of_switches(docker_args: List[str]) -> List[str]:
args = []
for token in docker_args:
if token.strip().startswith("-"):
args += [token.strip().split("=")[0].lstrip("-")]
return args
@staticmethod
def filter_switches(docker_args: List[str], exclude_switches: List[str]) -> List[str]:
# shortcut if we are sure we have no matches
if (not exclude_switches or
not any("-{}".format(s) in " ".join(docker_args) for s in exclude_switches)):
return docker_args
args = []
in_switch_args = True
for token in docker_args:
if token.strip().startswith("-"):
if "=" in token:
switch = token.strip().split("=")[0]
in_switch_args = False
else:
switch = token
in_switch_args = True
if switch.lstrip("-") in exclude_switches:
# if in excluded, skip the switch and following arguments
in_switch_args = False
else:
args += [token]
elif in_switch_args:
args += [token]
else:
# this is the switch arguments we need to skip
pass
return args
@staticmethod
def merge_docker_args(config, task_docker_arguments: List[str], extra_docker_arguments: List[str]) -> List[str]:
base_cmd = []
# currently only resolving --network, --ipc, --privileged
override_switches = config.get(
"agent.protected_docker_extra_args",
["privileged", "security-opt", "network", "ipc"]
)
if config.get("agent.docker_args_extra_precedes_task", True):
switches = []
if extra_docker_arguments:
switches = DockerArgsSanitizer.get_list_of_switches(extra_docker_arguments)
switches = list(set(switches) & set(override_switches))
base_cmd += [str(a) for a in extra_docker_arguments if a]
if task_docker_arguments:
docker_arguments = DockerArgsSanitizer.filter_switches(task_docker_arguments, switches)
base_cmd += [a for a in docker_arguments if a]
else:
switches = []
if task_docker_arguments:
switches = DockerArgsSanitizer.get_list_of_switches(task_docker_arguments)
switches = list(set(switches) & set(override_switches))
base_cmd += [a for a in task_docker_arguments if a]
if extra_docker_arguments:
extra_docker_arguments = DockerArgsSanitizer.filter_switches(extra_docker_arguments, switches)
base_cmd += [a for a in extra_docker_arguments if a]
return base_cmd

View File

@@ -0,0 +1,8 @@
from .entry import Entry, NotSet
from .environment import EnvEntry
__all__ = [
'Entry',
'NotSet',
'EnvEntry',
]

View File

@@ -0,0 +1,86 @@
import base64
from typing import Union, Optional, Any, TypeVar, Callable, Tuple
import six
try:
from typing import Text
except ImportError:
# windows conda-less hack
Text = Any
ConverterType = TypeVar("ConverterType", bound=Callable[[Any], Any])
def base64_to_text(value):
# type: (Any) -> Text
return base64.b64decode(value).decode("utf-8")
def text_to_int(value, default=0):
# type: (Any, int) -> int
try:
return int(value)
except (ValueError, TypeError):
return default
def text_to_bool(value):
# type: (Text) -> bool
return bool(strtobool(value))
def safe_text_to_bool(value):
# type: (Text) -> bool
try:
return text_to_bool(value)
except ValueError:
return bool(value)
def any_to_bool(value):
# type: (Optional[Union[int, float, Text]]) -> bool
if isinstance(value, six.text_type):
return text_to_bool(value)
return bool(value)
# noinspection PyIncorrectDocstring
def or_(*converters, **kwargs):
# type: (ConverterType, Tuple[Exception, ...]) -> ConverterType
"""
Wrapper that implements an "optional converter" pattern. Allows specifying a converter
for which a set of exceptions is ignored (and the original value is returned)
:param converters: A converter callable
:param exceptions: A tuple of exception types to ignore
"""
# noinspection PyUnresolvedReferences
exceptions = kwargs.get("exceptions", (ValueError, TypeError))
def wrapper(value):
for converter in converters:
try:
return converter(value)
except exceptions:
pass
return value
return wrapper
def strtobool(val):
"""Convert a string representation of truth to true (1) or false (0).
True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values
are 'n', 'no', 'f', 'false', 'off', and '0'. Raises ValueError if
'val' is anything else.
"""
val = val.lower()
if val in ('y', 'yes', 't', 'true', 'on', '1'):
return 1
elif val in ('n', 'no', 'f', 'false', 'off', '0'):
return 0
else:
raise ValueError("invalid truth value %r" % (val,))

View File

@@ -0,0 +1,134 @@
import abc
from typing import Optional, Any, Tuple, Callable, Dict
import six
from .converters import any_to_bool
try:
from typing import Text
except ImportError:
# windows conda-less hack
Text = Any
NotSet = object()
Converter = Callable[[Any], Any]
@six.add_metaclass(abc.ABCMeta)
class Entry(object):
"""
Configuration entry definition
"""
def default_conversions(self):
# type: () -> Dict[Any, Converter]
if self.lstrip and self.rstrip:
def str_convert(s):
return six.text_type(s).strip()
elif self.lstrip:
def str_convert(s):
return six.text_type(s).lstrip()
elif self.rstrip:
def str_convert(s):
return six.text_type(s).rstrip()
else:
def str_convert(s):
return six.text_type(s)
return {
bool: lambda x: any_to_bool(x.strip()),
six.text_type: str_convert,
}
def __init__(self, key, *more_keys, **kwargs):
# type: (Text, Text, Any) -> None
"""
:rtype: object
:param key: Entry's key (at least one).
:param more_keys: More alternate keys for this entry.
:param type: Value type. If provided, will be used choosing a default conversion or
(if none exists) for casting the environment value.
:param converter: Value converter. If provided, will be used to convert the environment value.
:param default: Default value. If provided, will be used as the default value on calls to get() and get_pair()
in case no value is found for any key and no specific default value was provided in the call.
Default value is None.
:param help: Help text describing this entry
"""
self.keys = (key,) + more_keys
self.type = kwargs.pop("type", six.text_type)
self.converter = kwargs.pop("converter", None)
self.default = kwargs.pop("default", None)
self.help = kwargs.pop("help", None)
self.lstrip = kwargs.pop("lstrip", True)
self.rstrip = kwargs.pop("rstrip", True)
def __str__(self):
return str(self.key)
@property
def key(self):
return self.keys[0]
def convert(self, value, converter=None):
# type: (Any, Converter) -> Optional[Any]
converter = converter or self.converter
if not converter:
converter = self.default_conversions().get(self.type, self.type)
return converter(value)
def get_pair(self, default=NotSet, converter=None, value_cb=None):
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
for key in self.keys:
value = self._get(key)
if value is NotSet:
continue
try:
value = self.convert(value, converter)
except Exception as ex:
self.error("invalid value {key}={value}: {ex}".format(**locals()))
break
# noinspection PyBroadException
try:
if value_cb:
value_cb(key, value)
except Exception:
pass
return key, value
result = self.default if default is NotSet else default
return self.key, result
def get(self, default=NotSet, converter=None, value_cb=None):
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
def set(self, value):
# type: (Any, Any) -> (Text, Any)
# key, _ = self.get_pair(default=None, converter=None)
for k in self.keys:
self._set(k, str(value))
def _set(self, key, value):
# type: (Text, Text) -> None
pass
@abc.abstractmethod
def _get(self, key):
# type: (Text) -> Any
pass
@abc.abstractmethod
def error(self, message):
# type: (Text) -> None
pass

View File

@@ -0,0 +1,28 @@
from os import getenv, environ
from .converters import text_to_bool
from .entry import Entry, NotSet
class EnvEntry(Entry):
def default_conversions(self):
conversions = super(EnvEntry, self).default_conversions().copy()
conversions[bool] = lambda x: text_to_bool(x.strip())
return conversions
def pop(self):
for k in self.keys:
environ.pop(k, None)
def _get(self, key):
value = getenv(key, "")
return value or NotSet
def _set(self, key, value):
environ[key] = value
def __str__(self):
return "env:{}".format(super(EnvEntry, self).__str__())
def error(self, message):
print("Environment configuration: {}".format(message))

View File

@@ -15,10 +15,8 @@ from __future__ import print_function
from __future__ import unicode_literals
import json
import os.path
import platform
import sys
import time
from datetime import datetime
from typing import Optional
@@ -59,6 +57,21 @@ class GPUStat(object):
"""
return self.entry['uuid']
@property
def mig_index(self):
"""
Returns the index of the MIG partition (as in nvidia-smi).
"""
return self.entry.get("mig_index")
@property
def mig_uuid(self):
"""
Returns the uuid of the MIG partition returned by nvidia-smi when running in MIG mode,
e.g. MIG-12345678-abcd-abcd-uuid-123456abcdef
"""
return self.entry.get("mig_uuid")
@property
def name(self):
"""
@@ -163,14 +176,16 @@ class GPUStatCollection(object):
_initialized = False
_device_count = None
_gpu_device_info = {}
_mig_device_info = {}
def __init__(self, gpu_list, driver_version=None):
def __init__(self, gpu_list, driver_version=None, driver_cuda_version=None):
self.gpus = gpu_list
# attach additional system information
self.hostname = platform.node()
self.query_time = datetime.now()
self.driver_version = driver_version
self.driver_cuda_version = driver_cuda_version
@staticmethod
def clean_processes():
@@ -181,17 +196,18 @@ class GPUStatCollection(object):
@staticmethod
def new_query(shutdown=False, per_process_stats=False, get_driver_info=False):
"""Query the information of all the GPUs on local machine"""
initialized = False
if not GPUStatCollection._initialized:
N.nvmlInit()
GPUStatCollection._initialized = True
initialized = True
def _decode(b):
if isinstance(b, bytes):
return b.decode() # for python3, to unicode
return b
def get_gpu_info(index, handle):
def get_gpu_info(index, handle, is_mig=False):
"""Get one GPU information specified by nvml handle"""
def get_process_info(nv_process):
@@ -200,10 +216,10 @@ class GPUStatCollection(object):
if nv_process.pid not in GPUStatCollection.global_processes:
GPUStatCollection.global_processes[nv_process.pid] = \
psutil.Process(pid=nv_process.pid)
ps_process = GPUStatCollection.global_processes[nv_process.pid]
process['pid'] = nv_process.pid
# noinspection PyBroadException
try:
# ps_process = GPUStatCollection.global_processes[nv_process.pid]
# we do not actually use these, so no point in collecting them
# process['username'] = ps_process.username()
# # cmdline returns full path;
@@ -227,12 +243,14 @@ class GPUStatCollection(object):
pass
return process
if not GPUStatCollection._gpu_device_info.get(index):
device_info = GPUStatCollection._mig_device_info if is_mig else GPUStatCollection._gpu_device_info
if not device_info.get(index):
name = _decode(N.nvmlDeviceGetName(handle))
uuid = _decode(N.nvmlDeviceGetUUID(handle))
GPUStatCollection._gpu_device_info[index] = (name, uuid)
device_info[index] = (name, uuid)
name, uuid = GPUStatCollection._gpu_device_info[index]
name, uuid = device_info[index]
try:
temperature = N.nvmlDeviceGetTemperature(
@@ -286,11 +304,11 @@ class GPUStatCollection(object):
for nv_process in nv_comp_processes + nv_graphics_processes:
try:
process = get_process_info(nv_process)
processes.append(process)
except psutil.NoSuchProcess:
# TODO: add some reminder for NVML broken context
# e.g. nvidia-smi reset or reboot the system
pass
process = None
processes.append(process)
# we do not actually use these, so no point in collecting them
# # TODO: Do not block if full process info is not requested
@@ -314,7 +332,7 @@ class GPUStatCollection(object):
# Convert bytes into MBytes
'memory.used': memory.used // MB if memory else None,
'memory.total': memory.total // MB if memory else None,
'processes': processes,
'processes': None if (processes and all(p is None for p in processes)) else processes
}
if per_process_stats:
GPUStatCollection.clean_processes()
@@ -328,8 +346,36 @@ class GPUStatCollection(object):
for index in range(GPUStatCollection._device_count):
handle = N.nvmlDeviceGetHandleByIndex(index)
gpu_info = get_gpu_info(index, handle)
gpu_stat = GPUStat(gpu_info)
gpu_list.append(gpu_stat)
mig_cnt = 0
# noinspection PyBroadException
try:
mig_cnt = N.nvmlDeviceGetMaxMigDeviceCount(handle)
except Exception:
pass
if mig_cnt <= 0:
gpu_list.append(GPUStat(gpu_info))
continue
got_mig_info = False
for mig_index in range(mig_cnt):
try:
mig_handle = N.nvmlDeviceGetMigDeviceHandleByIndex(handle, mig_index)
mig_info = get_gpu_info(mig_index, mig_handle, is_mig=True)
mig_info["mig_name"] = mig_info["name"]
mig_info["name"] = gpu_info["name"]
mig_info["mig_index"] = mig_info["index"]
mig_info["mig_uuid"] = mig_info["uuid"]
mig_info["index"] = gpu_info["index"]
mig_info["uuid"] = gpu_info["uuid"]
mig_info["temperature.gpu"] = gpu_info["temperature.gpu"]
mig_info["fan.speed"] = gpu_info["fan.speed"]
gpu_list.append(GPUStat(mig_info))
got_mig_info = True
except Exception as e:
pass
if not got_mig_info:
gpu_list.append(GPUStat(gpu_info))
# 2. additional info (driver version, etc).
if get_driver_info:
@@ -337,15 +383,32 @@ class GPUStatCollection(object):
driver_version = _decode(N.nvmlSystemGetDriverVersion())
except N.NVMLError:
driver_version = None # N/A
# noinspection PyBroadException
try:
cuda_driver_version = str(N.nvmlSystemGetCudaDriverVersion())
except BaseException:
# noinspection PyBroadException
try:
cuda_driver_version = str(N.nvmlSystemGetCudaDriverVersion_v2())
except BaseException:
cuda_driver_version = None
if cuda_driver_version:
try:
cuda_driver_version = '{}.{}'.format(
int(cuda_driver_version)//1000, (int(cuda_driver_version) % 1000)//10)
except (ValueError, TypeError):
pass
else:
driver_version = None
cuda_driver_version = None
# no need to shutdown:
if shutdown:
if shutdown and initialized:
N.nvmlShutdown()
GPUStatCollection._initialized = False
return GPUStatCollection(gpu_list, driver_version=driver_version)
return GPUStatCollection(gpu_list, driver_version=driver_version, driver_cuda_version=cuda_driver_version)
def __len__(self):
return len(self.gpus)

File diff suppressed because it is too large Load Diff

View File

@@ -13,16 +13,17 @@ from .locks import FileLock
class FolderCache(object):
_lock_filename = '.clearml.lock'
_lock_timeout_seconds = 30
_def_lock_timeout_seconds = 30
_temp_entry_prefix = '_temp.'
def __init__(self, cache_folder, max_cache_entries=5, min_free_space_gb=None):
def __init__(self, cache_folder, max_cache_entries=5, min_free_space_gb=None, lock_timeout_seconds=None):
self._cache_folder = Path(os.path.expandvars(cache_folder)).expanduser().absolute()
self._cache_folder.mkdir(parents=True, exist_ok=True)
self._max_cache_entries = max_cache_entries
self._last_copied_entry_folder = None
self._min_free_space_gb = min_free_space_gb if min_free_space_gb and min_free_space_gb > 0 else None
self._lock = FileLock((self._cache_folder / self._lock_filename).as_posix())
self._lock_timeout_seconds = float(lock_timeout_seconds or self._def_lock_timeout_seconds)
def get_cache_folder(self):
# type: () -> Path
@@ -46,9 +47,11 @@ class FolderCache(object):
# lock so we make sure no one deletes it before we copy it
# noinspection PyBroadException
try:
self._lock.acquire(timeout=self._lock_timeout_seconds)
self._lock.acquire(timeout=self._lock_timeout_seconds, readonly=True)
except BaseException as ex:
warning('Could not lock cache folder {}: {}'.format(self._cache_folder, ex))
import traceback
warning('DEBUG: Exception {}: {}'.format(ex, traceback.format_exc()))
return None
src = None
@@ -115,6 +118,8 @@ class FolderCache(object):
self._lock.acquire(timeout=self._lock_timeout_seconds)
except BaseException as ex:
warning('Could not lock cache folder {}: {}'.format(self._cache_folder, ex))
import traceback
warning('DEBUG: Exception {}: {}'.format(ex, traceback.format_exc()))
# failed locking do nothing
return True
keys = sorted(list(set(keys) | set(cached_keys)))
@@ -194,16 +199,23 @@ class FolderCache(object):
if cache_folder.is_dir() and not cache_folder.name.startswith(self._temp_entry_prefix)]
folder_entries = sorted(folder_entries, key=lambda x: x[1], reverse=True)
number_of_entries_to_keep = self._max_cache_entries - 1 \
if max_cache_entries is None else max(0, int(max_cache_entries))
# if nothing to do, leave
if not folder_entries[number_of_entries_to_keep:]:
return
# lock so we make sure no one deletes it before we copy it
# noinspection PyBroadException
try:
self._lock.acquire(timeout=self._lock_timeout_seconds)
except BaseException as ex:
warning('Could not lock cache folder {}: {}'.format(self._cache_folder, ex))
import traceback
warning('DEBUG: Exception {}: {}'.format(ex, traceback.format_exc()))
return
number_of_entries_to_keep = self._max_cache_entries - 1 \
if max_cache_entries is None else max(0, int(max_cache_entries))
for folder, ts in folder_entries[number_of_entries_to_keep:]:
try:
shutil.rmtree(folder.as_posix(), ignore_errors=True)

View File

@@ -32,7 +32,7 @@ def open_atomic(filename, binary=True):
... os.remove(filename)
>>> with open_atomic(filename) as fh:
... written = fh.write(b'test')
... written = fh.write(b"test")
>>> assert os.path.exists(filename)
>>> os.remove(filename)
@@ -67,7 +67,7 @@ class FileLock(object):
def __init__(
self, filename, mode='a', timeout=DEFAULT_TIMEOUT,
check_interval=DEFAULT_CHECK_INTERVAL, fail_when_locked=False,
flags=LOCK_METHOD, **file_open_kwargs):
**file_open_kwargs):
"""Lock manager with build-in timeout
filename -- filename
@@ -101,11 +101,12 @@ class FileLock(object):
self.timeout = timeout
self.check_interval = check_interval
self.fail_when_locked = fail_when_locked
self.flags = flags
self.flags_read = constants.LOCK_SH | constants.LOCK_NB
self.flags_write = constants.LOCK_EX | constants.LOCK_NB
self.file_open_kwargs = file_open_kwargs
def acquire(
self, timeout=None, check_interval=None, fail_when_locked=None):
self, timeout=None, check_interval=None, fail_when_locked=None, readonly=False):
"""Acquire the locked filehandle"""
if timeout is None:
timeout = self.timeout
@@ -123,12 +124,13 @@ class FileLock(object):
if fh:
return fh
# Get a new filehandler
fh = self._get_fh()
_fh = None
try:
# Get a new filehandler
_fh = self._get_fh()
# Try to lock
fh = self._get_lock(fh)
except exceptions.LockException as exception:
fh = self._get_lock(_fh, readonly=readonly)
except (exceptions.LockException, IOError) as exception:
# Try till the timeout has passed
timeoutend = current_time() + timeout
while timeoutend > current_time():
@@ -144,16 +146,18 @@ class FileLock(object):
raise exceptions.AlreadyLocked(exception)
else: # pragma: no cover
if not _fh:
_fh = self._get_fh()
# We've got the lock
fh = self._get_lock(fh)
fh = self._get_lock(_fh, readonly=readonly)
break
except exceptions.LockException:
except (exceptions.LockException, IOError):
pass
else:
# We got a timeout... reraising
raise exceptions.LockException(exception)
raise exceptions.LockTimeout(exception)
# Prepare the filehandle (truncate if needed)
fh = self._prepare_fh(fh)
@@ -176,16 +180,37 @@ class FileLock(object):
pass
self.fh = None
def delete_lock_file(self):
# type: () -> bool
"""
Remove the local file used for locking (fail if file is locked)
:return: True is successful
"""
if self.fh:
return False
# noinspection PyBroadException
try:
os.unlink(path=self.filename)
except BaseException:
return False
return True
def _get_fh(self):
"""Get a new filehandle"""
# Create the parent directory if it doesn't exist
path, name = os.path.split(self.filename)
if path and not os.path.isdir(path): # pragma: no cover
os.makedirs(path, exist_ok=True)
return open(self.filename, self.mode, **self.file_open_kwargs)
def _get_lock(self, fh):
def _get_lock(self, fh, readonly=False):
"""
Try to lock the given filehandle
returns LockException if it fails"""
lock(fh, self.flags)
lock(fh, self.flags_read if readonly else self.flags_write)
return fh
def _prepare_fh(self, fh):

View File

@@ -20,6 +20,9 @@ class exceptions:
class FileToLarge(BaseLockException):
pass
class LockTimeout(BaseLockException):
pass
class constants:
# The actual tests will execute the code anyhow so the following code can
@@ -185,6 +188,10 @@ elif os.name == 'posix': # pragma: no cover
# The exception code varies on different systems so we'll catch
# every IO error
raise exceptions.LockException(exc_value, fh=file_)
except BaseException as ex:
# DEBUG
print("Uncaught [{}] Exception [{}] in portalock: {}".format(locking_exceptions, type(ex), ex))
raise
def unlock(file_):
fcntl.flock(file_.fileno(), constants.LOCK_UN)

View File

@@ -28,6 +28,7 @@ class PackageManager(object):
_config_cache_folder = 'agent.venvs_cache.path'
_config_cache_max_entries = 'agent.venvs_cache.max_entries'
_config_cache_free_space_threshold = 'agent.venvs_cache.free_space_threshold_gb'
_config_cache_lock_timeout = 'agent.venvs_cache.lock_timeout'
def __init__(self):
self._cache_manager = None
@@ -50,7 +51,7 @@ class PackageManager(object):
pass
@abc.abstractmethod
def freeze(self):
def freeze(self, freeze_full_environment=False):
pass
@abc.abstractmethod
@@ -141,8 +142,9 @@ class PackageManager(object):
@classmethod
def out_of_scope_install_package(cls, package_name, *args):
if PackageManager._selected_manager is not None:
# noinspection PyBroadException
try:
result = PackageManager._selected_manager._install(package_name, *args)
result = PackageManager._selected_manager.install_packages(package_name, *args)
if result not in (0, None, True):
return False
except Exception:
@@ -150,10 +152,11 @@ class PackageManager(object):
return True
@classmethod
def out_of_scope_freeze(cls):
def out_of_scope_freeze(cls, freeze_full_environment=False):
if PackageManager._selected_manager is not None:
# noinspection PyBroadException
try:
return PackageManager._selected_manager.freeze()
return PackageManager._selected_manager.freeze(freeze_full_environment)
except Exception:
pass
return []
@@ -180,7 +183,7 @@ class PackageManager(object):
def get_pip_versions(cls, pip="pip", wrap=''):
return [
(wrap + pip + version + wrap)
for version in cls._pip_version or [pip]
for version in cls._pip_version or [""]
]
def get_cached_venv(self, requirements, docker_cmd, python_version, cuda_version, destination_folder):
@@ -216,6 +219,8 @@ class PackageManager(object):
if not self._get_cache_manager():
return
print('Adding venv into cache: {}'.format(source_folder))
try:
keys = self._generate_reqs_hash_keys(requirements, docker_cmd, python_version, cuda_version)
return self._get_cache_manager().add_entry(
@@ -300,7 +305,9 @@ class PackageManager(object):
max_entries = int(self.session.config.get(self._config_cache_max_entries, 10))
free_space_threshold = float(self.session.config.get(self._config_cache_free_space_threshold, 0))
self._cache_manager = FolderCache(
cache_folder, max_cache_entries=max_entries, min_free_space_gb=free_space_threshold)
cache_folder, max_cache_entries=max_entries,
min_free_space_gb=free_space_threshold,
lock_timeout_seconds=self.session.config.get(self._config_cache_lock_timeout, None))
except Exception as ex:
print("WARNING: Failed accessing venvs cache at {}: {}".format(cache_folder, ex))
print("WARNING: Skipping venv cache - folder not accessible!")

View File

@@ -5,7 +5,6 @@ import re
import os
import subprocess
from collections import OrderedDict
from distutils.spawn import find_executable
from functools import partial
from itertools import chain
from typing import Text, Iterable, Union, Dict, Set, Sequence, Any
@@ -22,13 +21,13 @@ from clearml_agent.errors import CommandFailedError
from clearml_agent.helper.base import (
rm_tree, NonStrictAttrs, select_for_platform, is_windows_platform, ExecutionInfo,
convert_cuda_version_to_float_single_digit_str, convert_cuda_version_to_int_10_base_str, )
from clearml_agent.helper.process import Argv, Executable, DEVNULL, CommandSequence, PathLike
from clearml_agent.helper.process import Argv, Executable, DEVNULL, CommandSequence, PathLike, find_executable
from clearml_agent.helper.package.requirements import SimpleVersion
from clearml_agent.session import Session
from .base import PackageManager
from .pip_api.venv import VirtualenvPip
from .requirements import RequirementsManager, MarkerRequirement
from ...backend_api.session.defs import ENV_CONDA_ENV_PACKAGE
from ...backend_api.session.defs import ENV_CONDA_ENV_PACKAGE, ENV_USE_CONDA_BASE_ENV
package_normalize = partial(re.compile(r"""\[version=['"](.*)['"]\]""").sub, r"\1")
@@ -79,6 +78,11 @@ class CondaAPI(PackageManager):
self.path = path
self.env_read_only = False
self.extra_channels = self.session.config.get('agent.package_manager.conda_channels', [])
# install into base conda environment (should only be used if running in docker mode)
self.use_conda_base_env = ENV_USE_CONDA_BASE_ENV.get(
default=self.session.config.get('agent.package_manager.use_conda_base_env', None)
)
# notice this will not install any additional packages into the selected environment
self.conda_env_as_base_docker = \
self.session.config.get('agent.package_manager.conda_env_as_base_docker', None) or \
bool(ENV_CONDA_ENV_PACKAGE.get())
@@ -129,16 +133,38 @@ class CondaAPI(PackageManager):
def bin(self):
return self.pip.bin
def _parse_package_marker_match_python_ver(self, line=None, marker_req=None):
if line:
marker_req = MarkerRequirement(Requirement.parse(line))
try:
mock_req = MarkerRequirement(Requirement.parse(marker_req.marker.replace("'", "").replace("\"", "")))
except Exception as ex:
print("WARNING: failed parsing, assuming package is okay {}".format(ex))
return marker_req
if not mock_req.compare_version(requested_version=self.python):
print("SKIPPING package `{}` not required python version {}".format(marker_req.tostr(), self.python))
return None
return marker_req
# noinspection SpellCheckingInspection
def upgrade_pip(self):
# do not change pip version if pre built environement is used
if self.env_read_only:
print('Conda environment in read-only mode, skipping pip upgrade.')
return ''
pip_versions = []
for req_pip_line in self.pip.get_pip_versions():
req = self._parse_package_marker_match_python_ver(line=req_pip_line)
if req:
pip_versions.append(req.tostr(markers=False))
return self._install(
*select_for_platform(
windows=self.pip.get_pip_versions(),
linux=self.pip.get_pip_versions()
windows=pip_versions,
linux=pip_versions
)
)
@@ -173,6 +199,14 @@ class CondaAPI(PackageManager):
else:
raise ValueError("Could not restore Conda environment, cannot find {}".format(
self.conda_pre_build_env_path))
elif self.use_conda_base_env:
try:
base_path = Path(self.conda).parent.parent.as_posix()
print("Using base conda environment at {}".format(base_path))
self._init_existing_environment(base_path, is_readonly=False)
return self
except Exception as ex:
print("WARNING: Failed using base conda environment, reverting to new environment: {}".format(ex))
command = Argv(
self.conda,
@@ -200,10 +234,25 @@ class CondaAPI(PackageManager):
return self
def _init_existing_environment(self, conda_pre_build_env_path):
def _init_existing_environment(self, conda_pre_build_env_path, is_readonly=True):
print("Using pre-existing Conda environment from {}".format(conda_pre_build_env_path))
self.path = Path(conda_pre_build_env_path)
self.source = ("conda", "activate", self.path.as_posix())
conda_env = self._get_conda_sh()
self.source = CommandSequence(('source', conda_env.as_posix()), self.source)
conda_packages_json = json.loads(
self._run_command((self.conda, "list", "--json", "-p", self.path), raw=True))
try:
for package in conda_packages_json:
if package.get("name") == "python" and package.get("version"):
self.python = ".".join(package.get("version").split(".")[:2])
print("Existing conda environment, found python version {}".format(self.python))
break
except Exception as ex:
print("WARNING: failed detecting existing conda python version: {}".format(ex))
self.pip = CondaPip(
session=self.session,
source=self.source,
@@ -211,9 +260,9 @@ class CondaAPI(PackageManager):
requirements_manager=self.requirements_manager,
path=self.path,
)
conda_env = self._get_conda_sh()
self.source = self.pip.source = CommandSequence(('source', conda_env.as_posix()), self.source)
self.env_read_only = True
self.pip.source = self.source
self.env_read_only = is_readonly
def remove(self):
"""
@@ -223,7 +272,7 @@ class CondaAPI(PackageManager):
Conda seems to load "vcruntime140.dll" from all its environment on startup.
This means environment have to be deleted using 'conda env remove'.
If necessary, conda can be fooled into deleting a partially-deleted environment by creating an empty file
in '<ENV>\conda-meta\history' (value found in 'conda.gateways.disk.test.PREFIX_MAGIC_FILE').
in '<ENV>\\conda-meta\\history' (value found in 'conda.gateways.disk.test.PREFIX_MAGIC_FILE').
Otherwise, it complains that said directory is not a conda environment.
See: https://github.com/conda/conda/issues/7682
@@ -499,7 +548,7 @@ class CondaAPI(PackageManager):
if '.' not in m.specs[0][1]:
continue
if m.name.lower() == 'cudatoolkit':
if m.name.lower() in ('cudatoolkit', 'cuda-toolkit'):
# skip cuda if we are running on CPU
if not cuda_version:
continue
@@ -526,10 +575,22 @@ class CondaAPI(PackageManager):
has_torch = True
m.req.name = 'tensorflow-gpu' if cuda_version > 0 else 'tensorflow'
# push the clearml packages into the pip_requirements
if "clearml" in m.req.name and "clearml" not in self.extra_channels:
if self.session.debug_mode:
print("info: moving `{}` packages to `pip` section".format(m.req))
pip_requirements.append(m)
continue
reqs.append(m)
if not has_cudatoolkit and cuda_version:
m = MarkerRequirement(Requirement.parse("cudatoolkit == {}".format(cuda_version_full)))
# nvidia channel is using `cuda-toolkit` and has newer versions of cuda,
# older cuda can be picked from conda-forge (<12)
if "nvidia" in self.extra_channels:
m = MarkerRequirement(Requirement.parse("cuda-toolkit == {}".format(cuda_version_full)))
else:
m = MarkerRequirement(Requirement.parse("cudatoolkit == {}".format(cuda_version_full)))
has_cudatoolkit = True
reqs.append(m)
@@ -589,21 +650,30 @@ class CondaAPI(PackageManager):
if r.name and not r.name.startswith('_') and not requirements.get('conda', None):
r.name = r.name.replace('_', '-')
if has_cudatoolkit and r.specs and len(r.specs[0]) > 1 and r.name == 'cudatoolkit':
if has_cudatoolkit and r.specs and len(r.specs[0]) > 1 and r.name in ('cudatoolkit', 'cuda-toolkit'):
# select specific cuda version if it came from the requirements
r.specs = [(r.specs[0][0].replace('==', '='), r.specs[0][1].split('.post')[0])]
elif r.specs and r.specs[0] and len(r.specs[0]) > 1:
# remove .post from version numbers it fails with ~= version, and change == to ~=
r.specs = [(r.specs[0][0].replace('==', '~='), r.specs[0][1].split('.post')[0])]
r.specs = [(s[0].replace('==', '~='), s[1].split('.post')[0]) for s in r.specs]
while reqs:
# notice, we give conda more freedom in version selection, to help it choose best combination
def clean_ver(ar):
if not ar.specs:
return ar.tostr()
ar.specs = [(ar.specs[0][0], ar.specs[0][1] + '.0' if '.' not in ar.specs[0][1] else ar.specs[0][1])]
return ar.tostr()
conda_env['dependencies'] = [clean_ver(r) for r in reqs]
markers = None
if ar.marker:
# check if we really need it based on python version
ar = self._parse_package_marker_match_python_ver(marker_req=ar)
if not ar:
# empty lines should be skipped
return ""
# if we do make sure we note that we ignored markers
print("WARNING: ignoring marker in `{}`".format(ar.tostr()))
markers = False
if ar.specs:
ar.specs = [(s[0], s[1] + '.0' if '.' not in s[1] else s[1]) for s in ar.specs]
return ar.tostr(markers=markers)
conda_env['dependencies'] = [clean_ver(r) for r in reqs if clean_ver(r)]
with self.temp_file("conda_env", yaml.dump(conda_env), suffix=".yml") as name:
print('Conda: Trying to install requirements:\n{}'.format(conda_env['dependencies']))
if self.session.debug_mode:
@@ -730,6 +800,25 @@ class CondaAPI(PackageManager):
return conda_env
return base_conda_env
def add_cached_venv(self, *args, **kwargs):
"""
Copy the local venv folder into the venv cache (keys are based on the requirements+python+docker).
"""
# do not cache if this is a base conda environment
if self.conda_env_as_base_docker or self.use_conda_base_env:
return
return super().add_cached_venv(*args, **kwargs)
def get_cached_venv(self, *args, **kwargs):
"""
Copy a cached copy of the venv (based on the requirements) into destination_folder.
Return None if failed or cached entry does not exist
"""
# do not cache if this is a base conda environment
if self.conda_env_as_base_docker or self.use_conda_base_env:
return
return super().get_cached_venv(*args, **kwargs)
# enable hashing with cmp=False because pdb fails on un-hashable exceptions
exception = attrs(str=True, cmp=False)

View File

@@ -92,21 +92,14 @@ class ExternalRequirements(SimpleSubstitution):
vcs_url = req_line[4:]
# reverse replace
vcs_url = vcs_url[::-1].replace(fragment[::-1], '', 1)[::-1]
# remove ssh:// or git:// prefix for git detection and credentials
scheme = ''
full_vcs_url = vcs_url
if vcs_url and (vcs_url.startswith('ssh://') or vcs_url.startswith('git://')):
scheme = 'ssh://' # notice git:// is actually ssh://
vcs_url = vcs_url[6:]
# notice git:// is actually ssh://
if vcs_url and vcs_url.startswith('git://'):
vcs_url = vcs_url.replace('git://', 'ssh://', 1)
from ..repo import Git
vcs = Git(session=session, url=full_vcs_url, location=None, revision=None)
vcs = Git(session=session, url=vcs_url, location=None, revision=None)
vcs._set_ssh_url()
new_req_line = 'git+{}{}{}'.format(
'' if scheme and '://' in vcs.url else scheme,
vcs_url if session.config.get('agent.force_git_ssh_protocol', None) else vcs.url_with_auth,
fragment
)
new_req_line = 'git+{}{}'.format(vcs.url_with_auth, fragment)
if new_req_line != req_line:
furl_line = furl(new_req_line)
print('Replacing original pip vcs \'{}\' with \'{}\''.format(

View File

@@ -4,7 +4,7 @@ from itertools import chain
from pathlib import Path
from typing import Text, Optional
from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME
from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME, ENV_PIP_EXTRA_INSTALL_FLAGS
from clearml_agent.helper.package.base import PackageManager
from clearml_agent.helper.process import Argv, DEVNULL
from clearml_agent.session import Session
@@ -12,8 +12,6 @@ from clearml_agent.session import Session
class SystemPip(PackageManager):
indices_args = None
def __init__(self, interpreter=None, session=None):
# type: (Optional[Text], Optional[Session]) -> ()
"""
@@ -52,7 +50,7 @@ class SystemPip(PackageManager):
package,
'--dest', cache_dir,
'--no-deps',
) + self.install_flags()
) + self.download_flags()
)
def load_requirements(self, requirements):
@@ -65,13 +63,14 @@ class SystemPip(PackageManager):
def uninstall(self, package):
self.run_with_env(('uninstall', '-y', package))
def freeze(self):
def freeze(self, freeze_full_environment=False):
"""
pip freeze to all install packages except the running program
:return: Dict contains pip as key and pip's packages to install
:rtype: Dict[str: List[str]]
"""
packages = self.run_with_env(('freeze',), output=True).splitlines()
packages = self.run_with_env(
('freeze',) if not freeze_full_environment else ('freeze', '--all'), output=True).splitlines()
packages_without_program = [package for package in packages if PROGRAM_NAME not in package]
return {'pip': packages_without_program}
@@ -87,14 +86,30 @@ class SystemPip(PackageManager):
# make sure we are not running it with our own PYTHONPATH
env = dict(**os.environ)
env.pop('PYTHONPATH', None)
# Debug print
if self.session.debug_mode:
print(command)
return (command.get_output if output else command.check_call)(stdin=DEVNULL, env=env, **kwargs)
def _make_command(self, command):
return Argv(self.bin, '-m', 'pip', '--disable-pip-version-check', *command)
def install_flags(self):
if self.indices_args is None:
self.indices_args = tuple(
chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
)
return self.indices_args
indices_args = tuple(
chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
)
extra_pip_flags = \
ENV_PIP_EXTRA_INSTALL_FLAGS.get() or \
self.session.config.get("agent.package_manager.extra_pip_install_flags", None)
return (indices_args + tuple(extra_pip_flags)) if extra_pip_flags else indices_args
def download_flags(self):
indices_args = tuple(
chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
)
return indices_args

View File

@@ -64,9 +64,18 @@ class VirtualenvPip(SystemPip, PackageManager):
Only valid if instantiated with path.
Use self.python as self.bin does not exist.
"""
self.session.command(
self.python, "-m", "virtualenv", self.path, *self.create_flags()
).check_call()
# noinspection PyBroadException
try:
self.session.command(
self.python, "-m", "virtualenv", self.path, *self.create_flags()
).check_call()
except Exception as ex:
# let's try with std library instead
print("WARNING: virtualenv call failed: {}\n INFO: Creating virtual environment with venv".format(ex))
self.session.command(
self.python, "-m", "venv", self.path, *self.create_flags()
).check_call()
return self
def remove(self):

View File

@@ -6,6 +6,7 @@ import sys
import os
from pathlib2 import Path
from clearml_agent.definitions import ENV_AGENT_FORCE_POETRY
from clearml_agent.helper.process import Argv, DEVNULL, check_if_command_exists
from clearml_agent.session import Session, POETRY
@@ -39,11 +40,11 @@ def prop_guard(prop, log_prop=None):
class PoetryConfig:
def __init__(self, session, interpreter=None):
# type: (Session, str) -> ()
def __init__(self, session):
# type: (Session, str) -> None
self.session = session
self._log = session.get_logger(__name__)
self._python = interpreter or sys.executable
self._python = sys.executable # default, overwritten from session config in initialize()
self._initialized = False
@property
@@ -52,7 +53,7 @@ class PoetryConfig:
@property
def enabled(self):
return self.session.config["agent.package_manager.type"] == POETRY
return ENV_AGENT_FORCE_POETRY.get() or self.session.config["agent.package_manager.type"] == POETRY
_guard_enabled = prop_guard(enabled, log)
@@ -69,7 +70,7 @@ class PoetryConfig:
path = path.replace(':'+sys.base_prefix, ':'+sys.real_prefix, 1)
kwargs['env']['PATH'] = path
if self.session and self.session.config:
if self.session and self.session.config and args and args[0] == "install":
extra_args = self.session.config.get("agent.package_manager.poetry_install_extra_args", None)
if extra_args:
args = args + tuple(extra_args)
@@ -87,32 +88,53 @@ class PoetryConfig:
@_guard_enabled
def initialize(self, cwd=None):
if not self._initialized:
# use correct python version -- detected in Worker.install_virtualenv() and written to
# session
if self.session.config.get("agent.python_binary", None):
self._python = self.session.config.get("agent.python_binary")
if self.session.config.get("agent.package_manager.poetry_version", None) is not None:
version = str(self.session.config.get("agent.package_manager.poetry_version"))
print('Upgrading Poetry package {}'.format(version))
# first upgrade pip if we need to
try:
from clearml_agent.helper.package.pip_api.venv import VirtualenvPip
pip = VirtualenvPip(
session=self.session, python=self._python,
requirements_manager=None, path=None, interpreter=self._python)
pip.upgrade_pip()
except Exception as ex:
self.log.warning("failed upgrading pip: {}".format(ex))
# get poetry version
version = version.replace(' ', '')
if ('=' in version) or ('~' in version) or ('<' in version) or ('>' in version):
version = version
elif version:
version = "==" + version
# (we are not running it yet)
argv = Argv(self._python, "-m", "pip", "install", "poetry{}".format(version),
"--upgrade", "--disable-pip-version-check")
# this is just for beauty and checks, we already set the verion in the Argv
if not version:
version = "latest"
else:
# mark to install poetry if not already installed (we are not running it yet)
argv = Argv(self._python, "-m", "pip", "install", "poetry", "--disable-pip-version-check")
version = ""
# first upgrade pip if we need to
try:
from clearml_agent.helper.package.pip_api.venv import VirtualenvPip
pip = VirtualenvPip(
session=self.session, python=self._python,
requirements_manager=None, path=None, interpreter=self._python)
pip.upgrade_pip()
except Exception as ex:
self.log.warning("failed upgrading pip: {}".format(ex))
# check if we do not have a specific version and poetry is found skip installation
if not version and check_if_command_exists("poetry"):
print("Notice: Poetry was found, no specific version required, skipping poetry installation")
else:
print('Installing / Upgrading Poetry package to {}'.format(version))
# now install poetry
try:
version = version.replace(' ', '')
if ('=' in version) or ('~' in version) or ('<' in version) or ('>' in version):
version = version
elif version:
version = "==" + version
argv = Argv(self._python, "-m", "pip", "install", "poetry{}".format(version),
"--upgrade", "--disable-pip-version-check")
print(argv.get_output())
except Exception as ex:
self.log.warning("failed upgrading poetry: {}".format(ex))
self.log.warning("failed installing poetry: {}".format(ex))
# now setup poetry
self._initialized = True
try:
self._config("--local", "virtualenvs.in-project", "true", cwd=cwd)
@@ -147,7 +169,7 @@ class PoetryAPI(object):
any((self.path / indicator).exists() for indicator in self.INDICATOR_FILES)
)
def freeze(self):
def freeze(self, freeze_full_environment=False):
lines = self.config.run("show", cwd=str(self.path)).splitlines()
lines = [[p for p in line.split(' ') if p] for line in lines]
return {"pip": [parts[0]+'=='+parts[1]+' # '+' '.join(parts[2:]) for parts in lines]}

View File

@@ -7,7 +7,7 @@ from .requirements import SimpleSubstitution
class PriorityPackageRequirement(SimpleSubstitution):
name = ("cython", "numpy", "setuptools", )
name = ("cython", "numpy", "setuptools", "pip", )
optional_package_names = tuple()
def __init__(self, *args, **kwargs):
@@ -50,31 +50,39 @@ class PriorityPackageRequirement(SimpleSubstitution):
"""
# if we replaced setuptools, it means someone requested it, and since freeze will not contain it,
# we need to add it manually
if not self._replaced_packages or "setuptools" not in self._replaced_packages:
if not self._replaced_packages:
return list_of_requirements
try:
for k, lines in list_of_requirements.items():
# k is either pip/conda
if k not in ('pip', 'conda'):
continue
for i, line in enumerate(lines):
if not line or line.lstrip().startswith('#'):
continue
parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
if not parts:
continue
# if we found setuptools, do nothing
if parts[0] == "setuptools":
return list_of_requirements
if "pip" in self._replaced_packages:
full_freeze = PackageManager.out_of_scope_freeze(freeze_full_environment=True)
# now let's look for pip
pips = [line for line in full_freeze.get("pip", []) if line.split("==")[0] == "pip"]
if pips and "pip" in list_of_requirements:
list_of_requirements["pip"] = [pips[0]] + list_of_requirements["pip"]
# if we are here it means we have not found setuptools
# we should add it:
if "pip" in list_of_requirements:
list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
if "setuptools" in self._replaced_packages:
try:
for k, lines in list_of_requirements.items():
# k is either pip/conda
if k not in ('pip', 'conda'):
continue
for i, line in enumerate(lines):
if not line or line.lstrip().startswith('#'):
continue
parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
if not parts:
continue
# if we found setuptools, do nothing
if parts[0] == "setuptools":
return list_of_requirements
except Exception as ex: # noqa
return list_of_requirements
# if we are here it means we have not found setuptools
# we should add it:
if "pip" in list_of_requirements:
list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
except Exception as ex: # noqa
return list_of_requirements
return list_of_requirements

View File

@@ -16,6 +16,7 @@ import six
from .requirements import (
SimpleSubstitution, FatalSpecsResolutionError, SimpleVersion, MarkerRequirement,
compare_version_rules, )
from ...definitions import ENV_PACKAGE_PYTORCH_RESOLVE
from ...external.requirements_parser.requirement import Requirement
OS_TO_WHEEL_NAME = {"linux": "linux_x86_64", "windows": "win_amd64"}
@@ -174,6 +175,7 @@ class PytorchRequirement(SimpleSubstitution):
extra_index_url_template = 'https://download.pytorch.org/whl/cu{}/'
nightly_extra_index_url_template = 'https://download.pytorch.org/whl/nightly/cu{}/'
torch_index_url_lookup = {}
resolver_types = ("pip", "direct", "none")
def __init__(self, *args, **kwargs):
os_name = kwargs.pop("os_override", None)
@@ -208,6 +210,13 @@ class PytorchRequirement(SimpleSubstitution):
if self.config.get("agent.package_manager.torch_url_template", None):
PytorchWheel.url_template = \
self.config.get("agent.package_manager.torch_url_template", None)
self.resolve_algorithm = str(
ENV_PACKAGE_PYTORCH_RESOLVE.get() or
self.config.get("agent.package_manager.pytorch_resolve", "pip")).lower()
if self.resolve_algorithm not in self.resolver_types:
print("WARNING: agent.package_manager.pytorch_resolve=={} not in {} reverting to '{}'".format(
self.resolve_algorithm, self.resolver_types, self.resolver_types[0]))
self.resolve_algorithm = self.resolver_types[0]
def _init_python_ver_cuda_ver(self):
if self.cuda is None:
@@ -261,6 +270,10 @@ class PytorchRequirement(SimpleSubstitution):
)
def match(self, req):
if self.resolve_algorithm == "none":
# skipping resolver
return False
return req.name in self.packages
@staticmethod
@@ -310,6 +323,12 @@ class PytorchRequirement(SimpleSubstitution):
# yes this is for linux python 2.7 support, this is the only python 2.7 we support...
if py_ver and py_ver[0] == '2' and len(parts) > 3 and not parts[3].endswith('u'):
continue
# check if this an actual match
if not req.compare_version(v) or \
(last_v and SimpleVersion.compare_versions(last_v, '>', v, ignore_sub_versions=False)):
continue
# update the closest matched version (from above)
if not closest_v:
closest_v = v
@@ -318,10 +337,6 @@ class PytorchRequirement(SimpleSubstitution):
SimpleVersion.compare_versions(
version_a=v, op='>=', version_b=req.specs[0][1], num_parts=3):
closest_v = v
# check if this an actual match
if not req.compare_version(v) or \
(last_v and SimpleVersion.compare_versions(last_v, '>', v, ignore_sub_versions=False)):
continue
url = '/'.join(torch_url.split('/')[:-1] + l.split('/'))
last_v = v
@@ -345,8 +360,10 @@ class PytorchRequirement(SimpleSubstitution):
from pip._internal.commands.show import search_packages_info
installed_torch = list(search_packages_info([req.name]))
# notice the comparison order, the first part will make sure we have a valid installed package
installed_torch_version = (getattr(installed_torch[0], 'version', None) or installed_torch[0]['version']) \
if installed_torch else None
installed_torch_version = \
(getattr(installed_torch[0], 'version', None) or
installed_torch[0]['version']) if installed_torch else None
if installed_torch and installed_torch_version and \
req.compare_version(installed_torch_version):
print('PyTorch: requested "{}" version {}, using pre-installed version {}'.format(
@@ -354,6 +371,7 @@ class PytorchRequirement(SimpleSubstitution):
# package already installed, do nothing
req.specs = [('==', str(installed_torch_version))]
return '{} {} {}'.format(req.name, req.specs[0][0], req.specs[0][1]), True
except Exception:
pass
@@ -475,6 +493,26 @@ class PytorchRequirement(SimpleSubstitution):
return self.match_version(req, base).replace(" ", "\n")
def replace(self, req):
# we first try to resolve things ourselves because pytorch pip is not always picking the correct
# versions from their pip repository
resolve_algorithm = self.resolve_algorithm
if resolve_algorithm == "none":
# skipping resolver
return None
elif resolve_algorithm == "direct":
# noinspection PyBroadException
try:
new_req = self._replace(req)
if new_req:
self._original_req.append((req, new_req))
return new_req
except Exception:
print("Warning: Failed resolving using `pytorch_resolve=direct` reverting to `pytorch_resolve=pip`")
elif resolve_algorithm not in self.resolver_types:
print("Warning: `agent.package_manager.pytorch_resolve={}` "
"unrecognized, default to `pip`".format(resolve_algorithm))
# check if package is already installed with system packages
self.validate_python_version()
@@ -508,6 +546,8 @@ class PytorchRequirement(SimpleSubstitution):
# return the original line
line = req.line
print("PyTorch: Adding index `{}` and installing `{}`".format(extra_index_url[0], line))
return line
except Exception: # noqa
@@ -566,6 +606,19 @@ class PytorchRequirement(SimpleSubstitution):
:param list_of_requirements: {'pip': ['a==1.0', ]}
:return: {'pip': ['a==1.0', ]}
"""
def build_specific_version_req(a_line, a_name, a_new_req):
try:
r = Requirement.parse(a_line)
wheel_parts = r.uri.split("/")[-1].split('-')
version = str(wheel_parts[1].split('%')[0].split('+')[0])
new_r = Requirement.parse("{} == {} # {}".format(a_name, version, str(a_new_req)))
if new_r.line:
# great it worked!
return new_r.line
except: # noqa
pass
return None
if not self._original_req:
return list_of_requirements
try:
@@ -589,9 +642,18 @@ class PytorchRequirement(SimpleSubstitution):
if req.local_file:
lines[i] = '{}'.format(str(new_req))
else:
lines[i] = '{} # {}'.format(str(req), str(new_req))
# try to rebuild requirements with specific version:
new_line = build_specific_version_req(line, req.req.name, new_req)
if new_line:
lines[i] = new_line
else:
lines[i] = '{} # {}'.format(str(req), str(new_req))
else:
lines[i] = '{} # {}'.format(line, str(new_req))
new_line = build_specific_version_req(line, req.req.name, new_req)
if new_line:
lines[i] = new_line
else:
lines[i] = '{} # {}'.format(line, str(new_req))
break
except:
pass
@@ -608,8 +670,7 @@ class PytorchRequirement(SimpleSubstitution):
return MarkerRequirement(Requirement.parse(self._fix_setuptools))
return None
@classmethod
def get_torch_index_url(cls, cuda_version, nightly=False):
def get_torch_index_url(self, cuda_version, nightly=False):
# noinspection PyBroadException
try:
cuda = int(cuda_version)
@@ -619,39 +680,39 @@ class PytorchRequirement(SimpleSubstitution):
if nightly:
for c in range(cuda, max(-1, cuda-15), -1):
# then try the nightly builds, it might be there...
torch_url = cls.nightly_extra_index_url_template.format(c)
torch_url = self.nightly_extra_index_url_template.format(c)
# noinspection PyBroadException
try:
if requests.get(torch_url, timeout=10).ok:
print('Torch nightly CUDA {} index page found'.format(c))
cls.torch_index_url_lookup[c] = torch_url
return cls.torch_index_url_lookup[c], c
self.torch_index_url_lookup[c] = torch_url
return self.torch_index_url_lookup[c], c
except Exception:
pass
return
# first check if key is valid
if cuda in cls.torch_index_url_lookup:
return cls.torch_index_url_lookup[cuda], cuda
if cuda in self.torch_index_url_lookup:
return self.torch_index_url_lookup[cuda], cuda
# then try a new cuda version page
for c in range(cuda, max(-1, cuda-15), -1):
torch_url = cls.extra_index_url_template.format(c)
torch_url = self.extra_index_url_template.format(c)
# noinspection PyBroadException
try:
if requests.get(torch_url, timeout=10).ok:
print('Torch CUDA {} index page found'.format(c))
cls.torch_index_url_lookup[c] = torch_url
return cls.torch_index_url_lookup[c], c
print('Torch CUDA {} index page found, adding `{}`'.format(c, torch_url))
self.torch_index_url_lookup[c] = torch_url
return self.torch_index_url_lookup[c], c
except Exception:
pass
keys = sorted(cls.torch_index_url_lookup.keys(), reverse=True)
keys = sorted(self.torch_index_url_lookup.keys(), reverse=True)
for k in keys:
if k <= cuda:
return cls.torch_index_url_lookup[k], k
return self.torch_index_url_lookup[k], k
# return default - zero
return cls.torch_index_url_lookup[0], 0
return self.torch_index_url_lookup[0], 0
MAP = {
"windows": {

View File

@@ -240,6 +240,23 @@ class SimpleVersion:
if not version_b:
return True
# remove trailing "*" in both
if "*" in version_a:
ignore_sub_versions = True
while version_a.endswith(".*"):
version_a = version_a[:-2]
if version_a == "*":
version_a = ""
num_parts = min(len(version_a.split('.')), len(version_b.split('.')), )
if "*" in version_b:
ignore_sub_versions = True
while version_b.endswith(".*"):
version_b = version_b[:-2]
if version_b == "*":
version_b = ""
num_parts = min(len(version_a.split('.')), len(version_b.split('.')), )
if not num_parts:
num_parts = max(len(version_a.split('.')), len(version_b.split('.')), )

View File

@@ -8,7 +8,6 @@ import subprocess
import sys
from contextlib import contextmanager
from copy import copy
from distutils.spawn import find_executable
from itertools import chain, repeat, islice
from os.path import devnull
from time import sleep
@@ -492,3 +491,40 @@ def double_quote(s):
# use single quotes, and put single quotes into double quotes
# the string $"b is then quoted as "$"""b"
return '"' + s.replace('"', '"\'\"\'"') + '"'
def find_executable(executable, path=None):
"""Tries to find 'executable' in the directories listed in 'path'.
A string listing directories separated by 'os.pathsep'; defaults to
os.environ['PATH']. Returns the complete filename or None if not found.
"""
_, ext = os.path.splitext(executable)
if (sys.platform == 'win32') and (ext != '.exe'):
executable = executable + '.exe'
if os.path.isfile(executable):
return executable
if path is None:
path = os.environ.get('PATH', None)
if path is None:
try:
path = os.confstr("CS_PATH")
except (AttributeError, ValueError):
# os.confstr() or CS_PATH is not available
path = os.defpath
# bpo-35755: Don't use os.defpath if the PATH environment variable is
# set to an empty string
# PATH='' doesn't match, whereas PATH=':' looks in the current directory
if not path:
return None
paths = path.split(os.pathsep)
for p in paths:
f = os.path.join(p, executable)
if os.path.isfile(f):
# the file exists, we have a shot at spawn working
return f
return None

View File

@@ -6,7 +6,6 @@ import stat
import subprocess
import sys
import tempfile
from distutils.spawn import find_executable
from hashlib import md5
from os import environ
from random import random
@@ -19,7 +18,7 @@ from pathlib2 import Path
import six
from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST
from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST, ENV_GIT_CLONE_VERBOSE
from clearml_agent.helper.console import ensure_text, ensure_binary
from clearml_agent.errors import CommandFailedError
from clearml_agent.helper.base import (
@@ -30,7 +29,7 @@ from clearml_agent.helper.base import (
create_file_if_not_exists, safe_remove_file,
)
from clearml_agent.helper.os.locks import FileLock
from clearml_agent.helper.process import DEVNULL, Argv, PathLike, COMMAND_SUCCESS
from clearml_agent.helper.process import DEVNULL, Argv, PathLike, COMMAND_SUCCESS, find_executable
from clearml_agent.session import Session
@@ -197,8 +196,9 @@ class VCS(object):
self.log.info("successfully applied uncommitted changes")
return True
# Command-line flags for clone command
clone_flags = ()
def clone_flags(self):
"""Command-line flags for clone command"""
return tuple()
@abc.abstractmethod
def executable_not_found_error_help(self):
@@ -320,7 +320,10 @@ class VCS(object):
self.url, new_url))
self.url = new_url
return
# rewrite ssh URLs only if either ssh port or ssh user are forced in config
# TODO: fix, when url is in the form of `git@domain.com:user/project.git` we will fail to get scheme
# need to add ssh:// and replace first ":" with / , unless port is specified
if parsed_url.scheme == "ssh" and (
self.session.config.get('agent.force_git_ssh_port', None) or
self.session.config.get('agent.force_git_ssh_user', None)
@@ -334,6 +337,9 @@ class VCS(object):
print("Using SSH credentials - ssh url '{}' with ssh url '{}'".format(
self.url, new_url))
self.url = new_url
return
elif parsed_url.scheme == "ssh":
return
if not self.session.config.agent.translate_ssh:
return
@@ -341,11 +347,18 @@ class VCS(object):
# if we have git_user / git_pass replace ssh credentials with https authentication
if (ENV_AGENT_GIT_USER.get() or self.session.config.get('agent.git_user', None)) and \
(ENV_AGENT_GIT_PASS.get() or self.session.config.get('agent.git_pass', None)):
# only apply to a specific domain (if requested)
config_domain = \
ENV_AGENT_GIT_HOST.get() or self.session.config.get("git_host", None)
if config_domain and config_domain != furl(self.url).host:
return
ENV_AGENT_GIT_HOST.get() or self.session.config.get("agent.git_host", None)
if config_domain:
if config_domain != furl(self.url).host:
# bail out here if we have a git_host configured and it's different than the URL host
# however, we should make sure this is not an ssh@ URL that furl failed to parse
ssh_git_url_match = self.SSH_URL_GIT_SYNTAX.match(self.url)
if not ssh_git_url_match or config_domain != ssh_git_url_match.groupdict().get("host"):
# do not replace to ssh url
return
new_url = self.replace_ssh_url(self.url)
if new_url != self.url:
@@ -362,7 +375,7 @@ class VCS(object):
self._set_ssh_url()
# if we are on linux no need for the full auth url because we use GIT_ASKPASS
url = self.url_without_auth if self._use_ask_pass else self.url_with_auth
clone_command = ("clone", url, self.location) + self.clone_flags
clone_command = ("clone", url, self.location) + self.clone_flags()
# clone all branches regardless of when we want to later checkout
# if branch:
# clone_command += ("-b", branch)
@@ -539,7 +552,6 @@ class VCS(object):
class Git(VCS):
executable_name = "git"
main_branch = ("master", "main")
clone_flags = ("--quiet", "--recursive")
checkout_flags = ("--force",)
COMMAND_ENV = {
# do not prompt for password
@@ -551,7 +563,7 @@ class Git(VCS):
def __init__(self, *args, **kwargs):
super(Git, self).__init__(*args, **kwargs)
self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', None) \
self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', True) \
else sys.platform == "linux"
try:
@@ -565,6 +577,12 @@ class Git(VCS):
"origin/{}".format(b) for b in ([branch] if isinstance(branch, str) else branch)
]
def clone_flags(self):
return (
"--recursive",
"--verbose" if ENV_GIT_CLONE_VERBOSE.get() else "--quiet"
)
def executable_not_found_error_help(self):
return 'Cannot find "{}" executable. {}'.format(
self.executable_name,
@@ -579,7 +597,8 @@ class Git(VCS):
)
def pull(self):
self.call("fetch", "--all", "--recurse-submodules", cwd=self.location)
self._set_ssh_url()
self.call("fetch", "--all", "--tags", "--recurse-submodules", cwd=self.location)
def _git_pass_auth_wrapper(self, func, *args, **kwargs):
try:
@@ -761,7 +780,22 @@ def clone_repository_cached(session, execution, destination):
# We clone the entire repository, not a specific branch
vcs.clone() # branch=execution.branch)
vcs.pull()
print("pulling git")
try:
vcs.pull()
except Exception as ex:
print("git pull failed: {}".format(ex))
if (
session.config.get("agent.vcs_cache.enabled", False) and
session.config.get("agent.vcs_cache.clone_on_pull_fail", False)
):
print("pulling git failed, re-cloning: {}".format(no_password_url))
rm_tree(cached_repo_path)
vcs.clone()
else:
raise ex
print("pulling git completed")
rm_tree(destination)
shutil.copytree(Text(cached_repo_path), Text(clone_folder),
symlinks=select_for_platform(linux=True, windows=False),
@@ -792,8 +826,8 @@ def fix_package_import_diff_patch(entry_script_file):
lines = f.readlines()
except Exception:
return
# make sre we are the first import (i.e. we patched the source code)
if not lines or not lines[0].strip().startswith('from clearml ') or 'Task.init' not in lines[1]:
# make sure we are the first import (i.e. we patched the source code)
if len(lines or []) < 2 or not lines[0].strip().startswith('from clearml ') or 'Task.init' not in lines[1]:
return
original_lines = lines
@@ -850,3 +884,90 @@ def fix_package_import_diff_patch(entry_script_file):
f.writelines(new_lines)
except Exception:
return
def _locate_future_import(lines):
# type: (list[str]) -> int
"""
:param lines: string lines of a python file
:return: line index of the last __future_ import. return -1 if no __future__ was found
"""
# skip over the first two lines, they are ours
# then skip over empty or comment lines
lines = [(i, line.split('#', 1)[0].rstrip()) for i, line in enumerate(lines)
if line.strip('\r\n\t ') and not line.strip().startswith('#')]
# remove triple quotes ' """ '
nested_c = -1
skip_lines = []
for i, line_pair in enumerate(lines):
for _ in line_pair[1].split('"""')[1:]:
if nested_c >= 0:
skip_lines.extend(list(range(nested_c, i + 1)))
nested_c = -1
else:
nested_c = i
# now select all the
lines = [pair for i, pair in enumerate(lines) if i not in skip_lines]
from_future = re.compile(r"^from[\s]*__future__[\s]*")
import_future = re.compile(r"^import[\s]*__future__[\s]*")
# test if we have __future__ import
found_index = -1
for a_i, (_, a_line) in enumerate(lines):
if found_index >= a_i:
continue
if from_future.match(a_line) or import_future.match(a_line):
found_index = a_i
# check the last import block
i, line = lines[found_index]
# wither we have \\ character at the end of the line or the line is indented
parenthesized_lines = '(' in line and ')' not in line
while line.endswith('\\') or parenthesized_lines:
found_index += 1
i, line = lines[found_index]
if ')' in line:
break
else:
break
return found_index if found_index < 0 else lines[found_index][0]
def patch_add_task_init_call(local_filename):
if not local_filename or not Path(local_filename).is_file() or not str(local_filename).lower().endswith(".py"):
return
idx_a = 0
# find the right entry for the patch if we have a local file (basically after __future__
try:
with open(local_filename, 'rt') as f:
lines = f.readlines()
except Exception as ex:
print("Failed patching entry point file {}: {}".format(local_filename, ex))
return
future_found = _locate_future_import(lines)
if future_found >= 0:
idx_a = future_found + 1
# check if we have not already patched it, no need to add another one
if len(lines or []) >= idx_a+2 and lines[idx_a].strip().startswith('from clearml ') and 'Task.init' in lines[idx_a+1]:
print("File {} already patched with Task.init()".format(local_filename))
return
patch = [
"from clearml import Task\n",
"(__name__ != \"__main__\") or Task.init()\n",
]
lines = lines[:idx_a] + patch + lines[idx_a:]
# noinspection PyBroadException
try:
with open(local_filename, 'wt') as f:
f.writelines(lines)
except Exception as ex:
print("Failed patching entry point file {}: {}".format(local_filename, ex))
return
print("Force clearml Task.init patch adding to entry point script: {}".format(local_filename))

View File

@@ -1,19 +1,20 @@
from __future__ import unicode_literals, division
import logging
import os
import re
import shlex
from collections import deque
from itertools import starmap
from threading import Thread, Event
from time import time
from typing import Text, Sequence
from typing import Sequence, List, Union, Dict, Optional
import attr
import psutil
from pathlib2 import Path
from clearml_agent.definitions import ENV_WORKER_TAGS, ENV_GPU_FRACTIONS
from clearml_agent.session import Session
from clearml_agent.definitions import ENV_WORKER_TAGS
try:
from .gpu import gpustat
@@ -54,6 +55,14 @@ class ResourceMonitor(object):
if value is not None
}
@attr.s
class ClusterReport:
cluster_key = attr.ib(type=str)
max_gpus = attr.ib(type=int, default=None)
max_workers = attr.ib(type=int, default=None)
max_cpus = attr.ib(type=int, default=None)
resource_groups = attr.ib(type=Sequence[str], factory=list)
def __init__(
self,
session, # type: Session
@@ -61,7 +70,7 @@ class ResourceMonitor(object):
sample_frequency_per_sec=2.0,
report_frequency_sec=30.0,
first_report_sec=None,
worker_tags=None,
worker_tags=None
):
self.session = session
self.queue = deque(maxlen=1)
@@ -79,6 +88,13 @@ class ResourceMonitor(object):
self._gpustat_fail = 0
self._gpustat = gpustat
self._active_gpus = None
self._default_gpu_utilization = session.config.get("agent.resource_monitoring.default_gpu_utilization", 100)
# allow default_gpu_utilization as null in the config, in which case we don't log anything
if self._default_gpu_utilization is not None:
self._default_gpu_utilization = int(self._default_gpu_utilization)
self._gpu_utilization_warning_sent = False
self._disk_use_path = str(session.config.get("agent.resource_monitoring.disk_use_path", None) or Path.home())
self._fractions_handler = GpuFractionsHandler() if session.feature_set != "basic" else None
if not worker_tags and ENV_WORKER_TAGS.get():
worker_tags = shlex.split(ENV_WORKER_TAGS.get())
self._worker_tags = worker_tags
@@ -91,6 +107,7 @@ class ResourceMonitor(object):
else:
# None means no filtering, report all gpus
self._active_gpus = None
# noinspection PyBroadException
try:
active_gpus = Session.get_nvidia_visible_env()
# None means no filtering, report all gpus
@@ -98,6 +115,10 @@ class ResourceMonitor(object):
self._active_gpus = [g.strip() for g in str(active_gpus).split(',')]
except Exception:
pass
self._cluster_report_interval_sec = int(session.config.get(
"agent.resource_monitoring.cluster_report_interval_sec", 60
))
self._cluster_report = None
def set_report(self, report):
# type: (ResourceMonitor.StatusReport) -> ()
@@ -129,6 +150,7 @@ class ResourceMonitor(object):
)
log.debug("sending report: %s", report)
# noinspection PyBroadException
try:
self.session.get(service="workers", action="status_report", **report)
except Exception:
@@ -136,45 +158,126 @@ class ResourceMonitor(object):
return False
return True
def send_cluster_report(self) -> bool:
if not self.session.feature_set == "basic":
return False
# noinspection PyBroadException
try:
properties = {
"max_cpus": self._cluster_report.max_cpus,
"max_gpus": self._cluster_report.max_gpus,
"max_workers": self._cluster_report.max_workers,
}
payload = {
"key": self._cluster_report.cluster_key,
"timestamp": int(time() * 1000),
"timeout": int(self._cluster_report_interval_sec * 2),
# "resource_groups": self._cluster_report.resource_groups, # yet to be supported
"properties": {k: v for k, v in properties.items() if v is not None},
}
self.session.post(service="workers", action="cluster_report", **payload)
except Exception as ex:
log.warning("Failed sending cluster report: %s", ex)
return False
return True
def setup_cluster_report(self, available_gpus, gpu_queues, worker_id=None, cluster_key=None, resource_groups=None):
# type: (List[int], Dict[str, int], Optional[str], Optional[str], Optional[List[str]]) -> ()
"""
Set up a cluster report for the enterprise server dashboard feature.
If a worker_id is provided, cluster_key and resource_groups are inferred from it.
"""
if self.session.feature_set == "basic":
return
if not worker_id and not cluster_key:
print("Error: cannot set up dashboard reporting - worker_id or cluster key are required")
return
# noinspection PyBroadException
try:
if not cluster_key:
worker_id_parts = worker_id.split(":")
if len(worker_id_parts) < 3:
cluster_key = self.session.config.get("agent.resource_dashboard.default_cluster_name", "onprem")
resource_group = ":".join((cluster_key, worker_id_parts[0]))
print(
'WARNING: your worker ID "{}" is not suitable for proper resource dashboard reporting, please '
'set up agent.worker_name to be at least two colon-separated parts (i.e. "<category>:<name>"). '
'Using "{}" as the resource dashboard category and "{}" as the resource group.'.format(
worker_id, cluster_key, resource_group
)
)
else:
cluster_key = worker_id_parts[0]
resource_group = ":".join((worker_id_parts[:2]))
resource_groups = [resource_group]
self._cluster_report = ResourceMonitor.ClusterReport(
cluster_key=cluster_key,
max_gpus=len(available_gpus),
max_workers=len(available_gpus) // min(x for x, _ in gpu_queues.values()),
resource_groups=resource_groups
)
self.send_cluster_report()
except Exception as ex:
print("Error: failed setting cluster report: {}".format(ex))
def _daemon(self):
last_cluster_report = 0
seconds_since_started = 0
reported = 0
while True:
last_report = time()
current_report_frequency = (
self._report_frequency if reported != 0 else self._first_report_sec
)
while (time() - last_report) < current_report_frequency:
# wait for self._sample_frequency seconds, if event set quit
if self._exit_event.wait(1 / self._sample_frequency):
return
# noinspection PyBroadException
try:
self._update_readouts()
except Exception as ex:
log.warning("failed getting machine stats: %s", report_error(ex))
self._failure()
try:
while True:
last_report = time()
current_report_frequency = (
self._report_frequency if reported != 0 else self._first_report_sec
)
while (time() - last_report) < current_report_frequency:
# wait for self._sample_frequency seconds, if event set quit
if self._exit_event.wait(1 / self._sample_frequency):
return
# noinspection PyBroadException
try:
self._update_readouts()
except Exception as ex:
log.error("failed getting machine stats: %s", report_error(ex))
self._failure()
seconds_since_started += int(round(time() - last_report))
# check if we do not report any metric (so it means the last iteration will not be changed)
seconds_since_started += int(round(time() - last_report))
# check if we do not report any metric (so it means the last iteration will not be changed)
# if we do not have last_iteration, we just use seconds as iteration
# if we do not have last_iteration, we just use seconds as iteration
# start reporting only when we figured out, if this is seconds based, or iterations based
average_readouts = self._get_average_readouts()
stats = {
# 3 points after the dot
key: round(value, 3) if isinstance(value, float) else [round(v, 3) for v in value]
for key, value in average_readouts.items()
}
# start reporting only when we figured out, if this is seconds based, or iterations based
average_readouts = self._get_average_readouts()
stats = {
# 3 points after the dot
key: round(value, 3) if isinstance(value, float) else [round(v, 3) for v in value]
for key, value in average_readouts.items()
}
# send actual report
if self.send_report(stats):
# clear readouts if this is update was sent
self._clear_readouts()
# send actual report
if self.send_report(stats):
# clear readouts if this is update was sent
self._clear_readouts()
# count reported iterations
reported += 1
# count reported iterations
reported += 1
if (
self._cluster_report and
self._cluster_report_interval_sec
and time() - last_cluster_report > self._cluster_report_interval_sec
):
if self.send_cluster_report():
last_cluster_report = time()
except Exception as ex:
log.exception("Error reporting monitoring info: %s", str(ex))
def _update_readouts(self):
readouts = self._machine_stats()
@@ -239,7 +342,7 @@ class ResourceMonitor(object):
virtual_memory = psutil.virtual_memory()
stats["memory_used"] = BytesSizes.megabytes(virtual_memory.used)
stats["memory_free"] = BytesSizes.megabytes(virtual_memory.available)
disk_use_percentage = psutil.disk_usage(Text(Path.home())).percent
disk_use_percentage = psutil.disk_usage(self._disk_use_path).percent
stats["disk_free_percent"] = 100 - disk_use_percentage
sensor_stat = (
psutil.sensors_temperatures()
@@ -261,23 +364,47 @@ class ResourceMonitor(object):
if self._active_gpus is not False and self._gpustat:
try:
gpu_stat = self._gpustat.new_query()
report_index = 0
for i, g in enumerate(gpu_stat.gpus):
# only monitor the active gpu's, if none were selected, monitor everything
if self._active_gpus and str(i) not in self._active_gpus:
continue
stats["gpu_temperature_{:d}".format(i)] = g["temperature.gpu"]
stats["gpu_utilization_{:d}".format(i)] = g["utilization.gpu"]
stats["gpu_mem_usage_{:d}".format(i)] = (
if self._active_gpus:
uuid = getattr(g, "uuid", None)
mig_uuid = getattr(g, "mig_uuid", None)
if (
str(g.index) not in self._active_gpus
and (not uuid or uuid not in self._active_gpus)
and (not mig_uuid or mig_uuid not in self._active_gpus)
):
continue
stats["gpu_temperature_{}".format(report_index)] = g["temperature.gpu"]
if g["utilization.gpu"] is not None:
stats["gpu_utilization_{}".format(report_index)] = g["utilization.gpu"]
elif self._default_gpu_utilization is not None:
stats["gpu_utilization_{}".format(report_index)] = self._default_gpu_utilization
if getattr(g, "mig_index", None) is None and not self._gpu_utilization_warning_sent:
# this shouldn't happen for non-MIGs, warn the user about it
log.error("Failed fetching GPU utilization")
self._gpu_utilization_warning_sent = True
stats["gpu_mem_usage_{}".format(report_index)] = (
100.0 * g["memory.used"] / g["memory.total"]
)
# already in MBs
stats["gpu_mem_free_{:d}".format(i)] = (
stats["gpu_mem_free_{}".format(report_index)] = (
g["memory.total"] - g["memory.used"]
)
stats["gpu_mem_used_%d" % i] = g["memory.used"]
stats["gpu_mem_used_{}".format(report_index)] = g["memory.used"] or 0
if self._fractions_handler:
fractions = self._fractions_handler.fractions
stats["gpu_fraction_{}".format(report_index)] = \
(fractions[i] if i < len(fractions) else fractions[-1]) if fractions else 1.0
except Exception as ex:
# something happened and we can't use gpu stats,
log.warning("failed getting machine stats: %s", report_error(ex))
log.error("failed getting machine stats: %s", report_error(ex))
self._failure()
return stats
@@ -290,19 +417,142 @@ class ResourceMonitor(object):
)
self._gpustat = None
BACKEND_STAT_MAP = {"cpu_usage_*": "cpu_usage",
"cpu_temperature_*": "cpu_temperature",
"disk_free_percent": "disk_free_home",
"io_read_mbs": "disk_read",
"io_write_mbs": "disk_write",
"network_tx_mbs": "network_tx",
"network_rx_mbs": "network_rx",
"memory_free": "memory_free",
"memory_used": "memory_used",
"gpu_temperature_*": "gpu_temperature",
"gpu_mem_used_*": "gpu_memory_used",
"gpu_mem_free_*": "gpu_memory_free",
"gpu_utilization_*": "gpu_usage"}
BACKEND_STAT_MAP = {
"cpu_usage_*": "cpu_usage",
"cpu_temperature_*": "cpu_temperature",
"disk_free_percent": "disk_free_home",
"io_read_mbs": "disk_read",
"io_write_mbs": "disk_write",
"network_tx_mbs": "network_tx",
"network_rx_mbs": "network_rx",
"memory_free": "memory_free",
"memory_used": "memory_used",
"gpu_temperature_*": "gpu_temperature",
"gpu_mem_used_*": "gpu_memory_used",
"gpu_mem_free_*": "gpu_memory_free",
"gpu_utilization_*": "gpu_usage",
"gpu_fraction_*": "gpu_fraction"
}
class GpuFractionsHandler:
_number_re = re.compile(r"^clear\.ml/fraction(-\d+)?$")
_mig_re = re.compile(r"^nvidia\.com/mig-(?P<compute>[0-9]+)g\.(?P<memory>[0-9]+)gb$")
_frac_gpu_injector_re = re.compile(r"^clearml-injector/fraction$")
_gpu_name_to_memory_gb = {
"A30": 24,
"NVIDIA A30": 24,
"A100-SXM4-40GB": 40,
"NVIDIA-A100-40GB-PCIe": 40,
"NVIDIA A100-40GB-PCIe": 40,
"NVIDIA-A100-SXM4-40GB": 40,
"NVIDIA A100-SXM4-40GB": 40,
"NVIDIA-A100-SXM4-80GB": 79,
"NVIDIA A100-SXM4-80GB": 79,
"NVIDIA-A100-80GB-PCIe": 79,
"NVIDIA A100-80GB-PCIe": 79,
}
def __init__(self):
self._total_memory_gb = [
self._gpu_name_to_memory_gb.get(name, 0)
for name in (self._get_gpu_names() or [])
]
self._fractions = self._get_fractions()
@property
def fractions(self) -> List[float]:
return self._fractions
def _get_fractions(self) -> List[float]:
if not self._total_memory_gb:
# Can't compute
return [1.0]
fractions = (ENV_GPU_FRACTIONS.get() or "").strip()
if not fractions:
# No fractions
return [1.0]
decoded_fractions = self.decode_fractions(fractions)
if isinstance(decoded_fractions, list):
return decoded_fractions
totals = []
for i, (fraction, count) in enumerate(decoded_fractions.items()):
m = self._mig_re.match(fraction)
if not m:
continue
try:
total_gb = self._total_memory_gb[i] if i < len(self._total_memory_gb) else self._total_memory_gb[-1]
if not total_gb:
continue
totals.append((int(m.group("memory")) * count) / total_gb)
except ValueError:
pass
if not totals:
log.warning("Fractions count is empty for {}".format(fractions))
return [1.0]
return totals
@classmethod
def extract_custom_limits(cls, limits: dict):
for k, v in list((limits or {}).items()):
if cls._number_re.match(k):
limits.pop(k, None)
@classmethod
def get_simple_fractions_total(cls, limits: dict) -> float:
try:
if any(cls._number_re.match(x) for x in limits):
return sum(float(v) for k, v in limits.items() if cls._number_re.match(k))
except Exception as ex:
log.error("Failed summing up fractions from {}: {}".format(limits, ex))
return 0
@classmethod
def encode_fractions(cls, limits: dict, annotations: dict) -> str:
if limits:
if any(cls._number_re.match(x) for x in (limits or {})):
return ",".join(str(v) for k, v in sorted(limits.items()) if cls._number_re.match(k))
return ",".join(("{}:{}".format(k, v) for k, v in (limits or {}).items() if cls._mig_re.match(k)))
elif annotations:
if any(cls._frac_gpu_injector_re.match(x) for x in (annotations or {})):
return ",".join(str(v) for k, v in sorted(annotations.items()) if cls._frac_gpu_injector_re.match(k))
@staticmethod
def decode_fractions(fractions: str) -> Union[List[float], Dict[str, int]]:
try:
items = [f.strip() for f in fractions.strip().split(",")]
tuples = [(k.strip(), v.strip()) for k, v in (f.partition(":")[::2] for f in items)]
if all(not v for _, v in tuples):
# comma-separated float fractions
return [float(k) for k, _ in tuples]
# comma-separated slice:count items
return {
k.strip(): int(v.strip())
for k, v in tuples
}
except Exception as ex:
log.error("Failed decoding GPU fractions '{}': {}".format(fractions, ex))
return {}
@staticmethod
def _get_gpu_names():
# noinspection PyBroadException
try:
gpus = gpustat.new_query().gpus
names = [g["name"] for g in gpus]
print("GPU names: {}".format(names))
return names
except Exception as ex:
log.error("Failed getting GPU names: {}".format(ex))
def report_error(ex):

View File

@@ -44,6 +44,11 @@ WORKER_ARGS = {
}
DAEMON_ARGS = dict({
'--polling-interval': {
'help': 'Polling interval in seconds. Minimum is 5 (default 5)',
'type': int,
'default': 5,
},
'--foreground': {
'help': 'Pipe full log to stdout/stderr, should not be used if running in background',
'action': 'store_true',
@@ -62,7 +67,10 @@ DAEMON_ARGS = dict({
'group': 'Docker support',
},
'--queue': {
'help': 'Queue ID(s)/Name(s) to pull tasks from (\'default\' queue)',
'help': 'Queue ID(s)/Name(s) to pull tasks from (\'default\' queue).'
' Note that the queue list order determines priority, with the first listed queue having the'
' highest priority. To change this behavior, use --order-fairness to pull from each queue in a'
' round-robin order',
'nargs': '+',
'default': tuple(),
'dest': 'queues',
@@ -107,8 +115,11 @@ DAEMON_ARGS = dict({
'--dynamic-gpus': {
'help': 'Allow to dynamically allocate gpus based on queue properties, '
'configure with \'--queue <queue_name>=<num_gpus>\'.'
' Example: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 single_gpu=1\''
' Example Opportunistic: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 max_quad_gpus=1-4 \'',
' Example: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 single_gpu=1\'.'
' Example Opportunistic: \'--dynamic-gpus --gpus 0-3 --queue dual_gpus=2 max_quad_gpus=1-4\'.'
' Note that the queue list order determines priority, with the first listed queue having the'
' highest priority. To change this behavior, use --order-fairness to pull from each queue in a'
' round-robin order',
'action': 'store_true',
},
'--uptime': {

View File

@@ -4,6 +4,7 @@ import json
import logging
import os
import platform
import re
import sys
from copy import deepcopy
from typing import Any, Callable
@@ -19,7 +20,7 @@ from clearml_agent.definitions import ENVIRONMENT_CONFIG, ENV_TASK_EXECUTE_AS_US
from clearml_agent.errors import APIError
from clearml_agent.helper.base import HOCONEncoder
from clearml_agent.helper.process import Argv
from clearml_agent.helper.docker_args import DockerArgsSanitizer
from clearml_agent.helper.docker_args import DockerArgsSanitizer, sanitize_urls
from .version import __version__
POETRY = "poetry"
@@ -240,38 +241,49 @@ class Session(_Session):
except:
pass
def print_configuration(
self,
remove_secret_keys=("secret", "pass", "token", "account_key", "contents"),
skip_value_keys=("environment", ),
docker_args_sanitize_keys=("extra_docker_arguments", ),
):
def print_configuration(self):
def load_config(key, default):
return [re.compile(x) for x in self.config.get(f"agent.sanitize_config_printout.{key}", default=default)]
dont_hide_secret_keys = load_config("dont_hide_secrets", ("^enable_git_ask_pass$",))
hide_secret_keys = load_config("hide_secrets", ("secret", "pass", "token", "account_key", "contents"))
hide_secret_section_keys = load_config("hide_secrets_recursive", ("^environment$",))
docker_cmd_keys = load_config("docker_commands", ("^extra_docker_arguments$",))
urls_keys = load_config("urls", ("^extra_index_url$",))
# remove all the secrets from the print
def recursive_remove_secrets(dictionary, secret_keys=(), empty_keys=()):
def recursive_remove_secrets(dictionary):
for k in list(dictionary):
for s in secret_keys:
if s in k:
dictionary.pop(k)
break
for s in empty_keys:
if s == k:
if not any(r.search(k) for r in dont_hide_secret_keys):
if any(r.search(k) for r in hide_secret_keys):
dictionary[k] = '****'
continue
if any(r.search(k) for r in hide_secret_section_keys):
dictionary[k] = {key: '****' for key in dictionary[k]} \
if isinstance(dictionary[k], dict) else '****'
break
continue
if any(r.search(k) for r in urls_keys):
value = dictionary.get(k, None)
if isinstance(value, str):
dictionary[k] = sanitize_urls(value)[0]
elif isinstance(value, (list, tuple)):
dictionary[k] = [sanitize_urls(v)[0] for v in value]
elif isinstance(value, dict):
dictionary[k] = {k_: sanitize_urls(v)[0] for k_, v in value.items()}
if isinstance(dictionary.get(k, None), dict):
recursive_remove_secrets(dictionary[k], secret_keys=secret_keys, empty_keys=empty_keys)
recursive_remove_secrets(dictionary[k])
elif isinstance(dictionary.get(k, None), (list, tuple)):
if k in (docker_args_sanitize_keys or []):
if any(r.match(k) for r in docker_cmd_keys):
dictionary[k] = DockerArgsSanitizer.sanitize_docker_command(self, dictionary[k])
for item in dictionary[k]:
if isinstance(item, dict):
recursive_remove_secrets(item, secret_keys=secret_keys, empty_keys=empty_keys)
recursive_remove_secrets(item)
config = deepcopy(self.config.to_dict())
# remove the env variable, it's not important
config.pop('env', None)
if remove_secret_keys or skip_value_keys or docker_args_sanitize_keys:
recursive_remove_secrets(config, secret_keys=remove_secret_keys, empty_keys=skip_value_keys)
if hide_secret_keys or hide_secret_section_keys or docker_cmd_keys or urls_keys:
recursive_remove_secrets(config)
# remove logging.loggers.urllib3.level from the print
try:
config['logging']['loggers']['urllib3'].pop('level', None)

View File

@@ -1 +1 @@
__version__ = '1.5.2'
__version__ = '1.9.1'

View File

@@ -0,0 +1,37 @@
#!/bin/bash
# Check if image name and Dockerfile path are provided
if [ -z "$1" ] || [ -z "$2" ]; then
echo "Usage: $0 <image_name> <dockerfile_path> <build_context>"
exit 1
fi
# Build the Docker image
image_name=$1
dockerfile_path=$2
build_context=$3
if [ $build_context == "glue-build-aws" ] || [ $build_context == "glue-build-gcp" ]; then
if [ ! -f $build_context/clearml.conf ]; then
cp build-resources/clearml.conf $build_context
fi
if [ ! -f $build_context/entrypoint.sh ]; then
cp build-resources/entrypoint.sh $build_context
chmod +x $build_context/entrypoint.sh
fi
if [ ! -f $build_context/setup.sh ]; then
cp build-resources/setup.sh $build_context
chmod +x $build_context/setup.sh
fi
fi
cp ../../examples/k8s_glue_example.py $build_context
docker build -f $dockerfile_path -t $image_name $build_context
# cleanup
if [ $build_context == "glue-build-aws" ] || [ $build_context == "glue-build-gcp" ]; then
rm $build_context/clearml.conf
rm $build_context/entrypoint.sh
rm $build_context/setup.sh
fi
rm $build_context/k8s_glue_example.py

View File

@@ -58,7 +58,7 @@ agent {
type: pip,
# specify pip version to use (examples "<20.2", "==19.3.1", "", empty string will install the latest version)
pip_version: "<21",
pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"],
# virtual environment inheres packages from system
system_site_packages: false,
@@ -171,7 +171,7 @@ agent {
default_docker: {
# default docker image to use when running in docker mode
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
# optional arguments to pass to docker image
# arguments: ["--ipc=host", ]
@@ -224,7 +224,7 @@ sdk {
storage {
cache {
# Defaults to system temp folder / cache
# Defaults to <system_temp_folder>/clearml_cache
default_base_dir: "~/.clearml/cache"
size {
# max_used_bytes = -1
@@ -361,7 +361,7 @@ sdk {
vcs_repo_detect_async: true
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
# This stores "git diff" or into the experiment's "script.requirements.diff" section
store_uncommitted_code_diff: true
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset

View File

@@ -33,4 +33,9 @@ echo "api.files_server: ${CLEARML_FILES_HOST}" >> ~/clearml.conf
./provider_entrypoint.sh
python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
if [[ -z "${K8S_GLUE_MAX_PODS}" ]]
then
python3 k8s_glue_example.py --queue ${QUEUE} ${EXTRA_ARGS}
else
python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
fi

View File

@@ -1,94 +0,0 @@
"""
This example assumes you have preconfigured services with selectors in the form of
"ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022.
The K8sIntegration component will label each pod accordingly.
"""
from argparse import ArgumentParser
from clearml_agent.glue.k8s import K8sIntegration
def parse_args():
parser = ArgumentParser()
group = parser.add_mutually_exclusive_group()
parser.add_argument(
"--queue", type=str, help="Queue to pull tasks from"
)
group.add_argument(
"--ports-mode", action='store_true', default=False,
help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
"Should not be used with max-pods"
)
parser.add_argument(
"--num-of-services", type=int, default=20,
help="Specify the number of k8s services to be used. Use only with ports-mode."
)
parser.add_argument(
"--base-port", type=int,
help="Used in conjunction with ports-mode, specifies the base port exposed by the services. "
"For pod #X, the port will be <base-port>+X. Note that pod number is calculated based on base-pod-num"
"e.g. if base-port=20000 and base-pod-num=3, the port for the first pod will be 20003"
)
parser.add_argument(
"--base-pod-num", type=int, default=1,
help="Used in conjunction with ports-mode and base-port, specifies the base pod number to be used by the "
"service (default: %(default)s)"
)
parser.add_argument(
"--gateway-address", type=str, default=None,
help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
)
parser.add_argument(
"--pod-clearml-conf", type=str,
help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
)
parser.add_argument(
"--overrides-yaml", type=str,
help="YAML file containing pod overrides to be used when launching a new pod"
)
parser.add_argument(
"--template-yaml", type=str,
help="YAML file containing pod template. If provided pod will be scheduled with kubectl apply "
"and overrides are ignored, otherwise it will be scheduled with kubectl run"
)
parser.add_argument(
"--ssh-server-port", type=int, default=0,
help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)"
)
parser.add_argument(
"--namespace", type=str,
help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
)
group.add_argument(
"--max-pods", type=int,
help="Limit the maximum number of pods that this service can run at the same time."
"Should not be used with ports-mode"
)
return parser.parse_args()
def main():
args = parse_args()
user_props_cb = None
if args.ports_mode and args.base_port:
def k8s_user_props_cb(pod_number=0):
user_prop = {"k8s-pod-port": args.base_port + pod_number}
if args.gateway_address:
user_prop["k8s-gateway-address"] = args.gateway_address
return user_prop
user_props_cb = k8s_user_props_cb
k8s = K8sIntegration(
ports_mode=args.ports_mode, num_of_services=args.num_of_services, base_pod_num=args.base_pod_num,
user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
namespace=args.namespace, max_pods_limit=args.max_pods or None,
)
k8s.k8s_daemon(args.queue)
if __name__ == "__main__":
main()

View File

@@ -2,13 +2,17 @@
chmod +x /root/entrypoint.sh
apt-get update -y
apt-get dist-upgrade -y
apt-get install -y curl unzip less locales
apt-get update -qqy
apt-get dist-upgrade -qqy
apt-get install -qqy curl unzip less locales
locale-gen en_US.UTF-8
apt-get install -y curl python3-pip git
apt-get update -qqy
apt-get install -qqy curl gcc python3-dev python3-pip apt-transport-https lsb-release openssh-client git gnupg
rm -rf /var/lib/apt/lists/*
apt clean
python3 -m pip install -U pip
python3 -m pip install clearml-agent
python3 -m pip install -U "cryptography>=2.9"
python3 -m pip install --no-cache-dir clearml-agent
python3 -m pip install -U --no-cache-dir "cryptography>=2.9"

View File

@@ -1,4 +1,4 @@
FROM ubuntu:18.04
FROM ubuntu:22.04
USER root
WORKDIR /root
@@ -8,15 +8,16 @@ ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
ENV PYTHONIOENCODING=UTF-8
COPY ../build-resources/setup.sh /root/setup.sh
COPY ./setup.sh /root/setup.sh
RUN /root/setup.sh
COPY ./setup_aws.sh /root/setup_aws.sh
RUN /root/setup_aws.sh
RUN chmod +x /root/setup_aws.sh && /root/setup_aws.sh
COPY ../build-resources/entrypoint.sh /root/entrypoint.sh
COPY ./entrypoint.sh /root/entrypoint.sh
COPY ./provider_entrypoint.sh /root/provider_entrypoint.sh
COPY ./build-resources/k8s_glue_example.py /root/k8s_glue_example.py
RUN chmod +x /root/provider_entrypoint.sh
COPY ./k8s_glue_example.py /root/k8s_glue_example.py
COPY ./clearml.conf /root/clearml.conf
ENTRYPOINT ["/root/entrypoint.sh"]

View File

@@ -4,7 +4,8 @@ curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip
unzip awscliv2.zip
./aws/install
curl -o kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/kubectl
curl -o kubectl https://s3.us-west-2.amazonaws.com/amazon-eks/1.29.3/2024-04-19/bin/linux/amd64/kubectl
#curl -o kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/kubectl
#curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.19.6/2021-01-05/bin/linux/amd64/kubectl
chmod +x ./kubectl && mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$PATH:$HOME/bin

View File

@@ -1,4 +1,4 @@
FROM ubuntu:18.04
FROM ubuntu:22.04
USER root
WORKDIR /root
@@ -8,15 +8,15 @@ ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
ENV PYTHONIOENCODING=UTF-8
COPY ../build-resources/setup.sh /root/setup.sh
COPY ./setup.sh /root/setup.sh
RUN /root/setup.sh
COPY ./setup_gcp.sh /root/setup_gcp.sh
RUN /root/setup_gcp.sh
RUN chmod +x /root/setup_gcp.sh && /root/setup_gcp.sh
COPY ../build-resources/entrypoint.sh /root/entrypoint.sh
COPY ./entrypoint.sh /root/entrypoint.sh
COPY ./provider_entrypoint.sh /root/provider_entrypoint.sh
COPY ./build-resources/k8s_glue_example.py /root/k8s_glue_example.py
COPY ./k8s_glue_example.py /root/k8s_glue_example.py
COPY ./clearml.conf /root/clearml.conf
ENTRYPOINT ["/root/entrypoint.sh"]

View File

@@ -1,6 +1,6 @@
#!/bin/bash
curl -LO https://dl.k8s.io/release/v1.21.0/bin/linux/amd64/kubectl
curl -LO https://dl.k8s.io/release/v1.29.3/bin/linux/amd64/kubectl
install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

View File

@@ -1,4 +1,4 @@
ARG TAG=3.7.12-alpine3.15
ARG TAG=3.7.17-alpine3.18
FROM python:${TAG} as build
@@ -20,7 +20,7 @@ FROM python:${TAG} as target
WORKDIR /app
ARG KUBECTL_VERSION=1.22.4
ARG KUBECTL_VERSION=1.29.3
# Not sure about these ENV vars
# ENV LC_ALL=en_US.UTF-8

View File

@@ -1,8 +1,8 @@
ARG TAG=3.7.12-slim-bullseye
ARG TAG=3.7.17-slim-bullseye
FROM python:${TAG} as target
ARG KUBECTL_VERSION=1.22.4
ARG KUBECTL_VERSION=1.29.3
WORKDIR /app

View File

@@ -1,94 +0,0 @@
"""
This example assumes you have preconfigured services with selectors in the form of
"ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022.
The K8sIntegration component will label each pod accordingly.
"""
from argparse import ArgumentParser
from clearml_agent.glue.k8s import K8sIntegration
def parse_args():
parser = ArgumentParser()
group = parser.add_mutually_exclusive_group()
parser.add_argument(
"--queue", type=str, help="Queue to pull tasks from"
)
group.add_argument(
"--ports-mode", action='store_true', default=False,
help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
"Should not be used with max-pods"
)
parser.add_argument(
"--num-of-services", type=int, default=20,
help="Specify the number of k8s services to be used. Use only with ports-mode."
)
parser.add_argument(
"--base-port", type=int,
help="Used in conjunction with ports-mode, specifies the base port exposed by the services. "
"For pod #X, the port will be <base-port>+X. Note that pod number is calculated based on base-pod-num"
"e.g. if base-port=20000 and base-pod-num=3, the port for the first pod will be 20003"
)
parser.add_argument(
"--base-pod-num", type=int, default=1,
help="Used in conjunction with ports-mode and base-port, specifies the base pod number to be used by the "
"service (default: %(default)s)"
)
parser.add_argument(
"--gateway-address", type=str, default=None,
help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
)
parser.add_argument(
"--pod-clearml-conf", type=str,
help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
)
parser.add_argument(
"--overrides-yaml", type=str,
help="YAML file containing pod overrides to be used when launching a new pod"
)
parser.add_argument(
"--template-yaml", type=str,
help="YAML file containing pod template. If provided pod will be scheduled with kubectl apply "
"and overrides are ignored, otherwise it will be scheduled with kubectl run"
)
parser.add_argument(
"--ssh-server-port", type=int, default=0,
help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)"
)
parser.add_argument(
"--namespace", type=str,
help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
)
group.add_argument(
"--max-pods", type=int,
help="Limit the maximum number of pods that this service can run at the same time."
"Should not be used with ports-mode"
)
return parser.parse_args()
def main():
args = parse_args()
user_props_cb = None
if args.ports_mode and args.base_port:
def k8s_user_props_cb(pod_number=0):
user_prop = {"k8s-pod-port": args.base_port + pod_number}
if args.gateway_address:
user_prop["k8s-gateway-address"] = args.gateway_address
return user_prop
user_props_cb = k8s_user_props_cb
k8s = K8sIntegration(
ports_mode=args.ports_mode, num_of_services=args.num_of_services, base_pod_num=args.base_pod_num,
user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
namespace=args.namespace, max_pods_limit=args.max_pods or None,
)
k8s.k8s_daemon(args.queue)
if __name__ == "__main__":
main()

View File

@@ -1,4 +1,4 @@
FROM ubuntu:18.04
FROM ubuntu:22.04
USER root
WORKDIR /root

View File

@@ -33,4 +33,10 @@ if [ -z "$CLEARML_AGENT_NO_UPDATE" ]; then
fi
fi
clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES --docker "${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}" --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
DOCKER_ARGS="--docker \"${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}\""
if [ -n "$CLEARML_AGENT_NO_DOCKER" ]; then
DOCKER_ARGS=""
fi
clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES $DOCKER_ARGS --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}

View File

@@ -58,8 +58,8 @@ agent {
# it solves passing user/token to git submodules.
# this is a safer way to ensure multiple users using the same repository will
# not accidentally leak credentials
# Only supported on Linux systems, it will be the default in future releases
# enable_git_ask_pass: false
# Note: this is only supported on Linux systems
# enable_git_ask_pass: true
# in docker mode, if container's entrypoint automatically activated a virtual environment
# use the activated virtual environment and install everything there
@@ -93,25 +93,50 @@ agent {
# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
extra_index_url: []
# additional flags to use when calling pip install, example: ["--use-deprecated=legacy-resolver", ]
# extra_pip_install_flags: []
# control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
# Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
# "pip" (default): would automatically detect the cuda version, and supply pip with the correct
# extra-index-url, based on pytorch.org tables
# "direct": would resolve a direct link to the pytorch wheel by parsing the pytorch.org pip repository
# and matching the automatically detected cuda version with the required pytorch wheel.
# if the exact cuda version is not found for the required pytorch wheel, it will try
# a lower cuda version until a match is found
# "none": No resolver used, install pytorch like any other package
# pytorch_resolve: "pip"
# additional conda channels to use when installing with conda package manager
conda_channels: ["pytorch", "conda-forge", "defaults", ]
conda_channels: ["pytorch", "conda-forge", "nvidia", "defaults", ]
# conda_full_env_update: false
# notice this will not install any additional packages into the selected environment, should be used in
# conjunction with CLEARML_CONDA_ENV_PACKAGE which points to an existing conda environment directory
# conda_env_as_base_docker: false
# install into base conda environment
# (should only be used if running in docker mode, because it will change the conda base enrichment)
# use_conda_base_env: false
# set the priority packages to be installed before the rest of the required packages
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# priority_packages: ["cython", "numpy", "setuptools", ]
# set the optional priority packages to be installed before the rest of the required packages,
# In case a package installation fails, the package will be ignored,
# and the virtual environment process will continue
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# priority_optional_packages: ["pygobject", ]
# set the post packages to be installed after all the rest of the required packages
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# post_packages: ["horovod", ]
# set the optional post packages to be installed after all the rest of the required packages,
# In case a package installation fails, the package will be ignored,
# and the virtual environment process will continue
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
# post_optional_packages: []
# set to True to support torch nightly build installation,
@@ -136,6 +161,9 @@ agent {
vcs_cache: {
enabled: true,
path: ~/.clearml/vcs-cache
# if git pull failed, always revert to re-cloning the repo, it protects against old user name changes
# clone_on_pull_fail: false
},
# DEPRECATED: please use `venvs_cache` and set `venvs_cache.path`
@@ -168,8 +196,16 @@ agent {
# optional arguments to pass to docker image
# these are local for this agent and will not be updated in the experiment's docker_cmd section
# You can also pass host environments into the container with ["-e", "HOST_NAME=$HOST_NAME"]
# extra_docker_arguments: ["--ipc=host", "-v", "/mnt/host/data:/mnt/data"]
# Allow the extra docker arg to override task level docker arg (if the same argument is passed on both),
# if set to False, a task docker arg will override the docker extra arg
# docker_args_extra_precedes_task: true
# prevent a task docker args to be used if already specified in the extra_docker_arguments
# protected_docker_extra_args: ["privileged", "security-opt", "network", "ipc"]
# optional shell script to run in docker when started before the experiment is started
# extra_docker_shell_script: ["apt-get install -y bindfs", ]
@@ -178,13 +214,19 @@ agent {
# change to false to skip installation and decrease docker spin up time
# docker_install_opencv_libs: true
# Allow passing host environments into docker container with Task's docker container args
# Example "-e HOST_NAME=$HOST_NAME"
# NOTICE this might introduce security risk allowing access to keys/secret on the host machine1
# Use with care!
# docker_allow_host_environ: false
# set to true in order to force "docker pull" before running an experiment using a docker image.
# This makes sure the docker image is updated.
docker_force_pull: false
default_docker: {
# default docker image to use when running in docker mode
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
# optional arguments to pass to docker image
# arguments: ["--ipc=host"]
@@ -194,7 +236,7 @@ agent {
# enterprise version only
# match_rules: [
# {
# image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
# image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
# arguments: "-e define=value"
# match: {
# script{
@@ -215,7 +257,7 @@ agent {
# }
# },
# {
# image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
# image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
# arguments: "-e define=value"
# match: {
# # must match all requirements (not partial)
@@ -228,8 +270,6 @@ agent {
# # no repository matching required
# repository: ""
# }
# # no container image matching required (allow to replace one requested container with another)
# container: ""
# # no repository matching required
# project: ""
# }
@@ -264,9 +304,11 @@ agent {
# sdk_cache: "/clearml_agent_cache"
# apt_cache: "/var/cache/apt/archives"
# ssh_folder: "/root/.ssh"
# ssh_ro_folder: "/.ssh"
# pip_cache: "/root/.cache/pip"
# poetry_cache: "/root/.cache/pypoetry"
# vcs_cache: "/root/.clearml/vcs-cache"
# venvs_cache: "/root/.clearml/venvs-cache"
# venv_build: "~/.clearml/venvs-builds"
# pip_download: "/root/.clearml/pip-download-cache"
# }
@@ -283,7 +325,7 @@ sdk {
storage {
cache {
# Defaults to system temp folder / cache
# Defaults to <system_temp_folder>/clearml_cache
default_base_dir: "~/.clearml/cache"
}
@@ -421,7 +463,7 @@ sdk {
vcs_repo_detect_async: True
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
# This stores "git diff" or into the experiment's "script.requirements.diff" section
store_uncommitted_code_diff_on_train: True
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset
@@ -469,7 +511,8 @@ sdk {
# target_format: format used to encode contents before writing into the target file. Supported values are json,
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
# overwrite: overwrite the target file in case it exists. Default is true.
#
# mode: file-system mode to be applied to the file after its creation. The mode string will be parsed into an
# integer (e.g. "0o777" for -rwxrwxrwx)
# Example:
# files {
# myfile1 {

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.1 MiB

After

Width:  |  Height:  |  Size: 1018 KiB

View File

@@ -146,7 +146,7 @@ sdk {
storage {
cache {
# Defaults to system temp folder / cache
# Defaults to <system_temp_folder>/clearml_cache
default_base_dir: "~/.clearml/cache"
}

View File

@@ -5,27 +5,30 @@
"metadata": {},
"source": [
"# Auto-Magically Spin AWS EC2 Instances On Demand \n",
"# and Create a Dynamic Cluster Running *Trains-Agent*\n",
"# and Create a Dynamic Cluster Running *ClearML-Agent*\n",
"\n",
"### Define your budget and execute the notebook, that's it\n",
"### You now have a fully managed cluster on AWS 🎉 🎊 "
"## Define your budget and execute the notebook, that's it\n",
"## You now have a fully managed cluster on AWS 🎉 🎊"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**trains-agent**'s main goal is to quickly pull a job from an execution queue, setup the environment (as defined in the experiment, including git cloning, python packages etc.) then execute the experiment and monitor it.\n",
"**clearml-agent**'s main goal is to quickly pull a job from an execution queue, set up the environment (as defined in the experiment, including git cloning, python packages etc.), then execute the experiment and monitor it.\n",
"\n",
"This notebook defines a cloud budget (currently only AWS is supported, but feel free to expand with PRs), and spins an instance the minute a job is waiting for execution. It will also spin down idle machines, saving you some $$$ :)\n",
"\n",
"Configuration steps\n",
"> **Note:**\n",
"> This is just an example of how you can use ClearML Agent to implement custom autoscaling. For a more structured autoscaler script, see [here](https://github.com/allegroai/clearml/blob/master/clearml/automation/auto_scaler.py).\n",
"\n",
"Configuration steps:\n",
"- Define maximum budget to be used (instance type / number of instances).\n",
"- Create new execution *queues* in the **trains-server**.\n",
"- Define mapping between the created the *queues* and an instance budget.\n",
"- Create new execution *queues* in the **clearml-server**.\n",
"- Define mapping between the created *queues* and an instance budget.\n",
"\n",
"**TL;DR - This notebook:**\n",
"- Will spin instances if there are jobs in the execution *queues*, until it will hit the budget limit. \n",
"- Will spin instances if there are jobs in the execution *queues* until it will hit the budget limit.\n",
"- If machines are idle, it will spin them down.\n",
"\n",
"The controller implementation itself is stateless, meaning you can always re-execute the notebook, if for some reason it stopped.\n",
@@ -39,7 +42,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Install & import required packages"
"### Install & import required packages"
]
},
{
@@ -48,7 +51,7 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install trains-agent\n",
"!pip install clearml-agent\n",
"!pip install boto3"
]
},
@@ -56,7 +59,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)"
"### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)"
]
},
{
@@ -92,17 +95,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Define machine budget per execution queue\n",
"### Define machine budget per execution queue\n",
"\n",
"Now that we defined our budget, we need to connect it with the **Trains** cluster.\n",
"Now that we defined our budget, we need to connect it with the **ClearML** cluster.\n",
"\n",
"We map each queue to a resource type (instance type).\n",
"\n",
"Create two queues in the WebUI:\n",
"- Browse to http://your_trains_server_ip:8080/workers-and-queues/queues\n",
"Create two queues in the Web UI:\n",
"- Browse to http://your_clearml_server_ip:8080/workers-and-queues/queues\n",
"- Then click on the \"New Queue\" button and name your queues \"aws_normal\" and \"aws_high\" respectively\n",
"\n",
"The QUEUES dictionary hold the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n",
"The QUEUES dictionary holds the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n",
"```\n",
"QUEUES = {\n",
" 'queue_name': [(\"instance-type-as-defined-in-RESOURCE_CONFIGURATIONS\", max_number_of_instances), ]\n",
@@ -116,7 +119,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Trains-Agent Queues - Machines budget per Queue\n",
"# ClearML Agent Queues - Machines budget per Queue\n",
"# Per queue: list of (machine type as defined in RESOURCE_CONFIGURATIONS,\n",
"# max instances for the specific queue). Order machines from most preferred to least.\n",
"QUEUES = {\n",
@@ -129,7 +132,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Credentials for your AWS account, as well as for your **Trains-Server**"
"### Credentials for your AWS account, as well as for your **ClearML Server**"
]
},
{
@@ -143,24 +146,25 @@
"CLOUD_CREDENTIALS_SECRET = \"\"\n",
"CLOUD_CREDENTIALS_REGION = \"us-east-1\"\n",
"\n",
"# TRAINS configuration\n",
"TRAINS_SERVER_WEB_SERVER = \"http://localhost:8080\"\n",
"TRAINS_SERVER_API_SERVER = \"http://localhost:8008\"\n",
"TRAINS_SERVER_FILES_SERVER = \"http://localhost:8081\"\n",
"# TRAINS credentials\n",
"TRAINS_ACCESS_KEY = \"\"\n",
"TRAINS_SECRET_KEY = \"\"\n",
"# Git User/Pass to be used by trains-agent,\n",
"# CLEARML configuration\n",
"CLEARML_WEB_SERVER = \"http://localhost:8080\"\n",
"CLEARML_API_SERVER = \"http://localhost:8008\"\n",
"CLEARML_FILES_SERVER = \"http://localhost:8081\"\n",
"# CLEARML credentials\n",
"CLEARML_API_ACCESS_KEY = \"\"\n",
"CLEARML_API_SECRET_KEY = \"\"\n",
"# Git User/Pass to be used by clearml-agent,\n",
"# leave empty if image already contains git ssh-key\n",
"TRAINS_GIT_USER = \"\"\n",
"TRAINS_GIT_PASS = \"\"\n",
"CLEARML_AGENT_GIT_USER = \"\"\n",
"CLEARML_AGENT_GIT_PASS = \"\"\n",
"\n",
"# Additional fields for trains.conf file created on the remote instance\n",
"# for example: 'agent.default_docker.image: \"nvidia/cuda:10.0-cudnn7-runtime\"'\n",
"EXTRA_TRAINS_CONF = \"\"\"\n",
"# Additional fields for clearml.conf file created on the remote instance\n",
"# for example: 'agent.default_docker.image: \"nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04\"'\n",
"\n",
"EXTRA_CLEARML_CONF = \"\"\"\n",
"\"\"\"\n",
"\n",
"# Bash script to run on instances before running trains-agent\n",
"# Bash script to run on instances before running clearml-agent\n",
"# Example: \"\"\"\n",
"# echo \"This is the first line\"\n",
"# echo \"This is the second line\"\n",
@@ -168,9 +172,9 @@
"EXTRA_BASH_SCRIPT = \"\"\"\n",
"\"\"\"\n",
"\n",
"# Default docker for trains-agent when running in docker mode (requires docker v19.03 and above). \n",
"# Leave empty to run trains-agent in non-docker mode.\n",
"DEFAULT_DOCKER_IMAGE = \"nvidia/cuda\""
"# Default docker for clearml-agent when running in docker mode (requires docker v19.03 and above).\n",
"# Leave empty to run clearml-agent in non-docker mode.\n",
"CLEARML_AGENT_DOCKER_IMAGE = \"nvidia/cuda\""
]
},
{
@@ -192,7 +196,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Import Packages and Budget Definition Sanity Check"
"### Import Packages and Budget Definition Sanity Check"
]
},
{
@@ -209,7 +213,7 @@
"from time import sleep, time\n",
"\n",
"import boto3\n",
"from trains_agent.backend_api.session.client import APIClient"
"from clearml_agent.backend_api.session.client import APIClient"
]
},
{
@@ -227,36 +231,36 @@
" \"A resource name can only appear in a single queue definition.\"\n",
" )\n",
"\n",
"# Encode EXTRA_TRAINS_CONF for later bash script usage\n",
"EXTRA_TRAINS_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_TRAINS_CONF.split(\"\\\"\"))"
"# Encode EXTRA_CLEARML_CONF for later bash script usage\n",
"EXTRA_CLEARML_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_CLEARML_CONF.split(\"\\\"\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Cloud specific implementation of spin up/down - currently supports AWS only"
"### Cloud specific implementation of spin up/down - currently supports AWS only"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Cloud-specific implementation (currently, only AWS EC2 is supported)\n",
"def spin_up_worker(resource, worker_id_prefix, queue_name):\n",
" \"\"\"\n",
" Creates a new worker for trains.\n",
" Creates a new worker for clearml.\n",
" First, create an instance in the cloud and install some required packages.\n",
" Then, define trains-agent environment variables and run \n",
" trains-agent for the specified queue.\n",
" Then, define clearml-agent environment variables and run\n",
" clearml-agent for the specified queue.\n",
" NOTE: - Will wait until instance is running\n",
" - This implementation assumes the instance image already has docker installed\n",
"\n",
" :param str resource: resource name, as defined in BUDGET and QUEUES.\n",
" :param str worker_id_prefix: worker name prefix\n",
" :param str queue_name: trains queue to listen to\n",
" :param str queue_name: clearml queue to listen to\n",
" \"\"\"\n",
" resource_conf = RESOURCE_CONFIGURATIONS[resource]\n",
" # Add worker type and AWS instance type to the worker name.\n",
@@ -267,8 +271,8 @@
" )\n",
"\n",
" # user_data script will automatically run when the instance is started. \n",
" # It will install the required packages for trains-agent configure it using \n",
" # environment variables and run trains-agent on the required queue\n",
" # It will install the required packages for clearml-agent configure it using\n",
" # environment variables and run clearml-agent on the required queue\n",
" user_data = \"\"\"#!/bin/bash\n",
" sudo apt-get update\n",
" sudo apt-get install -y python3-dev\n",
@@ -278,36 +282,36 @@
" sudo apt-get install -y build-essential\n",
" python3 -m pip install -U pip\n",
" python3 -m pip install virtualenv\n",
" python3 -m virtualenv trains_agent_venv\n",
" source trains_agent_venv/bin/activate\n",
" python -m pip install trains-agent\n",
" echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/trains.conf\n",
" echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/trains.conf\n",
" echo \"{trains_conf}\" >> /root/trains.conf\n",
" export TRAINS_API_HOST={api_server}\n",
" export TRAINS_WEB_HOST={web_server}\n",
" export TRAINS_FILES_HOST={files_server}\n",
" python3 -m virtualenv clearml_agent_venv\n",
" source clearml_agent_venv/bin/activate\n",
" python -m pip install clearml-agent\n",
" echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/clearml.conf\n",
" echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/clearml.conf\n",
" echo \"{clearml_conf}\" >> /root/clearml.conf\n",
" export CLEARML_API_HOST={api_server}\n",
" export CLEARML_WEB_HOST={web_server}\n",
" export CLEARML_FILES_HOST={files_server}\n",
" export DYNAMIC_INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`\n",
" export TRAINS_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n",
" export TRAINS_API_ACCESS_KEY='{access_key}'\n",
" export TRAINS_API_SECRET_KEY='{secret_key}'\n",
" export CLEARML_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n",
" export CLEARML_API_ACCESS_KEY='{access_key}'\n",
" export CLEARML_API_SECRET_KEY='{secret_key}'\n",
" {bash_script}\n",
" source ~/.bashrc\n",
" python -m trains_agent --config-file '/root/trains.conf' daemon --queue '{queue}' {docker}\n",
" python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker}\n",
" shutdown\n",
" \"\"\".format(\n",
" api_server=TRAINS_SERVER_API_SERVER,\n",
" web_server=TRAINS_SERVER_WEB_SERVER,\n",
" files_server=TRAINS_SERVER_FILES_SERVER,\n",
" api_server=CLEARML_API_SERVER,\n",
" web_server=CLEARML_WEB_SERVER,\n",
" files_server=CLEARML_FILES_SERVER,\n",
" worker_id=worker_id,\n",
" access_key=TRAINS_ACCESS_KEY,\n",
" secret_key=TRAINS_SECRET_KEY,\n",
" access_key=CLEARML_API_ACCESS_KEY,\n",
" secret_key=CLEARML_API_SECRET_KEY,\n",
" queue=queue_name,\n",
" git_user=TRAINS_GIT_USER,\n",
" git_pass=TRAINS_GIT_PASS,\n",
" trains_conf=EXTRA_TRAINS_CONF_ENCODED,\n",
" git_user=CLEARML_AGENT_GIT_USER,\n",
" git_pass=CLEARML_AGENT_GIT_PASS,\n",
" clearml_conf=EXTRA_CLEARML_CONF_ENCODED,\n",
" bash_script=EXTRA_BASH_SCRIPT,\n",
" docker=\"--docker '{}'\".format(DEFAULT_DOCKER_IMAGE) if DEFAULT_DOCKER_IMAGE else \"\"\n",
" docker=\"--docker '{}'\".format(CLEARML_AGENT_DOCKER_IMAGE) if CLEARML_AGENT_DOCKER_IMAGE else \"\"\n",
" )\n",
"\n",
" ec2 = boto3.client(\n",
@@ -405,7 +409,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"###### Controller Implementation and Logic"
"#### Controller Implementation and Logic"
]
},
{
@@ -430,18 +434,18 @@
"\n",
" # Internal definitions\n",
" workers_prefix = \"dynamic_aws\"\n",
" # Worker's id in trains would be composed from:\n",
" # Worker's id in clearml would be composed from:\n",
" # prefix, name, instance_type and cloud_id separated by ';'\n",
" workers_pattern = re.compile(\n",
" r\"^(?P<prefix>[^:]+):(?P<name>[^:]+):(?P<instance_type>[^:]+):(?P<cloud_id>[^:]+)\"\n",
" )\n",
"\n",
" # Set up the environment variables for trains\n",
" os.environ[\"TRAINS_API_HOST\"] = TRAINS_SERVER_API_SERVER\n",
" os.environ[\"TRAINS_WEB_HOST\"] = TRAINS_SERVER_WEB_SERVER\n",
" os.environ[\"TRAINS_FILES_HOST\"] = TRAINS_SERVER_FILES_SERVER\n",
" os.environ[\"TRAINS_API_ACCESS_KEY\"] = TRAINS_ACCESS_KEY\n",
" os.environ[\"TRAINS_API_SECRET_KEY\"] = TRAINS_SECRET_KEY\n",
" # Set up the environment variables for clearml\n",
" os.environ[\"CLEARML_API_HOST\"] = CLEARML_API_SERVER\n",
" os.environ[\"CLEARML_WEB_HOST\"] = CLEARML_WEB_SERVER\n",
" os.environ[\"CLEARML_FILES_HOST\"] = CLEARML_FILES_SERVER\n",
" os.environ[\"CLEARML_API_ACCESS_KEY\"] = CLEARM_API_ACCESS_KEY\n",
" os.environ[\"CLEARML_API_SECRET_KEY\"] = CLEARML_API_SECRET_KEY\n",
" api_client = APIClient()\n",
"\n",
" # Verify the requested queues exist and create those that doesn't exist\n",
@@ -520,7 +524,7 @@
" # skip resource types that might be needed\n",
" if resources in required_idle_resources:\n",
" continue\n",
" # Remove from both aws and trains all instances that are \n",
" # Remove from both aws and clearml all instances that are\n",
" # idle for longer than MAX_IDLE_TIME_MIN\n",
" if time() - timestamp > MAX_IDLE_TIME_MIN * 60.0:\n",
" cloud_id = workers_pattern.match(worker.id)[\"cloud_id\"]\n",
@@ -535,7 +539,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)"
"### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)"
]
},
{
@@ -584,4 +588,4 @@
},
"nbformat": 4,
"nbformat_minor": 2
}
}

View File

@@ -13,61 +13,86 @@ def parse_args():
group = parser.add_mutually_exclusive_group()
parser.add_argument(
"--queue", type=str, help="Queue to pull tasks from"
"--queue",
type=str,
help="Queues to pull tasks from. If multiple queues, use comma separated list, e.g. 'queue1,queue2'",
)
group.add_argument(
"--ports-mode", action='store_true', default=False,
"--ports-mode",
action="store_true",
default=False,
help="Ports-Mode will add a label to the pod which can be used as service, in order to expose ports"
"Should not be used with max-pods"
)
parser.add_argument(
"--num-of-services", type=int, default=20,
help="Specify the number of k8s services to be used. Use only with ports-mode."
"--num-of-services",
type=int,
default=20,
help="Specify the number of k8s services to be used. Use only with ports-mode.",
)
parser.add_argument(
"--base-port", type=int,
"--base-port",
type=int,
help="Used in conjunction with ports-mode, specifies the base port exposed by the services. "
"For pod #X, the port will be <base-port>+X. Note that pod number is calculated based on base-pod-num"
"e.g. if base-port=20000 and base-pod-num=3, the port for the first pod will be 20003"
)
parser.add_argument(
"--base-pod-num", type=int, default=1,
"--base-pod-num",
type=int,
default=1,
help="Used in conjunction with ports-mode and base-port, specifies the base pod number to be used by the "
"service (default: %(default)s)"
)
parser.add_argument(
"--gateway-address", type=str, default=None,
help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
"--gateway-address",
type=str,
default=None,
help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB",
)
parser.add_argument(
"--pod-clearml-conf", type=str,
help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
"--pod-clearml-conf",
type=str,
help="Configuration file to be used by the pod itself (if not provided, current configuration is used)",
)
parser.add_argument(
"--overrides-yaml", type=str,
help="YAML file containing pod overrides to be used when launching a new pod"
"--overrides-yaml", type=str, help="YAML file containing pod overrides to be used when launching a new pod"
)
parser.add_argument(
"--template-yaml", type=str,
"--template-yaml",
type=str,
help="YAML file containing pod template. If provided pod will be scheduled with kubectl apply "
"and overrides are ignored, otherwise it will be scheduled with kubectl run"
)
parser.add_argument(
"--ssh-server-port", type=int, default=0,
help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)"
"--ssh-server-port",
type=int,
default=0,
help="If non-zero, every pod will also start an SSH server on the selected port (default: zero, not active)",
)
parser.add_argument(
"--namespace", type=str,
help="Specify the namespace in which pods will be created (default: %(default)s)", default="clearml"
"--namespace",
type=str,
help="Specify the namespace in which pods will be created (default: %(default)s)",
default="clearml",
)
group.add_argument(
"--max-pods", type=int,
"--max-pods",
type=int,
help="Limit the maximum number of pods that this service can run at the same time."
"Should not be used with ports-mode"
)
parser.add_argument(
"--use-owner-token", action="store_true", default=False,
help="Generate and use task owner token for the execution of each task"
"--use-owner-token",
action="store_true",
default=False,
help="Generate and use task owner token for the execution of each task",
)
parser.add_argument(
"--create-queue",
action="store_true",
default=False,
help="Create the queue if it does not exist (default: %(default)s)",
)
return parser.parse_args()
@@ -77,21 +102,32 @@ def main():
user_props_cb = None
if args.ports_mode and args.base_port:
def k8s_user_props_cb(pod_number=0):
user_prop = {"k8s-pod-port": args.base_port + pod_number}
if args.gateway_address:
user_prop["k8s-gateway-address"] = args.gateway_address
return user_prop
user_props_cb = k8s_user_props_cb
k8s = K8sIntegration(
ports_mode=args.ports_mode, num_of_services=args.num_of_services, base_pod_num=args.base_pod_num,
user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
namespace=args.namespace, max_pods_limit=args.max_pods or None,
ports_mode=args.ports_mode,
num_of_services=args.num_of_services,
base_pod_num=args.base_pod_num,
user_props_cb=user_props_cb,
overrides_yaml=args.overrides_yaml,
clearml_conf_file=args.pod_clearml_conf,
template_yaml=args.template_yaml,
extra_bash_init_script=K8sIntegration.get_ssh_server_bash(ssh_port_number=args.ssh_server_port)
if args.ssh_server_port
else None,
namespace=args.namespace,
max_pods_limit=args.max_pods or None,
)
k8s.k8s_daemon(args.queue, use_owner_token=args.use_owner_token)
queue = [q.strip() for q in args.queue.split(",") if q.strip()] if args.queue else None
k8s.k8s_daemon(queue, use_owner_token=args.use_owner_token, create_queue=args.create_queue)
if __name__ == "__main__":

View File

@@ -1,15 +1,15 @@
attrs>=18.0,<23.0.0
attrs>=18.0,<24.0.0
enum34>=0.9,<1.2.0 ; python_version < '3.6'
furl>=2.0.0,<2.2.0
jsonschema>=2.6.0,<5.0.0
pathlib2>=2.3.0,<2.4.0
psutil>=3.4.2,<5.10.0
pyparsing>=2.0.3,<3.1.0
pyparsing>=2.0.3,<3.2.0
python-dateutil>=2.4.2,<2.9.0
pyjwt>=2.4.0,<2.7.0
pyjwt>=2.4.0,<2.9.0
PyYAML>=3.12,<6.1
requests>=2.20.0,<2.29.0
requests>=2.20.0,<=2.31.0
six>=1.13.0,<1.17.0
typing>=3.6.4,<3.8.0 ; python_version < '3.5'
urllib3>=1.21.1,<1.27.0
urllib3>=1.21.1,<2
virtualenv>=16,<21

View File

@@ -62,6 +62,8 @@ setup(
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: 3.9',
'Programming Language :: Python :: 3.10',
'Programming Language :: Python :: 3.11',
'Programming Language :: Python :: 3.12',
'License :: OSI Approved :: Apache Software License',
],