mirror of
https://github.com/clearml/clearml-agent
synced 2025-06-26 18:16:15 +00:00
Compare commits
62 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
c9fc092f4e | ||
|
|
432ee395e1 | ||
|
|
98fc4f0fb9 | ||
|
|
111e774c21 | ||
|
|
3dd8d783e1 | ||
|
|
7c3e420df4 | ||
|
|
55b065a114 | ||
|
|
faa97b6cc2 | ||
|
|
f5861b1e4a | ||
|
|
030cbb69f1 | ||
|
|
564f769ff7 | ||
|
|
2c7f091e57 | ||
|
|
dd5d24b0ca | ||
|
|
996bb797c3 | ||
|
|
9ad49a0d21 | ||
|
|
ba4fee7b19 | ||
|
|
0131db8b7d | ||
|
|
d2384a9a95 | ||
|
|
5b86c230c1 | ||
|
|
21e4be966f | ||
|
|
9c6cb421b3 | ||
|
|
52405c343d | ||
|
|
46f0c991c8 | ||
|
|
0254279ed5 | ||
|
|
0e1750f90e | ||
|
|
58e0dc42ec | ||
|
|
d16825029d | ||
|
|
fb639afcb9 | ||
|
|
eefb94d1bc | ||
|
|
f1e9266075 | ||
|
|
e1e3c84a8d | ||
|
|
ed1356976b | ||
|
|
2b815354e0 | ||
|
|
edae380a9e | ||
|
|
946e9d9ce9 | ||
|
|
a56343ffc7 | ||
|
|
159a6e9a5a | ||
|
|
6b7ee12dc1 | ||
|
|
3838247716 | ||
|
|
6e7d35a42a | ||
|
|
4c056a17b9 | ||
|
|
21d98afca5 | ||
|
|
6a1bf11549 | ||
|
|
7115a9b9a7 | ||
|
|
450df2f8d3 | ||
|
|
ccf752c4e4 | ||
|
|
3ed63e2154 | ||
|
|
a535f93cd6 | ||
|
|
b380ec54c6 | ||
|
|
a1274299ce | ||
|
|
c77224af68 | ||
|
|
95dadca45c | ||
|
|
685918fd9b | ||
|
|
bc85ddf78d | ||
|
|
5b5fb0b8a6 | ||
|
|
fec0ce1756 | ||
|
|
1e09b88b7a | ||
|
|
b6ca0fa6a5 | ||
|
|
307ec9213e | ||
|
|
a78a25d966 | ||
|
|
ebb6231f5a | ||
|
|
e1d65cb280 |
3
.gitignore
vendored
3
.gitignore
vendored
@@ -11,3 +11,6 @@ build/
|
||||
dist/
|
||||
*.egg-info
|
||||
|
||||
# VSCode
|
||||
.vscode
|
||||
|
||||
|
||||
107
README.md
107
README.md
@@ -24,8 +24,7 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
|
||||
* Launch-and-Forget service containers
|
||||
* [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
|
||||
* [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
|
||||
*
|
||||
Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
|
||||
* Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
|
||||
|
||||
It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.
|
||||
|
||||
@@ -35,8 +34,8 @@ It is a zero configuration fire-and-forget execution agent, providing a full ML/
|
||||
or [free tier hosting](https://app.clear.ml)
|
||||
2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine:
|
||||
on-premises / cloud / ...)
|
||||
3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or
|
||||
Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
|
||||
3. Create a [job](https://clear.ml/docs/latest/docs/apps/clearml_task) or
|
||||
add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines of code
|
||||
4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or
|
||||
automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
|
||||
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
|
||||
@@ -81,21 +80,21 @@ Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://githu
|
||||
|
||||
**Two K8s integration flavours**
|
||||
|
||||
- Spin ClearML-Agent as a long-lasting service pod
|
||||
- use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
|
||||
- Spin ClearML-Agent as a long-lasting service pod:
|
||||
- Use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
|
||||
- map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
|
||||
- allow the clearml-agent to manage sibling dockers
|
||||
- benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
|
||||
- downside: Sibling containers
|
||||
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
|
||||
- Allow the clearml-agent to manage sibling dockers
|
||||
- Benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
|
||||
- Downside: sibling containers
|
||||
- Kubernetes Glue, map ClearML jobs directly to K8s jobs:
|
||||
- Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
|
||||
a K8s cpu node
|
||||
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
|
||||
yaml template)
|
||||
- Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
|
||||
experiment's process
|
||||
- benefits: Kubernetes full view of all running jobs in the system
|
||||
- downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
|
||||
- Benefits: Kubernetes full view of all running jobs in the system
|
||||
- Downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
|
||||
|
||||
### Using the ClearML Agent
|
||||
|
||||
@@ -110,15 +109,15 @@ A previously run experiment can be put into 'Draft' state by either of two metho
|
||||
|
||||
* Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any
|
||||
results and artifacts the previous run had created.
|
||||
* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new '
|
||||
Draft' experiment with the same configuration as the original experiment.
|
||||
* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new
|
||||
'Draft' experiment with the same configuration as the original experiment.
|
||||
|
||||
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in
|
||||
the ClearML UI and selecting the execution queue.
|
||||
|
||||
See [creating an experiment and enqueuing it for execution](#from-scratch).
|
||||
|
||||
Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue.
|
||||
Once an experiment is enqueued, it will be picked up and executed by a ClearML Agent monitoring this queue.
|
||||
|
||||
The ClearML UI Workers & Queues page provides ongoing execution information:
|
||||
|
||||
@@ -170,22 +169,22 @@ clearml-agent init
|
||||
```
|
||||
|
||||
Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
|
||||
ClearML Agent cache folder is `~/.clearml`
|
||||
ClearML Agent cache folder is `~/.clearml`.
|
||||
|
||||
See full details in your configuration file at `~/clearml.conf`
|
||||
See full details in your configuration file at `~/clearml.conf`.
|
||||
|
||||
Note: The **ClearML agent** extends the **ClearML** configuration file `~/clearml.conf`
|
||||
Note: The **ClearML Agent** extends the **ClearML** configuration file `~/clearml.conf`.
|
||||
They are designed to share the same configuration file, see example [here](docs/clearml.conf)
|
||||
|
||||
#### Running the ClearML Agent
|
||||
|
||||
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
|
||||
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen:
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --queue default --foreground
|
||||
```
|
||||
|
||||
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
|
||||
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe).
|
||||
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
|
||||
|
||||
```bash
|
||||
@@ -195,20 +194,21 @@ clearml-agent daemon --detached --queue default
|
||||
GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled
|
||||
with `--cpu-only`).
|
||||
|
||||
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for
|
||||
the `clearml-agent` <br>
|
||||
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPUs will be allocated for
|
||||
the `clearml-agent`. <br>
|
||||
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES="none"`, no gpu will be allocated for
|
||||
the `clearml-agent`
|
||||
the `clearml-agent`.
|
||||
|
||||
Example: spin two agents, one per gpu on the same machine:
|
||||
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
|
||||
Example: spin two agents, one per GPU on the same machine:
|
||||
|
||||
Notice: with `--detached` flag, the *clearml-agent* will run in the background
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --gpus 0 --queue default
|
||||
clearml-agent daemon --detached --gpus 1 --queue default
|
||||
```
|
||||
|
||||
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
|
||||
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
|
||||
@@ -223,43 +223,43 @@ For debug and experimentation, start the ClearML agent in `foreground` mode, whe
|
||||
clearml-agent daemon --queue default --docker --foreground
|
||||
```
|
||||
|
||||
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
|
||||
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
|
||||
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe).
|
||||
Notice: with `--detached` flag, the *clearml-agent* will run in the background
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --queue default --docker
|
||||
```
|
||||
|
||||
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
|
||||
Example: spin two agents, one per gpu on the same machine, with default `nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04`
|
||||
docker:
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
|
||||
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
|
||||
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
|
||||
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
|
||||
```
|
||||
|
||||
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:
|
||||
10.1-cudnn7-runtime-ubuntu18.04 docker:
|
||||
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent, with default
|
||||
`nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04` docker:
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
|
||||
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
|
||||
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
|
||||
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
|
||||
```
|
||||
|
||||
##### Starting the ClearML Agent - Priority Queues
|
||||
|
||||
Priority Queues are also supported, example use case:
|
||||
|
||||
High priority queue: `important_jobs` Low priority queue: `default`
|
||||
High priority queue: `important_jobs`, low priority queue: `default`
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --queue important_jobs default
|
||||
```
|
||||
|
||||
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from
|
||||
the `default` queue.
|
||||
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, and only if it is empty, the agent
|
||||
will try to pull from the `default` queue.
|
||||
|
||||
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see
|
||||
Adding queues, managing job order within a queue, and moving jobs between queues, is available using the Web UI, see
|
||||
example on our [free server](https://app.clear.ml/workers-and-queues/queues)
|
||||
|
||||
##### Stopping the ClearML Agent
|
||||
@@ -268,7 +268,7 @@ To stop a **ClearML Agent** running in the background, run the same command line
|
||||
appended. For example, to stop the first of the above shown same machine, single gpu agents:
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop
|
||||
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 --stop
|
||||
```
|
||||
|
||||
### How do I create an experiment on the ClearML Server? <a name="from-scratch"></a>
|
||||
@@ -279,32 +279,33 @@ clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10
|
||||
- Git repository link and commit ID (or an entire jupyter notebook)
|
||||
- Git diff (we’re not saying you never commit and push, but still...)
|
||||
- Python packages used by your code (including specific versions used)
|
||||
- Hyper-Parameters
|
||||
- Input Artifacts
|
||||
- Hyperparameters
|
||||
- Input artifacts
|
||||
|
||||
You now have a 'template' of your experiment with everything required for automated execution
|
||||
|
||||
* In the ClearML UI, Right-click on the experiment and select 'clone'. A copy of your experiment will be created.
|
||||
* In the ClearML UI, right-click on the experiment and select 'clone'. A copy of your experiment will be created.
|
||||
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
|
||||
- Change the Hyper-Parameters
|
||||
- Change the hyperparameters
|
||||
- Switch to the latest code base of the repository
|
||||
- Update package versions
|
||||
- Select a specific docker image to run in (see docker execution mode section)
|
||||
- Or simply change nothing to run the same experiment again...
|
||||
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
|
||||
* Schedule the newly created experiment for execution: right-click the experiment and select 'enqueue'
|
||||
|
||||
### ClearML-Agent Services Mode <a name="services"></a>
|
||||
|
||||
ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that
|
||||
previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks)
|
||||
for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the
|
||||
budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as
|
||||
Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data
|
||||
transparency)
|
||||
for different use cases:
|
||||
* Auto-scaler service (spinning instances when the need arises and the budget allows)
|
||||
* Controllers (Implementing pipelines and more sophisticated DevOps logic)
|
||||
* Optimizer (such as Hyperparameter Optimization or sweeping)
|
||||
* Application (such as interactive Bokeh apps for increased data transparency)
|
||||
|
||||
ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by
|
||||
ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
|
||||
Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched
|
||||
Currently, clearml-agent in services-mode supports CPU only configuration. ClearML-Agent services mode can be launched
|
||||
alongside GPU agents.
|
||||
|
||||
```bash
|
||||
@@ -321,15 +322,15 @@ ClearML package.
|
||||
Sample AutoML & Orchestration examples can be found in the
|
||||
ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.
|
||||
|
||||
AutoML examples
|
||||
AutoML examples:
|
||||
|
||||
- [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
|
||||
- In order to create an experiment-template in the system, this code must be executed once manually
|
||||
- [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
|
||||
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter
|
||||
- This example will create multiple copies of the Keras experiment-template, with different hyperparameter
|
||||
combinations
|
||||
|
||||
Experiment Pipeline examples
|
||||
Experiment Pipeline examples:
|
||||
|
||||
- [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
|
||||
- This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
|
||||
|
||||
@@ -45,8 +45,8 @@
|
||||
# it solves passing user/token to git submodules.
|
||||
# this is a safer way to ensure multiple users using the same repository will
|
||||
# not accidentally leak credentials
|
||||
# Only supported on Linux systems, it will be the default in future releases
|
||||
# enable_git_ask_pass: false
|
||||
# Note: this is only supported on Linux systems
|
||||
# enable_git_ask_pass: true
|
||||
|
||||
# in docker mode, if container's entrypoint automatically activated a virtual environment
|
||||
# use the activated virtual environment and install everything there
|
||||
@@ -80,6 +80,17 @@
|
||||
# additional artifact repositories to use when installing python packages
|
||||
# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
|
||||
|
||||
# control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
|
||||
# Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
|
||||
# "pip" (default): would automatically detect the cuda version, and supply pip with the correct
|
||||
# extra-index-url, based on pytorch.org tables
|
||||
# "direct": would resolve a direct link to the pytorch wheel by parsing the pytorch.org pip repository
|
||||
# and matching the automatically detected cuda version with the required pytorch wheel.
|
||||
# if the exact cuda version is not found for the required pytorch wheel, it will try
|
||||
# a lower cuda version until a match is found
|
||||
# "none": No resolver used, install pytorch like any other package
|
||||
# pytorch_resolve: "pip"
|
||||
|
||||
# additional conda channels to use when installing with conda package manager
|
||||
conda_channels: ["pytorch", "conda-forge", "defaults", ]
|
||||
|
||||
@@ -88,19 +99,23 @@
|
||||
# force_repo_requirements_txt: false
|
||||
|
||||
# set the priority packages to be installed before the rest of the required packages
|
||||
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
|
||||
# priority_packages: ["cython", "numpy", "setuptools", ]
|
||||
|
||||
# set the optional priority packages to be installed before the rest of the required packages,
|
||||
# In case a package installation fails, the package will be ignored,
|
||||
# and the virtual environment process will continue
|
||||
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
|
||||
priority_optional_packages: ["pygobject", ]
|
||||
|
||||
# set the post packages to be installed after all the rest of the required packages
|
||||
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
|
||||
# post_packages: ["horovod", ]
|
||||
|
||||
# set the optional post packages to be installed after all the rest of the required packages,
|
||||
# In case a package installation fails, the package will be ignored,
|
||||
# and the virtual environment process will continue
|
||||
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
|
||||
# post_optional_packages: []
|
||||
|
||||
# set to True to support torch nightly build installation,
|
||||
@@ -162,6 +177,13 @@
|
||||
# these are local for this agent and will not be updated in the experiment's docker_cmd section
|
||||
# extra_docker_arguments: ["--ipc=host", ]
|
||||
|
||||
# Allow the extra docker arg to override task level docker arg (if the same argument is passed on both),
|
||||
# if set to False, a task docker arg will override the docker extra arg
|
||||
# docker_args_extra_precedes_task: true
|
||||
|
||||
# allows the following task docker args to be overridden by the extra_docker_arguments
|
||||
# protected_docker_extra_args: ["privileged", "security-opt", "network", "ipc"]
|
||||
|
||||
# optional shell script to run in docker when started before the experiment is started
|
||||
# extra_docker_shell_script: ["apt-get install -y bindfs", ]
|
||||
|
||||
@@ -192,7 +214,7 @@
|
||||
|
||||
default_docker: {
|
||||
# default docker image to use when running in docker mode
|
||||
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
|
||||
image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
|
||||
|
||||
# optional arguments to pass to docker image
|
||||
# arguments: ["--ipc=host", ]
|
||||
@@ -259,10 +281,15 @@
|
||||
|
||||
# Name docker containers created by the daemon using the following string format (supported from Docker 0.6.5)
|
||||
# Allowed variables are task_id, worker_id and rand_string (random lower-case letters string, up to 32 characters)
|
||||
# Custom variables may be specified using the docker_container_name_format_fields option.
|
||||
# Note: resulting name must start with an alphanumeric character and
|
||||
# continue with alphanumeric characters, underscores (_), dots (.) and/or dashes (-)
|
||||
# docker_container_name_format: "clearml-id-{task_id}-{rand_string:.8}"
|
||||
|
||||
# Specify custom variables for the docker_container_name_format option using a mapping of variable name
|
||||
# to a (nested) task field (using "." as a task field separator, digits specify array index)
|
||||
# docker_container_name_format_fields: { foo: "bar.moo" }
|
||||
|
||||
# Apply top-level environment section from configuration into os.environ
|
||||
apply_environment: true
|
||||
# Top-level environment section is in the form of:
|
||||
@@ -283,6 +310,8 @@
|
||||
# target_format: format used to encode contents before writing into the target file. Supported values are json,
|
||||
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
|
||||
# overwrite: overwrite the target file in case it exists. Default is true.
|
||||
# mode: file-system mode to be applied to the file after its creation. The mode string will be parsed into an
|
||||
# integer (e.g. "0o777" for -rwxrwxrwx)
|
||||
#
|
||||
# Example:
|
||||
# files {
|
||||
@@ -348,7 +377,7 @@
|
||||
# Notice: Matching is done via regular expression, for example "^searchme$" will match exactly "searchme$" string
|
||||
#
|
||||
# "default_docker": {
|
||||
# "image": "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04",
|
||||
# "image": "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04",
|
||||
# # optional arguments to pass to docker image
|
||||
# # arguments: ["--ipc=host", ]
|
||||
# "match_rules": [
|
||||
@@ -369,13 +398,6 @@
|
||||
# }
|
||||
# },
|
||||
# {
|
||||
# "image": "better_container:tag",
|
||||
# "arguments": "",
|
||||
# "match": {
|
||||
# "container": "replace_me_please"
|
||||
# }
|
||||
# },
|
||||
# {
|
||||
# "image": "another_container:tag",
|
||||
# "arguments": "",
|
||||
# "match": {
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
|
||||
storage {
|
||||
cache {
|
||||
# Defaults to system temp folder / cache
|
||||
# Defaults to <system_temp_folder>/clearml_cache
|
||||
default_base_dir: "~/.clearml/cache"
|
||||
size {
|
||||
# max_used_bytes = -1
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
from ...backend_config.converters import safe_text_to_bool
|
||||
from ...backend_config.environment import EnvEntry
|
||||
from clearml_agent.helper.environment import EnvEntry
|
||||
from clearml_agent.helper.environment.converters import safe_text_to_bool
|
||||
|
||||
|
||||
ENV_HOST = EnvEntry("CLEARML_API_HOST", "TRAINS_API_HOST")
|
||||
@@ -20,6 +20,7 @@ ENV_PROPAGATE_EXITCODE = EnvEntry("CLEARML_AGENT_PROPAGATE_EXITCODE", type=bool,
|
||||
ENV_INITIAL_CONNECT_RETRY_OVERRIDE = EnvEntry(
|
||||
'CLEARML_AGENT_INITIAL_CONNECT_RETRY_OVERRIDE', default=True, converter=safe_text_to_bool
|
||||
)
|
||||
ENV_FORCE_MAX_API_VERSION = EnvEntry("CLEARML_AGENT_FORCE_MAX_API_VERSION", type=str)
|
||||
|
||||
"""
|
||||
Experimental option to set the request method for all API requests and auth login.
|
||||
|
||||
@@ -16,11 +16,11 @@ from requests.auth import HTTPBasicAuth
|
||||
from six.moves.urllib.parse import urlparse, urlunparse
|
||||
|
||||
from clearml_agent.external.pyhocon import ConfigTree, ConfigFactory
|
||||
|
||||
from .callresult import CallResult
|
||||
from .defs import (
|
||||
ENV_VERBOSE, ENV_HOST, ENV_ACCESS_KEY, ENV_SECRET_KEY, ENV_WEB_HOST, ENV_FILES_HOST, ENV_AUTH_TOKEN,
|
||||
ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD, )
|
||||
ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD,
|
||||
ENV_FORCE_MAX_API_VERSION)
|
||||
from .request import Request, BatchRequest
|
||||
from .token_manager import TokenManager
|
||||
from ..config import load
|
||||
@@ -28,7 +28,6 @@ from ..utils import get_http_session_with_retry, urllib_log_warning_setup
|
||||
from ...backend_config.environment import backward_compatibility_support
|
||||
from ...version import __version__
|
||||
|
||||
|
||||
sys_random = SystemRandom()
|
||||
|
||||
|
||||
@@ -64,6 +63,7 @@ class Session(TokenManager):
|
||||
default_files = "https://demofiles.demo.clear.ml"
|
||||
default_key = "EGRTCO8JMSIGI6S39GTP43NFWXDQOW"
|
||||
default_secret = "x!XTov_G-#vspE*Y(h$Anm&DIc5Ou-F)jsl$PdOyj5wG1&E!Z8"
|
||||
force_max_api_version = ENV_FORCE_MAX_API_VERSION.get()
|
||||
|
||||
# TODO: add requests.codes.gateway_timeout once we support async commits
|
||||
_retry_codes = [
|
||||
@@ -199,6 +199,12 @@ class Session(TokenManager):
|
||||
# notice: this is across the board warning omission
|
||||
urllib_log_warning_setup(total_retries=http_retries_config.get('total', 0), display_warning_after=3)
|
||||
|
||||
if self.force_max_api_version and self.check_min_api_version(self.force_max_api_version):
|
||||
print("Using forced API version {}".format(self.force_max_api_version))
|
||||
Session.max_api_version = Session.api_version = str(self.force_max_api_version)
|
||||
|
||||
self.pre_vault_config = None
|
||||
|
||||
def _setup_session(self, http_retries_config, initial_session=False, default_initial_connect_override=None):
|
||||
# type: (dict, bool, Optional[bool]) -> (dict, requests.Session)
|
||||
http_retries_config = http_retries_config or self.config.get(
|
||||
@@ -250,7 +256,11 @@ class Session(TokenManager):
|
||||
def parse(vault):
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
d = vault.get('data', None)
|
||||
print("Loaded {} vault: {}".format(
|
||||
vault.get("scope", ""),
|
||||
(vault.get("description", None) or "")[:50] or vault.get("id", ""))
|
||||
)
|
||||
d = vault.get("data", None)
|
||||
if d:
|
||||
r = ConfigFactory.parse_string(d)
|
||||
if isinstance(r, (ConfigTree, dict)):
|
||||
@@ -266,6 +276,7 @@ class Session(TokenManager):
|
||||
vaults = res.json().get("data", {}).get("vaults", [])
|
||||
data = list(filter(None, map(parse, vaults)))
|
||||
if data:
|
||||
self.pre_vault_config = self.config.copy()
|
||||
self.config.set_overrides(*data)
|
||||
return True
|
||||
elif res.status_code != 404:
|
||||
|
||||
@@ -86,7 +86,10 @@ def get_http_session_with_retry(
|
||||
session = requests.Session()
|
||||
|
||||
if backoff_max is not None:
|
||||
Retry.BACKOFF_MAX = backoff_max
|
||||
if "BACKOFF_MAX" in vars(Retry):
|
||||
Retry.BACKOFF_MAX = backoff_max
|
||||
else:
|
||||
Retry.DEFAULT_BACKOFF_MAX = backoff_max
|
||||
|
||||
retry = Retry(
|
||||
total=total, connect=connect, read=read, redirect=redirect, status=status,
|
||||
|
||||
@@ -297,6 +297,9 @@ class Config(object):
|
||||
def put(self, key, value):
|
||||
self._config.put(key, value)
|
||||
|
||||
def pop(self, key, default=None):
|
||||
return self._config.pop(key, default=default)
|
||||
|
||||
def to_dict(self):
|
||||
return self._config.as_plain_ordered_dict()
|
||||
|
||||
|
||||
@@ -1,69 +1,8 @@
|
||||
import base64
|
||||
from distutils.util import strtobool
|
||||
from typing import Union, Optional, Any, TypeVar, Callable, Tuple
|
||||
|
||||
import six
|
||||
|
||||
try:
|
||||
from typing import Text
|
||||
except ImportError:
|
||||
# windows conda-less hack
|
||||
Text = Any
|
||||
|
||||
|
||||
ConverterType = TypeVar("ConverterType", bound=Callable[[Any], Any])
|
||||
|
||||
|
||||
def text_to_int(value, default=0):
|
||||
# type: (Any, int) -> int
|
||||
try:
|
||||
return int(value)
|
||||
except (ValueError, TypeError):
|
||||
return default
|
||||
|
||||
|
||||
def base64_to_text(value):
|
||||
# type: (Any) -> Text
|
||||
return base64.b64decode(value).decode("utf-8")
|
||||
|
||||
|
||||
def text_to_bool(value):
|
||||
# type: (Text) -> bool
|
||||
return bool(strtobool(value))
|
||||
|
||||
|
||||
def safe_text_to_bool(value):
|
||||
# type: (Text) -> bool
|
||||
try:
|
||||
return text_to_bool(value)
|
||||
except ValueError:
|
||||
return bool(value)
|
||||
|
||||
|
||||
def any_to_bool(value):
|
||||
# type: (Optional[Union[int, float, Text]]) -> bool
|
||||
if isinstance(value, six.text_type):
|
||||
return text_to_bool(value)
|
||||
return bool(value)
|
||||
|
||||
|
||||
def or_(*converters, **kwargs):
|
||||
# type: (ConverterType, Tuple[Exception, ...]) -> ConverterType
|
||||
"""
|
||||
Wrapper that implements an "optional converter" pattern. Allows specifying a converter
|
||||
for which a set of exceptions is ignored (and the original value is returned)
|
||||
:param converters: A converter callable
|
||||
:param exceptions: A tuple of exception types to ignore
|
||||
"""
|
||||
# noinspection PyUnresolvedReferences
|
||||
exceptions = kwargs.get("exceptions", (ValueError, TypeError))
|
||||
|
||||
def wrapper(value):
|
||||
for converter in converters:
|
||||
try:
|
||||
return converter(value)
|
||||
except exceptions:
|
||||
pass
|
||||
return value
|
||||
|
||||
return wrapper
|
||||
from clearml_agent.helper.environment.converters import (
|
||||
base64_to_text,
|
||||
text_to_bool,
|
||||
text_to_int,
|
||||
safe_text_to_bool,
|
||||
any_to_bool,
|
||||
or_,
|
||||
)
|
||||
|
||||
@@ -1,111 +1,6 @@
|
||||
import abc
|
||||
from typing import Optional, Any, Tuple, Callable, Dict
|
||||
from clearml_agent.helper.environment import Entry, NotSet
|
||||
|
||||
import six
|
||||
|
||||
from .converters import any_to_bool
|
||||
|
||||
try:
|
||||
from typing import Text
|
||||
except ImportError:
|
||||
# windows conda-less hack
|
||||
Text = Any
|
||||
|
||||
|
||||
NotSet = object()
|
||||
|
||||
Converter = Callable[[Any], Any]
|
||||
|
||||
|
||||
@six.add_metaclass(abc.ABCMeta)
|
||||
class Entry(object):
|
||||
"""
|
||||
Configuration entry definition
|
||||
"""
|
||||
|
||||
@classmethod
|
||||
def default_conversions(cls):
|
||||
# type: () -> Dict[Any, Converter]
|
||||
return {
|
||||
bool: any_to_bool,
|
||||
six.text_type: lambda s: six.text_type(s).strip(),
|
||||
}
|
||||
|
||||
def __init__(self, key, *more_keys, **kwargs):
|
||||
# type: (Text, Text, Any) -> None
|
||||
"""
|
||||
:param key: Entry's key (at least one).
|
||||
:param more_keys: More alternate keys for this entry.
|
||||
:param type: Value type. If provided, will be used choosing a default conversion or
|
||||
(if none exists) for casting the environment value.
|
||||
:param converter: Value converter. If provided, will be used to convert the environment value.
|
||||
:param default: Default value. If provided, will be used as the default value on calls to get() and get_pair()
|
||||
in case no value is found for any key and no specific default value was provided in the call.
|
||||
Default value is None.
|
||||
:param help: Help text describing this entry
|
||||
"""
|
||||
self.keys = (key,) + more_keys
|
||||
self.type = kwargs.pop("type", six.text_type)
|
||||
self.converter = kwargs.pop("converter", None)
|
||||
self.default = kwargs.pop("default", None)
|
||||
self.help = kwargs.pop("help", None)
|
||||
|
||||
def __str__(self):
|
||||
return str(self.key)
|
||||
|
||||
@property
|
||||
def key(self):
|
||||
return self.keys[0]
|
||||
|
||||
def convert(self, value, converter=None):
|
||||
# type: (Any, Converter) -> Optional[Any]
|
||||
converter = converter or self.converter
|
||||
if not converter:
|
||||
converter = self.default_conversions().get(self.type, self.type)
|
||||
return converter(value)
|
||||
|
||||
def get_pair(self, default=NotSet, converter=None, value_cb=None):
|
||||
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
|
||||
for key in self.keys:
|
||||
value = self._get(key)
|
||||
if value is NotSet:
|
||||
continue
|
||||
try:
|
||||
value = self.convert(value, converter)
|
||||
except Exception as ex:
|
||||
self.error("invalid value {key}={value}: {ex}".format(**locals()))
|
||||
break
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if value_cb:
|
||||
value_cb(key, value)
|
||||
except Exception:
|
||||
pass
|
||||
return key, value
|
||||
|
||||
result = self.default if default is NotSet else default
|
||||
return self.key, result
|
||||
|
||||
def get(self, default=NotSet, converter=None, value_cb=None):
|
||||
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
|
||||
return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
|
||||
|
||||
def set(self, value):
|
||||
# type: (Any, Any) -> (Text, Any)
|
||||
# key, _ = self.get_pair(default=None, converter=None)
|
||||
for k in self.keys:
|
||||
self._set(k, str(value))
|
||||
|
||||
def _set(self, key, value):
|
||||
# type: (Text, Text) -> None
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def _get(self, key):
|
||||
# type: (Text) -> Any
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def error(self, message):
|
||||
# type: (Text) -> None
|
||||
pass
|
||||
__all__ = [
|
||||
"Entry",
|
||||
"NotSet"
|
||||
]
|
||||
|
||||
@@ -1,32 +1,6 @@
|
||||
from os import getenv, environ
|
||||
from os import environ
|
||||
|
||||
from .converters import text_to_bool
|
||||
from .entry import Entry, NotSet
|
||||
|
||||
|
||||
class EnvEntry(Entry):
|
||||
@classmethod
|
||||
def default_conversions(cls):
|
||||
conversions = super(EnvEntry, cls).default_conversions().copy()
|
||||
conversions[bool] = text_to_bool
|
||||
return conversions
|
||||
|
||||
def pop(self):
|
||||
for k in self.keys:
|
||||
environ.pop(k, None)
|
||||
|
||||
def _get(self, key):
|
||||
value = getenv(key, "").strip()
|
||||
return value or NotSet
|
||||
|
||||
def _set(self, key, value):
|
||||
environ[key] = value
|
||||
|
||||
def __str__(self):
|
||||
return "env:{}".format(super(EnvEntry, self).__str__())
|
||||
|
||||
def error(self, message):
|
||||
print("Environment configuration: {}".format(message))
|
||||
from clearml_agent.helper.environment import EnvEntry
|
||||
|
||||
|
||||
def backward_compatibility_support():
|
||||
@@ -34,6 +8,7 @@ def backward_compatibility_support():
|
||||
if ENVIRONMENT_BACKWARD_COMPATIBLE.get():
|
||||
# Add TRAINS_ prefix on every CLEARML_ os environment we support
|
||||
for k, v in ENVIRONMENT_CONFIG.items():
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
trains_vars = [var for var in v.vars if var.startswith('CLEARML_')]
|
||||
if not trains_vars:
|
||||
@@ -44,6 +19,7 @@ def backward_compatibility_support():
|
||||
except:
|
||||
continue
|
||||
for k, v in ENVIRONMENT_SDK_PARAMS.items():
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
trains_vars = [var for var in v if var.startswith('CLEARML_')]
|
||||
if not trains_vars:
|
||||
@@ -62,3 +38,9 @@ def backward_compatibility_support():
|
||||
backwards_k = k.replace('CLEARML_', 'TRAINS_', 1)
|
||||
if backwards_k not in keys:
|
||||
environ[backwards_k] = environ[k]
|
||||
|
||||
|
||||
__all__ = [
|
||||
"EnvEntry",
|
||||
"backward_compatibility_support"
|
||||
]
|
||||
@@ -31,7 +31,8 @@ def apply_environment(config):
|
||||
keys = list(filter(None, env_vars.keys()))
|
||||
|
||||
for key in keys:
|
||||
os.environ[str(key)] = str(env_vars[key] or "")
|
||||
value = env_vars[key]
|
||||
os.environ[str(key)] = str(value if value is not None else "")
|
||||
|
||||
return keys
|
||||
|
||||
@@ -52,6 +53,7 @@ def apply_files(config):
|
||||
target_fmt = data.get("target_format", "string")
|
||||
overwrite = bool(data.get("overwrite", True))
|
||||
contents = data.get("contents")
|
||||
mode = data.get("mode")
|
||||
|
||||
target = Path(expanduser(expandvars(path)))
|
||||
|
||||
@@ -110,3 +112,14 @@ def apply_files(config):
|
||||
except Exception as ex:
|
||||
print("Skipped [{}]: failed saving file {} ({})".format(key, target, ex))
|
||||
continue
|
||||
|
||||
try:
|
||||
if mode:
|
||||
if isinstance(mode, int):
|
||||
mode = int(str(mode), 8)
|
||||
else:
|
||||
mode = int(mode, 8)
|
||||
target.chmod(mode)
|
||||
except Exception as ex:
|
||||
print("Skipped [{}]: failed setting mode {} for {} ({})".format(key, mode, target, ex))
|
||||
continue
|
||||
|
||||
@@ -44,7 +44,7 @@ def main():
|
||||
|
||||
if conf_file.exists() and conf_file.is_file() and conf_file.stat().st_size > 0:
|
||||
print('Configuration file already exists: {}'.format(str(conf_file)))
|
||||
print('Leaving setup, feel free to edit the configuration file.')
|
||||
print('Leaving setup. If you\'ve previously initialized the ClearML SDK on this machine, manually add an \'agent\' section to this file.')
|
||||
return
|
||||
|
||||
print(description, end='')
|
||||
|
||||
@@ -2,6 +2,7 @@ from __future__ import print_function
|
||||
|
||||
import json
|
||||
import time
|
||||
from typing import List, Tuple
|
||||
|
||||
from clearml_agent.commands.base import ServiceCommandSection
|
||||
from clearml_agent.helper.base import return_list
|
||||
@@ -57,6 +58,42 @@ class Events(ServiceCommandSection):
|
||||
# print('Sending events done: %d / %d events sent' % (sent_events, len(list_events)))
|
||||
return sent_events
|
||||
|
||||
def send_log_events_with_timestamps(
|
||||
self, worker_id, task_id, lines_with_ts: List[Tuple[str, str]], level="DEBUG", session=None
|
||||
):
|
||||
log_events = []
|
||||
|
||||
# break log lines into event packets
|
||||
for ts, line in return_list(lines_with_ts):
|
||||
# HACK ignore terminal reset ANSI code
|
||||
if line == '\x1b[0m':
|
||||
continue
|
||||
while line:
|
||||
if len(line) <= self.max_event_size:
|
||||
msg = line
|
||||
line = None
|
||||
else:
|
||||
msg = line[:self.max_event_size]
|
||||
line = line[self.max_event_size:]
|
||||
|
||||
log_events.append(
|
||||
{
|
||||
"type": "log",
|
||||
"level": level,
|
||||
"task": task_id,
|
||||
"worker": worker_id,
|
||||
"msg": msg,
|
||||
"timestamp": ts,
|
||||
}
|
||||
)
|
||||
|
||||
if line and ts is not None:
|
||||
# advance timestamp in case we break a line to more than one part
|
||||
ts += 1
|
||||
|
||||
# now send the events
|
||||
return self.send_events(list_events=log_events, session=session)
|
||||
|
||||
def send_log_events(self, worker_id, task_id, lines, level='DEBUG', session=None):
|
||||
log_events = []
|
||||
base_timestamp = int(time.time() * 1000)
|
||||
|
||||
@@ -109,15 +109,15 @@ def resolve_default_container(session, task_id, container_config):
|
||||
match.get('script.binary', None), entry))
|
||||
continue
|
||||
|
||||
if match.get('container', None):
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if not re.search(match.get('container', None), requested_container.get('image', '')):
|
||||
continue
|
||||
except Exception:
|
||||
print('Failed parsing regular expression \"{}\" in rule: {}'.format(
|
||||
match.get('container', None), entry))
|
||||
continue
|
||||
# if match.get('image', None):
|
||||
# # noinspection PyBroadException
|
||||
# try:
|
||||
# if not re.search(match.get('image', None), requested_container.get('image', '')):
|
||||
# continue
|
||||
# except Exception:
|
||||
# print('Failed parsing regular expression \"{}\" in rule: {}'.format(
|
||||
# match.get('image', None), entry))
|
||||
# continue
|
||||
|
||||
matched = True
|
||||
for req_section in ['script.requirements.pip', 'script.requirements.conda']:
|
||||
@@ -156,8 +156,8 @@ def resolve_default_container(session, task_id, container_config):
|
||||
break
|
||||
|
||||
if matched:
|
||||
if not container_config.get('container'):
|
||||
container_config['container'] = entry.get('image', None)
|
||||
if not container_config.get('image'):
|
||||
container_config['image'] = entry.get('image', None)
|
||||
if not container_config.get('arguments'):
|
||||
container_config['arguments'] = entry.get('arguments', None)
|
||||
container_config['arguments'] = shlex.split(str(container_config.get('arguments') or '').strip())
|
||||
|
||||
@@ -12,6 +12,7 @@ import shlex
|
||||
import shutil
|
||||
import signal
|
||||
import string
|
||||
import socket
|
||||
import subprocess
|
||||
import sys
|
||||
import traceback
|
||||
@@ -24,7 +25,7 @@ from functools import partial
|
||||
from os.path import basename
|
||||
from tempfile import mkdtemp, NamedTemporaryFile
|
||||
from time import sleep, time
|
||||
from typing import Text, Optional, Any, Tuple, List
|
||||
from typing import Text, Optional, Any, Tuple, List, Dict, Mapping, Union
|
||||
|
||||
import attr
|
||||
import six
|
||||
@@ -40,6 +41,7 @@ from clearml_agent.backend_api.session import CallResult, Request
|
||||
from clearml_agent.backend_api.session.defs import (
|
||||
ENV_ENABLE_ENV_CONFIG_SECTION, ENV_ENABLE_FILES_CONFIG_SECTION,
|
||||
ENV_VENV_CONFIGURED, ENV_PROPAGATE_EXITCODE, )
|
||||
from clearml_agent.backend_config import Config
|
||||
from clearml_agent.backend_config.defs import UptimeConf
|
||||
from clearml_agent.backend_config.utils import apply_environment, apply_files
|
||||
from clearml_agent.backend_config.converters import text_to_int
|
||||
@@ -71,6 +73,12 @@ from clearml_agent.definitions import (
|
||||
ENV_DOCKER_ARGS_FILTERS,
|
||||
ENV_FORCE_SYSTEM_SITE_PACKAGES,
|
||||
ENV_SERVICES_DOCKER_RESTART,
|
||||
ENV_CONFIG_BC_IN_STANDALONE,
|
||||
ENV_FORCE_DOCKER_AGENT_REPO,
|
||||
ENV_EXTRA_DOCKER_LABELS,
|
||||
ENV_AGENT_FORCE_CODE_DIR,
|
||||
ENV_AGENT_FORCE_EXEC_SCRIPT,
|
||||
ENV_TEMP_STDOUT_FILE_DIR,
|
||||
)
|
||||
from clearml_agent.definitions import WORKING_REPOSITORY_DIR, PIP_EXTRA_INDICES
|
||||
from clearml_agent.errors import (
|
||||
@@ -316,6 +324,37 @@ def get_next_task(session, queue, get_task_info=False):
|
||||
return data
|
||||
|
||||
|
||||
def get_task_fields(session, task_id, fields: list, log=None) -> dict:
|
||||
"""
|
||||
Returns dict with Task docker container setup {container: '', arguments: '', setup_shell_script: ''}
|
||||
"""
|
||||
result = session.send_request(
|
||||
service='tasks',
|
||||
action='get_all',
|
||||
json={'id': [task_id], 'only_fields': list(fields), 'search_hidden': True},
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
results = {}
|
||||
result = result.json()['data']['tasks'][0]
|
||||
for field in fields:
|
||||
cur = result
|
||||
for part in field.split("."):
|
||||
if part.isdigit():
|
||||
cur = cur[part]
|
||||
else:
|
||||
cur = cur.get(part, {})
|
||||
results[field] = cur
|
||||
return results
|
||||
except Exception as ex:
|
||||
if log:
|
||||
log.error("Failed obtaining values for task fields {}: {}", fields, ex)
|
||||
pass
|
||||
return {}
|
||||
|
||||
|
||||
def get_task_container(session, task_id):
|
||||
"""
|
||||
Returns dict with Task docker container setup {container: '', arguments: '', setup_shell_script: ''}
|
||||
@@ -333,20 +372,25 @@ def get_task_container(session, task_id):
|
||||
container = result.json()['data']['tasks'][0]['container'] if result.ok else {}
|
||||
if container.get('arguments'):
|
||||
container['arguments'] = shlex.split(str(container.get('arguments')).strip())
|
||||
if container.get('image'):
|
||||
container['image'] = container.get('image').strip()
|
||||
except (ValueError, TypeError):
|
||||
container = {}
|
||||
else:
|
||||
response = get_task(session, task_id, only_fields=["execution.docker_cmd"])
|
||||
task_docker_cmd_parts = shlex.split(str(response.execution.docker_cmd or '').strip())
|
||||
try:
|
||||
container = dict(
|
||||
container=task_docker_cmd_parts[0],
|
||||
arguments=task_docker_cmd_parts[1:] if len(task_docker_cmd_parts[0]) > 1 else ''
|
||||
)
|
||||
except (ValueError, TypeError):
|
||||
container = {}
|
||||
container = {}
|
||||
if response.execution:
|
||||
task_docker_cmd_parts = shlex.split(str(response.execution.docker_cmd or '').strip())
|
||||
if task_docker_cmd_parts:
|
||||
try:
|
||||
container = dict(
|
||||
image=task_docker_cmd_parts[0],
|
||||
arguments=task_docker_cmd_parts[1:] if len(task_docker_cmd_parts[0]) > 1 else ''
|
||||
)
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
|
||||
if (not container or not container.get('container')) and session.check_min_api_version("2.13"):
|
||||
if (not container or not container.get('image')) and session.check_min_api_version("2.13"):
|
||||
container = resolve_default_container(session=session, task_id=task_id, container_config=container)
|
||||
|
||||
return container
|
||||
@@ -461,19 +505,7 @@ class TaskStopSignal(object):
|
||||
return True
|
||||
|
||||
# check if abort callback is turned on
|
||||
cb_completed = None
|
||||
# TODO: add retries on network error with timeout
|
||||
try:
|
||||
task_info = self.session.get(
|
||||
service="tasks", action="get_all", version="2.13", id=[self.task_id],
|
||||
only_fields=["status", "status_message", "runtime._abort_callback_timeout",
|
||||
"runtime._abort_poll_freq", "runtime._abort_callback_completed"])
|
||||
abort_timeout = task_info['tasks'][0]['runtime'].get('_abort_callback_timeout', 0)
|
||||
poll_timeout = task_info['tasks'][0]['runtime'].get('_abort_poll_freq', 0)
|
||||
cb_completed = task_info['tasks'][0]['runtime'].get('_abort_callback_completed', None)
|
||||
except: # noqa
|
||||
abort_timeout = None
|
||||
poll_timeout = None
|
||||
abort_timeout, poll_timeout, cb_completed = self._get_abort_callback_stat()
|
||||
|
||||
if not abort_timeout:
|
||||
# no callback set we can leave
|
||||
@@ -496,8 +528,39 @@ class TaskStopSignal(object):
|
||||
self._active_callback_timeout = timeout
|
||||
return bool(cb_completed)
|
||||
|
||||
def was_abort_function_called(self):
|
||||
return bool(self._active_callback_timestamp)
|
||||
def _get_abort_callback_stat(self):
|
||||
# TODO: add retries on network error with timeout
|
||||
try:
|
||||
task_info = self.session.get(
|
||||
service="tasks", action="get_all", version="2.13", id=[self.task_id],
|
||||
only_fields=["status", "status_message", "runtime._abort_callback_timeout",
|
||||
"runtime._abort_poll_freq", "runtime._abort_callback_completed"])
|
||||
abort_timeout = task_info['tasks'][0]['runtime'].get('_abort_callback_timeout', 0)
|
||||
poll_timeout = task_info['tasks'][0]['runtime'].get('_abort_poll_freq', 0)
|
||||
cb_completed = task_info['tasks'][0]['runtime'].get('_abort_callback_completed', None)
|
||||
except: # noqa
|
||||
abort_timeout = None
|
||||
poll_timeout = None
|
||||
cb_completed = None
|
||||
|
||||
return abort_timeout, poll_timeout, cb_completed
|
||||
|
||||
def was_abort_function_called(self, process_error_code=None):
|
||||
if not self._support_callback:
|
||||
return False
|
||||
|
||||
if self._active_callback_timestamp:
|
||||
return True
|
||||
|
||||
# if the process error code is SIGKILL (exit code 137) -
|
||||
# check the runtime info of the Task - it might have killed itself because it was aborted
|
||||
if process_error_code in (-9, 137):
|
||||
# check if abort callback is turned on
|
||||
_, _, cb_completed = self._get_abort_callback_stat()
|
||||
if cb_completed:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def _test(self):
|
||||
# type: () -> TaskStopReason
|
||||
@@ -597,6 +660,8 @@ class Worker(ServiceCommandSection):
|
||||
_docker_fixed_user_cache = '/clearml_agent_cache'
|
||||
_temp_cleanup_list = []
|
||||
|
||||
hostname_task_runtime_prop = "_exec_agent_hostname"
|
||||
|
||||
@property
|
||||
def service(self):
|
||||
""" Worker command service endpoint """
|
||||
@@ -622,9 +687,13 @@ class Worker(ServiceCommandSection):
|
||||
self.log = self._session.get_logger(__name__)
|
||||
self.register_signal_handler()
|
||||
self._worker_registered = False
|
||||
|
||||
self._apply_extra_configuration()
|
||||
|
||||
self.is_conda = is_conda(self._session.config) # type: bool
|
||||
# Add extra index url - system wide
|
||||
extra_url = None
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if self._session.config.get("agent.package_manager.extra_index_url", None):
|
||||
extra_url = self._session.config.get("agent.package_manager.extra_index_url", [])
|
||||
@@ -812,6 +881,31 @@ class Worker(ServiceCommandSection):
|
||||
# "Running task '{}'".format(task_id)
|
||||
print(self._task_logging_start_message.format(task_id))
|
||||
task_session = task_session or self._session
|
||||
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
result = task_session.send_request(
|
||||
service='tasks',
|
||||
action='get_all',
|
||||
version='2.15',
|
||||
method=Request.def_method,
|
||||
json={'id': [task_id], 'only_fields': ["runtime"], 'search_hidden': True}
|
||||
)
|
||||
|
||||
runtime = result.json().get("data", {}).get("tasks", [])[0].get("runtime") or {}
|
||||
runtime[self.hostname_task_runtime_prop] = socket.gethostname()
|
||||
|
||||
res = task_session.send_request(
|
||||
service='tasks', action='edit', method=Request.def_method,
|
||||
json={
|
||||
"task": task_id, "force": True, "runtime": runtime
|
||||
},
|
||||
)
|
||||
if not res.ok:
|
||||
raise Exception("failed setting runtime property")
|
||||
except Exception as ex:
|
||||
print("Warning: failed obtaining/setting hostname for task '{}': {}".format(task_id, ex))
|
||||
|
||||
# set task status to in_progress so we know it was popped from the queue
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
@@ -821,7 +915,7 @@ class Worker(ServiceCommandSection):
|
||||
return
|
||||
# setup console log
|
||||
temp_stdout_name = safe_mkstemp(
|
||||
suffix=".txt", prefix=".clearml_agent_out.", name_only=True
|
||||
suffix=".txt", prefix=".clearml_agent_out.", name_only=True, dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
|
||||
)
|
||||
# temp_stderr_name = safe_mkstemp(suffix=".txt", prefix=".clearml_agent_err.", name_only=True)
|
||||
temp_stderr_name = None
|
||||
@@ -887,11 +981,21 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
name_format = self._session.config.get('agent.docker_container_name_format', None)
|
||||
if name_format:
|
||||
custom_fields = {}
|
||||
name_format_fields = self._session.config.get('agent.docker_container_name_format_fields', None)
|
||||
if name_format_fields:
|
||||
field_values = get_task_fields(task_session, task_id, name_format_fields.values(), log=self.log)
|
||||
custom_fields = {
|
||||
k: field_values.get(v)
|
||||
for k, v in name_format_fields.items()
|
||||
}
|
||||
|
||||
try:
|
||||
name = name_format.format(
|
||||
task_id=re.sub(r'[^a-zA-Z0-9._-]', '-', task_id),
|
||||
worker_id=re.sub(r'[^a-zA-Z0-9._-]', '-', worker_id),
|
||||
rand_string="".join(sys_random.choice(string.ascii_lowercase) for _ in range(32))
|
||||
rand_string="".join(sys_random.choice(string.ascii_lowercase) for _ in range(32)),
|
||||
**custom_fields,
|
||||
)
|
||||
except Exception as ex:
|
||||
print("Warning: failed generating docker container name: {}".format(ex))
|
||||
@@ -1032,6 +1136,7 @@ class Worker(ServiceCommandSection):
|
||||
if not (result.ok() and result.response):
|
||||
return
|
||||
new_session = copy(session)
|
||||
new_session.config = deepcopy(session.config)
|
||||
new_session.api_client = None
|
||||
new_session.set_auth_token(result.response.token)
|
||||
return new_session
|
||||
@@ -1459,8 +1564,6 @@ class Worker(ServiceCommandSection):
|
||||
return self._resolve_queue_names(queues=queues, create_if_missing=create_if_missing)
|
||||
|
||||
def daemon(self, queues, log_level, foreground=False, docker=False, detached=False, order_fairness=False, **kwargs):
|
||||
self._apply_extra_configuration()
|
||||
|
||||
# check that we have docker command if we need it
|
||||
if docker not in (False, None) and not check_if_command_exists("docker"):
|
||||
raise ValueError("Running in Docker mode, 'docker' command was not found")
|
||||
@@ -1577,6 +1680,7 @@ class Worker(ServiceCommandSection):
|
||||
open_kwargs={
|
||||
"buffering": self._session.config.get("agent.log_files_buffering", 1)
|
||||
},
|
||||
dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
|
||||
)
|
||||
print(
|
||||
"Running CLEARML-AGENT daemon in background mode, writing stdout/stderr to {}".format(
|
||||
@@ -1631,7 +1735,11 @@ class Worker(ServiceCommandSection):
|
||||
if self._session.config.get("agent.crash_on_exception", False):
|
||||
raise e
|
||||
|
||||
crash_file, name = safe_mkstemp(prefix=".clearml_agent-crash", suffix=".log")
|
||||
crash_file, name = safe_mkstemp(
|
||||
prefix=".clearml_agent-crash",
|
||||
suffix=".log",
|
||||
dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
|
||||
)
|
||||
try:
|
||||
with crash_file:
|
||||
crash_file.write(tb)
|
||||
@@ -1916,7 +2024,7 @@ class Worker(ServiceCommandSection):
|
||||
stderr_line_count += report_lines(printed_lines, "stderr")
|
||||
|
||||
# make sure that if the abort function was called, the task is marked as aborted
|
||||
if stop_signal and stop_signal.was_abort_function_called():
|
||||
if stop_signal and stop_signal.was_abort_function_called(status):
|
||||
stop_reason = TaskStopReason.stopped
|
||||
|
||||
return status, stop_reason
|
||||
@@ -2001,19 +2109,26 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
def _apply_extra_configuration(self):
|
||||
# store a few things we updated in runtime (TODO: we should list theme somewhere)
|
||||
agent_config = self._session.config["agent"].copy()
|
||||
vault_loaded = False
|
||||
session = self._session
|
||||
agent_config = session.config["agent"].copy()
|
||||
agent_config_keys = ["cuda_version", "cudnn_version", "default_python", "worker_id", "worker_name", "debug"]
|
||||
try:
|
||||
self._session.load_vaults()
|
||||
vault_loaded = session.load_vaults()
|
||||
except Exception as ex:
|
||||
print("Error: failed applying extra configuration: {}".format(ex))
|
||||
|
||||
# merge back
|
||||
for restore_key in agent_config_keys:
|
||||
if restore_key in agent_config:
|
||||
self._session.config["agent"][restore_key] = agent_config[restore_key]
|
||||
config = session.config
|
||||
|
||||
# merge back
|
||||
if vault_loaded:
|
||||
for restore_key in agent_config_keys:
|
||||
if restore_key in agent_config and agent_config[restore_key] != config["agent"].get(restore_key, None):
|
||||
print("Ignoring vault value for '{}' (agent config takes precedence), using '{}'".format(
|
||||
restore_key, agent_config[restore_key]
|
||||
))
|
||||
config["agent"][restore_key] = agent_config[restore_key]
|
||||
|
||||
config = self._session.config
|
||||
default = config.get("agent.apply_environment", False)
|
||||
if ENV_ENABLE_ENV_CONFIG_SECTION.get(default=default):
|
||||
try:
|
||||
@@ -2139,7 +2254,11 @@ class Worker(ServiceCommandSection):
|
||||
def _build_docker(self, docker, target, task_id, entry_point=None, force_docker=False):
|
||||
|
||||
self.temp_config_path = safe_mkstemp(
|
||||
suffix=".cfg", prefix=".clearml_agent.", text=True, name_only=True
|
||||
suffix=".cfg",
|
||||
prefix=".clearml_agent.",
|
||||
text=True,
|
||||
name_only=True,
|
||||
dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
|
||||
)
|
||||
if not target:
|
||||
target = "task_id_{}".format(task_id)
|
||||
@@ -2295,8 +2414,10 @@ class Worker(ServiceCommandSection):
|
||||
print("Cloning task id={}".format(task_id))
|
||||
current_task = self._session.api_client.tasks.get_by_id(
|
||||
self._session.send_api(
|
||||
tasks_api.CloneRequest(task=current_task.id,
|
||||
new_task_name='Clone of {}'.format(current_task.name))
|
||||
tasks_api.CloneRequest(
|
||||
task=current_task.id,
|
||||
new_task_name="Clone of {}".format(current_task.name)
|
||||
)
|
||||
).id
|
||||
)
|
||||
print("Task cloned, new task id={}".format(current_task.id))
|
||||
@@ -2304,11 +2425,23 @@ class Worker(ServiceCommandSection):
|
||||
raise CommandFailedError("Cloning failed")
|
||||
else:
|
||||
# make sure this task is not stuck in an execution queue, it shouldn't have been, but just in case.
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
res = self._session.api_client.tasks.dequeue(task=current_task.id)
|
||||
if require_queue and res.meta.result_code != 200:
|
||||
raise ValueError("Execution required enqueued task, "
|
||||
"but task id={} is not queued.".format(current_task.id))
|
||||
res = self._session.send_request(
|
||||
service="tasks", action="dequeue", method=Request.def_method,
|
||||
json={"task": current_task.id, "new_status": "in_progress"},
|
||||
)
|
||||
if require_queue and (not res.ok or res.json().get("data", {}).get("updated", 0) < 1):
|
||||
raise ValueError(
|
||||
"Execution required enqueued task, but task id={} is not queued.".format(current_task.id)
|
||||
)
|
||||
# Set task status to started to prevent any external monitoring from killing it
|
||||
self._session.api_client.tasks.started(
|
||||
task=current_task.id,
|
||||
status_reason="starting execution soon",
|
||||
status_message="",
|
||||
force=True,
|
||||
)
|
||||
except Exception:
|
||||
if require_queue:
|
||||
raise
|
||||
@@ -2319,14 +2452,18 @@ class Worker(ServiceCommandSection):
|
||||
# We expect the same behaviour in case full_monitoring was set, and in case docker mode is used
|
||||
if full_monitoring or docker is not False:
|
||||
if full_monitoring:
|
||||
if not (ENV_WORKER_ID.get() or '').strip():
|
||||
self._session.config["agent"]["worker_id"] = ''
|
||||
if not (ENV_WORKER_ID.get() or "").strip():
|
||||
self._session.config["agent"]["worker_id"] = ""
|
||||
# make sure we support multiple instances if we need to
|
||||
self._singleton()
|
||||
self.temp_config_path = self.temp_config_path or safe_mkstemp(
|
||||
suffix=".cfg", prefix=".clearml_agent.", text=True, name_only=True
|
||||
suffix=".cfg",
|
||||
prefix=".clearml_agent.",
|
||||
text=True,
|
||||
name_only=True,
|
||||
dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
|
||||
)
|
||||
self.dump_config(self.temp_config_path)
|
||||
self.dump_config(filename=self.temp_config_path, config=self._session.pre_vault_config)
|
||||
self._session._config_file = self.temp_config_path
|
||||
|
||||
worker_params = WorkerParams(
|
||||
@@ -2347,8 +2484,6 @@ class Worker(ServiceCommandSection):
|
||||
Singleton.close_pid_file()
|
||||
return status if ENV_PROPAGATE_EXITCODE.get() else 0
|
||||
|
||||
self._apply_extra_configuration()
|
||||
|
||||
self._session.print_configuration()
|
||||
|
||||
# now mark the task as started
|
||||
@@ -2365,6 +2500,12 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
execution = self.get_execution_info(current_task)
|
||||
|
||||
if ENV_AGENT_FORCE_EXEC_SCRIPT.get():
|
||||
entry_point_parts = str(ENV_AGENT_FORCE_EXEC_SCRIPT.get()).split(":", 1)
|
||||
execution.entry_point = entry_point_parts[-1]
|
||||
execution.working_dir = entry_point_parts[0] if len(entry_point_parts) > 1 else "."
|
||||
print("WARNING: Using forced script entry [{}:{}]".format(execution.working_dir, execution.entry_point))
|
||||
|
||||
python_ver = self._get_task_python_version(current_task)
|
||||
|
||||
freeze = None
|
||||
@@ -2442,8 +2583,9 @@ class Worker(ServiceCommandSection):
|
||||
code_folder = self._session.config.get("agent.venvs_dir")
|
||||
code_folder = Path(os.path.expanduser(os.path.expandvars(code_folder)))
|
||||
# let's make sure it is clear from previous runs
|
||||
rm_tree(normalize_path(code_folder, WORKING_REPOSITORY_DIR))
|
||||
rm_tree(normalize_path(code_folder, WORKING_STANDALONE_DIR))
|
||||
if not standalone_mode:
|
||||
rm_tree(normalize_path(code_folder, WORKING_REPOSITORY_DIR))
|
||||
rm_tree(normalize_path(code_folder, WORKING_STANDALONE_DIR))
|
||||
if not code_folder.exists():
|
||||
code_folder.mkdir(parents=True, exist_ok=True)
|
||||
alternative_code_folder = code_folder.as_posix()
|
||||
@@ -2461,10 +2603,14 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
print("\n")
|
||||
|
||||
# either use the venvs base folder for code or the cwd
|
||||
directory, vcs, repo_info = self.get_repo_info(
|
||||
execution, current_task, str(venv_folder or alternative_code_folder)
|
||||
)
|
||||
# if we force code directory - by definition we do not clone or apply any changes
|
||||
if ENV_AGENT_FORCE_CODE_DIR.get():
|
||||
directory, vcs, repo_info = ENV_AGENT_FORCE_CODE_DIR.get(), None, None
|
||||
else:
|
||||
# either use the venvs base folder for code or the cwd
|
||||
directory, vcs, repo_info = self.get_repo_info(
|
||||
execution, current_task, str(alternative_code_folder or venv_folder)
|
||||
)
|
||||
|
||||
print("\n")
|
||||
|
||||
@@ -2634,7 +2780,10 @@ class Worker(ServiceCommandSection):
|
||||
else:
|
||||
# store stdout/stderr into file, and send to backend
|
||||
temp_stdout_fname = log_file or safe_mkstemp(
|
||||
suffix=".txt", prefix=".clearml_agent_out.", name_only=True
|
||||
suffix=".txt",
|
||||
prefix=".clearml_agent_out.",
|
||||
name_only=True,
|
||||
dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
|
||||
)
|
||||
print("Storing stdout and stderr log into [%s]" % temp_stdout_fname)
|
||||
exit_code, _ = self._log_command_output(
|
||||
@@ -2917,7 +3066,7 @@ class Worker(ServiceCommandSection):
|
||||
# Todo: add support for poetry caching
|
||||
if not self.poetry.enabled:
|
||||
# add to cache
|
||||
if add_venv_folder_cache:
|
||||
if add_venv_folder_cache and not self._standalone_mode:
|
||||
print('Adding venv into cache: {}'.format(add_venv_folder_cache))
|
||||
self.package_api.add_cached_venv(
|
||||
requirements=[freeze, previous_reqs],
|
||||
@@ -3515,6 +3664,11 @@ class Worker(ServiceCommandSection):
|
||||
requirements_manager.translator.enabled = False
|
||||
print(requirements_manager.replace(contents))
|
||||
|
||||
def remove_non_backwards_compatible_entries(self, config: Config):
|
||||
if not self._standalone_mode or not ENV_CONFIG_BC_IN_STANDALONE.get() or self._session.feature_set == "basic":
|
||||
return
|
||||
config.pop("agent.package_manager.pip_version") # removed due to a breaking change in v1.5.1
|
||||
|
||||
def get_docker_config_cmd(self, docker_args, clean_api_credentials=False):
|
||||
docker_image = str(ENV_DOCKER_IMAGE.get() or
|
||||
self._session.config.get("agent.default_docker.image", "nvidia/cuda")) \
|
||||
@@ -3537,6 +3691,7 @@ class Worker(ServiceCommandSection):
|
||||
DockerArgsSanitizer.sanitize_docker_command(self._session, self._docker_arguments) or ''))
|
||||
|
||||
temp_config = deepcopy(self._session.config)
|
||||
self.remove_non_backwards_compatible_entries(temp_config)
|
||||
mounted_cache_dir = temp_config.get(
|
||||
"agent.docker_internal_mounts.sdk_cache", self._docker_fixed_user_cache)
|
||||
mounted_pip_dl_dir = temp_config.get(
|
||||
@@ -3587,34 +3742,35 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
def _get_docker_config_cmd(self, temp_config, clean_api_credentials=False, **kwargs):
|
||||
self.debug("Setting up docker config command")
|
||||
host_cache = Path(os.path.expandvars(
|
||||
self._session.config["sdk.storage.cache.default_base_dir"])).expanduser().as_posix()
|
||||
|
||||
def load_path(field, default=None):
|
||||
value = self._session.config.get(field, default)
|
||||
return Path(os.path.expandvars(value)).expanduser().as_posix() if value else None
|
||||
|
||||
host_cache = load_path("sdk.storage.cache.default_base_dir")
|
||||
self.debug("host_cache: {}".format(host_cache))
|
||||
host_pip_dl = Path(os.path.expandvars(
|
||||
self._session.config["agent.pip_download_cache.path"])).expanduser().as_posix()
|
||||
|
||||
host_pip_dl = load_path("agent.pip_download_cache.path")
|
||||
self.debug("host_pip_dl: {}".format(host_pip_dl))
|
||||
host_vcs_cache = Path(os.path.expandvars(
|
||||
self._session.config["agent.vcs_cache.path"])).expanduser().as_posix()
|
||||
|
||||
host_vcs_cache = load_path("agent.vcs_cache.path")
|
||||
self.debug("host_vcs_cache: {}".format(host_vcs_cache))
|
||||
host_venvs_cache = Path(os.path.expandvars(
|
||||
self._session.config["agent.venvs_cache.path"])).expanduser().as_posix() \
|
||||
if self._session.config.get("agent.venvs_cache.path", None) else None
|
||||
|
||||
host_venvs_cache = load_path("agent.venvs_cache.path")
|
||||
self.debug("host_venvs_cache: {}".format(host_venvs_cache))
|
||||
|
||||
host_ssh_cache = self._host_ssh_cache
|
||||
self.debug("host_ssh_cache: {}".format(host_ssh_cache))
|
||||
|
||||
host_apt_cache = Path(os.path.expandvars(self._session.config.get(
|
||||
"agent.docker_apt_cache", '~/.clearml/apt-cache'))).expanduser().as_posix()
|
||||
host_apt_cache = load_path("agent.docker_apt_cache", default="~/.clearml/apt-cache")
|
||||
self.debug("host_apt_cache: {}".format(host_apt_cache))
|
||||
host_pip_cache = Path(os.path.expandvars(self._session.config.get(
|
||||
"agent.docker_pip_cache", '~/.clearml/pip-cache'))).expanduser().as_posix()
|
||||
|
||||
host_pip_cache = load_path("agent.docker_pip_cache", default="~/.clearml/pip-cache")
|
||||
self.debug("host_pip_cache: {}".format(host_pip_cache))
|
||||
|
||||
if self.poetry.enabled:
|
||||
host_poetry_cache = Path(os.path.expandvars(self._session.config.get(
|
||||
"agent.docker_poetry_cache", '~/.clearml/poetry-cache'))).expanduser().as_posix()
|
||||
else:
|
||||
host_poetry_cache = None
|
||||
host_poetry_cache = (
|
||||
load_path("agent.docker_poetry_cache", "~/.clearml/poetry-cache") if self.poetry.enabled else None
|
||||
)
|
||||
self.debug("host_poetry_cache: {}".format(host_poetry_cache))
|
||||
|
||||
# make sure all folders are valid
|
||||
@@ -3674,7 +3830,11 @@ class Worker(ServiceCommandSection):
|
||||
install_opencv_libs = self._session.config.get("agent.docker_install_opencv_libs", True)
|
||||
|
||||
self.temp_config_path = self.temp_config_path or safe_mkstemp(
|
||||
suffix=".cfg", prefix=".clearml_agent.", text=True, name_only=True
|
||||
suffix=".cfg",
|
||||
prefix=".clearml_agent.",
|
||||
text=True,
|
||||
name_only=True,
|
||||
dir=(ENV_TEMP_STDOUT_FILE_DIR.get() or None)
|
||||
)
|
||||
|
||||
mounted_cache_dir = temp_config.get("sdk.storage.cache.default_base_dir")
|
||||
@@ -3766,6 +3926,60 @@ class Worker(ServiceCommandSection):
|
||||
pass
|
||||
return results
|
||||
|
||||
@staticmethod
|
||||
def _resolve_docker_env_args(docker_args):
|
||||
# type: (List[str]) -> List[str]
|
||||
"""
|
||||
Resolve -e / --env docker environment args matching $VAR or ${VAR} from the host environment
|
||||
|
||||
:argument docker_args: List of docker argument strings (flags and values)
|
||||
"""
|
||||
non_list_args = (
|
||||
"rm", "read-only", "sig-proxy", "tty", "privileged", "publish-all", "interactive", "init", "help", "detach"
|
||||
)
|
||||
non_list_args_single = (
|
||||
"t", "P", "i", "d",
|
||||
)
|
||||
|
||||
# if no filtering, do nothing
|
||||
if not docker_args:
|
||||
return docker_args
|
||||
|
||||
args = docker_args[:]
|
||||
skip_arg = False
|
||||
for i, cmd in enumerate(docker_args):
|
||||
if skip_arg and not cmd.startswith("-"):
|
||||
continue
|
||||
|
||||
skip_arg = False
|
||||
|
||||
if cmd.startswith("--"):
|
||||
# jump over single command
|
||||
if cmd[2:] in non_list_args:
|
||||
continue
|
||||
elif cmd.startswith("-"):
|
||||
# jump over single character non args
|
||||
if cmd[1:] in non_list_args_single:
|
||||
continue
|
||||
|
||||
# if we are here we have a command to bypass and the list after it
|
||||
if cmd in ('-e', '--env'):
|
||||
skip_arg = True
|
||||
for j in range(i+1, len(args)):
|
||||
if args[j].startswith("-"):
|
||||
break
|
||||
|
||||
parts = args[j].split("=", 1)
|
||||
if len(parts) != 2:
|
||||
continue
|
||||
|
||||
args[j] = "{}={}".format(parts[0], os.path.expandvars(parts[1]))
|
||||
|
||||
elif cmd.startswith("-"):
|
||||
skip_arg = True
|
||||
|
||||
return args
|
||||
|
||||
def _get_docker_cmd(
|
||||
self,
|
||||
worker_id, parent_worker_id,
|
||||
@@ -3829,17 +4043,31 @@ class Worker(ServiceCommandSection):
|
||||
docker_arguments = list(docker_arguments) \
|
||||
if isinstance(docker_arguments, (list, tuple)) else [docker_arguments]
|
||||
docker_arguments = self._filter_docker_args(docker_arguments)
|
||||
base_cmd += [a for a in docker_arguments if a]
|
||||
if self._session.config.get("agent.docker_allow_host_environ", None):
|
||||
docker_arguments = self._resolve_docker_env_args(docker_arguments)
|
||||
|
||||
if extra_docker_arguments:
|
||||
# we always resolve environments in the `extra_docker_arguments` becuase the admin set them (not users)
|
||||
extra_docker_arguments = self._resolve_docker_env_args(extra_docker_arguments)
|
||||
extra_docker_arguments = [extra_docker_arguments] \
|
||||
if isinstance(extra_docker_arguments, six.string_types) else extra_docker_arguments
|
||||
base_cmd += [str(a) for a in extra_docker_arguments if a]
|
||||
|
||||
# decide on order of docker args when merging overlapping arguments
|
||||
# from extra_docker_args and the Task's docker_args
|
||||
base_cmd += DockerArgsSanitizer.merge_docker_args(
|
||||
config=self._session.config,
|
||||
task_docker_arguments=docker_arguments,
|
||||
extra_docker_arguments=extra_docker_arguments
|
||||
)
|
||||
|
||||
# set docker labels
|
||||
base_cmd += ['-l', self._worker_label.format(worker_id)]
|
||||
base_cmd += ['-l', self._parent_worker_label.format(parent_worker_id)]
|
||||
|
||||
extra_labels = ENV_EXTRA_DOCKER_LABELS.get()
|
||||
for label in (extra_labels or []):
|
||||
base_cmd += ['-l', label]
|
||||
|
||||
self.debug("Command: {}".format(base_cmd), context="docker")
|
||||
|
||||
# check if running inside a kubernetes
|
||||
@@ -3903,7 +4131,7 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
base_cmd += ['-e', 'CLEARML_WORKER_ID='+worker_id, ]
|
||||
# update the docker image, so the system knows where it runs
|
||||
base_cmd += ['-e', 'CLEARML_DOCKER_IMAGE={} {}'.format(docker_image, ' '.join(docker_arguments or [])).strip()]
|
||||
base_cmd += ['-e', 'CLEARML_DOCKER_IMAGE={}'.format(docker_image)]
|
||||
|
||||
if env_task_id:
|
||||
base_cmd += ['-e', 'CLEARML_TASK_ID={}'.format(env_task_id), ]
|
||||
@@ -3919,9 +4147,13 @@ class Worker(ServiceCommandSection):
|
||||
if skip_pip_venv_install:
|
||||
base_cmd += ['-e', '{}={}'.format(ENV_AGENT_SKIP_PIP_VENV_INSTALL.vars[0], skip_pip_venv_install)]
|
||||
|
||||
if self._services_mode:
|
||||
base_cmd += ['-e', 'CLEARML_AGENT_SERVICE_TASK=1']
|
||||
|
||||
# if we are running a RC version, install the same version in the docker
|
||||
# because the default latest, will be a release version (not RC)
|
||||
specify_version = ''
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
from clearml_agent.version import __version__
|
||||
_version_parts = __version__.split('.')
|
||||
@@ -3930,13 +4162,15 @@ class Worker(ServiceCommandSection):
|
||||
except:
|
||||
pass
|
||||
|
||||
force_agent_repo = ENV_FORCE_DOCKER_AGENT_REPO.get()
|
||||
|
||||
if os.environ.get('FORCE_LOCAL_CLEARML_AGENT_WHEEL'):
|
||||
local_wheel = os.path.expanduser(os.environ.get('FORCE_LOCAL_CLEARML_AGENT_WHEEL'))
|
||||
docker_wheel = '/tmp/{}'.format(basename(local_wheel))
|
||||
base_cmd += ['-v', local_wheel + ':' + docker_wheel]
|
||||
clearml_agent_wheel = '\"{}\"'.format(docker_wheel)
|
||||
elif os.environ.get('FORCE_CLEARML_AGENT_REPO'):
|
||||
clearml_agent_wheel = os.environ.get('FORCE_CLEARML_AGENT_REPO')
|
||||
elif force_agent_repo:
|
||||
clearml_agent_wheel = force_agent_repo
|
||||
else:
|
||||
# clearml-agent{specify_version}
|
||||
clearml_agent_wheel = 'clearml-agent{specify_version}'.format(specify_version=specify_version)
|
||||
|
||||
@@ -152,11 +152,14 @@ WORKING_STANDALONE_DIR = "code"
|
||||
DEFAULT_VCS_CACHE = normalize_path(CONFIG_DIR, "vcs-cache")
|
||||
PIP_EXTRA_INDICES = []
|
||||
DEFAULT_PIP_DOWNLOAD_CACHE = normalize_path(CONFIG_DIR, "pip-download-cache")
|
||||
ENV_PIP_EXTRA_INSTALL_FLAGS = EnvironmentConfig("CLEARML_EXTRA_PIP_INSTALL_FLAGS", type=list)
|
||||
ENV_DOCKER_IMAGE = EnvironmentConfig("CLEARML_DOCKER_IMAGE", "TRAINS_DOCKER_IMAGE")
|
||||
ENV_WORKER_ID = EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID")
|
||||
ENV_WORKER_TAGS = EnvironmentConfig("CLEARML_WORKER_TAGS")
|
||||
ENV_AGENT_SKIP_PIP_VENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PIP_VENV_INSTALL")
|
||||
ENV_AGENT_SKIP_PYTHON_ENV_INSTALL = EnvironmentConfig("CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL", type=bool)
|
||||
ENV_AGENT_FORCE_CODE_DIR = EnvironmentConfig("CLEARML_AGENT_FORCE_CODE_DIR")
|
||||
ENV_AGENT_FORCE_EXEC_SCRIPT = EnvironmentConfig("CLEARML_AGENT_FORCE_EXEC_SCRIPT")
|
||||
ENV_DOCKER_SKIP_GPUS_FLAG = EnvironmentConfig("CLEARML_DOCKER_SKIP_GPUS_FLAG", "TRAINS_DOCKER_SKIP_GPUS_FLAG")
|
||||
ENV_AGENT_GIT_USER = EnvironmentConfig("CLEARML_AGENT_GIT_USER", "TRAINS_AGENT_GIT_USER")
|
||||
ENV_AGENT_GIT_PASS = EnvironmentConfig("CLEARML_AGENT_GIT_PASS", "TRAINS_AGENT_GIT_PASS")
|
||||
@@ -173,10 +176,15 @@ ENV_DOCKER_HOST_MOUNT = EnvironmentConfig(
|
||||
)
|
||||
ENV_VENV_CACHE_PATH = EnvironmentConfig("CLEARML_AGENT_VENV_CACHE_PATH")
|
||||
ENV_EXTRA_DOCKER_ARGS = EnvironmentConfig("CLEARML_AGENT_EXTRA_DOCKER_ARGS", type=list)
|
||||
ENV_EXTRA_DOCKER_LABELS = EnvironmentConfig("CLEARML_AGENT_EXTRA_DOCKER_LABELS", type=list)
|
||||
ENV_DEBUG_INFO = EnvironmentConfig("CLEARML_AGENT_DEBUG_INFO")
|
||||
ENV_CHILD_AGENTS_COUNT_CMD = EnvironmentConfig("CLEARML_AGENT_CHILD_AGENTS_COUNT_CMD")
|
||||
ENV_DOCKER_ARGS_FILTERS = EnvironmentConfig("CLEARML_AGENT_DOCKER_ARGS_FILTERS")
|
||||
ENV_DOCKER_ARGS_HIDE_ENV = EnvironmentConfig("CLEARML_AGENT_DOCKER_ARGS_HIDE_ENV")
|
||||
ENV_CONFIG_BC_IN_STANDALONE = EnvironmentConfig("CLEARML_AGENT_STANDALONE_CONFIG_BC", type=bool)
|
||||
""" Maintain backwards compatible configuration when launching in standalone mode """
|
||||
|
||||
ENV_FORCE_DOCKER_AGENT_REPO = EnvironmentConfig("FORCE_CLEARML_AGENT_REPO", "CLEARML_AGENT_DOCKER_AGENT_REPO")
|
||||
|
||||
ENV_SERVICES_DOCKER_RESTART = EnvironmentConfig("CLEARML_AGENT_SERVICES_DOCKER_RESTART")
|
||||
"""
|
||||
@@ -232,6 +240,12 @@ ENV_CUSTOM_BUILD_SCRIPT = EnvironmentConfig("CLEARML_AGENT_CUSTOM_BUILD_SCRIPT")
|
||||
standard flow.
|
||||
"""
|
||||
|
||||
ENV_PACKAGE_PYTORCH_RESOLVE = EnvironmentConfig("CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE")
|
||||
|
||||
ENV_TEMP_STDOUT_FILE_DIR = EnvironmentConfig("CLEARML_AGENT_TEMP_STDOUT_FILE_DIR")
|
||||
|
||||
ENV_GIT_CLONE_VERBOSE = EnvironmentConfig("CLEARML_AGENT_GIT_CLONE_VERBOSE", type=bool)
|
||||
|
||||
|
||||
class FileBuffering(IntEnum):
|
||||
"""
|
||||
|
||||
15
clearml_agent/glue/daemon.py
Normal file
15
clearml_agent/glue/daemon.py
Normal file
@@ -0,0 +1,15 @@
|
||||
from threading import Thread
|
||||
from clearml_agent.session import Session
|
||||
|
||||
|
||||
class K8sDaemon(Thread):
|
||||
|
||||
def __init__(self, agent):
|
||||
super(K8sDaemon, self).__init__(target=self.target)
|
||||
self.daemon = True
|
||||
self._agent = agent
|
||||
self.log = agent.log
|
||||
self._session: Session = agent._session
|
||||
|
||||
def target(self):
|
||||
pass
|
||||
@@ -1,7 +1,11 @@
|
||||
from clearml_agent.definitions import EnvironmentConfig
|
||||
from clearml_agent.helper.environment import EnvEntry
|
||||
|
||||
ENV_START_AGENT_SCRIPT_PATH = EnvironmentConfig('CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH')
|
||||
ENV_START_AGENT_SCRIPT_PATH = EnvEntry("CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH", default="~/__start_agent__.sh")
|
||||
"""
|
||||
Script path to use when creating the bash script to run the agent inside the scheduled pod's docker container.
|
||||
Script will be appended to the specified file.
|
||||
"""
|
||||
|
||||
ENV_DEFAULT_EXECUTION_AGENT_ARGS = EnvEntry("K8S_GLUE_DEF_EXEC_AGENT_ARGS", default="--full-monitoring --require-queue")
|
||||
ENV_POD_AGENT_INSTALL_ARGS = EnvEntry("K8S_GLUE_POD_AGENT_INSTALL_ARGS", default="", lstrip=False)
|
||||
ENV_POD_MONITOR_LOG_BATCH_SIZE = EnvEntry("K8S_GLUE_POD_MONITOR_LOG_BATCH_SIZE", default=5, converter=int)
|
||||
|
||||
12
clearml_agent/glue/errors.py
Normal file
12
clearml_agent/glue/errors.py
Normal file
@@ -0,0 +1,12 @@
|
||||
|
||||
class GetPodsError(Exception):
|
||||
pass
|
||||
|
||||
|
||||
class GetJobsError(Exception):
|
||||
pass
|
||||
|
||||
|
||||
class GetPodCountError(Exception):
|
||||
pass
|
||||
|
||||
@@ -9,17 +9,15 @@ import os
|
||||
import re
|
||||
import subprocess
|
||||
import tempfile
|
||||
from collections import defaultdict
|
||||
from collections import defaultdict, namedtuple
|
||||
from copy import deepcopy
|
||||
from pathlib import Path
|
||||
from pprint import pformat
|
||||
from threading import Thread
|
||||
from time import sleep, time
|
||||
from typing import Text, List, Callable, Any, Collection, Optional, Union, Iterable, Dict, Tuple, Set
|
||||
|
||||
import yaml
|
||||
|
||||
from clearml_agent.backend_api.session import Request
|
||||
from clearml_agent.commands.events import Events
|
||||
from clearml_agent.commands.worker import Worker, get_task_container, set_task_container, get_next_task
|
||||
from clearml_agent.definitions import (
|
||||
@@ -28,29 +26,32 @@ from clearml_agent.definitions import (
|
||||
ENV_AGENT_GIT_PASS,
|
||||
ENV_FORCE_SYSTEM_SITE_PACKAGES,
|
||||
)
|
||||
from clearml_agent.errors import APIError
|
||||
from clearml_agent.glue.definitions import ENV_START_AGENT_SCRIPT_PATH
|
||||
from clearml_agent.errors import APIError, UsageError
|
||||
from clearml_agent.glue.errors import GetPodCountError
|
||||
from clearml_agent.glue.utilities import get_path, get_bash_output
|
||||
from clearml_agent.glue.pending_pods_daemon import PendingPodsDaemon
|
||||
from clearml_agent.helper.base import safe_remove_file
|
||||
from clearml_agent.helper.dicts import merge_dicts
|
||||
from clearml_agent.helper.process import get_bash_output, stringify_bash_output
|
||||
from clearml_agent.helper.resource_monitor import ResourceMonitor
|
||||
from clearml_agent.interface.base import ObjectID
|
||||
from clearml_agent.backend_api.session import Request
|
||||
from clearml_agent.glue.definitions import (
|
||||
ENV_START_AGENT_SCRIPT_PATH,
|
||||
ENV_DEFAULT_EXECUTION_AGENT_ARGS,
|
||||
ENV_POD_AGENT_INSTALL_ARGS,
|
||||
)
|
||||
|
||||
|
||||
class K8sIntegration(Worker):
|
||||
SUPPORTED_KIND = ("pod", "job")
|
||||
K8S_PENDING_QUEUE = "k8s_scheduler"
|
||||
|
||||
K8S_DEFAULT_NAMESPACE = "clearml"
|
||||
AGENT_LABEL = "CLEARML=agent"
|
||||
QUEUE_LABEL = "clearml-agent-queue"
|
||||
|
||||
KUBECTL_APPLY_CMD = "kubectl apply --namespace={namespace} -f"
|
||||
|
||||
KUBECTL_CLEANUP_DELETE_CMD = "kubectl delete pods " \
|
||||
"-l={agent_label} " \
|
||||
"--field-selector=status.phase!=Pending,status.phase!=Running " \
|
||||
"--namespace={namespace} " \
|
||||
"--output name"
|
||||
|
||||
BASH_INSTALL_SSH_CMD = [
|
||||
"apt-get update",
|
||||
"apt-get install -y openssh-server",
|
||||
@@ -66,9 +67,6 @@ class K8sIntegration(Worker):
|
||||
'echo "ldconfig" >> /etc/profile',
|
||||
"/usr/sbin/sshd -p {port}"]
|
||||
|
||||
DEFAULT_EXECUTION_AGENT_ARGS = os.getenv("K8S_GLUE_DEF_EXEC_AGENT_ARGS", "--full-monitoring --require-queue")
|
||||
POD_AGENT_INSTALL_ARGS = os.getenv("K8S_GLUE_POD_AGENT_INSTALL_ARGS", "")
|
||||
|
||||
CONTAINER_BASH_SCRIPT = [
|
||||
"export DEBIAN_FRONTEND='noninteractive'",
|
||||
"echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean",
|
||||
@@ -81,7 +79,7 @@ class K8sIntegration(Worker):
|
||||
"[ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip",
|
||||
"[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3",
|
||||
"{extra_bash_init_cmd}",
|
||||
"$LOCAL_PYTHON -m pip install clearml-agent{agent_install_args}",
|
||||
"[ ! -z $CLEARML_AGENT_NO_UPDATE ] || $LOCAL_PYTHON -m pip install clearml-agent{agent_install_args}",
|
||||
"{extra_docker_bash_script}",
|
||||
"$LOCAL_PYTHON -m clearml_agent execute {default_execution_agent_args} --id {task_id}"
|
||||
]
|
||||
@@ -108,6 +106,7 @@ class K8sIntegration(Worker):
|
||||
max_pods_limit=None,
|
||||
pod_name_prefix=None,
|
||||
limit_pod_label=None,
|
||||
force_system_packages=None,
|
||||
**kwargs
|
||||
):
|
||||
"""
|
||||
@@ -135,12 +134,17 @@ class K8sIntegration(Worker):
|
||||
:param int max_pods_limit: Maximum number of pods that K8S glue can run at the same time
|
||||
"""
|
||||
super(K8sIntegration, self).__init__()
|
||||
self.kind = os.environ.get("CLEARML_K8S_GLUE_KIND", "pod").strip().lower()
|
||||
if self.kind not in self.SUPPORTED_KIND:
|
||||
raise UsageError(f"Kind '{self.kind}' not supported (expected {','.join(self.SUPPORTED_KIND)})")
|
||||
self.using_jobs = self.kind == "job"
|
||||
self.pod_name_prefix = pod_name_prefix or self.DEFAULT_POD_NAME_PREFIX
|
||||
self.limit_pod_label = limit_pod_label or self.DEFAULT_LIMIT_POD_LABEL
|
||||
self.k8s_pending_queue_name = k8s_pending_queue_name or self.K8S_PENDING_QUEUE
|
||||
self.k8s_pending_queue_id = None
|
||||
self.container_bash_script = container_bash_script or self.CONTAINER_BASH_SCRIPT
|
||||
force_system_packages = ENV_FORCE_SYSTEM_SITE_PACKAGES.get()
|
||||
if force_system_packages is None:
|
||||
force_system_packages = ENV_FORCE_SYSTEM_SITE_PACKAGES.get()
|
||||
self._force_system_site_packages = force_system_packages if force_system_packages is not None else True
|
||||
if self._force_system_site_packages:
|
||||
# Use system packages, because by we will be running inside a docker
|
||||
@@ -179,11 +183,18 @@ class K8sIntegration(Worker):
|
||||
|
||||
self._agent_label = None
|
||||
|
||||
self._monitor_hanging_pods()
|
||||
self._pending_pods_daemon = self._create_daemon_instance(
|
||||
cls_=PendingPodsDaemon,
|
||||
polling_interval=self._polling_interval
|
||||
)
|
||||
self._pending_pods_daemon.start()
|
||||
|
||||
self._min_cleanup_interval_per_ns_sec = 1.0
|
||||
self._last_pod_cleanup_per_ns = defaultdict(lambda: 0.)
|
||||
|
||||
def _create_daemon_instance(self, cls_, **kwargs):
|
||||
return cls_(agent=self, **kwargs)
|
||||
|
||||
def _load_overrides_yaml(self, overrides_yaml):
|
||||
if not overrides_yaml:
|
||||
return
|
||||
@@ -209,26 +220,33 @@ class K8sIntegration(Worker):
|
||||
self.log.warning('Removing containers section: {}'.format(overrides['spec'].pop('containers')))
|
||||
self.overrides_json_string = json.dumps(overrides)
|
||||
|
||||
def _monitor_hanging_pods(self):
|
||||
_check_pod_thread = Thread(target=self._monitor_hanging_pods_daemon)
|
||||
_check_pod_thread.daemon = True
|
||||
_check_pod_thread.start()
|
||||
|
||||
@staticmethod
|
||||
def _load_template_file(path):
|
||||
with open(os.path.expandvars(os.path.expanduser(str(path))), 'rt') as f:
|
||||
return yaml.load(f, Loader=getattr(yaml, 'FullLoader', None))
|
||||
|
||||
def _get_kubectl_options(self, command, extra_labels=None, filters=None, output="json", labels=None):
|
||||
# type: (str, Iterable[str], Iterable[str], str, Iterable[str]) -> Dict
|
||||
if not labels:
|
||||
@staticmethod
|
||||
def _get_path(d, *path, default=None):
|
||||
try:
|
||||
return functools.reduce(
|
||||
lambda a, b: a[b], path, d
|
||||
)
|
||||
except (IndexError, KeyError):
|
||||
return default
|
||||
|
||||
def _get_kubectl_options(self, command, extra_labels=None, filters=None, output="json", labels=None, ns=None):
|
||||
# type: (str, Iterable[str], Iterable[str], str, Iterable[str], str) -> Dict
|
||||
if labels is False:
|
||||
labels = []
|
||||
elif not labels:
|
||||
labels = [self._get_agent_label()]
|
||||
labels = list(labels) + (list(extra_labels) if extra_labels else [])
|
||||
d = {
|
||||
"-l": ",".join(labels),
|
||||
"-n": str(self.namespace),
|
||||
"-n": ns or str(self.namespace),
|
||||
"-o": output,
|
||||
}
|
||||
if labels:
|
||||
d["-l"] = ",".join(labels)
|
||||
if filters:
|
||||
d["--field-selector"] = ",".join(filters)
|
||||
return d
|
||||
@@ -239,132 +257,6 @@ class K8sIntegration(Worker):
|
||||
command=command, opts=" ".join(x for item in opts.items() for x in item)
|
||||
)
|
||||
|
||||
def _monitor_hanging_pods_daemon(self):
|
||||
last_tasks_msgs = {} # last msg updated for every task
|
||||
|
||||
while True:
|
||||
kubectl_cmd = self.get_kubectl_command("get pods", filters=["status.phase=Pending"])
|
||||
self.log.debug("Detecting hanging pods: {}".format(kubectl_cmd))
|
||||
output = stringify_bash_output(get_bash_output(kubectl_cmd))
|
||||
try:
|
||||
output_config = json.loads(output)
|
||||
except Exception as ex:
|
||||
self.log.warning('K8S Glue pods monitor: Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
|
||||
sleep(self._polling_interval)
|
||||
continue
|
||||
pods = output_config.get('items', [])
|
||||
task_id_to_details = dict()
|
||||
for pod in pods:
|
||||
pod_name = pod.get('metadata', {}).get('name', None)
|
||||
if not pod_name:
|
||||
continue
|
||||
|
||||
task_id = pod_name.rpartition('-')[-1]
|
||||
if not task_id:
|
||||
continue
|
||||
|
||||
namespace = pod.get('metadata', {}).get('namespace', None)
|
||||
if not namespace:
|
||||
continue
|
||||
|
||||
task_id_to_details[task_id] = (pod_name, namespace)
|
||||
|
||||
msg = None
|
||||
|
||||
waiting = self._get_path(pod, 'status', 'containerStatuses', 0, 'state', 'waiting')
|
||||
if not waiting:
|
||||
condition = self._get_path(pod, 'status', 'conditions', 0)
|
||||
if condition:
|
||||
reason = condition.get('reason')
|
||||
if reason == 'Unschedulable':
|
||||
message = condition.get('message')
|
||||
msg = reason + (" ({})".format(message) if message else "")
|
||||
else:
|
||||
reason = waiting.get("reason", None)
|
||||
message = waiting.get("message", None)
|
||||
|
||||
msg = reason + (" ({})".format(message) if message else "")
|
||||
|
||||
if reason == 'ImagePullBackOff':
|
||||
delete_pod_cmd = 'kubectl delete pods {} -n {}'.format(pod_name, namespace)
|
||||
self.log.debug(" - deleting pod due to ImagePullBackOff: {}".format(delete_pod_cmd))
|
||||
get_bash_output(delete_pod_cmd)
|
||||
try:
|
||||
self.log.debug(" - Detecting hanging pods: {}".format(kubectl_cmd))
|
||||
self._session.api_client.tasks.failed(
|
||||
task=task_id,
|
||||
status_reason="K8S glue error: {}".format(msg),
|
||||
status_message="Changed by K8S glue",
|
||||
force=True
|
||||
)
|
||||
except Exception as ex:
|
||||
self.log.warning(
|
||||
'K8S Glue pods monitor: Failed deleting task "{}"\nEX: {}'.format(task_id, ex)
|
||||
)
|
||||
|
||||
# clean up any msg for this task
|
||||
last_tasks_msgs.pop(task_id, None)
|
||||
continue
|
||||
if msg and last_tasks_msgs.get(task_id, None) != msg:
|
||||
try:
|
||||
result = self._session.send_request(
|
||||
service='tasks',
|
||||
action='update',
|
||||
json={"task": task_id, "status_message": "K8S glue status: {}".format(msg)},
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
if not result.ok:
|
||||
result_msg = self._get_path(result.json(), 'meta', 'result_msg')
|
||||
raise Exception(result_msg or result.text)
|
||||
|
||||
# update last msg for this task
|
||||
last_tasks_msgs[task_id] = msg
|
||||
except Exception as ex:
|
||||
self.log.warning(
|
||||
'K8S Glue pods monitor: Failed setting status message for task "{}"\nMSG: {}\nEX: {}'.format(
|
||||
task_id, msg, ex
|
||||
)
|
||||
)
|
||||
|
||||
if task_id_to_details:
|
||||
try:
|
||||
result = self._session.get(
|
||||
service='tasks',
|
||||
action='get_all',
|
||||
json={"id": list(task_id_to_details), "status": ["stopped"], "only_fields": ["id"]},
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
aborted_task_ids = list(filter(None, (task.get("id") for task in result["tasks"])))
|
||||
|
||||
for task_id in aborted_task_ids:
|
||||
pod_name, namespace = task_id_to_details.get(task_id)
|
||||
if not pod_name:
|
||||
self.log.error("Failed locating aborted task {} in pending pods list".format(task_id))
|
||||
continue
|
||||
self.log.info(
|
||||
"K8S Glue pods monitor: task {} was aborted by its pod {} is still pending, "
|
||||
"deleting pod".format(task_id, pod_name)
|
||||
)
|
||||
|
||||
kubectl_cmd = "kubectl delete pod {pod_name} --output name {namespace}".format(
|
||||
namespace=f"--namespace={namespace}" if namespace else "", pod_name=pod_name,
|
||||
).strip()
|
||||
self.log.debug("Deleting aborted task pending pod: {}".format(kubectl_cmd))
|
||||
output = stringify_bash_output(get_bash_output(kubectl_cmd))
|
||||
if not output:
|
||||
self.log.warning("K8S Glue pods monitor: failed deleting pod {}".format(pod_name))
|
||||
except Exception as ex:
|
||||
self.log.warning(
|
||||
'K8S Glue pods monitor: failed checking aborted tasks for hanging pods: {}'.format(ex)
|
||||
)
|
||||
|
||||
# clean up any last message for a task that wasn't seen as a pod
|
||||
last_tasks_msgs = {k: v for k, v in last_tasks_msgs.items() if k in task_id_to_details}
|
||||
|
||||
sleep(self._polling_interval)
|
||||
|
||||
def _set_task_user_properties(self, task_id: str, task_session=None, **properties: str):
|
||||
session = task_session or self._session
|
||||
if self._edit_hyperparams_support is not True:
|
||||
@@ -408,34 +300,50 @@ class K8sIntegration(Worker):
|
||||
|
||||
return self._agent_label
|
||||
|
||||
def _get_used_pods(self):
|
||||
# type: () -> Tuple[int, Set[str]]
|
||||
# noinspection PyBroadException
|
||||
RunningPod = namedtuple("RunningPod", "name queue namespace")
|
||||
|
||||
def _get_running_pods(self):
|
||||
try:
|
||||
kubectl_cmd = self.get_kubectl_command(
|
||||
"get pods",
|
||||
output="jsonpath=\"{range .items[*]}{.metadata.name}{' '}{.metadata.namespace}{'\\n'}{end}\""
|
||||
output="jsonpath=\"{{range .items[*]}}{{.metadata.name}}{{' '}}{{.metadata.namespace}}{{' '}}"
|
||||
"{{.metadata.labels.{}}}{{'\\n'}}{{end}}\"".format(self.QUEUE_LABEL)
|
||||
)
|
||||
self.log.debug("Getting used pods: {}".format(kubectl_cmd))
|
||||
output = stringify_bash_output(get_bash_output(kubectl_cmd, raise_error=True))
|
||||
|
||||
if not output:
|
||||
# No such pod exist so we can use the pod_number we found
|
||||
return 0, set([])
|
||||
return []
|
||||
|
||||
try:
|
||||
items = output.splitlines()
|
||||
current_pod_count = len(items)
|
||||
namespaces = {item.rpartition(" ")[-1] for item in items}
|
||||
self.log.debug(" - found {} pods in namespaces {}".format(current_pod_count, ", ".join(namespaces)))
|
||||
except (KeyError, ValueError, TypeError, AttributeError) as ex:
|
||||
print("Failed parsing used pods command response for cleanup: {}".format(ex))
|
||||
return -1, set([])
|
||||
return [
|
||||
self.RunningPod(
|
||||
name=parts[0],
|
||||
namespace=parts[1],
|
||||
queue=parts[2]
|
||||
)
|
||||
for parts in (line.split(" ") for line in output.splitlines())
|
||||
]
|
||||
except Exception as ex:
|
||||
raise Exception("Failed parsing used pods command response for cleanup: {}".format(ex))
|
||||
except Exception as ex:
|
||||
raise Exception('Failed obtaining used pods information: {}'.format(ex))
|
||||
|
||||
def _get_used_pods(self):
|
||||
# type: () -> Tuple[int, Set[str]]
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
items = self._get_running_pods()
|
||||
if not items:
|
||||
return 0, set([])
|
||||
current_pod_count = len(items)
|
||||
namespaces = {item.namespace for item in items}
|
||||
self.log.debug(" - found {} pods in namespaces {}".format(current_pod_count, ", ".join(namespaces)))
|
||||
return current_pod_count, namespaces
|
||||
except Exception as ex:
|
||||
print('Failed obtaining used pods information: {}'.format(ex))
|
||||
return -2, set([])
|
||||
self.log.debug("Failed getting used pods: {}", ex)
|
||||
return -1, set([])
|
||||
|
||||
def _is_same_tenant(self, task_session):
|
||||
if not task_session or task_session is self._session:
|
||||
@@ -448,6 +356,73 @@ class K8sIntegration(Worker):
|
||||
except Exception as ex:
|
||||
print("ERROR: Failed getting tenant for task session: {}".format(ex))
|
||||
|
||||
def get_jobs_info(self, info_path: str, condition: str = None, namespace=None, debug_msg: str = None)\
|
||||
-> Dict[str, str]:
|
||||
cond = "==".join((x.strip("=") for x in condition.partition("=")[::2]))
|
||||
output = f"jsonpath='{{range .items[?(@.{cond})]}}{{@.{info_path}}}{{\" \"}}{{@.metadata.namespace}}{{\"\\n\"}}{{end}}'"
|
||||
kubectl_cmd = self.get_kubectl_command("get job", output=output, ns=namespace)
|
||||
if debug_msg:
|
||||
self.log.debug(debug_msg.format(cmd=kubectl_cmd))
|
||||
output = stringify_bash_output(get_bash_output(kubectl_cmd))
|
||||
output = output.strip("'") # for Windows debugging :(
|
||||
try:
|
||||
data_items = dict(l.strip().partition(" ")[::2] for l in output.splitlines())
|
||||
return data_items
|
||||
except Exception as ex:
|
||||
self.log.warning('Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
|
||||
|
||||
def get_pods_for_jobs(self, job_condition: str = None, pod_filters: List[str] = None, debug_msg: str = None):
|
||||
controller_uids = self.get_jobs_info(
|
||||
"spec.selector.matchLabels.controller-uid", condition=job_condition, debug_msg=debug_msg
|
||||
)
|
||||
if not controller_uids:
|
||||
# No pods were found for these jobs
|
||||
return []
|
||||
pods = self.get_pods(filters=pod_filters, debug_msg=debug_msg)
|
||||
return [
|
||||
pod for pod in pods
|
||||
if get_path(pod, "metadata", "labels", "controller-uid") in controller_uids
|
||||
]
|
||||
|
||||
def get_pods(self, filters: List[str] = None, debug_msg: str = None):
|
||||
kubectl_cmd = self.get_kubectl_command(
|
||||
"get pods",
|
||||
filters=filters,
|
||||
labels=False if self.using_jobs else None,
|
||||
)
|
||||
if debug_msg:
|
||||
self.log.debug(debug_msg.format(cmd=kubectl_cmd))
|
||||
output = stringify_bash_output(get_bash_output(kubectl_cmd))
|
||||
try:
|
||||
output_config = json.loads(output)
|
||||
except Exception as ex:
|
||||
self.log.warning('Failed parsing kubectl output:\n{}\nEx: {}'.format(output, ex))
|
||||
return
|
||||
return output_config.get('items', [])
|
||||
|
||||
def _get_pod_count(self, extra_labels: List[str] = None, msg: str = None):
|
||||
kubectl_cmd_new = self.get_kubectl_command(
|
||||
f"get {self.kind}s",
|
||||
extra_labels= extra_labels
|
||||
)
|
||||
self.log.debug("{}{}".format((msg + ": ") if msg else "", kubectl_cmd_new))
|
||||
process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
|
||||
output, error = process.communicate()
|
||||
output = stringify_bash_output(output)
|
||||
error = stringify_bash_output(error)
|
||||
|
||||
try:
|
||||
return len(json.loads(output).get("items", []))
|
||||
except (ValueError, TypeError) as ex:
|
||||
self.log.warning(
|
||||
"K8S Glue pods monitor: Failed parsing kubectl output:\n{}\nEx: {}".format(output, ex)
|
||||
)
|
||||
raise GetPodCountError()
|
||||
|
||||
def resource_applied(self, resource_name: str, namespace: str, task_id: str, session):
|
||||
""" Called when a resource (pod/job) was applied """
|
||||
pass
|
||||
|
||||
def run_one_task(self, queue: Text, task_id: Text, worker_args=None, task_session=None, **_):
|
||||
print('Pulling task {} launching on kubernetes cluster'.format(task_id))
|
||||
session = task_session or self._session
|
||||
@@ -457,7 +432,13 @@ class K8sIntegration(Worker):
|
||||
if self._is_same_tenant(task_session):
|
||||
try:
|
||||
print('Pushing task {} into temporary pending queue'.format(task_id))
|
||||
_ = session.api_client.tasks.stop(task_id, force=True)
|
||||
_ = session.api_client.tasks.stop(task_id, force=True, status_reason="moving to k8s pending queue")
|
||||
|
||||
# Just make sure to clean up in case the task is stuck in the queue (known issue)
|
||||
self._session.api_client.queues.remove_task(
|
||||
task=task_id,
|
||||
queue=self.k8s_pending_queue_id,
|
||||
)
|
||||
|
||||
res = self._session.api_client.tasks.enqueue(
|
||||
task_id,
|
||||
@@ -498,14 +479,14 @@ class K8sIntegration(Worker):
|
||||
|
||||
hocon_config_encoded = config_content.encode("ascii")
|
||||
|
||||
create_clearml_conf = ["echo '{}' | base64 --decode >> ~/clearml.conf".format(
|
||||
clearml_conf_create_script = ["echo '{}' | base64 --decode >> ~/clearml.conf".format(
|
||||
base64.b64encode(
|
||||
hocon_config_encoded
|
||||
).decode('ascii')
|
||||
)]
|
||||
|
||||
if task_session:
|
||||
create_clearml_conf.append(
|
||||
clearml_conf_create_script.append(
|
||||
"export CLEARML_AUTH_TOKEN=$(echo '{}' | base64 --decode)".format(
|
||||
base64.b64encode(task_session.token.encode("ascii")).decode('ascii')
|
||||
)
|
||||
@@ -526,23 +507,15 @@ class K8sIntegration(Worker):
|
||||
while self.ports_mode or self.max_pods_limit:
|
||||
pod_number = self.base_pod_num + pod_count
|
||||
|
||||
kubectl_cmd_new = self.get_kubectl_command(
|
||||
"get pods",
|
||||
extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if self.ports_mode else None
|
||||
)
|
||||
self.log.debug("Looking for a free pod/port: {}".format(kubectl_cmd_new))
|
||||
process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
|
||||
output, error = process.communicate()
|
||||
output = stringify_bash_output(output)
|
||||
error = stringify_bash_output(error)
|
||||
|
||||
try:
|
||||
items_count = len(json.loads(output).get("items", []))
|
||||
except (ValueError, TypeError) as ex:
|
||||
items_count = self._get_pod_count(
|
||||
extra_labels=[self.limit_pod_label.format(pod_number=pod_number)] if self.ports_mode else None,
|
||||
msg="Looking for a free pod/port"
|
||||
)
|
||||
except GetPodCountError:
|
||||
self.log.warning(
|
||||
"K8S Glue pods monitor: Failed parsing kubectl output:\n{}\ntask '{}' "
|
||||
"will be enqueued back to queue '{}'\nEx: {}".format(
|
||||
output, task_id, queue, ex
|
||||
"K8S Glue pods monitor: task '{}' will be enqueued back to queue '{}'".format(
|
||||
task_id, queue
|
||||
)
|
||||
)
|
||||
session.api_client.tasks.stop(task_id, force=True)
|
||||
@@ -566,8 +539,6 @@ class K8sIntegration(Worker):
|
||||
|
||||
if current_pod_count >= max_count:
|
||||
# All pods are taken, exit
|
||||
self.log.debug(
|
||||
"kubectl last result: {}\n{}".format(error, output))
|
||||
self.log.warning(
|
||||
"All k8s services are in use, task '{}' "
|
||||
"will be enqueued back to queue '{}'".format(
|
||||
@@ -608,25 +579,34 @@ class K8sIntegration(Worker):
|
||||
except (KeyError, TypeError, AttributeError):
|
||||
namespace = self.namespace
|
||||
|
||||
if template:
|
||||
output, error = self._kubectl_apply(
|
||||
template=template,
|
||||
pod_number=pod_number,
|
||||
create_clearml_conf=create_clearml_conf,
|
||||
labels=labels,
|
||||
docker_image=container['image'],
|
||||
docker_args=container['arguments'],
|
||||
docker_bash=container.get('setup_shell_script'),
|
||||
task_id=task_id,
|
||||
queue=queue,
|
||||
namespace=namespace,
|
||||
)
|
||||
if not template:
|
||||
print("ERROR: no template for task {}, skipping".format(task_id))
|
||||
return
|
||||
|
||||
output, error, pod_name = self._kubectl_apply(
|
||||
template=template,
|
||||
pod_number=pod_number,
|
||||
clearml_conf_create_script=clearml_conf_create_script,
|
||||
labels=labels,
|
||||
docker_image=container['image'],
|
||||
docker_args=container.get('arguments'),
|
||||
docker_bash=container.get('setup_shell_script'),
|
||||
task_id=task_id,
|
||||
queue=queue,
|
||||
namespace=namespace,
|
||||
)
|
||||
|
||||
print('kubectl output:\n{}\n{}'.format(error, output))
|
||||
if error:
|
||||
send_log = "Running kubectl encountered an error: {}".format(error)
|
||||
self.log.error(send_log)
|
||||
self.send_logs(task_id, send_log.splitlines())
|
||||
return
|
||||
|
||||
if pod_name:
|
||||
self.resource_applied(
|
||||
resource_name=pod_name, namespace=namespace, task_id=task_id, session=session
|
||||
)
|
||||
|
||||
user_props = {"k8s-queue": str(queue_name)}
|
||||
if self.ports_mode:
|
||||
@@ -657,8 +637,8 @@ class K8sIntegration(Worker):
|
||||
def _get_pod_labels(self, queue, queue_name):
|
||||
return [
|
||||
self._get_agent_label(),
|
||||
"clearml-agent-queue={}".format(self._safe_k8s_label_value(queue)),
|
||||
"clearml-agent-queue-name={}".format(self._safe_k8s_label_value(queue_name))
|
||||
"{}={}".format(self.QUEUE_LABEL, self._safe_k8s_label_value(queue)),
|
||||
"{}-name={}".format(self.QUEUE_LABEL, self._safe_k8s_label_value(queue_name))
|
||||
]
|
||||
|
||||
def _get_docker_args(self, docker_args, flags, target=None, convert=None):
|
||||
@@ -687,32 +667,13 @@ class K8sIntegration(Worker):
|
||||
return {target: results} if results else {}
|
||||
return results
|
||||
|
||||
def _kubectl_apply(
|
||||
self,
|
||||
create_clearml_conf,
|
||||
docker_image,
|
||||
docker_args,
|
||||
docker_bash,
|
||||
labels,
|
||||
queue,
|
||||
task_id,
|
||||
namespace,
|
||||
template=None,
|
||||
pod_number=None
|
||||
):
|
||||
template.setdefault('apiVersion', 'v1')
|
||||
template['kind'] = 'Pod'
|
||||
template.setdefault('metadata', {})
|
||||
name = self.pod_name_prefix + str(task_id)
|
||||
template['metadata']['name'] = name
|
||||
template.setdefault('spec', {})
|
||||
template['spec'].setdefault('containers', [])
|
||||
template['spec'].setdefault('restartPolicy', 'Never')
|
||||
if labels:
|
||||
labels_dict = dict(pair.split('=', 1) for pair in labels)
|
||||
template['metadata'].setdefault('labels', {})
|
||||
template['metadata']['labels'].update(labels_dict)
|
||||
def get_task_worker_id(self, template, task_id, pod_name, namespace, queue):
|
||||
return f"{self.worker_id}:{task_id}"
|
||||
|
||||
def _create_template_container(
|
||||
self, pod_name: str, task_id: str, docker_image: str, docker_args: List[str],
|
||||
docker_bash: str, clearml_conf_create_script: List[str], task_worker_id: str
|
||||
) -> dict:
|
||||
container = self._get_docker_args(
|
||||
docker_args,
|
||||
target="env",
|
||||
@@ -720,6 +681,16 @@ class K8sIntegration(Worker):
|
||||
convert=lambda env: {'name': env.partition("=")[0], 'value': env.partition("=")[2]},
|
||||
)
|
||||
|
||||
# Set worker ID
|
||||
env_vars = container.get('env', [])
|
||||
found_worker_id = False
|
||||
for entry in env_vars:
|
||||
if entry.get('name') == 'CLEARML_WORKER_ID':
|
||||
entry['name'] = task_worker_id
|
||||
found_worker_id = True
|
||||
if not found_worker_id:
|
||||
container['env'] = env_vars + [{'name': 'CLEARML_WORKER_ID', 'value': task_worker_id}]
|
||||
|
||||
container_bash_script = [self.container_bash_script] if isinstance(self.container_bash_script, str) \
|
||||
else self.container_bash_script
|
||||
|
||||
@@ -732,11 +703,11 @@ class K8sIntegration(Worker):
|
||||
[line.format(extra_bash_init_cmd=self.extra_bash_init_script or '',
|
||||
task_id=task_id,
|
||||
extra_docker_bash_script=extra_docker_bash_script,
|
||||
default_execution_agent_args=self.DEFAULT_EXECUTION_AGENT_ARGS,
|
||||
agent_install_args=self.POD_AGENT_INSTALL_ARGS)
|
||||
default_execution_agent_args=ENV_DEFAULT_EXECUTION_AGENT_ARGS.get(),
|
||||
agent_install_args=ENV_POD_AGENT_INSTALL_ARGS.get())
|
||||
for line in container_bash_script])
|
||||
|
||||
extra_bash_commands = list(create_clearml_conf or [])
|
||||
extra_bash_commands = list(clearml_conf_create_script or [])
|
||||
|
||||
start_agent_script_path = ENV_START_AGENT_SCRIPT_PATH.get() or "~/__start_agent__.sh"
|
||||
|
||||
@@ -750,20 +721,82 @@ class K8sIntegration(Worker):
|
||||
)
|
||||
|
||||
# Notice: we always leave with exit code 0, so pods are never restarted
|
||||
container = self._merge_containers(
|
||||
return self._merge_containers(
|
||||
container,
|
||||
dict(name=name, image=docker_image,
|
||||
dict(name=pod_name, image=docker_image,
|
||||
command=['/bin/bash'],
|
||||
args=['-c', '{} ; exit 0'.format(' ; '.join(extra_bash_commands))])
|
||||
)
|
||||
|
||||
if template['spec']['containers']:
|
||||
template['spec']['containers'][0] = self._merge_containers(template['spec']['containers'][0], container)
|
||||
def _kubectl_apply(
|
||||
self,
|
||||
clearml_conf_create_script: List[str],
|
||||
docker_image,
|
||||
docker_args,
|
||||
docker_bash,
|
||||
labels,
|
||||
queue,
|
||||
task_id,
|
||||
namespace,
|
||||
template,
|
||||
pod_number=None
|
||||
):
|
||||
if "apiVersion" not in template:
|
||||
template["apiVersion"] = "batch/v1" if self.using_jobs else "v1"
|
||||
if "kind" in template:
|
||||
if template["kind"].lower() != self.kind:
|
||||
return (
|
||||
"",
|
||||
f"Template kind {template['kind']} does not maych kind {self.kind.capitalize()} set for agent",
|
||||
None
|
||||
)
|
||||
else:
|
||||
template['spec']['containers'].append(container)
|
||||
template["kind"] = self.kind.capitalize()
|
||||
|
||||
metadata = template.setdefault('metadata', {})
|
||||
name = self.pod_name_prefix + str(task_id)
|
||||
metadata['name'] = name
|
||||
|
||||
def place_labels(metadata_dict):
|
||||
labels_dict = dict(pair.split('=', 1) for pair in labels)
|
||||
metadata_dict.setdefault('labels', {}).update(labels_dict)
|
||||
|
||||
if labels:
|
||||
# Place labels on base resource (job or single pod)
|
||||
place_labels(metadata)
|
||||
|
||||
spec = template.setdefault('spec', {})
|
||||
if self.using_jobs:
|
||||
spec.setdefault('backoffLimit', 0)
|
||||
spec_template = spec.setdefault('template', {})
|
||||
if labels:
|
||||
# Place same labels fro any pod spawned by the job
|
||||
place_labels(spec_template.setdefault('metadata', {}))
|
||||
|
||||
spec = spec_template.setdefault('spec', {})
|
||||
|
||||
containers = spec.setdefault('containers', [])
|
||||
spec.setdefault('restartPolicy', 'Never')
|
||||
|
||||
task_worker_id = self.get_task_worker_id(template, task_id, name, namespace, queue)
|
||||
|
||||
container = self._create_template_container(
|
||||
pod_name=name,
|
||||
task_id=task_id,
|
||||
docker_image=docker_image,
|
||||
docker_args=docker_args,
|
||||
docker_bash=docker_bash,
|
||||
clearml_conf_create_script=clearml_conf_create_script,
|
||||
task_worker_id=task_worker_id
|
||||
)
|
||||
|
||||
if containers:
|
||||
containers[0] = self._merge_containers(containers[0], container)
|
||||
else:
|
||||
containers.append(container)
|
||||
|
||||
if self._docker_force_pull:
|
||||
for c in template['spec']['containers']:
|
||||
for c in containers:
|
||||
c.setdefault('imagePullPolicy', 'Always')
|
||||
|
||||
fp, yaml_file = tempfile.mkstemp(prefix='clearml_k8stmpl_', suffix='.yml')
|
||||
@@ -789,11 +822,88 @@ class K8sIntegration(Worker):
|
||||
process = subprocess.Popen(kubectl_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
|
||||
output, error = process.communicate()
|
||||
except Exception as ex:
|
||||
return None, str(ex)
|
||||
return None, str(ex), None
|
||||
finally:
|
||||
safe_remove_file(yaml_file)
|
||||
|
||||
return stringify_bash_output(output), stringify_bash_output(error)
|
||||
return stringify_bash_output(output), stringify_bash_output(error), name
|
||||
|
||||
def _process_bash_lines_response(self, bash_cmd: str, raise_error=True):
|
||||
res = get_bash_output(bash_cmd, raise_error=raise_error)
|
||||
lines = [
|
||||
line for line in
|
||||
(r.strip().rpartition("/")[-1] for r in res.splitlines())
|
||||
if line.startswith(self.pod_name_prefix)
|
||||
]
|
||||
return lines
|
||||
|
||||
def _delete_pods(self, selectors: List[str], namespace: str, msg: str = None) -> List[str]:
|
||||
kubectl_cmd = \
|
||||
"kubectl delete pod -l={agent_label} " \
|
||||
"--namespace={namespace} --field-selector={selector} --output name".format(
|
||||
selector=",".join(selectors),
|
||||
agent_label=self._get_agent_label(),
|
||||
namespace=namespace,
|
||||
)
|
||||
self.log.debug("Deleting old/failed pods{} for ns {}: {}".format(
|
||||
msg or "", namespace, kubectl_cmd
|
||||
))
|
||||
lines = self._process_bash_lines_response(kubectl_cmd)
|
||||
self.log.debug(" - deleted pods %s", ", ".join(lines))
|
||||
return lines
|
||||
|
||||
def _delete_jobs_by_names(self, names_to_ns: Dict[str, str], msg: str = None) -> List[str]:
|
||||
if not names_to_ns:
|
||||
return []
|
||||
ns_to_names = defaultdict(list)
|
||||
for name, ns in names_to_ns.items():
|
||||
ns_to_names[ns].append(name)
|
||||
|
||||
results = []
|
||||
for ns, names in ns_to_names.items():
|
||||
kubectl_cmd = "kubectl delete job --namespace={ns} --output=name {names}".format(
|
||||
ns=ns, names=" ".join(names)
|
||||
)
|
||||
self.log.debug("Deleting jobs {}: {}".format(
|
||||
msg or "", kubectl_cmd
|
||||
))
|
||||
lines = self._process_bash_lines_response(kubectl_cmd)
|
||||
if not lines:
|
||||
continue
|
||||
self.log.debug(" - deleted jobs %s", ", ".join(lines))
|
||||
results.extend(lines)
|
||||
return results
|
||||
|
||||
def _delete_completed_or_failed_pods(self, namespace, msg: str = None):
|
||||
if not self.using_jobs:
|
||||
return self._delete_pods(
|
||||
selectors=["status.phase!=Pending", "status.phase!=Running"], namespace=namespace, msg=msg
|
||||
)
|
||||
|
||||
job_names_to_delete = {}
|
||||
|
||||
# locate failed pods for jobs
|
||||
failed_pods = self.get_pods_for_jobs(
|
||||
job_condition="status.active=1",
|
||||
pod_filters=["status.phase!=Pending", "status.phase!=Running", "status.phase!=Terminating"],
|
||||
debug_msg="Deleting failed pods: {cmd}"
|
||||
)
|
||||
if failed_pods:
|
||||
job_names_to_delete = {
|
||||
get_path(pod, "metadata", "labels", "job-name"): get_path(pod, "metadata", "namespace")
|
||||
for pod in failed_pods
|
||||
if get_path(pod, "metadata", "labels", "job-name")
|
||||
}
|
||||
self.log.debug(f" - found jobs with failed pods: {' '.join(job_names_to_delete)}")
|
||||
|
||||
completed_job_names = self.get_jobs_info(
|
||||
"metadata.name", condition="status.succeeded=1", namespace=namespace, debug_msg=msg
|
||||
)
|
||||
if completed_job_names:
|
||||
self.log.debug(f" - found completed jobs: {' '.join(completed_job_names)}")
|
||||
job_names_to_delete.update(completed_job_names)
|
||||
|
||||
return self._delete_jobs_by_names(names_to_ns=job_names_to_delete, msg=msg)
|
||||
|
||||
def _cleanup_old_pods(self, namespaces, extra_msg=None):
|
||||
# type: (Iterable[str], Optional[str]) -> Dict[str, List[str]]
|
||||
@@ -803,23 +913,12 @@ class K8sIntegration(Worker):
|
||||
if time() - self._last_pod_cleanup_per_ns[namespace] < self._min_cleanup_interval_per_ns_sec:
|
||||
# Do not try to cleanup the same namespace too quickly
|
||||
continue
|
||||
kubectl_cmd = self.KUBECTL_CLEANUP_DELETE_CMD.format(
|
||||
namespace=namespace, agent_label=self._get_agent_label()
|
||||
)
|
||||
self.log.debug("Deleting old/failed pods{} for ns {}: {}".format(
|
||||
extra_msg or "", namespace, kubectl_cmd
|
||||
))
|
||||
|
||||
try:
|
||||
res = get_bash_output(kubectl_cmd, raise_error=True)
|
||||
lines = [
|
||||
line for line in
|
||||
(r.strip().rpartition("/")[-1] for r in res.splitlines())
|
||||
if line.startswith(self.pod_name_prefix)
|
||||
]
|
||||
self.log.debug(" - deleted pod(s) %s", ", ".join(lines))
|
||||
deleted_pods[namespace].extend(lines)
|
||||
res = self._delete_completed_or_failed_pods(namespace, extra_msg)
|
||||
deleted_pods[namespace].extend(res)
|
||||
except Exception as ex:
|
||||
self.log.error("Failed deleting old/failed pods for ns %s: %s", namespace, str(ex))
|
||||
self.log.error("Failed deleting completed/failed pods for ns %s: %s", namespace, str(ex))
|
||||
finally:
|
||||
self._last_pod_cleanup_per_ns[namespace] = time()
|
||||
|
||||
@@ -840,7 +939,7 @@ class K8sIntegration(Worker):
|
||||
)
|
||||
tasks_to_abort = result["tasks"]
|
||||
except Exception as ex:
|
||||
self.log.warning('Failed getting running tasks for deleted pods: {}'.format(ex))
|
||||
self.log.warning('Failed getting running tasks for deleted {}(s): {}'.format(self.kind, ex))
|
||||
|
||||
for task in tasks_to_abort:
|
||||
task_id = task.get("id")
|
||||
@@ -853,15 +952,27 @@ class K8sIntegration(Worker):
|
||||
self._session.get(
|
||||
service='tasks',
|
||||
action='dequeue',
|
||||
json={"task": task_id, "force": True, "status_reason": "Pod deleted (not pending or running)",
|
||||
"status_message": "Pod deleted by agent {}".format(self.worker_id or "unknown")},
|
||||
json={
|
||||
"task": task_id,
|
||||
"force": True,
|
||||
"status_reason": "Pod deleted (not pending or running)",
|
||||
"status_message": "{} deleted by agent {}".format(
|
||||
self.kind.capitalize(), self.worker_id or "unknown"
|
||||
)
|
||||
},
|
||||
method=Request.def_method,
|
||||
)
|
||||
self._session.get(
|
||||
service='tasks',
|
||||
action='failed',
|
||||
json={"task": task_id, "force": True, "status_reason": "Pod deleted (not pending or running)",
|
||||
"status_message": "Pod deleted by agent {}".format(self.worker_id or "unknown")},
|
||||
json={
|
||||
"task": task_id,
|
||||
"force": True,
|
||||
"status_reason": "Pod deleted (not pending or running)",
|
||||
"status_message": "{} deleted by agent {}".format(
|
||||
self.kind.capitalize(), self.worker_id or "unknown"
|
||||
)
|
||||
},
|
||||
method=Request.def_method,
|
||||
)
|
||||
except Exception as ex:
|
||||
@@ -902,10 +1013,10 @@ class K8sIntegration(Worker):
|
||||
# check if have pod limit, then check if we hit it.
|
||||
if self.max_pods_limit:
|
||||
if current_pods >= self.max_pods_limit:
|
||||
print("Maximum pod limit reached {}/{}, sleeping for {:.1f} seconds".format(
|
||||
current_pods, self.max_pods_limit, self._polling_interval))
|
||||
print("Maximum {} limit reached {}/{}, sleeping for {:.1f} seconds".format(
|
||||
self.kind, current_pods, self.max_pods_limit, self._polling_interval))
|
||||
# delete old completed / failed pods
|
||||
self._cleanup_old_pods(namespaces, " due to pod limit")
|
||||
self._cleanup_old_pods(namespaces, f" due to {self.kind} limit")
|
||||
# go to sleep
|
||||
sleep(self._polling_interval)
|
||||
continue
|
||||
@@ -913,7 +1024,7 @@ class K8sIntegration(Worker):
|
||||
# iterate over queues (priority style, queues[0] is highest)
|
||||
for queue in queues:
|
||||
# delete old completed / failed pods
|
||||
self._cleanup_old_pods(namespaces)
|
||||
self._cleanup_old_pods(namespaces, extra_msg="Cleanup cycle {cmd}")
|
||||
|
||||
# get next task in queue
|
||||
try:
|
||||
|
||||
236
clearml_agent/glue/pending_pods_daemon.py
Normal file
236
clearml_agent/glue/pending_pods_daemon.py
Normal file
@@ -0,0 +1,236 @@
|
||||
from time import sleep
|
||||
from typing import Dict, Tuple, Optional, List
|
||||
|
||||
from clearml_agent.backend_api.session import Request
|
||||
from clearml_agent.glue.utilities import get_bash_output
|
||||
|
||||
from clearml_agent.helper.process import stringify_bash_output
|
||||
|
||||
from .daemon import K8sDaemon
|
||||
from .utilities import get_path
|
||||
from .errors import GetPodsError
|
||||
|
||||
|
||||
class PendingPodsDaemon(K8sDaemon):
|
||||
def __init__(self, polling_interval: float, agent):
|
||||
super(PendingPodsDaemon, self).__init__(agent=agent)
|
||||
self._polling_interval = polling_interval
|
||||
self._last_tasks_msgs = {} # last msg updated for every task
|
||||
|
||||
def get_pods(self, pod_name=None):
|
||||
filters = ["status.phase=Pending"]
|
||||
if pod_name:
|
||||
filters.append(f"metadata.name={pod_name}")
|
||||
|
||||
if self._agent.using_jobs:
|
||||
return self._agent.get_pods_for_jobs(
|
||||
job_condition="status.active=1", pod_filters=filters, debug_msg="Detecting pending pods: {cmd}"
|
||||
)
|
||||
return self._agent.get_pods(filters=filters, debug_msg="Detecting pending pods: {cmd}")
|
||||
|
||||
def _get_pod_name(self, pod: dict):
|
||||
return get_path(pod, "metadata", "name")
|
||||
|
||||
def _get_k8s_resource_name(self, pod: dict):
|
||||
if self._agent.using_jobs:
|
||||
return get_path(pod, "metadata", "labels", "job-name")
|
||||
return get_path(pod, "metadata", "name")
|
||||
|
||||
def _get_task_id(self, pod: dict):
|
||||
return self._get_k8s_resource_name(pod).rpartition('-')[-1]
|
||||
|
||||
@staticmethod
|
||||
def _get_k8s_resource_namespace(pod: dict):
|
||||
return pod.get('metadata', {}).get('namespace', None)
|
||||
|
||||
def target(self):
|
||||
"""
|
||||
Handle pending objects (pods or jobs, depending on the agent mode).
|
||||
- Delete any pending objects that are not expected to recover
|
||||
- Delete any pending objects for whom the associated task was aborted
|
||||
"""
|
||||
while True:
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
# Get pods (standalone pods if we're in pods mode, or pods associated to jobs if we're in jobs mode)
|
||||
pods = self.get_pods()
|
||||
if pods is None:
|
||||
raise GetPodsError()
|
||||
|
||||
task_id_to_pod = dict()
|
||||
|
||||
for pod in pods:
|
||||
pod_name = self._get_pod_name(pod)
|
||||
if not pod_name:
|
||||
continue
|
||||
|
||||
task_id = self._get_task_id(pod)
|
||||
if not task_id:
|
||||
continue
|
||||
|
||||
namespace = self._get_k8s_resource_namespace(pod)
|
||||
if not namespace:
|
||||
continue
|
||||
|
||||
task_id_to_pod[task_id] = pod
|
||||
|
||||
msg = None
|
||||
tags = []
|
||||
|
||||
waiting = get_path(pod, 'status', 'containerStatuses', 0, 'state', 'waiting')
|
||||
if not waiting:
|
||||
condition = get_path(pod, 'status', 'conditions', 0)
|
||||
if condition:
|
||||
reason = condition.get('reason')
|
||||
if reason == 'Unschedulable':
|
||||
message = condition.get('message')
|
||||
msg = reason + (" ({})".format(message) if message else "")
|
||||
else:
|
||||
reason = waiting.get("reason", None)
|
||||
message = waiting.get("message", None)
|
||||
|
||||
msg = reason + (" ({})".format(message) if message else "")
|
||||
|
||||
if reason == 'ImagePullBackOff':
|
||||
self.delete_k8s_resource(k8s_resource=pod, msg=reason)
|
||||
try:
|
||||
self._session.api_client.tasks.failed(
|
||||
task=task_id,
|
||||
status_reason="K8S glue error: {}".format(msg),
|
||||
status_message="Changed by K8S glue",
|
||||
force=True
|
||||
)
|
||||
self._agent.send_logs(
|
||||
task_id, ["K8S Error: {}".format(msg)],
|
||||
session=self._session
|
||||
)
|
||||
except Exception as ex:
|
||||
self.log.warning(
|
||||
'K8S Glue pending monitor: Failed deleting task "{}"\nEX: {}'.format(task_id, ex)
|
||||
)
|
||||
|
||||
# clean up any msg for this task
|
||||
self._last_tasks_msgs.pop(task_id, None)
|
||||
continue
|
||||
|
||||
self._update_pending_task_msg(task_id, msg, tags)
|
||||
|
||||
if task_id_to_pod:
|
||||
self._process_tasks_for_pending_pods(task_id_to_pod)
|
||||
|
||||
# clean up any last message for a task that wasn't seen as a pod
|
||||
self._last_tasks_msgs = {k: v for k, v in self._last_tasks_msgs.items() if k in task_id_to_pod}
|
||||
except GetPodsError:
|
||||
pass
|
||||
except Exception:
|
||||
self.log.exception("Hanging pods daemon loop")
|
||||
|
||||
sleep(self._polling_interval)
|
||||
|
||||
def delete_k8s_resource(self, k8s_resource: dict, msg: str = None):
|
||||
delete_cmd = "kubectl delete {kind} {name} -n {namespace} --output name".format(
|
||||
kind=self._agent.kind,
|
||||
name=self._get_k8s_resource_name(k8s_resource),
|
||||
namespace=self._get_k8s_resource_namespace(k8s_resource)
|
||||
).strip()
|
||||
self.log.debug(" - deleting {} {}: {}".format(self._agent.kind, (" " + msg) if msg else "", delete_cmd))
|
||||
return get_bash_output(delete_cmd).strip()
|
||||
|
||||
def _process_tasks_for_pending_pods(self, task_id_to_details: Dict[str, dict]):
|
||||
self._handle_aborted_tasks(task_id_to_details)
|
||||
|
||||
def _handle_aborted_tasks(self, pending_tasks_details: Dict[str, dict]):
|
||||
try:
|
||||
result = self._session.get(
|
||||
service='tasks',
|
||||
action='get_all',
|
||||
json={
|
||||
"id": list(pending_tasks_details),
|
||||
"status": ["stopped"],
|
||||
"only_fields": ["id"]
|
||||
}
|
||||
)
|
||||
aborted_task_ids = list(filter(None, (task.get("id") for task in result["tasks"])))
|
||||
|
||||
for task_id in aborted_task_ids:
|
||||
pod = pending_tasks_details.get(task_id)
|
||||
if not pod:
|
||||
self.log.error("Failed locating aborted task {} in pending pods list".format(task_id))
|
||||
continue
|
||||
|
||||
pod_name = self._get_pod_name(pod)
|
||||
if not self.get_pods(pod_name=pod_name):
|
||||
self.log.debug("K8S Glue pending monitor: pod {} is no longer pending, skipping".format(pod_name))
|
||||
continue
|
||||
|
||||
resource_name = self._get_k8s_resource_name(pod)
|
||||
self.log.info(
|
||||
"K8S Glue pending monitor: task {} was aborted but the k8s resource {} is still pending, "
|
||||
"deleting pod".format(task_id, resource_name)
|
||||
)
|
||||
|
||||
result = self._session.get(
|
||||
service='tasks',
|
||||
action='get_all',
|
||||
json={"id": [task_id], "status": ["stopped"], "only_fields": ["id"]},
|
||||
)
|
||||
if not result["tasks"]:
|
||||
self.log.debug("K8S Glue pending monitor: task {} is no longer aborted, skipping".format(task_id))
|
||||
continue
|
||||
|
||||
output = self.delete_k8s_resource(k8s_resource=pod, msg="Pending resource of an aborted task")
|
||||
if not output:
|
||||
self.log.warning("K8S Glue pending monitor: failed deleting resource {}".format(resource_name))
|
||||
except Exception as ex:
|
||||
self.log.warning(
|
||||
'K8S Glue pending monitor: failed checking aborted tasks for pending resources: {}'.format(ex)
|
||||
)
|
||||
|
||||
def _update_pending_task_msg(self, task_id: str, msg: str, tags: List[str] = None):
|
||||
if not msg or self._last_tasks_msgs.get(task_id, None) == (msg, tags):
|
||||
return
|
||||
try:
|
||||
# Make sure the task is queued
|
||||
result = self._session.send_request(
|
||||
service='tasks',
|
||||
action='get_all',
|
||||
json={"id": task_id, "only_fields": ["status"]},
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
if result.ok:
|
||||
status = get_path(result.json(), 'data', 'tasks', 0, 'status')
|
||||
# if task is in progress, change its status to enqueued
|
||||
if status == "in_progress":
|
||||
result = self._session.send_request(
|
||||
service='tasks', action='enqueue',
|
||||
json={
|
||||
"task": task_id, "force": True, "queue": self._agent.k8s_pending_queue_id
|
||||
},
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
if not result.ok:
|
||||
result_msg = get_path(result.json(), 'meta', 'result_msg')
|
||||
self.log.debug(
|
||||
"K8S Glue pods monitor: failed forcing task status change"
|
||||
" for pending task {}: {}".format(task_id, result_msg)
|
||||
)
|
||||
|
||||
# Update task status message
|
||||
payload = {"task": task_id, "status_message": "K8S glue status: {}".format(msg)}
|
||||
if tags:
|
||||
payload["tags"] = tags
|
||||
result = self._session.send_request('tasks', 'update', json=payload, method=Request.def_method)
|
||||
if not result.ok:
|
||||
result_msg = get_path(result.json(), 'meta', 'result_msg')
|
||||
raise Exception(result_msg or result.text)
|
||||
|
||||
# update last msg for this task
|
||||
self._last_tasks_msgs[task_id] = msg
|
||||
except Exception as ex:
|
||||
self.log.warning(
|
||||
'K8S Glue pods monitor: Failed setting status message for task "{}"\nMSG: {}\nEX: {}'.format(
|
||||
task_id, msg, ex
|
||||
)
|
||||
)
|
||||
18
clearml_agent/glue/utilities.py
Normal file
18
clearml_agent/glue/utilities.py
Normal file
@@ -0,0 +1,18 @@
|
||||
import functools
|
||||
|
||||
from subprocess import DEVNULL
|
||||
|
||||
from clearml_agent.helper.process import get_bash_output as _get_bash_output
|
||||
|
||||
|
||||
def get_path(d, *path, default=None):
|
||||
try:
|
||||
return functools.reduce(
|
||||
lambda a, b: a[b], path, d
|
||||
)
|
||||
except (IndexError, KeyError):
|
||||
return default
|
||||
|
||||
|
||||
def get_bash_output(cmd, stderr=DEVNULL, raise_error=False):
|
||||
return _get_bash_output(cmd, stderr=stderr, raise_error=raise_error)
|
||||
@@ -20,20 +20,22 @@ from typing import Text, Dict, Any, Optional, AnyStr, IO, Union
|
||||
|
||||
import attr
|
||||
import furl
|
||||
import six
|
||||
import yaml
|
||||
from attr import fields_dict
|
||||
from pathlib2 import Path
|
||||
|
||||
import six
|
||||
from six.moves import reduce
|
||||
from clearml_agent.external import pyhocon
|
||||
|
||||
from clearml_agent.errors import CommandFailedError
|
||||
from clearml_agent.external import pyhocon
|
||||
from clearml_agent.helper.dicts import filter_keys
|
||||
|
||||
pretty_lines = False
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
use_powershell = os.getenv("CLEARML_AGENT_USE_POWERSHELL", None)
|
||||
|
||||
|
||||
def which(cmd, path=None):
|
||||
result = find_executable(cmd, path)
|
||||
@@ -52,7 +54,7 @@ def select_for_platform(linux, windows):
|
||||
|
||||
|
||||
def bash_c():
|
||||
return 'bash -c' if not is_windows_platform() else 'cmd /c'
|
||||
return 'bash -c' if not is_windows_platform() else ('powershell -Command' if use_powershell else 'cmd /c')
|
||||
|
||||
|
||||
def return_list(arg):
|
||||
|
||||
@@ -17,6 +17,30 @@ if TYPE_CHECKING:
|
||||
from clearml_agent.session import Session
|
||||
|
||||
|
||||
def sanitize_urls(s: str) -> Tuple[str, bool]:
|
||||
"""
|
||||
Replaces passwords in URLs with asterisks.
|
||||
Returns the sanitized string and a boolean indicating whether sanitation was performed.
|
||||
"""
|
||||
regex = re.compile("^([^:]*:)[^@]+(.*)$")
|
||||
tokens = re.split(r"\s", s)
|
||||
changed = False
|
||||
for k in range(len(tokens)):
|
||||
if "@" in tokens[k]:
|
||||
res = urlparse(tokens[k])
|
||||
if regex.match(res.netloc):
|
||||
changed = True
|
||||
tokens[k] = urlunparse((
|
||||
res.scheme,
|
||||
regex.sub("\\1********\\2", res.netloc),
|
||||
res.path,
|
||||
res.params,
|
||||
res.query,
|
||||
res.fragment
|
||||
))
|
||||
return " ".join(tokens) if changed else s, changed
|
||||
|
||||
|
||||
class DockerArgsSanitizer:
|
||||
@classmethod
|
||||
def sanitize_docker_command(cls, session, docker_command):
|
||||
@@ -62,11 +86,11 @@ class DockerArgsSanitizer:
|
||||
elif key in keys:
|
||||
val = "********"
|
||||
elif parse_embedded_urls:
|
||||
val = cls._sanitize_urls(val)[0]
|
||||
val = sanitize_urls(val)[0]
|
||||
result[i + 1] = "{}={}".format(key, val)
|
||||
skip_next = True
|
||||
elif parse_embedded_urls and not item.startswith("-"):
|
||||
item, changed = cls._sanitize_urls(item)
|
||||
item, changed = sanitize_urls(item)
|
||||
if changed:
|
||||
result[i] = item
|
||||
except (KeyError, TypeError):
|
||||
@@ -75,22 +99,71 @@ class DockerArgsSanitizer:
|
||||
return result
|
||||
|
||||
@staticmethod
|
||||
def _sanitize_urls(s: str) -> Tuple[str, bool]:
|
||||
""" Replaces passwords in URLs with asterisks """
|
||||
regex = re.compile("^([^:]*:)[^@]+(.*)$")
|
||||
tokens = re.split(r"\s", s)
|
||||
changed = False
|
||||
for k in range(len(tokens)):
|
||||
if "@" in tokens[k]:
|
||||
res = urlparse(tokens[k])
|
||||
if regex.match(res.netloc):
|
||||
changed = True
|
||||
tokens[k] = urlunparse((
|
||||
res.scheme,
|
||||
regex.sub("\\1********\\2", res.netloc),
|
||||
res.path,
|
||||
res.params,
|
||||
res.query,
|
||||
res.fragment
|
||||
))
|
||||
return " ".join(tokens) if changed else s, changed
|
||||
def get_list_of_switches(docker_args: List[str]) -> List[str]:
|
||||
args = []
|
||||
for token in docker_args:
|
||||
if token.strip().startswith("-"):
|
||||
args += [token.strip().split("=")[0].lstrip("-")]
|
||||
|
||||
return args
|
||||
|
||||
@staticmethod
|
||||
def filter_switches(docker_args: List[str], exclude_switches: List[str]) -> List[str]:
|
||||
# shortcut if we are sure we have no matches
|
||||
if (not exclude_switches or
|
||||
not any("-{}".format(s) in " ".join(docker_args) for s in exclude_switches)):
|
||||
return docker_args
|
||||
|
||||
args = []
|
||||
in_switch_args = True
|
||||
for token in docker_args:
|
||||
if token.strip().startswith("-"):
|
||||
if "=" in token:
|
||||
switch = token.strip().split("=")[0]
|
||||
in_switch_args = False
|
||||
else:
|
||||
switch = token
|
||||
in_switch_args = True
|
||||
|
||||
if switch.lstrip("-") in exclude_switches:
|
||||
# if in excluded, skip the switch and following arguments
|
||||
in_switch_args = False
|
||||
else:
|
||||
args += [token]
|
||||
|
||||
elif in_switch_args:
|
||||
args += [token]
|
||||
else:
|
||||
# this is the switch arguments we need to skip
|
||||
pass
|
||||
|
||||
return args
|
||||
|
||||
@staticmethod
|
||||
def merge_docker_args(config, task_docker_arguments: List[str], extra_docker_arguments: List[str]) -> List[str]:
|
||||
base_cmd = []
|
||||
# currently only resolving --network, --ipc, --privileged
|
||||
override_switches = config.get(
|
||||
"agent.protected_docker_extra_args",
|
||||
["privileged", "security-opt", "network", "ipc"]
|
||||
)
|
||||
|
||||
if config.get("agent.docker_args_extra_precedes_task", True):
|
||||
switches = []
|
||||
if extra_docker_arguments:
|
||||
switches = DockerArgsSanitizer.get_list_of_switches(extra_docker_arguments)
|
||||
switches = list(set(switches) & set(override_switches))
|
||||
base_cmd += [str(a) for a in extra_docker_arguments if a]
|
||||
if task_docker_arguments:
|
||||
docker_arguments = DockerArgsSanitizer.filter_switches(task_docker_arguments, switches)
|
||||
base_cmd += [a for a in docker_arguments if a]
|
||||
else:
|
||||
switches = []
|
||||
if task_docker_arguments:
|
||||
switches = DockerArgsSanitizer.get_list_of_switches(task_docker_arguments)
|
||||
switches = list(set(switches) & set(override_switches))
|
||||
base_cmd += [a for a in task_docker_arguments if a]
|
||||
if extra_docker_arguments:
|
||||
extra_docker_arguments = DockerArgsSanitizer.filter_switches(extra_docker_arguments, switches)
|
||||
base_cmd += [a for a in extra_docker_arguments if a]
|
||||
return base_cmd
|
||||
|
||||
8
clearml_agent/helper/environment/__init__.py
Normal file
8
clearml_agent/helper/environment/__init__.py
Normal file
@@ -0,0 +1,8 @@
|
||||
from .entry import Entry, NotSet
|
||||
from .environment import EnvEntry
|
||||
|
||||
__all__ = [
|
||||
'Entry',
|
||||
'NotSet',
|
||||
'EnvEntry',
|
||||
]
|
||||
70
clearml_agent/helper/environment/converters.py
Normal file
70
clearml_agent/helper/environment/converters.py
Normal file
@@ -0,0 +1,70 @@
|
||||
import base64
|
||||
from distutils.util import strtobool
|
||||
from typing import Union, Optional, Any, TypeVar, Callable, Tuple
|
||||
|
||||
import six
|
||||
|
||||
try:
|
||||
from typing import Text
|
||||
except ImportError:
|
||||
# windows conda-less hack
|
||||
Text = Any
|
||||
|
||||
|
||||
ConverterType = TypeVar("ConverterType", bound=Callable[[Any], Any])
|
||||
|
||||
|
||||
def base64_to_text(value):
|
||||
# type: (Any) -> Text
|
||||
return base64.b64decode(value).decode("utf-8")
|
||||
|
||||
|
||||
def text_to_int(value, default=0):
|
||||
# type: (Any, int) -> int
|
||||
try:
|
||||
return int(value)
|
||||
except (ValueError, TypeError):
|
||||
return default
|
||||
|
||||
|
||||
def text_to_bool(value):
|
||||
# type: (Text) -> bool
|
||||
return bool(strtobool(value))
|
||||
|
||||
|
||||
def safe_text_to_bool(value):
|
||||
# type: (Text) -> bool
|
||||
try:
|
||||
return text_to_bool(value)
|
||||
except ValueError:
|
||||
return bool(value)
|
||||
|
||||
|
||||
def any_to_bool(value):
|
||||
# type: (Optional[Union[int, float, Text]]) -> bool
|
||||
if isinstance(value, six.text_type):
|
||||
return text_to_bool(value)
|
||||
return bool(value)
|
||||
|
||||
|
||||
# noinspection PyIncorrectDocstring
|
||||
def or_(*converters, **kwargs):
|
||||
# type: (ConverterType, Tuple[Exception, ...]) -> ConverterType
|
||||
"""
|
||||
Wrapper that implements an "optional converter" pattern. Allows specifying a converter
|
||||
for which a set of exceptions is ignored (and the original value is returned)
|
||||
:param converters: A converter callable
|
||||
:param exceptions: A tuple of exception types to ignore
|
||||
"""
|
||||
# noinspection PyUnresolvedReferences
|
||||
exceptions = kwargs.get("exceptions", (ValueError, TypeError))
|
||||
|
||||
def wrapper(value):
|
||||
for converter in converters:
|
||||
try:
|
||||
return converter(value)
|
||||
except exceptions:
|
||||
pass
|
||||
return value
|
||||
|
||||
return wrapper
|
||||
134
clearml_agent/helper/environment/entry.py
Normal file
134
clearml_agent/helper/environment/entry.py
Normal file
@@ -0,0 +1,134 @@
|
||||
import abc
|
||||
from typing import Optional, Any, Tuple, Callable, Dict
|
||||
|
||||
import six
|
||||
|
||||
from .converters import any_to_bool
|
||||
|
||||
try:
|
||||
from typing import Text
|
||||
except ImportError:
|
||||
# windows conda-less hack
|
||||
Text = Any
|
||||
|
||||
|
||||
NotSet = object()
|
||||
|
||||
Converter = Callable[[Any], Any]
|
||||
|
||||
|
||||
@six.add_metaclass(abc.ABCMeta)
|
||||
class Entry(object):
|
||||
"""
|
||||
Configuration entry definition
|
||||
"""
|
||||
|
||||
def default_conversions(self):
|
||||
# type: () -> Dict[Any, Converter]
|
||||
|
||||
if self.lstrip and self.rstrip:
|
||||
|
||||
def str_convert(s):
|
||||
return six.text_type(s).strip()
|
||||
|
||||
elif self.lstrip:
|
||||
|
||||
def str_convert(s):
|
||||
return six.text_type(s).lstrip()
|
||||
|
||||
elif self.rstrip:
|
||||
|
||||
def str_convert(s):
|
||||
return six.text_type(s).rstrip()
|
||||
|
||||
else:
|
||||
|
||||
def str_convert(s):
|
||||
return six.text_type(s)
|
||||
|
||||
return {
|
||||
bool: lambda x: any_to_bool(x.strip()),
|
||||
six.text_type: str_convert,
|
||||
}
|
||||
|
||||
def __init__(self, key, *more_keys, **kwargs):
|
||||
# type: (Text, Text, Any) -> None
|
||||
"""
|
||||
:rtype: object
|
||||
:param key: Entry's key (at least one).
|
||||
:param more_keys: More alternate keys for this entry.
|
||||
:param type: Value type. If provided, will be used choosing a default conversion or
|
||||
(if none exists) for casting the environment value.
|
||||
:param converter: Value converter. If provided, will be used to convert the environment value.
|
||||
:param default: Default value. If provided, will be used as the default value on calls to get() and get_pair()
|
||||
in case no value is found for any key and no specific default value was provided in the call.
|
||||
Default value is None.
|
||||
:param help: Help text describing this entry
|
||||
"""
|
||||
self.keys = (key,) + more_keys
|
||||
self.type = kwargs.pop("type", six.text_type)
|
||||
self.converter = kwargs.pop("converter", None)
|
||||
self.default = kwargs.pop("default", None)
|
||||
self.help = kwargs.pop("help", None)
|
||||
self.lstrip = kwargs.pop("lstrip", True)
|
||||
self.rstrip = kwargs.pop("rstrip", True)
|
||||
|
||||
def __str__(self):
|
||||
return str(self.key)
|
||||
|
||||
@property
|
||||
def key(self):
|
||||
return self.keys[0]
|
||||
|
||||
def convert(self, value, converter=None):
|
||||
# type: (Any, Converter) -> Optional[Any]
|
||||
converter = converter or self.converter
|
||||
if not converter:
|
||||
converter = self.default_conversions().get(self.type, self.type)
|
||||
return converter(value)
|
||||
|
||||
def get_pair(self, default=NotSet, converter=None, value_cb=None):
|
||||
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Tuple[Text, Any]]
|
||||
for key in self.keys:
|
||||
value = self._get(key)
|
||||
if value is NotSet:
|
||||
continue
|
||||
try:
|
||||
value = self.convert(value, converter)
|
||||
except Exception as ex:
|
||||
self.error("invalid value {key}={value}: {ex}".format(**locals()))
|
||||
break
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if value_cb:
|
||||
value_cb(key, value)
|
||||
except Exception:
|
||||
pass
|
||||
return key, value
|
||||
|
||||
result = self.default if default is NotSet else default
|
||||
return self.key, result
|
||||
|
||||
def get(self, default=NotSet, converter=None, value_cb=None):
|
||||
# type: (Any, Converter, Callable[[str, Any], None]) -> Optional[Any]
|
||||
return self.get_pair(default=default, converter=converter, value_cb=value_cb)[1]
|
||||
|
||||
def set(self, value):
|
||||
# type: (Any, Any) -> (Text, Any)
|
||||
# key, _ = self.get_pair(default=None, converter=None)
|
||||
for k in self.keys:
|
||||
self._set(k, str(value))
|
||||
|
||||
def _set(self, key, value):
|
||||
# type: (Text, Text) -> None
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def _get(self, key):
|
||||
# type: (Text) -> Any
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def error(self, message):
|
||||
# type: (Text) -> None
|
||||
pass
|
||||
28
clearml_agent/helper/environment/environment.py
Normal file
28
clearml_agent/helper/environment/environment.py
Normal file
@@ -0,0 +1,28 @@
|
||||
from os import getenv, environ
|
||||
|
||||
from .converters import text_to_bool
|
||||
from .entry import Entry, NotSet
|
||||
|
||||
|
||||
class EnvEntry(Entry):
|
||||
def default_conversions(self):
|
||||
conversions = super(EnvEntry, self).default_conversions().copy()
|
||||
conversions[bool] = lambda x: text_to_bool(x.strip())
|
||||
return conversions
|
||||
|
||||
def pop(self):
|
||||
for k in self.keys:
|
||||
environ.pop(k, None)
|
||||
|
||||
def _get(self, key):
|
||||
value = getenv(key, "")
|
||||
return value or NotSet
|
||||
|
||||
def _set(self, key, value):
|
||||
environ[key] = value
|
||||
|
||||
def __str__(self):
|
||||
return "env:{}".format(super(EnvEntry, self).__str__())
|
||||
|
||||
def error(self, message):
|
||||
print("Environment configuration: {}".format(message))
|
||||
@@ -15,10 +15,8 @@ from __future__ import print_function
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import json
|
||||
import os.path
|
||||
import platform
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
|
||||
@@ -164,13 +162,14 @@ class GPUStatCollection(object):
|
||||
_device_count = None
|
||||
_gpu_device_info = {}
|
||||
|
||||
def __init__(self, gpu_list, driver_version=None):
|
||||
def __init__(self, gpu_list, driver_version=None, driver_cuda_version=None):
|
||||
self.gpus = gpu_list
|
||||
|
||||
# attach additional system information
|
||||
self.hostname = platform.node()
|
||||
self.query_time = datetime.now()
|
||||
self.driver_version = driver_version
|
||||
self.driver_cuda_version = driver_cuda_version
|
||||
|
||||
@staticmethod
|
||||
def clean_processes():
|
||||
@@ -181,10 +180,11 @@ class GPUStatCollection(object):
|
||||
@staticmethod
|
||||
def new_query(shutdown=False, per_process_stats=False, get_driver_info=False):
|
||||
"""Query the information of all the GPUs on local machine"""
|
||||
|
||||
initialized = False
|
||||
if not GPUStatCollection._initialized:
|
||||
N.nvmlInit()
|
||||
GPUStatCollection._initialized = True
|
||||
initialized = True
|
||||
|
||||
def _decode(b):
|
||||
if isinstance(b, bytes):
|
||||
@@ -200,10 +200,10 @@ class GPUStatCollection(object):
|
||||
if nv_process.pid not in GPUStatCollection.global_processes:
|
||||
GPUStatCollection.global_processes[nv_process.pid] = \
|
||||
psutil.Process(pid=nv_process.pid)
|
||||
ps_process = GPUStatCollection.global_processes[nv_process.pid]
|
||||
process['pid'] = nv_process.pid
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
# ps_process = GPUStatCollection.global_processes[nv_process.pid]
|
||||
# we do not actually use these, so no point in collecting them
|
||||
# process['username'] = ps_process.username()
|
||||
# # cmdline returns full path;
|
||||
@@ -286,11 +286,11 @@ class GPUStatCollection(object):
|
||||
for nv_process in nv_comp_processes + nv_graphics_processes:
|
||||
try:
|
||||
process = get_process_info(nv_process)
|
||||
processes.append(process)
|
||||
except psutil.NoSuchProcess:
|
||||
# TODO: add some reminder for NVML broken context
|
||||
# e.g. nvidia-smi reset or reboot the system
|
||||
pass
|
||||
process = None
|
||||
processes.append(process)
|
||||
|
||||
# we do not actually use these, so no point in collecting them
|
||||
# # TODO: Do not block if full process info is not requested
|
||||
@@ -314,7 +314,7 @@ class GPUStatCollection(object):
|
||||
# Convert bytes into MBytes
|
||||
'memory.used': memory.used // MB if memory else None,
|
||||
'memory.total': memory.total // MB if memory else None,
|
||||
'processes': processes,
|
||||
'processes': None if (processes and all(p is None for p in processes)) else processes
|
||||
}
|
||||
if per_process_stats:
|
||||
GPUStatCollection.clean_processes()
|
||||
@@ -337,15 +337,32 @@ class GPUStatCollection(object):
|
||||
driver_version = _decode(N.nvmlSystemGetDriverVersion())
|
||||
except N.NVMLError:
|
||||
driver_version = None # N/A
|
||||
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
cuda_driver_version = str(N.nvmlSystemGetCudaDriverVersion())
|
||||
except BaseException:
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
cuda_driver_version = str(N.nvmlSystemGetCudaDriverVersion_v2())
|
||||
except BaseException:
|
||||
cuda_driver_version = None
|
||||
if cuda_driver_version:
|
||||
try:
|
||||
cuda_driver_version = '{}.{}'.format(
|
||||
int(cuda_driver_version)//1000, (int(cuda_driver_version) % 1000)//10)
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
else:
|
||||
driver_version = None
|
||||
cuda_driver_version = None
|
||||
|
||||
# no need to shutdown:
|
||||
if shutdown:
|
||||
if shutdown and initialized:
|
||||
N.nvmlShutdown()
|
||||
GPUStatCollection._initialized = False
|
||||
|
||||
return GPUStatCollection(gpu_list, driver_version=driver_version)
|
||||
return GPUStatCollection(gpu_list, driver_version=driver_version, driver_cuda_version=cuda_driver_version)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.gpus)
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -50,7 +50,7 @@ class PackageManager(object):
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
def freeze(self):
|
||||
def freeze(self, freeze_full_environment=False):
|
||||
pass
|
||||
|
||||
@abc.abstractmethod
|
||||
@@ -141,8 +141,9 @@ class PackageManager(object):
|
||||
@classmethod
|
||||
def out_of_scope_install_package(cls, package_name, *args):
|
||||
if PackageManager._selected_manager is not None:
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
result = PackageManager._selected_manager._install(package_name, *args)
|
||||
result = PackageManager._selected_manager.install_packages(package_name, *args)
|
||||
if result not in (0, None, True):
|
||||
return False
|
||||
except Exception:
|
||||
@@ -150,10 +151,11 @@ class PackageManager(object):
|
||||
return True
|
||||
|
||||
@classmethod
|
||||
def out_of_scope_freeze(cls):
|
||||
def out_of_scope_freeze(cls, freeze_full_environment=False):
|
||||
if PackageManager._selected_manager is not None:
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
return PackageManager._selected_manager.freeze()
|
||||
return PackageManager._selected_manager.freeze(freeze_full_environment)
|
||||
except Exception:
|
||||
pass
|
||||
return []
|
||||
|
||||
@@ -92,21 +92,14 @@ class ExternalRequirements(SimpleSubstitution):
|
||||
vcs_url = req_line[4:]
|
||||
# reverse replace
|
||||
vcs_url = vcs_url[::-1].replace(fragment[::-1], '', 1)[::-1]
|
||||
# remove ssh:// or git:// prefix for git detection and credentials
|
||||
scheme = ''
|
||||
full_vcs_url = vcs_url
|
||||
if vcs_url and (vcs_url.startswith('ssh://') or vcs_url.startswith('git://')):
|
||||
scheme = 'ssh://' # notice git:// is actually ssh://
|
||||
vcs_url = vcs_url[6:]
|
||||
# notice git:// is actually ssh://
|
||||
if vcs_url and vcs_url.startswith('git://'):
|
||||
vcs_url = vcs_url.replace('git://', 'ssh://', 1)
|
||||
|
||||
from ..repo import Git
|
||||
vcs = Git(session=session, url=full_vcs_url, location=None, revision=None)
|
||||
vcs = Git(session=session, url=vcs_url, location=None, revision=None)
|
||||
vcs._set_ssh_url()
|
||||
new_req_line = 'git+{}{}{}'.format(
|
||||
'' if scheme and '://' in vcs.url else scheme,
|
||||
vcs_url if session.config.get('agent.force_git_ssh_protocol', None) else vcs.url_with_auth,
|
||||
fragment
|
||||
)
|
||||
new_req_line = 'git+{}{}'.format(vcs.url_with_auth, fragment)
|
||||
if new_req_line != req_line:
|
||||
furl_line = furl(new_req_line)
|
||||
print('Replacing original pip vcs \'{}\' with \'{}\''.format(
|
||||
|
||||
@@ -4,7 +4,7 @@ from itertools import chain
|
||||
from pathlib import Path
|
||||
from typing import Text, Optional
|
||||
|
||||
from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME
|
||||
from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME, ENV_PIP_EXTRA_INSTALL_FLAGS
|
||||
from clearml_agent.helper.package.base import PackageManager
|
||||
from clearml_agent.helper.process import Argv, DEVNULL
|
||||
from clearml_agent.session import Session
|
||||
@@ -12,8 +12,6 @@ from clearml_agent.session import Session
|
||||
|
||||
class SystemPip(PackageManager):
|
||||
|
||||
indices_args = None
|
||||
|
||||
def __init__(self, interpreter=None, session=None):
|
||||
# type: (Optional[Text], Optional[Session]) -> ()
|
||||
"""
|
||||
@@ -52,7 +50,7 @@ class SystemPip(PackageManager):
|
||||
package,
|
||||
'--dest', cache_dir,
|
||||
'--no-deps',
|
||||
) + self.install_flags()
|
||||
) + self.download_flags()
|
||||
)
|
||||
|
||||
def load_requirements(self, requirements):
|
||||
@@ -65,13 +63,14 @@ class SystemPip(PackageManager):
|
||||
def uninstall(self, package):
|
||||
self.run_with_env(('uninstall', '-y', package))
|
||||
|
||||
def freeze(self):
|
||||
def freeze(self, freeze_full_environment=False):
|
||||
"""
|
||||
pip freeze to all install packages except the running program
|
||||
:return: Dict contains pip as key and pip's packages to install
|
||||
:rtype: Dict[str: List[str]]
|
||||
"""
|
||||
packages = self.run_with_env(('freeze',), output=True).splitlines()
|
||||
packages = self.run_with_env(
|
||||
('freeze',) if not freeze_full_environment else ('freeze', '--all'), output=True).splitlines()
|
||||
packages_without_program = [package for package in packages if PROGRAM_NAME not in package]
|
||||
return {'pip': packages_without_program}
|
||||
|
||||
@@ -87,14 +86,30 @@ class SystemPip(PackageManager):
|
||||
# make sure we are not running it with our own PYTHONPATH
|
||||
env = dict(**os.environ)
|
||||
env.pop('PYTHONPATH', None)
|
||||
|
||||
# Debug print
|
||||
if self.session.debug_mode:
|
||||
print(command)
|
||||
|
||||
return (command.get_output if output else command.check_call)(stdin=DEVNULL, env=env, **kwargs)
|
||||
|
||||
def _make_command(self, command):
|
||||
return Argv(self.bin, '-m', 'pip', '--disable-pip-version-check', *command)
|
||||
|
||||
def install_flags(self):
|
||||
if self.indices_args is None:
|
||||
self.indices_args = tuple(
|
||||
chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
|
||||
)
|
||||
return self.indices_args
|
||||
indices_args = tuple(
|
||||
chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
|
||||
)
|
||||
|
||||
extra_pip_flags = \
|
||||
ENV_PIP_EXTRA_INSTALL_FLAGS.get() or \
|
||||
self.session.config.get("agent.package_manager.extra_pip_install_flags", None)
|
||||
|
||||
return (indices_args + tuple(extra_pip_flags)) if extra_pip_flags else indices_args
|
||||
|
||||
def download_flags(self):
|
||||
indices_args = tuple(
|
||||
chain.from_iterable(('--extra-index-url', x) for x in PIP_EXTRA_INDICES)
|
||||
)
|
||||
|
||||
return indices_args
|
||||
|
||||
@@ -69,7 +69,7 @@ class PoetryConfig:
|
||||
path = path.replace(':'+sys.base_prefix, ':'+sys.real_prefix, 1)
|
||||
kwargs['env']['PATH'] = path
|
||||
|
||||
if self.session and self.session.config:
|
||||
if self.session and self.session.config and args and args[0] == "install":
|
||||
extra_args = self.session.config.get("agent.package_manager.poetry_install_extra_args", None)
|
||||
if extra_args:
|
||||
args = args + tuple(extra_args)
|
||||
@@ -147,7 +147,7 @@ class PoetryAPI(object):
|
||||
any((self.path / indicator).exists() for indicator in self.INDICATOR_FILES)
|
||||
)
|
||||
|
||||
def freeze(self):
|
||||
def freeze(self, freeze_full_environment=False):
|
||||
lines = self.config.run("show", cwd=str(self.path)).splitlines()
|
||||
lines = [[p for p in line.split(' ') if p] for line in lines]
|
||||
return {"pip": [parts[0]+'=='+parts[1]+' # '+' '.join(parts[2:]) for parts in lines]}
|
||||
|
||||
@@ -7,7 +7,7 @@ from .requirements import SimpleSubstitution
|
||||
|
||||
class PriorityPackageRequirement(SimpleSubstitution):
|
||||
|
||||
name = ("cython", "numpy", "setuptools", )
|
||||
name = ("cython", "numpy", "setuptools", "pip", )
|
||||
optional_package_names = tuple()
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
@@ -50,31 +50,39 @@ class PriorityPackageRequirement(SimpleSubstitution):
|
||||
"""
|
||||
# if we replaced setuptools, it means someone requested it, and since freeze will not contain it,
|
||||
# we need to add it manually
|
||||
if not self._replaced_packages or "setuptools" not in self._replaced_packages:
|
||||
if not self._replaced_packages:
|
||||
return list_of_requirements
|
||||
|
||||
try:
|
||||
for k, lines in list_of_requirements.items():
|
||||
# k is either pip/conda
|
||||
if k not in ('pip', 'conda'):
|
||||
continue
|
||||
for i, line in enumerate(lines):
|
||||
if not line or line.lstrip().startswith('#'):
|
||||
continue
|
||||
parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
|
||||
if not parts:
|
||||
continue
|
||||
# if we found setuptools, do nothing
|
||||
if parts[0] == "setuptools":
|
||||
return list_of_requirements
|
||||
if "pip" in self._replaced_packages:
|
||||
full_freeze = PackageManager.out_of_scope_freeze(freeze_full_environment=True)
|
||||
# now let's look for pip
|
||||
pips = [line for line in full_freeze.get("pip", []) if line.split("==")[0] == "pip"]
|
||||
if pips and "pip" in list_of_requirements:
|
||||
list_of_requirements["pip"] = [pips[0]] + list_of_requirements["pip"]
|
||||
|
||||
# if we are here it means we have not found setuptools
|
||||
# we should add it:
|
||||
if "pip" in list_of_requirements:
|
||||
list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
|
||||
if "setuptools" in self._replaced_packages:
|
||||
try:
|
||||
for k, lines in list_of_requirements.items():
|
||||
# k is either pip/conda
|
||||
if k not in ('pip', 'conda'):
|
||||
continue
|
||||
for i, line in enumerate(lines):
|
||||
if not line or line.lstrip().startswith('#'):
|
||||
continue
|
||||
parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
|
||||
if not parts:
|
||||
continue
|
||||
# if we found setuptools, do nothing
|
||||
if parts[0] == "setuptools":
|
||||
return list_of_requirements
|
||||
|
||||
except Exception as ex: # noqa
|
||||
return list_of_requirements
|
||||
# if we are here it means we have not found setuptools
|
||||
# we should add it:
|
||||
if "pip" in list_of_requirements:
|
||||
list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
|
||||
|
||||
except Exception as ex: # noqa
|
||||
return list_of_requirements
|
||||
|
||||
return list_of_requirements
|
||||
|
||||
|
||||
@@ -16,6 +16,7 @@ import six
|
||||
from .requirements import (
|
||||
SimpleSubstitution, FatalSpecsResolutionError, SimpleVersion, MarkerRequirement,
|
||||
compare_version_rules, )
|
||||
from ...definitions import ENV_PACKAGE_PYTORCH_RESOLVE
|
||||
from ...external.requirements_parser.requirement import Requirement
|
||||
|
||||
OS_TO_WHEEL_NAME = {"linux": "linux_x86_64", "windows": "win_amd64"}
|
||||
@@ -174,6 +175,7 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
extra_index_url_template = 'https://download.pytorch.org/whl/cu{}/'
|
||||
nightly_extra_index_url_template = 'https://download.pytorch.org/whl/nightly/cu{}/'
|
||||
torch_index_url_lookup = {}
|
||||
resolver_types = ("pip", "direct", "none")
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
os_name = kwargs.pop("os_override", None)
|
||||
@@ -208,6 +210,13 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
if self.config.get("agent.package_manager.torch_url_template", None):
|
||||
PytorchWheel.url_template = \
|
||||
self.config.get("agent.package_manager.torch_url_template", None)
|
||||
self.resolve_algorithm = str(
|
||||
ENV_PACKAGE_PYTORCH_RESOLVE.get() or
|
||||
self.config.get("agent.package_manager.pytorch_resolve", "pip")).lower()
|
||||
if self.resolve_algorithm not in self.resolver_types:
|
||||
print("WARNING: agent.package_manager.pytorch_resolve=={} not in {} reverting to '{}'".format(
|
||||
self.resolve_algorithm, self.resolver_types, self.resolver_types[0]))
|
||||
self.resolve_algorithm = self.resolver_types[0]
|
||||
|
||||
def _init_python_ver_cuda_ver(self):
|
||||
if self.cuda is None:
|
||||
@@ -261,6 +270,10 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
)
|
||||
|
||||
def match(self, req):
|
||||
if self.resolve_algorithm == "none":
|
||||
# skipping resolver
|
||||
return False
|
||||
|
||||
return req.name in self.packages
|
||||
|
||||
@staticmethod
|
||||
@@ -310,6 +323,12 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
# yes this is for linux python 2.7 support, this is the only python 2.7 we support...
|
||||
if py_ver and py_ver[0] == '2' and len(parts) > 3 and not parts[3].endswith('u'):
|
||||
continue
|
||||
|
||||
# check if this an actual match
|
||||
if not req.compare_version(v) or \
|
||||
(last_v and SimpleVersion.compare_versions(last_v, '>', v, ignore_sub_versions=False)):
|
||||
continue
|
||||
|
||||
# update the closest matched version (from above)
|
||||
if not closest_v:
|
||||
closest_v = v
|
||||
@@ -318,10 +337,6 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
SimpleVersion.compare_versions(
|
||||
version_a=v, op='>=', version_b=req.specs[0][1], num_parts=3):
|
||||
closest_v = v
|
||||
# check if this an actual match
|
||||
if not req.compare_version(v) or \
|
||||
(last_v and SimpleVersion.compare_versions(last_v, '>', v, ignore_sub_versions=False)):
|
||||
continue
|
||||
|
||||
url = '/'.join(torch_url.split('/')[:-1] + l.split('/'))
|
||||
last_v = v
|
||||
@@ -345,8 +360,10 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
from pip._internal.commands.show import search_packages_info
|
||||
installed_torch = list(search_packages_info([req.name]))
|
||||
# notice the comparison order, the first part will make sure we have a valid installed package
|
||||
installed_torch_version = (getattr(installed_torch[0], 'version', None) or installed_torch[0]['version']) \
|
||||
if installed_torch else None
|
||||
installed_torch_version = \
|
||||
(getattr(installed_torch[0], 'version', None) or
|
||||
installed_torch[0]['version']) if installed_torch else None
|
||||
|
||||
if installed_torch and installed_torch_version and \
|
||||
req.compare_version(installed_torch_version):
|
||||
print('PyTorch: requested "{}" version {}, using pre-installed version {}'.format(
|
||||
@@ -354,6 +371,7 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
# package already installed, do nothing
|
||||
req.specs = [('==', str(installed_torch_version))]
|
||||
return '{} {} {}'.format(req.name, req.specs[0][0], req.specs[0][1]), True
|
||||
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
@@ -475,6 +493,26 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
return self.match_version(req, base).replace(" ", "\n")
|
||||
|
||||
def replace(self, req):
|
||||
# we first try to resolve things ourselves because pytorch pip is not always picking the correct
|
||||
# versions from their pip repository
|
||||
|
||||
resolve_algorithm = self.resolve_algorithm
|
||||
if resolve_algorithm == "none":
|
||||
# skipping resolver
|
||||
return None
|
||||
elif resolve_algorithm == "direct":
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
new_req = self._replace(req)
|
||||
if new_req:
|
||||
self._original_req.append((req, new_req))
|
||||
return new_req
|
||||
except Exception:
|
||||
print("Warning: Failed resolving using `pytorch_resolve=direct` reverting to `pytorch_resolve=pip`")
|
||||
elif resolve_algorithm not in self.resolver_types:
|
||||
print("Warning: `agent.package_manager.pytorch_resolve={}` "
|
||||
"unrecognized, default to `pip`".format(resolve_algorithm))
|
||||
|
||||
# check if package is already installed with system packages
|
||||
self.validate_python_version()
|
||||
|
||||
@@ -508,6 +546,8 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
# return the original line
|
||||
line = req.line
|
||||
|
||||
print("PyTorch: Adding index `{}` and installing `{}`".format(extra_index_url[0], line))
|
||||
|
||||
return line
|
||||
|
||||
except Exception: # noqa
|
||||
@@ -566,6 +606,19 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
:param list_of_requirements: {'pip': ['a==1.0', ]}
|
||||
:return: {'pip': ['a==1.0', ]}
|
||||
"""
|
||||
def build_specific_version_req(a_line, a_name, a_new_req):
|
||||
try:
|
||||
r = Requirement.parse(a_line)
|
||||
wheel_parts = r.uri.split("/")[-1].split('-')
|
||||
version = str(wheel_parts[1].split('%')[0].split('+')[0])
|
||||
new_r = Requirement.parse("{} == {} # {}".format(a_name, version, str(a_new_req)))
|
||||
if new_r.line:
|
||||
# great it worked!
|
||||
return new_r.line
|
||||
except: # noqa
|
||||
pass
|
||||
return None
|
||||
|
||||
if not self._original_req:
|
||||
return list_of_requirements
|
||||
try:
|
||||
@@ -589,9 +642,18 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
if req.local_file:
|
||||
lines[i] = '{}'.format(str(new_req))
|
||||
else:
|
||||
lines[i] = '{} # {}'.format(str(req), str(new_req))
|
||||
# try to rebuild requirements with specific version:
|
||||
new_line = build_specific_version_req(line, req.req.name, new_req)
|
||||
if new_line:
|
||||
lines[i] = new_line
|
||||
else:
|
||||
lines[i] = '{} # {}'.format(str(req), str(new_req))
|
||||
else:
|
||||
lines[i] = '{} # {}'.format(line, str(new_req))
|
||||
new_line = build_specific_version_req(line, req.req.name, new_req)
|
||||
if new_line:
|
||||
lines[i] = new_line
|
||||
else:
|
||||
lines[i] = '{} # {}'.format(line, str(new_req))
|
||||
break
|
||||
except:
|
||||
pass
|
||||
@@ -640,7 +702,7 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if requests.get(torch_url, timeout=10).ok:
|
||||
print('Torch CUDA {} index page found'.format(c))
|
||||
print('Torch CUDA {} index page found, adding `{}`'.format(c, torch_url))
|
||||
cls.torch_index_url_lookup[c] = torch_url
|
||||
return cls.torch_index_url_lookup[c], c
|
||||
except Exception:
|
||||
|
||||
@@ -240,6 +240,23 @@ class SimpleVersion:
|
||||
if not version_b:
|
||||
return True
|
||||
|
||||
# remove trailing "*" in both
|
||||
if "*" in version_a:
|
||||
ignore_sub_versions = True
|
||||
while version_a.endswith(".*"):
|
||||
version_a = version_a[:-2]
|
||||
if version_a == "*":
|
||||
version_a = ""
|
||||
num_parts = min(len(version_a.split('.')), len(version_b.split('.')), )
|
||||
|
||||
if "*" in version_b:
|
||||
ignore_sub_versions = True
|
||||
while version_b.endswith(".*"):
|
||||
version_b = version_b[:-2]
|
||||
if version_b == "*":
|
||||
version_b = ""
|
||||
num_parts = min(len(version_a.split('.')), len(version_b.split('.')), )
|
||||
|
||||
if not num_parts:
|
||||
num_parts = max(len(version_a.split('.')), len(version_b.split('.')), )
|
||||
|
||||
|
||||
@@ -19,7 +19,7 @@ from pathlib2 import Path
|
||||
|
||||
import six
|
||||
|
||||
from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST
|
||||
from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST, ENV_GIT_CLONE_VERBOSE
|
||||
from clearml_agent.helper.console import ensure_text, ensure_binary
|
||||
from clearml_agent.errors import CommandFailedError
|
||||
from clearml_agent.helper.base import (
|
||||
@@ -197,8 +197,9 @@ class VCS(object):
|
||||
self.log.info("successfully applied uncommitted changes")
|
||||
return True
|
||||
|
||||
# Command-line flags for clone command
|
||||
clone_flags = ()
|
||||
def clone_flags(self):
|
||||
"""Command-line flags for clone command"""
|
||||
return tuple()
|
||||
|
||||
@abc.abstractmethod
|
||||
def executable_not_found_error_help(self):
|
||||
@@ -320,6 +321,7 @@ class VCS(object):
|
||||
self.url, new_url))
|
||||
self.url = new_url
|
||||
return
|
||||
|
||||
# rewrite ssh URLs only if either ssh port or ssh user are forced in config
|
||||
if parsed_url.scheme == "ssh" and (
|
||||
self.session.config.get('agent.force_git_ssh_port', None) or
|
||||
@@ -334,6 +336,9 @@ class VCS(object):
|
||||
print("Using SSH credentials - ssh url '{}' with ssh url '{}'".format(
|
||||
self.url, new_url))
|
||||
self.url = new_url
|
||||
return
|
||||
elif parsed_url.scheme == "ssh":
|
||||
return
|
||||
|
||||
if not self.session.config.agent.translate_ssh:
|
||||
return
|
||||
@@ -341,11 +346,18 @@ class VCS(object):
|
||||
# if we have git_user / git_pass replace ssh credentials with https authentication
|
||||
if (ENV_AGENT_GIT_USER.get() or self.session.config.get('agent.git_user', None)) and \
|
||||
(ENV_AGENT_GIT_PASS.get() or self.session.config.get('agent.git_pass', None)):
|
||||
|
||||
# only apply to a specific domain (if requested)
|
||||
config_domain = \
|
||||
ENV_AGENT_GIT_HOST.get() or self.session.config.get("git_host", None)
|
||||
if config_domain and config_domain != furl(self.url).host:
|
||||
return
|
||||
ENV_AGENT_GIT_HOST.get() or self.session.config.get("agent.git_host", None)
|
||||
if config_domain:
|
||||
if config_domain != furl(self.url).host:
|
||||
# bail out here if we have a git_host configured and it's different than the URL host
|
||||
# however, we should make sure this is not an ssh@ URL that furl failed to parse
|
||||
ssh_git_url_match = self.SSH_URL_GIT_SYNTAX.match(self.url)
|
||||
if not ssh_git_url_match or config_domain != ssh_git_url_match.groupdict().get("host"):
|
||||
# do not replace to ssh url
|
||||
return
|
||||
|
||||
new_url = self.replace_ssh_url(self.url)
|
||||
if new_url != self.url:
|
||||
@@ -362,7 +374,7 @@ class VCS(object):
|
||||
self._set_ssh_url()
|
||||
# if we are on linux no need for the full auth url because we use GIT_ASKPASS
|
||||
url = self.url_without_auth if self._use_ask_pass else self.url_with_auth
|
||||
clone_command = ("clone", url, self.location) + self.clone_flags
|
||||
clone_command = ("clone", url, self.location) + self.clone_flags()
|
||||
# clone all branches regardless of when we want to later checkout
|
||||
# if branch:
|
||||
# clone_command += ("-b", branch)
|
||||
@@ -539,7 +551,6 @@ class VCS(object):
|
||||
class Git(VCS):
|
||||
executable_name = "git"
|
||||
main_branch = ("master", "main")
|
||||
clone_flags = ("--quiet", "--recursive")
|
||||
checkout_flags = ("--force",)
|
||||
COMMAND_ENV = {
|
||||
# do not prompt for password
|
||||
@@ -551,7 +562,7 @@ class Git(VCS):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super(Git, self).__init__(*args, **kwargs)
|
||||
|
||||
self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', None) \
|
||||
self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', True) \
|
||||
else sys.platform == "linux"
|
||||
|
||||
try:
|
||||
@@ -565,6 +576,12 @@ class Git(VCS):
|
||||
"origin/{}".format(b) for b in ([branch] if isinstance(branch, str) else branch)
|
||||
]
|
||||
|
||||
def clone_flags(self):
|
||||
return (
|
||||
"--recursive",
|
||||
"--verbose" if ENV_GIT_CLONE_VERBOSE.get() else "--quiet"
|
||||
)
|
||||
|
||||
def executable_not_found_error_help(self):
|
||||
return 'Cannot find "{}" executable. {}'.format(
|
||||
self.executable_name,
|
||||
|
||||
@@ -79,6 +79,7 @@ class ResourceMonitor(object):
|
||||
self._gpustat_fail = 0
|
||||
self._gpustat = gpustat
|
||||
self._active_gpus = None
|
||||
self._disk_use_path = str(session.config.get("agent.resource_monitoring.disk_use_path", None) or Path.home())
|
||||
if not worker_tags and ENV_WORKER_TAGS.get():
|
||||
worker_tags = shlex.split(ENV_WORKER_TAGS.get())
|
||||
self._worker_tags = worker_tags
|
||||
@@ -139,42 +140,45 @@ class ResourceMonitor(object):
|
||||
def _daemon(self):
|
||||
seconds_since_started = 0
|
||||
reported = 0
|
||||
while True:
|
||||
last_report = time()
|
||||
current_report_frequency = (
|
||||
self._report_frequency if reported != 0 else self._first_report_sec
|
||||
)
|
||||
while (time() - last_report) < current_report_frequency:
|
||||
# wait for self._sample_frequency seconds, if event set quit
|
||||
if self._exit_event.wait(1 / self._sample_frequency):
|
||||
return
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
self._update_readouts()
|
||||
except Exception as ex:
|
||||
log.warning("failed getting machine stats: %s", report_error(ex))
|
||||
self._failure()
|
||||
try:
|
||||
while True:
|
||||
last_report = time()
|
||||
current_report_frequency = (
|
||||
self._report_frequency if reported != 0 else self._first_report_sec
|
||||
)
|
||||
while (time() - last_report) < current_report_frequency:
|
||||
# wait for self._sample_frequency seconds, if event set quit
|
||||
if self._exit_event.wait(1 / self._sample_frequency):
|
||||
return
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
self._update_readouts()
|
||||
except Exception as ex:
|
||||
log.warning("failed getting machine stats: %s", report_error(ex))
|
||||
self._failure()
|
||||
|
||||
seconds_since_started += int(round(time() - last_report))
|
||||
# check if we do not report any metric (so it means the last iteration will not be changed)
|
||||
seconds_since_started += int(round(time() - last_report))
|
||||
# check if we do not report any metric (so it means the last iteration will not be changed)
|
||||
|
||||
# if we do not have last_iteration, we just use seconds as iteration
|
||||
# if we do not have last_iteration, we just use seconds as iteration
|
||||
|
||||
# start reporting only when we figured out, if this is seconds based, or iterations based
|
||||
average_readouts = self._get_average_readouts()
|
||||
stats = {
|
||||
# 3 points after the dot
|
||||
key: round(value, 3) if isinstance(value, float) else [round(v, 3) for v in value]
|
||||
for key, value in average_readouts.items()
|
||||
}
|
||||
# start reporting only when we figured out, if this is seconds based, or iterations based
|
||||
average_readouts = self._get_average_readouts()
|
||||
stats = {
|
||||
# 3 points after the dot
|
||||
key: round(value, 3) if isinstance(value, float) else [round(v, 3) for v in value]
|
||||
for key, value in average_readouts.items()
|
||||
}
|
||||
|
||||
# send actual report
|
||||
if self.send_report(stats):
|
||||
# clear readouts if this is update was sent
|
||||
self._clear_readouts()
|
||||
# send actual report
|
||||
if self.send_report(stats):
|
||||
# clear readouts if this is update was sent
|
||||
self._clear_readouts()
|
||||
|
||||
# count reported iterations
|
||||
reported += 1
|
||||
# count reported iterations
|
||||
reported += 1
|
||||
except Exception as ex:
|
||||
log.exception("Error reporting monitoring info: %s", str(ex))
|
||||
|
||||
def _update_readouts(self):
|
||||
readouts = self._machine_stats()
|
||||
@@ -239,7 +243,7 @@ class ResourceMonitor(object):
|
||||
virtual_memory = psutil.virtual_memory()
|
||||
stats["memory_used"] = BytesSizes.megabytes(virtual_memory.used)
|
||||
stats["memory_free"] = BytesSizes.megabytes(virtual_memory.available)
|
||||
disk_use_percentage = psutil.disk_usage(Text(Path.home())).percent
|
||||
disk_use_percentage = psutil.disk_usage(self._disk_use_path).percent
|
||||
stats["disk_free_percent"] = 100 - disk_use_percentage
|
||||
sensor_stat = (
|
||||
psutil.sensors_temperatures()
|
||||
@@ -263,8 +267,10 @@ class ResourceMonitor(object):
|
||||
gpu_stat = self._gpustat.new_query()
|
||||
for i, g in enumerate(gpu_stat.gpus):
|
||||
# only monitor the active gpu's, if none were selected, monitor everything
|
||||
if self._active_gpus and str(i) not in self._active_gpus:
|
||||
continue
|
||||
if self._active_gpus:
|
||||
uuid = getattr(g, "uuid", None)
|
||||
if str(i) not in self._active_gpus and (not uuid or uuid not in self._active_gpus):
|
||||
continue
|
||||
stats["gpu_temperature_{:d}".format(i)] = g["temperature.gpu"]
|
||||
stats["gpu_utilization_{:d}".format(i)] = g["utilization.gpu"]
|
||||
stats["gpu_mem_usage_{:d}".format(i)] = (
|
||||
|
||||
@@ -19,7 +19,7 @@ from clearml_agent.definitions import ENVIRONMENT_CONFIG, ENV_TASK_EXECUTE_AS_US
|
||||
from clearml_agent.errors import APIError
|
||||
from clearml_agent.helper.base import HOCONEncoder
|
||||
from clearml_agent.helper.process import Argv
|
||||
from clearml_agent.helper.docker_args import DockerArgsSanitizer
|
||||
from clearml_agent.helper.docker_args import DockerArgsSanitizer, sanitize_urls
|
||||
from .version import __version__
|
||||
|
||||
POETRY = "poetry"
|
||||
@@ -245,33 +245,43 @@ class Session(_Session):
|
||||
remove_secret_keys=("secret", "pass", "token", "account_key", "contents"),
|
||||
skip_value_keys=("environment", ),
|
||||
docker_args_sanitize_keys=("extra_docker_arguments", ),
|
||||
sanitize_urls_keys=("extra_index_url", ),
|
||||
):
|
||||
# remove all the secrets from the print
|
||||
def recursive_remove_secrets(dictionary, secret_keys=(), empty_keys=()):
|
||||
def recursive_remove_secrets(dictionary):
|
||||
for k in list(dictionary):
|
||||
for s in secret_keys:
|
||||
for s in remove_secret_keys:
|
||||
if s in k:
|
||||
dictionary.pop(k)
|
||||
break
|
||||
for s in empty_keys:
|
||||
for s in skip_value_keys:
|
||||
if s == k:
|
||||
dictionary[k] = {key: '****' for key in dictionary[k]} \
|
||||
if isinstance(dictionary[k], dict) else '****'
|
||||
break
|
||||
for s in sanitize_urls_keys:
|
||||
if s == k:
|
||||
value = dictionary.get(k, None)
|
||||
if isinstance(value, str):
|
||||
dictionary[k] = sanitize_urls(value)[0]
|
||||
elif isinstance(value, (list, tuple)):
|
||||
dictionary[k] = [sanitize_urls(v)[0] for v in value]
|
||||
elif isinstance(value, dict):
|
||||
dictionary[k] = {k_: sanitize_urls(v)[0] for k_, v in value.items()}
|
||||
if isinstance(dictionary.get(k, None), dict):
|
||||
recursive_remove_secrets(dictionary[k], secret_keys=secret_keys, empty_keys=empty_keys)
|
||||
recursive_remove_secrets(dictionary[k])
|
||||
elif isinstance(dictionary.get(k, None), (list, tuple)):
|
||||
if k in (docker_args_sanitize_keys or []):
|
||||
dictionary[k] = DockerArgsSanitizer.sanitize_docker_command(self, dictionary[k])
|
||||
for item in dictionary[k]:
|
||||
if isinstance(item, dict):
|
||||
recursive_remove_secrets(item, secret_keys=secret_keys, empty_keys=empty_keys)
|
||||
recursive_remove_secrets(item)
|
||||
|
||||
config = deepcopy(self.config.to_dict())
|
||||
# remove the env variable, it's not important
|
||||
config.pop('env', None)
|
||||
if remove_secret_keys or skip_value_keys or docker_args_sanitize_keys:
|
||||
recursive_remove_secrets(config, secret_keys=remove_secret_keys, empty_keys=skip_value_keys)
|
||||
if remove_secret_keys or skip_value_keys or docker_args_sanitize_keys or sanitize_urls_keys:
|
||||
recursive_remove_secrets(config)
|
||||
# remove logging.loggers.urllib3.level from the print
|
||||
try:
|
||||
config['logging']['loggers']['urllib3'].pop('level', None)
|
||||
|
||||
@@ -1 +1 @@
|
||||
__version__ = '1.5.2'
|
||||
__version__ = '1.7.0'
|
||||
|
||||
@@ -58,7 +58,7 @@ agent {
|
||||
type: pip,
|
||||
|
||||
# specify pip version to use (examples "<20.2", "==19.3.1", "", empty string will install the latest version)
|
||||
pip_version: "<21",
|
||||
pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"],
|
||||
|
||||
# virtual environment inheres packages from system
|
||||
system_site_packages: false,
|
||||
@@ -171,7 +171,7 @@ agent {
|
||||
|
||||
default_docker: {
|
||||
# default docker image to use when running in docker mode
|
||||
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
|
||||
image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
|
||||
|
||||
# optional arguments to pass to docker image
|
||||
# arguments: ["--ipc=host", ]
|
||||
@@ -224,7 +224,7 @@ sdk {
|
||||
|
||||
storage {
|
||||
cache {
|
||||
# Defaults to system temp folder / cache
|
||||
# Defaults to <system_temp_folder>/clearml_cache
|
||||
default_base_dir: "~/.clearml/cache"
|
||||
size {
|
||||
# max_used_bytes = -1
|
||||
|
||||
@@ -33,4 +33,9 @@ echo "api.files_server: ${CLEARML_FILES_HOST}" >> ~/clearml.conf
|
||||
|
||||
./provider_entrypoint.sh
|
||||
|
||||
python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
|
||||
if [[ -z "${K8S_GLUE_MAX_PODS}" ]]
|
||||
then
|
||||
python3 k8s_glue_example.py --queue ${QUEUE} ${EXTRA_ARGS}
|
||||
else
|
||||
python3 k8s_glue_example.py --queue ${QUEUE} --max-pods ${MAX_PODS} ${EXTRA_ARGS}
|
||||
fi
|
||||
|
||||
@@ -65,6 +65,19 @@ def parse_args():
|
||||
help="Limit the maximum number of pods that this service can run at the same time."
|
||||
"Should not be used with ports-mode"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--use-owner-token", action="store_true", default=False,
|
||||
help="Generate and use task owner token for the execution of each task"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--standalone-mode", action="store_true", default=False,
|
||||
help="Do not use any network connects, assume everything is pre-installed"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--child-report-tags", type=str, nargs="+", default=None,
|
||||
help="List of tags to send with the status reports from a worker that runs a task"
|
||||
)
|
||||
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
@@ -85,9 +98,14 @@ def main():
|
||||
user_props_cb=user_props_cb, overrides_yaml=args.overrides_yaml, clearml_conf_file=args.pod_clearml_conf,
|
||||
template_yaml=args.template_yaml, extra_bash_init_script=K8sIntegration.get_ssh_server_bash(
|
||||
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
|
||||
namespace=args.namespace, max_pods_limit=args.max_pods or None,
|
||||
namespace=args.namespace, max_pods_limit=args.max_pods or None
|
||||
)
|
||||
k8s.k8s_daemon(
|
||||
args.queue,
|
||||
use_owner_token=args.use_owner_token,
|
||||
standalone_mode=args.standalone_mode,
|
||||
child_report_tags=args.child_report_tags
|
||||
)
|
||||
k8s.k8s_daemon(args.queue)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -2,13 +2,17 @@
|
||||
|
||||
chmod +x /root/entrypoint.sh
|
||||
|
||||
apt-get update -y
|
||||
apt-get dist-upgrade -y
|
||||
apt-get install -y curl unzip less locales
|
||||
apt-get update -qqy
|
||||
apt-get dist-upgrade -qqy
|
||||
apt-get install -qqy curl unzip less locales
|
||||
|
||||
locale-gen en_US.UTF-8
|
||||
|
||||
apt-get install -y curl python3-pip git
|
||||
apt-get update -qqy
|
||||
apt-get install -qqy curl gcc python3-dev python3-pip apt-transport-https lsb-release openssh-client git gnupg
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
apt clean
|
||||
|
||||
python3 -m pip install -U pip
|
||||
python3 -m pip install clearml-agent
|
||||
python3 -m pip install -U "cryptography>=2.9"
|
||||
python3 -m pip install --no-cache-dir clearml-agent
|
||||
python3 -m pip install -U --no-cache-dir "cryptography>=2.9"
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
ARG TAG=3.7.12-alpine3.15
|
||||
ARG TAG=3.7.17-alpine3.18
|
||||
|
||||
FROM python:${TAG} as build
|
||||
|
||||
@@ -20,7 +20,7 @@ FROM python:${TAG} as target
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
ARG KUBECTL_VERSION=1.22.4
|
||||
ARG KUBECTL_VERSION=1.24.0
|
||||
|
||||
# Not sure about these ENV vars
|
||||
# ENV LC_ALL=en_US.UTF-8
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
ARG TAG=3.7.12-slim-bullseye
|
||||
ARG TAG=3.7.17-slim-bullseye
|
||||
|
||||
FROM python:${TAG} as target
|
||||
|
||||
|
||||
@@ -65,6 +65,10 @@ def parse_args():
|
||||
help="Limit the maximum number of pods that this service can run at the same time."
|
||||
"Should not be used with ports-mode"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--use-owner-token", action="store_true", default=False,
|
||||
help="Generate and use task owner token for the execution of each task"
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
@@ -87,7 +91,7 @@ def main():
|
||||
ssh_port_number=args.ssh_server_port) if args.ssh_server_port else None,
|
||||
namespace=args.namespace, max_pods_limit=args.max_pods or None,
|
||||
)
|
||||
k8s.k8s_daemon(args.queue)
|
||||
k8s.k8s_daemon(args.queue, use_owner_token=args.use_owner_token)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -58,8 +58,8 @@ agent {
|
||||
# it solves passing user/token to git submodules.
|
||||
# this is a safer way to ensure multiple users using the same repository will
|
||||
# not accidentally leak credentials
|
||||
# Only supported on Linux systems, it will be the default in future releases
|
||||
# enable_git_ask_pass: false
|
||||
# Note: this is only supported on Linux systems
|
||||
# enable_git_ask_pass: true
|
||||
|
||||
# in docker mode, if container's entrypoint automatically activated a virtual environment
|
||||
# use the activated virtual environment and install everything there
|
||||
@@ -93,25 +93,43 @@ agent {
|
||||
# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
|
||||
extra_index_url: []
|
||||
|
||||
# additional flags to use when calling pip install, example: ["--use-deprecated=legacy-resolver", ]
|
||||
# extra_pip_install_flags: []
|
||||
|
||||
# control the pytorch wheel resolving algorithm, options are: "pip", "direct", "none"
|
||||
# Override with environment variable CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE
|
||||
# "pip" (default): would automatically detect the cuda version, and supply pip with the correct
|
||||
# extra-index-url, based on pytorch.org tables
|
||||
# "direct": would resolve a direct link to the pytorch wheel by parsing the pytorch.org pip repository
|
||||
# and matching the automatically detected cuda version with the required pytorch wheel.
|
||||
# if the exact cuda version is not found for the required pytorch wheel, it will try
|
||||
# a lower cuda version until a match is found
|
||||
# "none": No resolver used, install pytorch like any other package
|
||||
# pytorch_resolve: "pip"
|
||||
|
||||
# additional conda channels to use when installing with conda package manager
|
||||
conda_channels: ["pytorch", "conda-forge", "defaults", ]
|
||||
# conda_full_env_update: false
|
||||
# conda_env_as_base_docker: false
|
||||
|
||||
# set the priority packages to be installed before the rest of the required packages
|
||||
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
|
||||
# priority_packages: ["cython", "numpy", "setuptools", ]
|
||||
|
||||
# set the optional priority packages to be installed before the rest of the required packages,
|
||||
# In case a package installation fails, the package will be ignored,
|
||||
# and the virtual environment process will continue
|
||||
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
|
||||
# priority_optional_packages: ["pygobject", ]
|
||||
|
||||
# set the post packages to be installed after all the rest of the required packages
|
||||
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
|
||||
# post_packages: ["horovod", ]
|
||||
|
||||
# set the optional post packages to be installed after all the rest of the required packages,
|
||||
# In case a package installation fails, the package will be ignored,
|
||||
# and the virtual environment process will continue
|
||||
# Note: this only controls the installation order of existing requirement packages (and does not add additional packages)
|
||||
# post_optional_packages: []
|
||||
|
||||
# set to True to support torch nightly build installation,
|
||||
@@ -168,8 +186,16 @@ agent {
|
||||
|
||||
# optional arguments to pass to docker image
|
||||
# these are local for this agent and will not be updated in the experiment's docker_cmd section
|
||||
# You can also pass host environments into the container with ["-e", "HOST_NAME=$HOST_NAME"]
|
||||
# extra_docker_arguments: ["--ipc=host", "-v", "/mnt/host/data:/mnt/data"]
|
||||
|
||||
# Allow the extra docker arg to override task level docker arg (if the same argument is passed on both),
|
||||
# if set to False, a task docker arg will override the docker extra arg
|
||||
# docker_args_extra_precedes_task: true
|
||||
|
||||
# prevent a task docker args to be used if already specified in the extra_docker_arguments
|
||||
# protected_docker_extra_args: ["privileged", "security-opt", "network", "ipc"]
|
||||
|
||||
# optional shell script to run in docker when started before the experiment is started
|
||||
# extra_docker_shell_script: ["apt-get install -y bindfs", ]
|
||||
|
||||
@@ -178,13 +204,19 @@ agent {
|
||||
# change to false to skip installation and decrease docker spin up time
|
||||
# docker_install_opencv_libs: true
|
||||
|
||||
# Allow passing host environments into docker container with Task's docker container args
|
||||
# Example "-e HOST_NAME=$HOST_NAME"
|
||||
# NOTICE this might introduce security risk allowing access to keys/secret on the host machine1
|
||||
# Use with care!
|
||||
# docker_allow_host_environ: false
|
||||
|
||||
# set to true in order to force "docker pull" before running an experiment using a docker image.
|
||||
# This makes sure the docker image is updated.
|
||||
docker_force_pull: false
|
||||
|
||||
default_docker: {
|
||||
# default docker image to use when running in docker mode
|
||||
image: "nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04"
|
||||
image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
|
||||
|
||||
# optional arguments to pass to docker image
|
||||
# arguments: ["--ipc=host"]
|
||||
@@ -194,7 +226,7 @@ agent {
|
||||
# enterprise version only
|
||||
# match_rules: [
|
||||
# {
|
||||
# image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
|
||||
# image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
|
||||
# arguments: "-e define=value"
|
||||
# match: {
|
||||
# script{
|
||||
@@ -215,7 +247,7 @@ agent {
|
||||
# }
|
||||
# },
|
||||
# {
|
||||
# image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
|
||||
# image: "nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04"
|
||||
# arguments: "-e define=value"
|
||||
# match: {
|
||||
# # must match all requirements (not partial)
|
||||
@@ -228,8 +260,6 @@ agent {
|
||||
# # no repository matching required
|
||||
# repository: ""
|
||||
# }
|
||||
# # no container image matching required (allow to replace one requested container with another)
|
||||
# container: ""
|
||||
# # no repository matching required
|
||||
# project: ""
|
||||
# }
|
||||
@@ -283,7 +313,7 @@ sdk {
|
||||
|
||||
storage {
|
||||
cache {
|
||||
# Defaults to system temp folder / cache
|
||||
# Defaults to <system_temp_folder>/clearml_cache
|
||||
default_base_dir: "~/.clearml/cache"
|
||||
}
|
||||
|
||||
@@ -469,7 +499,8 @@ sdk {
|
||||
# target_format: format used to encode contents before writing into the target file. Supported values are json,
|
||||
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
|
||||
# overwrite: overwrite the target file in case it exists. Default is true.
|
||||
#
|
||||
# mode: file-system mode to be applied to the file after its creation. The mode string will be parsed into an
|
||||
# integer (e.g. "0o777" for -rwxrwxrwx)
|
||||
# Example:
|
||||
# files {
|
||||
# myfile1 {
|
||||
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 1.1 MiB After Width: | Height: | Size: 1018 KiB |
@@ -146,7 +146,7 @@ sdk {
|
||||
|
||||
storage {
|
||||
cache {
|
||||
# Defaults to system temp folder / cache
|
||||
# Defaults to <system_temp_folder>/clearml_cache
|
||||
default_base_dir: "~/.clearml/cache"
|
||||
}
|
||||
|
||||
|
||||
@@ -5,27 +5,30 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Auto-Magically Spin AWS EC2 Instances On Demand \n",
|
||||
"# and Create a Dynamic Cluster Running *Trains-Agent*\n",
|
||||
"# and Create a Dynamic Cluster Running *ClearML-Agent*\n",
|
||||
"\n",
|
||||
"### Define your budget and execute the notebook, that's it\n",
|
||||
"### You now have a fully managed cluster on AWS 🎉 🎊 "
|
||||
"## Define your budget and execute the notebook, that's it\n",
|
||||
"## You now have a fully managed cluster on AWS 🎉 🎊"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**trains-agent**'s main goal is to quickly pull a job from an execution queue, setup the environment (as defined in the experiment, including git cloning, python packages etc.) then execute the experiment and monitor it.\n",
|
||||
"**clearml-agent**'s main goal is to quickly pull a job from an execution queue, set up the environment (as defined in the experiment, including git cloning, python packages etc.), then execute the experiment and monitor it.\n",
|
||||
"\n",
|
||||
"This notebook defines a cloud budget (currently only AWS is supported, but feel free to expand with PRs), and spins an instance the minute a job is waiting for execution. It will also spin down idle machines, saving you some $$$ :)\n",
|
||||
"\n",
|
||||
"Configuration steps\n",
|
||||
"> **Note:**\n",
|
||||
"> This is just an example of how you can use ClearML Agent to implement custom autoscaling. For a more structured autoscaler script, see [here](https://github.com/allegroai/clearml/blob/master/clearml/automation/auto_scaler.py).\n",
|
||||
"\n",
|
||||
"Configuration steps:\n",
|
||||
"- Define maximum budget to be used (instance type / number of instances).\n",
|
||||
"- Create new execution *queues* in the **trains-server**.\n",
|
||||
"- Define mapping between the created the *queues* and an instance budget.\n",
|
||||
"- Create new execution *queues* in the **clearml-server**.\n",
|
||||
"- Define mapping between the created *queues* and an instance budget.\n",
|
||||
"\n",
|
||||
"**TL;DR - This notebook:**\n",
|
||||
"- Will spin instances if there are jobs in the execution *queues*, until it will hit the budget limit. \n",
|
||||
"- Will spin instances if there are jobs in the execution *queues* until it will hit the budget limit.\n",
|
||||
"- If machines are idle, it will spin them down.\n",
|
||||
"\n",
|
||||
"The controller implementation itself is stateless, meaning you can always re-execute the notebook, if for some reason it stopped.\n",
|
||||
@@ -39,7 +42,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Install & import required packages"
|
||||
"### Install & import required packages"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -48,7 +51,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install trains-agent\n",
|
||||
"!pip install clearml-agent\n",
|
||||
"!pip install boto3"
|
||||
]
|
||||
},
|
||||
@@ -56,7 +59,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)"
|
||||
"### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -92,17 +95,17 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Define machine budget per execution queue\n",
|
||||
"### Define machine budget per execution queue\n",
|
||||
"\n",
|
||||
"Now that we defined our budget, we need to connect it with the **Trains** cluster.\n",
|
||||
"Now that we defined our budget, we need to connect it with the **ClearML** cluster.\n",
|
||||
"\n",
|
||||
"We map each queue to a resource type (instance type).\n",
|
||||
"\n",
|
||||
"Create two queues in the WebUI:\n",
|
||||
"- Browse to http://your_trains_server_ip:8080/workers-and-queues/queues\n",
|
||||
"Create two queues in the Web UI:\n",
|
||||
"- Browse to http://your_clearml_server_ip:8080/workers-and-queues/queues\n",
|
||||
"- Then click on the \"New Queue\" button and name your queues \"aws_normal\" and \"aws_high\" respectively\n",
|
||||
"\n",
|
||||
"The QUEUES dictionary hold the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n",
|
||||
"The QUEUES dictionary holds the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n",
|
||||
"```\n",
|
||||
"QUEUES = {\n",
|
||||
" 'queue_name': [(\"instance-type-as-defined-in-RESOURCE_CONFIGURATIONS\", max_number_of_instances), ]\n",
|
||||
@@ -116,7 +119,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Trains-Agent Queues - Machines budget per Queue\n",
|
||||
"# ClearML Agent Queues - Machines budget per Queue\n",
|
||||
"# Per queue: list of (machine type as defined in RESOURCE_CONFIGURATIONS,\n",
|
||||
"# max instances for the specific queue). Order machines from most preferred to least.\n",
|
||||
"QUEUES = {\n",
|
||||
@@ -129,7 +132,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Credentials for your AWS account, as well as for your **Trains-Server**"
|
||||
"### Credentials for your AWS account, as well as for your **ClearML Server**"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -143,24 +146,25 @@
|
||||
"CLOUD_CREDENTIALS_SECRET = \"\"\n",
|
||||
"CLOUD_CREDENTIALS_REGION = \"us-east-1\"\n",
|
||||
"\n",
|
||||
"# TRAINS configuration\n",
|
||||
"TRAINS_SERVER_WEB_SERVER = \"http://localhost:8080\"\n",
|
||||
"TRAINS_SERVER_API_SERVER = \"http://localhost:8008\"\n",
|
||||
"TRAINS_SERVER_FILES_SERVER = \"http://localhost:8081\"\n",
|
||||
"# TRAINS credentials\n",
|
||||
"TRAINS_ACCESS_KEY = \"\"\n",
|
||||
"TRAINS_SECRET_KEY = \"\"\n",
|
||||
"# Git User/Pass to be used by trains-agent,\n",
|
||||
"# CLEARML configuration\n",
|
||||
"CLEARML_WEB_SERVER = \"http://localhost:8080\"\n",
|
||||
"CLEARML_API_SERVER = \"http://localhost:8008\"\n",
|
||||
"CLEARML_FILES_SERVER = \"http://localhost:8081\"\n",
|
||||
"# CLEARML credentials\n",
|
||||
"CLEARML_API_ACCESS_KEY = \"\"\n",
|
||||
"CLEARML_API_SECRET_KEY = \"\"\n",
|
||||
"# Git User/Pass to be used by clearml-agent,\n",
|
||||
"# leave empty if image already contains git ssh-key\n",
|
||||
"TRAINS_GIT_USER = \"\"\n",
|
||||
"TRAINS_GIT_PASS = \"\"\n",
|
||||
"CLEARML_AGENT_GIT_USER = \"\"\n",
|
||||
"CLEARML_AGENT_GIT_PASS = \"\"\n",
|
||||
"\n",
|
||||
"# Additional fields for trains.conf file created on the remote instance\n",
|
||||
"# for example: 'agent.default_docker.image: \"nvidia/cuda:10.0-cudnn7-runtime\"'\n",
|
||||
"EXTRA_TRAINS_CONF = \"\"\"\n",
|
||||
"# Additional fields for clearml.conf file created on the remote instance\n",
|
||||
"# for example: 'agent.default_docker.image: \"nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04\"'\n",
|
||||
"\n",
|
||||
"EXTRA_CLEARML_CONF = \"\"\"\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"# Bash script to run on instances before running trains-agent\n",
|
||||
"# Bash script to run on instances before running clearml-agent\n",
|
||||
"# Example: \"\"\"\n",
|
||||
"# echo \"This is the first line\"\n",
|
||||
"# echo \"This is the second line\"\n",
|
||||
@@ -168,9 +172,9 @@
|
||||
"EXTRA_BASH_SCRIPT = \"\"\"\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"# Default docker for trains-agent when running in docker mode (requires docker v19.03 and above). \n",
|
||||
"# Leave empty to run trains-agent in non-docker mode.\n",
|
||||
"DEFAULT_DOCKER_IMAGE = \"nvidia/cuda\""
|
||||
"# Default docker for clearml-agent when running in docker mode (requires docker v19.03 and above).\n",
|
||||
"# Leave empty to run clearml-agent in non-docker mode.\n",
|
||||
"CLEARML_AGENT_DOCKER_IMAGE = \"nvidia/cuda\""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -192,7 +196,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Import Packages and Budget Definition Sanity Check"
|
||||
"### Import Packages and Budget Definition Sanity Check"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -209,7 +213,7 @@
|
||||
"from time import sleep, time\n",
|
||||
"\n",
|
||||
"import boto3\n",
|
||||
"from trains_agent.backend_api.session.client import APIClient"
|
||||
"from clearml_agent.backend_api.session.client import APIClient"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -227,36 +231,36 @@
|
||||
" \"A resource name can only appear in a single queue definition.\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"# Encode EXTRA_TRAINS_CONF for later bash script usage\n",
|
||||
"EXTRA_TRAINS_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_TRAINS_CONF.split(\"\\\"\"))"
|
||||
"# Encode EXTRA_CLEARML_CONF for later bash script usage\n",
|
||||
"EXTRA_CLEARML_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_CLEARML_CONF.split(\"\\\"\"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Cloud specific implementation of spin up/down - currently supports AWS only"
|
||||
"### Cloud specific implementation of spin up/down - currently supports AWS only"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Cloud-specific implementation (currently, only AWS EC2 is supported)\n",
|
||||
"def spin_up_worker(resource, worker_id_prefix, queue_name):\n",
|
||||
" \"\"\"\n",
|
||||
" Creates a new worker for trains.\n",
|
||||
" Creates a new worker for clearml.\n",
|
||||
" First, create an instance in the cloud and install some required packages.\n",
|
||||
" Then, define trains-agent environment variables and run \n",
|
||||
" trains-agent for the specified queue.\n",
|
||||
" Then, define clearml-agent environment variables and run\n",
|
||||
" clearml-agent for the specified queue.\n",
|
||||
" NOTE: - Will wait until instance is running\n",
|
||||
" - This implementation assumes the instance image already has docker installed\n",
|
||||
"\n",
|
||||
" :param str resource: resource name, as defined in BUDGET and QUEUES.\n",
|
||||
" :param str worker_id_prefix: worker name prefix\n",
|
||||
" :param str queue_name: trains queue to listen to\n",
|
||||
" :param str queue_name: clearml queue to listen to\n",
|
||||
" \"\"\"\n",
|
||||
" resource_conf = RESOURCE_CONFIGURATIONS[resource]\n",
|
||||
" # Add worker type and AWS instance type to the worker name.\n",
|
||||
@@ -267,8 +271,8 @@
|
||||
" )\n",
|
||||
"\n",
|
||||
" # user_data script will automatically run when the instance is started. \n",
|
||||
" # It will install the required packages for trains-agent configure it using \n",
|
||||
" # environment variables and run trains-agent on the required queue\n",
|
||||
" # It will install the required packages for clearml-agent configure it using\n",
|
||||
" # environment variables and run clearml-agent on the required queue\n",
|
||||
" user_data = \"\"\"#!/bin/bash\n",
|
||||
" sudo apt-get update\n",
|
||||
" sudo apt-get install -y python3-dev\n",
|
||||
@@ -278,36 +282,36 @@
|
||||
" sudo apt-get install -y build-essential\n",
|
||||
" python3 -m pip install -U pip\n",
|
||||
" python3 -m pip install virtualenv\n",
|
||||
" python3 -m virtualenv trains_agent_venv\n",
|
||||
" source trains_agent_venv/bin/activate\n",
|
||||
" python -m pip install trains-agent\n",
|
||||
" echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/trains.conf\n",
|
||||
" echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/trains.conf\n",
|
||||
" echo \"{trains_conf}\" >> /root/trains.conf\n",
|
||||
" export TRAINS_API_HOST={api_server}\n",
|
||||
" export TRAINS_WEB_HOST={web_server}\n",
|
||||
" export TRAINS_FILES_HOST={files_server}\n",
|
||||
" python3 -m virtualenv clearml_agent_venv\n",
|
||||
" source clearml_agent_venv/bin/activate\n",
|
||||
" python -m pip install clearml-agent\n",
|
||||
" echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/clearml.conf\n",
|
||||
" echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/clearml.conf\n",
|
||||
" echo \"{clearml_conf}\" >> /root/clearml.conf\n",
|
||||
" export CLEARML_API_HOST={api_server}\n",
|
||||
" export CLEARML_WEB_HOST={web_server}\n",
|
||||
" export CLEARML_FILES_HOST={files_server}\n",
|
||||
" export DYNAMIC_INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`\n",
|
||||
" export TRAINS_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n",
|
||||
" export TRAINS_API_ACCESS_KEY='{access_key}'\n",
|
||||
" export TRAINS_API_SECRET_KEY='{secret_key}'\n",
|
||||
" export CLEARML_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n",
|
||||
" export CLEARML_API_ACCESS_KEY='{access_key}'\n",
|
||||
" export CLEARML_API_SECRET_KEY='{secret_key}'\n",
|
||||
" {bash_script}\n",
|
||||
" source ~/.bashrc\n",
|
||||
" python -m trains_agent --config-file '/root/trains.conf' daemon --queue '{queue}' {docker}\n",
|
||||
" python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker}\n",
|
||||
" shutdown\n",
|
||||
" \"\"\".format(\n",
|
||||
" api_server=TRAINS_SERVER_API_SERVER,\n",
|
||||
" web_server=TRAINS_SERVER_WEB_SERVER,\n",
|
||||
" files_server=TRAINS_SERVER_FILES_SERVER,\n",
|
||||
" api_server=CLEARML_API_SERVER,\n",
|
||||
" web_server=CLEARML_WEB_SERVER,\n",
|
||||
" files_server=CLEARML_FILES_SERVER,\n",
|
||||
" worker_id=worker_id,\n",
|
||||
" access_key=TRAINS_ACCESS_KEY,\n",
|
||||
" secret_key=TRAINS_SECRET_KEY,\n",
|
||||
" access_key=CLEARML_API_ACCESS_KEY,\n",
|
||||
" secret_key=CLEARML_API_SECRET_KEY,\n",
|
||||
" queue=queue_name,\n",
|
||||
" git_user=TRAINS_GIT_USER,\n",
|
||||
" git_pass=TRAINS_GIT_PASS,\n",
|
||||
" trains_conf=EXTRA_TRAINS_CONF_ENCODED,\n",
|
||||
" git_user=CLEARML_AGENT_GIT_USER,\n",
|
||||
" git_pass=CLEARML_AGENT_GIT_PASS,\n",
|
||||
" clearml_conf=EXTRA_CLEARML_CONF_ENCODED,\n",
|
||||
" bash_script=EXTRA_BASH_SCRIPT,\n",
|
||||
" docker=\"--docker '{}'\".format(DEFAULT_DOCKER_IMAGE) if DEFAULT_DOCKER_IMAGE else \"\"\n",
|
||||
" docker=\"--docker '{}'\".format(CLEARML_AGENT_DOCKER_IMAGE) if CLEARML_AGENT_DOCKER_IMAGE else \"\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" ec2 = boto3.client(\n",
|
||||
@@ -405,7 +409,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"###### Controller Implementation and Logic"
|
||||
"#### Controller Implementation and Logic"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -430,18 +434,18 @@
|
||||
"\n",
|
||||
" # Internal definitions\n",
|
||||
" workers_prefix = \"dynamic_aws\"\n",
|
||||
" # Worker's id in trains would be composed from:\n",
|
||||
" # Worker's id in clearml would be composed from:\n",
|
||||
" # prefix, name, instance_type and cloud_id separated by ';'\n",
|
||||
" workers_pattern = re.compile(\n",
|
||||
" r\"^(?P<prefix>[^:]+):(?P<name>[^:]+):(?P<instance_type>[^:]+):(?P<cloud_id>[^:]+)\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Set up the environment variables for trains\n",
|
||||
" os.environ[\"TRAINS_API_HOST\"] = TRAINS_SERVER_API_SERVER\n",
|
||||
" os.environ[\"TRAINS_WEB_HOST\"] = TRAINS_SERVER_WEB_SERVER\n",
|
||||
" os.environ[\"TRAINS_FILES_HOST\"] = TRAINS_SERVER_FILES_SERVER\n",
|
||||
" os.environ[\"TRAINS_API_ACCESS_KEY\"] = TRAINS_ACCESS_KEY\n",
|
||||
" os.environ[\"TRAINS_API_SECRET_KEY\"] = TRAINS_SECRET_KEY\n",
|
||||
" # Set up the environment variables for clearml\n",
|
||||
" os.environ[\"CLEARML_API_HOST\"] = CLEARML_API_SERVER\n",
|
||||
" os.environ[\"CLEARML_WEB_HOST\"] = CLEARML_WEB_SERVER\n",
|
||||
" os.environ[\"CLEARML_FILES_HOST\"] = CLEARML_FILES_SERVER\n",
|
||||
" os.environ[\"CLEARML_API_ACCESS_KEY\"] = CLEARM_API_ACCESS_KEY\n",
|
||||
" os.environ[\"CLEARML_API_SECRET_KEY\"] = CLEARML_API_SECRET_KEY\n",
|
||||
" api_client = APIClient()\n",
|
||||
"\n",
|
||||
" # Verify the requested queues exist and create those that doesn't exist\n",
|
||||
@@ -520,7 +524,7 @@
|
||||
" # skip resource types that might be needed\n",
|
||||
" if resources in required_idle_resources:\n",
|
||||
" continue\n",
|
||||
" # Remove from both aws and trains all instances that are \n",
|
||||
" # Remove from both aws and clearml all instances that are\n",
|
||||
" # idle for longer than MAX_IDLE_TIME_MIN\n",
|
||||
" if time() - timestamp > MAX_IDLE_TIME_MIN * 60.0:\n",
|
||||
" cloud_id = workers_pattern.match(worker.id)[\"cloud_id\"]\n",
|
||||
@@ -535,7 +539,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"##### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)"
|
||||
"### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -584,4 +588,4 @@
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
}
|
||||
|
||||
@@ -8,7 +8,7 @@ pyparsing>=2.0.3,<3.1.0
|
||||
python-dateutil>=2.4.2,<2.9.0
|
||||
pyjwt>=2.4.0,<2.7.0
|
||||
PyYAML>=3.12,<6.1
|
||||
requests>=2.20.0,<2.29.0
|
||||
requests>=2.20.0,<=2.31.0
|
||||
six>=1.13.0,<1.17.0
|
||||
typing>=3.6.4,<3.8.0 ; python_version < '3.5'
|
||||
urllib3>=1.21.1,<1.27.0
|
||||
|
||||
Reference in New Issue
Block a user