mirror of
https://github.com/clearml/clearml-agent
synced 2025-06-26 18:16:15 +00:00
Compare commits
45 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
42450dcbc4 | ||
|
|
ef47225d41 | ||
|
|
e61accefb9 | ||
|
|
5c1543d112 | ||
|
|
7ff6aee20c | ||
|
|
37ea381d98 | ||
|
|
67fc884895 | ||
|
|
1e3646b57c | ||
|
|
ba2db4e727 | ||
|
|
077148be00 | ||
|
|
594ee5842e | ||
|
|
a69766bd8b | ||
|
|
857a750eb1 | ||
|
|
26aa50f1b5 | ||
|
|
8b4f1eefc2 | ||
|
|
97c2e21dcc | ||
|
|
918dd39b87 | ||
|
|
7776e906c4 | ||
|
|
1bf865ec08 | ||
|
|
3f1ce847dc | ||
|
|
9006c2d28f | ||
|
|
ec216198a0 | ||
|
|
fe6adbf110 | ||
|
|
2693c565ba | ||
|
|
9054ea37c2 | ||
|
|
7292263f86 | ||
|
|
f8a6cd697f | ||
|
|
ec9d027678 | ||
|
|
48a145a8bd | ||
|
|
71d2ab4ce7 | ||
|
|
12a8872b27 | ||
|
|
820ab4dc0c | ||
|
|
1d1ffd17fb | ||
|
|
d96b8ff906 | ||
|
|
e687418194 | ||
|
|
a5a797ec5e | ||
|
|
ff6cee4a44 | ||
|
|
9acbad28f7 | ||
|
|
560e689ccd | ||
|
|
f66e42ddb1 | ||
|
|
d9856d5de5 | ||
|
|
24177cc5a9 | ||
|
|
178af0dee8 | ||
|
|
51eb0a713c | ||
|
|
249aa006cb |
196
README.md
196
README.md
@@ -9,14 +9,14 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
|
||||
[](https://img.shields.io/pypi/pyversions/clearml-agent.svg)
|
||||
[](https://img.shields.io/pypi/v/clearml-agent.svg)
|
||||
[](https://pypi.org/project/clearml-agent/)
|
||||
[](https://artifacthub.io/packages/search?repo=allegroai)
|
||||
[](https://artifacthub.io/packages/search?repo=allegroai)
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
### ClearML-Agent
|
||||
#### *Formerly known as Trains Agent*
|
||||
|
||||
#### *Formerly known as Trains Agent*
|
||||
|
||||
* Run jobs (experiments) on any local or cloud based resource
|
||||
* Implement optimized resource utilization policies
|
||||
@@ -24,23 +24,31 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
|
||||
* Launch-and-Forget service containers
|
||||
* [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
|
||||
* [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
|
||||
* Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
|
||||
*
|
||||
Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
|
||||
|
||||
It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.
|
||||
|
||||
**Full Automation in 5 steps**
|
||||
1. ClearML Server [self-hosted](https://github.com/allegroai/clearml-server) or [free tier hosting](https://app.clear.ml)
|
||||
2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine: on-premises / cloud / ...)
|
||||
3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
|
||||
4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
|
||||
|
||||
1. ClearML Server [self-hosted](https://github.com/allegroai/clearml-server)
|
||||
or [free tier hosting](https://app.clear.ml)
|
||||
2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine:
|
||||
on-premises / cloud / ...)
|
||||
3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or
|
||||
Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
|
||||
4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or
|
||||
automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
|
||||
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
|
||||
|
||||
"All the Deep/Machine-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
|
||||
|
||||
**Try ClearML now** [Self Hosted](https://github.com/allegroai/clearml-server) or [Free tier Hosting](https://app.clear.ml)
|
||||
**Try ClearML now** [Self Hosted](https://github.com/allegroai/clearml-server)
|
||||
or [Free tier Hosting](https://app.clear.ml)
|
||||
<a href="https://app.clear.ml"><img src="https://github.com/allegroai/clearml-agent/blob/master/docs/screenshots.gif?raw=true" width="100%"></a>
|
||||
|
||||
### Simple, Flexible Experiment Orchestration
|
||||
|
||||
**The ClearML Agent was built to address the DL/ML R&D DevOps needs:**
|
||||
|
||||
* Easily add & remove machines from the cluster
|
||||
@@ -56,20 +64,23 @@ It is a zero configuration fire-and-forget execution agent, providing a full ML/
|
||||
|
||||
*epsilon - Because we are :triangular_ruler: and nothing is really zero work
|
||||
|
||||
|
||||
### Kubernetes Integration (Optional)
|
||||
We think Kubernetes is awesome, but it should be a choice.
|
||||
We designed `clearml-agent` so you can run bare-metal or inside a pod with any mix that fits your environment.
|
||||
|
||||
We think Kubernetes is awesome, but it should be a choice. We designed `clearml-agent` so you can run bare-metal or
|
||||
inside a pod with any mix that fits your environment.
|
||||
|
||||
Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://github.com/allegroai/clearml-helm-charts
|
||||
#### Benefits of integrating existing K8s with ClearML-Agent
|
||||
|
||||
#### Benefits of integrating existing K8s with ClearML-Agent
|
||||
|
||||
- ClearML-Agent adds the missing scheduling capabilities to K8s
|
||||
- Allowing for more flexible automation from code
|
||||
- A programmatic interface for easier learning curve (and debugging)
|
||||
- Seamless integration with ML/DL experiment manager
|
||||
- Web UI for customization, scheduling & prioritization of jobs
|
||||
- Web UI for customization, scheduling & prioritization of jobs
|
||||
|
||||
**Two K8s integration flavours**
|
||||
|
||||
**Two K8s integration flavours**
|
||||
- Spin ClearML-Agent as a long-lasting service pod
|
||||
- use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
|
||||
- map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
|
||||
@@ -77,57 +88,66 @@ Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://githu
|
||||
- benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
|
||||
- downside: Sibling containers
|
||||
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
|
||||
- Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on a K8s cpu node
|
||||
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml template)
|
||||
- Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the experiment's process
|
||||
- Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
|
||||
a K8s cpu node
|
||||
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
|
||||
yaml template)
|
||||
- Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
|
||||
experiment's process
|
||||
- benefits: Kubernetes full view of all running jobs in the system
|
||||
- downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
|
||||
- downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
|
||||
|
||||
### Using the ClearML Agent
|
||||
|
||||
**Full scale HPC with a click of a button**
|
||||
|
||||
The ClearML Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.
|
||||
The ClearML Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the
|
||||
job and monitors its progress.
|
||||
|
||||
Any 'Draft' experiment can be scheduled for execution by a ClearML agent.
|
||||
|
||||
A previously run experiment can be put into 'Draft' state by either of two methods:
|
||||
* Using the **'Reset'** action from the experiment right-click context menu in the
|
||||
ClearML UI - This will clear any results and artifacts the previous run had created.
|
||||
* Using the **'Clone'** action from the experiment right-click context menu in the
|
||||
ClearML UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.
|
||||
|
||||
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment
|
||||
right-click context menu in the ClearML UI and selecting the execution queue.
|
||||
* Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any
|
||||
results and artifacts the previous run had created.
|
||||
* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new '
|
||||
Draft' experiment with the same configuration as the original experiment.
|
||||
|
||||
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in
|
||||
the ClearML UI and selecting the execution queue.
|
||||
|
||||
See [creating an experiment and enqueuing it for execution](#from-scratch).
|
||||
|
||||
Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue.
|
||||
|
||||
The ClearML UI Workers & Queues page provides ongoing execution information:
|
||||
- Workers Tab: Monitor you cluster
|
||||
|
||||
- Workers Tab: Monitor you cluster
|
||||
- Review available resources
|
||||
- Monitor machines statistics (CPU / GPU / Disk / Network)
|
||||
- Queues Tab:
|
||||
- Queues Tab:
|
||||
- Control the scheduling order of jobs
|
||||
- Cancel or abort job execution
|
||||
- Move jobs between execution queues
|
||||
|
||||
#### What The ClearML Agent Actually Does
|
||||
|
||||
The ClearML Agent executes experiments using the following process:
|
||||
- Create a new virtual environment (or launch the selected docker image)
|
||||
- Clone the code into the virtual-environment (or inside the docker)
|
||||
- Install python packages based on the package requirements listed for the experiment
|
||||
- Special note for PyTorch: The ClearML Agent will automatically select the
|
||||
torch packages based on the CUDA_VERSION environment variable of the machine
|
||||
- Execute the code, while monitoring the process
|
||||
- Log all stdout/stderr in the ClearML UI, including the cloning and installation process, for easy debugging
|
||||
- Monitor the execution and allow you to manually abort the job using the ClearML UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
|
||||
|
||||
- Create a new virtual environment (or launch the selected docker image)
|
||||
- Clone the code into the virtual-environment (or inside the docker)
|
||||
- Install python packages based on the package requirements listed for the experiment
|
||||
- Special note for PyTorch: The ClearML Agent will automatically select the torch packages based on the CUDA_VERSION
|
||||
environment variable of the machine
|
||||
- Execute the code, while monitoring the process
|
||||
- Log all stdout/stderr in the ClearML UI, including the cloning and installation process, for easy debugging
|
||||
- Monitor the execution and allow you to manually abort the job using the ClearML UI (or, in the unfortunate case of a
|
||||
code crash, catch the error and signal the experiment has failed)
|
||||
|
||||
#### System Design & Flow
|
||||
|
||||
<img src="https://github.com/allegroai/clearml-agent/blob/master/docs/clearml_architecture.png" width="100%" alt="clearml-architecture">
|
||||
|
||||
|
||||
#### Installing the ClearML Agent
|
||||
|
||||
```bash
|
||||
@@ -137,6 +157,7 @@ pip install clearml-agent
|
||||
#### ClearML Agent Usage Examples
|
||||
|
||||
Full Interface and capabilities are available with
|
||||
|
||||
```bash
|
||||
clearml-agent --help
|
||||
clearml-agent daemon --help
|
||||
@@ -148,7 +169,8 @@ clearml-agent daemon --help
|
||||
clearml-agent init
|
||||
```
|
||||
|
||||
Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default ClearML Agent cache folder is `~/.clearml`
|
||||
Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
|
||||
ClearML Agent cache folder is `~/.clearml`
|
||||
|
||||
See full details in your configuration file at `~/clearml.conf`
|
||||
|
||||
@@ -158,29 +180,36 @@ They are designed to share the same configuration file, see example [here](docs/
|
||||
#### Running the ClearML Agent
|
||||
|
||||
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --queue default --foreground
|
||||
```
|
||||
|
||||
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
|
||||
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --queue default
|
||||
```
|
||||
|
||||
GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled with `--cpu-only`).
|
||||
GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled
|
||||
with `--cpu-only`).
|
||||
|
||||
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for the `clearml-agent` <br>
|
||||
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES` is an empty string (""), no gpu will be allocated for the `clearml-agent`
|
||||
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for
|
||||
the `clearml-agent` <br>
|
||||
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES="none"`, no gpu will be allocated for
|
||||
the `clearml-agent`
|
||||
|
||||
Example: spin two agents, one per gpu on the same machine:
|
||||
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --gpus 0 --queue default
|
||||
clearml-agent daemon --detached --gpus 1 --queue default
|
||||
```
|
||||
|
||||
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
|
||||
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu
|
||||
@@ -189,23 +218,29 @@ clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu
|
||||
##### Starting the ClearML Agent in docker mode
|
||||
|
||||
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --queue default --docker --foreground
|
||||
```
|
||||
|
||||
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
|
||||
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --queue default --docker
|
||||
```
|
||||
|
||||
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 docker:
|
||||
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
|
||||
docker:
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
|
||||
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
|
||||
```
|
||||
|
||||
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 docker:
|
||||
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:
|
||||
10.1-cudnn7-runtime-ubuntu18.04 docker:
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
|
||||
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
|
||||
@@ -216,55 +251,61 @@ clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
|
||||
Priority Queues are also supported, example use case:
|
||||
|
||||
High priority queue: `important_jobs` Low priority queue: `default`
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --queue important_jobs default
|
||||
```
|
||||
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.
|
||||
|
||||
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our [free server](https://app.clear.ml/workers-and-queues/queues)
|
||||
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from
|
||||
the `default` queue.
|
||||
|
||||
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see
|
||||
example on our [free server](https://app.clear.ml/workers-and-queues/queues)
|
||||
|
||||
##### Stopping the ClearML Agent
|
||||
|
||||
To stop a **ClearML Agent** running in the background, run the same command line used to start the agent with `--stop` appended.
|
||||
For example, to stop the first of the above shown same machine, single gpu agents:
|
||||
To stop a **ClearML Agent** running in the background, run the same command line used to start the agent with `--stop`
|
||||
appended. For example, to stop the first of the above shown same machine, single gpu agents:
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop
|
||||
```
|
||||
|
||||
### How do I create an experiment on the ClearML Server? <a name="from-scratch"></a>
|
||||
|
||||
* Integrate [ClearML](https://github.com/allegroai/clearml) with your code
|
||||
* Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
|
||||
* As your code is running, **ClearML** creates an experiment logging all the necessary execution information:
|
||||
- Git repository link and commit ID (or an entire jupyter notebook)
|
||||
- Git diff (we’re not saying you never commit and push, but still...)
|
||||
- Python packages used by your code (including specific versions used)
|
||||
- Hyper-Parameters
|
||||
- Input Artifacts
|
||||
- Git repository link and commit ID (or an entire jupyter notebook)
|
||||
- Git diff (we’re not saying you never commit and push, but still...)
|
||||
- Python packages used by your code (including specific versions used)
|
||||
- Hyper-Parameters
|
||||
- Input Artifacts
|
||||
|
||||
You now have a 'template' of your experiment with everything required for automated execution
|
||||
|
||||
* In the ClearML UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
|
||||
* In the ClearML UI, Right-click on the experiment and select 'clone'. A copy of your experiment will be created.
|
||||
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
|
||||
- Change the Hyper-Parameters
|
||||
- Switch to the latest code base of the repository
|
||||
- Update package versions
|
||||
- Select a specific docker image to run in (see docker execution mode section)
|
||||
- Or simply change nothing to run the same experiment again...
|
||||
- Change the Hyper-Parameters
|
||||
- Switch to the latest code base of the repository
|
||||
- Update package versions
|
||||
- Select a specific docker image to run in (see docker execution mode section)
|
||||
- Or simply change nothing to run the same experiment again...
|
||||
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
|
||||
|
||||
### ClearML-Agent Services Mode <a name="services"></a>
|
||||
|
||||
ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs
|
||||
that previously had to be executed on local / dedicated machines. It allows a single agent to
|
||||
launch multiple dockers (Tasks) for different use cases. To name a few use cases, auto-scaler service (spinning instances
|
||||
when the need arises and the budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic),
|
||||
Optimizer (such as Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for
|
||||
increased data transparency)
|
||||
ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that
|
||||
previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks)
|
||||
for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the
|
||||
budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as
|
||||
Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data
|
||||
transparency)
|
||||
|
||||
ClearML-Agent Services mode will spin **any** task enqueued into the specified queue.
|
||||
Every task launched by ClearML-Agent Services will be registered as a new node in the system,
|
||||
providing tracking and transparency capabilities.
|
||||
Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched alongside GPU agents.
|
||||
ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by
|
||||
ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
|
||||
Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched
|
||||
alongside GPU agents.
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
|
||||
@@ -272,22 +313,27 @@ clearml-agent daemon --services-mode --detached --queue services --create-queue
|
||||
|
||||
**Note**: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.
|
||||
|
||||
|
||||
### AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
|
||||
The ClearML Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the ClearML package.
|
||||
|
||||
Sample AutoML & Orchestration examples can be found in the ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.
|
||||
The ClearML Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the
|
||||
ClearML package.
|
||||
|
||||
Sample AutoML & Orchestration examples can be found in the
|
||||
ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.
|
||||
|
||||
AutoML examples
|
||||
- [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
|
||||
|
||||
- [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
|
||||
- In order to create an experiment-template in the system, this code must be executed once manually
|
||||
- [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
|
||||
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter combinations
|
||||
- [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
|
||||
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter
|
||||
combinations
|
||||
|
||||
Experiment Pipeline examples
|
||||
- [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
|
||||
|
||||
- [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
|
||||
- This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
|
||||
- [Second step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/toy_base_task.py)
|
||||
- [Second step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/toy_base_task.py)
|
||||
- In order to create an experiment-template in the system, this code must be executed once manually
|
||||
|
||||
### License
|
||||
|
||||
@@ -39,6 +39,13 @@
|
||||
# default false, only the working directory will be added to the PYHTONPATH
|
||||
# force_git_root_python_path: false
|
||||
|
||||
# if set, use GIT_ASKPASS to pass user/pass when cloning / fetch repositories
|
||||
# it solves passing user/token to git submodules.
|
||||
# this is a safer way to ensure multiple users using the same repository will
|
||||
# not accidentally leak credentials
|
||||
# Only supported on Linux systems, it will be the default in future releases
|
||||
# enable_git_ask_pass: false
|
||||
|
||||
# in docker mode, if container's entrypoint automatically activated a virtual environment
|
||||
# use the activated virtual environment and install everything there
|
||||
# set to False to disable, and always create a new venv inheriting from the system_site_packages
|
||||
@@ -83,7 +90,7 @@
|
||||
# set the optional priority packages to be installed before the rest of the required packages,
|
||||
# In case a package installation fails, the package will be ignored,
|
||||
# and the virtual environment process will continue
|
||||
# priority_optional_packages: ["pygobject", ]
|
||||
priority_optional_packages: ["pygobject", ]
|
||||
|
||||
# set the post packages to be installed after all the rest of the required packages
|
||||
# post_packages: ["horovod", ]
|
||||
@@ -130,6 +137,12 @@
|
||||
},
|
||||
|
||||
translate_ssh: true,
|
||||
|
||||
# set "disable_ssh_mount: true" to disable the automatic mount of ~/.ssh folder into the docker containers
|
||||
# default is false, automatically mounts ~/.ssh
|
||||
# Must be set to True if using "clearml-session" with this agent!
|
||||
# disable_ssh_mount: false
|
||||
|
||||
# reload configuration file every daemon execution
|
||||
reload_config: false,
|
||||
|
||||
@@ -220,16 +233,20 @@
|
||||
parse_embedded_urls: true
|
||||
}
|
||||
|
||||
# Maximum execution time (in seconds) for Task's abort function call
|
||||
abort_callback_max_timeout: 1800
|
||||
|
||||
# allow to set internal mount points inside the docker,
|
||||
# especially useful for non-root docker container images.
|
||||
docker_internal_mounts {
|
||||
sdk_cache: "/clearml_agent_cache"
|
||||
apt_cache: "/var/cache/apt/archives"
|
||||
ssh_folder: "/root/.ssh"
|
||||
ssh_folder: "~/.ssh"
|
||||
ssh_ro_folder: "/.ssh"
|
||||
pip_cache: "/root/.cache/pip"
|
||||
poetry_cache: "/root/.cache/pypoetry"
|
||||
vcs_cache: "/root/.clearml/vcs-cache"
|
||||
venv_build: "/root/.clearml/venvs-builds"
|
||||
venv_build: "~/.clearml/venvs-builds"
|
||||
pip_download: "/root/.clearml/pip-download-cache"
|
||||
}
|
||||
|
||||
|
||||
@@ -28,6 +28,9 @@
|
||||
|
||||
pool_maxsize: 512
|
||||
pool_connections: 512
|
||||
|
||||
# Override the default http method, use "put" if working behind GCP load balancer (default: "get")
|
||||
# default_method: "get"
|
||||
}
|
||||
|
||||
auth {
|
||||
|
||||
@@ -8,13 +8,14 @@ from .datamodel import DataModel
|
||||
from .defs import ENV_API_DEFAULT_REQ_METHOD
|
||||
|
||||
|
||||
if ENV_API_DEFAULT_REQ_METHOD.get().upper() not in ("GET", "POST"):
|
||||
if ENV_API_DEFAULT_REQ_METHOD.get().upper() not in ("GET", "POST", "PUT"):
|
||||
raise ValueError(
|
||||
"CLEARML_API_DEFAULT_REQ_METHOD environment variable must be 'get' or 'post' (any case is allowed)."
|
||||
)
|
||||
|
||||
|
||||
class Request(ApiModel):
|
||||
def_method = ENV_API_DEFAULT_REQ_METHOD.get(default="get")
|
||||
_method = ENV_API_DEFAULT_REQ_METHOD.get(default="get")
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
|
||||
@@ -14,8 +14,9 @@ from requests.auth import HTTPBasicAuth
|
||||
from six.moves.urllib.parse import urlparse, urlunparse
|
||||
|
||||
from .callresult import CallResult
|
||||
from .defs import ENV_VERBOSE, ENV_HOST, ENV_ACCESS_KEY, ENV_SECRET_KEY, ENV_WEB_HOST, ENV_FILES_HOST, ENV_AUTH_TOKEN, \
|
||||
ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD
|
||||
from .defs import (
|
||||
ENV_VERBOSE, ENV_HOST, ENV_ACCESS_KEY, ENV_SECRET_KEY, ENV_WEB_HOST, ENV_FILES_HOST, ENV_AUTH_TOKEN,
|
||||
ENV_NO_DEFAULT_SERVER, ENV_DISABLE_VAULT_SUPPORT, ENV_INITIAL_CONNECT_RETRY_OVERRIDE, ENV_API_DEFAULT_REQ_METHOD, )
|
||||
from .request import Request, BatchRequest
|
||||
from .token_manager import TokenManager
|
||||
from ..config import load
|
||||
@@ -110,6 +111,19 @@ class Session(TokenManager):
|
||||
self._logger = logger
|
||||
self.__auth_token = None
|
||||
|
||||
if ENV_API_DEFAULT_REQ_METHOD.get(default=None):
|
||||
# Make sure we update the config object, so we pass it into the new containers when we map them
|
||||
self.config["api.http.default_method"] = ENV_API_DEFAULT_REQ_METHOD.get()
|
||||
# notice the default setting of Request.def_method are already set by the OS environment
|
||||
elif self.config.get("api.http.default_method", None):
|
||||
def_method = str(self.config.get("api.http.default_method", None)).strip()
|
||||
if def_method.upper() not in ("GET", "POST", "PUT"):
|
||||
raise ValueError(
|
||||
"api.http.default_method variable must be 'get' or 'post' (any case is allowed)."
|
||||
)
|
||||
Request.def_method = def_method
|
||||
Request._method = Request.def_method
|
||||
|
||||
if ENV_AUTH_TOKEN.get(
|
||||
value_cb=lambda key, value: print("Using environment access token {}=********".format(key))
|
||||
):
|
||||
@@ -251,7 +265,7 @@ class Session(TokenManager):
|
||||
service,
|
||||
action,
|
||||
version=None,
|
||||
method="get",
|
||||
method=Request.def_method,
|
||||
headers=None,
|
||||
auth=None,
|
||||
data=None,
|
||||
@@ -328,7 +342,7 @@ class Session(TokenManager):
|
||||
service,
|
||||
action,
|
||||
version=None,
|
||||
method="get",
|
||||
method=Request.def_method,
|
||||
headers=None,
|
||||
data=None,
|
||||
json=None,
|
||||
@@ -371,7 +385,7 @@ class Session(TokenManager):
|
||||
headers=None,
|
||||
data=None,
|
||||
json=None,
|
||||
method="get",
|
||||
method=Request.def_method,
|
||||
):
|
||||
"""
|
||||
Send a raw batch API request. Batch requests always use application/json-lines content type.
|
||||
@@ -615,7 +629,7 @@ class Session(TokenManager):
|
||||
try:
|
||||
data = {"expiration_sec": exp} if exp else {}
|
||||
res = self._send_request(
|
||||
method=ENV_API_DEFAULT_REQ_METHOD.get(default="get"),
|
||||
method=Request.def_method,
|
||||
service="auth",
|
||||
action="login",
|
||||
auth=auth,
|
||||
|
||||
@@ -347,7 +347,7 @@ class ServiceCommandSection(BaseCommandSection):
|
||||
except AttributeError:
|
||||
raise NameResolutionError('Name resolution unavailable for {}'.format(service))
|
||||
|
||||
request = request_cls.from_dict(dict(name=name, only_fields=['name', 'id']))
|
||||
request = request_cls.from_dict(dict(name=re.escape(name), only_fields=['name', 'id']))
|
||||
# from_dict will ignore unrecognised keyword arguments - not all GetAll's have only_fields
|
||||
response = getattr(self._session.send_api(request), service)
|
||||
matches = [db_object for db_object in response if name.lower() == db_object.name.lower()]
|
||||
|
||||
@@ -122,7 +122,7 @@ def main():
|
||||
" Bitbucket: https://support.atlassian.com/bitbucket-cloud/docs/app-passwords/\n"
|
||||
" GitLab: https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html\n"
|
||||
)
|
||||
print('Enter git password token for user \'{}\': '.format(git_user), end='')
|
||||
print('Enter git personal token for user \'{}\': '.format(git_user), end='')
|
||||
git_pass = input()
|
||||
print('Git repository cloning will be using user={} token={}'.format(git_user, git_pass))
|
||||
else:
|
||||
|
||||
@@ -1,6 +1,8 @@
|
||||
import json
|
||||
import re
|
||||
import shlex
|
||||
|
||||
from clearml_agent.backend_api.session import Request
|
||||
from clearml_agent.helper.package.requirements import (
|
||||
RequirementsManager, MarkerRequirement,
|
||||
compare_version_rules, )
|
||||
@@ -26,7 +28,7 @@ def resolve_default_container(session, task_id, container_config):
|
||||
'script.repository', 'script.branch',
|
||||
'project', 'container'],
|
||||
'search_hidden': True},
|
||||
method='get',
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
try:
|
||||
@@ -53,7 +55,7 @@ def resolve_default_container(session, task_id, container_config):
|
||||
'id': [task_info.get('project')],
|
||||
'only_fields': ['name'],
|
||||
},
|
||||
method='get',
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
try:
|
||||
|
||||
@@ -38,7 +38,7 @@ from clearml_agent.backend_api.services import auth as auth_api
|
||||
from clearml_agent.backend_api.services import queues as queues_api
|
||||
from clearml_agent.backend_api.services import tasks as tasks_api
|
||||
from clearml_agent.backend_api.services import workers as workers_api
|
||||
from clearml_agent.backend_api.session import CallResult
|
||||
from clearml_agent.backend_api.session import CallResult, Request
|
||||
from clearml_agent.backend_api.session.defs import (
|
||||
ENV_ENABLE_ENV_CONFIG_SECTION, ENV_ENABLE_FILES_CONFIG_SECTION,
|
||||
ENV_VENV_CONFIGURED, ENV_PROPAGATE_EXITCODE, )
|
||||
@@ -67,8 +67,12 @@ from clearml_agent.definitions import (
|
||||
ENV_SSH_AUTH_SOCK,
|
||||
ENV_AGENT_SKIP_PIP_VENV_INSTALL,
|
||||
ENV_EXTRA_DOCKER_ARGS,
|
||||
ENV_CUSTOM_BUILD_SCRIPT, ENV_AGENT_SKIP_PYTHON_ENV_INSTALL, WORKING_STANDALONE_DIR,
|
||||
|
||||
ENV_CUSTOM_BUILD_SCRIPT,
|
||||
ENV_AGENT_SKIP_PYTHON_ENV_INSTALL,
|
||||
WORKING_STANDALONE_DIR,
|
||||
ENV_DEBUG_INFO,
|
||||
ENV_CHILD_AGENTS_COUNT_CMD,
|
||||
ENV_DOCKER_ARGS_FILTERS,
|
||||
)
|
||||
from clearml_agent.definitions import WORKING_REPOSITORY_DIR, PIP_EXTRA_INDICES
|
||||
from clearml_agent.errors import (
|
||||
@@ -270,7 +274,7 @@ def get_task(session, task_id, **kwargs):
|
||||
action='get_all',
|
||||
version='2.14',
|
||||
json={"id": [task_id], "search_hidden": True, **kwargs},
|
||||
method='get',
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
result = CallResult.from_result(
|
||||
@@ -302,7 +306,7 @@ def get_next_task(session, queue, get_task_info=False):
|
||||
action='get_next_task',
|
||||
version='2.14',
|
||||
json=request,
|
||||
method='get',
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
if not result.ok:
|
||||
@@ -323,7 +327,7 @@ def get_task_container(session, task_id):
|
||||
action='get_all',
|
||||
version='2.14',
|
||||
json={'id': [task_id], 'only_fields': ['container'], 'search_hidden': True},
|
||||
method='get',
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
try:
|
||||
@@ -364,7 +368,7 @@ def set_task_container(session, task_id, docker_image=None, docker_arguments=Non
|
||||
action='edit',
|
||||
version='2.13',
|
||||
json={'task': task_id, 'container': container, 'force': True},
|
||||
method='get',
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
return result.ok
|
||||
@@ -406,6 +410,10 @@ class TaskStopSignal(object):
|
||||
self.worker_id = command.worker_id
|
||||
self._task_reset_state_counter = 0
|
||||
self.task_id = task_id
|
||||
self._support_callback = None
|
||||
self._active_callback_timestamp = None
|
||||
self._active_callback_timeout = None
|
||||
self._abort_callback_max_timeout = float(self.session.config.get('agent.abort_callback_max_timeout', 1800))
|
||||
|
||||
def test(self):
|
||||
# type: () -> TaskStopReason
|
||||
@@ -423,11 +431,84 @@ class TaskStopSignal(object):
|
||||
# make sure we break nothing
|
||||
return TaskStopSignal.default
|
||||
|
||||
def _wait_for_abort_callback(self):
|
||||
if not self._support_callback:
|
||||
return None
|
||||
|
||||
if self._active_callback_timestamp:
|
||||
if time() - self._active_callback_timestamp < self._active_callback_timeout:
|
||||
# print("waiting for callback to complete")
|
||||
self.command.log("waiting for callback to complete")
|
||||
# check state
|
||||
cb_completed = None
|
||||
try:
|
||||
task_info = self.session.get(
|
||||
service="tasks", action="get_all", version="2.13", id=[self.task_id],
|
||||
only_fields=["status", "status_message", "runtime._abort_callback_completed"])
|
||||
cb_completed = task_info['tasks'][0]['runtime'].get('_abort_callback_completed', None)
|
||||
except: # noqa
|
||||
pass
|
||||
|
||||
if not bool(cb_completed):
|
||||
return False
|
||||
|
||||
msg = "Task abort callback completed in {:.2f} seconds".format(
|
||||
time() - self._active_callback_timestamp)
|
||||
else:
|
||||
msg = "Task abort callback timed out [timeout: {}, elapsed: {:.2f}]".format(
|
||||
self._active_callback_timeout, time() - self._active_callback_timestamp)
|
||||
|
||||
self.command.send_logs(self.task_id, ["### " + msg + " ###"], session=self.session)
|
||||
return True
|
||||
|
||||
# check if abort callback is turned on
|
||||
cb_completed = None
|
||||
# TODO: add retries on network error with timeout
|
||||
try:
|
||||
task_info = self.session.get(
|
||||
service="tasks", action="get_all", version="2.13", id=[self.task_id],
|
||||
only_fields=["status", "status_message", "runtime._abort_callback_timeout",
|
||||
"runtime._abort_poll_freq", "runtime._abort_callback_completed"])
|
||||
abort_timeout = task_info['tasks'][0]['runtime'].get('_abort_callback_timeout', 0)
|
||||
poll_timeout = task_info['tasks'][0]['runtime'].get('_abort_poll_freq', 0)
|
||||
cb_completed = task_info['tasks'][0]['runtime'].get('_abort_callback_completed', None)
|
||||
except: # noqa
|
||||
abort_timeout = None
|
||||
poll_timeout = None
|
||||
|
||||
if not abort_timeout:
|
||||
# no callback set we can leave
|
||||
return None
|
||||
|
||||
try:
|
||||
timeout = min(float(abort_timeout) + float(poll_timeout), self._abort_callback_max_timeout)
|
||||
except: # noqa
|
||||
self.command.log("Failed parsing runtime timeout shutdown callback [{}, {}]".format(
|
||||
abort_timeout, poll_timeout))
|
||||
return None
|
||||
|
||||
self.command.send_logs(
|
||||
self.task_id,
|
||||
["### Task abort callback timeout set, waiting for max {} sec ###".format(timeout)],
|
||||
session=self.session
|
||||
)
|
||||
|
||||
self._active_callback_timestamp = time()
|
||||
self._active_callback_timeout = timeout
|
||||
return bool(cb_completed)
|
||||
|
||||
def was_abort_function_called(self):
|
||||
return bool(self._active_callback_timestamp)
|
||||
|
||||
def _test(self):
|
||||
# type: () -> TaskStopReason
|
||||
"""
|
||||
"Unsafe" version of test()
|
||||
"""
|
||||
if self._support_callback is None:
|
||||
# test if backend support callback
|
||||
self._support_callback = self.session.check_min_api_version("2.13")
|
||||
|
||||
task_info = get_task(
|
||||
self.session, self.task_id, only_fields=["status", "status_message"]
|
||||
)
|
||||
@@ -439,10 +520,16 @@ class TaskStopSignal(object):
|
||||
"task status_message has '%s', task will terminate",
|
||||
self.stopping_message,
|
||||
)
|
||||
# actively waiting for task to complete
|
||||
if self._wait_for_abort_callback() is False:
|
||||
return TaskStopReason.no_stop
|
||||
return TaskStopReason.stopped
|
||||
|
||||
if status in self.unexpected_statuses: # ## and "worker" not in message:
|
||||
self.command.log("unexpected status change, task will terminate")
|
||||
# actively waiting for task to complete
|
||||
if self._wait_for_abort_callback() is False:
|
||||
return TaskStopReason.no_stop
|
||||
return TaskStopReason.status_changed
|
||||
|
||||
if status == self.statuses.created:
|
||||
@@ -451,13 +538,18 @@ class TaskStopSignal(object):
|
||||
>= self._number_of_consecutive_reset_tests
|
||||
):
|
||||
self.command.log("task was reset, task will terminate")
|
||||
# actively waiting for task to complete
|
||||
if self._wait_for_abort_callback() is False:
|
||||
return TaskStopReason.no_stop
|
||||
return TaskStopReason.reset
|
||||
|
||||
self._task_reset_state_counter += 1
|
||||
warning_msg = "Warning: Task {} was reset! if state is consistent we shall terminate ({}/{}).".format(
|
||||
self.task_id,
|
||||
self._task_reset_state_counter,
|
||||
self._number_of_consecutive_reset_tests,
|
||||
)
|
||||
|
||||
if self.events_service:
|
||||
self.events_service.send_log_events(
|
||||
self.worker_id,
|
||||
@@ -526,6 +618,7 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
super(Worker, self).__init__(*args, **kwargs)
|
||||
self._debug_context = ENV_DEBUG_INFO.get()
|
||||
self.monitor = None
|
||||
self.log = self._session.get_logger(__name__)
|
||||
self.register_signal_handler()
|
||||
@@ -555,6 +648,7 @@ class Worker(ServiceCommandSection):
|
||||
self.worker_id = self._session.config["agent.worker_id"] or "{}:{}".format(
|
||||
self._session.config["agent.worker_name"], os.getpid()
|
||||
)
|
||||
self.parent_worker_id = None # maybe add os env for overriding
|
||||
self._last_stats = defaultdict(lambda: 0)
|
||||
self._last_report_timestamp = psutil.time.time()
|
||||
self.temp_config_path = None
|
||||
@@ -593,6 +687,16 @@ class Worker(ServiceCommandSection):
|
||||
# str - not supported, version string indicates last server version
|
||||
self._runtime_props_support = None
|
||||
|
||||
# allow docker sanitization, needs backend support
|
||||
if ENV_DOCKER_ARGS_FILTERS.get():
|
||||
self._docker_args_filters = \
|
||||
[re.compile(f) for f in shlex.split(ENV_DOCKER_ARGS_FILTERS.get())]
|
||||
elif self._session.config.get('agent.docker_args_filters', None):
|
||||
self._docker_args_filters = \
|
||||
[re.compile(f) for f in self._session.config.get('agent.docker_args_filters', [])]
|
||||
else:
|
||||
self._docker_args_filters = []
|
||||
|
||||
@classmethod
|
||||
def _verify_command_states(cls, kwargs):
|
||||
"""
|
||||
@@ -943,7 +1047,7 @@ class Worker(ServiceCommandSection):
|
||||
# update available gpus
|
||||
if gpu_queues:
|
||||
available_gpus = self._dynamic_gpu_get_available(gpu_indexes)
|
||||
# if something went wrong or we have no free gpus
|
||||
# if something went wrong, or we have no free gpus
|
||||
# start over from the highest priority queue
|
||||
if not available_gpus:
|
||||
if self._daemon_foreground or worker_params.debug:
|
||||
@@ -1029,7 +1133,7 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
self.report_monitor(ResourceMonitor.StatusReport(queues=queues, queue=queue, task=task_id))
|
||||
|
||||
org_gpus = os.environ.get('NVIDIA_VISIBLE_DEVICES')
|
||||
org_gpus = Session.get_nvidia_visible_env()
|
||||
dynamic_gpus_worker_id = self.worker_id
|
||||
# the following is only executed in dynamic gpus mode
|
||||
if gpu_queues and gpu_queues.get(queue):
|
||||
@@ -1040,10 +1144,10 @@ class Worker(ServiceCommandSection):
|
||||
available_gpus = available_gpus[gpu_queues.get(queue)[1]:]
|
||||
self.set_runtime_properties(
|
||||
key='available_gpus', value=','.join(str(g) for g in available_gpus))
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = \
|
||||
os.environ['NVIDIA_VISIBLE_DEVICES'] = ','.join(str(g) for g in gpus)
|
||||
Session.set_nvidia_visible_env(gpus)
|
||||
list_task_gpus_ids.update({str(g): task_id for g in gpus})
|
||||
self.worker_id = ':'.join(self.worker_id.split(':')[:-1] + ['gpu'+','.join(str(g) for g in gpus)])
|
||||
self.worker_id = ':'.join(
|
||||
self.worker_id.split(':')[:-1] + ['gpu'+','.join(str(g) for g in gpus)])
|
||||
|
||||
self.send_logs(
|
||||
task_id=task_id,
|
||||
@@ -1056,8 +1160,7 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
if gpu_queues:
|
||||
self.worker_id = dynamic_gpus_worker_id
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = \
|
||||
os.environ['NVIDIA_VISIBLE_DEVICES'] = org_gpus
|
||||
Session.set_nvidia_visible_env(org_gpus)
|
||||
|
||||
self.report_monitor(ResourceMonitor.StatusReport(queues=self.queues))
|
||||
|
||||
@@ -1065,7 +1168,7 @@ class Worker(ServiceCommandSection):
|
||||
runtime_props = None
|
||||
|
||||
# if we are using priority start pulling from the first always,
|
||||
# if we are doing round robin, pull from the next one
|
||||
# if we are doing roundrobin, pull from the next one
|
||||
if priority_order:
|
||||
break
|
||||
else:
|
||||
@@ -1097,6 +1200,8 @@ class Worker(ServiceCommandSection):
|
||||
self._unregister()
|
||||
|
||||
def _dynamic_gpu_get_available(self, gpu_indexes):
|
||||
# cast to string
|
||||
gpu_indexes = [str(g) for g in gpu_indexes]
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
response = self._session.send_api(workers_api.GetAllRequest(last_seen=600))
|
||||
@@ -1111,7 +1216,8 @@ class Worker(ServiceCommandSection):
|
||||
for w in our_workers:
|
||||
for g in w.split(':')[-1].lower().replace('gpu', '').split(','):
|
||||
try:
|
||||
gpus += [int(g.strip())]
|
||||
# verify "int.int"
|
||||
gpus += [str(g).strip()] if float(g.strip()) >= 0 else []
|
||||
except (ValueError, TypeError):
|
||||
print("INFO: failed parsing GPU int('{}') - skipping".format(g))
|
||||
available_gpus = list(set(gpu_indexes) - set(gpus))
|
||||
@@ -1127,10 +1233,12 @@ class Worker(ServiceCommandSection):
|
||||
gpus = []
|
||||
for g in available_gpus[-1].split(','):
|
||||
try:
|
||||
gpus += [int(g.strip())]
|
||||
# verify "int.int"
|
||||
gpus += [str(g).strip()] if float(g.strip()) >= 0 else []
|
||||
except (ValueError, TypeError):
|
||||
print("INFO: failed parsing GPU int('{}') - skipping".format(g))
|
||||
available_gpus = gpus
|
||||
|
||||
if not isinstance(gpu_queues, dict):
|
||||
gpu_queues = dict(gpu_queues)
|
||||
|
||||
@@ -1283,6 +1391,9 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
self._session.print_configuration()
|
||||
|
||||
def resolve_daemon_queue_names(self, queues, create_if_missing=False):
|
||||
return self._resolve_queue_names(queues=queues, create_if_missing=create_if_missing)
|
||||
|
||||
def daemon(self, queues, log_level, foreground=False, docker=False, detached=False, order_fairness=False, **kwargs):
|
||||
self._apply_extra_configuration()
|
||||
|
||||
@@ -1325,7 +1436,7 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
# if we do not need to create queues, make sure they are valid
|
||||
# match previous behaviour when we validated queue names before everything else
|
||||
queues = self._resolve_queue_names(queues, create_if_missing=kwargs.get('create_queue', False))
|
||||
queues = self.resolve_daemon_queue_names(queues, create_if_missing=kwargs.get('create_queue', False))
|
||||
|
||||
queues_info = [
|
||||
q.to_dict()
|
||||
@@ -1487,12 +1598,14 @@ class Worker(ServiceCommandSection):
|
||||
if '-' in gpu_indexes:
|
||||
gpu_indexes = list(range(int(gpu_indexes.split('-')[0]), 1 + int(gpu_indexes.split('-')[1])))
|
||||
else:
|
||||
gpu_indexes = [int(g) for g in gpu_indexes.split(',')]
|
||||
gpu_indexes = [str(g).replace(":", ".").strip() for g in gpu_indexes.split(',')]
|
||||
# verify (basically numbers with single "." dot)
|
||||
gpu_indexes = [str(g) for g in gpu_indexes if float(g) >= 0]
|
||||
except Exception:
|
||||
raise ValueError(
|
||||
'Failed parsing --gpus "{}". '
|
||||
'--dynamic_gpus must be use with '
|
||||
'specific gpus for example "0-7" or "0,1,2,3"'.format(kwargs.get('gpus')))
|
||||
'specific gpus for example "0-7" or "0,1,2,3" or "0:0,0:1,1:0,1:1"'.format(kwargs.get('gpus')))
|
||||
|
||||
dynamic_gpus = []
|
||||
for s in queue_names:
|
||||
@@ -1719,6 +1832,10 @@ class Worker(ServiceCommandSection):
|
||||
printed_lines, stderr_pos_count = _print_file(stderr_path, stderr_pos_count)
|
||||
stderr_line_count += report_lines(printed_lines, "stderr")
|
||||
|
||||
# make sure that if the abort function was called, the task is marked as aborted
|
||||
if stop_signal and stop_signal.was_abort_function_called():
|
||||
stop_reason = TaskStopReason.stopped
|
||||
|
||||
return status, stop_reason
|
||||
|
||||
def _check_if_internal_agent_started(self, printed_lines, service_mode_internal_agent_started, task_id):
|
||||
@@ -2024,7 +2141,10 @@ class Worker(ServiceCommandSection):
|
||||
python_ver = task.script.binary
|
||||
python_ver = python_ver.split('/')[-1].replace('python', '')
|
||||
# if we can cast it, we are good
|
||||
return '{:.1f}'.format(float(python_ver))
|
||||
return '{}.{}'.format(
|
||||
int(python_ver.partition(".")[0]),
|
||||
int(python_ver.partition(".")[-1].partition(".")[0] or 0)
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
@@ -2873,8 +2993,8 @@ class Worker(ServiceCommandSection):
|
||||
if self._session.debug_mode:
|
||||
self.log(traceback.format_exc())
|
||||
|
||||
def debug(self, message):
|
||||
if self._session.debug_mode:
|
||||
def debug(self, message, context=None):
|
||||
if self._session.debug_mode and (not context or context == self._debug_context):
|
||||
print("clearml_agent: {}".format(message))
|
||||
|
||||
@staticmethod
|
||||
@@ -3160,6 +3280,11 @@ class Worker(ServiceCommandSection):
|
||||
first_time=first_time,
|
||||
)
|
||||
|
||||
# print message so users know they can enable cache
|
||||
if not self.package_api.is_cached_enabled():
|
||||
print('::: Python virtual environment cache is disabled. '
|
||||
'To accelerate spin-up time set `agent.venvs_cache.path=~/.clearml/venvs-cache` :::\n')
|
||||
|
||||
# check if we have a cached folder
|
||||
if cached_requirements and not skip_pip_venv_install and self.package_api.get_cached_venv(
|
||||
requirements=cached_requirements,
|
||||
@@ -3303,7 +3428,7 @@ class Worker(ServiceCommandSection):
|
||||
mounted_vcs_cache = temp_config.get(
|
||||
"agent.docker_internal_mounts.vcs_cache", '/root/.clearml/vcs-cache')
|
||||
mounted_venv_dir = temp_config.get(
|
||||
"agent.docker_internal_mounts.venv_build", '/root/.clearml/venvs-builds')
|
||||
"agent.docker_internal_mounts.venv_build", '~/.clearml/venvs-builds')
|
||||
temp_config.put("sdk.storage.cache.default_base_dir", mounted_cache_dir)
|
||||
temp_config.put("agent.pip_download_cache.path", mounted_pip_dl_dir)
|
||||
temp_config.put("agent.vcs_cache.path", mounted_vcs_cache)
|
||||
@@ -3330,7 +3455,7 @@ class Worker(ServiceCommandSection):
|
||||
'-v', '{}:{}'.format(ENV_SSH_AUTH_SOCK.get(), ENV_SSH_AUTH_SOCK.get()),
|
||||
'-e', ssh_auth_sock_env,
|
||||
]
|
||||
elif ENV_AGENT_DISABLE_SSH_MOUNT.get():
|
||||
elif ENV_AGENT_DISABLE_SSH_MOUNT.get() or self._session.config.get("agent.disable_ssh_mount", None):
|
||||
self._host_ssh_cache = None
|
||||
else:
|
||||
self._host_ssh_cache = mkdtemp(prefix='clearml_agent.ssh.')
|
||||
@@ -3341,27 +3466,36 @@ class Worker(ServiceCommandSection):
|
||||
)
|
||||
|
||||
def _get_docker_config_cmd(self, temp_config, clean_api_credentials=False, **kwargs):
|
||||
self.debug("Setting up docker config command")
|
||||
host_cache = Path(os.path.expandvars(
|
||||
self._session.config["sdk.storage.cache.default_base_dir"])).expanduser().as_posix()
|
||||
self.debug("host_cache: {}".format(host_cache))
|
||||
host_pip_dl = Path(os.path.expandvars(
|
||||
self._session.config["agent.pip_download_cache.path"])).expanduser().as_posix()
|
||||
self.debug("host_pip_dl: {}".format(host_pip_dl))
|
||||
host_vcs_cache = Path(os.path.expandvars(
|
||||
self._session.config["agent.vcs_cache.path"])).expanduser().as_posix()
|
||||
self.debug("host_vcs_cache: {}".format(host_vcs_cache))
|
||||
host_venvs_cache = Path(os.path.expandvars(
|
||||
self._session.config["agent.venvs_cache.path"])).expanduser().as_posix() \
|
||||
if self._session.config.get("agent.venvs_cache.path", None) else None
|
||||
self.debug("host_venvs_cache: {}".format(host_venvs_cache))
|
||||
host_ssh_cache = self._host_ssh_cache
|
||||
self.debug("host_ssh_cache: {}".format(host_ssh_cache))
|
||||
|
||||
host_apt_cache = Path(os.path.expandvars(self._session.config.get(
|
||||
"agent.docker_apt_cache", '~/.clearml/apt-cache'))).expanduser().as_posix()
|
||||
self.debug("host_apt_cache: {}".format(host_apt_cache))
|
||||
host_pip_cache = Path(os.path.expandvars(self._session.config.get(
|
||||
"agent.docker_pip_cache", '~/.clearml/pip-cache'))).expanduser().as_posix()
|
||||
self.debug("host_pip_cache: {}".format(host_pip_cache))
|
||||
|
||||
if self.poetry.enabled:
|
||||
host_poetry_cache = Path(os.path.expandvars(self._session.config.get(
|
||||
"agent.docker_poetry_cache", '~/.clearml/poetry-cache'))).expanduser().as_posix()
|
||||
else:
|
||||
host_poetry_cache = None
|
||||
self.debug("host_poetry_cache: {}".format(host_poetry_cache))
|
||||
|
||||
# make sure all folders are valid
|
||||
if host_apt_cache:
|
||||
@@ -3389,8 +3523,16 @@ class Worker(ServiceCommandSection):
|
||||
shutil.rmtree(host_ssh_cache, ignore_errors=True)
|
||||
shutil.copytree(Path('~/.ssh').expanduser().as_posix(), host_ssh_cache)
|
||||
except Exception:
|
||||
host_ssh_cache = None
|
||||
self.log.warning('Failed creating temporary copy of ~/.ssh for git credential')
|
||||
# if we failed to copy / delete, let's see if we
|
||||
self.log.warning('Failed creating temporary copy of ~/.ssh for git credential, '
|
||||
'creating a new temp folder')
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
host_ssh_cache = mkdtemp(prefix='clearml_agent.ssh.')
|
||||
shutil.copytree(Path('~/.ssh').expanduser().as_posix(), host_ssh_cache)
|
||||
except Exception:
|
||||
self.log.warning('Failed creating temporary copy of ~/.ssh for git credential, removing mount!')
|
||||
host_ssh_cache = None
|
||||
|
||||
# check if the .git credentials exist:
|
||||
try:
|
||||
@@ -3420,6 +3562,7 @@ class Worker(ServiceCommandSection):
|
||||
mounted_vcs_cache = temp_config.get("agent.vcs_cache.path")
|
||||
mounted_venvs_cache = temp_config.get("agent.venvs_cache.path", "")
|
||||
mount_ssh = temp_config.get("agent.docker_internal_mounts.ssh_folder", None)
|
||||
mount_ssh_ro = temp_config.get("agent.docker_internal_mounts.ssh_ro_folder", None)
|
||||
mount_apt_cache = temp_config.get("agent.docker_internal_mounts.apt_cache", None)
|
||||
mount_pip_cache = temp_config.get("agent.docker_internal_mounts.pip_cache", None)
|
||||
mount_poetry_cache = temp_config.get("agent.docker_internal_mounts.poetry_cache", None)
|
||||
@@ -3430,7 +3573,7 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
docker_cmd = dict(
|
||||
worker_id=self.worker_id,
|
||||
parent_worker_id=self.worker_id,
|
||||
parent_worker_id=self.parent_worker_id or self.worker_id,
|
||||
# docker_image=docker_image,
|
||||
# docker_arguments=docker_arguments,
|
||||
extra_docker_arguments=self._extra_docker_arguments,
|
||||
@@ -3451,6 +3594,7 @@ class Worker(ServiceCommandSection):
|
||||
preprocess_bash_script=preprocess_bash_script,
|
||||
install_opencv_libs=install_opencv_libs,
|
||||
mount_ssh=mount_ssh,
|
||||
mount_ssh_ro=mount_ssh_ro,
|
||||
mount_apt_cache=mount_apt_cache,
|
||||
mount_pip_cache=mount_pip_cache,
|
||||
mount_poetry_cache=mount_poetry_cache,
|
||||
@@ -3462,15 +3606,11 @@ class Worker(ServiceCommandSection):
|
||||
def _get_child_agents_count_for_worker(self):
|
||||
"""Get the amount of running child agents. In case of any error return 0"""
|
||||
parent_worker_label = self._parent_worker_label.format(self.worker_id)
|
||||
cmd = [
|
||||
'docker',
|
||||
'ps',
|
||||
'--filter',
|
||||
'label={}'.format(parent_worker_label),
|
||||
'--format',
|
||||
# get some fields for debugging
|
||||
'{"ID":"{{ .ID }}", "Image": "{{ .Image }}", "Names":"{{ .Names }}", "Labels":"{{ .Labels }}"}'
|
||||
]
|
||||
|
||||
default_cmd = 'docker ps --filter label={parent_worker_label} --format {{{{.ID}}}}'
|
||||
child_agents_cmd = ENV_CHILD_AGENTS_COUNT_CMD.get() or default_cmd
|
||||
|
||||
cmd = shlex.split(child_agents_cmd.format(parent_worker_label=parent_worker_label))
|
||||
try:
|
||||
output = Argv(*cmd).get_output(
|
||||
stderr=subprocess.STDOUT
|
||||
@@ -3481,9 +3621,33 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
return len(output.splitlines()) if output else 0
|
||||
|
||||
@classmethod
|
||||
def _filter_docker_args(self, docker_args):
|
||||
# type: (List[str]) -> List[str]
|
||||
"""
|
||||
Filter docker args matching specific flags.
|
||||
Supports list of Regular expressions, e.g self._docker_args_filters = ["^--env$", "^-e$"]
|
||||
|
||||
:argument docker_args: List of docker argument strings (flags and values)
|
||||
"""
|
||||
# if no filtering, do nothing
|
||||
if not docker_args or not self._docker_args_filters:
|
||||
return docker_args
|
||||
|
||||
args = docker_args[:]
|
||||
results = []
|
||||
while args:
|
||||
cmd = args.pop(0).strip()
|
||||
if any(f.match(cmd) for f in self._docker_args_filters):
|
||||
results.append(cmd)
|
||||
if "=" not in cmd and args and not args[0].startswith("-"):
|
||||
try:
|
||||
results.append(args.pop(0).strip())
|
||||
except IndexError:
|
||||
pass
|
||||
return results
|
||||
|
||||
def _get_docker_cmd(
|
||||
cls,
|
||||
self,
|
||||
worker_id, parent_worker_id,
|
||||
docker_image, docker_arguments,
|
||||
python_version,
|
||||
@@ -3505,18 +3669,19 @@ class Worker(ServiceCommandSection):
|
||||
auth_token=None,
|
||||
worker_tags=None,
|
||||
name=None,
|
||||
mount_ssh=None, mount_apt_cache=None, mount_pip_cache=None, mount_poetry_cache=None,
|
||||
mount_ssh=None, mount_ssh_ro=None, mount_apt_cache=None, mount_pip_cache=None, mount_poetry_cache=None,
|
||||
env_task_id=None,
|
||||
):
|
||||
self.debug("Constructing docker command", context="docker")
|
||||
docker = 'docker'
|
||||
|
||||
base_cmd = [docker, 'run', '-t']
|
||||
update_scheme = ""
|
||||
dockers_nvidia_visible_devices = 'all'
|
||||
gpu_devices = os.environ.get('NVIDIA_VISIBLE_DEVICES', None)
|
||||
gpu_devices = Session.get_nvidia_visible_env()
|
||||
if gpu_devices is None or gpu_devices.lower().strip() == 'all':
|
||||
if ENV_DOCKER_SKIP_GPUS_FLAG.get():
|
||||
dockers_nvidia_visible_devices = os.environ.get('NVIDIA_VISIBLE_DEVICES') or \
|
||||
dockers_nvidia_visible_devices = Session.get_nvidia_visible_env() or \
|
||||
dockers_nvidia_visible_devices
|
||||
else:
|
||||
base_cmd += ['--gpus', 'all', ]
|
||||
@@ -3524,7 +3689,8 @@ class Worker(ServiceCommandSection):
|
||||
if ENV_DOCKER_SKIP_GPUS_FLAG.get():
|
||||
dockers_nvidia_visible_devices = gpu_devices
|
||||
else:
|
||||
base_cmd += ['--gpus', '\"device={}\"'.format(gpu_devices), ]
|
||||
# replace back "." to ":" MIG support
|
||||
base_cmd += ['--gpus', '\"device={}\"'.format(gpu_devices.replace(".", ":")), ]
|
||||
# We are using --gpu, so we should not pass NVIDIA_VISIBLE_DEVICES, I think.
|
||||
# base_cmd += ['-e', 'NVIDIA_VISIBLE_DEVICES=' + gpu_devices, ]
|
||||
elif gpu_devices.strip() == 'none':
|
||||
@@ -3533,6 +3699,7 @@ class Worker(ServiceCommandSection):
|
||||
if docker_arguments:
|
||||
docker_arguments = list(docker_arguments) \
|
||||
if isinstance(docker_arguments, (list, tuple)) else [docker_arguments]
|
||||
docker_arguments = self._filter_docker_args(docker_arguments)
|
||||
base_cmd += [a for a in docker_arguments if a]
|
||||
|
||||
if extra_docker_arguments:
|
||||
@@ -3541,8 +3708,10 @@ class Worker(ServiceCommandSection):
|
||||
base_cmd += [str(a) for a in extra_docker_arguments if a]
|
||||
|
||||
# set docker labels
|
||||
base_cmd += ['-l', cls._worker_label.format(worker_id)]
|
||||
base_cmd += ['-l', cls._parent_worker_label.format(parent_worker_id)]
|
||||
base_cmd += ['-l', self._worker_label.format(worker_id)]
|
||||
base_cmd += ['-l', self._parent_worker_label.format(parent_worker_id)]
|
||||
|
||||
self.debug("Command: {}".format(base_cmd), context="docker")
|
||||
|
||||
# check if running inside a kubernetes
|
||||
if ENV_DOCKER_HOST_MOUNT.get() or (os.environ.get('KUBERNETES_SERVICE_HOST') and
|
||||
@@ -3559,6 +3728,8 @@ class Worker(ServiceCommandSection):
|
||||
pass
|
||||
base_cmd += ['-e', 'NVIDIA_VISIBLE_DEVICES={}'.format(dockers_nvidia_visible_devices)]
|
||||
|
||||
self.debug("Running in k8s: {}".format(base_cmd), context="docker")
|
||||
|
||||
# check if we need to map host folders
|
||||
if ENV_DOCKER_HOST_MOUNT.get():
|
||||
# expect CLEARML_AGENT_K8S_HOST_MOUNT = '/mnt/host/data:/root/.clearml'
|
||||
@@ -3566,6 +3737,7 @@ class Worker(ServiceCommandSection):
|
||||
# search and replace all the host folders with the k8s
|
||||
host_mounts = [host_apt_cache, host_pip_cache, host_poetry_cache, host_pip_dl,
|
||||
host_cache, host_vcs_cache, host_venvs_cache]
|
||||
self.debug("Mapping host mounts: {}".format(host_mounts), context="docker")
|
||||
for i, m in enumerate(host_mounts):
|
||||
if not m:
|
||||
continue
|
||||
@@ -3574,6 +3746,7 @@ class Worker(ServiceCommandSection):
|
||||
host_mounts[i] = None
|
||||
else:
|
||||
host_mounts[i] = m.replace(k8s_pod_mnt, k8s_node_mnt, 1)
|
||||
self.debug("Mapped host mounts: {}".format(host_mounts), context="docker")
|
||||
host_apt_cache, host_pip_cache, host_poetry_cache, host_pip_dl, \
|
||||
host_cache, host_vcs_cache, host_venvs_cache = host_mounts
|
||||
|
||||
@@ -3587,6 +3760,8 @@ class Worker(ServiceCommandSection):
|
||||
except Exception:
|
||||
raise ValueError('Error: could not copy configuration file into: {}'.format(new_conf_file))
|
||||
|
||||
self.debug("Config file target: {}, host: {}".format(new_conf_file, conf_file), context="docker")
|
||||
|
||||
if host_ssh_cache:
|
||||
new_ssh_cache = os.path.join(k8s_pod_mnt, '.clearml_agent.{}.ssh'.format(quote(worker_id, safe="")))
|
||||
try:
|
||||
@@ -3595,6 +3770,7 @@ class Worker(ServiceCommandSection):
|
||||
host_ssh_cache = new_ssh_cache.replace(k8s_pod_mnt, k8s_node_mnt)
|
||||
except Exception:
|
||||
raise ValueError('Error: could not copy .ssh directory into: {}'.format(new_ssh_cache))
|
||||
self.debug("Copied host SSH cache to: {}, host {}".format(new_ssh_cache, host_ssh_cache), context="docker")
|
||||
|
||||
base_cmd += ['-e', 'CLEARML_WORKER_ID='+worker_id, ]
|
||||
# update the docker image, so the system knows where it runs
|
||||
@@ -3637,6 +3813,12 @@ class Worker(ServiceCommandSection):
|
||||
# clearml-agent{specify_version}
|
||||
clearml_agent_wheel = 'clearml-agent{specify_version}'.format(specify_version=specify_version)
|
||||
|
||||
mount_ssh = mount_ssh or '/root/.ssh'
|
||||
mount_ssh_ro = mount_ssh_ro or "{}_ro".format(mount_ssh.rstrip("/"))
|
||||
mount_apt_cache = mount_apt_cache or '/var/cache/apt/archives'
|
||||
mount_pip_cache = mount_pip_cache or '/root/.cache/pip'
|
||||
mount_poetry_cache = mount_poetry_cache or '/root/.cache/pypoetry'
|
||||
|
||||
if not standalone_mode:
|
||||
if not bash_script:
|
||||
# Find the highest python version installed, or install from apt-get
|
||||
@@ -3647,6 +3829,7 @@ class Worker(ServiceCommandSection):
|
||||
"export DEBIAN_FRONTEND=noninteractive",
|
||||
"export CLEARML_APT_INSTALL=\"$CLEARML_APT_INSTALL{}\"".format(
|
||||
' libsm6 libxext6 libxrender-dev libglib2.0-0' if install_opencv_libs else ""),
|
||||
"cp -Rf {mount_ssh_ro} -T {mount_ssh}" if host_ssh_cache else "",
|
||||
"[ ! -z $(which git) ] || export CLEARML_APT_INSTALL=\"$CLEARML_APT_INSTALL git\"",
|
||||
"declare LOCAL_PYTHON",
|
||||
"[ ! -z $LOCAL_PYTHON ] || for i in {{15..5}}; do which {python_single_digit}.$i && " +
|
||||
@@ -3674,7 +3857,9 @@ class Worker(ServiceCommandSection):
|
||||
"$LOCAL_PYTHON -m pip install -U {clearml_agent_wheel} ; ").format(
|
||||
python_single_digit=python_version.split('.')[0],
|
||||
python=python_version, pip_version=PackageManager.get_pip_version(),
|
||||
clearml_agent_wheel=clearml_agent_wheel)
|
||||
clearml_agent_wheel=clearml_agent_wheel,
|
||||
mount_ssh_ro=mount_ssh_ro, mount_ssh=mount_ssh,
|
||||
)
|
||||
|
||||
if host_git_credentials:
|
||||
for git_credentials in host_git_credentials:
|
||||
@@ -3686,16 +3871,20 @@ class Worker(ServiceCommandSection):
|
||||
for line in docker_bash_setup_script.split('\n') if line.strip()) + \
|
||||
' ; '
|
||||
|
||||
mount_ssh = mount_ssh or '/root/.ssh'
|
||||
mount_apt_cache = mount_apt_cache or '/var/cache/apt/archives'
|
||||
mount_pip_cache = mount_pip_cache or '/root/.cache/pip'
|
||||
mount_poetry_cache = mount_poetry_cache or '/root/.cache/pypoetry'
|
||||
self.debug(
|
||||
"Adding mounts: host_ssh_cache={}, host_apt_cache={}, host_pip_cache={}, host_poetry_cache={}, "
|
||||
"host_pip_dl={}, host_cache={}, host_vcs_cache={}, host_venvs_cache={}".format(
|
||||
host_ssh_cache, host_apt_cache, host_pip_cache, host_poetry_cache, host_pip_dl, host_cache,
|
||||
host_vcs_cache, host_venvs_cache,
|
||||
),
|
||||
context="docker"
|
||||
)
|
||||
|
||||
base_cmd += (
|
||||
(['--name', name] if name else []) +
|
||||
['-v', conf_file+':'+DOCKER_ROOT_CONF_FILE] +
|
||||
['-e', "CLEARML_CONFIG_FILE={}".format(DOCKER_ROOT_CONF_FILE)] +
|
||||
(['-v', host_ssh_cache+':'+mount_ssh] if host_ssh_cache else []) +
|
||||
(['-v', host_ssh_cache+':'+mount_ssh_ro] if host_ssh_cache else []) +
|
||||
(['-v', host_apt_cache+':'+mount_apt_cache] if host_apt_cache else []) +
|
||||
(['-v', host_pip_cache+':'+mount_pip_cache] if host_pip_cache else []) +
|
||||
(['-v', host_poetry_cache + ':'+mount_poetry_cache] if host_poetry_cache else []) +
|
||||
@@ -3841,6 +4030,9 @@ class Worker(ServiceCommandSection):
|
||||
unique_worker_id=worker_id, worker_name=worker_name, api_client=self._session.api_client,
|
||||
allow_double=bool(ENV_DOCKER_HOST_MOUNT.get()) # and bool(self._services_mode),
|
||||
)
|
||||
# set the parent ID the first time we have a worker ID (it might change for services-mode / dgpus)
|
||||
if not self.parent_worker_id:
|
||||
self.parent_worker_id = self.worker_id
|
||||
|
||||
if self.worker_id is None:
|
||||
error('Instance with the same WORKER_ID [{}] is already running'.format(worker_id))
|
||||
@@ -3851,8 +4043,8 @@ class Worker(ServiceCommandSection):
|
||||
def _generate_worker_id_name(self, dynamic_gpus=False):
|
||||
worker_id = self._session.config["agent.worker_id"]
|
||||
worker_name = self._session.config["agent.worker_name"]
|
||||
if not worker_id and os.environ.get('NVIDIA_VISIBLE_DEVICES') is not None:
|
||||
nvidia_visible_devices = os.environ.get('NVIDIA_VISIBLE_DEVICES')
|
||||
if not worker_id and Session.get_nvidia_visible_env() is not None:
|
||||
nvidia_visible_devices = Session.get_nvidia_visible_env()
|
||||
if nvidia_visible_devices and nvidia_visible_devices.lower() != 'none':
|
||||
worker_id = '{}:{}gpu{}'.format(
|
||||
worker_name, 'd' if dynamic_gpus else '', nvidia_visible_devices)
|
||||
@@ -3969,6 +4161,13 @@ class Worker(ServiceCommandSection):
|
||||
if self._session.feature_set == "basic":
|
||||
raise ValueError("Server does not support --use-owner-token option")
|
||||
|
||||
role = self._session.get_decoded_token(self._session.token).get("identity", {}).get("role", None)
|
||||
if role and role not in ["admin", "root", "system"]:
|
||||
raise ValueError(
|
||||
"User role not suitable for --use-owner-token option (requires at least admin,"
|
||||
" found {})".format(role)
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pass
|
||||
|
||||
@@ -148,6 +148,9 @@ ENV_DOCKER_HOST_MOUNT = EnvironmentConfig('CLEARML_AGENT_K8S_HOST_MOUNT', 'CLEAR
|
||||
'TRAINS_AGENT_K8S_HOST_MOUNT', 'TRAINS_AGENT_DOCKER_HOST_MOUNT')
|
||||
ENV_VENV_CACHE_PATH = EnvironmentConfig('CLEARML_AGENT_VENV_CACHE_PATH')
|
||||
ENV_EXTRA_DOCKER_ARGS = EnvironmentConfig('CLEARML_AGENT_EXTRA_DOCKER_ARGS', type=list)
|
||||
ENV_DEBUG_INFO = EnvironmentConfig('CLEARML_AGENT_DEBUG_INFO')
|
||||
ENV_CHILD_AGENTS_COUNT_CMD = EnvironmentConfig('CLEARML_AGENT_CHILD_AGENTS_COUNT_CMD')
|
||||
ENV_DOCKER_ARGS_FILTERS = EnvironmentConfig('CLEARML_AGENT_DOCKER_ARGS_FILTERS')
|
||||
|
||||
ENV_CUSTOM_BUILD_SCRIPT = EnvironmentConfig('CLEARML_AGENT_CUSTOM_BUILD_SCRIPT')
|
||||
"""
|
||||
|
||||
@@ -1,6 +1,9 @@
|
||||
import os
|
||||
import re
|
||||
import warnings
|
||||
|
||||
from clearml_agent.definitions import PIP_EXTRA_INDICES
|
||||
|
||||
from .requirement import Requirement
|
||||
|
||||
|
||||
@@ -42,9 +45,14 @@ def parse(reqstr, cwd=None):
|
||||
yield requirement
|
||||
elif line.startswith('-f') or line.startswith('--find-links') or \
|
||||
line.startswith('-i') or line.startswith('--index-url') or \
|
||||
line.startswith('--extra-index-url') or \
|
||||
line.startswith('--no-index'):
|
||||
warnings.warn('Private repos not supported. Skipping.')
|
||||
elif line.startswith('--extra-index-url'):
|
||||
extra_index = line[len('--extra-index-url'):].strip()
|
||||
extra_index = re.sub(r"\s+#.*$", "", extra_index) # strip comments
|
||||
if extra_index and extra_index not in PIP_EXTRA_INDICES:
|
||||
PIP_EXTRA_INDICES.append(extra_index)
|
||||
print(f"appended {extra_index} to list of extra pip indices")
|
||||
continue
|
||||
elif line.startswith('-Z') or line.startswith('--always-unzip'):
|
||||
warnings.warn('Unused option --always-unzip. Skipping.')
|
||||
|
||||
7
clearml_agent/glue/definitions.py
Normal file
7
clearml_agent/glue/definitions.py
Normal file
@@ -0,0 +1,7 @@
|
||||
from clearml_agent.definitions import EnvironmentConfig
|
||||
|
||||
ENV_START_AGENT_SCRIPT_PATH = EnvironmentConfig('CLEARML_K8S_GLUE_START_AGENT_SCRIPT_PATH')
|
||||
"""
|
||||
Script path to use when creating the bash script to run the agent inside the scheduled pod's docker container.
|
||||
Script will be appended to the specified file.
|
||||
"""
|
||||
@@ -11,6 +11,7 @@ import subprocess
|
||||
import tempfile
|
||||
from copy import deepcopy
|
||||
from pathlib import Path
|
||||
from pprint import pformat
|
||||
from threading import Thread
|
||||
from time import sleep
|
||||
from typing import Text, List, Callable, Any, Collection, Optional, Union
|
||||
@@ -26,6 +27,8 @@ from clearml_agent.helper.dicts import merge_dicts
|
||||
from clearml_agent.helper.process import get_bash_output
|
||||
from clearml_agent.helper.resource_monitor import ResourceMonitor
|
||||
from clearml_agent.interface.base import ObjectID
|
||||
from clearml_agent.backend_api.session import Request
|
||||
from clearml_agent.glue.definitions import ENV_START_AGENT_SCRIPT_PATH
|
||||
|
||||
|
||||
class K8sIntegration(Worker):
|
||||
@@ -73,8 +76,8 @@ class K8sIntegration(Worker):
|
||||
"export LOCAL_PYTHON=$(which python3.$i) && break ; done",
|
||||
"[ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip",
|
||||
"[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3",
|
||||
"$LOCAL_PYTHON -m pip install clearml-agent",
|
||||
"{extra_bash_init_cmd}",
|
||||
"$LOCAL_PYTHON -m pip install clearml-agent",
|
||||
"{extra_docker_bash_script}",
|
||||
"$LOCAL_PYTHON -m clearml_agent execute --full-monitoring --require-queue --id {task_id}"
|
||||
]
|
||||
@@ -119,7 +122,7 @@ class K8sIntegration(Worker):
|
||||
when scheduling a task to run in a pod. Callable can receive an optional pod number and should return
|
||||
a dictionary of user properties (name and value). Signature is [[Optional[int]], Dict[str,str]]
|
||||
:param str overrides_yaml: YAML file containing the overrides for the pod (optional)
|
||||
:param str template_yaml: YAML file containing the template for the pod (optional).
|
||||
:param str template_yaml: YAML file containing the template for the pod (optional).
|
||||
If provided the pod is scheduled with kubectl apply and overrides are ignored, otherwise with kubectl run.
|
||||
:param str clearml_conf_file: clearml.conf file to be use by the pod itself (optional)
|
||||
:param str extra_bash_init_script: Additional bash script to run before starting the Task inside the container
|
||||
@@ -128,6 +131,7 @@ class K8sIntegration(Worker):
|
||||
"""
|
||||
super(K8sIntegration, self).__init__()
|
||||
self.k8s_pending_queue_name = k8s_pending_queue_name or self.K8S_PENDING_QUEUE
|
||||
self.k8s_pending_queue_id = None
|
||||
self.kubectl_cmd = kubectl_cmd or self.KUBECTL_RUN_CMD
|
||||
self.container_bash_script = container_bash_script or self.CONTAINER_BASH_SCRIPT
|
||||
# Always do system packages, because by we will be running inside a docker
|
||||
@@ -135,7 +139,8 @@ class K8sIntegration(Worker):
|
||||
# Add debug logging
|
||||
if debug:
|
||||
self.log.logger.disabled = False
|
||||
self.log.logger.setLevel(logging.INFO)
|
||||
self.log.logger.setLevel(logging.DEBUG)
|
||||
self.log.logger.addHandler(logging.StreamHandler())
|
||||
self.ports_mode = ports_mode
|
||||
self.num_of_services = num_of_services
|
||||
self.base_pod_num = base_pod_num
|
||||
@@ -152,8 +157,7 @@ class K8sIntegration(Worker):
|
||||
self.pod_requests = []
|
||||
self.max_pods_limit = max_pods_limit if not self.ports_mode else None
|
||||
if overrides_yaml:
|
||||
with open(os.path.expandvars(os.path.expanduser(str(overrides_yaml))), 'rt') as f:
|
||||
overrides = yaml.load(f, Loader=getattr(yaml, 'FullLoader', None))
|
||||
overrides = self._load_template_file(overrides_yaml)
|
||||
if overrides:
|
||||
containers = overrides.get('spec', {}).get('containers', [])
|
||||
for c in containers:
|
||||
@@ -174,8 +178,7 @@ class K8sIntegration(Worker):
|
||||
self.log.warning('Removing containers section: {}'.format(overrides['spec'].pop('containers')))
|
||||
self.overrides_json_string = json.dumps(overrides)
|
||||
if template_yaml:
|
||||
with open(os.path.expandvars(os.path.expanduser(str(template_yaml))), 'rt') as f:
|
||||
self.template_dict = yaml.load(f, Loader=getattr(yaml, 'FullLoader', None))
|
||||
self.template_dict = self._load_template_file(template_yaml)
|
||||
|
||||
clearml_conf_file = clearml_conf_file or kwargs.get('trains_conf_file')
|
||||
|
||||
@@ -194,6 +197,11 @@ class K8sIntegration(Worker):
|
||||
_check_pod_thread.daemon = True
|
||||
_check_pod_thread.start()
|
||||
|
||||
@staticmethod
|
||||
def _load_template_file(path):
|
||||
with open(os.path.expandvars(os.path.expanduser(str(path))), 'rt') as f:
|
||||
return yaml.load(f, Loader=getattr(yaml, 'FullLoader', None))
|
||||
|
||||
@staticmethod
|
||||
def _get_path(d, *path, default=None):
|
||||
try:
|
||||
@@ -203,13 +211,27 @@ class K8sIntegration(Worker):
|
||||
except (IndexError, KeyError):
|
||||
return default
|
||||
|
||||
def _get_kubectl_options(self, command, extra_labels=None):
|
||||
labels = [self._get_agent_label()] + (list(extra_labels) if extra_labels else [])
|
||||
return {
|
||||
"-l": ",".join(labels),
|
||||
"-n": str(self.namespace),
|
||||
"-o": "json"
|
||||
}
|
||||
|
||||
def get_kubectl_command(self, command, extra_labels=None):
|
||||
opts = self._get_kubectl_options(command, extra_labels)
|
||||
return 'kubectl {command} {opts}'.format(
|
||||
command=command, opts=" ".join(x for item in opts.items() for x in item)
|
||||
)
|
||||
|
||||
def _monitor_hanging_pods_daemon(self):
|
||||
last_tasks_msgs = {} # last msg updated for every task
|
||||
|
||||
while True:
|
||||
output = get_bash_output('kubectl get pods -n {namespace} -o=JSON'.format(
|
||||
namespace=self.namespace
|
||||
))
|
||||
kubectl_cmd = self.get_kubectl_command("get pods")
|
||||
self.log.debug("Detecting hanging pods: {}".format(kubectl_cmd))
|
||||
output = get_bash_output(kubectl_cmd)
|
||||
output = '' if not output else output if isinstance(output, str) else output.decode('utf-8')
|
||||
try:
|
||||
output_config = json.loads(output)
|
||||
@@ -231,6 +253,10 @@ class K8sIntegration(Worker):
|
||||
if not task_id:
|
||||
continue
|
||||
|
||||
namespace = pod.get('metadata', {}).get('namespace', None)
|
||||
if not namespace:
|
||||
continue
|
||||
|
||||
task_ids.add(task_id)
|
||||
|
||||
msg = None
|
||||
@@ -250,7 +276,7 @@ class K8sIntegration(Worker):
|
||||
msg = reason + (" ({})".format(message) if message else "")
|
||||
|
||||
if reason == 'ImagePullBackOff':
|
||||
delete_pod_cmd = 'kubectl delete pods {} -n {}'.format(pod_name, self.namespace)
|
||||
delete_pod_cmd = 'kubectl delete pods {} -n {}'.format(pod_name, namespace)
|
||||
get_bash_output(delete_pod_cmd)
|
||||
try:
|
||||
self._session.api_client.tasks.failed(
|
||||
@@ -273,7 +299,7 @@ class K8sIntegration(Worker):
|
||||
service='tasks',
|
||||
action='update',
|
||||
json={"task": task_id, "status_message": "K8S glue status: {}".format(msg)},
|
||||
method='get',
|
||||
method=Request.def_method,
|
||||
async_enable=False,
|
||||
)
|
||||
if not result.ok:
|
||||
@@ -336,13 +362,11 @@ class K8sIntegration(Worker):
|
||||
|
||||
return self._agent_label
|
||||
|
||||
def _get_number_used_pods(self):
|
||||
def _get_used_pods(self):
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
kubectl_cmd_new = "kubectl get pods -l {agent_label} -n {namespace} -o json".format(
|
||||
agent_label=self._get_agent_label(),
|
||||
namespace=self.namespace,
|
||||
)
|
||||
kubectl_cmd_new = self.get_kubectl_command("get pods")
|
||||
self.log.debug("Getting used pods: {}".format(kubectl_cmd_new))
|
||||
process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
|
||||
output, error = process.communicate()
|
||||
output = '' if not output else output if isinstance(output, str) else output.decode('utf-8')
|
||||
@@ -350,17 +374,20 @@ class K8sIntegration(Worker):
|
||||
|
||||
if not output:
|
||||
# No such pod exist so we can use the pod_number we found
|
||||
return 0
|
||||
return 0, {}
|
||||
|
||||
try:
|
||||
current_pod_count = len(json.loads(output).get("items", []))
|
||||
except (ValueError, TypeError) as ex:
|
||||
return -1
|
||||
items = json.loads(output).get("items", [])
|
||||
current_pod_count = len(items)
|
||||
namespaces = {item["metadata"]["namespace"] for item in items}
|
||||
except (KeyError, ValueError, TypeError, AttributeError) as ex:
|
||||
print("Failed parsing used pods command response for cleanup: {}".format(ex))
|
||||
return -1, {}
|
||||
|
||||
return current_pod_count
|
||||
return current_pod_count, namespaces
|
||||
except Exception as ex:
|
||||
print('Failed getting number of used pods: {}'.format(ex))
|
||||
return -2
|
||||
print('Failed obtaining used pods information: {}'.format(ex))
|
||||
return -2, {}
|
||||
|
||||
def run_one_task(self, queue: Text, task_id: Text, worker_args=None, task_session=None, **_):
|
||||
print('Pulling task {} launching on kubernetes cluster'.format(task_id))
|
||||
@@ -369,17 +396,17 @@ class K8sIntegration(Worker):
|
||||
# push task into the k8s queue, so we have visibility on pending tasks in the k8s scheduler
|
||||
try:
|
||||
print('Pushing task {} into temporary pending queue'.format(task_id))
|
||||
res = self._session.api_client.tasks.stop(task_id, force=True)
|
||||
_ = self._session.api_client.tasks.stop(task_id, force=True)
|
||||
res = self._session.api_client.tasks.enqueue(
|
||||
task_id,
|
||||
queue=self.k8s_pending_queue_name,
|
||||
queue=self.k8s_pending_queue_id,
|
||||
status_reason='k8s pending scheduler',
|
||||
)
|
||||
if res.meta.result_code != 200:
|
||||
raise Exception(res.meta.result_msg)
|
||||
except Exception as e:
|
||||
self.log.error("ERROR: Could not push back task [{}] to k8s pending queue [{}], error: {}".format(
|
||||
task_id, self.k8s_pending_queue_name, e))
|
||||
self.log.error("ERROR: Could not push back task [{}] to k8s pending queue {} [{}], error: {}".format(
|
||||
task_id, self.k8s_pending_queue_name, self.k8s_pending_queue_id, e))
|
||||
return
|
||||
|
||||
container = get_task_container(self._session, task_id)
|
||||
@@ -426,39 +453,36 @@ class K8sIntegration(Worker):
|
||||
pod_number = self.base_pod_num
|
||||
while self.ports_mode or self.max_pods_limit:
|
||||
pod_number = self.base_pod_num + pod_count
|
||||
if self.ports_mode:
|
||||
kubectl_cmd_new = "kubectl get pods -l {pod_label},{agent_label} -n {namespace}".format(
|
||||
pod_label=self.LIMIT_POD_LABEL.format(pod_number=pod_number),
|
||||
agent_label=self._get_agent_label(),
|
||||
namespace=self.namespace,
|
||||
)
|
||||
else:
|
||||
kubectl_cmd_new = "kubectl get pods -l {agent_label} -n {namespace} -o json".format(
|
||||
agent_label=self._get_agent_label(),
|
||||
namespace=self.namespace,
|
||||
)
|
||||
|
||||
kubectl_cmd_new = self.get_kubectl_command(
|
||||
"get pods",
|
||||
extra_labels=[self.LIMIT_POD_LABEL.format(pod_number=pod_number)] if self.ports_mode else None
|
||||
)
|
||||
self.log.debug("Looking for a free pod/port: {}".format(kubectl_cmd_new))
|
||||
process = subprocess.Popen(kubectl_cmd_new.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
|
||||
output, error = process.communicate()
|
||||
output = '' if not output else output if isinstance(output, str) else output.decode('utf-8')
|
||||
error = '' if not error else error if isinstance(error, str) else error.decode('utf-8')
|
||||
|
||||
if not output:
|
||||
# No such pod exist so we can use the pod_number we found
|
||||
try:
|
||||
items_count = len(json.loads(output).get("items", []))
|
||||
except (ValueError, TypeError) as ex:
|
||||
self.log.warning(
|
||||
"K8S Glue pods monitor: Failed parsing kubectl output:\n{}\ntask '{}' "
|
||||
"will be enqueued back to queue '{}'\nEx: {}".format(
|
||||
output, task_id, queue, ex
|
||||
)
|
||||
)
|
||||
self._session.api_client.tasks.stop(task_id, force=True)
|
||||
self._session.api_client.tasks.enqueue(task_id, queue=queue, status_reason='kubectl parsing error')
|
||||
return
|
||||
|
||||
if not items_count:
|
||||
# No such pod exist so we can use the pod_number we found (result exists but with no items)
|
||||
break
|
||||
|
||||
if self.max_pods_limit:
|
||||
try:
|
||||
current_pod_count = len(json.loads(output).get("items", []))
|
||||
except (ValueError, TypeError) as ex:
|
||||
self.log.warning(
|
||||
"K8S Glue pods monitor: Failed parsing kubectl output:\n{}\ntask '{}' "
|
||||
"will be enqueued back to queue '{}'\nEx: {}".format(
|
||||
output, task_id, queue, ex
|
||||
)
|
||||
)
|
||||
self._session.api_client.tasks.stop(task_id, force=True)
|
||||
self._session.api_client.tasks.enqueue(task_id, queue=queue, status_reason='kubectl parsing error')
|
||||
return
|
||||
current_pod_count = items_count
|
||||
max_count = self.max_pods_limit
|
||||
else:
|
||||
current_pod_count = pod_count
|
||||
@@ -483,10 +507,9 @@ class K8sIntegration(Worker):
|
||||
break
|
||||
pod_count += 1
|
||||
|
||||
labels = ([self.LIMIT_POD_LABEL.format(pod_number=pod_number)] if self.ports_mode else []) + \
|
||||
[self._get_agent_label()]
|
||||
labels.append("clearml-agent-queue={}".format(self._safe_k8s_label_value(queue)))
|
||||
labels.append("clearml-agent-queue-name={}".format(self._safe_k8s_label_value(queue_name)))
|
||||
labels = self._get_pod_labels(queue, queue_name)
|
||||
if self.ports_mode:
|
||||
labels.append(self.LIMIT_POD_LABEL.format(pod_number=pod_number))
|
||||
|
||||
if self.ports_mode:
|
||||
print("Kubernetes scheduling task id={} on pod={} (pod_count={})".format(task_id, pod_number, pod_count))
|
||||
@@ -503,8 +526,14 @@ class K8sIntegration(Worker):
|
||||
queue=queue
|
||||
)
|
||||
|
||||
if self.template_dict:
|
||||
output, error = self._kubectl_apply(**kubectl_kwargs)
|
||||
try:
|
||||
template = self._resolve_template(task_session, task_data, queue)
|
||||
except Exception as ex:
|
||||
print("ERROR: Failed resolving template (skipping): {}".format(ex))
|
||||
return
|
||||
|
||||
if template:
|
||||
output, error = self._kubectl_apply(template=template, **kubectl_kwargs)
|
||||
else:
|
||||
output, error = self._kubectl_run(task_data=task_data, **kubectl_kwargs)
|
||||
|
||||
@@ -540,6 +569,13 @@ class K8sIntegration(Worker):
|
||||
**user_props
|
||||
)
|
||||
|
||||
def _get_pod_labels(self, queue, queue_name):
|
||||
return [
|
||||
self._get_agent_label(),
|
||||
"clearml-agent-queue={}".format(self._safe_k8s_label_value(queue)),
|
||||
"clearml-agent-queue-name={}".format(self._safe_k8s_label_value(queue_name))
|
||||
]
|
||||
|
||||
def _get_docker_args(self, docker_args, flags, target=None, convert=None):
|
||||
# type: (List[str], Collection[str], Optional[str], Callable[[str], Any]) -> Union[dict, List[str]]
|
||||
"""
|
||||
@@ -566,8 +602,16 @@ class K8sIntegration(Worker):
|
||||
return {target: results} if results else {}
|
||||
return results
|
||||
|
||||
def _kubectl_apply(self, create_clearml_conf, docker_image, docker_args, docker_bash, labels, queue, task_id):
|
||||
template = deepcopy(self.template_dict)
|
||||
def _kubectl_apply(
|
||||
self, create_clearml_conf, docker_image, docker_args, docker_bash, labels, queue, task_id, template=None
|
||||
):
|
||||
template = template or deepcopy(self.template_dict)
|
||||
|
||||
try:
|
||||
namespace = template['metadata']['namespace'] or self.namespace
|
||||
except (KeyError, TypeError, AttributeError):
|
||||
namespace = self.namespace
|
||||
|
||||
template.setdefault('apiVersion', 'v1')
|
||||
template['kind'] = 'Pod'
|
||||
template.setdefault('metadata', {})
|
||||
@@ -604,12 +648,15 @@ class K8sIntegration(Worker):
|
||||
|
||||
extra_bash_commands = list(create_clearml_conf or [])
|
||||
|
||||
start_agent_script_path = ENV_START_AGENT_SCRIPT_PATH.get() or "~/__start_agent__.sh"
|
||||
|
||||
extra_bash_commands.append(
|
||||
"echo '{}' | base64 --decode >> ~/__start_agent__.sh ; "
|
||||
"/bin/bash ~/__start_agent__.sh".format(
|
||||
base64.b64encode(
|
||||
"echo '{content}' | base64 --decode >> {script_path} ; /bin/bash {script_path}".format(
|
||||
content=base64.b64encode(
|
||||
script_encoded.encode('ascii')
|
||||
).decode('ascii'))
|
||||
).decode('ascii'),
|
||||
script_path=start_agent_script_path
|
||||
)
|
||||
)
|
||||
|
||||
# Notice: we always leave with exit code 0, so pods are never restarted
|
||||
@@ -634,11 +681,13 @@ class K8sIntegration(Worker):
|
||||
with open(yaml_file, 'wt') as f:
|
||||
yaml.dump(template, f)
|
||||
|
||||
self.log.debug("Applying template:\n{}".format(pformat(template, indent=2)))
|
||||
|
||||
kubectl_cmd = self.KUBECTL_APPLY_CMD.format(
|
||||
task_id=task_id,
|
||||
docker_image=docker_image,
|
||||
queue_id=queue,
|
||||
namespace=self.namespace
|
||||
namespace=namespace
|
||||
)
|
||||
# make sure we provide a list
|
||||
if isinstance(kubectl_cmd, str):
|
||||
@@ -720,26 +769,29 @@ class K8sIntegration(Worker):
|
||||
events_service = self.get_service(Events)
|
||||
|
||||
# make sure we have a k8s pending queue
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
self._session.api_client.queues.create(self.k8s_pending_queue_name)
|
||||
except Exception:
|
||||
pass
|
||||
# get queue id
|
||||
self.k8s_pending_queue_name = self._resolve_name(self.k8s_pending_queue_name, "queues")
|
||||
if not self.k8s_pending_queue_id:
|
||||
resolved_ids = self._resolve_queue_names([self.k8s_pending_queue_name], create_if_missing=True)
|
||||
if not resolved_ids:
|
||||
raise ValueError(
|
||||
"Failed resolving or creating k8s pending queue {}".format(self.k8s_pending_queue_name)
|
||||
)
|
||||
self.k8s_pending_queue_id = resolved_ids[0]
|
||||
|
||||
_last_machine_update_ts = 0
|
||||
while True:
|
||||
# Get used pods and namespaces
|
||||
current_pods, namespaces = self._get_used_pods()
|
||||
|
||||
# check if have pod limit, then check if we hit it.
|
||||
if self.max_pods_limit:
|
||||
current_pods = self._get_number_used_pods()
|
||||
if current_pods >= self.max_pods_limit:
|
||||
print("Maximum pod limit reached {}/{}, sleeping for {:.1f} seconds".format(
|
||||
current_pods, self.max_pods_limit, self._polling_interval))
|
||||
# delete old completed / failed pods
|
||||
get_bash_output(
|
||||
self.KUBECTL_DELETE_CMD.format(namespace=self.namespace, selector=self._get_agent_label())
|
||||
)
|
||||
for namespace in namespaces:
|
||||
kubectl_cmd = self.KUBECTL_DELETE_CMD.format(namespace=namespace, selector=self._get_agent_label())
|
||||
self.log.debug("Deleting old/failed pods due to pod limit: {}".format(kubectl_cmd))
|
||||
get_bash_output(kubectl_cmd)
|
||||
# go to sleep
|
||||
sleep(self._polling_interval)
|
||||
continue
|
||||
@@ -747,19 +799,20 @@ class K8sIntegration(Worker):
|
||||
# iterate over queues (priority style, queues[0] is highest)
|
||||
for queue in queues:
|
||||
# delete old completed / failed pods
|
||||
get_bash_output(
|
||||
self.KUBECTL_DELETE_CMD.format(namespace=self.namespace, selector=self._get_agent_label())
|
||||
)
|
||||
for namespace in namespaces:
|
||||
kubectl_cmd = self.KUBECTL_DELETE_CMD.format(namespace=namespace, selector=self._get_agent_label())
|
||||
self.log.debug("Deleting old/failed pods: {}".format(kubectl_cmd))
|
||||
get_bash_output(kubectl_cmd)
|
||||
|
||||
# get next task in queue
|
||||
try:
|
||||
response = get_next_task(
|
||||
self._session, queue=queue, get_task_info=self._impersonate_as_task_owner
|
||||
)
|
||||
response = self._get_next_task(queue=queue, get_task_info=self._impersonate_as_task_owner)
|
||||
except Exception as e:
|
||||
print("Warning: Could not access task queue [{}], error: {}".format(queue, e))
|
||||
continue
|
||||
else:
|
||||
if not response:
|
||||
continue
|
||||
try:
|
||||
task_id = response["entry"]["task"]
|
||||
except (KeyError, TypeError, AttributeError):
|
||||
@@ -820,6 +873,15 @@ class K8sIntegration(Worker):
|
||||
log_level=logging.INFO, foreground=True, docker=False, **kwargs,
|
||||
)
|
||||
|
||||
def _get_next_task(self, queue, get_task_info):
|
||||
return get_next_task(
|
||||
self._session, queue=queue, get_task_info=get_task_info
|
||||
)
|
||||
|
||||
def _resolve_template(self, task_session, task_data, queue):
|
||||
if self.template_dict:
|
||||
return deepcopy(self.template_dict)
|
||||
|
||||
@classmethod
|
||||
def get_ssh_server_bash(cls, ssh_port_number):
|
||||
return ' ; '.join(line.format(port=ssh_port_number) for line in cls.BASH_INSTALL_SSH_CMD)
|
||||
|
||||
@@ -213,6 +213,13 @@ class PackageManager(object):
|
||||
return
|
||||
return self._get_cache_manager().get_last_copied_entry()
|
||||
|
||||
def is_cached_enabled(self):
|
||||
if not self._cache_manager:
|
||||
cache_folder = ENV_VENV_CACHE_PATH.get() or self.session.config.get(self._config_cache_folder, None)
|
||||
if not cache_folder:
|
||||
return False
|
||||
return True
|
||||
|
||||
@classmethod
|
||||
def _generate_reqs_hash_keys(cls, requirements_list, docker_cmd, python_version, cuda_version):
|
||||
# type: (Union[Dict, List[Dict]], Optional[Union[dict, str]], Optional[str], Optional[str]) -> List[str]
|
||||
|
||||
@@ -95,7 +95,8 @@ class ExternalRequirements(SimpleSubstitution):
|
||||
vcs._set_ssh_url()
|
||||
new_req_line = 'git+{}{}{}'.format(
|
||||
'' if scheme and '://' in vcs.url else scheme,
|
||||
vcs.url_with_auth, fragment
|
||||
vcs_url if session.config.get('agent.force_git_ssh_protocol', None) else vcs.url_with_auth,
|
||||
fragment
|
||||
)
|
||||
if new_req_line != req_line:
|
||||
furl_line = furl(new_req_line)
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
import re
|
||||
from typing import Text
|
||||
|
||||
from .base import PackageManager
|
||||
@@ -11,13 +12,14 @@ class PriorityPackageRequirement(SimpleSubstitution):
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
super(PriorityPackageRequirement, self).__init__(*args, **kwargs)
|
||||
self._replaced_packages = {}
|
||||
# check if we need to replace the packages:
|
||||
priority_packages = self.config.get('agent.package_manager.priority_packages', None)
|
||||
if priority_packages:
|
||||
self.__class__.name = priority_packages
|
||||
self.__class__.name = [p.lower() for p in priority_packages]
|
||||
priority_optional_packages = self.config.get('agent.package_manager.priority_optional_packages', None)
|
||||
if priority_optional_packages:
|
||||
self.__class__.optional_package_names = priority_optional_packages
|
||||
self.__class__.optional_package_names = [p.lower() for p in priority_optional_packages]
|
||||
|
||||
def match(self, req):
|
||||
# match both Cython & cython
|
||||
@@ -28,7 +30,9 @@ class PriorityPackageRequirement(SimpleSubstitution):
|
||||
Replace a requirement
|
||||
:raises: ValueError if version is pre-release
|
||||
"""
|
||||
if req.name in self.optional_package_names:
|
||||
self._replaced_packages[req.name] = req.line
|
||||
|
||||
if req.name.lower() in self.optional_package_names:
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
if PackageManager.out_of_scope_install_package(str(req)):
|
||||
@@ -39,6 +43,41 @@ class PriorityPackageRequirement(SimpleSubstitution):
|
||||
PackageManager.out_of_scope_install_package(str(req))
|
||||
return Text(req)
|
||||
|
||||
def replace_back(self, list_of_requirements):
|
||||
"""
|
||||
:param list_of_requirements: {'pip': ['a==1.0', ]}
|
||||
:return: {'pip': ['a==1.0', ]}
|
||||
"""
|
||||
# if we replaced setuptools, it means someone requested it, and since freeze will not contain it,
|
||||
# we need to add it manually
|
||||
if not self._replaced_packages or "setuptools" not in self._replaced_packages:
|
||||
return list_of_requirements
|
||||
|
||||
try:
|
||||
for k, lines in list_of_requirements.items():
|
||||
# k is either pip/conda
|
||||
if k not in ('pip', 'conda'):
|
||||
continue
|
||||
for i, line in enumerate(lines):
|
||||
if not line or line.lstrip().startswith('#'):
|
||||
continue
|
||||
parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
|
||||
if not parts:
|
||||
continue
|
||||
# if we found setuptools, do nothing
|
||||
if parts[0] == "setuptools":
|
||||
return list_of_requirements
|
||||
|
||||
# if we are here it means we have not found setuptools
|
||||
# we should add it:
|
||||
if "pip" in list_of_requirements:
|
||||
list_of_requirements["pip"] = [self._replaced_packages["setuptools"]] + list_of_requirements["pip"]
|
||||
|
||||
except Exception as ex: # noqa
|
||||
return list_of_requirements
|
||||
|
||||
return list_of_requirements
|
||||
|
||||
|
||||
class PackageCollectorRequirement(SimpleSubstitution):
|
||||
"""
|
||||
|
||||
@@ -7,7 +7,7 @@ from furl import furl
|
||||
import urllib.parse
|
||||
from operator import itemgetter
|
||||
from html.parser import HTMLParser
|
||||
from typing import Text, Optional
|
||||
from typing import Text, Optional, Dict
|
||||
|
||||
import attr
|
||||
import requests
|
||||
@@ -53,17 +53,16 @@ class PytorchWheel(object):
|
||||
python = attr.ib(type=str, converter=lambda x: str(x).replace(".", ""))
|
||||
torch_version = attr.ib(type=str, converter=fix_version)
|
||||
|
||||
url_template = (
|
||||
"http://download.pytorch.org/whl/"
|
||||
"{0.cuda_version}/torch-{0.torch_version}-cp{0.python}-cp{0.python}m{0.unicode}-{0.os_name}.whl"
|
||||
)
|
||||
url_template_prefix = "http://download.pytorch.org/whl/"
|
||||
url_template = "{0.cuda_version}/torch-{0.torch_version}" \
|
||||
"-cp{0.python}-cp{0.python}m{0.unicode}-{0.os_name}.whl"
|
||||
|
||||
def __attrs_post_init__(self):
|
||||
self.unicode = "u" if self.python.startswith("2") else ""
|
||||
|
||||
def make_url(self):
|
||||
# type: () -> Text
|
||||
return self.url_template.format(self)
|
||||
return (self.url_template_prefix + self.url_template).format(self)
|
||||
|
||||
|
||||
class PytorchResolutionError(FatalSpecsResolutionError):
|
||||
@@ -183,6 +182,19 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
self._fix_setuptools = None
|
||||
self.exceptions = []
|
||||
self._original_req = []
|
||||
# allow override pytorch lookup pages
|
||||
if self.config.get("agent.package_manager.torch_page", None):
|
||||
SimplePytorchRequirement.page_lookup_template = \
|
||||
self.config.get("agent.package_manager.torch_page", None)
|
||||
if self.config.get("agent.package_manager.torch_nightly_page", None):
|
||||
SimplePytorchRequirement.nightly_page_lookup_template = \
|
||||
self.config.get("agent.package_manager.torch_nightly_page", None)
|
||||
if self.config.get("agent.package_manager.torch_url_template_prefix", None):
|
||||
PytorchWheel.url_template_prefix = \
|
||||
self.config.get("agent.package_manager.torch_url_template_prefix", None)
|
||||
if self.config.get("agent.package_manager.torch_url_template", None):
|
||||
PytorchWheel.url_template = \
|
||||
self.config.get("agent.package_manager.torch_url_template", None)
|
||||
|
||||
def _init_python_ver_cuda_ver(self):
|
||||
if self.cuda is None:
|
||||
@@ -512,7 +524,7 @@ class PytorchRequirement(SimpleSubstitution):
|
||||
for i, line in enumerate(lines):
|
||||
if not line or line.lstrip().startswith('#'):
|
||||
continue
|
||||
parts = [p for p in re.split('\s|=|\.|<|>|~|!|@|#', line) if p]
|
||||
parts = [p for p in re.split(r'\s|=|\.|<|>|~|!|@|#', line) if p]
|
||||
if not parts:
|
||||
continue
|
||||
for req, new_req in self._original_req:
|
||||
|
||||
@@ -14,6 +14,7 @@ from pathlib2 import Path
|
||||
from pyhocon import ConfigTree
|
||||
|
||||
import six
|
||||
from six.moves.urllib.parse import unquote
|
||||
import logging
|
||||
from clearml_agent.definitions import PIP_EXTRA_INDICES
|
||||
from clearml_agent.helper.base import (
|
||||
@@ -175,11 +176,13 @@ class MarkerRequirement(object):
|
||||
return
|
||||
local_path = Path(self.uri[len("file://"):])
|
||||
if not local_path.exists():
|
||||
line = self.line
|
||||
if self.remove_local_file_ref():
|
||||
# print warning
|
||||
logging.getLogger(__name__).warning(
|
||||
'Local file not found [{}], references removed'.format(line))
|
||||
local_path = Path(unquote(self.uri)[len("file://"):])
|
||||
if not local_path.exists():
|
||||
line = self.line
|
||||
if self.remove_local_file_ref():
|
||||
# print warning
|
||||
logging.getLogger(__name__).warning(
|
||||
'Local file not found [{}], references removed'.format(line))
|
||||
|
||||
|
||||
class SimpleVersion:
|
||||
|
||||
@@ -1,7 +1,11 @@
|
||||
import abc
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import stat
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
from distutils.spawn import find_executable
|
||||
from hashlib import md5
|
||||
from os import environ
|
||||
@@ -23,7 +27,7 @@ from clearml_agent.helper.base import (
|
||||
rm_tree,
|
||||
ExecutionInfo,
|
||||
normalize_path,
|
||||
create_file_if_not_exists,
|
||||
create_file_if_not_exists, safe_remove_file,
|
||||
)
|
||||
from clearml_agent.helper.os.locks import FileLock
|
||||
from clearml_agent.helper.process import DEVNULL, Argv, PathLike, COMMAND_SUCCESS
|
||||
@@ -118,6 +122,13 @@ class VCS(object):
|
||||
"""
|
||||
return self.add_auth(self.session.config, self.url)
|
||||
|
||||
@property
|
||||
def url_without_auth(self):
|
||||
"""
|
||||
Return URL without configured user/password
|
||||
"""
|
||||
return self.add_auth(self.session.config, self.url, reset_auth=True)
|
||||
|
||||
@abc.abstractmethod
|
||||
def executable_name(self):
|
||||
"""
|
||||
@@ -349,7 +360,9 @@ class VCS(object):
|
||||
If not in debug mode, filter VCS password from output.
|
||||
"""
|
||||
self._set_ssh_url()
|
||||
clone_command = ("clone", self.url_with_auth, self.location) + self.clone_flags
|
||||
# if we are on linux no need for the full auth url because we use GIT_ASKPASS
|
||||
url = self.url_without_auth if self._use_ask_pass else self.url_with_auth
|
||||
clone_command = ("clone", url, self.location) + self.clone_flags
|
||||
# clone all branches regardless of when we want to later checkout
|
||||
# if branch:
|
||||
# clone_command += ("-b", branch)
|
||||
@@ -357,34 +370,35 @@ class VCS(object):
|
||||
self.call(*clone_command)
|
||||
return
|
||||
|
||||
def normalize_output(result):
|
||||
"""
|
||||
Returns result string without user's password.
|
||||
NOTE: ``self.get_stderr``'s result might or might not have the same type as ``e.output`` in case of error.
|
||||
"""
|
||||
string_type = (
|
||||
ensure_text
|
||||
if isinstance(result, six.text_type)
|
||||
else ensure_binary
|
||||
)
|
||||
return result.replace(
|
||||
string_type(self.url),
|
||||
string_type(furl(self.url).remove(password=True).tostr()),
|
||||
)
|
||||
|
||||
def print_output(output):
|
||||
print(ensure_text(output))
|
||||
|
||||
try:
|
||||
print_output(normalize_output(self.get_stderr(*clone_command)))
|
||||
self._print_output(self._normalize_output(self.get_stderr(*clone_command)))
|
||||
except subprocess.CalledProcessError as e:
|
||||
# In Python 3, subprocess.CalledProcessError has a `stderr` attribute,
|
||||
# but since stderr is redirect to `subprocess.PIPE` it will appear in the usual `output` attribute
|
||||
if e.output:
|
||||
e.output = normalize_output(e.output)
|
||||
print_output(e.output)
|
||||
e.output = self._normalize_output(e.output)
|
||||
self._print_output(e.output)
|
||||
raise
|
||||
|
||||
def _normalize_output(self, result):
|
||||
"""
|
||||
Returns result string without user's password.
|
||||
NOTE: ``self.get_stderr``'s result might or might not have the same type as ``e.output`` in case of error.
|
||||
"""
|
||||
string_type = (
|
||||
ensure_text
|
||||
if isinstance(result, six.text_type)
|
||||
else ensure_binary
|
||||
)
|
||||
return result.replace(
|
||||
string_type(self.url),
|
||||
string_type(furl(self.url).remove(password=True).tostr()),
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def _print_output(output):
|
||||
print(ensure_text(output))
|
||||
|
||||
def checkout(self):
|
||||
# type: () -> None
|
||||
"""
|
||||
@@ -473,10 +487,12 @@ class VCS(object):
|
||||
return Argv(self.executable_name, *argv)
|
||||
|
||||
@classmethod
|
||||
def add_auth(cls, config, url):
|
||||
def add_auth(cls, config, url, reset_auth=False):
|
||||
"""
|
||||
Add username and password to URL if missing from URL and present in config.
|
||||
Does not modify ssh URLs.
|
||||
|
||||
:param reset_auth: If true remove the user/pass from the URL (default False)
|
||||
"""
|
||||
try:
|
||||
parsed_url = furl(url)
|
||||
@@ -493,7 +509,10 @@ class VCS(object):
|
||||
and config_pass
|
||||
and (not config_domain or config_domain.lower() == parsed_url.host)
|
||||
):
|
||||
parsed_url.set(username=config_user, password=config_pass)
|
||||
if reset_auth:
|
||||
parsed_url.set(username=None, password=None)
|
||||
else:
|
||||
parsed_url.set(username=config_user, password=config_pass)
|
||||
return parsed_url.url
|
||||
|
||||
@abc.abstractmethod
|
||||
@@ -531,6 +550,10 @@ class Git(VCS):
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
super(Git, self).__init__(*args, **kwargs)
|
||||
|
||||
self._use_ask_pass = False if not self.session.config.get('agent.enable_git_ask_pass', None) \
|
||||
else sys.platform == "linux"
|
||||
|
||||
try:
|
||||
self.call("config", "--global", "--replace-all", "safe.directory", "*", cwd=self.location)
|
||||
except: # noqa
|
||||
@@ -558,6 +581,66 @@ class Git(VCS):
|
||||
def pull(self):
|
||||
self.call("fetch", "--all", "--recurse-submodules", cwd=self.location)
|
||||
|
||||
def _git_pass_auth_wrapper(self, func, *args, **kwargs):
|
||||
try:
|
||||
url_with_auth = furl(self.url_with_auth)
|
||||
password = url_with_auth.password if url_with_auth else None
|
||||
username = url_with_auth.username if url_with_auth else None
|
||||
except: # noqa
|
||||
password = None
|
||||
username = None
|
||||
|
||||
# if this is not linux or we do not have a password, just run as is
|
||||
if not self._use_ask_pass or not password or not username:
|
||||
return func(*args, **kwargs)
|
||||
|
||||
# create the password file
|
||||
fp, pass_file = tempfile.mkstemp(prefix='clearml_git_', suffix='.sh')
|
||||
os.close(fp)
|
||||
with open(pass_file, 'wt') as f:
|
||||
# get first letter only (username / password are the argument options)
|
||||
# then echo the correct information
|
||||
f.writelines([
|
||||
'#!/bin/bash\n',
|
||||
'c="$1"\n',
|
||||
'c="${c%"${c#?}"}"\n',
|
||||
'if [ "$c" == "u" ] || [ "$c" == "U" ]; then echo "{}"; else echo "{}"; fi\n'.format(
|
||||
username.replace('"', '\\"'), password.replace('"', '\\"')
|
||||
)
|
||||
])
|
||||
# mark executable
|
||||
st = os.stat(pass_file)
|
||||
os.chmod(pass_file, st.st_mode | stat.S_IEXEC)
|
||||
# let GIT use it
|
||||
self.COMMAND_ENV["GIT_ASKPASS"] = pass_file
|
||||
# call git command
|
||||
try:
|
||||
ret = func(*args, **kwargs)
|
||||
finally:
|
||||
# delete temp password file
|
||||
self.COMMAND_ENV.pop("GIT_ASKPASS", None)
|
||||
safe_remove_file(pass_file)
|
||||
|
||||
return ret
|
||||
|
||||
def get_stderr(self, *argv, **kwargs):
|
||||
"""
|
||||
Wrapper with git password authentication
|
||||
"""
|
||||
return self._git_pass_auth_wrapper(super(Git, self).get_stderr, *argv, **kwargs)
|
||||
|
||||
def call_with_stdin(self, *argv, **kwargs):
|
||||
"""
|
||||
Wrapper with git password authentication
|
||||
"""
|
||||
return self._git_pass_auth_wrapper(super(Git, self).call_with_stdin, *argv, **kwargs)
|
||||
|
||||
def call(self, *argv, **kwargs):
|
||||
"""
|
||||
Wrapper with git password authentication
|
||||
"""
|
||||
return self._git_pass_auth_wrapper(super(Git, self).call, *argv, **kwargs)
|
||||
|
||||
def checkout(self): # type: () -> None
|
||||
"""
|
||||
Checkout repository at specified revision
|
||||
|
||||
@@ -82,7 +82,7 @@ class ResourceMonitor(object):
|
||||
if not worker_tags and ENV_WORKER_TAGS.get():
|
||||
worker_tags = shlex.split(ENV_WORKER_TAGS.get())
|
||||
self._worker_tags = worker_tags
|
||||
if os.environ.get('NVIDIA_VISIBLE_DEVICES') == 'none':
|
||||
if Session.get_nvidia_visible_env() == 'none':
|
||||
# NVIDIA_VISIBLE_DEVICES set to none, marks cpu_only flag
|
||||
# active_gpus == False means no GPU reporting
|
||||
self._active_gpus = False
|
||||
@@ -92,10 +92,10 @@ class ResourceMonitor(object):
|
||||
# None means no filtering, report all gpus
|
||||
self._active_gpus = None
|
||||
try:
|
||||
active_gpus = os.environ.get('NVIDIA_VISIBLE_DEVICES', '') or \
|
||||
os.environ.get('CUDA_VISIBLE_DEVICES', '')
|
||||
if active_gpus:
|
||||
self._active_gpus = [int(g.strip()) for g in active_gpus.split(',')]
|
||||
active_gpus = Session.get_nvidia_visible_env()
|
||||
# None means no filtering, report all gpus
|
||||
if active_gpus and active_gpus != "all":
|
||||
self._active_gpus = [g.strip() for g in str(active_gpus).split(',')]
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
@@ -263,7 +263,7 @@ class ResourceMonitor(object):
|
||||
gpu_stat = self._gpustat.new_query()
|
||||
for i, g in enumerate(gpu_stat.gpus):
|
||||
# only monitor the active gpu's, if none were selected, monitor everything
|
||||
if self._active_gpus and i not in self._active_gpus:
|
||||
if self._active_gpus and str(i) not in self._active_gpus:
|
||||
continue
|
||||
stats["gpu_temperature_{:d}".format(i)] = g["temperature.gpu"]
|
||||
stats["gpu_utilization_{:d}".format(i)] = g["utilization.gpu"]
|
||||
|
||||
@@ -22,7 +22,7 @@ WORKER_ARGS = {
|
||||
'help': 'git username for repository access',
|
||||
},
|
||||
'--git-pass': {
|
||||
'help': 'git password for repository access',
|
||||
'help': 'git password (personal access tokens) for repository access',
|
||||
},
|
||||
'--log-level': {
|
||||
'help': 'SDK log level',
|
||||
|
||||
@@ -76,7 +76,7 @@ class Session(_Session):
|
||||
|
||||
cpu_only = kwargs.get('cpu_only')
|
||||
if cpu_only:
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = os.environ['NVIDIA_VISIBLE_DEVICES'] = 'none'
|
||||
Session.set_nvidia_visible_env('none')
|
||||
|
||||
if kwargs.get('gpus') and not os.environ.get('KUBERNETES_SERVICE_HOST') \
|
||||
and not os.environ.get('KUBERNETES_PORT'):
|
||||
@@ -85,7 +85,7 @@ class Session(_Session):
|
||||
os.environ.pop('CUDA_VISIBLE_DEVICES', None)
|
||||
os.environ['NVIDIA_VISIBLE_DEVICES'] = kwargs.get('gpus')
|
||||
else:
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = os.environ['NVIDIA_VISIBLE_DEVICES'] = kwargs.get('gpus')
|
||||
Session.set_nvidia_visible_env(kwargs.get('gpus'))
|
||||
|
||||
if kwargs.get('only_load_config'):
|
||||
from clearml_agent.backend_api.config import load
|
||||
@@ -288,7 +288,7 @@ class Session(_Session):
|
||||
def get(self, service, action, version=None, headers=None,
|
||||
data=None, json=None, async_enable=False, **kwargs):
|
||||
return self._manual_request(service=service, action=action,
|
||||
version=version, method="get", headers=headers,
|
||||
version=version, method=Request.def_method, headers=headers,
|
||||
data=data, async_enable=async_enable,
|
||||
json=json or kwargs)
|
||||
|
||||
@@ -299,7 +299,7 @@ class Session(_Session):
|
||||
data=data, async_enable=async_enable,
|
||||
json=json or kwargs)
|
||||
|
||||
def _manual_request(self, service, action, version=None, method="get", headers=None,
|
||||
def _manual_request(self, service, action, version=None, method=Request.def_method, headers=None,
|
||||
data=None, json=None, async_enable=False, **kwargs):
|
||||
|
||||
res = self.send_request(service=service, action=action,
|
||||
@@ -327,6 +327,23 @@ class Session(_Session):
|
||||
def command(self, *args):
|
||||
return Argv(*args, log=self.get_logger(Argv.__module__))
|
||||
|
||||
@staticmethod
|
||||
def set_nvidia_visible_env(gpus):
|
||||
if not gpus:
|
||||
gpus = ""
|
||||
visible_env = gpus.replace(".", ":") if isinstance(gpus, str) else \
|
||||
','.join(str(g).replace(".", ":") for g in gpus)
|
||||
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = os.environ['NVIDIA_VISIBLE_DEVICES'] = visible_env
|
||||
|
||||
@staticmethod
|
||||
def get_nvidia_visible_env():
|
||||
visible_env = os.environ.get('NVIDIA_VISIBLE_DEVICES') or os.environ.get('CUDA_VISIBLE_DEVICES')
|
||||
if visible_env is None:
|
||||
return None
|
||||
visible_env = str(visible_env).replace(":", ".")
|
||||
return visible_env
|
||||
|
||||
|
||||
@attr.s
|
||||
class TrainsAgentLogger(object):
|
||||
|
||||
@@ -1 +1 @@
|
||||
__version__ = '1.2.4rc1'
|
||||
__version__ = '1.4.1'
|
||||
|
||||
@@ -136,6 +136,12 @@ agent {
|
||||
},
|
||||
|
||||
translate_ssh: true,
|
||||
|
||||
# set "disable_ssh_mount: true" to disable the automatic mount of ~/.ssh folder into the docker containers
|
||||
# default is false, automatically mounts ~/.ssh
|
||||
# Must be set to True if using "clearml-session" with this agent!
|
||||
# disable_ssh_mount: false
|
||||
|
||||
# reload configuration file every daemon execution
|
||||
reload_config: false,
|
||||
|
||||
@@ -245,7 +251,7 @@ agent {
|
||||
# pip_cache: "/root/.cache/pip"
|
||||
# poetry_cache: "/root/.cache/pypoetry"
|
||||
# vcs_cache: "/root/.clearml/vcs-cache"
|
||||
# venv_build: "/root/.clearml/venvs-builds"
|
||||
# venv_build: "~/.clearml/venvs-builds"
|
||||
# pip_download: "/root/.clearml/pip-download-cache"
|
||||
# }
|
||||
|
||||
@@ -325,6 +331,11 @@ sdk {
|
||||
key: ""
|
||||
secret: ""
|
||||
region: ""
|
||||
# Or enable credentials chain to let Boto3 pick the right credentials.
|
||||
# This includes picking credentials from environment variables,
|
||||
# credential file and IAM role using metadata service.
|
||||
# Refer to the latest Boto3 docs
|
||||
use_credentials_chain: false
|
||||
|
||||
credentials: [
|
||||
# specifies key/secret credentials to use when handling s3 urls (read or write)
|
||||
|
||||
@@ -8,7 +8,7 @@ psutil>=3.4.2,<5.9.0
|
||||
pyhocon>=0.3.38,<0.4.0
|
||||
pyparsing>=2.0.3,<2.5.0
|
||||
python-dateutil>=2.4.2,<2.9.0
|
||||
pyjwt>=1.6.4,<2.1.0
|
||||
pyjwt>=2.4.0,<2.5.0
|
||||
PyYAML>=3.12,<5.5.0
|
||||
requests>=2.20.0,<2.26.0
|
||||
six>=1.13.0,<1.16.0
|
||||
|
||||
Reference in New Issue
Block a user