mirror of
https://github.com/clearml/clearml-agent
synced 2025-06-26 18:16:15 +00:00
initial clearml-agent v0.17.0
This commit is contained in:
278
README.md
278
README.md
@@ -1,80 +1,107 @@
|
||||
# Allegro Trains Agent
|
||||
## Deep Learning DevOps For Everyone - Now supporting all platforms (Linux, macOS, and Windows)
|
||||
<div align="center">
|
||||
|
||||
"All the Deep-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
|
||||
<img src="docs/clearml_agent_logo.png" width="250px">
|
||||
|
||||
**ClearML Agent - ML-Ops made easy
|
||||
An ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
|
||||
|
||||
[](https://img.shields.io/github/license/allegroai/trains-agent.svg)
|
||||
[](https://img.shields.io/pypi/pyversions/trains-agent.svg)
|
||||
[](https://img.shields.io/pypi/v/trains-agent.svg)
|
||||
[](https://pypi.python.org/pypi/trains-agent/)
|
||||
[](https://img.shields.io/pypi/pyversions/clearml-agent.svg)
|
||||
[](https://img.shields.io/pypi/v/clearml-agent.svg)
|
||||
[](https://pypi.python.org/pypi/clearml-agent/)
|
||||
|
||||
### Help improve Trains by filling our 2-min [user survey](https://allegro.ai/lp/trains-user-survey/)
|
||||
</div>
|
||||
|
||||
**Trains Agent is an AI experiment cluster solution.**
|
||||
---
|
||||
|
||||
It is a zero configuration fire-and-forget execution agent, which combined with trains-server provides a full AI cluster solution.
|
||||
### ClearML-Agent
|
||||
#### *Formerly known as Trains Agent*
|
||||
|
||||
**Full AutoML in 5 steps**
|
||||
1. Install the [Trains Server](https://github.com/allegroai/trains-agent) (or use our [open server](https://demoapp.trains.allegro.ai))
|
||||
2. `pip install trains-agent` ([install](#installing-the-trains-agent) the Trains Agent on any GPU machine: on-premises / cloud / ...)
|
||||
3. Add [Trains](https://github.com/allegroai/trains) to your code with just 2 lines & run it once (on your machine / laptop)
|
||||
4. Change the [parameters](#using-the-trains-agent) in the UI & schedule for [execution](#using-the-trains-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
|
||||
|
||||
* Run jobs (experiments) on any local or cloud based resource
|
||||
* Implement optimized resource utilization policies
|
||||
* Deploy execution environments with either virtualenv or fully docker containerized with zero effort
|
||||
* Launch-and-Forget service containers
|
||||
* [Cloud autoscaling](https://allegro.ai/clearml/docs/examples/services/aws_autoscaler/aws_autoscaler/)
|
||||
* [Customizable cleanup](https://allegro.ai/clearml/docs/examples/services/cleanup/cleanup_service/)
|
||||
* Advanced [pipeline building and execution](https://allegro.ai/clearml/docs/examples/frameworks/pytorch/notebooks/table/tabular_training_pipeline/)
|
||||
|
||||
It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.
|
||||
|
||||
**Full Automation in 5 steps**
|
||||
1. ClearML Server [self-hosted](https://github.com/allegroai/trains-server) or [free tier hosting](https://app.community.clear.ml)
|
||||
2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine: on-premises / cloud / ...)
|
||||
3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or Add [ClearML](https://github.com/allegroai/trains) to your code with just 2 lines
|
||||
4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
|
||||
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
|
||||
|
||||
"All the Deep/Machine-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
|
||||
|
||||
**Using the Trains Agent, you can now set up a dynamic cluster with \*epsilon DevOps**
|
||||
**Try ClearML now** [Self Hosted](https://github.com/allegroai/trains-server) or [Free tier Hosting](https://app.community.clear.ml)
|
||||
<a href="https://app.community.clear.ml"><img src="https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width="100%"></a>
|
||||
|
||||
*epsilon - Because we are scientists :triangular_ruler: and nothing is really zero work
|
||||
|
||||
(Experience Trains live at [https://demoapp.trains.allegro.ai](https://demoapp.trains.allegro.ai))
|
||||
<a href="https://demoapp.trains.allegro.ai"><img src="https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width="100%"></a>
|
||||
|
||||
## Simple, Flexible Experiment Orchestration
|
||||
**The Trains Agent was built to address the DL/ML R&D DevOps needs:**
|
||||
### Simple, Flexible Experiment Orchestration
|
||||
**The ClearML Agent was built to address the DL/ML R&D DevOps needs:**
|
||||
|
||||
* Easily add & remove machines from the cluster
|
||||
* Reuse machines without the need for any dedicated containers or images
|
||||
* **Combine GPU resources across any cloud and on-prem**
|
||||
* **No need for yaml/json/template configuration of any kind**
|
||||
* **No need for yaml / json / template configuration of any kind**
|
||||
* **User friendly UI**
|
||||
* Manageable resource allocation that can be used by researchers and engineers
|
||||
* Flexible and controllable scheduler with priority support
|
||||
* Automatic instance spinning in the cloud **(coming soon)**
|
||||
* Automatic instance spinning in the cloud
|
||||
|
||||
**Using the ClearML Agent, you can now set up a dynamic cluster with \*epsilon DevOps**
|
||||
|
||||
*epsilon - Because we are :triangular_ruler: and nothing is really zero work
|
||||
|
||||
|
||||
## But ... K8S?
|
||||
We think Kubernetes is awesome.
|
||||
Combined with KubeFlow it is a robust solution for production-grade DevOps.
|
||||
We've observed, however, that it can be a bit of an overkill as an R&D DL/ML solution.
|
||||
If you are considering K8S for your research, also consider that you will soon be managing **hundreds** of containers...
|
||||
### Kubernetes Integration (Optional)
|
||||
We think Kubernetes is awesome, but it should be a choice.
|
||||
We designed `clearml-agent` so you can run bare-metal or inside a pod with any mix that fits your environment.
|
||||
#### Benefits of integrating existing K8s with ClearML-Agent
|
||||
- ClearML-Agent adds the missing scheduling capabilities to K8s
|
||||
- Allowing for more flexible automation from code
|
||||
- A programmatic interface for easier learning curve (and debugging)
|
||||
- Seamless integration with ML/DL experiment manager
|
||||
- Web UI for customization, scheduling & prioritization of jobs
|
||||
|
||||
In our experience, handling and building the environments, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, it’s usually out of scope for the research team, and overwhelming even for the DevOps team).
|
||||
**Two K8s integration flavours**
|
||||
- Spin ClearML-Agent as a long-lasting service pod
|
||||
- use [clearml-agent](https://hub.docker.com/r/allegroai/trains-agent) docker image
|
||||
- map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
|
||||
- allow the clearml-agent to manage sibling dockers
|
||||
- benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
|
||||
- downside: Sibling containers
|
||||
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
|
||||
- Run the [clearml-k8s glue](https://github.com/allegroai/trains-agent/blob/master/examples/k8s_glue_example.py) on a K8s cpu node
|
||||
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml template)
|
||||
- Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the experiment's process
|
||||
- benefits: Kubernetes full view of all running jobs in the system
|
||||
- downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
|
||||
|
||||
We feel there has to be a better way, that can be just as powerful for R&D and at the same time allow integration with K8S **when the need arises**.
|
||||
(If you already have a K8S cluster for AI, detailed instructions on how to integrate Trains into your K8S cluster are [here](https://github.com/allegroai/trains-server-k8s/tree/master/trains-server-chart) with included [helm chart](https://github.com/allegroai/trains-server-helm))
|
||||
|
||||
|
||||
## Using the Trains Agent
|
||||
### Using the ClearML Agent
|
||||
**Full scale HPC with a click of a button**
|
||||
|
||||
The Trains Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.
|
||||
The ClearML Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.
|
||||
|
||||
Any 'Draft' experiment can be scheduled for execution by a Trains agent.
|
||||
Any 'Draft' experiment can be scheduled for execution by a ClearML agent.
|
||||
|
||||
A previously run experiment can be put into 'Draft' state by either of two methods:
|
||||
* Using the **'Reset'** action from the experiment right-click context menu in the
|
||||
Trains UI - This will clear any results and artifacts the previous run had created.
|
||||
ClearML UI - This will clear any results and artifacts the previous run had created.
|
||||
* Using the **'Clone'** action from the experiment right-click context menu in the
|
||||
Trains UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.
|
||||
ClearML UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.
|
||||
|
||||
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment
|
||||
right-click context menu in the Trains UI and selecting the execution queue.
|
||||
right-click context menu in the ClearML UI and selecting the execution queue.
|
||||
|
||||
See [creating an experiment and enqueuing it for execution](#from-scratch).
|
||||
|
||||
Once an experiment is enqueued, it will be picked up and executed by a Trains agent monitoring this queue.
|
||||
Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue.
|
||||
|
||||
The Trains UI Workers & Queues page provides ongoing execution information:
|
||||
The ClearML UI Workers & Queues page provides ongoing execution information:
|
||||
- Workers Tab: Monitor you cluster
|
||||
- Review available resources
|
||||
- Monitor machines statistics (CPU / GPU / Disk / Network)
|
||||
@@ -83,162 +110,129 @@ The Trains UI Workers & Queues page provides ongoing execution information:
|
||||
- Cancel or abort job execution
|
||||
- Move jobs between execution queues
|
||||
|
||||
### What The Trains Agent Actually Does
|
||||
The Trains Agent executes experiments using the following process:
|
||||
#### What The ClearML Agent Actually Does
|
||||
The ClearML Agent executes experiments using the following process:
|
||||
- Create a new virtual environment (or launch the selected docker image)
|
||||
- Clone the code into the virtual-environment (or inside the docker)
|
||||
- Install python packages based on the package requirements listed for the experiment
|
||||
- Special note for PyTorch: The Trains Agent will automatically select the
|
||||
- Special note for PyTorch: The ClearML Agent will automatically select the
|
||||
torch packages based on the CUDA_VERSION environment variable of the machine
|
||||
- Execute the code, while monitoring the process
|
||||
- Log all stdout/stderr in the Trains UI, including the cloning and installation process, for easy debugging
|
||||
- Monitor the execution and allow you to manually abort the job using the Trains UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
|
||||
- Log all stdout/stderr in the ClearML UI, including the cloning and installation process, for easy debugging
|
||||
- Monitor the execution and allow you to manually abort the job using the ClearML UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
|
||||
|
||||
### System Design & Flow
|
||||
```text
|
||||
+-----------------+
|
||||
| GPU Machine |
|
||||
Development Machine | |
|
||||
+------------------------+ | +-------------+ |
|
||||
| Data Scientist's | +--------------+ | |Trains Agent | |
|
||||
| DL/ML Code | | WEB UI | | | | |
|
||||
| | | | | | +---------+ | |
|
||||
| | | | | | | DL/ML | | |
|
||||
| | +--------------+ | | | Code | | |
|
||||
| | User Clones Exp #1 / . . . . . . . / | | | | | |
|
||||
| +-------------------+ | into Exp #2 / . . . . . . . / | | +---------+ | |
|
||||
| | Trains | | +---------------/-_____________-/ | | | |
|
||||
| +---------+---------+ | | | | ^ | |
|
||||
+-----------|------------+ | | +------|------+ |
|
||||
| | +--------|--------+
|
||||
Auto-Magically | |
|
||||
Creates Exp #1 | The Trains Agent
|
||||
\ User Change Hyper-Parameters Pulls Exp #2, setup the
|
||||
| | environment & clone code.
|
||||
| | Start execution with the
|
||||
+------------|------------+ | +--------------------+ new set of Hyper-Parameters.
|
||||
| +---------v---------+ | | | Trains Server | |
|
||||
| | Experiment #1 | | | | | |
|
||||
| +-------------------+ | | | Execution Queue | |
|
||||
| || | | | | |
|
||||
| +-------------------+<----------+ | | |
|
||||
| | | | | | |
|
||||
| | Experiment #2 | | | | |
|
||||
| +-------------------<------------\ | | |
|
||||
| | ------------->---------------+ | |
|
||||
| | User Send Exp #2 | |Execute Exp #2 +--------------------+
|
||||
| | For Execution | +---------------+ |
|
||||
| Trains Server | | |
|
||||
+-------------------------+ +--------------------+
|
||||
```
|
||||
#### System Design & Flow
|
||||
|
||||
### Installing the Trains Agent
|
||||
<img src="https://allegro.ai/clearml/docs/img/ClearML_Architecture.png" width="100%" alt="clearml-architecture">
|
||||
|
||||
|
||||
#### Installing the ClearML Agent
|
||||
|
||||
```bash
|
||||
pip install trains-agent
|
||||
pip install clearml-agent
|
||||
```
|
||||
|
||||
### Trains Agent Usage Examples
|
||||
#### ClearML Agent Usage Examples
|
||||
|
||||
Full Interface and capabilities are available with
|
||||
```bash
|
||||
trains-agent --help
|
||||
trains-agent daemon --help
|
||||
clearml-agent --help
|
||||
clearml-agent daemon --help
|
||||
```
|
||||
|
||||
### Configuring the Trains Agent
|
||||
#### Configuring the ClearML Agent
|
||||
|
||||
```bash
|
||||
trains-agent init
|
||||
clearml-agent init
|
||||
```
|
||||
|
||||
Note: The Trains Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default Trains Agent cache folder is `~/.trains`
|
||||
Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default ClearML Agent cache folder is `~/.clearml`
|
||||
|
||||
See full details in your configuration file at `~/trains.conf`
|
||||
See full details in your configuration file at `~/clearml.conf`
|
||||
|
||||
Note: The **Trains agent** extends the **Trains** configuration file `~/trains.conf`
|
||||
They are designed to share the same configuration file, see example [here](docs/trains.conf)
|
||||
Note: The **ClearML agent** extends the **ClearML** configuration file `~/clearml.conf`
|
||||
They are designed to share the same configuration file, see example [here](docs/clearml.conf)
|
||||
|
||||
### Running the Trains Agent
|
||||
#### Running the ClearML Agent
|
||||
|
||||
For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
|
||||
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
|
||||
```bash
|
||||
trains-agent daemon --queue default --foreground
|
||||
clearml-agent daemon --queue default --foreground
|
||||
```
|
||||
|
||||
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
|
||||
Notice: with `--detached` flag, the *trains-agent* will be running in the background
|
||||
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
|
||||
```bash
|
||||
trains-agent daemon --detached --queue default
|
||||
clearml-agent daemon --detached --queue default
|
||||
```
|
||||
|
||||
GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled with `--cpu-only`).
|
||||
|
||||
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for the `trains-agent` <br>
|
||||
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES` is an empty string (""), no gpu will be allocated for the `trains-agent`
|
||||
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for the `clearml-agent` <br>
|
||||
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES` is an empty string (""), no gpu will be allocated for the `clearml-agent`
|
||||
|
||||
Example: spin two agents, one per gpu on the same machine:
|
||||
Notice: with `--detached` flag, the *trains-agent* will be running in the background
|
||||
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
|
||||
```bash
|
||||
trains-agent daemon --detached --gpus 0 --queue default
|
||||
trains-agent daemon --detached --gpus 1 --queue default
|
||||
clearml-agent daemon --detached --gpus 0 --queue default
|
||||
clearml-agent daemon --detached --gpus 1 --queue default
|
||||
```
|
||||
|
||||
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
|
||||
```bash
|
||||
trains-agent daemon --detached --gpus 0,1 --queue dual_gpu
|
||||
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu
|
||||
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
|
||||
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu
|
||||
```
|
||||
|
||||
#### Starting the Trains Agent in docker mode
|
||||
##### Starting the ClearML Agent in docker mode
|
||||
|
||||
For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
|
||||
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
|
||||
```bash
|
||||
trains-agent daemon --queue default --docker --foreground
|
||||
clearml-agent daemon --queue default --docker --foreground
|
||||
```
|
||||
|
||||
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
|
||||
Notice: with `--detached` flag, the *trains-agent* will be running in the background
|
||||
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
|
||||
```bash
|
||||
trains-agent daemon --detached --queue default --docker
|
||||
clearml-agent daemon --detached --queue default --docker
|
||||
```
|
||||
|
||||
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda docker:
|
||||
```bash
|
||||
trains-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda
|
||||
trains-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda
|
||||
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda
|
||||
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda
|
||||
```
|
||||
|
||||
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda docker:
|
||||
```bash
|
||||
trains-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda
|
||||
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
|
||||
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda
|
||||
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
|
||||
```
|
||||
|
||||
#### Starting the Trains Agent - Priority Queues
|
||||
##### Starting the ClearML Agent - Priority Queues
|
||||
|
||||
Priority Queues are also supported, example use case:
|
||||
|
||||
High priority queue: `important_jobs` Low priority queue: `default`
|
||||
```bash
|
||||
trains-agent daemon --queue important_jobs default
|
||||
clearml-agent daemon --queue important_jobs default
|
||||
```
|
||||
The **Trains Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.
|
||||
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.
|
||||
|
||||
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our [open server](https://demoapp.trains.allegro.ai/workers-and-queues/queues)
|
||||
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our [free server](https://app.community.clear.ml/workers-and-queues/queues)
|
||||
|
||||
#### Stopping the Trains Agent
|
||||
##### Stopping the ClearML Agent
|
||||
|
||||
To stop a **Trains Agent** running in the background, run the same command line used to start the agent with `--stop` appended.
|
||||
To stop a **ClearML Agent** running in the background, run the same command line used to start the agent with `--stop` appended.
|
||||
For example, to stop the first of the above shown same machine, single gpu agents:
|
||||
```bash
|
||||
trains-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda --stop
|
||||
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda --stop
|
||||
```
|
||||
|
||||
## How do I create an experiment on the Trains Server? <a name="from-scratch"></a>
|
||||
* Integrate [Trains](https://github.com/allegroai/trains) with your code
|
||||
### How do I create an experiment on the ClearML Server? <a name="from-scratch"></a>
|
||||
* Integrate [ClearML](https://github.com/allegroai/trains) with your code
|
||||
* Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
|
||||
* As your code is running, **Trains** creates an experiment logging all the necessary execution information:
|
||||
* As your code is running, **ClearML** creates an experiment logging all the necessary execution information:
|
||||
- Git repository link and commit ID (or an entire jupyter notebook)
|
||||
- Git diff (we’re not saying you never commit and push, but still...)
|
||||
- Python packages used by your code (including specific versions used)
|
||||
@@ -247,7 +241,7 @@ trains-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda --s
|
||||
|
||||
You now have a 'template' of your experiment with everything required for automated execution
|
||||
|
||||
* In the Trains UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
|
||||
* In the ClearML UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
|
||||
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
|
||||
- Change the Hyper-Parameters
|
||||
- Switch to the latest code base of the repository
|
||||
@@ -256,31 +250,31 @@ trains-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda --s
|
||||
- Or simply change nothing to run the same experiment again...
|
||||
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
|
||||
|
||||
## Trains-Agent Services Mode <a name="services"></a>
|
||||
### ClearML-Agent Services Mode <a name="services"></a>
|
||||
|
||||
Trains-Agent Services is a special mode of Trains-Agent that provides the ability to launch long-lasting jobs
|
||||
that previously had to be executed on local / dedicated machines. It allows a single agent to
|
||||
launch multiple dockers (Tasks) for different use cases. To name a few use cases, auto-scaler service (spinning instances
|
||||
ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs
|
||||
that previously had to be executed on local / dedicated machines. It allows a single agent to
|
||||
launch multiple dockers (Tasks) for different use cases. To name a few use cases, auto-scaler service (spinning instances
|
||||
when the need arises and the budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic),
|
||||
Optimizer (such as Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for
|
||||
Optimizer (such as Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for
|
||||
increased data transparency)
|
||||
|
||||
Trains-Agent Services mode will spin **any** task enqueued into the specified queue.
|
||||
Every task launched by Trains-Agent Services will be registered as a new node in the system,
|
||||
providing tracking and transparency capabilities.
|
||||
Currently trains-agent in services-mode supports cpu only configuration. Trains-agent services mode can be launched alongside GPU agents.
|
||||
ClearML-Agent Services mode will spin **any** task enqueued into the specified queue.
|
||||
Every task launched by ClearML-Agent Services will be registered as a new node in the system,
|
||||
providing tracking and transparency capabilities.
|
||||
Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched alongside GPU agents.
|
||||
|
||||
```bash
|
||||
trains-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
|
||||
clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
|
||||
```
|
||||
|
||||
**Note**: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.
|
||||
**Note**: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.
|
||||
|
||||
|
||||
## AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
|
||||
The Trains Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the Trains package.
|
||||
### AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
|
||||
The ClearML Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the ClearML package.
|
||||
|
||||
Sample AutoML & Orchestration examples can be found in the Trains [example/automation](https://github.com/allegroai/trains/tree/master/examples/automation) folder.
|
||||
Sample AutoML & Orchestration examples can be found in the ClearML [example/automation](https://github.com/allegroai/trains/tree/master/examples/automation) folder.
|
||||
|
||||
AutoML examples
|
||||
- [Toy Keras training experiment](https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
|
||||
@@ -294,6 +288,6 @@ Experiment Pipeline examples
|
||||
- [Second step experiment](https://github.com/allegroai/trains/blob/master/examples/automation/toy_base_task.py)
|
||||
- In order to create an experiment-template in the system, this code must be executed once manually
|
||||
|
||||
## License
|
||||
### License
|
||||
|
||||
Apache License, Version 2.0 (see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0.html) for more information)
|
||||
|
||||
@@ -4,13 +4,13 @@ import argparse
|
||||
import sys
|
||||
import warnings
|
||||
|
||||
from trains_agent.backend_api.session.datamodel import UnusedKwargsWarning
|
||||
from clearml_agent.backend_api.session.datamodel import UnusedKwargsWarning
|
||||
|
||||
import trains_agent
|
||||
from trains_agent.config import get_config
|
||||
from trains_agent.definitions import FileBuffering, CONFIG_FILE
|
||||
from trains_agent.helper.base import reverse_home_folder_expansion, chain_map, named_temporary_file
|
||||
from trains_agent.helper.process import ExitStatus
|
||||
import clearml_agent
|
||||
from clearml_agent.config import get_config
|
||||
from clearml_agent.definitions import FileBuffering, CONFIG_FILE
|
||||
from clearml_agent.helper.base import reverse_home_folder_expansion, chain_map, named_temporary_file
|
||||
from clearml_agent.helper.process import ExitStatus
|
||||
from . import interface, session, definitions, commands
|
||||
from .errors import ConfigFileNotFound, Sigterm, APIError
|
||||
from .helper.trace import PackageTrace
|
||||
@@ -47,7 +47,7 @@ def run_command(parser, args, command_name):
|
||||
except ConfigFileNotFound:
|
||||
message = 'Cannot find configuration file in "{}".\n' \
|
||||
'To create a configuration file, run:\n' \
|
||||
'$ trains_agent init'.format(reverse_home_folder_expansion(CONFIG_FILE))
|
||||
'$ clearml_agent init'.format(reverse_home_folder_expansion(CONFIG_FILE))
|
||||
command_class.exit(message)
|
||||
except APIError as api_error:
|
||||
if not debug:
|
||||
|
||||
@@ -1,12 +1,12 @@
|
||||
{
|
||||
# unique name of this worker, if None, created based on hostname:process_id
|
||||
# Override with os environment: TRAINS_WORKER_ID
|
||||
# worker_id: "trains-agent-machine1:gpu0"
|
||||
# Override with os environment: CLEARML_WORKER_ID
|
||||
# worker_id: "clearml-agent-machine1:gpu0"
|
||||
worker_id: ""
|
||||
|
||||
# worker name, replaces the hostname when creating a unique name for this worker
|
||||
# Override with os environment: TRAINS_WORKER_NAME
|
||||
# worker_name: "trains-agent-machine1"
|
||||
# Override with os environment: CLEARML_WORKER_NAME
|
||||
# worker_name: "clearml-agent-machine1"
|
||||
worker_name: ""
|
||||
|
||||
# Set GIT user/pass credentials (if user/pass are set, GIT protocol will be set to https)
|
||||
@@ -22,7 +22,7 @@
|
||||
|
||||
# Set the python version to use when creating the virtual environment and launching the experiment
|
||||
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
|
||||
# The default is the python executing the trains_agent
|
||||
# The default is the python executing the clearml_agent
|
||||
python_binary: ""
|
||||
|
||||
# select python package manager:
|
||||
@@ -42,7 +42,7 @@
|
||||
force_upgrade: false,
|
||||
|
||||
# additional artifact repositories to use when installing python packages
|
||||
# extra_index_url: ["https://allegroai.jfrog.io/trainsai/api/pypi/public/simple"]
|
||||
# extra_index_url: ["https://allegroai.jfrog.io/clearmlai/api/pypi/public/simple"]
|
||||
|
||||
# additional conda channels to use when installing with conda package manager
|
||||
conda_channels: ["defaults", "conda-forge", "pytorch", ]
|
||||
@@ -73,12 +73,12 @@
|
||||
},
|
||||
|
||||
# target folder for virtual environments builds, created when executing experiment
|
||||
venvs_dir = ~/.trains/venvs-builds
|
||||
venvs_dir = ~/.clearml/venvs-builds
|
||||
|
||||
# cached git clone folder
|
||||
vcs_cache: {
|
||||
enabled: true,
|
||||
path: ~/.trains/vcs-cache
|
||||
path: ~/.clearml/vcs-cache
|
||||
},
|
||||
|
||||
# use venv-update in order to accelerate python virtual environment building
|
||||
@@ -90,7 +90,7 @@
|
||||
# cached folder for specific python package download (used for pytorch package caching)
|
||||
pip_download_cache {
|
||||
enabled: true,
|
||||
path: ~/.trains/pip-download-cache
|
||||
path: ~/.clearml/pip-download-cache
|
||||
},
|
||||
|
||||
translate_ssh: true,
|
||||
@@ -98,9 +98,9 @@
|
||||
reload_config: false,
|
||||
|
||||
# pip cache folder mapped into docker, used for python package caching
|
||||
docker_pip_cache = ~/.trains/pip-cache
|
||||
docker_pip_cache = ~/.clearml/pip-cache
|
||||
# apt cache folder mapped into docker, used for ubuntu package caching
|
||||
docker_apt_cache = ~/.trains/apt-cache
|
||||
docker_apt_cache = ~/.clearml/apt-cache
|
||||
|
||||
# optional arguments to pass to docker image
|
||||
# these are local for this agent and will not be updated in the experiment's docker_cmd section
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
{
|
||||
# TRAINS - default SDK configuration
|
||||
# ClearML - default SDK configuration
|
||||
|
||||
storage {
|
||||
cache {
|
||||
# Defaults to system temp folder / cache
|
||||
default_base_dir: "~/.trains/cache"
|
||||
default_base_dir: "~/.clearml/cache"
|
||||
size {
|
||||
# max_used_bytes = -1
|
||||
min_free_bytes = 10GB
|
||||
@@ -98,7 +98,7 @@
|
||||
google.storage {
|
||||
# # Default project and credentials file
|
||||
# # Will be used when no bucket configuration is found
|
||||
# project: "trains"
|
||||
# project: "clearml"
|
||||
# credentials_json: "/path/to/credentials.json"
|
||||
|
||||
# # Specific credentials per bucket and sub directory
|
||||
@@ -106,7 +106,7 @@
|
||||
# {
|
||||
# bucket: "my-bucket"
|
||||
# subdir: "path/in/bucket" # Not required
|
||||
# project: "trains"
|
||||
# project: "clearml"
|
||||
# credentials_json: "/path/to/credentials.json"
|
||||
# },
|
||||
# ]
|
||||
@@ -114,7 +114,7 @@
|
||||
azure.storage {
|
||||
# containers: [
|
||||
# {
|
||||
# account_name: "trains"
|
||||
# account_name: "clearml"
|
||||
# account_key: "secret"
|
||||
# # container_name:
|
||||
# }
|
||||
@@ -155,8 +155,8 @@
|
||||
# do not analyze the entire repository.
|
||||
force_analyze_entire_repo: false
|
||||
|
||||
# If set to true, *trains* update message will not be printed to the console
|
||||
# this value can be overwritten with os environment variable TRAINS_SUPPRESS_UPDATE_MESSAGE=1
|
||||
# If set to true, *clearml* update message will not be printed to the console
|
||||
# this value can be overwritten with os environment variable CLEARML_SUPPRESS_UPDATE_MESSAGE=1
|
||||
suppress_update_message: false
|
||||
|
||||
# If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze`
|
||||
|
||||
@@ -106,15 +106,15 @@ class StrictSession(Session):
|
||||
init()
|
||||
return
|
||||
|
||||
original = os.environ.get(LOCAL_CONFIG_FILE_OVERRIDE_VAR, None)
|
||||
original = LOCAL_CONFIG_FILE_OVERRIDE_VAR.get() or None
|
||||
try:
|
||||
os.environ[LOCAL_CONFIG_FILE_OVERRIDE_VAR] = str(config_file)
|
||||
LOCAL_CONFIG_FILE_OVERRIDE_VAR.set(str(config_file))
|
||||
init()
|
||||
finally:
|
||||
if original is None:
|
||||
os.environ.pop(LOCAL_CONFIG_FILE_OVERRIDE_VAR, None)
|
||||
LOCAL_CONFIG_FILE_OVERRIDE_VAR.pop()
|
||||
else:
|
||||
os.environ[LOCAL_CONFIG_FILE_OVERRIDE_VAR] = original
|
||||
LOCAL_CONFIG_FILE_OVERRIDE_VAR.set(original)
|
||||
|
||||
def send(self, request, *args, **kwargs):
|
||||
result = super(StrictSession, self).send(request, *args, **kwargs)
|
||||
@@ -222,7 +222,7 @@ class TableResponse(Response):
|
||||
return "" if result is None else result
|
||||
|
||||
fields = fields or self.fields
|
||||
from trains_agent.helper.base import create_table
|
||||
from clearml_agent.helper.base import create_table
|
||||
return create_table(
|
||||
(dict((attr, getter(item, attr)) for attr in fields) for item in self),
|
||||
titles=fields, columns=fields, headers=True,
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
from ...backend_config import EnvEntry
|
||||
from ...backend_config.environment import EnvEntry
|
||||
|
||||
|
||||
ENV_HOST = EnvEntry("TRAINS_API_HOST", "TRAINS_API_HOST")
|
||||
ENV_WEB_HOST = EnvEntry("TRAINS_WEB_HOST", "TRAINS_WEB_HOST")
|
||||
ENV_FILES_HOST = EnvEntry("TRAINS_FILES_HOST", "TRAINS_FILES_HOST")
|
||||
ENV_ACCESS_KEY = EnvEntry("TRAINS_API_ACCESS_KEY", "TRAINS_API_ACCESS_KEY")
|
||||
ENV_SECRET_KEY = EnvEntry("TRAINS_API_SECRET_KEY", "TRAINS_API_SECRET_KEY")
|
||||
ENV_VERBOSE = EnvEntry("TRAINS_API_VERBOSE", "TRAINS_API_VERBOSE", type=bool, default=False)
|
||||
ENV_HOST_VERIFY_CERT = EnvEntry("TRAINS_API_HOST_VERIFY_CERT", "TRAINS_API_HOST_VERIFY_CERT", type=bool, default=True)
|
||||
ENV_CONDA_ENV_PACKAGE = EnvEntry("TRAINS_CONDA_ENV_PACKAGE", "TRAINS_CONDA_ENV_PACKAGE")
|
||||
ENV_HOST = EnvEntry("CLEARML_API_HOST", "TRAINS_API_HOST")
|
||||
ENV_WEB_HOST = EnvEntry("CLEARML_WEB_HOST", "TRAINS_WEB_HOST")
|
||||
ENV_FILES_HOST = EnvEntry("CLEARML_FILES_HOST", "TRAINS_FILES_HOST")
|
||||
ENV_ACCESS_KEY = EnvEntry("CLEARML_API_ACCESS_KEY", "TRAINS_API_ACCESS_KEY")
|
||||
ENV_SECRET_KEY = EnvEntry("CLEARML_API_SECRET_KEY", "TRAINS_API_SECRET_KEY")
|
||||
ENV_VERBOSE = EnvEntry("CLEARML_API_VERBOSE", "TRAINS_API_VERBOSE", type=bool, default=False)
|
||||
ENV_HOST_VERIFY_CERT = EnvEntry("CLEARML_API_HOST_VERIFY_CERT", "TRAINS_API_HOST_VERIFY_CERT", type=bool, default=True)
|
||||
ENV_CONDA_ENV_PACKAGE = EnvEntry("CLEARML_CONDA_ENV_PACKAGE", "TRAINS_CONDA_ENV_PACKAGE")
|
||||
|
||||
@@ -29,12 +29,12 @@ class MaxRequestSizeError(Exception):
|
||||
|
||||
|
||||
class Session(TokenManager):
|
||||
""" TRAINS API Session class. """
|
||||
""" ClearML API Session class. """
|
||||
|
||||
_AUTHORIZATION_HEADER = "Authorization"
|
||||
_WORKER_HEADER = "X-Trains-Worker"
|
||||
_ASYNC_HEADER = "X-Trains-Async"
|
||||
_CLIENT_HEADER = "X-Trains-Agent"
|
||||
_WORKER_HEADER = ("X-ClearML-Worker", "X-Trains-Worker", )
|
||||
_ASYNC_HEADER = ("X-ClearML-Async", "X-Trains-Async", )
|
||||
_CLIENT_HEADER = ("X-ClearML-Agent", "X-Trains-Agent", )
|
||||
|
||||
_async_status_code = 202
|
||||
_session_requests = 0
|
||||
@@ -45,9 +45,9 @@ class Session(TokenManager):
|
||||
_write_session_timeout = (30.0, 30.)
|
||||
|
||||
api_version = '2.1'
|
||||
default_host = "https://demoapi.trains.allegro.ai"
|
||||
default_web = "https://demoapp.trains.allegro.ai"
|
||||
default_files = "https://demofiles.trains.allegro.ai"
|
||||
default_host = "https://demoapi.demo.clear.ml"
|
||||
default_web = "https://demoapp.demo.clear.ml"
|
||||
default_files = "https://demofiles.demo.clear.ml"
|
||||
default_key = "EGRTCO8JMSIGI6S39GTP43NFWXDQOW"
|
||||
default_secret = "x!XTov_G-#vspE*Y(h$Anm&DIc5Ou-F)jsl$PdOyj5wG1&E!Z8"
|
||||
|
||||
@@ -192,8 +192,10 @@ class Session(TokenManager):
|
||||
"""
|
||||
host = self.host
|
||||
headers = headers.copy() if headers else {}
|
||||
headers[self._WORKER_HEADER] = self.worker
|
||||
headers[self._CLIENT_HEADER] = self.client
|
||||
for h in self._WORKER_HEADER:
|
||||
headers[h] = self.worker
|
||||
for h in self._CLIENT_HEADER:
|
||||
headers[h] = self.client
|
||||
|
||||
token_refreshed_on_error = False
|
||||
url = (
|
||||
@@ -268,7 +270,8 @@ class Session(TokenManager):
|
||||
headers.copy() if headers else {}
|
||||
)
|
||||
if async_enable:
|
||||
headers[self._ASYNC_HEADER] = "1"
|
||||
for h in self._ASYNC_HEADER:
|
||||
headers[h] = "1"
|
||||
return self._send_request(
|
||||
service=service,
|
||||
action=action,
|
||||
@@ -464,7 +467,7 @@ class Session(TokenManager):
|
||||
if parsed.port == 8008:
|
||||
return host.replace(':8008', ':8080', 1)
|
||||
|
||||
raise ValueError('Could not detect TRAINS web application server')
|
||||
raise ValueError('Could not detect ClearML web application server')
|
||||
|
||||
@classmethod
|
||||
def get_files_server_host(cls, config=None):
|
||||
@@ -548,13 +551,13 @@ class Session(TokenManager):
|
||||
# check if this is a misconfigured api server (getting 200 without the data section)
|
||||
if res and res.status_code == 200:
|
||||
raise ValueError('It seems *api_server* is misconfigured. '
|
||||
'Is this the TRAINS API server {} ?'.format(self.get_api_server_host()))
|
||||
'Is this the ClearML API server {} ?'.format(self.get_api_server_host()))
|
||||
else:
|
||||
raise LoginError("Response data mismatch: No 'token' in 'data' value from res, receive : {}, "
|
||||
"exception: {}".format(res, ex))
|
||||
except requests.ConnectionError as ex:
|
||||
raise ValueError('Connection Error: it seems *api_server* is misconfigured. '
|
||||
'Is this the TRAINS API server {} ?'.format('/'.join(ex.request.url.split('/')[:3])))
|
||||
'Is this the ClearML API server {} ?'.format('/'.join(ex.request.url.split('/')[:3])))
|
||||
except Exception as ex:
|
||||
raise LoginError('Unrecognized Authentication Error: {} {}'.format(type(ex), ex))
|
||||
|
||||
|
||||
@@ -107,7 +107,7 @@ def get_http_session_with_retry(
|
||||
if not session.verify and __disable_certificate_verification_warning < 2:
|
||||
# show warning
|
||||
__disable_certificate_verification_warning += 1
|
||||
logging.getLogger('TRAINS').warning(
|
||||
logging.getLogger('ClearML').warning(
|
||||
msg='InsecureRequestWarning: Certificate verification is disabled! Adding '
|
||||
'certificate verification is strongly advised. See: '
|
||||
'https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings')
|
||||
|
||||
@@ -1,4 +1,3 @@
|
||||
from .defs import Environment
|
||||
from .config import Config, ConfigEntry
|
||||
from .errors import ConfigurationError
|
||||
from .environment import EnvEntry
|
||||
|
||||
@@ -138,7 +138,7 @@ class Config(object):
|
||||
else:
|
||||
env_config_paths = []
|
||||
|
||||
env_config_path_override = os.environ.get(ENV_CONFIG_PATH_OVERRIDE_VAR)
|
||||
env_config_path_override = ENV_CONFIG_PATH_OVERRIDE_VAR.get()
|
||||
if env_config_path_override:
|
||||
env_config_paths = [expanduser(env_config_path_override)]
|
||||
|
||||
@@ -165,7 +165,7 @@ class Config(object):
|
||||
)
|
||||
|
||||
local_config_files = LOCAL_CONFIG_FILES
|
||||
local_config_override = os.environ.get(LOCAL_CONFIG_FILE_OVERRIDE_VAR)
|
||||
local_config_override = LOCAL_CONFIG_FILE_OVERRIDE_VAR.get()
|
||||
if local_config_override:
|
||||
local_config_files = [expanduser(local_config_override)]
|
||||
|
||||
|
||||
@@ -1,6 +1,8 @@
|
||||
from os.path import expanduser
|
||||
from pathlib2 import Path
|
||||
|
||||
from ..backend_config.environment import EnvEntry
|
||||
|
||||
ENV_VAR = 'TRAINS_ENV'
|
||||
""" Name of system environment variable that can be used to specify the config environment name """
|
||||
|
||||
@@ -17,23 +19,24 @@ ENV_CONFIG_PATHS = [
|
||||
|
||||
|
||||
LOCAL_CONFIG_PATHS = [
|
||||
# '/etc/opt/trains', # used by servers for docker-generated configuration
|
||||
# expanduser('~/.trains/config'),
|
||||
# '/etc/opt/clearml', # used by servers for docker-generated configuration
|
||||
# expanduser('~/.clearml/config'),
|
||||
]
|
||||
""" Local config paths, not related to environment """
|
||||
|
||||
|
||||
LOCAL_CONFIG_FILES = [
|
||||
expanduser('~/trains.conf'), # used for workstation configuration (end-users, workers)
|
||||
expanduser('~/clearml.conf'), # used for workstation configuration (end-users, workers)
|
||||
]
|
||||
""" Local config files (not paths) """
|
||||
|
||||
|
||||
LOCAL_CONFIG_FILE_OVERRIDE_VAR = 'TRAINS_CONFIG_FILE'
|
||||
LOCAL_CONFIG_FILE_OVERRIDE_VAR = EnvEntry('CLEARML_CONFIG_FILE', 'TRAINS_CONFIG_FILE', )
|
||||
""" Local config file override environment variable. If this is set, no other local config files will be used. """
|
||||
|
||||
|
||||
ENV_CONFIG_PATH_OVERRIDE_VAR = 'TRAINS_CONFIG_PATH'
|
||||
ENV_CONFIG_PATH_OVERRIDE_VAR = EnvEntry('CLEARML_CONFIG_PATH', 'TRAINS_CONFIG_PATH', )
|
||||
"""
|
||||
Environment-related config path override environment variable. If this is set, no other env config path will be used.
|
||||
"""
|
||||
|
||||
@@ -85,8 +85,9 @@ class Entry(object):
|
||||
|
||||
def set(self, value):
|
||||
# type: (Any, Any) -> (Text, Any)
|
||||
key, _ = self.get_pair(default=None, converter=None)
|
||||
self._set(key, str(value))
|
||||
# key, _ = self.get_pair(default=None, converter=None)
|
||||
for k in self.keys:
|
||||
self._set(k, str(value))
|
||||
|
||||
def _set(self, key, value):
|
||||
# type: (Text, Text) -> None
|
||||
|
||||
@@ -11,6 +11,10 @@ class EnvEntry(Entry):
|
||||
conversions[bool] = text_to_bool
|
||||
return conversions
|
||||
|
||||
def pop(self):
|
||||
for k in self.keys:
|
||||
environ.pop(k, None)
|
||||
|
||||
def _get(self, key):
|
||||
value = getenv(key, "").strip()
|
||||
return value or NotSet
|
||||
@@ -27,27 +31,34 @@ class EnvEntry(Entry):
|
||||
|
||||
def backward_compatibility_support():
|
||||
from ..definitions import ENVIRONMENT_CONFIG, ENVIRONMENT_SDK_PARAMS, ENVIRONMENT_BACKWARD_COMPATIBLE
|
||||
if not ENVIRONMENT_BACKWARD_COMPATIBLE.get():
|
||||
return
|
||||
if ENVIRONMENT_BACKWARD_COMPATIBLE.get():
|
||||
# Add TRAINS_ prefix on every CLEARML_ os environment we support
|
||||
for k, v in ENVIRONMENT_CONFIG.items():
|
||||
try:
|
||||
trains_vars = [var for var in v.vars if var.startswith('CLEARML_')]
|
||||
if not trains_vars:
|
||||
continue
|
||||
alg_var = trains_vars[0].replace('CLEARML_', 'TRAINS_', 1)
|
||||
if alg_var not in v.vars:
|
||||
v.vars = tuple(list(v.vars) + [alg_var])
|
||||
except:
|
||||
continue
|
||||
for k, v in ENVIRONMENT_SDK_PARAMS.items():
|
||||
try:
|
||||
trains_vars = [var for var in v if var.startswith('CLEARML_')]
|
||||
if not trains_vars:
|
||||
continue
|
||||
alg_var = trains_vars[0].replace('CLEARML_', 'TRAINS_', 1)
|
||||
if alg_var not in v:
|
||||
ENVIRONMENT_SDK_PARAMS[k] = tuple(list(v) + [alg_var])
|
||||
except:
|
||||
continue
|
||||
|
||||
# Add ALG_ prefix on every TRAINS_ os environment we support
|
||||
for k, v in ENVIRONMENT_CONFIG.items():
|
||||
try:
|
||||
trains_vars = [var for var in v.vars if var.startswith('TRAINS_')]
|
||||
if not trains_vars:
|
||||
continue
|
||||
alg_var = trains_vars[0].replace('TRAINS_', 'ALG_', 1)
|
||||
if alg_var not in v.vars:
|
||||
v.vars = tuple(list(v.vars) + [alg_var])
|
||||
except:
|
||||
continue
|
||||
for k, v in ENVIRONMENT_SDK_PARAMS.items():
|
||||
try:
|
||||
trains_vars = [var for var in v if var.startswith('TRAINS_')]
|
||||
if not trains_vars:
|
||||
continue
|
||||
alg_var = trains_vars[0].replace('TRAINS_', 'ALG_', 1)
|
||||
if alg_var not in v:
|
||||
ENVIRONMENT_SDK_PARAMS[k] = tuple(list(v) + [alg_var])
|
||||
except:
|
||||
# set OS environ:
|
||||
keys = environ.keys()
|
||||
for k in keys:
|
||||
if not k.startswith('CLEARML_'):
|
||||
continue
|
||||
backwards_k = k.replace('CLEARML_', 'TRAINS_', 1)
|
||||
if backwards_k not in keys:
|
||||
environ[backwards_k] = environ[k]
|
||||
|
||||
@@ -4,11 +4,11 @@ from pathlib2 import Path
|
||||
|
||||
|
||||
def logger(path=None):
|
||||
name = "trains"
|
||||
name = "clearml"
|
||||
if path:
|
||||
p = Path(path)
|
||||
module = (p.parent if p.stem.startswith('_') else p).stem
|
||||
name = "trains.%s" % module
|
||||
name = "clearml.%s" % module
|
||||
return logging.getLogger(name)
|
||||
|
||||
|
||||
|
||||
@@ -9,16 +9,16 @@ from operator import attrgetter
|
||||
from traceback import print_exc
|
||||
from typing import Text
|
||||
|
||||
from trains_agent.helper.console import ListFormatter, print_text
|
||||
from trains_agent.helper.dicts import filter_keys
|
||||
from clearml_agent.helper.console import ListFormatter, print_text
|
||||
from clearml_agent.helper.dicts import filter_keys
|
||||
|
||||
import six
|
||||
from trains_agent.backend_api import services
|
||||
from clearml_agent.backend_api import services
|
||||
|
||||
from trains_agent.errors import APIError, CommandFailedError
|
||||
from trains_agent.helper.base import Singleton, return_list, print_parameters, dump_yaml, load_yaml, error, warning
|
||||
from trains_agent.interface.base import ObjectID
|
||||
from trains_agent.session import Session
|
||||
from clearml_agent.errors import APIError, CommandFailedError
|
||||
from clearml_agent.helper.base import Singleton, return_list, print_parameters, dump_yaml, load_yaml, error, warning
|
||||
from clearml_agent.interface.base import ObjectID
|
||||
from clearml_agent.session import Session
|
||||
|
||||
|
||||
class NameResolutionError(CommandFailedError):
|
||||
@@ -74,7 +74,7 @@ class BaseCommandSection(object):
|
||||
|
||||
@staticmethod
|
||||
def log(message, *args):
|
||||
print("trains-agent: {}".format(message % args))
|
||||
print("clearml-agent: {}".format(message % args))
|
||||
|
||||
@classmethod
|
||||
def exit(cls, message, code=1): # type: (Text, int) -> ()
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
from trains_agent.commands.base import ServiceCommandSection
|
||||
from clearml_agent.commands.base import ServiceCommandSection
|
||||
|
||||
|
||||
class Config(ServiceCommandSection):
|
||||
|
||||
@@ -5,13 +5,15 @@ from pyhocon import ConfigFactory, ConfigMissingException
|
||||
from pathlib2 import Path
|
||||
from six.moves.urllib.parse import urlparse
|
||||
|
||||
from trains_agent.backend_api.session import Session
|
||||
from trains_agent.backend_api.session.defs import ENV_HOST
|
||||
from trains_agent.backend_config.defs import LOCAL_CONFIG_FILES
|
||||
from clearml_agent.backend_api.session import Session
|
||||
from clearml_agent.backend_api.session.defs import ENV_HOST
|
||||
from clearml_agent.backend_config.defs import LOCAL_CONFIG_FILES
|
||||
|
||||
|
||||
description = """
|
||||
Please create new trains credentials through the profile page in your trains web app (e.g. https://demoapp.trains.allegro.ai/profile)
|
||||
Please create new clearml credentials through the profile page in your clearml web app (e.g. https://demoapp.demo.clear.ml/profile)
|
||||
Or with the free hosted service at https://app.community.clear.ml/profile
|
||||
|
||||
In the profile page, press "Create new credentials", then press "Copy to clipboard".
|
||||
|
||||
Paste copied configuration here:
|
||||
@@ -25,7 +27,7 @@ except Exception:
|
||||
|
||||
host_description = """
|
||||
Editing configuration file: {CONFIG_FILE}
|
||||
Enter the url of the trains-server's Web service, for example: {HOST}
|
||||
Enter the url of the clearml-server's Web service, for example: {HOST}
|
||||
""".format(
|
||||
CONFIG_FILE=LOCAL_CONFIG_FILES[0],
|
||||
HOST=def_host,
|
||||
@@ -33,8 +35,12 @@ Enter the url of the trains-server's Web service, for example: {HOST}
|
||||
|
||||
|
||||
def main():
|
||||
print('TRAINS-AGENT setup process')
|
||||
conf_file = Path(LOCAL_CONFIG_FILES[0]).absolute()
|
||||
print('CLEARML-AGENT setup process')
|
||||
for f in LOCAL_CONFIG_FILES:
|
||||
conf_file = Path(f).absolute()
|
||||
if conf_file.exists():
|
||||
break
|
||||
|
||||
if conf_file.exists() and conf_file.is_file() and conf_file.stat().st_size > 0:
|
||||
print('Configuration file already exists: {}'.format(str(conf_file)))
|
||||
print('Leaving setup, feel free to edit the configuration file.')
|
||||
@@ -42,7 +48,12 @@ def main():
|
||||
|
||||
print(description, end='')
|
||||
sentinel = ''
|
||||
parse_input = '\n'.join(iter(input, sentinel))
|
||||
parse_input = ''
|
||||
for line in iter(input, sentinel):
|
||||
parse_input += line+'\n'
|
||||
if line.rstrip() == '}':
|
||||
break
|
||||
|
||||
credentials = None
|
||||
api_server = None
|
||||
web_server = None
|
||||
@@ -86,7 +97,7 @@ def main():
|
||||
|
||||
files_host = input_url('File Store Host', files_host)
|
||||
|
||||
print('\nTRAINS Hosts configuration:\nWeb App: {}\nAPI: {}\nFile Store: {}\n'.format(
|
||||
print('\nClearML Hosts configuration:\nWeb App: {}\nAPI: {}\nFile Store: {}\n'.format(
|
||||
web_host, api_host, files_host))
|
||||
|
||||
retry = 1
|
||||
@@ -140,14 +151,14 @@ def main():
|
||||
# noinspection PyBroadException
|
||||
try:
|
||||
with open(str(conf_file), 'wt') as f:
|
||||
header = '# TRAINS-AGENT configuration file\n' \
|
||||
header = '# CLEARML-AGENT configuration file\n' \
|
||||
'api {\n' \
|
||||
' # Notice: \'host\' is the api server (default port 8008), not the web server.\n' \
|
||||
' api_server: %s\n' \
|
||||
' web_server: %s\n' \
|
||||
' files_server: %s\n' \
|
||||
' # Credentials are generated using the webapp, %s/profile\n' \
|
||||
' # Override with os environment: TRAINS_API_ACCESS_KEY / TRAINS_API_SECRET_KEY\n' \
|
||||
' # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY\n' \
|
||||
' credentials {"access_key": "%s", "secret_key": "%s"}\n' \
|
||||
'}\n\n' % (api_host, web_host, files_host,
|
||||
web_host, credentials['access_key'], credentials['secret_key'])
|
||||
@@ -158,7 +169,7 @@ def main():
|
||||
'agent.git_pass=\"{}\"\n' \
|
||||
'\n'.format(git_user or '', git_pass or '')
|
||||
f.write(git_credentials)
|
||||
extra_index_str = '# extra_index_url: ["https://allegroai.jfrog.io/trainsai/api/pypi/public/simple"]\n' \
|
||||
extra_index_str = '# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]\n' \
|
||||
'agent.package_manager.extra_index_url= ' \
|
||||
'[\n{}\n]\n\n'.format("\n".join(map("\"{}\"".format, extra_index_urls)))
|
||||
f.write(extra_index_str)
|
||||
@@ -168,7 +179,7 @@ def main():
|
||||
return
|
||||
|
||||
print('\nNew configuration stored in {}'.format(str(conf_file)))
|
||||
print('TRAINS-AGENT setup completed successfully.')
|
||||
print('CLEARML-AGENT setup completed successfully.')
|
||||
|
||||
|
||||
def parse_host(parsed_host, allow_input=True):
|
||||
@@ -309,7 +320,7 @@ def verify_url(parse_input):
|
||||
parsed_host = None
|
||||
except Exception:
|
||||
parsed_host = None
|
||||
print('Could not parse url {}\nEnter your trains-server host: '.format(parse_input), end='')
|
||||
print('Could not parse url {}\nEnter your clearml-server host: '.format(parse_input), end='')
|
||||
return parsed_host
|
||||
|
||||
|
||||
|
||||
@@ -5,8 +5,8 @@ import time
|
||||
|
||||
from future.builtins import super
|
||||
|
||||
from trains_agent.commands.base import ServiceCommandSection
|
||||
from trains_agent.helper.base import return_list
|
||||
from clearml_agent.commands.base import ServiceCommandSection
|
||||
from clearml_agent.helper.base import return_list
|
||||
|
||||
|
||||
class Events(ServiceCommandSection):
|
||||
|
||||
@@ -24,27 +24,28 @@ from typing import Text, Optional, Any, Tuple
|
||||
import attr
|
||||
import psutil
|
||||
import six
|
||||
from trains_agent.backend_api.services import queues as queues_api
|
||||
from trains_agent.backend_api.services import tasks as tasks_api
|
||||
from clearml_agent.backend_api.services import queues as queues_api
|
||||
from clearml_agent.backend_api.services import tasks as tasks_api
|
||||
from pathlib2 import Path
|
||||
from pyhocon import ConfigTree, ConfigFactory
|
||||
from six.moves.urllib.parse import quote
|
||||
|
||||
from trains_agent.backend_config.defs import UptimeConf
|
||||
from trains_agent.helper.check_update import start_check_update_daemon
|
||||
from trains_agent.commands.base import resolve_names, ServiceCommandSection
|
||||
from trains_agent.definitions import (
|
||||
from clearml_agent.backend_config.defs import UptimeConf
|
||||
from clearml_agent.helper.check_update import start_check_update_daemon
|
||||
from clearml_agent.commands.base import resolve_names, ServiceCommandSection
|
||||
from clearml_agent.definitions import (
|
||||
ENVIRONMENT_SDK_PARAMS,
|
||||
PROGRAM_NAME,
|
||||
DEFAULT_VENV_UPDATE_URL,
|
||||
ENV_DOCKER_IMAGE,
|
||||
ENV_TASK_EXECUTE_AS_USER,
|
||||
ENV_DOCKER_HOST_MOUNT,
|
||||
ENV_TASK_EXTRA_PYTHON_PATH,
|
||||
ENV_AGENT_GIT_USER,
|
||||
ENV_AGENT_GIT_PASS)
|
||||
from trains_agent.definitions import WORKING_REPOSITORY_DIR, PIP_EXTRA_INDICES
|
||||
from trains_agent.errors import APIError, CommandFailedError, Sigterm
|
||||
from trains_agent.helper.base import (
|
||||
ENV_AGENT_GIT_PASS, ENV_WORKER_ID, ENV_DOCKER_SKIP_GPUS_FLAG, )
|
||||
from clearml_agent.definitions import WORKING_REPOSITORY_DIR, PIP_EXTRA_INDICES
|
||||
from clearml_agent.errors import APIError, CommandFailedError, Sigterm
|
||||
from clearml_agent.helper.base import (
|
||||
return_list,
|
||||
print_parameters,
|
||||
dump_yaml,
|
||||
@@ -66,19 +67,19 @@ from trains_agent.helper.base import (
|
||||
is_linux_platform,
|
||||
rm_file,
|
||||
add_python_path, safe_remove_tree, )
|
||||
from trains_agent.helper.console import ensure_text, print_text, decode_binary_lines
|
||||
from trains_agent.helper.os.daemonize import daemonize_process
|
||||
from trains_agent.helper.package.base import PackageManager
|
||||
from trains_agent.helper.package.conda_api import CondaAPI
|
||||
from trains_agent.helper.package.post_req import PostRequirement
|
||||
from trains_agent.helper.package.external_req import ExternalRequirements
|
||||
from trains_agent.helper.package.pip_api.system import SystemPip
|
||||
from trains_agent.helper.package.pip_api.venv import VirtualenvPip
|
||||
from trains_agent.helper.package.poetry_api import PoetryConfig, PoetryAPI
|
||||
from trains_agent.helper.package.pytorch import PytorchRequirement
|
||||
from trains_agent.helper.package.requirements import RequirementsManager
|
||||
from trains_agent.helper.package.venv_update_api import VenvUpdateAPI
|
||||
from trains_agent.helper.process import (
|
||||
from clearml_agent.helper.console import ensure_text, print_text, decode_binary_lines
|
||||
from clearml_agent.helper.os.daemonize import daemonize_process
|
||||
from clearml_agent.helper.package.base import PackageManager
|
||||
from clearml_agent.helper.package.conda_api import CondaAPI
|
||||
from clearml_agent.helper.package.post_req import PostRequirement
|
||||
from clearml_agent.helper.package.external_req import ExternalRequirements
|
||||
from clearml_agent.helper.package.pip_api.system import SystemPip
|
||||
from clearml_agent.helper.package.pip_api.venv import VirtualenvPip
|
||||
from clearml_agent.helper.package.poetry_api import PoetryConfig, PoetryAPI
|
||||
from clearml_agent.helper.package.pytorch import PytorchRequirement
|
||||
from clearml_agent.helper.package.requirements import RequirementsManager
|
||||
from clearml_agent.helper.package.venv_update_api import VenvUpdateAPI
|
||||
from clearml_agent.helper.process import (
|
||||
kill_all_child_processes,
|
||||
WorkerParams,
|
||||
ExitStatus,
|
||||
@@ -90,17 +91,17 @@ from trains_agent.helper.process import (
|
||||
get_docker_id,
|
||||
commit_docker, terminate_process,
|
||||
)
|
||||
from trains_agent.helper.package.priority_req import PriorityPackageRequirement, PackageCollectorRequirement
|
||||
from trains_agent.helper.repo import clone_repository_cached, RepoInfo, VCS
|
||||
from trains_agent.helper.resource_monitor import ResourceMonitor
|
||||
from trains_agent.helper.runtime_verification import check_runtime, print_uptime_properties
|
||||
from trains_agent.session import Session
|
||||
from trains_agent.helper.singleton import Singleton
|
||||
from clearml_agent.helper.package.priority_req import PriorityPackageRequirement, PackageCollectorRequirement
|
||||
from clearml_agent.helper.repo import clone_repository_cached, RepoInfo, VCS
|
||||
from clearml_agent.helper.resource_monitor import ResourceMonitor
|
||||
from clearml_agent.helper.runtime_verification import check_runtime, print_uptime_properties
|
||||
from clearml_agent.session import Session
|
||||
from clearml_agent.helper.singleton import Singleton
|
||||
|
||||
from .events import Events
|
||||
|
||||
DOCKER_ROOT_CONF_FILE = "/root/trains.conf"
|
||||
DOCKER_DEFAULT_CONF_FILE = "/root/default_trains.conf"
|
||||
DOCKER_ROOT_CONF_FILE = "/root/clearml.conf"
|
||||
DOCKER_DEFAULT_CONF_FILE = "/root/default_clearml.conf"
|
||||
|
||||
|
||||
@attr.s
|
||||
@@ -337,8 +338,8 @@ class Worker(ServiceCommandSection):
|
||||
# last message before passing control to the actual task
|
||||
_task_logging_pass_control_message = "Running task id [{}]:"
|
||||
|
||||
_run_as_user_home = '/trains_agent_home'
|
||||
_docker_fixed_user_cache = '/trains_agent_cache'
|
||||
_run_as_user_home = '/clearml_agent_home'
|
||||
_docker_fixed_user_cache = '/clearml_agent_cache'
|
||||
_temp_cleanup_list = []
|
||||
|
||||
@property
|
||||
@@ -485,9 +486,9 @@ class Worker(ServiceCommandSection):
|
||||
return
|
||||
# setup console log
|
||||
temp_stdout_name = safe_mkstemp(
|
||||
suffix=".txt", prefix=".trains_agent_out.", name_only=True
|
||||
suffix=".txt", prefix=".clearml_agent_out.", name_only=True
|
||||
)
|
||||
# temp_stderr_name = safe_mkstemp(suffix=".txt", prefix=".trains_agent_err.", name_only=True)
|
||||
# temp_stderr_name = safe_mkstemp(suffix=".txt", prefix=".clearml_agent_err.", name_only=True)
|
||||
temp_stderr_name = None
|
||||
print(
|
||||
"Storing stdout and stderr log to '{}', '{}'".format(
|
||||
@@ -565,7 +566,7 @@ class Worker(ServiceCommandSection):
|
||||
status = ExitStatus.failure
|
||||
try:
|
||||
# set WORKER ID
|
||||
os.environ['TRAINS_WORKER_ID'] = self.worker_id
|
||||
ENV_WORKER_ID.set(self.worker_id)
|
||||
|
||||
if self._docker_force_pull and docker_image:
|
||||
full_pull_cmd = ['docker', 'pull', docker_image]
|
||||
@@ -619,7 +620,7 @@ class Worker(ServiceCommandSection):
|
||||
:param queues: IDs of queues to pull tasks from
|
||||
:type queues: list of ``Text``
|
||||
:param worker_params: Worker command line arguments
|
||||
:type worker_params: ``trains_agent.helper.process.WorkerParams``
|
||||
:type worker_params: ``clearml_agent.helper.process.WorkerParams``
|
||||
:param priority_order: If True pull order in priority manner. always from the first
|
||||
If False, pull from each queue once in a round robin manner
|
||||
:type priority_order: bool
|
||||
@@ -870,7 +871,7 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
# create temp config file with current configuration
|
||||
self.temp_config_path = NamedTemporaryFile(
|
||||
suffix=".cfg", prefix=".trains_agent.", mode='w+t').name
|
||||
suffix=".cfg", prefix=".clearml_agent.", mode='w+t').name
|
||||
|
||||
# print docker image
|
||||
if docker is not False and docker is not None:
|
||||
@@ -893,19 +894,19 @@ class Worker(ServiceCommandSection):
|
||||
warning(message+'\n')
|
||||
|
||||
if self._services_mode:
|
||||
print('Trains-Agent running in services mode')
|
||||
print('ClearML-Agent running in services mode')
|
||||
|
||||
self._daemon_foreground = foreground
|
||||
if not foreground:
|
||||
out_file, name = safe_mkstemp(
|
||||
prefix=".trains_agent_daemon_out",
|
||||
prefix=".clearml_agent_daemon_out",
|
||||
suffix=".txt",
|
||||
open_kwargs={
|
||||
"buffering": self._session.config.get("agent.log_files_buffering", 1)
|
||||
},
|
||||
)
|
||||
print(
|
||||
"Running TRAINS-AGENT daemon in background mode, writing stdout/stderr to {}".format(
|
||||
"Running CLEARML-AGENT daemon in background mode, writing stdout/stderr to {}".format(
|
||||
name
|
||||
)
|
||||
)
|
||||
@@ -946,7 +947,7 @@ class Worker(ServiceCommandSection):
|
||||
tb = six.text_type(traceback.format_exc())
|
||||
print("FATAL ERROR:")
|
||||
print(tb)
|
||||
crash_file, name = safe_mkstemp(prefix=".trains_agent-crash", suffix=".log")
|
||||
crash_file, name = safe_mkstemp(prefix=".clearml_agent-crash", suffix=".log")
|
||||
try:
|
||||
with crash_file:
|
||||
crash_file.write(tb)
|
||||
@@ -1272,7 +1273,7 @@ class Worker(ServiceCommandSection):
|
||||
def _build_docker(self, docker, target, task_id, entry_point=None, standalone_mode=True):
|
||||
|
||||
self.temp_config_path = safe_mkstemp(
|
||||
suffix=".cfg", prefix=".trains_agent.", text=True, name_only=True
|
||||
suffix=".cfg", prefix=".clearml_agent.", text=True, name_only=True
|
||||
)
|
||||
if not target:
|
||||
target = "task_id_{}".format(task_id)
|
||||
@@ -1337,7 +1338,7 @@ class Worker(ServiceCommandSection):
|
||||
if entry_point == "clone_task" or entry_point == "reuse_task":
|
||||
change = 'ENTRYPOINT if [ ! -s "{trains_conf}" ] ; then ' \
|
||||
'cp {default_trains_conf} {trains_conf} ; ' \
|
||||
' fi ; trains-agent execute --id {task_id} --standalone-mode {clone}'.format(
|
||||
' fi ; clearml-agent execute --id {task_id} --standalone-mode {clone}'.format(
|
||||
default_trains_conf=DOCKER_DEFAULT_CONF_FILE,
|
||||
trains_conf=DOCKER_ROOT_CONF_FILE,
|
||||
task_id=task_id,
|
||||
@@ -1527,7 +1528,7 @@ class Worker(ServiceCommandSection):
|
||||
"task_id": current_task.id,
|
||||
"log_level": log_level,
|
||||
"log_to_backend": "0",
|
||||
"config_file": self._session.config_file, # The config file is the tmp file that trains_agent created
|
||||
"config_file": self._session.config_file, # The config file is the tmp file that clearml_agent created
|
||||
}
|
||||
os.environ.update(
|
||||
{
|
||||
@@ -1542,18 +1543,18 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
# Add the script CWD to the python path
|
||||
python_path = get_python_path(script_dir, execution.entry_point, self.package_api, is_conda_env=self.is_conda)
|
||||
if os.environ.get(ENV_TASK_EXTRA_PYTHON_PATH):
|
||||
python_path = add_python_path(python_path, os.environ.get(ENV_TASK_EXTRA_PYTHON_PATH))
|
||||
if ENV_TASK_EXTRA_PYTHON_PATH.get():
|
||||
python_path = add_python_path(python_path, ENV_TASK_EXTRA_PYTHON_PATH.get())
|
||||
if python_path:
|
||||
os.environ['PYTHONPATH'] = python_path
|
||||
|
||||
# check if we want to run as another user, only supported on linux
|
||||
if os.environ.get(ENV_TASK_EXECUTE_AS_USER, None) and is_linux_platform():
|
||||
if ENV_TASK_EXECUTE_AS_USER.get() and is_linux_platform():
|
||||
command, script_dir = self._run_as_user_patch(
|
||||
command, self._session.config_file,
|
||||
script_dir, venv_folder,
|
||||
self._session.config.get('sdk.storage.cache.default_base_dir'),
|
||||
os.environ.get(ENV_TASK_EXECUTE_AS_USER))
|
||||
ENV_TASK_EXECUTE_AS_USER.get())
|
||||
use_execv = False
|
||||
else:
|
||||
use_execv = is_linux_platform() and not isinstance(self.package_api, (PoetryAPI, CondaAPI))
|
||||
@@ -1583,7 +1584,7 @@ class Worker(ServiceCommandSection):
|
||||
else:
|
||||
# store stdout/stderr into file, and send to backend
|
||||
temp_stdout_fname = log_file or safe_mkstemp(
|
||||
suffix=".txt", prefix=".trains_agent_out.", name_only=True
|
||||
suffix=".txt", prefix=".clearml_agent_out.", name_only=True
|
||||
)
|
||||
print("Storing stdout and stderr log into [%s]" % temp_stdout_fname)
|
||||
exit_code, _ = self._log_command_output(
|
||||
@@ -1966,7 +1967,7 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
def debug(self, message):
|
||||
if self._session.debug_mode:
|
||||
print("trains_agent: {}".format(message))
|
||||
print("clearml_agent: {}".format(message))
|
||||
|
||||
def find_python_executable_for_version(self, config_version):
|
||||
# type: (Text) -> Tuple[Text, Text, Text]
|
||||
@@ -2166,7 +2167,7 @@ class Worker(ServiceCommandSection):
|
||||
args.update(kwargs)
|
||||
return self._get_docker_cmd(**args)
|
||||
|
||||
docker_image = str(os.environ.get("TRAINS_DOCKER_IMAGE") or
|
||||
docker_image = str(ENV_DOCKER_IMAGE.get() or
|
||||
self._session.config.get("agent.default_docker.image", "nvidia/cuda")) \
|
||||
if not docker_args else docker_args[0]
|
||||
docker_arguments = docker_image.split(' ') if docker_image else []
|
||||
@@ -2183,10 +2184,10 @@ class Worker(ServiceCommandSection):
|
||||
print("Running in Docker {} mode (v19.03 and above) - using default docker image: {} running {}\n".format(
|
||||
'*standalone*' if self._standalone_mode else '', docker_image, python_version))
|
||||
temp_config = deepcopy(self._session.config)
|
||||
mounted_cache_dir = self._docker_fixed_user_cache # '/root/.trains/cache'
|
||||
mounted_pip_dl_dir = '/root/.trains/pip-download-cache'
|
||||
mounted_vcs_cache = '/root/.trains/vcs-cache'
|
||||
mounted_venv_dir = '/root/.trains/venvs-builds'
|
||||
mounted_cache_dir = self._docker_fixed_user_cache # '/root/.clearml/cache'
|
||||
mounted_pip_dl_dir = '/root/.clearml/pip-download-cache'
|
||||
mounted_vcs_cache = '/root/.clearml/vcs-cache'
|
||||
mounted_venv_dir = '/root/.clearml/venvs-builds'
|
||||
host_cache = Path(os.path.expandvars(
|
||||
self._session.config["sdk.storage.cache.default_base_dir"])).expanduser().as_posix()
|
||||
host_pip_dl = Path(os.path.expandvars(
|
||||
@@ -2209,10 +2210,10 @@ class Worker(ServiceCommandSection):
|
||||
self._session.config.get("agent.git_pass", None)))
|
||||
|
||||
host_apt_cache = Path(os.path.expandvars(self._session.config.get(
|
||||
"agent.docker_apt_cache", '~/.trains/apt-cache'))).expanduser().as_posix()
|
||||
"agent.docker_apt_cache", '~/.clearml/apt-cache'))).expanduser().as_posix()
|
||||
host_pip_cache = Path(os.path.expandvars(self._session.config.get(
|
||||
"agent.docker_pip_cache", '~/.trains/pip-cache'))).expanduser().as_posix()
|
||||
host_ssh_cache = mkdtemp(prefix='trains_agent.ssh.')
|
||||
"agent.docker_pip_cache", '~/.clearml/pip-cache'))).expanduser().as_posix()
|
||||
host_ssh_cache = mkdtemp(prefix='clearml_agent.ssh.')
|
||||
self._temp_cleanup_list.append(host_ssh_cache)
|
||||
|
||||
# make sure all folders are valid
|
||||
@@ -2257,7 +2258,7 @@ class Worker(ServiceCommandSection):
|
||||
preprocess_bash_script = self._session.config.get("agent.docker_preprocess_bash_script", None)
|
||||
|
||||
self.temp_config_path = self.temp_config_path or safe_mkstemp(
|
||||
suffix=".cfg", prefix=".trains_agent.", text=True, name_only=True
|
||||
suffix=".cfg", prefix=".clearml_agent.", text=True, name_only=True
|
||||
)
|
||||
|
||||
docker_cmd = dict(worker_id=self.worker_id,
|
||||
@@ -2298,13 +2299,13 @@ class Worker(ServiceCommandSection):
|
||||
dockers_nvidia_visible_devices = 'all'
|
||||
gpu_devices = os.environ.get('NVIDIA_VISIBLE_DEVICES', None)
|
||||
if gpu_devices is None or gpu_devices.lower().strip() == 'all':
|
||||
if os.environ.get('TRAINS_DOCKER_SKIP_GPUS_FLAG', None):
|
||||
if ENV_DOCKER_SKIP_GPUS_FLAG.get():
|
||||
dockers_nvidia_visible_devices = os.environ.get('NVIDIA_VISIBLE_DEVICES') or \
|
||||
dockers_nvidia_visible_devices
|
||||
else:
|
||||
base_cmd += ['--gpus', 'all', ]
|
||||
elif gpu_devices.strip() and gpu_devices.strip() != 'none':
|
||||
if os.environ.get('TRAINS_DOCKER_SKIP_GPUS_FLAG', None):
|
||||
if ENV_DOCKER_SKIP_GPUS_FLAG.get():
|
||||
dockers_nvidia_visible_devices = gpu_devices
|
||||
else:
|
||||
base_cmd += ['--gpus', '\"device={}\"'.format(gpu_devices), ]
|
||||
@@ -2338,7 +2339,7 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
# check if we need to map host folders
|
||||
if ENV_DOCKER_HOST_MOUNT.get():
|
||||
# expect TRAINS_AGENT_K8S_HOST_MOUNT = '/mnt/host/data:/root/.trains'
|
||||
# expect CLEARML_AGENT_K8S_HOST_MOUNT = '/mnt/host/data:/root/.clearml'
|
||||
k8s_node_mnt, _, k8s_pod_mnt = ENV_DOCKER_HOST_MOUNT.get().partition(':')
|
||||
# search and replace all the host folders with the k8s
|
||||
host_mounts = [host_apt_cache, host_pip_cache, host_pip_dl, host_cache, host_vcs_cache]
|
||||
@@ -2351,7 +2352,7 @@ class Worker(ServiceCommandSection):
|
||||
host_apt_cache, host_pip_cache, host_pip_dl, host_cache, host_vcs_cache = host_mounts
|
||||
|
||||
# copy the configuration file into the mounted folder
|
||||
new_conf_file = os.path.join(k8s_pod_mnt, '.trains_agent.{}.cfg'.format(quote(worker_id, safe="")))
|
||||
new_conf_file = os.path.join(k8s_pod_mnt, '.clearml_agent.{}.cfg'.format(quote(worker_id, safe="")))
|
||||
try:
|
||||
rm_tree(new_conf_file)
|
||||
rm_file(new_conf_file)
|
||||
@@ -2361,7 +2362,7 @@ class Worker(ServiceCommandSection):
|
||||
raise ValueError('Error: could not copy configuration file into: {}'.format(new_conf_file))
|
||||
|
||||
if host_ssh_cache:
|
||||
new_ssh_cache = os.path.join(k8s_pod_mnt, '.trains_agent.{}.ssh'.format(quote(worker_id, safe="")))
|
||||
new_ssh_cache = os.path.join(k8s_pod_mnt, '.clearml_agent.{}.ssh'.format(quote(worker_id, safe="")))
|
||||
try:
|
||||
rm_tree(new_ssh_cache)
|
||||
shutil.copytree(host_ssh_cache, new_ssh_cache)
|
||||
@@ -2369,29 +2370,29 @@ class Worker(ServiceCommandSection):
|
||||
except Exception:
|
||||
raise ValueError('Error: could not copy .ssh directory into: {}'.format(new_ssh_cache))
|
||||
|
||||
base_cmd += ['-e', 'TRAINS_WORKER_ID='+worker_id, ]
|
||||
base_cmd += ['-e', 'CLEARML_WORKER_ID='+worker_id, ]
|
||||
# update the docker image, so the system knows where it runs
|
||||
base_cmd += ['-e', 'TRAINS_DOCKER_IMAGE={} {}'.format(docker_image, ' '.join(docker_arguments)).strip()]
|
||||
base_cmd += ['-e', 'CLEARML_DOCKER_IMAGE={} {}'.format(docker_image, ' '.join(docker_arguments)).strip()]
|
||||
|
||||
# if we are running a RC version, install the same version in the docker
|
||||
# because the default latest, will be a release version (not RC)
|
||||
specify_version = ''
|
||||
try:
|
||||
from trains_agent.version import __version__
|
||||
from clearml_agent.version import __version__
|
||||
_version_parts = __version__.split('.')
|
||||
if force_current_version or 'rc' in _version_parts[-1].lower() or 'rc' in _version_parts[-2].lower():
|
||||
specify_version = '=={}'.format(__version__)
|
||||
except:
|
||||
pass
|
||||
|
||||
if os.environ.get('FORCE_LOCAL_TRAINS_AGENT_WHEEL'):
|
||||
local_wheel = os.path.expanduser(os.environ.get('FORCE_LOCAL_TRAINS_AGENT_WHEEL'))
|
||||
if os.environ.get('FORCE_LOCAL_CLEARML_AGENT_WHEEL'):
|
||||
local_wheel = os.path.expanduser(os.environ.get('FORCE_LOCAL_CLEARML_AGENT_WHEEL'))
|
||||
docker_wheel = str(Path('/tmp') / Path(local_wheel).name)
|
||||
base_cmd += ['-v', local_wheel + ':' + docker_wheel]
|
||||
trains_agent_wheel = '\"{}\"'.format(docker_wheel)
|
||||
clearml_agent_wheel = '\"{}\"'.format(docker_wheel)
|
||||
else:
|
||||
# trains-agent{specify_version}
|
||||
trains_agent_wheel = 'trains-agent{specify_version}'.format(specify_version=specify_version)
|
||||
# clearml-agent{specify_version}
|
||||
clearml_agent_wheel = 'clearml-agent{specify_version}'.format(specify_version=specify_version)
|
||||
|
||||
if not standalone_mode:
|
||||
if not bash_script:
|
||||
@@ -2421,10 +2422,10 @@ class Worker(ServiceCommandSection):
|
||||
docker_bash_script + " ; " +
|
||||
"[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON={python} ; " +
|
||||
"$LOCAL_PYTHON -m pip install -U \"pip{pip_version}\" ; " +
|
||||
"$LOCAL_PYTHON -m pip install -U {trains_agent_wheel} ; ").format(
|
||||
"$LOCAL_PYTHON -m pip install -U {clearml_agent_wheel} ; ").format(
|
||||
python_single_digit=python_version.split('.')[0],
|
||||
python=python_version, pip_version=PackageManager.get_pip_version(),
|
||||
trains_agent_wheel=trains_agent_wheel)
|
||||
clearml_agent_wheel=clearml_agent_wheel)
|
||||
|
||||
if host_git_credentials:
|
||||
for git_credentials in host_git_credentials:
|
||||
@@ -2442,7 +2443,7 @@ class Worker(ServiceCommandSection):
|
||||
update_scheme +
|
||||
extra_shell_script +
|
||||
"cp {} {} ; ".format(DOCKER_ROOT_CONF_FILE, DOCKER_DEFAULT_CONF_FILE) +
|
||||
"NVIDIA_VISIBLE_DEVICES={nv_visible} $LOCAL_PYTHON -u -m trains_agent ".format(
|
||||
"NVIDIA_VISIBLE_DEVICES={nv_visible} $LOCAL_PYTHON -u -m clearml_agent ".format(
|
||||
nv_visible=dockers_nvidia_visible_devices, python=python_version)
|
||||
])
|
||||
|
||||
@@ -2473,7 +2474,7 @@ class Worker(ServiceCommandSection):
|
||||
os.setuid(self.uid)
|
||||
|
||||
# create a home folder for our user
|
||||
trains_agent_home = self._run_as_user_home + '{}'.format(
|
||||
clearml_agent_home = self._run_as_user_home + '{}'.format(
|
||||
'.'+str(Singleton.get_slot()) if Singleton.get_slot() else '')
|
||||
try:
|
||||
home_folder = self._run_as_user_home
|
||||
@@ -2511,7 +2512,7 @@ class Worker(ServiceCommandSection):
|
||||
|
||||
# make sure we could access the trains_conf file
|
||||
try:
|
||||
user_trains_conf = os.path.join(home_folder, 'trains.conf')
|
||||
user_trains_conf = os.path.join(home_folder, 'clearml.conf')
|
||||
shutil.copy(trains_conf, user_trains_conf)
|
||||
Path(user_trains_conf).chmod(0o0777)
|
||||
except:
|
||||
@@ -2539,11 +2540,11 @@ class Worker(ServiceCommandSection):
|
||||
(worker_id and uid == worker_id) or
|
||||
(not worker_id and uid.startswith('{}:'.format(worker_name)))):
|
||||
# this is us kill it
|
||||
print('Terminating trains-agent worker_id={} pid={}'.format(uid, pid))
|
||||
print('Terminating clearml-agent worker_id={} pid={}'.format(uid, pid))
|
||||
if not terminate_process(pid, timeout=10):
|
||||
error('Could not terminate process pid={}'.format(pid))
|
||||
return True
|
||||
print('Could not find a running trains-agent instance with worker_name={} worker_id={}'.format(
|
||||
print('Could not find a running clearml-agent instance with worker_name={} worker_id={}'.format(
|
||||
worker_name, worker_id))
|
||||
return False
|
||||
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
"""
|
||||
Script for generating command-line completion.
|
||||
Called by trains_agent/utilities/complete.sh (or a copy of it) like so:
|
||||
Called by clearml_agent/utilities/complete.sh (or a copy of it) like so:
|
||||
|
||||
python -m trains_agent.complete "current command line"
|
||||
python -m clearml_agent.complete "current command line"
|
||||
|
||||
And writes line-separated completion targets to stdout.
|
||||
Results are line-separated in order to enable other whitespace in results.
|
||||
@@ -13,7 +13,7 @@ from __future__ import print_function
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
from trains_agent.interface import get_parser
|
||||
from clearml_agent.interface import get_parser
|
||||
|
||||
|
||||
def is_argument_required(action):
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
from pyhocon import ConfigTree
|
||||
|
||||
import six
|
||||
from trains_agent.helper.base import Singleton
|
||||
from clearml_agent.helper.base import Singleton
|
||||
|
||||
|
||||
@six.add_metaclass(Singleton)
|
||||
|
||||
@@ -1,22 +1,22 @@
|
||||
from datetime import timedelta
|
||||
from distutils.util import strtobool
|
||||
from enum import IntEnum
|
||||
from os import getenv
|
||||
from os import getenv, environ
|
||||
from typing import Text, Optional, Union, Tuple, Any
|
||||
|
||||
from furl import furl
|
||||
from pathlib2 import Path
|
||||
|
||||
import six
|
||||
from trains_agent.helper.base import normalize_path
|
||||
from clearml_agent.helper.base import normalize_path
|
||||
|
||||
PROGRAM_NAME = "trains-agent"
|
||||
PROGRAM_NAME = "clearml-agent"
|
||||
FROM_FILE_PREFIX_CHARS = "@"
|
||||
|
||||
CONFIG_DIR = normalize_path("~/.trains")
|
||||
TOKEN_CACHE_FILE = normalize_path("~/.trains.trains_agent.tmp")
|
||||
CONFIG_DIR = normalize_path("~/.clearml")
|
||||
TOKEN_CACHE_FILE = normalize_path("~/.clearml.clearml_agent.tmp")
|
||||
|
||||
CONFIG_FILE_CANDIDATES = ["~/trains.conf"]
|
||||
CONFIG_FILE_CANDIDATES = ["~/clearml.conf"]
|
||||
|
||||
|
||||
def find_config_path():
|
||||
@@ -40,6 +40,14 @@ class EnvironmentConfig(object):
|
||||
self.vars = names
|
||||
self.type = kwargs.pop("type", six.text_type)
|
||||
|
||||
def pop(self):
|
||||
for k in self.vars:
|
||||
environ.pop(k, None)
|
||||
|
||||
def set(self, value):
|
||||
for k in self.vars:
|
||||
environ[k] = str(value)
|
||||
|
||||
def convert(self, value):
|
||||
return self.conversions.get(self.type, self.type)(value)
|
||||
|
||||
@@ -55,23 +63,23 @@ class EnvironmentConfig(object):
|
||||
|
||||
|
||||
ENVIRONMENT_CONFIG = {
|
||||
"api.api_server": EnvironmentConfig("TRAINS_API_HOST", ),
|
||||
"api.api_server": EnvironmentConfig("CLEARML_API_HOST", "TRAINS_API_HOST", ),
|
||||
"api.credentials.access_key": EnvironmentConfig(
|
||||
"TRAINS_API_ACCESS_KEY",
|
||||
"CLEARML_API_ACCESS_KEY", "TRAINS_API_ACCESS_KEY",
|
||||
),
|
||||
"api.credentials.secret_key": EnvironmentConfig(
|
||||
"TRAINS_API_SECRET_KEY",
|
||||
"CLEARML_API_SECRET_KEY", "TRAINS_API_SECRET_KEY",
|
||||
),
|
||||
"agent.worker_name": EnvironmentConfig("TRAINS_WORKER_NAME", ),
|
||||
"agent.worker_id": EnvironmentConfig("TRAINS_WORKER_ID", ),
|
||||
"agent.worker_name": EnvironmentConfig("CLEARML_WORKER_NAME", "TRAINS_WORKER_NAME", ),
|
||||
"agent.worker_id": EnvironmentConfig("CLEARML_WORKER_ID", "TRAINS_WORKER_ID", ),
|
||||
"agent.cuda_version": EnvironmentConfig(
|
||||
"TRAINS_CUDA_VERSION", "CUDA_VERSION"
|
||||
"CLEARML_CUDA_VERSION", "TRAINS_CUDA_VERSION", "CUDA_VERSION"
|
||||
),
|
||||
"agent.cudnn_version": EnvironmentConfig(
|
||||
"TRAINS_CUDNN_VERSION", "CUDNN_VERSION"
|
||||
"CLEARML_CUDNN_VERSION", "TRAINS_CUDNN_VERSION", "CUDNN_VERSION"
|
||||
),
|
||||
"agent.cpu_only": EnvironmentConfig(
|
||||
"TRAINS_CPU_ONLY", "CPU_ONLY", type=bool
|
||||
names=("CLEARML_CPU_ONLY", "TRAINS_CPU_ONLY", "CPU_ONLY"), type=bool
|
||||
),
|
||||
"sdk.aws.s3.key": EnvironmentConfig("AWS_ACCESS_KEY_ID"),
|
||||
"sdk.aws.s3.secret": EnvironmentConfig("AWS_SECRET_ACCESS_KEY"),
|
||||
@@ -82,13 +90,14 @@ ENVIRONMENT_CONFIG = {
|
||||
}
|
||||
|
||||
ENVIRONMENT_SDK_PARAMS = {
|
||||
"task_id": ("TRAINS_TASK_ID", ),
|
||||
"config_file": ("TRAINS_CONFIG_FILE", ),
|
||||
"log_level": ("TRAINS_LOG_LEVEL", ),
|
||||
"log_to_backend": ("TRAINS_LOG_TASK_TO_BACKEND", ),
|
||||
"task_id": ("CLEARML_TASK_ID", "TRAINS_TASK_ID", ),
|
||||
"config_file": ("CLEARML_CONFIG_FILE", "TRAINS_CONFIG_FILE", ),
|
||||
"log_level": ("CLEARML_LOG_LEVEL", "TRAINS_LOG_LEVEL", ),
|
||||
"log_to_backend": ("CLEARML_LOG_TASK_TO_BACKEND", "TRAINS_LOG_TASK_TO_BACKEND", ),
|
||||
}
|
||||
|
||||
ENVIRONMENT_BACKWARD_COMPATIBLE = EnvironmentConfig("TRAINS_AGENT_ALG_ENV", type=bool)
|
||||
ENVIRONMENT_BACKWARD_COMPATIBLE = EnvironmentConfig(
|
||||
names=("CLEARML_AGENT_ALG_ENV", "TRAINS_AGENT_ALG_ENV"), type=bool)
|
||||
|
||||
VIRTUAL_ENVIRONMENT_PATH = {
|
||||
"python2": normalize_path(CONFIG_DIR, "py2venv"),
|
||||
@@ -96,7 +105,7 @@ VIRTUAL_ENVIRONMENT_PATH = {
|
||||
}
|
||||
|
||||
DEFAULT_BASE_DIR = normalize_path(CONFIG_DIR, "data_cache")
|
||||
DEFAULT_HOST = "https://demoapi.trains.allegro.ai"
|
||||
DEFAULT_HOST = "https://demoapi.demo.clear.ml"
|
||||
MAX_DATASET_SOURCES_COUNT = 50000
|
||||
|
||||
INVALID_WORKER_ID = (400, 1001)
|
||||
@@ -105,11 +114,6 @@ WORKER_ALREADY_REGISTERED = (400, 1003)
|
||||
API_VERSION = "v1.5"
|
||||
TOKEN_EXPIRATION_SECONDS = int(timedelta(days=2).total_seconds())
|
||||
|
||||
HTTP_HEADERS = {
|
||||
"worker": "X-Trains-Worker",
|
||||
"act-as": "X-Trains-Act-As",
|
||||
"client": "X-Trains-Agent",
|
||||
}
|
||||
METADATA_EXTENSION = ".json"
|
||||
|
||||
DEFAULT_VENV_UPDATE_URL = (
|
||||
@@ -120,12 +124,16 @@ DEFAULT_VCS_CACHE = normalize_path(CONFIG_DIR, "vcs-cache")
|
||||
PIP_EXTRA_INDICES = [
|
||||
]
|
||||
DEFAULT_PIP_DOWNLOAD_CACHE = normalize_path(CONFIG_DIR, "pip-download-cache")
|
||||
ENV_AGENT_GIT_USER = EnvironmentConfig('TRAINS_AGENT_GIT_USER')
|
||||
ENV_AGENT_GIT_PASS = EnvironmentConfig('TRAINS_AGENT_GIT_PASS')
|
||||
ENV_AGENT_GIT_HOST = EnvironmentConfig('TRAINS_AGENT_GIT_HOST')
|
||||
ENV_TASK_EXECUTE_AS_USER = 'TRAINS_AGENT_EXEC_USER'
|
||||
ENV_TASK_EXTRA_PYTHON_PATH = 'TRAINS_AGENT_EXTRA_PYTHON_PATH'
|
||||
ENV_DOCKER_HOST_MOUNT = EnvironmentConfig('TRAINS_AGENT_K8S_HOST_MOUNT', 'TRAINS_AGENT_DOCKER_HOST_MOUNT')
|
||||
ENV_DOCKER_IMAGE = EnvironmentConfig('CLEARML_DOCKER_IMAGE', 'TRAINS_DOCKER_IMAGE')
|
||||
ENV_WORKER_ID = EnvironmentConfig('CLEARML_WORKER_ID', 'TRAINS_WORKER_ID')
|
||||
ENV_DOCKER_SKIP_GPUS_FLAG = EnvironmentConfig('CLEARML_DOCKER_SKIP_GPUS_FLAG', 'TRAINS_DOCKER_SKIP_GPUS_FLAG')
|
||||
ENV_AGENT_GIT_USER = EnvironmentConfig('CLEARML_AGENT_GIT_USER', 'TRAINS_AGENT_GIT_USER')
|
||||
ENV_AGENT_GIT_PASS = EnvironmentConfig('CLEARML_AGENT_GIT_PASS', 'TRAINS_AGENT_GIT_PASS')
|
||||
ENV_AGENT_GIT_HOST = EnvironmentConfig('CLEARML_AGENT_GIT_HOST', 'TRAINS_AGENT_GIT_HOST')
|
||||
ENV_TASK_EXECUTE_AS_USER = EnvironmentConfig('CLEARML_AGENT_EXEC_USER', 'TRAINS_AGENT_EXEC_USER')
|
||||
ENV_TASK_EXTRA_PYTHON_PATH = EnvironmentConfig('CLEARML_AGENT_EXTRA_PYTHON_PATH', 'TRAINS_AGENT_EXTRA_PYTHON_PATH')
|
||||
ENV_DOCKER_HOST_MOUNT = EnvironmentConfig('CLEARML_AGENT_K8S_HOST_MOUNT', 'CLEARML_AGENT_DOCKER_HOST_MOUNT',
|
||||
'TRAINS_AGENT_K8S_HOST_MOUNT', 'TRAINS_AGENT_DOCKER_HOST_MOUNT')
|
||||
|
||||
|
||||
class FileBuffering(IntEnum):
|
||||
|
||||
@@ -30,7 +30,7 @@ LOCAL_REGEX = re.compile(
|
||||
|
||||
class Requirement(object):
|
||||
"""
|
||||
Represents a single requirementfrom trains_agent.external.requirements_parser.requirement import Requirement
|
||||
Represents a single requirementfrom clearml_agent.external.requirements_parser.requirement import Requirement
|
||||
|
||||
Typically instances of this class are created with ``Requirement.parse``.
|
||||
For local file requirements, there's no verification that the file
|
||||
|
||||
@@ -13,14 +13,15 @@ import json
|
||||
from time import sleep
|
||||
from typing import Text, List
|
||||
|
||||
from trains_agent.commands.events import Events
|
||||
from trains_agent.commands.worker import Worker
|
||||
from trains_agent.errors import APIError
|
||||
from trains_agent.helper.base import safe_remove_file
|
||||
from trains_agent.helper.dicts import merge_dicts
|
||||
from trains_agent.helper.process import get_bash_output
|
||||
from trains_agent.helper.resource_monitor import ResourceMonitor
|
||||
from trains_agent.interface.base import ObjectID
|
||||
from clearml_agent.commands.events import Events
|
||||
from clearml_agent.commands.worker import Worker
|
||||
from clearml_agent.definitions import ENV_DOCKER_IMAGE
|
||||
from clearml_agent.errors import APIError
|
||||
from clearml_agent.helper.base import safe_remove_file
|
||||
from clearml_agent.helper.dicts import merge_dicts
|
||||
from clearml_agent.helper.process import get_bash_output
|
||||
from clearml_agent.helper.resource_monitor import ResourceMonitor
|
||||
from clearml_agent.interface.base import ObjectID
|
||||
|
||||
|
||||
class K8sIntegration(Worker):
|
||||
@@ -28,16 +29,16 @@ class K8sIntegration(Worker):
|
||||
|
||||
KUBECTL_APPLY_CMD = "kubectl apply -f"
|
||||
|
||||
KUBECTL_RUN_CMD = "kubectl run trains-{queue_name}-id-{task_id} " \
|
||||
KUBECTL_RUN_CMD = "kubectl run clearml-{queue_name}-id-{task_id} " \
|
||||
"--image {docker_image} " \
|
||||
"--restart=Never --replicas=1 " \
|
||||
"--generator=run-pod/v1 " \
|
||||
"--namespace=trains"
|
||||
"--namespace=clearml"
|
||||
|
||||
KUBECTL_DELETE_CMD = "kubectl delete pods " \
|
||||
"--selector=TRAINS=agent " \
|
||||
"--field-selector=status.phase!=Pending,status.phase!=Running " \
|
||||
"--namespace=trains"
|
||||
"--namespace=clearml"
|
||||
|
||||
BASH_INSTALL_SSH_CMD = [
|
||||
"apt-get install -y openssh-server",
|
||||
@@ -46,7 +47,8 @@ class K8sIntegration(Worker):
|
||||
"echo 'PermitRootLogin yes' >> /etc/ssh/sshd_config",
|
||||
"sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config",
|
||||
r"sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd",
|
||||
"echo 'AcceptEnv TRAINS_API_ACCESS_KEY TRAINS_API_SECRET_KEY' >> /etc/ssh/sshd_config",
|
||||
"echo 'AcceptEnv TRAINS_API_ACCESS_KEY TRAINS_API_SECRET_KEY CLEARML_API_ACCESS_KEY CLEARML_API_SECRET_KEY' "
|
||||
">> /etc/ssh/sshd_config",
|
||||
'echo "export VISIBLE=now" >> /etc/profile',
|
||||
'echo "export PATH=$PATH" >> /etc/profile',
|
||||
'echo "ldconfig" >> /etc/profile',
|
||||
@@ -63,9 +65,9 @@ class K8sIntegration(Worker):
|
||||
"export LOCAL_PYTHON=$(which python3.$i) && break ; done",
|
||||
"[ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip",
|
||||
"[ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3",
|
||||
"$LOCAL_PYTHON -m pip install trains-agent",
|
||||
"$LOCAL_PYTHON -m pip install clearml-agent",
|
||||
"{extra_bash_init_cmd}",
|
||||
"$LOCAL_PYTHON -m trains_agent execute --full-monitoring --require-queue --id {task_id}"
|
||||
"$LOCAL_PYTHON -m clearml_agent execute --full-monitoring --require-queue --id {task_id}"
|
||||
]
|
||||
|
||||
AGENT_LABEL = "TRAINS=agent"
|
||||
@@ -108,7 +110,7 @@ class K8sIntegration(Worker):
|
||||
:param str overrides_yaml: YAML file containing the overrides for the pod (optional)
|
||||
:param str template_yaml: YAML file containing the template for the pod (optional).
|
||||
If provided the pod is scheduled with kubectl apply and overrides are ignored, otherwise with kubectl run.
|
||||
:param str trains_conf_file: trains.conf file to be use by the pod itself (optional)
|
||||
:param str trains_conf_file: clearml.conf file to be use by the pod itself (optional)
|
||||
:param str extra_bash_init_script: Additional bash script to run before starting the Task inside the container
|
||||
"""
|
||||
super(K8sIntegration, self).__init__()
|
||||
@@ -213,7 +215,7 @@ class K8sIntegration(Worker):
|
||||
if task_data.execution.docker_cmd:
|
||||
docker_parts = task_data.execution.docker_cmd
|
||||
else:
|
||||
docker_parts = str(os.environ.get("TRAINS_DOCKER_IMAGE") or
|
||||
docker_parts = str(ENV_DOCKER_IMAGE.get() or
|
||||
self._session.config.get("agent.default_docker.image", "nvidia/cuda"))
|
||||
|
||||
# take the first part, this is the docker image name (not arguments)
|
||||
@@ -221,10 +223,10 @@ class K8sIntegration(Worker):
|
||||
docker_image = docker_parts[0]
|
||||
docker_args = docker_parts[1:] if len(docker_parts) > 1 else []
|
||||
|
||||
# get the trains.conf encoded file
|
||||
# get the clearml.conf encoded file
|
||||
# noinspection PyProtectedMember
|
||||
hocon_config_encoded = (self.trains_conf_file or self._session._config_file).encode('ascii')
|
||||
create_trains_conf = "echo '{}' | base64 --decode >> ~/trains.conf".format(
|
||||
create_trains_conf = "echo '{}' | base64 --decode >> ~/clearml.conf".format(
|
||||
base64.b64encode(
|
||||
hocon_config_encoded
|
||||
).decode('ascii')
|
||||
@@ -246,7 +248,7 @@ class K8sIntegration(Worker):
|
||||
# Search for a free pod number
|
||||
pod_number = 1
|
||||
while self.ports_mode:
|
||||
kubectl_cmd_new = "kubectl get pods -l {pod_label},{agent_label} -n trains".format(
|
||||
kubectl_cmd_new = "kubectl get pods -l {pod_label},{agent_label} -n clearml".format(
|
||||
pod_label=self.LIMIT_POD_LABEL.format(pod_number=pod_number),
|
||||
agent_label=self.AGENT_LABEL
|
||||
)
|
||||
@@ -333,7 +335,7 @@ class K8sIntegration(Worker):
|
||||
template.setdefault('apiVersion', 'v1')
|
||||
template['kind'] = 'Pod'
|
||||
template.setdefault('metadata', {})
|
||||
name = 'trains-{queue}-id-{task_id}'.format(queue=queue_name, task_id=task_id)
|
||||
name = 'clearml-{queue}-id-{task_id}'.format(queue=queue_name, task_id=task_id)
|
||||
template['metadata']['name'] = name
|
||||
template.setdefault('spec', {})
|
||||
template['spec'].setdefault('containers', [])
|
||||
@@ -370,7 +372,7 @@ class K8sIntegration(Worker):
|
||||
else:
|
||||
template['spec']['containers'].append(container)
|
||||
|
||||
fp, yaml_file = tempfile.mkstemp(prefix='trains_k8stmpl_', suffix='.yml')
|
||||
fp, yaml_file = tempfile.mkstemp(prefix='clearml_k8stmpl_', suffix='.yml')
|
||||
os.close(fp)
|
||||
with open(yaml_file, 'wt') as f:
|
||||
yaml.dump(template, f)
|
||||
@@ -444,7 +446,7 @@ class K8sIntegration(Worker):
|
||||
:param queues: IDs of queues to pull tasks from
|
||||
:type queues: list of ``Text``
|
||||
:param worker_params: Worker command line arguments
|
||||
:type worker_params: ``trains_agent.helper.process.WorkerParams``
|
||||
:type worker_params: ``clearml_agent.helper.process.WorkerParams``
|
||||
"""
|
||||
events_service = self.get_service(Events)
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
""" TRAINS-AGENT Stdout Helper Functions """
|
||||
""" CLEARML-AGENT Stdout Helper Functions """
|
||||
from __future__ import print_function, unicode_literals
|
||||
|
||||
import io
|
||||
@@ -28,8 +28,8 @@ from tqdm import tqdm
|
||||
|
||||
import six
|
||||
from six.moves import reduce
|
||||
from trains_agent.errors import CommandFailedError
|
||||
from trains_agent.helper.dicts import filter_keys
|
||||
from clearml_agent.errors import CommandFailedError
|
||||
from clearml_agent.helper.dicts import filter_keys
|
||||
|
||||
pretty_lines = False
|
||||
|
||||
@@ -380,11 +380,11 @@ AllDumper.add_multi_representer(object, lambda dumper, data: dumper.represent_st
|
||||
|
||||
|
||||
def error(message):
|
||||
print('\ntrains_agent: ERROR: {}\n'.format(message))
|
||||
print('\nclearml_agent: ERROR: {}\n'.format(message))
|
||||
|
||||
|
||||
def warning(message):
|
||||
print('trains_agent: Warning: {}'.format(message))
|
||||
print('clearml_agent: Warning: {}'.format(message))
|
||||
|
||||
|
||||
class TqdmStream(object):
|
||||
|
||||
@@ -21,14 +21,14 @@ def start_check_update_daemon():
|
||||
|
||||
def _check_new_version_available():
|
||||
cur_version = __version__
|
||||
update_server_releases = requests.get('https://updates.trains.allegro.ai/updates',
|
||||
data=json.dumps({"versions": {"trains-agent": str(cur_version)}}),
|
||||
update_server_releases = requests.get('https://updates.clear.ml/updates',
|
||||
data=json.dumps({"versions": {"clearml-agent": str(cur_version)}}),
|
||||
timeout=3.0)
|
||||
if update_server_releases.ok:
|
||||
update_server_releases = update_server_releases.json()
|
||||
else:
|
||||
return None
|
||||
trains_answer = update_server_releases.get("trains-agent", {})
|
||||
trains_answer = update_server_releases.get("clearml-agent", {})
|
||||
latest_version = trains_answer.get("version")
|
||||
cur_version = cur_version
|
||||
latest_version = latest_version or ''
|
||||
@@ -48,7 +48,7 @@ def _check_update_daemon():
|
||||
if latest_version:
|
||||
if latest_version[1]:
|
||||
sep = os.linesep
|
||||
print('TRAINS-AGENT new package available: UPGRADE to v{} is recommended!\nRelease Notes:\n{}'.format(
|
||||
print('CLEARML-AGENT new package available: UPGRADE to v{} is recommended!\nRelease Notes:\n{}'.format(
|
||||
latest_version[0], sep.join(latest_version[2])))
|
||||
else:
|
||||
print('TRAINS-SERVER new version available: upgrade to v{} is recommended!'.format(
|
||||
|
||||
@@ -9,7 +9,7 @@ from attr import attrs, attrib
|
||||
|
||||
import six
|
||||
from six import binary_type, text_type
|
||||
from trains_agent.helper.base import nonstrict_in_place_sort
|
||||
from clearml_agent.helper.base import nonstrict_in_place_sort
|
||||
|
||||
|
||||
def print_text(text, newline=True):
|
||||
|
||||
@@ -5,8 +5,8 @@ from contextlib import contextmanager
|
||||
from typing import Text, Iterable, Union
|
||||
|
||||
import six
|
||||
from trains_agent.helper.base import mkstemp, safe_remove_file, join_lines, select_for_platform
|
||||
from trains_agent.helper.process import Executable, Argv, PathLike
|
||||
from clearml_agent.helper.base import mkstemp, safe_remove_file, join_lines, select_for_platform
|
||||
from clearml_agent.helper.process import Executable, Argv, PathLike
|
||||
|
||||
|
||||
@six.add_metaclass(abc.ABCMeta)
|
||||
|
||||
@@ -15,14 +15,14 @@ import yaml
|
||||
from time import time
|
||||
from attr import attrs, attrib, Factory
|
||||
from pathlib2 import Path
|
||||
from trains_agent.external.requirements_parser import parse
|
||||
from trains_agent.external.requirements_parser.requirement import Requirement
|
||||
from clearml_agent.external.requirements_parser import parse
|
||||
from clearml_agent.external.requirements_parser.requirement import Requirement
|
||||
|
||||
from trains_agent.errors import CommandFailedError
|
||||
from trains_agent.helper.base import rm_tree, NonStrictAttrs, select_for_platform, is_windows_platform, ExecutionInfo
|
||||
from trains_agent.helper.process import Argv, Executable, DEVNULL, CommandSequence, PathLike
|
||||
from trains_agent.helper.package.requirements import SimpleVersion
|
||||
from trains_agent.session import Session
|
||||
from clearml_agent.errors import CommandFailedError
|
||||
from clearml_agent.helper.base import rm_tree, NonStrictAttrs, select_for_platform, is_windows_platform, ExecutionInfo
|
||||
from clearml_agent.helper.process import Argv, Executable, DEVNULL, CommandSequence, PathLike
|
||||
from clearml_agent.helper.package.requirements import SimpleVersion
|
||||
from clearml_agent.session import Session
|
||||
from .base import PackageManager
|
||||
from .pip_api.venv import VirtualenvPip
|
||||
from .requirements import RequirementsManager, MarkerRequirement
|
||||
|
||||
@@ -2,10 +2,10 @@ import sys
|
||||
from itertools import chain
|
||||
from typing import Text, Optional
|
||||
|
||||
from trains_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME
|
||||
from trains_agent.helper.package.base import PackageManager
|
||||
from trains_agent.helper.process import Argv, DEVNULL
|
||||
from trains_agent.session import Session
|
||||
from clearml_agent.definitions import PIP_EXTRA_INDICES, PROGRAM_NAME
|
||||
from clearml_agent.helper.package.base import PackageManager
|
||||
from clearml_agent.helper.process import Argv, DEVNULL
|
||||
from clearml_agent.session import Session
|
||||
|
||||
|
||||
class SystemPip(PackageManager):
|
||||
|
||||
@@ -2,10 +2,10 @@ from typing import Any
|
||||
|
||||
from pathlib2 import Path
|
||||
|
||||
from trains_agent.helper.base import select_for_platform, rm_tree, ExecutionInfo
|
||||
from trains_agent.helper.package.base import PackageManager
|
||||
from trains_agent.helper.process import Argv, PathLike
|
||||
from trains_agent.session import Session
|
||||
from clearml_agent.helper.base import select_for_platform, rm_tree, ExecutionInfo
|
||||
from clearml_agent.helper.package.base import PackageManager
|
||||
from clearml_agent.helper.process import Argv, PathLike
|
||||
from clearml_agent.session import Session
|
||||
from ..pip_api.system import SystemPip
|
||||
from ..requirements import RequirementsManager
|
||||
|
||||
|
||||
@@ -5,8 +5,8 @@ import attr
|
||||
import sys
|
||||
import os
|
||||
from pathlib2 import Path
|
||||
from trains_agent.helper.process import Argv, DEVNULL, check_if_command_exists
|
||||
from trains_agent.session import Session, POETRY
|
||||
from clearml_agent.helper.process import Argv, DEVNULL, check_if_command_exists
|
||||
from clearml_agent.session import Session, POETRY
|
||||
|
||||
|
||||
def prop_guard(prop, log_prop=None):
|
||||
|
||||
@@ -14,12 +14,12 @@ from pathlib2 import Path
|
||||
from pyhocon import ConfigTree
|
||||
|
||||
import six
|
||||
from trains_agent.definitions import PIP_EXTRA_INDICES
|
||||
from trains_agent.helper.base import warning, is_conda, which, join_lines, is_windows_platform
|
||||
from trains_agent.helper.process import Argv, PathLike
|
||||
from trains_agent.session import Session, normalize_cuda_version
|
||||
from trains_agent.external.requirements_parser import parse
|
||||
from trains_agent.external.requirements_parser.requirement import Requirement
|
||||
from clearml_agent.definitions import PIP_EXTRA_INDICES
|
||||
from clearml_agent.helper.base import warning, is_conda, which, join_lines, is_windows_platform
|
||||
from clearml_agent.helper.process import Argv, PathLike
|
||||
from clearml_agent.session import Session, normalize_cuda_version
|
||||
from clearml_agent.external.requirements_parser import parse
|
||||
from clearml_agent.external.requirements_parser.requirement import Requirement
|
||||
|
||||
from .translator import RequirementsTranslator
|
||||
|
||||
|
||||
@@ -3,7 +3,7 @@ from typing import Text
|
||||
from furl import furl
|
||||
from pathlib2 import Path
|
||||
|
||||
from trains_agent.config import Config
|
||||
from clearml_agent.config import Config
|
||||
from .pip_api.system import SystemPip
|
||||
|
||||
|
||||
|
||||
@@ -4,8 +4,8 @@ import requests
|
||||
from pathlib2 import Path
|
||||
|
||||
import six
|
||||
from trains_agent.definitions import CONFIG_DIR
|
||||
from trains_agent.helper.process import Argv, DEVNULL
|
||||
from clearml_agent.definitions import CONFIG_DIR
|
||||
from clearml_agent.helper.process import Argv, DEVNULL
|
||||
from .pip_api.venv import VirtualenvPip
|
||||
|
||||
|
||||
|
||||
@@ -20,8 +20,8 @@ from future.builtins import super
|
||||
from pathlib2 import Path
|
||||
|
||||
import six
|
||||
from trains_agent.definitions import PROGRAM_NAME, CONFIG_FILE
|
||||
from trains_agent.helper.base import bash_c, is_windows_platform, select_for_platform, chain_map
|
||||
from clearml_agent.definitions import PROGRAM_NAME, CONFIG_FILE
|
||||
from clearml_agent.helper.base import bash_c, is_windows_platform, select_for_platform, chain_map
|
||||
|
||||
PathLike = Union[Text, Path]
|
||||
|
||||
|
||||
@@ -13,18 +13,18 @@ from pathlib2 import Path
|
||||
|
||||
import six
|
||||
|
||||
from trains_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST
|
||||
from trains_agent.helper.console import ensure_text, ensure_binary
|
||||
from trains_agent.errors import CommandFailedError
|
||||
from trains_agent.helper.base import (
|
||||
from clearml_agent.definitions import ENV_AGENT_GIT_USER, ENV_AGENT_GIT_PASS, ENV_AGENT_GIT_HOST
|
||||
from clearml_agent.helper.console import ensure_text, ensure_binary
|
||||
from clearml_agent.errors import CommandFailedError
|
||||
from clearml_agent.helper.base import (
|
||||
select_for_platform,
|
||||
rm_tree,
|
||||
ExecutionInfo,
|
||||
normalize_path,
|
||||
create_file_if_not_exists,
|
||||
)
|
||||
from trains_agent.helper.process import DEVNULL, Argv, PathLike, COMMAND_SUCCESS
|
||||
from trains_agent.session import Session
|
||||
from clearml_agent.helper.process import DEVNULL, Argv, PathLike, COMMAND_SUCCESS
|
||||
from clearml_agent.session import Session
|
||||
|
||||
|
||||
class VcsFactory(object):
|
||||
|
||||
@@ -11,7 +11,7 @@ from typing import Text, Sequence
|
||||
import attr
|
||||
import psutil
|
||||
from pathlib2 import Path
|
||||
from trains_agent.session import Session
|
||||
from clearml_agent.session import Session
|
||||
|
||||
try:
|
||||
from .gpu import gpustat
|
||||
@@ -81,7 +81,7 @@ class ResourceMonitor(object):
|
||||
# active_gpus == False means no GPU reporting
|
||||
self._active_gpus = False
|
||||
elif not self._gpustat:
|
||||
log.warning('Trains-Agent Resource Monitor: GPU monitoring is not available')
|
||||
log.warning('ClearML-Agent Resource Monitor: GPU monitoring is not available')
|
||||
else:
|
||||
# None means no filtering, report all gpus
|
||||
self._active_gpus = None
|
||||
|
||||
@@ -3,7 +3,7 @@ from datetime import datetime, timedelta
|
||||
|
||||
from typing import List, Tuple, Optional
|
||||
|
||||
from trains_agent.backend_config.defs import UptimeConf
|
||||
from clearml_agent.backend_config.defs import UptimeConf
|
||||
|
||||
DAYS = ["SUN", "MON", "TUE", "WED", "THU", "FRI", "SAT"]
|
||||
PATTERN = re.compile(r"^(?P<hours>[^\s]+)\s(?P<days>[^\s]+)")
|
||||
|
||||
@@ -6,12 +6,12 @@ from tempfile import gettempdir, NamedTemporaryFile
|
||||
|
||||
from typing import List, Tuple, Optional
|
||||
|
||||
from trains_agent.definitions import ENV_DOCKER_HOST_MOUNT
|
||||
from trains_agent.helper.base import warning
|
||||
from clearml_agent.definitions import ENV_DOCKER_HOST_MOUNT
|
||||
from clearml_agent.helper.base import warning
|
||||
|
||||
|
||||
class Singleton(object):
|
||||
prefix = '.trainsagent'
|
||||
prefix = '.clearmlagent'
|
||||
sep = '_'
|
||||
ext = '.tmp'
|
||||
worker_id = None
|
||||
|
||||
@@ -3,7 +3,7 @@ from functools import partial
|
||||
from importlib import import_module
|
||||
import argparse
|
||||
|
||||
from trains_agent.definitions import PROGRAM_NAME
|
||||
from clearml_agent.definitions import PROGRAM_NAME
|
||||
from .base import Parser, base_arguments, add_service, OnlyPluralChoicesHelpFormatter
|
||||
|
||||
SERVICES = [
|
||||
|
||||
@@ -8,10 +8,10 @@ from functools import partial
|
||||
import six
|
||||
from pathlib2 import Path
|
||||
|
||||
from trains_agent import definitions
|
||||
from trains_agent.session import Session
|
||||
from clearml_agent import definitions
|
||||
from clearml_agent.session import Session
|
||||
|
||||
HEADER = 'TRAINS-AGENT Deep Learning DevOps'
|
||||
HEADER = 'CLEARML-AGENT Deep Learning DevOps'
|
||||
|
||||
|
||||
class Parser(argparse.ArgumentParser):
|
||||
@@ -192,7 +192,7 @@ def add_service(subparsers, name, commands, command_name_dest='command', formatt
|
||||
parser_class=partial(Parser, usage_on_error=False),
|
||||
dest='action')
|
||||
|
||||
# This is a fix for a bug in python3's argparse: running "trains-agent some_service" fails
|
||||
# This is a fix for a bug in python3's argparse: running "clearml-agent some_service" fails
|
||||
service_subparsers.required = True
|
||||
|
||||
for name, subparser in commands.pop('subparsers', {}).items():
|
||||
@@ -368,8 +368,8 @@ def base_arguments(top_parser):
|
||||
top_parser.add_argument(
|
||||
'--version',
|
||||
action='version',
|
||||
version='TRAINS-AGENT version %s' % Session.version,
|
||||
help='TRAINS-AGENT version number')
|
||||
version='CLEARML-AGENT version %s' % Session.version,
|
||||
help='CLEARML-AGENT version number')
|
||||
top_parser.add_argument(
|
||||
'--config-file',
|
||||
help='Use a different configuration file (default: "{}")'.format(definitions.CONFIG_FILE))
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
import argparse
|
||||
from textwrap import dedent
|
||||
|
||||
from trains_agent.helper.base import warning, is_windows_platform
|
||||
from trains_agent.interface.base import foreign_object_id
|
||||
from clearml_agent.helper.base import warning, is_windows_platform
|
||||
from clearml_agent.interface.base import foreign_object_id
|
||||
|
||||
|
||||
class DeprecatedFlag(argparse.Action):
|
||||
@@ -57,7 +57,7 @@ DAEMON_ARGS = dict({
|
||||
'group': 'Docker support',
|
||||
},
|
||||
'--force-current-version': {
|
||||
'help': 'Force trains-agent to use the current trains-agent version when running in the docker',
|
||||
'help': 'Force clearml-agent to use the current clearml-agent version when running in the docker',
|
||||
'action': 'store_true',
|
||||
'group': 'Docker support',
|
||||
},
|
||||
@@ -94,14 +94,14 @@ DAEMON_ARGS = dict({
|
||||
'action': 'store_true',
|
||||
},
|
||||
'--uptime': {
|
||||
'help': 'Specify uptime for trains-agent in "<hours> <days>" format. for example, use "17-20 TUE" to set '
|
||||
'help': 'Specify uptime for clearml-agent in "<hours> <days>" format. for example, use "17-20 TUE" to set '
|
||||
'Tuesday\'s uptime to 17-20'
|
||||
'Note: Make sure to have only one of uptime/downtime configuration and not both.',
|
||||
'nargs': '*',
|
||||
'default': None,
|
||||
},
|
||||
'--downtime': {
|
||||
'help': 'Specify uptime for trains-agent in "<hours> <days>" format. for example, use "09-13 TUE" to set '
|
||||
'help': 'Specify uptime for clearml-agent in "<hours> <days>" format. for example, use "09-13 TUE" to set '
|
||||
'Tuesday\'s downtime to 09-13'
|
||||
'Note: Make sure to have only on of uptime/downtime configuration and not both.',
|
||||
'nargs': '*',
|
||||
@@ -199,13 +199,13 @@ COMMANDS = {
|
||||
'help': 'List all worker machines and status',
|
||||
},
|
||||
'daemon': {
|
||||
'help': 'Start Trains-Agent daemon worker',
|
||||
'help': 'Start ClearML-Agent daemon worker',
|
||||
'args': DAEMON_ARGS,
|
||||
},
|
||||
'config': {
|
||||
'help': 'Check daemon configuration and print it',
|
||||
},
|
||||
'init': {
|
||||
'help': 'Trains-Agent configuration wizard',
|
||||
'help': 'ClearML-Agent configuration wizard',
|
||||
}
|
||||
}
|
||||
|
||||
@@ -12,13 +12,13 @@ import attr
|
||||
from pathlib2 import Path
|
||||
from pyhocon import ConfigFactory, HOCONConverter, ConfigTree
|
||||
|
||||
from trains_agent.backend_api.session import Session as _Session, Request
|
||||
from trains_agent.backend_api.session.client import APIClient
|
||||
from trains_agent.backend_config.defs import LOCAL_CONFIG_FILE_OVERRIDE_VAR, LOCAL_CONFIG_FILES
|
||||
from trains_agent.definitions import ENVIRONMENT_CONFIG, ENV_TASK_EXECUTE_AS_USER, ENVIRONMENT_BACKWARD_COMPATIBLE
|
||||
from trains_agent.errors import APIError
|
||||
from trains_agent.helper.base import HOCONEncoder
|
||||
from trains_agent.helper.process import Argv
|
||||
from clearml_agent.backend_api.session import Session as _Session, Request
|
||||
from clearml_agent.backend_api.session.client import APIClient
|
||||
from clearml_agent.backend_config.defs import LOCAL_CONFIG_FILE_OVERRIDE_VAR, LOCAL_CONFIG_FILES
|
||||
from clearml_agent.definitions import ENVIRONMENT_CONFIG, ENV_TASK_EXECUTE_AS_USER, ENVIRONMENT_BACKWARD_COMPATIBLE
|
||||
from clearml_agent.errors import APIError
|
||||
from clearml_agent.helper.base import HOCONEncoder
|
||||
from clearml_agent.helper.process import Argv
|
||||
from .version import __version__
|
||||
|
||||
POETRY = "poetry"
|
||||
@@ -70,7 +70,7 @@ class Session(_Session):
|
||||
if kwargs.get('config_file'):
|
||||
config_file = Path(os.path.expandvars(kwargs.get('config_file'))).expanduser().absolute().as_posix()
|
||||
kwargs['config_file'] = config_file
|
||||
os.environ[LOCAL_CONFIG_FILE_OVERRIDE_VAR] = config_file
|
||||
LOCAL_CONFIG_FILE_OVERRIDE_VAR.set(config_file)
|
||||
if not Path(config_file).is_file():
|
||||
raise ValueError("Could not open configuration file: {}".format(config_file))
|
||||
|
||||
@@ -88,7 +88,7 @@ class Session(_Session):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = os.environ['NVIDIA_VISIBLE_DEVICES'] = kwargs.get('gpus')
|
||||
|
||||
if kwargs.get('only_load_config'):
|
||||
from trains_agent.backend_api.config import load
|
||||
from clearml_agent.backend_api.config import load
|
||||
self.config = load()
|
||||
else:
|
||||
super(Session, self).__init__(*args, **kwargs)
|
||||
@@ -99,8 +99,12 @@ class Session(_Session):
|
||||
|
||||
self.log = self.get_logger(__name__)
|
||||
self.trace = kwargs.get('trace', False)
|
||||
self._config_file = kwargs.get('config_file') or \
|
||||
os.environ.get(LOCAL_CONFIG_FILE_OVERRIDE_VAR) or LOCAL_CONFIG_FILES[0]
|
||||
self._config_file = kwargs.get('config_file') or LOCAL_CONFIG_FILE_OVERRIDE_VAR.get()
|
||||
if not self._config_file:
|
||||
for f in reversed(LOCAL_CONFIG_FILES):
|
||||
if os.path.exists(os.path.expanduser(os.path.expandvars(f))):
|
||||
self._config_file = f
|
||||
break
|
||||
self.api_client = APIClient(session=self, api_version="2.5")
|
||||
# HACK make sure we have python version to execute,
|
||||
# if nothing was specific, use the one that runs us
|
||||
@@ -147,7 +151,7 @@ class Session(_Session):
|
||||
|
||||
# initialize cuda versions
|
||||
try:
|
||||
from trains_agent.helper.package.requirements import RequirementsManager
|
||||
from clearml_agent.helper.package.requirements import RequirementsManager
|
||||
agent = self.config['agent']
|
||||
agent['cuda_version'], agent['cudnn_version'] = \
|
||||
RequirementsManager.get_cuda_version(self.config) if not cpu_only else ('0', '0')
|
||||
@@ -202,7 +206,7 @@ class Session(_Session):
|
||||
'agent.docker_pip_cache', 'agent.docker_apt_cache')
|
||||
singleton_folders = ('agent.venvs_dir', 'agent.vcs_cache.path', 'agent.docker_apt_cache')
|
||||
|
||||
if os.environ.get(ENV_TASK_EXECUTE_AS_USER):
|
||||
if ENV_TASK_EXECUTE_AS_USER.get():
|
||||
folder_keys = tuple(list(folder_keys) + ['sdk.storage.cache.default_base_dir'])
|
||||
singleton_folders = tuple(list(singleton_folders) + ['sdk.storage.cache.default_base_dir'])
|
||||
|
||||
@@ -257,7 +261,7 @@ class Session(_Session):
|
||||
config = ConfigFactory.from_dict(config)
|
||||
self.log.debug("Run by interpreter: %s", sys.executable)
|
||||
print(
|
||||
"Current configuration (trains_agent v{}, location: {}):\n"
|
||||
"Current configuration (clearml_agent v{}, location: {}):\n"
|
||||
"----------------------\n{}\n".format(
|
||||
self.version, self._config_file, HOCONConverter.convert(config, "properties")
|
||||
)
|
||||
|
||||
@@ -1 +1 @@
|
||||
__version__ = '0.16.3'
|
||||
__version__ = '0.0.0'
|
||||
|
||||
303
docs/clearml.conf
Normal file
303
docs/clearml.conf
Normal file
@@ -0,0 +1,303 @@
|
||||
# CLEARML-AGENT configuration file
|
||||
api {
|
||||
api_server: https://demoapi.demo.clear.ml
|
||||
web_server: https://demoapp.demo.clear.ml
|
||||
files_server: https://demofiles.demo.clear.ml
|
||||
|
||||
# Credentials are generated in the webapp, https://demoapp.demo.clear.ml/profile
|
||||
# Overridden with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
|
||||
credentials {"access_key": "EGRTCO8JMSIGI6S39GTP43NFWXDQOW", "secret_key": "x!XTov_G-#vspE*Y(h$Anm&DIc5Ou-F)jsl$PdOyj5wG1&E!Z8"}
|
||||
|
||||
# verify host ssl certificate, set to False only if you have a very good reason
|
||||
verify_certificate: True
|
||||
}
|
||||
|
||||
agent {
|
||||
# Set GIT user/pass credentials (if user/pass are set, GIT protocol will be set to https)
|
||||
# leave blank for GIT SSH credentials (set force_git_ssh_protocol=true to force SSH protocol)
|
||||
git_user=""
|
||||
git_pass=""
|
||||
# Limit credentials to a single domain, for example: github.com,
|
||||
# all other domains will use public access (no user/pass). Default: always send user/pass for any VCS domain
|
||||
git_host=""
|
||||
|
||||
# Force GIT protocol to use SSH regardless of the git url (Assumes GIT user/pass are blank)
|
||||
force_git_ssh_protocol: false
|
||||
# Force a specific SSH port when converting http to ssh links (the domain is kept the same)
|
||||
# force_git_ssh_port: ""
|
||||
|
||||
# unique name of this worker, if None, created based on hostname:process_id
|
||||
# Overridden with os environment: CLEARML_WORKER_NAME
|
||||
# worker_id: "clearml-agent-machine1:gpu0"
|
||||
worker_id: ""
|
||||
|
||||
# worker name, replaces the hostname when creating a unique name for this worker
|
||||
# Overridden with os environment: CLEARML_WORKER_ID
|
||||
# worker_name: "clearml-agent-machine1"
|
||||
worker_name: ""
|
||||
|
||||
# Set the python version to use when creating the virtual environment and launching the experiment
|
||||
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
|
||||
# The default is the python executing the clearml_agent
|
||||
python_binary: ""
|
||||
|
||||
# select python package manager:
|
||||
# currently supported pip and conda
|
||||
# poetry is used if pip selected and repository contains poetry.lock file
|
||||
package_manager: {
|
||||
# supported options: pip, conda, poetry
|
||||
type: pip,
|
||||
|
||||
# specify pip version to use (examples "<20", "==19.3.1", "", empty string will install the latest version)
|
||||
# pip_version: "<20"
|
||||
|
||||
# virtual environment inheres packages from system
|
||||
system_site_packages: false,
|
||||
# install with --upgrade
|
||||
force_upgrade: false,
|
||||
|
||||
# additional artifact repositories to use when installing python packages
|
||||
# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
|
||||
extra_index_url: []
|
||||
|
||||
# additional conda channels to use when installing with conda package manager
|
||||
conda_channels: ["pytorch", "conda-forge", ]
|
||||
# conda_full_env_update: false
|
||||
# conda_env_as_base_docker: false
|
||||
|
||||
# set the priority packages to be installed before the rest of the required packages
|
||||
# priority_packages: ["cython", "numpy", "setuptools", ]
|
||||
|
||||
# set the optional priority packages to be installed before the rest of the required packages,
|
||||
# In case a package installation fails, the package will be ignored,
|
||||
# and the virtual environment process will continue
|
||||
# priority_optional_packages: ["pygobject", ]
|
||||
|
||||
# set the post packages to be installed after all the rest of the required packages
|
||||
# post_packages: ["horovod", ]
|
||||
|
||||
# set the optional post packages to be installed after all the rest of the required packages,
|
||||
# In case a package installation fails, the package will be ignored,
|
||||
# and the virtual environment process will continue
|
||||
# post_optional_packages: []
|
||||
|
||||
# set to True to support torch nightly build installation,
|
||||
# notice: torch nightly builds are ephemeral and are deleted from time to time
|
||||
torch_nightly: false,
|
||||
},
|
||||
|
||||
# target folder for virtual environments builds, created when executing experiment
|
||||
venvs_dir = ~/.clearml/venvs-builds
|
||||
|
||||
# cached git clone folder
|
||||
vcs_cache: {
|
||||
enabled: true,
|
||||
path: ~/.clearml/vcs-cache
|
||||
},
|
||||
|
||||
# use venv-update in order to accelerate python virtual environment building
|
||||
# Still in beta, turned off by default
|
||||
venv_update: {
|
||||
enabled: false,
|
||||
},
|
||||
|
||||
# cached folder for specific python package download (mostly pytorch versions)
|
||||
pip_download_cache {
|
||||
enabled: true,
|
||||
path: ~/.clearml/pip-download-cache
|
||||
},
|
||||
|
||||
translate_ssh: true,
|
||||
# reload configuration file every daemon execution
|
||||
reload_config: false,
|
||||
|
||||
# pip cache folder mapped into docker, used for python package caching
|
||||
docker_pip_cache = ~/.clearml/pip-cache
|
||||
# apt cache folder mapped into docker, used for ubuntu package caching
|
||||
docker_apt_cache = ~/.clearml/apt-cache
|
||||
|
||||
# optional arguments to pass to docker image
|
||||
# these are local for this agent and will not be updated in the experiment's docker_cmd section
|
||||
# extra_docker_arguments: ["--ipc=host", ]
|
||||
|
||||
# optional shell script to run in docker when started before the experiment is started
|
||||
# extra_docker_shell_script: ["apt-get install -y bindfs", ]
|
||||
|
||||
# set to true in order to force "docker pull" before running an experiment using a docker image.
|
||||
# This makes sure the docker image is updated.
|
||||
docker_force_pull: false
|
||||
|
||||
default_docker: {
|
||||
# default docker image to use when running in docker mode
|
||||
image: "nvidia/cuda:10.1-runtime-ubuntu18.04"
|
||||
|
||||
# optional arguments to pass to docker image
|
||||
# arguments: ["--ipc=host"]
|
||||
}
|
||||
|
||||
# CUDA versions used for Conda setup & solving PyTorch wheel packages
|
||||
# it Should be detected automatically. Override with os environment CUDA_VERSION / CUDNN_VERSION
|
||||
# cuda_version: 10.1
|
||||
# cudnn_version: 7.6
|
||||
}
|
||||
|
||||
sdk {
|
||||
# CLEARML - default SDK configuration
|
||||
|
||||
storage {
|
||||
cache {
|
||||
# Defaults to system temp folder / cache
|
||||
default_base_dir: "~/.clearml/cache"
|
||||
}
|
||||
|
||||
direct_access: [
|
||||
# Objects matching are considered to be available for direct access, i.e. they will not be downloaded
|
||||
# or cached, and any download request will return a direct reference.
|
||||
# Objects are specified in glob format, available for url and content_type.
|
||||
{ url: "file://*" } # file-urls are always directly referenced
|
||||
]
|
||||
}
|
||||
|
||||
metrics {
|
||||
# History size for debug files per metric/variant. For each metric/variant combination with an attached file
|
||||
# (e.g. debug image event), file names for the uploaded files will be recycled in such a way that no more than
|
||||
# X files are stored in the upload destination for each metric/variant combination.
|
||||
file_history_size: 100
|
||||
|
||||
# Max history size for matplotlib imshow files per plot title.
|
||||
# File names for the uploaded images will be recycled in such a way that no more than
|
||||
# X images are stored in the upload destination for each matplotlib plot title.
|
||||
matplotlib_untitled_history_size: 100
|
||||
|
||||
# Limit the number of digits after the dot in plot reporting (reducing plot report size)
|
||||
# plot_max_num_digits: 5
|
||||
|
||||
# Settings for generated debug images
|
||||
images {
|
||||
format: JPEG
|
||||
quality: 87
|
||||
subsampling: 0
|
||||
}
|
||||
|
||||
# Support plot-per-graph fully matching Tensorboard behavior (i.e. if this is set to True, each series should have its own graph)
|
||||
tensorboard_single_series_per_graph: False
|
||||
}
|
||||
|
||||
network {
|
||||
metrics {
|
||||
# Number of threads allocated to uploading files (typically debug images) when transmitting metrics for
|
||||
# a specific iteration
|
||||
file_upload_threads: 4
|
||||
|
||||
# Warn about upload starvation if no uploads were made in specified period while file-bearing events keep
|
||||
# being sent for upload
|
||||
file_upload_starvation_warning_sec: 120
|
||||
}
|
||||
|
||||
iteration {
|
||||
# Max number of retries when getting frames if the server returned an error (http code 500)
|
||||
max_retries_on_server_error: 5
|
||||
# Backoff factory for consecutive retry attempts.
|
||||
# SDK will wait for {backoff factor} * (2 ^ ({number of total retries} - 1)) between retries.
|
||||
retry_backoff_factor_sec: 10
|
||||
}
|
||||
}
|
||||
aws {
|
||||
s3 {
|
||||
# S3 credentials, used for read/write access by various SDK elements
|
||||
|
||||
# default, used for any bucket not specified below
|
||||
key: ""
|
||||
secret: ""
|
||||
region: ""
|
||||
|
||||
credentials: [
|
||||
# specifies key/secret credentials to use when handling s3 urls (read or write)
|
||||
# {
|
||||
# bucket: "my-bucket-name"
|
||||
# key: "my-access-key"
|
||||
# secret: "my-secret-key"
|
||||
# },
|
||||
# {
|
||||
# # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
|
||||
# host: "my-minio-host:9000"
|
||||
# key: "12345678"
|
||||
# secret: "12345678"
|
||||
# multipart: false
|
||||
# secure: false
|
||||
# }
|
||||
]
|
||||
}
|
||||
boto3 {
|
||||
pool_connections: 512
|
||||
max_multipart_concurrency: 16
|
||||
}
|
||||
}
|
||||
google.storage {
|
||||
# # Default project and credentials file
|
||||
# # Will be used when no bucket configuration is found
|
||||
# project: "clearml"
|
||||
# credentials_json: "/path/to/credentials.json"
|
||||
|
||||
# # Specific credentials per bucket and sub directory
|
||||
# credentials = [
|
||||
# {
|
||||
# bucket: "my-bucket"
|
||||
# subdir: "path/in/bucket" # Not required
|
||||
# project: "clearml"
|
||||
# credentials_json: "/path/to/credentials.json"
|
||||
# },
|
||||
# ]
|
||||
}
|
||||
azure.storage {
|
||||
# containers: [
|
||||
# {
|
||||
# account_name: "clearml"
|
||||
# account_key: "secret"
|
||||
# # container_name:
|
||||
# }
|
||||
# ]
|
||||
}
|
||||
|
||||
log {
|
||||
# debugging feature: set this to true to make null log propagate messages to root logger (so they appear in stdout)
|
||||
null_log_propagate: False
|
||||
task_log_buffer_capacity: 66
|
||||
|
||||
# disable urllib info and lower levels
|
||||
disable_urllib3_info: True
|
||||
}
|
||||
|
||||
development {
|
||||
# Development-mode options
|
||||
|
||||
# dev task reuse window
|
||||
task_reuse_time_window_in_hours: 72.0
|
||||
|
||||
# Run VCS repository detection asynchronously
|
||||
vcs_repo_detect_async: True
|
||||
|
||||
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
|
||||
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
|
||||
store_uncommitted_code_diff_on_train: True
|
||||
|
||||
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset
|
||||
support_stopping: True
|
||||
|
||||
# Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
|
||||
default_output_uri: ""
|
||||
|
||||
# Development mode worker
|
||||
worker {
|
||||
# Status report period in seconds
|
||||
report_period_sec: 2
|
||||
|
||||
# ping to the server - check connectivity
|
||||
ping_period_sec: 30
|
||||
|
||||
# Log all stdout & stderr
|
||||
log_stdout: True
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
BIN
docs/clearml_agent_logo.png
Normal file
BIN
docs/clearml_agent_logo.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 11 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 845 KiB After Width: | Height: | Size: 2.0 MiB |
@@ -1,11 +1,11 @@
|
||||
# TRAINS-AGENT configuration file
|
||||
# CLEARML-AGENT configuration file - Please use ~/clearml.conf
|
||||
api {
|
||||
api_server: https://demoapi.trains.allegro.ai
|
||||
web_server: https://demoapp.trains.allegro.ai
|
||||
files_server: https://demofiles.trains.allegro.ai
|
||||
api_server: https://demoapi.demo.clear.ml
|
||||
web_server: https://demoapp.demo.clear.ml
|
||||
files_server: https://demofiles.demo.clear.ml
|
||||
|
||||
# Credentials are generated in the webapp, https://demoapp.trains.allegro.ai/profile
|
||||
# Overridden with os environment: TRAINS_API_ACCESS_KEY / TRAINS_API_SECRET_KEY
|
||||
# Credentials are generated in the webapp, https://demoapp.demo.clear.ml/profile
|
||||
# Overridden with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
|
||||
credentials {"access_key": "EGRTCO8JMSIGI6S39GTP43NFWXDQOW", "secret_key": "x!XTov_G-#vspE*Y(h$Anm&DIc5Ou-F)jsl$PdOyj5wG1&E!Z8"}
|
||||
|
||||
# verify host ssl certificate, set to False only if you have a very good reason
|
||||
@@ -27,18 +27,18 @@ agent {
|
||||
# force_git_ssh_port: ""
|
||||
|
||||
# unique name of this worker, if None, created based on hostname:process_id
|
||||
# Overridden with os environment: TRAINS_WORKER_NAME
|
||||
# worker_id: "trains-agent-machine1:gpu0"
|
||||
# Overridden with os environment: CLEARML_WORKER_NAME
|
||||
# worker_id: "clearml-agent-machine1:gpu0"
|
||||
worker_id: ""
|
||||
|
||||
# worker name, replaces the hostname when creating a unique name for this worker
|
||||
# Overridden with os environment: TRAINS_WORKER_ID
|
||||
# worker_name: "trains-agent-machine1"
|
||||
# Overridden with os environment: CLEARML_WORKER_ID
|
||||
# worker_name: "clearml-agent-machine1"
|
||||
worker_name: ""
|
||||
|
||||
# Set the python version to use when creating the virtual environment and launching the experiment
|
||||
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
|
||||
# The default is the python executing the trains_agent
|
||||
# The default is the python executing the clearml_agent
|
||||
python_binary: ""
|
||||
|
||||
# select python package manager:
|
||||
@@ -57,7 +57,7 @@ agent {
|
||||
force_upgrade: false,
|
||||
|
||||
# additional artifact repositories to use when installing python packages
|
||||
# extra_index_url: ["https://allegroai.jfrog.io/trains/api/pypi/public/simple"]
|
||||
# extra_index_url: ["https://allegroai.jfrog.io/clearml/api/pypi/public/simple"]
|
||||
extra_index_url: []
|
||||
|
||||
# additional conda channels to use when installing with conda package manager
|
||||
@@ -87,12 +87,12 @@ agent {
|
||||
},
|
||||
|
||||
# target folder for virtual environments builds, created when executing experiment
|
||||
venvs_dir = ~/.trains/venvs-builds
|
||||
venvs_dir = ~/.clearml/venvs-builds
|
||||
|
||||
# cached git clone folder
|
||||
vcs_cache: {
|
||||
enabled: true,
|
||||
path: ~/.trains/vcs-cache
|
||||
path: ~/.clearml/vcs-cache
|
||||
},
|
||||
|
||||
# use venv-update in order to accelerate python virtual environment building
|
||||
@@ -104,7 +104,7 @@ agent {
|
||||
# cached folder for specific python package download (mostly pytorch versions)
|
||||
pip_download_cache {
|
||||
enabled: true,
|
||||
path: ~/.trains/pip-download-cache
|
||||
path: ~/.clearml/pip-download-cache
|
||||
},
|
||||
|
||||
translate_ssh: true,
|
||||
@@ -112,9 +112,9 @@ agent {
|
||||
reload_config: false,
|
||||
|
||||
# pip cache folder mapped into docker, used for python package caching
|
||||
docker_pip_cache = ~/.trains/pip-cache
|
||||
docker_pip_cache = ~/.clearml/pip-cache
|
||||
# apt cache folder mapped into docker, used for ubuntu package caching
|
||||
docker_apt_cache = ~/.trains/apt-cache
|
||||
docker_apt_cache = ~/.clearml/apt-cache
|
||||
|
||||
# optional arguments to pass to docker image
|
||||
# these are local for this agent and will not be updated in the experiment's docker_cmd section
|
||||
@@ -142,12 +142,12 @@ agent {
|
||||
}
|
||||
|
||||
sdk {
|
||||
# TRAINS - default SDK configuration
|
||||
# CLEARML - default SDK configuration
|
||||
|
||||
storage {
|
||||
cache {
|
||||
# Defaults to system temp folder / cache
|
||||
default_base_dir: "~/.trains/cache"
|
||||
default_base_dir: "~/.clearml/cache"
|
||||
}
|
||||
|
||||
direct_access: [
|
||||
@@ -236,7 +236,7 @@ sdk {
|
||||
google.storage {
|
||||
# # Default project and credentials file
|
||||
# # Will be used when no bucket configuration is found
|
||||
# project: "trains"
|
||||
# project: "clearml"
|
||||
# credentials_json: "/path/to/credentials.json"
|
||||
|
||||
# # Specific credentials per bucket and sub directory
|
||||
@@ -244,7 +244,7 @@ sdk {
|
||||
# {
|
||||
# bucket: "my-bucket"
|
||||
# subdir: "path/in/bucket" # Not required
|
||||
# project: "trains"
|
||||
# project: "clearml"
|
||||
# credentials_json: "/path/to/credentials.json"
|
||||
# },
|
||||
# ]
|
||||
@@ -252,7 +252,7 @@ sdk {
|
||||
azure.storage {
|
||||
# containers: [
|
||||
# {
|
||||
# account_name: "trains"
|
||||
# account_name: "clearml"
|
||||
# account_key: "secret"
|
||||
# # container_name:
|
||||
# }
|
||||
|
||||
@@ -5,7 +5,7 @@ An example script that cleans up failed experiments by moving them to the archiv
|
||||
import argparse
|
||||
from datetime import datetime
|
||||
|
||||
from trains_agent import APIClient
|
||||
from clearml_agent import APIClient
|
||||
|
||||
parser = argparse.ArgumentParser(description=__doc__)
|
||||
parser.add_argument("--project", "-P", help="Project ID. Only clean up experiments from this project")
|
||||
|
||||
@@ -5,7 +5,7 @@ The K8sIntegration component will label each pod accordingly.
|
||||
"""
|
||||
from argparse import ArgumentParser
|
||||
|
||||
from trains_agent.glue.k8s import K8sIntegration
|
||||
from clearml_agent.glue.k8s import K8sIntegration
|
||||
|
||||
|
||||
def parse_args():
|
||||
@@ -31,7 +31,7 @@ def parse_args():
|
||||
help="Used in conjunction with ports-mode, specify the external address of the k8s ingress / ELB"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pod-trains-conf", type=str,
|
||||
"--pod-clearml-conf", type=str,
|
||||
help="Configuration file to be used by the pod itself (if not provided, current configuration is used)"
|
||||
)
|
||||
parser.add_argument(
|
||||
|
||||
2
main.py
2
main.py
@@ -1,5 +1,5 @@
|
||||
import sys
|
||||
from trains_agent import __main__
|
||||
from clearml_agent import __main__
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -9,7 +9,7 @@ psutil>=3.4.2,<5.9.0
|
||||
pyhocon>=0.3.38,<0.4.0
|
||||
pyparsing>=2.0.3,<2.5.0
|
||||
python-dateutil>=2.4.2,<2.9.0
|
||||
pyjwt>=1.6.4,<2.0.0
|
||||
pyjwt>=1.6.4,<1.8.0
|
||||
PyYAML>=3.12,<5.4.0
|
||||
requests-file>=1.4.2,<1.6.0
|
||||
requests>=2.20.0,<2.26.0
|
||||
|
||||
25
setup.py
25
setup.py
@@ -1,17 +1,19 @@
|
||||
"""
|
||||
TRAINS - Artificial Intelligence Version Control
|
||||
TRAINS-AGENT DevOps for machine/deep learning
|
||||
https://github.com/allegroai/trains-agent
|
||||
ClearML - Artificial Intelligence Version Control
|
||||
CLEARML-AGENT DevOps for machine/deep learning
|
||||
https://github.com/allegroai/clearml-agent
|
||||
"""
|
||||
|
||||
import os.path
|
||||
# Always prefer setuptools over distutils
|
||||
from setuptools import setup, find_packages
|
||||
|
||||
|
||||
def read_text(filepath):
|
||||
with open(filepath, "r") as f:
|
||||
return f.read()
|
||||
|
||||
|
||||
here = os.path.dirname(__file__)
|
||||
# Get the long description from the README file
|
||||
long_description = read_text(os.path.join(here, 'README.md'))
|
||||
@@ -26,20 +28,20 @@ def read_version_string(version_file):
|
||||
raise RuntimeError("Unable to find version string.")
|
||||
|
||||
|
||||
version = read_version_string("trains_agent/version.py")
|
||||
version = read_version_string("clearml_agent/version.py")
|
||||
|
||||
requirements = read_text(os.path.join(here, 'requirements.txt')).splitlines()
|
||||
|
||||
setup(
|
||||
name='trains_agent',
|
||||
name='clearml_agent',
|
||||
version=version,
|
||||
description='TRAINS Agent - Auto-Magical DevOps for Deep Learning',
|
||||
description='ClearML Agent - Auto-Magical DevOps for Deep Learning',
|
||||
long_description=long_description,
|
||||
long_description_content_type='text/markdown',
|
||||
# The project's main homepage.
|
||||
url='https://github.com/allegroai/trains-agent',
|
||||
url='https://github.com/allegroai/clearml-agent',
|
||||
author='Allegroai',
|
||||
author_email='trains@allegro.ai',
|
||||
author_email='clearml@allegro.ai',
|
||||
license='Apache License 2.0',
|
||||
classifiers=[
|
||||
'Development Status :: 4 - Beta',
|
||||
@@ -57,10 +59,11 @@ setup(
|
||||
'Programming Language :: Python :: 3.5',
|
||||
'Programming Language :: Python :: 3.6',
|
||||
'Programming Language :: Python :: 3.7',
|
||||
'Programming Language :: Python :: 3.8',
|
||||
'License :: OSI Approved :: Apache Software License',
|
||||
],
|
||||
|
||||
keywords='trains devops machine deep learning agent automation hpc cluster',
|
||||
keywords='clearml trains devops machine deep learning agent automation hpc cluster',
|
||||
|
||||
packages=find_packages(exclude=['contrib', 'docs', 'data', 'examples', 'tests*']),
|
||||
|
||||
@@ -68,13 +71,13 @@ setup(
|
||||
extras_require={
|
||||
},
|
||||
package_data={
|
||||
'trains_agent': ['backend_api/config/default/*.conf']
|
||||
'clearml_agent': ['backend_api/config/default/*.conf']
|
||||
},
|
||||
include_package_data=True,
|
||||
# To provide executable scripts, use entry points in preference to the
|
||||
# "scripts" keyword. Entry points provide cross-platform support and allow
|
||||
# pip to create the appropriate form of executable for the target platform.
|
||||
entry_points={
|
||||
'console_scripts': ['trains-agent=trains_agent.__main__:main'],
|
||||
'console_scripts': ['clearml-agent=clearml_agent.__main__:main'],
|
||||
},
|
||||
)
|
||||
|
||||
@@ -9,8 +9,8 @@ PROJECT_ROOT = Path(__file__).resolve().parents[1]
|
||||
|
||||
|
||||
@pytest.fixture(scope='function')
|
||||
def run_trains_agent(script_runner):
|
||||
""" Execute trains_agent agent app in subprocess and return stdout as a string.
|
||||
def run_clearml_agent(script_runner):
|
||||
""" Execute clearml_agent agent app in subprocess and return stdout as a string.
|
||||
Args:
|
||||
script_runner (object): a pytest plugin for testing python scripts
|
||||
installed via console_scripts entry point of setup.py.
|
||||
@@ -23,17 +23,17 @@ def run_trains_agent(script_runner):
|
||||
string: The return value. stdout output
|
||||
"""
|
||||
def _method(*args):
|
||||
trains_agent_file = str(PROJECT_ROOT / "trains_agent.sh")
|
||||
ret = script_runner.run(trains_agent_file, *args)
|
||||
clearml_agent_file = str(PROJECT_ROOT / "clearml_agent.sh")
|
||||
ret = script_runner.run(clearml_agent_file, *args)
|
||||
return ret
|
||||
return _method
|
||||
|
||||
|
||||
@pytest.fixture(scope='function')
|
||||
def trains_agentyaml(tmpdir):
|
||||
def clearml_agentyaml(tmpdir):
|
||||
@contextmanager
|
||||
def _method(template_file):
|
||||
file = tmpdir.join("trains_agent.yaml")
|
||||
file = tmpdir.join("clearml_agent.yaml")
|
||||
with (PROJECT_ROOT / "tests/templates" / template_file).open() as f:
|
||||
code = yaml.load(f, Loader=yaml.SafeLoader)
|
||||
yield Namespace(code=code, file=file.strpath)
|
||||
@@ -43,7 +43,7 @@ def trains_agentyaml(tmpdir):
|
||||
|
||||
# class Test(object):
|
||||
# def yaml_file(self, tmpdir, template_file):
|
||||
# file = tmpdir.join("trains_agent.yaml")
|
||||
# file = tmpdir.join("clearml_agent.yaml")
|
||||
# with open(template_file) as f:
|
||||
# test_object = yaml.load(f)
|
||||
# self.let(test_object)
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
import pytest
|
||||
|
||||
from trains_agent.helper.repo import VCS
|
||||
from clearml_agent.helper.repo import VCS
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
import pytest
|
||||
from furl import furl
|
||||
|
||||
from trains_agent.helper.package.translator import RequirementsTranslator
|
||||
from clearml_agent.helper.package.translator import RequirementsTranslator
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
|
||||
@@ -4,7 +4,7 @@ import requests
|
||||
from furl import furl
|
||||
|
||||
import six
|
||||
from trains_agent.helper.package.pytorch import PytorchRequirement
|
||||
from clearml_agent.helper.package.pytorch import PytorchRequirement
|
||||
|
||||
|
||||
@attr.s
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
import pytest
|
||||
|
||||
from trains_agent.helper.repo import Git
|
||||
from clearml_agent.helper.repo import Git
|
||||
|
||||
NO_CHANGE = object()
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
"""
|
||||
Test handling of jupyter notebook tasks.
|
||||
Logging is enabled in `trains_agent/tests/pytest.ini`. Search for `pytest live logging` for more info.
|
||||
Logging is enabled in `clearml_agent/tests/pytest.ini`. Search for `pytest live logging` for more info.
|
||||
"""
|
||||
import logging
|
||||
import re
|
||||
@@ -11,12 +11,12 @@ from contextlib import contextmanager
|
||||
from typing import Iterator, ContextManager, Sequence, IO, Text
|
||||
from uuid import uuid4
|
||||
|
||||
from trains_agent.backend_api.services import tasks
|
||||
from trains_agent.backend_api.session.client import APIClient
|
||||
from clearml_agent.backend_api.services import tasks
|
||||
from clearml_agent.backend_api.session.client import APIClient
|
||||
from pathlib2 import Path
|
||||
from pytest import fixture
|
||||
|
||||
from trains_agent.helper.process import Argv
|
||||
from clearml_agent.helper.process import Argv
|
||||
|
||||
logging.getLogger("urllib3").setLevel(logging.CRITICAL)
|
||||
log = logging.getLogger(__name__)
|
||||
@@ -51,7 +51,7 @@ def select_read(file_obj, timeout):
|
||||
|
||||
|
||||
def run_task(task):
|
||||
return Argv("trains_agent", "--debug", "worker", "execute", "--id", task.id)
|
||||
return Argv("clearml_agent", "--debug", "worker", "execute", "--id", task.id)
|
||||
|
||||
|
||||
@contextmanager
|
||||
|
||||
@@ -10,12 +10,12 @@ from sys import platform as sys_platform
|
||||
import pytest
|
||||
import requirements
|
||||
|
||||
from trains_agent.commands.worker import Worker
|
||||
from trains_agent.helper.package.pytorch import PytorchRequirement
|
||||
from trains_agent.helper.package.requirements import RequirementsManager, \
|
||||
from clearml_agent.commands.worker import Worker
|
||||
from clearml_agent.helper.package.pytorch import PytorchRequirement
|
||||
from clearml_agent.helper.package.requirements import RequirementsManager, \
|
||||
RequirementSubstitution, MarkerRequirement
|
||||
from trains_agent.helper.process import get_bash_output
|
||||
from trains_agent.session import Session
|
||||
from clearml_agent.helper.process import get_bash_output
|
||||
from clearml_agent.session import Session
|
||||
|
||||
_cuda_based_packages_hack = ('seematics.caffe', 'lightnet')
|
||||
|
||||
|
||||
Reference in New Issue
Block a user