clearml-docs/docs/guides/services/aws_autoscaler.md
2023-04-16 10:13:04 +03:00

171 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: ClearML AWS Autoscaler Service
---
The ClearML [AWS autoscaler example](https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py)
demonstrates how to use the [`clearml.automation.auto_scaler`](https://github.com/allegroai/clearml/blob/master/clearml/automation/auto_scaler.py)
module to implement a service that optimizes AWS EC2 instance scaling according to a defined instance budget.
The autoscaler periodically polls your AWS cluster and automatically stops idle instances based on a defined maximum idle time or spins
up new instances when there aren't enough to execute pending tasks.
## Running the ClearML AWS Autoscaler
run the ClearML AWS autoscaler in one of these ways:
* Run the [aws_autoscaler.py](https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py)
script locally
* Launch through your [`services` queue](../../clearml_agent.md#services-mode)
:::note Default AMI
The autoscaler services uses by default the `NVIDIA Deep Learning AMI v20.11.0-46a68101-e56b-41cd-8e32-631ac6e5d02b` AMI
:::
### Running the Script
:::info Self deployed ClearML server
A template `AWS Auto-Scaler` task is available in the `DevOps Services` project.
You can clone it, adapt its [configuration](#configuration) to your needs, and enqueue it for execution directly from the ClearML UI.
:::
Launch the autoscaler locally by executing the following command:
```bash
python aws_autoscaler.py --run
```
When the script runs, a configuration wizard prompts for instance details and budget configuration.
1. Enter the AWS credentials and AWS region name.
```console
AWS Autoscaler setup wizard
---------------------------
Follow the wizard to configure your AWS auto-scaler service.
Once completed, you will be able to view and change the configuration in the clearml-server web UI.
It means there is no need to worry about typos or mistakes :)
Enter AWS Access Key ID :
Enter AWS Secret Access Key :
Enter AWS region name [us-east-1b]:
```
1. Enter Git credentials. These are required by ClearML Agent to set up a Task execution environment in an AWS EC2 instance.
```console
GIT credentials:
Enter GIT username for repository cloning (leave blank for SSH key authentication): []
Enter password for user '<username>':
```
The wizard reports the Git credentials it will use.
```console
Git repository cloning will be using user=*************** password=***********
```
1. Enter the default Docker image and parameters to use.
```console
Enter default docker image/parameters to use [nvidia/cuda:10.1-runtime-ubuntu18.04]:
```
1. For each AWS EC2 instance type that will be used in the budget, do the following:
```console
Configure the machine types for the auto-scaler:
------------------------------------------------
Select Amazon instance type ['g4dn.4xlarge']:
Use spot instances? [y/N]: y
Select availability zone ['us-east-1b']:
Select the Amazon Machine Image id ['ami-07c95cafbb788face']:
Enter the Amazon EBS device ['/dev/xvda']:
Enter the Amazon EBS volume size (in GiB) [100]:
Enter the Amazon EBS volume type ['gp2']:
```
Name the instance type that was configured. Later in the configuration, use this name to create the budget.
```console
Select a name for this instance type (used in the budget section) For example 'aws4gpu':
```
The wizard prompts whether to select another instance type.
```console
Define another instance type? [y/N]:
```
1. Enter any bash script to run on newly created instances before launching the ClearML Agent.
```console
Enter any pre-execution bash script to be executed on the newly created instances []:
```
1. Configure the AWS autoscaler budget. For each queue that will be used in the budget, enter the maximum number of
instances of a selected type that can be spun up simultaneously.
```console
Define the machines budget:
-----------------------------
Select a queue name (for example: 'aws_4gpu_machines') :
Select a instance type to attach to the queue ['aws-g4dn.xlarge', 'aws-g4dn.8xlarge', 'aws-g4dn.16xlarge']:
Enter maximum number of 'aws-g4dn.xlarge' instances to spin simultaneously (example: 3) :
```
1. If needed, add another instance type to the same queue. The previous step repeats.
```console
Do you wish to add another instance type to queue? [y/N]:
```
1. The ClearML AWS autoscaler polls instances, and if instances have been idle for the maximum idle time that was specified,
the autoscaler spins them down.
```console
Enter maximum idle time for the auto-scaler to spin down an instance (in minutes) [15]:
Enter instances polling interval for the auto-scaler (in minutes) [5]:
```
The configuration is complete, and a new task called `AWS Auto-Scaler` is created in the `DevOps` project. The service begins,
and the script prints a hyperlink to the Task's log.
```console
CLEARML Task: created new task id=d0ee5309a9a3471d8802f2561da60dfa
CLEARML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
CLEARML results page: https://app.clear.ml/projects/142a598b5d234bebb37a57d692f5689f/experiments/d0ee5309a9a3471d8802f2561da60dfa/output/log
Running AWS auto-scaler as a service
Execution log https://app.clear.ml/projects/142a598b5d234bebb37a57d692f5689f/experiments/d0ee5309a9a3471d8802f2561da60dfa/output/log
```
### Remote Execution
Using the `--remote` command line option will enqueue the autoscaler to your [`services` queue](../../clearml_agent.md#services-mode)
once the configuration wizard is complete:
```bash
python aws_autoscaler.py --remote
```
Make sure a `clearml-agent` is assigned to that queue.
## WebApp
### Configuration
The values configured through the wizard are stored in the tasks hyperparameters and configuration objects by using the
[`Task.connect`](../../references/sdk/task.md#connect) and [`Task.set_configuration_object`](../../references/sdk/task.md#set_configuration_object)
methods respectively. They can be viewed in the WebApp, in the tasks **CONFIGURATION** page under **HYPERPARAMETERS** and **CONFIGURATION OBJECTS > General**.
ClearML automatically logs command line arguments defined with argparse. View them in the experiments **CONFIGURATION**
page under **HYPERPARAMETERS > General**.
![Autoscaler configuration](../../img/examples_aws_autoscaler_config.png)
The task can be reused to launch another autoscaler instance: clone the task, then edit its parameters for the instance
types and budget configuration, and enqueue the task for execution (youll typically want to use a ClearML Agent running
in [services mode](../../clearml_agent.md#services-mode) for such service tasks).
### Console
All other console output appears in the experiments **CONSOLE**.
![Autoscaler console](../../img/examples_aws_autoscaler_console.png)