diff --git a/examples/dynamic_cloud_cluster.ipynb b/examples/dynamic_cloud_cluster.ipynb index b245aea..c9e5b46 100644 --- a/examples/dynamic_cloud_cluster.ipynb +++ b/examples/dynamic_cloud_cluster.ipynb @@ -5,27 +5,30 @@ "metadata": {}, "source": [ "# Auto-Magically Spin AWS EC2 Instances On Demand \n", - "# and Create a Dynamic Cluster Running *Trains-Agent*\n", + "# and Create a Dynamic Cluster Running *ClearML-Agent*\n", "\n", - "### Define your budget and execute the notebook, that's it\n", - "### You now have a fully managed cluster on AWS 🎉 🎊 " + "## Define your budget and execute the notebook, that's it\n", + "## You now have a fully managed cluster on AWS 🎉 🎊" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**trains-agent**'s main goal is to quickly pull a job from an execution queue, setup the environment (as defined in the experiment, including git cloning, python packages etc.) then execute the experiment and monitor it.\n", + "**clearml-agent**'s main goal is to quickly pull a job from an execution queue, set up the environment (as defined in the experiment, including git cloning, python packages etc.), then execute the experiment and monitor it.\n", "\n", "This notebook defines a cloud budget (currently only AWS is supported, but feel free to expand with PRs), and spins an instance the minute a job is waiting for execution. It will also spin down idle machines, saving you some $$$ :)\n", "\n", - "Configuration steps\n", + "> **Note:**\n", + "> This is just an example of how you can use ClearML Agent to implement custom autoscaling. For a more structured autoscaler script, see [here](https://github.com/allegroai/clearml/blob/master/clearml/automation/auto_scaler.py).\n", + "\n", + "Configuration steps:\n", "- Define maximum budget to be used (instance type / number of instances).\n", - "- Create new execution *queues* in the **trains-server**.\n", - "- Define mapping between the created the *queues* and an instance budget.\n", + "- Create new execution *queues* in the **clearml-server**.\n", + "- Define mapping between the created *queues* and an instance budget.\n", "\n", "**TL;DR - This notebook:**\n", - "- Will spin instances if there are jobs in the execution *queues*, until it will hit the budget limit. \n", + "- Will spin instances if there are jobs in the execution *queues* until it will hit the budget limit.\n", "- If machines are idle, it will spin them down.\n", "\n", "The controller implementation itself is stateless, meaning you can always re-execute the notebook, if for some reason it stopped.\n", @@ -39,7 +42,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### Install & import required packages" + "### Install & import required packages" ] }, { @@ -48,7 +51,7 @@ "metadata": {}, "outputs": [], "source": [ - "!pip install trains-agent\n", + "!pip install clearml-agent\n", "!pip install boto3" ] }, @@ -56,7 +59,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)" + "### Define AWS instance types and configuration (Instance Type, EBS, AMI etc.)" ] }, { @@ -92,17 +95,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### Define machine budget per execution queue\n", + "### Define machine budget per execution queue\n", "\n", - "Now that we defined our budget, we need to connect it with the **Trains** cluster.\n", + "Now that we defined our budget, we need to connect it with the **ClearML** cluster.\n", "\n", "We map each queue to a resource type (instance type).\n", "\n", - "Create two queues in the WebUI:\n", - "- Browse to http://your_trains_server_ip:8080/workers-and-queues/queues\n", + "Create two queues in the Web UI:\n", + "- Browse to http://your_clearml_server_ip:8080/workers-and-queues/queues\n", "- Then click on the \"New Queue\" button and name your queues \"aws_normal\" and \"aws_high\" respectively\n", "\n", - "The QUEUES dictionary hold the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n", + "The QUEUES dictionary holds the mapping between the queue name and the type/number of instances to spin connected to the specific queue.\n", "```\n", "QUEUES = {\n", " 'queue_name': [(\"instance-type-as-defined-in-RESOURCE_CONFIGURATIONS\", max_number_of_instances), ]\n", @@ -116,7 +119,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Trains-Agent Queues - Machines budget per Queue\n", + "# ClearML Agent Queues - Machines budget per Queue\n", "# Per queue: list of (machine type as defined in RESOURCE_CONFIGURATIONS,\n", "# max instances for the specific queue). Order machines from most preferred to least.\n", "QUEUES = {\n", @@ -129,7 +132,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### Credentials for your AWS account, as well as for your **Trains-Server**" + "### Credentials for your AWS account, as well as for your **ClearML Server**" ] }, { @@ -143,24 +146,25 @@ "CLOUD_CREDENTIALS_SECRET = \"\"\n", "CLOUD_CREDENTIALS_REGION = \"us-east-1\"\n", "\n", - "# TRAINS configuration\n", - "TRAINS_SERVER_WEB_SERVER = \"http://localhost:8080\"\n", - "TRAINS_SERVER_API_SERVER = \"http://localhost:8008\"\n", - "TRAINS_SERVER_FILES_SERVER = \"http://localhost:8081\"\n", - "# TRAINS credentials\n", - "TRAINS_ACCESS_KEY = \"\"\n", - "TRAINS_SECRET_KEY = \"\"\n", - "# Git User/Pass to be used by trains-agent,\n", + "# CLEARML configuration\n", + "CLEARML_WEB_SERVER = \"http://localhost:8080\"\n", + "CLEARML_API_SERVER = \"http://localhost:8008\"\n", + "CLEARML_FILES_SERVER = \"http://localhost:8081\"\n", + "# CLEARML credentials\n", + "CLEARML_API_ACCESS_KEY = \"\"\n", + "CLEARML_API_SECRET_KEY = \"\"\n", + "# Git User/Pass to be used by clearml-agent,\n", "# leave empty if image already contains git ssh-key\n", - "TRAINS_GIT_USER = \"\"\n", - "TRAINS_GIT_PASS = \"\"\n", + "CLEARML_AGENT_GIT_USER = \"\"\n", + "CLEARML_AGENT_GIT_PASS = \"\"\n", "\n", - "# Additional fields for trains.conf file created on the remote instance\n", + "# Additional fields for clearml.conf file created on the remote instance\n", "# for example: 'agent.default_docker.image: \"nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04\"'\n", - "EXTRA_TRAINS_CONF = \"\"\"\n", + "\n", + "EXTRA_CLEARML_CONF = \"\"\"\n", "\"\"\"\n", "\n", - "# Bash script to run on instances before running trains-agent\n", + "# Bash script to run on instances before running clearml-agent\n", "# Example: \"\"\"\n", "# echo \"This is the first line\"\n", "# echo \"This is the second line\"\n", @@ -168,9 +172,9 @@ "EXTRA_BASH_SCRIPT = \"\"\"\n", "\"\"\"\n", "\n", - "# Default docker for trains-agent when running in docker mode (requires docker v19.03 and above). \n", - "# Leave empty to run trains-agent in non-docker mode.\n", - "DEFAULT_DOCKER_IMAGE = \"nvidia/cuda\"" + "# Default docker for clearml-agent when running in docker mode (requires docker v19.03 and above).\n", + "# Leave empty to run clearml-agent in non-docker mode.\n", + "CLEARML_AGENT_DOCKER_IMAGE = \"nvidia/cuda\"" ] }, { @@ -192,7 +196,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### Import Packages and Budget Definition Sanity Check" + "### Import Packages and Budget Definition Sanity Check" ] }, { @@ -209,7 +213,7 @@ "from time import sleep, time\n", "\n", "import boto3\n", - "from trains_agent.backend_api.session.client import APIClient" + "from clearml_agent.backend_api.session.client import APIClient" ] }, { @@ -227,36 +231,36 @@ " \"A resource name can only appear in a single queue definition.\"\n", " )\n", "\n", - "# Encode EXTRA_TRAINS_CONF for later bash script usage\n", - "EXTRA_TRAINS_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_TRAINS_CONF.split(\"\\\"\"))" + "# Encode EXTRA_CLEARML_CONF for later bash script usage\n", + "EXTRA_CLEARML_CONF_ENCODED = \"\\\\\\\"\".join(EXTRA_CLEARML_CONF.split(\"\\\"\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "##### Cloud specific implementation of spin up/down - currently supports AWS only" + "### Cloud specific implementation of spin up/down - currently supports AWS only" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Cloud-specific implementation (currently, only AWS EC2 is supported)\n", "def spin_up_worker(resource, worker_id_prefix, queue_name):\n", " \"\"\"\n", - " Creates a new worker for trains.\n", + " Creates a new worker for clearml.\n", " First, create an instance in the cloud and install some required packages.\n", - " Then, define trains-agent environment variables and run \n", - " trains-agent for the specified queue.\n", + " Then, define clearml-agent environment variables and run\n", + " clearml-agent for the specified queue.\n", " NOTE: - Will wait until instance is running\n", " - This implementation assumes the instance image already has docker installed\n", "\n", " :param str resource: resource name, as defined in BUDGET and QUEUES.\n", " :param str worker_id_prefix: worker name prefix\n", - " :param str queue_name: trains queue to listen to\n", + " :param str queue_name: clearml queue to listen to\n", " \"\"\"\n", " resource_conf = RESOURCE_CONFIGURATIONS[resource]\n", " # Add worker type and AWS instance type to the worker name.\n", @@ -267,8 +271,8 @@ " )\n", "\n", " # user_data script will automatically run when the instance is started. \n", - " # It will install the required packages for trains-agent configure it using \n", - " # environment variables and run trains-agent on the required queue\n", + " # It will install the required packages for clearml-agent configure it using\n", + " # environment variables and run clearml-agent on the required queue\n", " user_data = \"\"\"#!/bin/bash\n", " sudo apt-get update\n", " sudo apt-get install -y python3-dev\n", @@ -278,36 +282,36 @@ " sudo apt-get install -y build-essential\n", " python3 -m pip install -U pip\n", " python3 -m pip install virtualenv\n", - " python3 -m virtualenv trains_agent_venv\n", - " source trains_agent_venv/bin/activate\n", - " python -m pip install trains-agent\n", - " echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/trains.conf\n", - " echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/trains.conf\n", - " echo \"{trains_conf}\" >> /root/trains.conf\n", - " export TRAINS_API_HOST={api_server}\n", - " export TRAINS_WEB_HOST={web_server}\n", - " export TRAINS_FILES_HOST={files_server}\n", + " python3 -m virtualenv clearml_agent_venv\n", + " source clearml_agent_venv/bin/activate\n", + " python -m pip install clearml-agent\n", + " echo 'agent.git_user=\\\"{git_user}\\\"' >> /root/clearml.conf\n", + " echo 'agent.git_pass=\\\"{git_pass}\\\"' >> /root/clearml.conf\n", + " echo \"{clearml_conf}\" >> /root/clearml.conf\n", + " export CLEARML_API_HOST={api_server}\n", + " export CLEARML_WEB_HOST={web_server}\n", + " export CLEARML_FILES_HOST={files_server}\n", " export DYNAMIC_INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`\n", - " export TRAINS_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n", - " export TRAINS_API_ACCESS_KEY='{access_key}'\n", - " export TRAINS_API_SECRET_KEY='{secret_key}'\n", + " export CLEARML_WORKER_ID={worker_id}:$DYNAMIC_INSTANCE_ID\n", + " export CLEARML_API_ACCESS_KEY='{access_key}'\n", + " export CLEARML_API_SECRET_KEY='{secret_key}'\n", " {bash_script}\n", " source ~/.bashrc\n", - " python -m trains_agent --config-file '/root/trains.conf' daemon --queue '{queue}' {docker}\n", + " python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker}\n", " shutdown\n", " \"\"\".format(\n", - " api_server=TRAINS_SERVER_API_SERVER,\n", - " web_server=TRAINS_SERVER_WEB_SERVER,\n", - " files_server=TRAINS_SERVER_FILES_SERVER,\n", + " api_server=CLEARML_API_SERVER,\n", + " web_server=CLEARML_WEB_SERVER,\n", + " files_server=CLEARML_FILES_SERVER,\n", " worker_id=worker_id,\n", - " access_key=TRAINS_ACCESS_KEY,\n", - " secret_key=TRAINS_SECRET_KEY,\n", + " access_key=CLEARML_API_ACCESS_KEY,\n", + " secret_key=CLEARML_API_SECRET_KEY,\n", " queue=queue_name,\n", - " git_user=TRAINS_GIT_USER,\n", - " git_pass=TRAINS_GIT_PASS,\n", - " trains_conf=EXTRA_TRAINS_CONF_ENCODED,\n", + " git_user=CLEARML_AGENT_GIT_USER,\n", + " git_pass=CLEARML_AGENT_GIT_PASS,\n", + " clearml_conf=EXTRA_CLEARML_CONF_ENCODED,\n", " bash_script=EXTRA_BASH_SCRIPT,\n", - " docker=\"--docker '{}'\".format(DEFAULT_DOCKER_IMAGE) if DEFAULT_DOCKER_IMAGE else \"\"\n", + " docker=\"--docker '{}'\".format(CLEARML_AGENT_DOCKER_IMAGE) if CLEARML_AGENT_DOCKER_IMAGE else \"\"\n", " )\n", "\n", " ec2 = boto3.client(\n", @@ -405,7 +409,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "###### Controller Implementation and Logic" + "#### Controller Implementation and Logic" ] }, { @@ -430,18 +434,18 @@ "\n", " # Internal definitions\n", " workers_prefix = \"dynamic_aws\"\n", - " # Worker's id in trains would be composed from:\n", + " # Worker's id in clearml would be composed from:\n", " # prefix, name, instance_type and cloud_id separated by ';'\n", " workers_pattern = re.compile(\n", " r\"^(?P[^:]+):(?P[^:]+):(?P[^:]+):(?P[^:]+)\"\n", " )\n", "\n", - " # Set up the environment variables for trains\n", - " os.environ[\"TRAINS_API_HOST\"] = TRAINS_SERVER_API_SERVER\n", - " os.environ[\"TRAINS_WEB_HOST\"] = TRAINS_SERVER_WEB_SERVER\n", - " os.environ[\"TRAINS_FILES_HOST\"] = TRAINS_SERVER_FILES_SERVER\n", - " os.environ[\"TRAINS_API_ACCESS_KEY\"] = TRAINS_ACCESS_KEY\n", - " os.environ[\"TRAINS_API_SECRET_KEY\"] = TRAINS_SECRET_KEY\n", + " # Set up the environment variables for clearml\n", + " os.environ[\"CLEARML_API_HOST\"] = CLEARML_API_SERVER\n", + " os.environ[\"CLEARML_WEB_HOST\"] = CLEARML_WEB_SERVER\n", + " os.environ[\"CLEARML_FILES_HOST\"] = CLEARML_FILES_SERVER\n", + " os.environ[\"CLEARML_API_ACCESS_KEY\"] = CLEARM_API_ACCESS_KEY\n", + " os.environ[\"CLEARML_API_SECRET_KEY\"] = CLEARML_API_SECRET_KEY\n", " api_client = APIClient()\n", "\n", " # Verify the requested queues exist and create those that doesn't exist\n", @@ -520,7 +524,7 @@ " # skip resource types that might be needed\n", " if resources in required_idle_resources:\n", " continue\n", - " # Remove from both aws and trains all instances that are \n", + " # Remove from both aws and clearml all instances that are\n", " # idle for longer than MAX_IDLE_TIME_MIN\n", " if time() - timestamp > MAX_IDLE_TIME_MIN * 60.0:\n", " cloud_id = workers_pattern.match(worker.id)[\"cloud_id\"]\n", @@ -535,7 +539,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)" + "### Execute Forever* (the controller is stateless, so you can always re-execute the notebook)" ] }, {