5.8 KiB
title |
---|
Monitoring Service Posting Slack Alerts |
The Slack alerts example
demonstrates how to use the clearml.automation.monitor
class to implement a service that monitors the completion and
failure of tasks, and posts alert messages on a Slack channel.
Creating a Slack Bot
Before configuring and running the Slack alert service, create a new Slack Bot (ClearML Bot).
:::important The Slack API token and channel you create are required to configure the Slack alert service. :::
- Login to your Slack account.
- Go to https://api.slack.com/apps/new.
- In App Name, enter an app name; for example, "ClearML Bot".
- In Development Slack Workspace, select a workspace.
- Click Create App.
- In Basic Information, under Display Information, complete the following:
- In Short description, enter "Allegro Train Bot".
- In Background color, enter "#202432".
- Click Save Changes.
- In OAuth & Permissions, under Scopes, click Add an OAuth Scope, and then select the following permissions
on the list:
- channels:join
- channels:read
- chat:write
- In OAuth Tokens & Redirect URLs:
- Click Install App to Workspace
- In the confirmation dialog, click Allow.
- Click Copy to copy the Bot User OAuth Access Token.
Running the Script
:::info Self deployed ClearML server
A template Slack Alerts
task is available in the DevOps Services
project. You can clone it, adapt its configuration
to your needs and enqueue for execution directly from the ClearML UI.
:::
Run the monitoring service in one of these ways:
- Run locally
- Run in ClearML Agent services mode
To run the monitoring service:
python slack_alerts.py --channel <Slack-channel-name> --slack-api <Slack-API-token> --local True [...]
channel
- The Slack channel where alerts will be posted.slack_api
- Slack API key.local
- IfTrue
, run monitoring services locally. IfFalse
, enqueue the task in the queue passed as theservice_queue
(by defaultservices
queue) for remote execution.
The script supports the following additional command line options:
service_queue
- The queue to use when running remotely as a service. The default value isservices
(make sure your workspace has such a queue and to assign a ClearML Agent to this queue).message_prefix
- A message prefix for Slack alerts. For example, to alert all channel members use: "Hey <!here>".min_num_iterations
- Minimal iteration threshold below which experiments are ignored. Use this option to eliminate debug sessions that fail quickly. The default value is 0.project
- The name of the project to monitor. By default, all projects are monitored.include_manual_experiments
- Whether to include experiments that are running locally:True
- Monitor all experiments (both local and remote, executed by ClearML Agent).False
(default) - Monitor only remote experiments.
include_completed_experiments
- IfFalse
(default), send alerts only for failed tasks. IfTrue
, send an alert for completed and failed tasks.include_archived
- IfFalse
(default), only tasks that are not archived will be reported. This option can be useful if a task is archived between polling.refresh_rate
- How often to monitor the experiments in seconds. The default value is 10.0.include_users
- Only report tasks that were initiated by these users (usernames and user IDs are accepted). Mutually exclusive toexclude_users
.exclude_users
- Only report tasks that were NOT initiated by these users (usernames and user IDs are accepted). Mutually exclusive toinclude_users
.verbose
- IfTrue
, will increase verbosity of messages (such as when tasks are polled but filtered away).
Configuration
ClearML automatically logs command line options defined with argparse. They appear in the experiment's CONFIGURATION page under HYPERPARAMETERS > Args.
The task can be reused to launch another monitor instance: clone the task, edit its parameters, and enqueue the task for execution (you'll typically want to use a ClearML Agent running in services mode for such service tasks).
Console
All console output appears in the experiment's CONSOLE page.
Additional Information about slack_alerts.py
In slack_alerts.py
, the class SlackMonitor
inherits from the Monitor
class in clearml.automation.monitor
.
SlackMonitor
overrides the following Monitor
class methods:
get_query_parameters
- Get the query parameters for Task monitoring.process_task
- Get the information for a Task, post a Slack message, and output to console.- Allows skipping failed Tasks, if a Task ran for few iterations. Calls
Task.get_last_iteration
to get the number of iterations. - Builds the Slack message which includes the most recent output to the console (retrieved by calling
Task.get_reported_console_output
), and the URL of the Task's output log in the ClearML Web UI (retrieved by callingTask.get_output_log_web_page
).
- Allows skipping failed Tasks, if a Task ran for few iterations. Calls
You can run the example remotely by calling the Task.execute_remotely
method.
To interface to Slack, the example uses slack_sdk.WebClient
and slack_sdk.errors.SlackApiError
.