mirror of
https://github.com/clearml/clearml-docs
synced 2025-06-26 18:17:44 +00:00
Clarify Non-responsive Task Watchdog section
This commit is contained in:
parent
ea2ab54e3f
commit
cc4902c649
@ -361,10 +361,16 @@ You can also use hashed passwords instead of plain-text passwords. To do that:
|
||||
|
||||
### Non-responsive Task Watchdog
|
||||
|
||||
The non-responsive task watchdog monitors tasks that were not updated for a specified time interval, and then
|
||||
the watchdog marks them as `aborted`. The non-responsive experiment watchdog is always active.
|
||||
The non-responsive task watchdog monitors tasks that have stopped communicating with the ClearML Server for a specified
|
||||
time interval. If a task remains unresponsive beyond the set threshold, the watchdog marks it as `aborted`. The
|
||||
non-responsive task watchdog is always active.
|
||||
|
||||
Modify the following settings for the watchdog:
|
||||
A task is considered non-responsive when it no longer sends updates to the ClearML Server. The non-responsiveness timer
|
||||
starts when the task stops communicating with the server. This typically happens if:
|
||||
* The task's main process is stuck but has not exited.
|
||||
* There is a network issue preventing the task from communicating with the server.
|
||||
|
||||
You can configure the following watchdog settings:
|
||||
|
||||
* Watchdog status - enabled / disabled
|
||||
* The time threshold (in seconds) of experiment inactivity (default value is 7200 seconds (2 hours)).
|
||||
@ -372,10 +378,15 @@ Modify the following settings for the watchdog:
|
||||
|
||||
**To configure the non-responsive watchdog for the ClearML Server:**
|
||||
|
||||
1. In the ClearML Server `/opt/clearml/config/services.conf` file, add or edit the `tasks.non_responsive_tasks_watchdog`
|
||||
section and specify the watchdog settings.
|
||||
1. Open the ClearML Server `/opt/clearml/config/services.conf` file.
|
||||
|
||||
:::tip
|
||||
If the `services.conf` file does not exist, create your own in ClearML Server's `/opt/clearml/config` directory (or
|
||||
an alternate folder you configured).
|
||||
:::
|
||||
|
||||
1. Add or edit the `tasks.non_responsive_tasks_watchdog` section and specify the watchdog settings. For example:
|
||||
|
||||
For example:
|
||||
```
|
||||
tasks {
|
||||
non_responsive_tasks_watchdog {
|
||||
@ -389,11 +400,6 @@ Modify the following settings for the watchdog:
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
:::tip
|
||||
If the `services.conf` file does not exist, create your own in ClearML Server's `/opt/clearml/config` directory (or
|
||||
an alternate folder you configured), and input the modified configuration
|
||||
:::
|
||||
|
||||
1. Restart ClearML Server.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user