clearml-docs/docs/guides/distributed/distributed_pytorch_example.md

80 lines
3.6 KiB
Markdown
Raw Normal View History

2021-05-13 23:48:51 +00:00
---
title: PyTorch Distributed
---
The [pytorch_distributed_example.py](https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_distributed_example.py)
2022-03-13 13:07:06 +00:00
script demonstrates integrating ClearML into code that uses the [PyTorch Distributed Communications Package](https://pytorch.org/docs/stable/distributed.html)
2021-05-13 23:48:51 +00:00
(`torch.distributed`).
The script initializes a main Task and spawns subprocesses, each for an instance of that Task.
The Task in each subprocess trains a neural network over a partitioned dataset (the torchvision built-in [MNIST](https://pytorch.org/vision/stable/datasets.html#mnist)
dataset), and reports (uploads) the following to the main Task:
* Artifacts - A dictionary containing different key-value pairs.
* Scalars - Loss reported as a scalar during training in each Task in a subprocess.
* Hyperparameters - Hyperparameters created in each Task are added to the hyperparameters in the main Task.
2022-01-19 12:26:14 +00:00
Each Task in a subprocess references the main Task by calling [Task.current_task](../../references/sdk/task.md#taskcurrent_task), which always returns
2021-05-13 23:48:51 +00:00
the main Task.
2023-09-04 12:40:42 +00:00
When the script runs, it creates an experiment named `test torch distributed` in the `examples` project.
2021-05-13 23:48:51 +00:00
## Artifacts
The example uploads a dictionary as an artifact in the main Task by calling the [Task.upload_artifact](../../references/sdk/task.md#upload_artifact)
method on [`Task.current_task`](../../references/sdk/task.md#taskcurrent_task) (the main Task). The dictionary contains the [`dist.rank`](https://pytorch.org/docs/stable/distributed.html#torch.distributed.get_rank)
of the subprocess, making each unique.
2021-12-14 13:12:30 +00:00
```python
Task.current_task().upload_artifact(
'temp {:02d}'.format(dist.get_rank()),
artifact_object={'worker_rank': dist.get_rank()}
)
```
2021-05-13 23:48:51 +00:00
All of these artifacts appear in the main Task under **ARTIFACTS** **>** **OTHER**.
2022-03-13 13:07:06 +00:00
![Experiment artifacts](../../img/examples_pytorch_distributed_example_09.png)
2021-05-13 23:48:51 +00:00
## Scalars
2022-01-19 12:26:14 +00:00
Loss is reported to the main Task by calling the [Logger.report_scalar](../../references/sdk/logger.md#report_scalar)
2021-05-13 23:48:51 +00:00
method on `Task.current_task().get_logger`, which is the logger for the main Task. Since `Logger.report_scalar` is called
with the same title (`loss`), but a different series name (containing the subprocess' `rank`), all loss scalar series are
logged together.
2021-12-14 13:12:30 +00:00
```python
Task.current_task().get_logger().report_scalar(
'loss',
'worker {:02d}'.format(dist.get_rank()),
value=loss.item(),
iteration=i
)
```
2021-05-13 23:48:51 +00:00
2022-05-22 07:27:30 +00:00
The single scalar plot for loss appears in **SCALARS**.
2021-05-13 23:48:51 +00:00
2022-03-13 13:07:06 +00:00
![Experiment scalars](../../img/examples_pytorch_distributed_example_08.png)
2021-05-13 23:48:51 +00:00
## Hyperparameters
2022-03-13 13:07:06 +00:00
ClearML automatically logs the argparse command line options. Since the [`Task.connect`](../../references/sdk/task.md#connect)
2021-12-14 13:12:30 +00:00
method is called on [`Task.current_task`](../../references/sdk/task.md#taskcurrent_task), they are logged in the main Task. A different hyperparameter key is used in each
2021-05-13 23:48:51 +00:00
subprocess, so they do not overwrite each other in the main Task.
2021-12-14 13:12:30 +00:00
```python
param = {'worker_{}_stuff'.format(dist.get_rank()): 'some stuff ' + str(randint(0, 100))}
Task.current_task().connect(param)
```
2021-05-13 23:48:51 +00:00
2023-01-12 10:49:55 +00:00
All the hyperparameters appear in **CONFIGURATION** **>** **HYPERPARAMETERS**.
2021-05-13 23:48:51 +00:00
2022-03-13 13:07:06 +00:00
![Experiment hyperparameters Args](../../img/examples_pytorch_distributed_example_01.png)
2021-05-13 23:48:51 +00:00
2022-03-13 13:07:06 +00:00
![Experiment hyperparameters General ](../../img/examples_pytorch_distributed_example_01a.png)
2021-05-13 23:48:51 +00:00
2021-05-18 22:31:01 +00:00
## Console
2021-05-13 23:48:51 +00:00
2022-05-22 07:27:30 +00:00
Output to the console, including the text messages printed from the main Task object and each subprocess appear in **CONSOLE**.
2021-05-13 23:48:51 +00:00
2022-03-13 13:07:06 +00:00
![Experiment console log](../../img/examples_pytorch_distributed_example_06.png)