Refactor ClearML Data docs (#108)

2025-06-26 18:17:44 +00:00 · 2021-11-08 13:21:44 +02:00
parent 43751dc64b
commit e155c49cfd
17 changed files with 847 additions and 683 deletions
--- a/management/data_man_cifar_classification.md
+++ b/management/data_man_cifar_classification.md
@@ -1,302 +0,0 @@
---
-title: Dataset Management Using CIFAR10
---
-
-In this tutorial, we are going use a CIFAR example, manage the CIFAR dataset with `clearml-data`, and then replace our 
-current dataset read method with one that interfaces with `clearml-data`.
-
-## Creating the Dataset
-
-### Downloading the Data
-Before we can register the CIFAR dataset with `clearml-data` we need to obtain a local copy of it.
-
-Execute this python script to download the data
- ```python
- from clearml import StorageManager
- # We're using the StorageManager to download the data for us! 
- # It's a neat little utility that helps us download
- # files we need and cache them :)
-
- manager = StorageManager()
- dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz")
- # make sure to copy the printed value
- print("COPY THIS DATASET PATH: {}".format(dataset_path))
- ```
-
-Expected reponse:
-```bash
-COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
-```
-The script prints the path to the downloaded data. It'll be needed later one
-
-### Creating the Dataset
-To create the dataset, in a CLI, execute:
- ```
- clearml-data create --project cifar --name cifar_dataset
- ```
-
-Expected response:
-```
-clearml-data - Dataset Management & Versioning CLI 
-Creating a new dataset: 
-New dataset created id=*********
-```
-Where \*\*\*\*\*\*\*\*\* is the dataset ID.
-
-## Adding Files
-Add the files we just downloaded to the dataset: 
-```
-clearml-data add --files <dataset_path>
-```
-
-where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
-
-:::note
-There's no need to specify a *dataset_id* as *clearml-data* session stores it.
-:::
-
-## Finalizing the Dataset
-Run the close command to upload the files (it'll be uploaded to file server by default):<br/>
-```
-clearml-data close 
-```
-
-![image](../../img/examples_data_management_cifar_dataset.png)
-
-## Using the Dataset
-Now that we have a new dataset registered, we can consume it.
-
-We take [this script](https://github.com/allegroai/clearml/blob/master/examples/frameworks/ignite/cifar_ignite.py) as a base to train on the CIFAR dataset.
-
-We replace the file load part with ClearML's Dataset object. The Dataset's `get_local_copy()` method will return a path
-to the cached, downloaded dataset.
-Then we provide the path to Pytorch's dataset object.
-
-```python
-dataset_id = "ee1c35f60f384e65bc800f42f0aca5ec"
-
-from clearml import Dataset
-dataset_path = Dataset.get(dataset_id=dataset_id).get_local_copy()
-
-trainset = datasets.CIFAR10(root=dataset_path,
-                            train=True,
-                            download=False,
-                            transform=transform)
-```
-
-<details className="cml-expansion-panel info">
-<summary className="cml-expansion-panel-summary">Full example code using dataset:</summary>
-<div className="cml-expansion-panel-content">
-
-```python
-#These are the obligatory imports
-from pathlib import Path
-
-import matplotlib.pyplot as plt
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.optim as optim
-import torchvision.datasets as datasets
-import torchvision.transforms as transforms
-from ignite.contrib.handlers import TensorboardLogger
-from ignite.engine import Events, create_supervised_trainer, create_supervised_evaluator
-from ignite.handlers import global_step_from_engine
-from ignite.metrics import Accuracy, Loss, Recall
-from ignite.utils import setup_logger
-from torch.utils.tensorboard import SummaryWriter
-from tqdm import tqdm
-
-from clearml import Task, StorageManager
-
-# Connecting ClearML with the current process,
-# from here on everything is logged automatically
-task = Task.init(project_name='Image Example', task_name='image classification CIFAR10')
-params = {'number_of_epochs': 20, 'batch_size': 64, 'dropout': 0.25, 'base_lr': 0.001, 'momentum': 0.9, 'loss_report': 100}
-params = task.connect(params)  # enabling configuration override by clearml/
-print(params)  # printing actual configuration (after override in remote mode)
-
-# This is our original data retrieval code. it uses storage manager to just download and cache our dataset.
-'''
-manager = StorageManager()
-
-dataset_path = Path(manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"))
-'''
-
-# Let's now modify it to utilize for the new dataset API, you'll need to copy the created dataset id
-# to the next variable
-
-dataset_id = "ee1c35f60f384e65bc800f42f0aca5ec"
-
-# The below gets the dataset and stores in the cache. If you want to download the dataset regardless if it's in the
-# cache, use the Dataset.get(dataset_id).get_mutable_local_copy(path to download)
-from clearml import Dataset
-dataset_path = Dataset.get(dataset_id=dataset_id).get_local_copy()
-
-# Dataset and Dataloader initializations
-transform = transforms.Compose([transforms.ToTensor()])
-
-trainset = datasets.CIFAR10(root=dataset_path,
-                            train=True,
-                            download=False,
-                            transform=transform)
-trainloader = torch.utils.data.DataLoader(trainset,
-                                          batch_size=params.get('batch_size', 4),
-                                          shuffle=True,
-                                          num_workers=10)
-
-testset = datasets.CIFAR10(root=dataset_path,
-                           train=False,
-                           download=False,
-                           transform=transform)
-testloader = torch.utils.data.DataLoader(testset,
-                                         batch_size=params.get('batch_size', 4),
-                                         shuffle=False,
-                                         num_workers=10)
-
-classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
-
-tb_logger = TensorboardLogger(log_dir="cifar-output")
-
-
-# Helper function to store predictions and scores using matplotlib
-def predictions_gt_images_handler(engine, logger, *args, **kwargs):
-    x, _ = engine.state.batch
-    y_pred, y = engine.state.output
-
-    num_x = num_y = 4
-    le = num_x * num_y
-    fig = plt.figure(figsize=(20, 20))
-    trans = transforms.ToPILImage()
-    for idx in range(le):
-        preds = torch.argmax(F.softmax(y_pred[idx],dim=0))
-        probs = torch.max(F.softmax(y_pred[idx],dim=0))
-        ax = fig.add_subplot(num_x, num_y, idx + 1, xticks=[], yticks=[])
-        ax.imshow(trans(x[idx]))
-        ax.set_title("{0} {1:.1f}% (label: {2})".format(
-            classes[preds],
-            probs * 100,
-            classes[y[idx]]),
-            color=("green" if preds == y[idx] else "red")
-        )
-    logger.writer.add_figure('predictions vs actuals', figure=fig, global_step=engine.state.epoch)
-
-
-class Net(nn.Module):
-    def __init__(self):
-        super(Net, self).__init__()
-        self.conv1 = nn.Conv2d(3, 6, 3)
-        self.conv2 = nn.Conv2d(6, 16, 3)
-        self.pool = nn.MaxPool2d(2, 2)
-        self.fc1 = nn.Linear(16 * 6 * 6, 120)
-        self.fc2 = nn.Linear(120, 84)
-        self.dorpout = nn.Dropout(p=params.get('dropout', 0.25))
-        self.fc3 = nn.Linear(84, 10)
-
-    def forward(self, x):
-        x = self.pool(F.relu(self.conv1(x)))
-        x = self.pool(F.relu(self.conv2(x)))
-        x = x.view(-1, 16 * 6 * 6)
-        x = F.relu(self.fc1(x))
-        x = F.relu(self.fc2(x))
-        x = self.fc3(self.dorpout(x))
-        return x
-
-
-# Training
-def run(epochs, lr, momentum, log_interval):
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    net = Net().to(device)
-    criterion = nn.CrossEntropyLoss()
-    optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum)
-
-    trainer = create_supervised_trainer(net, optimizer, criterion, device=device)
-    trainer.logger = setup_logger("trainer")
-
-    val_metrics = {"accuracy": Accuracy(),"loss": Loss(criterion), "recall": Recall()}
-    evaluator = create_supervised_evaluator(net, metrics=val_metrics, device=device)
-    evaluator.logger = setup_logger("evaluator")
-
-    # Attach handler to plot trainer's loss every 100 iterations
-    tb_logger.attach_output_handler(
-        trainer,
-        event_name=Events.ITERATION_COMPLETED(every=params.get('loss_report')),
-        tag="training",
-        output_transform=lambda loss: {"loss": loss},
-    )
-
-    # Attach handler to dump evaluator's metrics every epoch completed
-    for tag, evaluator in [("training", trainer), ("validation", evaluator)]:
-        tb_logger.attach_output_handler(
-            evaluator,
-            event_name=Events.EPOCH_COMPLETED,
-            tag=tag,
-            metric_names="all",
-            global_step_transform=global_step_from_engine(trainer),
-        )
-
-    # Attach function to build debug images and report every epoch end
-    tb_logger.attach(
-        evaluator,
-        log_handler=predictions_gt_images_handler,
-        event_name=Events.EPOCH_COMPLETED(once=1),
-    );
-
-    desc = "ITERATION - loss: {:.2f}"
-    pbar = tqdm(initial=0, leave=False, total=len(trainloader), desc=desc.format(0))
-
-    @trainer.on(Events.ITERATION_COMPLETED(every=log_interval))
-    def log_training_loss(engine):
-        pbar.desc = desc.format(engine.state.output)
-        pbar.update(log_interval)
-
-    @trainer.on(Events.EPOCH_COMPLETED)
-    def log_training_results(engine):
-        pbar.refresh()
-        evaluator.run(trainloader)
-        metrics = evaluator.state.metrics
-        avg_accuracy = metrics["accuracy"]
-        avg_nll = metrics["loss"]
-        tqdm.write(
-            "Training Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}".format(
-                engine.state.epoch, avg_accuracy, avg_nll
-            )
-        )
-
-    @trainer.on(Events.EPOCH_COMPLETED)
-    def log_validation_results(engine):
-        evaluator.run(testloader)
-        metrics = evaluator.state.metrics
-        avg_accuracy = metrics["accuracy"]
-        avg_nll = metrics["loss"]
-        tqdm.write(
-            "Validation Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}".format(
-                engine.state.epoch, avg_accuracy, avg_nll
-            )
-        )
-
-        pbar.n = pbar.last_print_n = 0
-
-    @trainer.on(Events.EPOCH_COMPLETED | Events.COMPLETED)
-    def log_time():
-        tqdm.write(
-            "{} took {} seconds".format(trainer.last_event_name.name, trainer.state.times[trainer.last_event_name.name])
-        )
-
-    trainer.run(trainloader, max_epochs=epochs)
-    pbar.close()
-
-    PATH = './cifar_net.pth'
-    torch.save(net.state_dict(), PATH)
-
-    print('Finished Training')
-    print('Task ID number is: {}'.format(task.id))
-
-
-run(params.get('number_of_epochs'), params.get('base_lr'), params.get('momentum'), 10)
-```
-
-</div></details>
-
-<br/><br/>
-That's it! All you need to do now is run the full script.
--- a/management/data_man_folder_sync.md
+++ b/management/data_man_folder_sync.md
@@ -1,79 +0,0 @@
---
-title: Folder Sync
---
-
-This example shows how to use the *clearml-data* folder sync function.
-
-*clearml-data* folder sync mode is useful for cases when users have a single point of truth (i.e. a folder) that updates 
-from time to time. When the point of truth is updated, users can call `clearml-data sync` and the 
-changes (file addition, modification, or removal) will be reflected in ClearML.
-
-## Creating Initial Version
-
-## Prerequisites
-First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
-the needed files.
-1. Open terminal and change directory to the cloned repository's examples folder
-    `cd clearml/examples/reporting`
-
-## Syncing a Folder
-Create a dataset and sync the `data_samples` folder from the repo to ClearML
-```bash
-clearml-data sync --project datasets --name sync_folder --folder data_samples
-```
-
-Expected response:
-
-```
-clearml-data - Dataset Management & Versioning CLI
-Creating a new dataset:
-New dataset created id=0d8f5f3e5ebd4f849bfb218021be1ede
-Syncing dataset id 0d8f5f3e5ebd4f849bfb218021be1ede to local folder data_samples
-Generating SHA2 hash for 5 files
-Hash generation completed
-Sync completed: 0 files removed, 5 added / modified
-Finalizing dataset
-Pending uploads, starting dataset upload to https://files.community.clear.ml
-Uploading compressed dataset changes (5 files, total 222.17 KB) to https://files.community.clear.ml
-Upload completed (222.17 KB)
-2021-05-04 09:57:56,809 - clearml.Task - INFO - Waiting to finish uploads
-2021-05-04 09:57:57,581 - clearml.Task - INFO - Finished uploading
-Dataset closed and finalized
-```
-
-As can be seen, the `clearml-data sync` command creates the dataset, then uploads the files, and closes the dataset.
-
-
-## Modifying Synced Folder
-
-Now we'll modify the folder:
-1. Add another line to one of the files in the `data_samples` folder.
-1. Add a file to the sample_data folder.<br/> 
-   Run`echo "data data data" > data_samples/new_data.txt` (this will create the file `new_data.txt` and put it in the `data_samples` folder)
-
-
-We'll repeat the process of creating a new dataset with the previous one as its parent, and syncing the folder.
-
-```bash
-clearml-data sync --project datasets --name second_ds --parents a1ddc8b0711b4178828f6c6e6e994b7c --folder data_samples
-```
-
-Expected response:
-```
-clearml-data - Dataset Management & Versioning CLI
-Creating a new dataset:
-New dataset created id=0992dd6bae6144388e0f2ef131d9724a
-Syncing dataset id 0992dd6bae6144388e0f2ef131d9724a to local folder data_samples
-Generating SHA2 hash for 6 files
-Hash generation completed
-Sync completed: 0 files removed, 2 added / modified
-Finalizing dataset
-Pending uploads, starting dataset upload to https://files.community.clear.ml
-Uploading compressed dataset changes (2 files, total 742 bytes) to https://files.community.clear.ml
-Upload completed (742 bytes)
-2021-05-04 10:05:42,353 - clearml.Task - INFO - Waiting to finish uploads
-2021-05-04 10:05:43,106 - clearml.Task - INFO - Finished uploading
-Dataset closed and finalized
-```
-
-We can see that 2 files were added or modified, just as we expected!
--- a/management/data_man_simple.md
+++ b/management/data_man_simple.md
@@ -1,165 +0,0 @@
---
-title: Data Management Example
---
-
-In this example we'll create a simple dataset and demonstrate basic actions on it. 
-
-## Prerequisites
-First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
-the needed files.
-1. Open terminal and change directory to the cloned repository's examples folder
-    `cd clearml/examples/reporting`
-
-## Creating Initial Dataset
-
-1. To create the dataset, run this code:
-
-    ```bash
-    clearml-data create --project datasets --name HelloDataset
-    ```
-
-    Expected response:
-
-    ```bash
-    clearml-data - Dataset Management & Versioning CLI
-    Creating a new dataset:
-    New dataset created id=24d05040f3e14fbfbed8edb1bf08a88c
-    ```
-
-1. Now let's add a folder. File addition is recursive, so it's enough to point at the folder 
-to captures all files and subfolders:
-   
-   ```bash
-   clearml-data add --files data_samples
-   ```
-   
-   Expected response:
-   
-   ```bash
-   clearml-data - Dataset Management & Versioning CLI
-   Adding files/folder to dataset id 24d05040f3e14fbfbed8edb1bf08a88c
-   Generating SHA2 hash for 2 files
-   Hash generation completed
-   5 files added
-   ```
-:::note
-After creating a dataset, we don't have to specify its ID when running commands, such as *add*, *remove* or *list*
-:::
-
-1. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but  
-this can be configured with the `--storage` flag to any of ClearML's supported storage mediums (see [storage](../../integrations/storage.md)).
-The command also finalizes the dataset, making it immutable and ready to be consumed.
-
-   ```bash
-   clearml-data close
-   ```
-
-   Expected response:
-
-   ```bash
-   clearml-data - Dataset Management & Versioning CLI
-   Finalizing dataset id 24d05040f3e14fbfbed8edb1bf08a88c
-   Pending uploads, starting dataset upload to https://files.community-master.hosted.allegro.ai
-   Pending uploads, starting dataset upload to https://files.community.clear.ml
-   Uploading compressed dataset changes (4 files, total 221.56 KB) to https://files.community.clear.ml
-   Upload completed (221.56 KB)
-   2021-05-04 09:32:03,388 - clearml.Task - INFO - Waiting to finish uploads
-   2021-05-04 09:32:04,067 - clearml.Task - INFO - Finished uploading
-   Dataset closed and finalized
-   ```
-
-## Listing Dataset Content
-
-To see that all the files were added to the created dataset, use `clearml-data list` and enter the ID of the dataset
-that was just closed.
-
-   ```bash
-  clearml-data list --id 24d05040f3e14fbfbed8edb1bf08a88c
-   ```
-
-Expected response:
-
-```console
-clearml-data - Dataset Management & Versioning CLI 
-
-List dataset content: 24d05040f3e14fbfbed8edb1bf08a88c 
-Listing dataset content
-file name                                                        | size       | hash                                                            
------------------------------------------------------------------------------------------------------------------------------------------------
-dancing.jpg                                                      |     40,484 | 78e804c0c1d54da8d67e9d072c1eec514b91f4d1f296cdf9bf16d6e54d63116a
-data.csv                                                         |     21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
-picasso.jpg                                                      |    114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
-sample.json                                                      |        132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
-sample.mp3                                                       |     72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
-Total 5 files, 248771 bytes
-```
-
-## Creating a Child Dataset
-
-In Clear Data, it's possible to create datasets that inherit the content of other datasets, there are called child datasets.
-
-1. Create a new dataset, specifying the previously created one as its parent:
-
-   ```bash
-   clearml-data create --project datasets --name HelloDataset-improved --parents 24d05040f3e14fbfbed8edb1bf08a88c
-   ```
-:::note
-You'll need to input the Dataset ID you received when created the dataset above 
-:::
-
-1. Now, we want to add a new file. 
-   * Create a new file: `echo "data data data" > new_data.txt` (this will create the file `new_data.txt`),
-   * Now add the file to the dataset:  
-
-   ```bash
-   clearml-data add --files new_data.txt
-   ```
-   Which should return this output:
-
-   ```console
-   clearml-data - Dataset Management & Versioning CLI
-   Adding files/folder to dataset id 8b68686a4af040d081027ba3cf6bbca6
-   1 file added
-   ```
-   
-1. Let's also remove a file. We'll need to specify the file's full path (within the dataset, not locally) to remove it.
-
-   ```bash
-   clearml-data remove --files data_samples/dancing.jpg
-   ```
-
-   Expected response:
-   ```bash
-   clearml-data - Dataset Management & Versioning CLI
-   Removing files/folder from dataset id 8b68686a4af040d081027ba3cf6bbca6
-   1 files removed
-   ```
-
-1. Close and finalize the dataset
-
-   ```bash
-   clearml-data close
-   ```
-   
-1. Let's take a look again at the files in the dataset:
-
-   ```
-   clearml-data list --id 8b68686a4af040d081027ba3cf6bbca6
-   ```
-
-   And we see that our changes have been made! `new_data.txt` has been added, and `dancing.jpg` has been removed. 
-
-   ```
-   file name                                                        | size       | hash                                                            
-   ------------------------------------------------------------------------------------------------------------------------------------------------
-   data.csv                                                         |     21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
-   new_data.txt                                                     |         15 | 6df986a2154902260a836febc5a32543f5337eac60560c57db99257a7e012051
-   picasso.jpg                                                      |    114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
-   sample.json                                                      |        132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
-   sample.mp3                                                       |     72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
-   Total 5 files, 208302 bytes
-   ```
-
-By using `clearml-data`, a clear lineage is created for the data. As seen in this example, when a dataset is closed, the 
-only way to add or remove data is to create a new dataset, and using the previous dataset as a parent. This way, the data 
-is not reliant on the code and is reproducible.