mirror of
https://github.com/clearml/clearml-docs
synced 2025-06-26 18:17:44 +00:00
Refactor ClearML Data docs (#108)
This commit is contained in:
@@ -0,0 +1,98 @@
|
||||
---
|
||||
title: Dataset Management with CLI and SDK
|
||||
---
|
||||
|
||||
In this tutorial, we are going to manage the CIFAR dataset with `clearml-data` CLI, and then use ClearML's [`Dataset`](../../references/sdk/dataset.md)
|
||||
class to ingest the data.
|
||||
|
||||
## Creating the Dataset
|
||||
|
||||
### Downloading the Data
|
||||
Before we can register the CIFAR dataset with `clearml-data`, we need to obtain a local copy of it.
|
||||
|
||||
Execute this python script to download the data
|
||||
```python
|
||||
from clearml import StorageManager
|
||||
|
||||
manager = StorageManager()
|
||||
dataset_path = manager.get_local_copy(
|
||||
remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
|
||||
)
|
||||
# make sure to copy the printed value
|
||||
print("COPY THIS DATASET PATH: {}".format(dataset_path))
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```bash
|
||||
COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
|
||||
```
|
||||
The script prints the path to the downloaded data. It will be needed later on.
|
||||
|
||||
### Creating the Dataset
|
||||
To create the dataset, execute the following command:
|
||||
```
|
||||
clearml-data create --project dataset_examples --name cifar_dataset
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Creating a new dataset:
|
||||
New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec
|
||||
```
|
||||
Where `ee1c35f60f384e65bc800f42f0aca5ec` is the dataset ID.
|
||||
|
||||
## Adding Files
|
||||
Add the files we just downloaded to the dataset:
|
||||
|
||||
```
|
||||
clearml-data add --files <dataset_path>
|
||||
```
|
||||
|
||||
where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
|
||||
|
||||
:::note
|
||||
There's no need to specify a `dataset_id`, since the `clearml-data` session stores it.
|
||||
:::
|
||||
|
||||
## Finalizing the Dataset
|
||||
Run the [`close`](../../references/sdk/dataset.md#close) command to upload the files (it'll be uploaded to ClearML Server by default):<br/>
|
||||
|
||||
```
|
||||
clearml-data close
|
||||
```
|
||||
|
||||
This command sets the dataset task's status to *completed*, so it will no longer be modifiable. This ensures future
|
||||
reproducibility.
|
||||
|
||||
The information about the dataset, including a list of files and their sizes, can be viewed
|
||||
in the WebApp, in the dataset task's **ARTIFACTS** tab.
|
||||
|
||||

|
||||
|
||||
## Using the Dataset
|
||||
|
||||
Now that we have a new dataset registered, we can consume it.
|
||||
|
||||
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) example
|
||||
script demonstrates using the dataset within Python code.
|
||||
|
||||
```python
|
||||
dataset_name = "cifar_dataset"
|
||||
dataset_project = "dataset_examples"
|
||||
|
||||
from clearml import Dataset
|
||||
|
||||
dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
|
||||
|
||||
trainset = datasets.CIFAR10(
|
||||
root=dataset_path,
|
||||
train=True,
|
||||
download=False,
|
||||
transform=transform
|
||||
)
|
||||
```
|
||||
The Dataset's [`get_local_copy`](../../references/sdk/dataset.md#get_local_copy) method will return a path to the cached,
|
||||
downloaded dataset. Then we provide the path to Pytorch's dataset object.
|
||||
|
||||
The script then trains a neural network to classify images using the dataset created above.
|
||||
@@ -0,0 +1,83 @@
|
||||
---
|
||||
title: Folder Sync with CLI
|
||||
---
|
||||
|
||||
This example shows how to use the `clearml-data` folder sync function.
|
||||
|
||||
`clearml-data` folder sync mode is useful for cases when users have a single point of truth (i.e. a folder) that updates
|
||||
from time to time. When the point of truth is updated, users can call `clearml-data sync` and the
|
||||
changes (file addition, modification, or removal) will be reflected in ClearML.
|
||||
|
||||
## Creating Initial Version
|
||||
|
||||
## Prerequisites
|
||||
1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
|
||||
the needed files.
|
||||
|
||||
1. Open terminal and change directory to the cloned repository's examples folder
|
||||
|
||||
```
|
||||
cd clearml/examples/reporting
|
||||
```
|
||||
|
||||
## Syncing a Folder
|
||||
Create a dataset and sync the `data_samples` folder from the repo to ClearML
|
||||
```bash
|
||||
clearml-data sync --project datasets --name sync_folder --folder data_samples
|
||||
```
|
||||
|
||||
Expected response:
|
||||
|
||||
```
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Creating a new dataset:
|
||||
New dataset created id=0d8f5f3e5ebd4f849bfb218021be1ede
|
||||
Syncing dataset id 0d8f5f3e5ebd4f849bfb218021be1ede to local folder data_samples
|
||||
Generating SHA2 hash for 5 files
|
||||
Hash generation completed
|
||||
Sync completed: 0 files removed, 5 added / modified
|
||||
Finalizing dataset
|
||||
Pending uploads, starting dataset upload to https://files.community.clear.ml
|
||||
Uploading compressed dataset changes (5 files, total 222.17 KB) to https://files.community.clear.ml
|
||||
Upload completed (222.17 KB)
|
||||
2021-05-04 09:57:56,809 - clearml.Task - INFO - Waiting to finish uploads
|
||||
2021-05-04 09:57:57,581 - clearml.Task - INFO - Finished uploading
|
||||
Dataset closed and finalized
|
||||
```
|
||||
|
||||
As can be seen, the `clearml-data sync` command creates the dataset, then uploads the files, and closes the dataset.
|
||||
|
||||
|
||||
## Modifying Synced Folder
|
||||
|
||||
Now we'll modify the folder:
|
||||
1. Add another line to one of the files in the `data_samples` folder.
|
||||
1. Add a file to the sample_data folder.<br/>
|
||||
Run`echo "data data data" > data_samples/new_data.txt` (this will create the file `new_data.txt` and put it in the `data_samples` folder)
|
||||
|
||||
|
||||
We'll repeat the process of creating a new dataset with the previous one as its parent, and syncing the folder.
|
||||
|
||||
```bash
|
||||
clearml-data sync --project datasets --name second_ds --parents a1ddc8b0711b4178828f6c6e6e994b7c --folder data_samples
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Creating a new dataset:
|
||||
New dataset created id=0992dd6bae6144388e0f2ef131d9724a
|
||||
Syncing dataset id 0992dd6bae6144388e0f2ef131d9724a to local folder data_samples
|
||||
Generating SHA2 hash for 6 files
|
||||
Hash generation completed
|
||||
Sync completed: 0 files removed, 2 added / modified
|
||||
Finalizing dataset
|
||||
Pending uploads, starting dataset upload to https://files.community.clear.ml
|
||||
Uploading compressed dataset changes (2 files, total 742 bytes) to https://files.community.clear.ml
|
||||
Upload completed (742 bytes)
|
||||
2021-05-04 10:05:42,353 - clearml.Task - INFO - Waiting to finish uploads
|
||||
2021-05-04 10:05:43,106 - clearml.Task - INFO - Finished uploading
|
||||
Dataset closed and finalized
|
||||
```
|
||||
|
||||
We can see that 2 files were added or modified, just as we expected!
|
||||
@@ -0,0 +1,94 @@
|
||||
---
|
||||
title: Data Management with Python
|
||||
---
|
||||
|
||||
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) and
|
||||
[data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py)
|
||||
together demonstrate how to use ClearML's [`Dataset`](../../references/sdk/dataset.md) class to create a dataset and
|
||||
subsequently ingest the data.
|
||||
|
||||
## Dataset Creation
|
||||
|
||||
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) script
|
||||
demonstrates how to do the following:
|
||||
* Create a dataset and add files to it
|
||||
* Upload the dataset to the ClearML Server
|
||||
* Finalize the dataset
|
||||
|
||||
|
||||
### Downloading the Data
|
||||
|
||||
We first need to obtain a local copy of the CIFAR dataset.
|
||||
|
||||
```python
|
||||
from clearml import StorageManager
|
||||
|
||||
manager = StorageManager()
|
||||
dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz")
|
||||
```
|
||||
|
||||
This script downloads the data and `dataset_path` contains the path to the downloaded data.
|
||||
|
||||
### Creating the Dataset
|
||||
|
||||
```python
|
||||
from clearml import Dataset
|
||||
|
||||
dataset = Dataset.create(dataset_name="cifar_dataset", dataset_project="dataset examples" )
|
||||
```
|
||||
|
||||
This creates a data processing task called `cifar_dataset` in the `dataset examples` project, which
|
||||
can be viewed in the WebApp.
|
||||
|
||||
### Adding Files
|
||||
|
||||
```python
|
||||
dataset.add_files(path=dataset_path)
|
||||
```
|
||||
|
||||
This adds the downloaded files to the current dataset.
|
||||
|
||||
### Uploading the Files
|
||||
|
||||
```python
|
||||
dataset.upload()
|
||||
```
|
||||
This uploads the dataset to the ClearML Server by default. The dataset's destination can be changed by specifying the
|
||||
target storage with the `output_url` parameter of the [`upload`](../../references/sdk/dataset#upload) method.
|
||||
|
||||
### Finalizing the Dataset
|
||||
|
||||
Run the [`finalize`](../../references/sdk/dataset#finzalize) command to close the dataset and set that dataset's tasks
|
||||
status to *completed*. The dataset can only be finalized if it doesn't have any pending uploads.
|
||||
|
||||
```python
|
||||
dataset.finalize()
|
||||
```
|
||||
|
||||
After a dataset has been closed, it can no longer be modified. This ensures future reproducibility.
|
||||
|
||||
The information about the dataset, including a list of files and their sizes, can be viewed
|
||||
in the WebApp, in the dataset task's **ARTIFACTS** tab.
|
||||
|
||||

|
||||
|
||||
## Data Ingestion
|
||||
|
||||
Now that we have a new dataset registered, we can consume it!
|
||||
|
||||
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) script
|
||||
demonstrates data ingestion using the dataset created in the first script.
|
||||
|
||||
```python
|
||||
dataset_name = "cifar_dataset"
|
||||
dataset_project = "dataset_examples"
|
||||
|
||||
dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
|
||||
```
|
||||
|
||||
The script above gets the dataset and uses the [`Dataset.get_local_copy`](../../references/sdk/dataset.md#get_local_copy)
|
||||
method to return a path to the cached, read-only local dataset. If you need a modifiable copy of the dataset,
|
||||
use `Dataset.get(dataset_name, dataset_project).get_mutable_local_copy(path/to/download)`
|
||||
|
||||
The script then creates a neural network to train a model to classify images from the dataset that was
|
||||
created above.
|
||||
170
docs/clearml_data/data_management_examples/data_man_simple.md
Normal file
170
docs/clearml_data/data_management_examples/data_man_simple.md
Normal file
@@ -0,0 +1,170 @@
|
||||
---
|
||||
title: Data Management from CLI
|
||||
---
|
||||
|
||||
In this example we'll create a simple dataset and demonstrate basic actions on it, using the `clearml-data` CLI.
|
||||
|
||||
## Prerequisites
|
||||
1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
|
||||
the needed files.
|
||||
1. Open terminal and change directory to the cloned repository's examples folder
|
||||
|
||||
```
|
||||
cd clearml/examples/reporting
|
||||
```
|
||||
|
||||
## Creating Initial Dataset
|
||||
|
||||
1. To create the dataset, run this code:
|
||||
|
||||
```bash
|
||||
clearml-data create --project datasets --name HelloDataset
|
||||
```
|
||||
|
||||
Expected response:
|
||||
|
||||
```bash
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Creating a new dataset:
|
||||
New dataset created id=24d05040f3e14fbfbed8edb1bf08a88c
|
||||
```
|
||||
|
||||
1. Now let's add a folder. File addition is recursive, so it's enough to point at the folder
|
||||
to captures all files and sub-folders:
|
||||
|
||||
```bash
|
||||
clearml-data add --files data_samples
|
||||
```
|
||||
|
||||
Expected response:
|
||||
|
||||
```bash
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Adding files/folder to dataset id 24d05040f3e14fbfbed8edb1bf08a88c
|
||||
Generating SHA2 hash for 2 files
|
||||
Hash generation completed
|
||||
5 files added
|
||||
```
|
||||
|
||||
|
||||
:::note
|
||||
After creating a dataset, we don't have to specify its ID when running commands, such as *add*, *remove* or *list*
|
||||
:::
|
||||
|
||||
3. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but
|
||||
this can be configured with the `--storage` flag to any of ClearML's supported storage mediums (see [storage](../../integrations/storage.md)).
|
||||
The command also finalizes the dataset, making it immutable and ready to be consumed.
|
||||
|
||||
```bash
|
||||
clearml-data close
|
||||
```
|
||||
|
||||
Expected response:
|
||||
|
||||
```bash
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Finalizing dataset id 24d05040f3e14fbfbed8edb1bf08a88c
|
||||
Pending uploads, starting dataset upload to https://files.community-master.hosted.allegro.ai
|
||||
Pending uploads, starting dataset upload to https://files.community.clear.ml
|
||||
Uploading compressed dataset changes (4 files, total 221.56 KB) to https://files.community.clear.ml
|
||||
Upload completed (221.56 KB)
|
||||
2021-05-04 09:32:03,388 - clearml.Task - INFO - Waiting to finish uploads
|
||||
2021-05-04 09:32:04,067 - clearml.Task - INFO - Finished uploading
|
||||
Dataset closed and finalized
|
||||
```
|
||||
|
||||
## Listing Dataset Content
|
||||
|
||||
To see that all the files were added to the created dataset, use `clearml-data list` and enter the ID of the dataset
|
||||
that was just closed.
|
||||
|
||||
```bash
|
||||
clearml-data list --id 24d05040f3e14fbfbed8edb1bf08a88c
|
||||
```
|
||||
|
||||
Expected response:
|
||||
|
||||
```console
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
|
||||
List dataset content: 24d05040f3e14fbfbed8edb1bf08a88c
|
||||
Listing dataset content
|
||||
file name | size | hash
|
||||
------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
dancing.jpg | 40,484 | 78e804c0c1d54da8d67e9d072c1eec514b91f4d1f296cdf9bf16d6e54d63116a
|
||||
data.csv | 21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
|
||||
picasso.jpg | 114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
|
||||
sample.json | 132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
|
||||
sample.mp3 | 72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
|
||||
Total 5 files, 248771 bytes
|
||||
```
|
||||
|
||||
## Creating a Child Dataset
|
||||
|
||||
In Clear Data, it's possible to create datasets that inherit the content of other datasets, there are called child datasets.
|
||||
|
||||
1. Create a new dataset, specifying the previously created one as its parent:
|
||||
|
||||
```bash
|
||||
clearml-data create --project datasets --name HelloDataset-improved --parents 24d05040f3e14fbfbed8edb1bf08a88c
|
||||
```
|
||||
:::note
|
||||
You'll need to input the Dataset ID you received when created the dataset above
|
||||
:::
|
||||
|
||||
1. Now, we want to add a new file.
|
||||
* Create a new file: `echo "data data data" > new_data.txt` (this will create the file `new_data.txt`),
|
||||
* Now add the file to the dataset:
|
||||
|
||||
```bash
|
||||
clearml-data add --files new_data.txt
|
||||
```
|
||||
Which should return this output:
|
||||
|
||||
```console
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Adding files/folder to dataset id 8b68686a4af040d081027ba3cf6bbca6
|
||||
1 file added
|
||||
```
|
||||
|
||||
1. Let's also remove a file. We'll need to specify the file's full path (within the dataset, not locally) to remove it.
|
||||
|
||||
```bash
|
||||
clearml-data remove --files data_samples/dancing.jpg
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```bash
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Removing files/folder from dataset id 8b68686a4af040d081027ba3cf6bbca6
|
||||
1 files removed
|
||||
```
|
||||
|
||||
1. Close and finalize the dataset
|
||||
|
||||
```bash
|
||||
clearml-data close
|
||||
```
|
||||
|
||||
1. Let's take a look again at the files in the dataset:
|
||||
|
||||
```
|
||||
clearml-data list --id 8b68686a4af040d081027ba3cf6bbca6
|
||||
```
|
||||
|
||||
And we see that our changes have been made! `new_data.txt` has been added, and `dancing.jpg` has been removed.
|
||||
|
||||
```
|
||||
file name | size | hash
|
||||
------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
data.csv | 21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
|
||||
new_data.txt | 15 | 6df986a2154902260a836febc5a32543f5337eac60560c57db99257a7e012051
|
||||
picasso.jpg | 114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
|
||||
sample.json | 132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
|
||||
sample.mp3 | 72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
|
||||
Total 5 files, 208302 bytes
|
||||
```
|
||||
|
||||
By using `clearml-data`, a clear lineage is created for the data. As seen in this example, when a dataset is closed, the
|
||||
only way to add or remove data is to create a new dataset, and using the previous dataset as a parent. This way, the data
|
||||
is not reliant on the code and is reproducible.
|
||||
12
docs/clearml_data/data_management_examples/workflows.md
Normal file
12
docs/clearml_data/data_management_examples/workflows.md
Normal file
@@ -0,0 +1,12 @@
|
||||
---
|
||||
title: Workflows
|
||||
---
|
||||
|
||||
Take a look at the ClearML Data examples which demonstrate common workflows using the `clearml-data` CLI and the
|
||||
`Dataset` class:
|
||||
* [Dataset Management with CLI](data_man_simple.md) - Tutorial for creating, modifying, and consuming dataset with CLI
|
||||
* [Folder Sync with CLI](data_man_folder_sync.md) - Tutorial for using `clearml-data sync` CLI option to update a dataset according
|
||||
to a local folder.
|
||||
* [Dataset Management with CLI and SDK](data_man_cifar_classification.md) - Tutorial for creating a dataset with the CLI
|
||||
then programmatically ingesting the data with the SDK
|
||||
* [Data Management with Python](data_man_python.md) - Example scripts for creating and consuming a dataset with the SDK.
|
||||
Reference in New Issue
Block a user