add data examples to guides (#111)

2025-06-26 18:17:44 +00:00 · 2021-11-09 12:27:52 +02:00 · 2021-11-09 12:27:52 +02:00 · 3137f74e66
commit 3137f74e66
parent e155c49cfd
3 changed files with 193 additions and 0 deletions
--- a/docs/guides/datasets/data_man_cifar_classification.md
+++ b/docs/guides/datasets/data_man_cifar_classification.md
@ -0,0 +1,98 @@
+---
+title: Dataset Management with CLI and SDK
+---
+
+In this tutorial, we are going to manage the CIFAR dataset with `clearml-data` CLI, and then use ClearML's [`Dataset`](../../references/sdk/dataset.md) 
+class to ingest the data.
+
+## Creating the Dataset
+
+### Downloading the Data
+Before we can register the CIFAR dataset with `clearml-data`, we need to obtain a local copy of it.
+
+Execute this python script to download the data
+```python
+from clearml import StorageManager
+
+manager = StorageManager()
+dataset_path = manager.get_local_copy(
+    remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
+)
+# make sure to copy the printed value
+print("COPY THIS DATASET PATH: {}".format(dataset_path))
+```
+
+Expected response:
+```bash
+COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
+```
+The script prints the path to the downloaded data. It will be needed later on.
+
+### Creating the Dataset
+To create the dataset, execute the following command:
+ ```
+ clearml-data create --project dataset_examples --name cifar_dataset
+ ```
+
+Expected response:
+```
+clearml-data - Dataset Management & Versioning CLI 
+Creating a new dataset: 
+New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec
+```
+Where `ee1c35f60f384e65bc800f42f0aca5ec` is the dataset ID.
+
+## Adding Files
+Add the files we just downloaded to the dataset: 
+
+```
+clearml-data add --files <dataset_path>
+```
+
+where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
+
+:::note
+There's no need to specify a `dataset_id`, since the `clearml-data` session stores it.
+:::
+
+## Finalizing the Dataset
+Run the [`close`](../../references/sdk/dataset.md#close) command to upload the files (it'll be uploaded to ClearML Server by default):<br/>
+
+```
+clearml-data close 
+```
+
+This command sets the dataset task's status to *completed*, so it will no longer be modifiable. This ensures future
+reproducibility. 
+
+The information about the dataset, including a list of files and their sizes, can be viewed
+in the WebApp, in the dataset task's **ARTIFACTS** tab.
+
+![image](../../img/examples_data_management_cifar_dataset.png)
+
+## Using the Dataset
+
+Now that we have a new dataset registered, we can consume it.
+
+The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) example 
+script demonstrates using the dataset within Python code.
+
+```python
+dataset_name = "cifar_dataset"
+dataset_project = "dataset_examples"
+
+from clearml import Dataset
+
+dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
+
+trainset = datasets.CIFAR10(
+    root=dataset_path,
+    train=True,
+    download=False,
+    transform=transform
+)
+```
+The Dataset's [`get_local_copy`](../../references/sdk/dataset.md#get_local_copy) method will return a path to the cached, 
+downloaded dataset. Then we provide the path to Pytorch's dataset object.
+
+The script then trains a neural network to classify images using the dataset created above.
--- a/docs/guides/datasets/data_man_python.md
+++ b/docs/guides/datasets/data_man_python.md
@ -0,0 +1,94 @@
+---
+title: Data Management with Python
+---
+
+The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) and 
+[data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) 
+together demonstrate how to use ClearML's [`Dataset`](../../references/sdk/dataset.md) class to create a dataset and 
+subsequently ingest the data. 
+
+## Dataset Creation
+
+The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) script 
+demonstrates how to do the following:
+* Create a dataset and add files to it
+* Upload the dataset to the ClearML Server
+* Finalize the dataset
+
+
+### Downloading the Data
+
+We first need to obtain a local copy of the CIFAR dataset.
+
+ ```python
+ from clearml import StorageManager
+
+ manager = StorageManager()
+ dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz")
+```
+
+This script downloads the data and `dataset_path` contains the path to the downloaded data. 
+
+### Creating the Dataset
+
+```python
+from clearml import Dataset
+
+dataset = Dataset.create(dataset_name="cifar_dataset", dataset_project="dataset examples" )
+ ```
+
+This creates a data processing task called `cifar_dataset` in the `dataset examples` project, which
+can be viewed in the WebApp.
+
+### Adding Files
+
+```python
+dataset.add_files(path=dataset_path)
+```
+
+This adds the downloaded files to the current dataset.  
+
+### Uploading the Files
+
+```python
+dataset.upload()
+```
+This uploads the dataset to the ClearML Server by default. The dataset's destination can be changed by specifying the 
+target storage with the `output_url` parameter of the [`upload`](../../references/sdk/dataset#upload) method. 
+
+### Finalizing the Dataset
+
+Run the [`finalize`](../../references/sdk/dataset#finzalize) command to close the dataset and set that dataset's tasks
+status to *completed*. The dataset can only be finalized if it doesn't have any pending uploads. 
+
+```python
+dataset.finalize()
+```
+
+After a dataset has been closed, it can no longer be modified. This ensures future reproducibility. 
+
+The information about the dataset, including a list of files and their sizes, can be viewed
+in the WebApp, in the dataset task's **ARTIFACTS** tab.
+
+![image](../../img/examples_data_management_cifar_dataset.png)
+
+## Data Ingestion
+
+Now that we have a new dataset registered, we can consume it!
+
+The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) script 
+demonstrates data ingestion using the dataset created in the first script.
+
+```python
+dataset_name = "cifar_dataset"
+dataset_project = "dataset_examples"
+
+dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
+```
+
+The script above gets the dataset and uses the [`Dataset.get_local_copy`](../../references/sdk/dataset.md#get_local_copy) 
+method to return a path to the cached, read-only local dataset. If you need a modifiable copy of the dataset, 
+use `Dataset.get(dataset_name, dataset_project).get_mutable_local_copy(path/to/download)`
+
+The script then creates a neural network to train a model to classify images from the dataset that was
+created above.
--- a/sidebars.js
+++ b/sidebars.js
@ -61,6 +61,7 @@ module.exports = {
            {'Advanced': ['guides/advanced/execute_remotely', 'guides/advanced/multiple_tasks_single_process']},
            {'Automation': ['guides/automation/manual_random_param_search_example', 'guides/automation/task_piping']},
            {'Clearml Task': ['guides/clearml-task/clearml_task_tutorial']},
+            {'Datasets': ['guides/datasets/data_man_cifar_classification', 'guides/datasets/data_man_python']},
            {'Distributed': ['guides/distributed/distributed_pytorch_example', 'guides/distributed/subprocess_example']},
            {'Docker': ['guides/docker/extra_docker_shell_script']},
            {'Frameworks': [