diff --git a/docs/guides/datasets/data_man_cifar_classification.md b/docs/guides/datasets/data_man_cifar_classification.md new file mode 100644 index 00000000..48bf65f4 --- /dev/null +++ b/docs/guides/datasets/data_man_cifar_classification.md @@ -0,0 +1,98 @@ +--- +title: Dataset Management with CLI and SDK +--- + +In this tutorial, we are going to manage the CIFAR dataset with `clearml-data` CLI, and then use ClearML's [`Dataset`](../../references/sdk/dataset.md) +class to ingest the data. + +## Creating the Dataset + +### Downloading the Data +Before we can register the CIFAR dataset with `clearml-data`, we need to obtain a local copy of it. + +Execute this python script to download the data +```python +from clearml import StorageManager + +manager = StorageManager() +dataset_path = manager.get_local_copy( + remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz" +) +# make sure to copy the printed value +print("COPY THIS DATASET PATH: {}".format(dataset_path)) +``` + +Expected response: +```bash +COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None +``` +The script prints the path to the downloaded data. It will be needed later on. + +### Creating the Dataset +To create the dataset, execute the following command: + ``` + clearml-data create --project dataset_examples --name cifar_dataset + ``` + +Expected response: +``` +clearml-data - Dataset Management & Versioning CLI +Creating a new dataset: +New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec +``` +Where `ee1c35f60f384e65bc800f42f0aca5ec` is the dataset ID. + +## Adding Files +Add the files we just downloaded to the dataset: + +``` +clearml-data add --files +``` + +where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset. + +:::note +There's no need to specify a `dataset_id`, since the `clearml-data` session stores it. +::: + +## Finalizing the Dataset +Run the [`close`](../../references/sdk/dataset.md#close) command to upload the files (it'll be uploaded to ClearML Server by default):
+ +``` +clearml-data close +``` + +This command sets the dataset task's status to *completed*, so it will no longer be modifiable. This ensures future +reproducibility. + +The information about the dataset, including a list of files and their sizes, can be viewed +in the WebApp, in the dataset task's **ARTIFACTS** tab. + +![image](../../img/examples_data_management_cifar_dataset.png) + +## Using the Dataset + +Now that we have a new dataset registered, we can consume it. + +The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) example +script demonstrates using the dataset within Python code. + +```python +dataset_name = "cifar_dataset" +dataset_project = "dataset_examples" + +from clearml import Dataset + +dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy() + +trainset = datasets.CIFAR10( + root=dataset_path, + train=True, + download=False, + transform=transform +) +``` +The Dataset's [`get_local_copy`](../../references/sdk/dataset.md#get_local_copy) method will return a path to the cached, +downloaded dataset. Then we provide the path to Pytorch's dataset object. + +The script then trains a neural network to classify images using the dataset created above. \ No newline at end of file diff --git a/docs/guides/datasets/data_man_python.md b/docs/guides/datasets/data_man_python.md new file mode 100644 index 00000000..2766a10f --- /dev/null +++ b/docs/guides/datasets/data_man_python.md @@ -0,0 +1,94 @@ +--- +title: Data Management with Python +--- + +The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) and +[data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) +together demonstrate how to use ClearML's [`Dataset`](../../references/sdk/dataset.md) class to create a dataset and +subsequently ingest the data. + +## Dataset Creation + +The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) script +demonstrates how to do the following: +* Create a dataset and add files to it +* Upload the dataset to the ClearML Server +* Finalize the dataset + + +### Downloading the Data + +We first need to obtain a local copy of the CIFAR dataset. + + ```python + from clearml import StorageManager + + manager = StorageManager() + dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz") +``` + +This script downloads the data and `dataset_path` contains the path to the downloaded data. + +### Creating the Dataset + +```python +from clearml import Dataset + +dataset = Dataset.create(dataset_name="cifar_dataset", dataset_project="dataset examples" ) + ``` + +This creates a data processing task called `cifar_dataset` in the `dataset examples` project, which +can be viewed in the WebApp. + +### Adding Files + +```python +dataset.add_files(path=dataset_path) +``` + +This adds the downloaded files to the current dataset. + +### Uploading the Files + +```python +dataset.upload() +``` +This uploads the dataset to the ClearML Server by default. The dataset's destination can be changed by specifying the +target storage with the `output_url` parameter of the [`upload`](../../references/sdk/dataset#upload) method. + +### Finalizing the Dataset + +Run the [`finalize`](../../references/sdk/dataset#finzalize) command to close the dataset and set that dataset's tasks +status to *completed*. The dataset can only be finalized if it doesn't have any pending uploads. + +```python +dataset.finalize() +``` + +After a dataset has been closed, it can no longer be modified. This ensures future reproducibility. + +The information about the dataset, including a list of files and their sizes, can be viewed +in the WebApp, in the dataset task's **ARTIFACTS** tab. + +![image](../../img/examples_data_management_cifar_dataset.png) + +## Data Ingestion + +Now that we have a new dataset registered, we can consume it! + +The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) script +demonstrates data ingestion using the dataset created in the first script. + +```python +dataset_name = "cifar_dataset" +dataset_project = "dataset_examples" + +dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy() +``` + +The script above gets the dataset and uses the [`Dataset.get_local_copy`](../../references/sdk/dataset.md#get_local_copy) +method to return a path to the cached, read-only local dataset. If you need a modifiable copy of the dataset, +use `Dataset.get(dataset_name, dataset_project).get_mutable_local_copy(path/to/download)` + +The script then creates a neural network to train a model to classify images from the dataset that was +created above. \ No newline at end of file diff --git a/sidebars.js b/sidebars.js index 4e681985..209f4d69 100644 --- a/sidebars.js +++ b/sidebars.js @@ -61,6 +61,7 @@ module.exports = { {'Advanced': ['guides/advanced/execute_remotely', 'guides/advanced/multiple_tasks_single_process']}, {'Automation': ['guides/automation/manual_random_param_search_example', 'guides/automation/task_piping']}, {'Clearml Task': ['guides/clearml-task/clearml_task_tutorial']}, + {'Datasets': ['guides/datasets/data_man_cifar_classification', 'guides/datasets/data_man_python']}, {'Distributed': ['guides/distributed/distributed_pytorch_example', 'guides/distributed/subprocess_example']}, {'Docker': ['guides/docker/extra_docker_shell_script']}, {'Frameworks': [