mirror of
https://github.com/clearml/clearml-docs
synced 2025-05-10 07:31:15 +00:00
Small edits (#526)
This commit is contained in:
parent
4700306b9d
commit
3b71c66636
@ -1,108 +0,0 @@
|
|||||||
---
|
|
||||||
title: Dataset Management with CLI and SDK
|
|
||||||
---
|
|
||||||
|
|
||||||
In this tutorial, you are going to manage the CIFAR dataset with `clearml-data` CLI, and then use ClearML's [`Dataset`](../../references/sdk/dataset.md)
|
|
||||||
class to ingest the data.
|
|
||||||
|
|
||||||
## Creating the Dataset
|
|
||||||
|
|
||||||
### Downloading the Data
|
|
||||||
Before registering the CIFAR dataset with `clearml-data`, you need to obtain a local copy of it.
|
|
||||||
|
|
||||||
Execute this python script to download the data
|
|
||||||
```python
|
|
||||||
from clearml import StorageManager
|
|
||||||
|
|
||||||
manager = StorageManager()
|
|
||||||
dataset_path = manager.get_local_copy(
|
|
||||||
remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
|
|
||||||
)
|
|
||||||
# make sure to copy the printed value
|
|
||||||
print("COPY THIS DATASET PATH: {}".format(dataset_path))
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected response:
|
|
||||||
```bash
|
|
||||||
COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
|
|
||||||
```
|
|
||||||
The script prints the path to the downloaded data. It will be needed later on.
|
|
||||||
|
|
||||||
### Creating the Dataset
|
|
||||||
To create the dataset, execute the following command:
|
|
||||||
```
|
|
||||||
clearml-data create --project dataset_examples --name cifar_dataset
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected response:
|
|
||||||
```
|
|
||||||
clearml-data - Dataset Management & Versioning CLI
|
|
||||||
Creating a new dataset:
|
|
||||||
New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec
|
|
||||||
```
|
|
||||||
Where `ee1c35f60f384e65bc800f42f0aca5ec` is the dataset ID.
|
|
||||||
|
|
||||||
## Adding Files
|
|
||||||
Add the files that were just downloaded to the dataset:
|
|
||||||
|
|
||||||
```
|
|
||||||
clearml-data add --files <dataset_path>
|
|
||||||
```
|
|
||||||
|
|
||||||
where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
|
|
||||||
|
|
||||||
:::note
|
|
||||||
There's no need to specify a `dataset_id`, since the `clearml-data` session stores it.
|
|
||||||
:::
|
|
||||||
|
|
||||||
## Finalizing the Dataset
|
|
||||||
Run the [`close`](../../references/sdk/dataset.md#close) command to upload the files (it'll be uploaded to ClearML Server by default):<br/>
|
|
||||||
|
|
||||||
```
|
|
||||||
clearml-data close
|
|
||||||
```
|
|
||||||
|
|
||||||
This command sets the dataset task's status to *completed*, so it will no longer be modifiable. This ensures future
|
|
||||||
reproducibility.
|
|
||||||
|
|
||||||
Information about the dataset can be viewed in the WebApp, in the dataset's [details panel](../../webapp/datasets/webapp_dataset_viewing.md#version-details-panel).
|
|
||||||
In the panel's **CONTENT** tab, you can see a table summarizing version contents, including file names, file sizes, and hashes.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Using the Dataset
|
|
||||||
|
|
||||||
Now that a new dataset is registered, you can consume it.
|
|
||||||
|
|
||||||
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) example
|
|
||||||
script demonstrates using the dataset within Python code.
|
|
||||||
|
|
||||||
```python
|
|
||||||
dataset_name = "cifar_dataset"
|
|
||||||
dataset_project = "dataset_examples"
|
|
||||||
|
|
||||||
from clearml import Dataset
|
|
||||||
|
|
||||||
dataset_path = Dataset.get(
|
|
||||||
dataset_name=dataset_name,
|
|
||||||
dataset_project=dataset_project,
|
|
||||||
alias="Cifar dataset"
|
|
||||||
).get_local_copy()
|
|
||||||
|
|
||||||
trainset = datasets.CIFAR10(
|
|
||||||
root=dataset_path,
|
|
||||||
train=True,
|
|
||||||
download=False,
|
|
||||||
transform=transform
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
In cases like this, where you use a dataset in a task, you can have the dataset's ID stored in the task’s
|
|
||||||
hyperparameters. Passing `alias=<dataset_alias_string>` stores the dataset’s ID in the
|
|
||||||
`dataset_alias_string` parameter in the experiment's **CONFIGURATION > HYPERPARAMETERS > Datasets** section. This way
|
|
||||||
you can easily track which dataset the task is using.
|
|
||||||
|
|
||||||
The Dataset's [`get_local_copy`](../../references/sdk/dataset.md#get_local_copy) method will return a path to the cached,
|
|
||||||
downloaded dataset. Then the dataset path is input to PyTorch's `datasets` object.
|
|
||||||
|
|
||||||
The script then trains a neural network to classify images using the dataset created above.
|
|
@ -1,106 +0,0 @@
|
|||||||
---
|
|
||||||
title: Data Management with Python
|
|
||||||
---
|
|
||||||
|
|
||||||
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) and
|
|
||||||
[data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py)
|
|
||||||
together demonstrate how to use ClearML's [`Dataset`](../../references/sdk/dataset.md) class to create a dataset and
|
|
||||||
subsequently ingest the data.
|
|
||||||
|
|
||||||
## Dataset Creation
|
|
||||||
|
|
||||||
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) script
|
|
||||||
demonstrates how to do the following:
|
|
||||||
* Create a dataset and add files to it
|
|
||||||
* Upload the dataset to the ClearML Server
|
|
||||||
* Finalize the dataset
|
|
||||||
|
|
||||||
|
|
||||||
### Downloading the Data
|
|
||||||
|
|
||||||
You first need to obtain a local copy of the CIFAR dataset.
|
|
||||||
|
|
||||||
```python
|
|
||||||
from clearml import StorageManager
|
|
||||||
|
|
||||||
manager = StorageManager()
|
|
||||||
dataset_path = manager.get_local_copy(
|
|
||||||
remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
This script downloads the data and `dataset_path` contains the path to the downloaded data.
|
|
||||||
|
|
||||||
### Creating the Dataset
|
|
||||||
|
|
||||||
```python
|
|
||||||
from clearml import Dataset
|
|
||||||
|
|
||||||
dataset = Dataset.create(
|
|
||||||
dataset_name="cifar_dataset",
|
|
||||||
dataset_project="dataset examples"
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
This creates a data processing task called `cifar_dataset` in the `dataset examples` project, which
|
|
||||||
can be viewed in the WebApp.
|
|
||||||
|
|
||||||
### Adding Files
|
|
||||||
|
|
||||||
```python
|
|
||||||
dataset.add_files(path=dataset_path)
|
|
||||||
```
|
|
||||||
|
|
||||||
This adds the downloaded files to the current dataset.
|
|
||||||
|
|
||||||
### Uploading the Files
|
|
||||||
|
|
||||||
```python
|
|
||||||
dataset.upload()
|
|
||||||
```
|
|
||||||
This uploads the dataset to the ClearML Server by default. The dataset's destination can be changed by specifying the
|
|
||||||
target storage with the `output_url` parameter of the [`upload`](../../references/sdk/dataset.md#upload) method.
|
|
||||||
|
|
||||||
### Finalizing the Dataset
|
|
||||||
|
|
||||||
Run the [`finalize`](../../references/sdk/dataset.md#finalize) command to close the dataset and set that dataset's tasks
|
|
||||||
status to *completed*. The dataset can only be finalized if it doesn't have any pending uploads.
|
|
||||||
|
|
||||||
```python
|
|
||||||
dataset.finalize()
|
|
||||||
```
|
|
||||||
|
|
||||||
After a dataset has been closed, it can no longer be modified. This ensures future reproducibility.
|
|
||||||
|
|
||||||
Information about the dataset can be viewed in the WebApp, in the dataset's [details panel](../../webapp/datasets/webapp_dataset_viewing.md#version-details-panel).
|
|
||||||
In the panel's **CONTENT** tab, you can see a table summarizing version contents, including file names, file sizes, and hashes.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Data Ingestion
|
|
||||||
|
|
||||||
Now that a new dataset is registered, you can consume it!
|
|
||||||
|
|
||||||
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) script
|
|
||||||
demonstrates data ingestion using the dataset created in the first script.
|
|
||||||
|
|
||||||
```python
|
|
||||||
dataset_name = "cifar_dataset"
|
|
||||||
dataset_project = "dataset_examples"
|
|
||||||
|
|
||||||
dataset_path = Dataset.get(
|
|
||||||
dataset_name=dataset_name,
|
|
||||||
dataset_project=dataset_project
|
|
||||||
).get_local_copy()
|
|
||||||
```
|
|
||||||
|
|
||||||
The script above gets the dataset and uses the [`Dataset.get_local_copy`](../../references/sdk/dataset.md#get_local_copy)
|
|
||||||
method to return a path to the cached, read-only local dataset.
|
|
||||||
|
|
||||||
If you need a modifiable copy of the dataset, use the following:
|
|
||||||
```python
|
|
||||||
Dataset.get(dataset_name, dataset_project).get_mutable_local_copy("path/to/download")
|
|
||||||
```
|
|
||||||
|
|
||||||
The script then creates a neural network to train a model to classify images from the dataset that was
|
|
||||||
created above.
|
|
@ -128,7 +128,7 @@ module.exports = {
|
|||||||
{'Automation': ['guides/automation/manual_random_param_search_example', 'guides/automation/task_piping']},
|
{'Automation': ['guides/automation/manual_random_param_search_example', 'guides/automation/task_piping']},
|
||||||
{'ClearML Task': ['guides/clearml-task/clearml_task_tutorial']},
|
{'ClearML Task': ['guides/clearml-task/clearml_task_tutorial']},
|
||||||
{'ClearML Agent': ['guides/clearml_agent/executable_exp_containers', 'guides/clearml_agent/exp_environment_containers']},
|
{'ClearML Agent': ['guides/clearml_agent/executable_exp_containers', 'guides/clearml_agent/exp_environment_containers']},
|
||||||
{'Datasets': ['guides/datasets/data_man_cifar_classification', 'guides/datasets/data_man_python']},
|
{'Datasets': ['clearml_data/data_management_examples/data_man_cifar_classification', 'clearml_data/data_management_examples/data_man_python']},
|
||||||
{'Distributed': ['guides/distributed/distributed_pytorch_example', 'guides/distributed/subprocess_example']},
|
{'Distributed': ['guides/distributed/distributed_pytorch_example', 'guides/distributed/subprocess_example']},
|
||||||
{'Docker': ['guides/docker/extra_docker_shell_script']},
|
{'Docker': ['guides/docker/extra_docker_shell_script']},
|
||||||
{'Frameworks': [
|
{'Frameworks': [
|
||||||
|
Loading…
Reference in New Issue
Block a user