2021-11-08 11:21:44 +00:00
---
title: Data Management with Python
---
The [dataset_creation.py ](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py ) and
2024-01-18 07:50:32 +00:00
[data_ingestion.py ](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py ) scripts
2021-11-08 11:21:44 +00:00
together demonstrate how to use ClearML's [`Dataset` ](../../references/sdk/dataset.md ) class to create a dataset and
subsequently ingest the data.
## Dataset Creation
The [dataset_creation.py ](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py ) script
demonstrates how to do the following:
* Create a dataset and add files to it
* Upload the dataset to the ClearML Server
* Finalize the dataset
### Downloading the Data
2023-12-26 13:49:35 +00:00
You first need to obtain a local copy of the CIFAR dataset.
The code below downloads the data and `dataset_path` contains the path to the downloaded data:
2021-11-08 11:21:44 +00:00
2023-12-26 13:49:35 +00:00
```python
from clearml import StorageManager
2021-11-08 11:21:44 +00:00
2023-12-26 13:49:35 +00:00
manager = StorageManager()
dataset_path = manager.get_local_copy(
remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
)
2021-11-08 11:21:44 +00:00
```
### Creating the Dataset
2023-12-26 13:49:35 +00:00
The following code creates a data processing task called `cifar_dataset` in the `dataset examples` project, which
can be viewed in the [WebApp ](../../webapp/datasets/webapp_dataset_viewing.md ).
2021-11-08 11:21:44 +00:00
```python
from clearml import Dataset
2022-01-24 13:42:17 +00:00
dataset = Dataset.create(
dataset_name="cifar_dataset",
dataset_project="dataset examples"
)
2021-11-08 11:21:44 +00:00
```
### Adding Files
2023-12-26 13:49:35 +00:00
Add the downloaded files to the current dataset:
2021-11-08 11:21:44 +00:00
```python
dataset.add_files(path=dataset_path)
```
### Uploading the Files
2023-12-26 13:49:35 +00:00
Upload the dataset:
2021-11-08 11:21:44 +00:00
```python
dataset.upload()
```
2023-12-26 13:49:35 +00:00
2024-02-26 15:24:58 +00:00
By default, the dataset is uploaded to the ClearML file server. The dataset's destination can be changed by specifying the
2022-01-19 12:26:14 +00:00
target storage with the `output_url` parameter of the [`upload` ](../../references/sdk/dataset.md#upload ) method.
2021-11-08 11:21:44 +00:00
### Finalizing the Dataset
2022-01-19 12:26:14 +00:00
Run the [`finalize` ](../../references/sdk/dataset.md#finalize ) command to close the dataset and set that dataset's tasks
2021-11-08 11:21:44 +00:00
status to *completed* . The dataset can only be finalized if it doesn't have any pending uploads.
```python
dataset.finalize()
```
After a dataset has been closed, it can no longer be modified. This ensures future reproducibility.
2022-11-23 10:13:17 +00:00
Information about the dataset can be viewed in the WebApp, in the dataset's [details panel ](../../webapp/datasets/webapp_dataset_viewing.md#version-details-panel ).
In the panel's **CONTENT** tab, you can see a table summarizing version contents, including file names, file sizes, and hashes.
2021-11-08 11:21:44 +00:00
2022-11-23 10:13:17 +00:00
![Dataset content tab ](../../img/examples_data_management_cifar_dataset.png )
2021-11-08 11:21:44 +00:00
## Data Ingestion
2023-02-16 10:17:53 +00:00
Now that a new dataset is registered, you can consume it!
2021-11-08 11:21:44 +00:00
The [data_ingestion.py ](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py ) script
demonstrates data ingestion using the dataset created in the first script.
2023-12-03 12:27:46 +00:00
The following script gets the dataset and uses [`Dataset.get_local_copy()` ](../../references/sdk/dataset.md#get_local_copy )
2023-11-27 13:14:21 +00:00
to return a path to the cached, read-only local dataset.
2021-11-08 11:21:44 +00:00
```python
dataset_name = "cifar_dataset"
dataset_project = "dataset_examples"
2021-11-09 13:58:40 +00:00
dataset_path = Dataset.get(
dataset_name=dataset_name,
dataset_project=dataset_project
).get_local_copy()
2021-11-08 11:21:44 +00:00
```
2021-11-09 13:58:40 +00:00
If you need a modifiable copy of the dataset, use the following code:
```python
Dataset.get(dataset_name, dataset_project).get_mutable_local_copy("path/to/download")
```
2021-11-08 11:21:44 +00:00
The script then creates a neural network to train a model to classify images from the dataset that was
created above.