clearml-docs/docs/guides/data management/data_man_simple.md

---
title: Data Management Example
---

In this example we'll create a simple dataset and demonstrate basic actions on it. 

## Prerequisites
First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
the needed files.
1. Open terminal and change directory to the cloned repository's examples folder
    `cd clearml/examples/reporting`

## Creating initial dataset

1. To create the dataset, run this code:

    ```bash
    clearml-data create --project datasets --name HelloDataset
    ```

    Expected response:

    ```bash
    clearml-data - Dataset Management & Versioning CLI
    Creating a new dataset:
    New dataset created id=24d05040f3e14fbfbed8edb1bf08a88c
    ```

1. Now let's add a folder. File addition is recursive, so it's enough to point at the folder 
to captures all files and subfolders:
   
   ```bash
   clearml-data add --files data_samples
   ```
   
   Expected response:
   
   ```bash
   clearml-data - Dataset Management & Versioning CLI
   Adding files/folder to dataset id 24d05040f3e14fbfbed8edb1bf08a88c
   Generating SHA2 hash for 2 files
   Hash generation completed
   5 files added
   ```
:::note
After creating a dataset, we don't have to specify its ID when running commands, such as *add*, *remove* or *list*
:::

1. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but  
this can be configured with the `--storage` flag to any of ClearML's supported storage mediums (see [storage](../../integrations/storage.md)).
The command also finalizes the dataset, making it immutable and ready to be consumed.

   ```bash
   clearml-data close
   ```

   Expected response:

   ```bash
   clearml-data - Dataset Management & Versioning CLI
   Finalizing dataset id 24d05040f3e14fbfbed8edb1bf08a88c
   Pending uploads, starting dataset upload to https://files.community-master.hosted.allegro.ai
   Pending uploads, starting dataset upload to https://files.community.clear.ml
   Uploading compressed dataset changes (4 files, total 221.56 KB) to https://files.community.clear.ml
   Upload completed (221.56 KB)
   2021-05-04 09:32:03,388 - clearml.Task - INFO - Waiting to finish uploads
   2021-05-04 09:32:04,067 - clearml.Task - INFO - Finished uploading
   Dataset closed and finalized
   ```

## Listing Dataset content

To see that all the files were added to the created dataset, use `clearml-data list` and enter the ID of the dataset
that was just closed.

   ```bash
  clearml-data list --id 24d05040f3e14fbfbed8edb1bf08a88c
   ```

Expected response:

```console
clearml-data - Dataset Management & Versioning CLI 

List dataset content: 24d05040f3e14fbfbed8edb1bf08a88c 
Listing dataset content
file name                                                        | size       | hash                                                            
------------------------------------------------------------------------------------------------------------------------------------------------
dancing.jpg                                                      |     40,484 | 78e804c0c1d54da8d67e9d072c1eec514b91f4d1f296cdf9bf16d6e54d63116a
data.csv                                                         |     21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
picasso.jpg                                                      |    114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
sample.json                                                      |        132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
sample.mp3                                                       |     72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
Total 5 files, 248771 bytes
```

## Creating a Child Dataset

In Clear Data, it's possible to create datasets that inherit the content of other datasets, there are called child datasets.

1. Create a new dataset, specifying the previously created one as its parent:

   ```bash
   clearml-data create --project datasets --name HelloDataset-improved --parents 24d05040f3e14fbfbed8edb1bf08a88c
   ```
:::note
You'll need to input the Dataset ID you received when created the dataset above 
:::

1. Now, we want to add a new file. 
   * Create a new file: `echo "data data data" > new_data.txt` (this will create the file `new_data.txt`),
   * Now add the file to the dataset:  

   ```bash
   clearml-data add --files new_data.txt
   ```
   Which should return this output:

   ```console
   clearml-data - Dataset Management & Versioning CLI
   Adding files/folder to dataset id 8b68686a4af040d081027ba3cf6bbca6
   1 file added
   ```
   
1. Let's also remove a file. We'll need to specify the file's full path (within the dataset, not locally) to remove it.

   ```bash
   clearml-data remove --files data_samples/dancing.jpg
   ```

   Expected response:
   ```bash
   clearml-data - Dataset Management & Versioning CLI
   Removing files/folder from dataset id 8b68686a4af040d081027ba3cf6bbca6
   1 files removed
   ```

1. Close and finalize the dataset

   ```bash
   clearml-data close
   ```
   
1. Let's take a look again at the files in the dataset:

   ```
   clearml-data list --id 8b68686a4af040d081027ba3cf6bbca6
   ```

   And we see that our changes have been made! `new_data.txt` has been added, and `dancing.jpg` has been removed. 

   ```
   file name                                                        | size       | hash                                                            
   ------------------------------------------------------------------------------------------------------------------------------------------------
   data.csv                                                         |     21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
   new_data.txt                                                     |         15 | 6df986a2154902260a836febc5a32543f5337eac60560c57db99257a7e012051
   picasso.jpg                                                      |    114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
   sample.json                                                      |        132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
   sample.mp3                                                       |     72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
   Total 5 files, 208302 bytes
   ```

By using `clearml-data`, a clear lineage is created for the data. As seen in this example, when a dataset is closed, the 
only way to add or remove data is to create a new dataset, and using the previous dataset as a parent. This way, the data 
is not reliant on the code and is reproducible.
Initial commit 2021-05-13 23:48:51 +00:00			`---`
			`title: Data Management Example`
			`---`

			`In this example we'll create a simple dataset and demonstrate basic actions on it.`

			`## Prerequisites`
			`First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all`
			`the needed files.`
			`1. Open terminal and change directory to the cloned repository's examples folder`
			`cd clearml/examples/reporting`

			`## Creating initial dataset`

			`1. To create the dataset, run this code:`

			```bash
			`clearml-data create --project datasets --name HelloDataset`
			```

			`Expected response:`

			```bash
			`clearml-data - Dataset Management & Versioning CLI`
			`Creating a new dataset:`
			`New dataset created id=24d05040f3e14fbfbed8edb1bf08a88c`
			```

			`1. Now let's add a folder. File addition is recursive, so it's enough to point at the folder`
			`to captures all files and subfolders:`

			```bash
			`clearml-data add --files data_samples`
			```

			`Expected response:`

			```bash
			`clearml-data - Dataset Management & Versioning CLI`
			`Adding files/folder to dataset id 24d05040f3e14fbfbed8edb1bf08a88c`
			`Generating SHA2 hash for 2 files`
			`Hash generation completed`
			`5 files added`
			```
			`:::note`
			`After creating a dataset, we don't have to specify its ID when running commands, such as add, remove or list`
			`:::`

			`1. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but`
			this can be configured with the `--storage` flag to any of ClearML's supported storage mediums (see [storage](../../integrations/storage.md)).
			`The command also finalizes the dataset, making it immutable and ready to be consumed.`

			```bash
			`clearml-data close`
			```

			`Expected response:`

			```bash
			`clearml-data - Dataset Management & Versioning CLI`
			`Finalizing dataset id 24d05040f3e14fbfbed8edb1bf08a88c`
			`Pending uploads, starting dataset upload to https://files.community-master.hosted.allegro.ai`
			`Pending uploads, starting dataset upload to https://files.community.clear.ml`
			`Uploading compressed dataset changes (4 files, total 221.56 KB) to https://files.community.clear.ml`
			`Upload completed (221.56 KB)`
			`2021-05-04 09:32:03,388 - clearml.Task - INFO - Waiting to finish uploads`
			`2021-05-04 09:32:04,067 - clearml.Task - INFO - Finished uploading`
			`Dataset closed and finalized`
			```

			`## Listing Dataset content`

			To see that all the files were added to the created dataset, use `clearml-data list` and enter the ID of the dataset
			`that was just closed.`

			```bash
			`clearml-data list --id 24d05040f3e14fbfbed8edb1bf08a88c`
			```

			`Expected response:`

			```console
			`clearml-data - Dataset Management & Versioning CLI`

			`List dataset content: 24d05040f3e14fbfbed8edb1bf08a88c`
			`Listing dataset content`
			`file name \| size \| hash`
			`------------------------------------------------------------------------------------------------------------------------------------------------`
			`dancing.jpg \| 40,484 \| 78e804c0c1d54da8d67e9d072c1eec514b91f4d1f296cdf9bf16d6e54d63116a`
			`data.csv \| 21,440 \| b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5`
			`picasso.jpg \| 114,573 \| 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c`
			`sample.json \| 132 \| 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7`
			`sample.mp3 \| 72,142 \| fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b`
			`Total 5 files, 248771 bytes`
			```

			`## Creating a Child Dataset`

			`In Clear Data, it's possible to create datasets that inherit the content of other datasets, there are called child datasets.`

			`1. Create a new dataset, specifying the previously created one as its parent:`

			```bash
			`clearml-data create --project datasets --name HelloDataset-improved --parents 24d05040f3e14fbfbed8edb1bf08a88c`
			```
			`:::note`
			`You'll need to input the Dataset ID you received when created the dataset above`
			`:::`

			`1. Now, we want to add a new file.`
			* Create a new file: `echo "data data data" > new_data.txt` (this will create the file `new_data.txt`),
			`* Now add the file to the dataset:`

			```bash
			`clearml-data add --files new_data.txt`
			```
			`Which should return this output:`

			```console
			`clearml-data - Dataset Management & Versioning CLI`
			`Adding files/folder to dataset id 8b68686a4af040d081027ba3cf6bbca6`
			`1 file added`
			```

			`1. Let's also remove a file. We'll need to specify the file's full path (within the dataset, not locally) to remove it.`

			```bash
			`clearml-data remove --files data_samples/dancing.jpg`
			```

			`Expected response:`
			```bash
			`clearml-data - Dataset Management & Versioning CLI`
			`Removing files/folder from dataset id 8b68686a4af040d081027ba3cf6bbca6`
			`1 files removed`
			```

			`1. Close and finalize the dataset`

			```bash
			`clearml-data close`
			```

			`1. Let's take a look again at the files in the dataset:`

			```
			`clearml-data list --id 8b68686a4af040d081027ba3cf6bbca6`
			```

			And we see that our changes have been made! `new_data.txt` has been added, and `dancing.jpg` has been removed.

			```
			`file name \| size \| hash`
			`------------------------------------------------------------------------------------------------------------------------------------------------`
			`data.csv \| 21,440 \| b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5`
			`new_data.txt \| 15 \| 6df986a2154902260a836febc5a32543f5337eac60560c57db99257a7e012051`
			`picasso.jpg \| 114,573 \| 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c`
			`sample.json \| 132 \| 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7`
			`sample.mp3 \| 72,142 \| fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b`
			`Total 5 files, 208302 bytes`
			```

			By using `clearml-data`, a clear lineage is created for the data. As seen in this example, when a dataset is closed, the
			`only way to add or remove data is to create a new dataset, and using the previous dataset as a parent. This way, the data`
			`is not reliant on the code and is reproducible.`