clearml-docs/docs/clearml_data.md

362 lines
17 KiB
Markdown
Raw Normal View History

2021-05-13 23:48:51 +00:00
---
title: ClearML Data
---
In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset,
which you then need to be able to share, reproduce, and track.
ClearML Data Management solves two important challenges:
- Accessibility - Making data easily accessible from every machine,
- Versioning - Linking data and experiments for better **traceability**.
2021-05-23 20:17:12 +00:00
**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
2021-05-23 20:17:12 +00:00
Datasets can be set up to inherit from other datasets, so data lineages can be created,
2021-09-01 09:48:30 +00:00
and users can track when and how their data changes.
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
2021-05-13 23:48:51 +00:00
2021-05-18 22:31:01 +00:00
Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
2021-09-01 09:48:30 +00:00
When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
ClearML Data offers two interfaces:
2021-05-23 20:17:12 +00:00
- `clearml-data` - CLI utility for creating, uploading, and managing datasets.
2021-05-13 23:48:51 +00:00
- `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets.
2021-09-01 09:48:30 +00:00
## Setup
`clearml-data` comes built-in with our `clearml` python package! Just check out the [getting started](getting_started/ds/ds_first_steps.md) guide for more info!
## Workflow
Below is an example of a workflow using ClearML Data's command line tool to create a dataset and inegrating the dataset into code
using ClearML Data's python interface.
### Creating a Dataset
2021-05-13 23:48:51 +00:00
Using the `clearml-data` CLI, users can create datasets using the following commands:
```bash
clearml-data create --project dataset_example --name initial_version
2021-05-23 20:17:12 +00:00
clearml-data add --files data_folder
2021-09-01 09:48:30 +00:00
clearml-data close
2021-05-13 23:48:51 +00:00
```
The commands will do the following:
1. Start a Data Processing Task called "initial_version" in the "dataset_example" project
2021-05-23 20:17:12 +00:00
2021-05-13 23:48:51 +00:00
1. The CLI will return a unique ID for the dataset
2021-05-23 20:17:12 +00:00
1. All the files from the "data_folder" folder will be added to the dataset and uploaded
by default to the [ClearML server](deploying_clearml/clearml_server.md).
2021-09-01 09:48:30 +00:00
1. The dataset will be finalized, making it immutable and ready to be consumed.
2021-05-13 23:48:51 +00:00
:::note
2021-05-23 20:17:12 +00:00
`clearml-data` is stateful and remembers the last created dataset so there's no need to specify a specific dataset ID unless
2021-05-13 23:48:51 +00:00
we want to work on another dataset.
:::
2021-09-01 09:48:30 +00:00
### Using a Dataset
2021-05-13 23:48:51 +00:00
Now in our python code, we can access and use the created dataset from anywhere:
```python
from clearml import Dataset
local_path = Dataset.get(dataset_id='dataset_id_from_previous_command').get_local_copy()
```
We have all our files in the same folder structure under `local_path`, it is that simple!<br/>
2021-05-23 20:17:12 +00:00
The next step is to set the dataset_id as a parameter for our code and voilà! We can now train on any dataset we have in
2021-05-13 23:48:51 +00:00
the system.
2021-09-01 09:48:30 +00:00
## CLI Options
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
It's possible to manage datasets (create / modify / upload / delete) with the `clearml-data` command line tool.
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
### Creating a Dataset
2021-05-13 23:48:51 +00:00
```bash
clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
```
Creates a new dataset. <br/>
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|name |Dataset's name| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|project|Dataset's project| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|parents|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|tags |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
:::important
2021-05-23 20:17:12 +00:00
clearml-data works in a stateful mode so once a new dataset is created, the following commands
2021-05-13 23:48:51 +00:00
do not require the `--id` flag.
:::
2021-05-23 20:17:12 +00:00
2021-05-13 23:48:51 +00:00
<br/>
2021-09-01 09:48:30 +00:00
### Add Files
2021-05-13 23:48:51 +00:00
```bash
clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
```
It's possible to add individual files or complete folders.<br/>
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|id | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|files|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|dataset-folder | Dataset base folder to add the files to in the dataset. Default: dataset root| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|non-recursive | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
<br/>
2021-09-01 09:48:30 +00:00
### Remove Files
2021-05-13 23:48:51 +00:00
```bash
2021-05-23 20:17:12 +00:00
clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
```
2021-05-13 23:48:51 +00:00
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|id | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|files | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|non-recursive | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
<br/>
2021-09-01 09:48:30 +00:00
### Upload Dataset Content
2021-05-13 23:48:51 +00:00
```bash
2021-09-01 09:48:30 +00:00
clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
2021-05-23 20:17:12 +00:00
```
2021-09-01 09:48:30 +00:00
Uploads added files to [ClearML Server](deploying_clearml/clearml_server.md) by default. It's possible to specify a different storage
medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure://`, `/mnt/shared/`.
2021-05-13 23:48:51 +00:00
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|storage| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
<br/>
2021-09-01 09:48:30 +00:00
### Finalize Dataset
2021-05-13 23:48:51 +00:00
```bash
2021-09-01 09:48:30 +00:00
clearml-data close --id <dataset_id>
2021-05-13 23:48:51 +00:00
```
2021-09-01 09:48:30 +00:00
Finalizes the dataset and makes it ready to be consumed.
It automatically uploads all files that were not previously uploaded.
Once a dataset is finalized, it can no longer be modified.
2021-05-13 23:48:51 +00:00
**Parameters**
2021-05-23 20:17:12 +00:00
2021-05-13 23:48:51 +00:00
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|storage| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|disable-upload | Disable automatic upload when closing the dataset | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
<br/>
2021-09-01 09:48:30 +00:00
### Sync Local Folder
2021-05-13 23:48:51 +00:00
```
clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`
```
This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which
updates from time to time.
2021-05-23 20:17:12 +00:00
Once an update should be reflected into ClearML's system, users can call `clearml-data sync`, create a new dataset, enter the folder,
2021-05-13 23:48:51 +00:00
and the changes (either file addition, modification and removal) will be reflected in ClearML.
This command also uploads the data and finalizes the dataset automatically.
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|folder|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|storage|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|parents|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|project|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|name|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|tags|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|skip-close|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose | Verbose reporting |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
<br/>
2021-09-01 09:48:30 +00:00
### List Dataset Content
2021-05-13 23:48:51 +00:00
```bash
2021-05-23 20:17:12 +00:00
clearml-data list [--id <dataset_id>]
2021-05-13 23:48:51 +00:00
```
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|id|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|project|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|name|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|filter|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|modified|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
<br/>
2021-09-01 09:48:30 +00:00
### Delete Dataset
2021-05-13 23:48:51 +00:00
```
clearml-data delete [--id <dataset_id_to_delete>]
2021-05-23 20:17:12 +00:00
```
2021-05-13 23:48:51 +00:00
Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.
This does not work on datasets with children.
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|id|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|force|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />||
2021-05-13 23:48:51 +00:00
<br/>
2021-09-01 09:48:30 +00:00
### Search for a Dataset
2021-05-13 23:48:51 +00:00
```
clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
2021-05-23 20:17:12 +00:00
```
2021-05-13 23:48:51 +00:00
Lists all datasets in the system that match the search request.
Datasets can be searched by project, name, ID, and tags.
2021-05-23 20:17:12 +00:00
2021-05-13 23:48:51 +00:00
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-05-23 20:17:12 +00:00
|ids|A list of dataset IDs|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|project|The project name of the datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|name|A dataset name or a partial name to filter datasets by|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|tags|A list of dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
<br/>
2021-09-01 09:48:30 +00:00
### Compare Two Datasets
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
```
clearml-data compare [--source SOURCE] [--target TARGET]
```
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
```
Comparison summary: 4 files removed, 3 files modified, 0 files added
```
2021-05-13 23:48:51 +00:00
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|source|Source dataset id (used as baseline)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|target|Target dataset id (compare against the source baseline dataset)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
### Merge Datasets
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
```
clearml-data squash --name NAME --ids [IDS [IDS ...]]
```
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
Squash (merge) multiple datasets into a single dataset version.
2021-05-23 20:17:12 +00:00
2021-05-13 23:48:51 +00:00
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|name|Create squashed dataset name|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|ids|Source dataset IDs to squash (merge down)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|storage|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
### Verify Dataset
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
```
clearml-data verify [--id ID] [--folder FOLDER]
```
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
Verify that the dataset content matches the data from the local source.
2021-05-23 20:17:12 +00:00
2021-05-13 23:48:51 +00:00
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|id|Specify dataset ID. Default: previously created/accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|folder|Specify dataset local copy (if not provided the local cache folder will be verified)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|filesize| If True, only verify file size and skip hash checks (default: false)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
### Get a Dataset
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
```
clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]
```
2021-05-23 20:17:12 +00:00
2021-09-01 09:48:30 +00:00
Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the
`--copy` flag.
2021-05-13 23:48:51 +00:00
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|id| Specify dataset ID. Default: previously created / accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|copy| Get a writable copy of the dataset to a specific output folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|link| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|overwrite| If True, overwrite the target folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose| Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
### Publish a Dataset
```
clearml-data publish --id ID
```
Publish the dataset for public use. The dataset must be [finalized](#finalize-dataset) before it is published.
2021-05-13 23:48:51 +00:00
2021-05-23 20:17:12 +00:00
2021-05-13 23:48:51 +00:00
**Parameters**
|Name|Description|Optional|
|---|---|---|
2021-09-01 09:48:30 +00:00
|id| The dataset task id to be published.|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
## Python API
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
It's also possible to manage a dataset using ClearML Data's python interface.
All API commands should be imported with:
```python
from clearml import Dataset
```
See all API commands in the [Dataset](references/sdk/dataset.md) reference page.
## Tutorials
2021-05-13 23:48:51 +00:00
2021-09-01 09:48:30 +00:00
Take a look at the ClearML Data tutorials:
* [Dataset Management with CLI and SDK](guides/data%20management/data_man_cifar_classification)
* [Dataset Management with CLI](guides/data%20management/data_man_simple)
* [Folder Sync with CLI](guides/data%20management/data_man_folder_sync)