Refactor ClearML Data docs (#108)

This commit is contained in:
pollfly
2021-11-08 13:21:44 +02:00
committed by GitHub
parent 43751dc64b
commit e155c49cfd
17 changed files with 847 additions and 683 deletions

View File

@@ -0,0 +1,46 @@
---
title: Best Practices
---
The following are some recommendations for using ClearML Data.
## Versioning Datasets
Use ClearML Data to version your datasets. Once a dataset is finalized, it can no longer be modified. This makes clear
which version of the dataset was used with which task, enabling the accurate reproduction of your experiments.
Once you need to change the dataset's contents, you can create a new version of the dataset by specifying the previous
dataset as a parent. This makes the new dataset version inherit the previous version's contents, with the dataset's new
version contents ready to be updated.
## Organize Datasets for Easier Access
Organize the datasets according to use-cases and use tags. This makes managing multiple datasets and
accessing the most updated datasets for different use-cases easier.
Like any ClearML tasks, datasets can be organized into [projects (and sub-projects)](../fundamentals/projects.md#creating-sub-projects).
Additionally, when creating a dataset, tags can be applied to the dataset, which will make searching for the dataset easier.
Organizing your datasets into projects by use-case makes it easier to access the most recent dataset version for that use-case.
If only a project is specified when using [`Dataset.get`](../references/sdk/dataset.md#datasetget), the method returns the
most recent dataset in a project. The same is true with tags; if a tag is specified, the method will return the most recent dataset that is labeled with that tag.
## Document your Datasets
Attach informative metrics or debug samples to the Dataset itself. Use the [`get_logger`](../references/sdk/dataset.md#get_logger)
method to access the dataset's logger object, then add any additional information to the dataset, using the methods
available with a [logger](../references/sdk/logger.md) object.
You can add some dataset summaries (like [table reporting](../references/sdk/logger.md#report_table)) to create a preview
of the data stored for better visibility, or attach any statistics generated by the data ingestion process.
## Periodically Update Your Dataset
Your data probably changes from time to time. If the data is updated into the same local / network folder structure, which
serves as a dataset's single point of truth, you can schedule a script which uses the dataset `sync` functionality which
will update the dataset based on the modifications made to the folder. This way, there is no need to manually modify a dataset.
This functionality will also track the modifications made to a folder.
See the sync function with the [CLI](clearml_data_cli.md#syncing-local-storage) or [SDK](clearml_data_sdk.md#syncing-local-storage)
interface.

View File

@@ -0,0 +1,92 @@
---
title: Introduction
---
In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset,
which you then need to be able to share, reproduce, and track.
ClearML Data Management solves two important challenges:
- Accessibility - Making data easily accessible from every machine,
- Versioning - Linking data and experiments for better **traceability**.
**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
Datasets can be set up to inherit from other datasets, so data lineages can be created,
and users can track when and how their data changes.
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
## Setup
`clearml-data` comes built-in with the `clearml` python package! Just check out the [Getting Started](../getting_started/ds/ds_first_steps.md)
guide for more info!
## Using ClearML Data
ClearML Data offers two interfaces:
- `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](clearml_data_cli.md) for a reference of `clearml-data` commands.
- `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets. See [SDK](clearml_data_sdk.md) for an overview of the basic methods of the `Dataset` module.
For an overview of our recommendations for ClearML Data workflows and practices, see [Best Practices](best_practices.md).
## WebApp
ClearML's WebApp provides a visual interface to your datasets through dataset tasks. Dataset tasks are categorized
as data-processing [task type](../fundamentals/task.md#task-types), and they are labeled with a `DATASET` system tag.
Full log (calls / CLI) of the dataset creation process can be found in a dataset's **EXECUTION** section.
Listing of the dataset differential snapshot, summary of files added / modified / removed and details of files in the
differential snapshot (location / size / hash), is available in the **ARTIFACTS** section. Download the dataset
by clicking <img src="/docs/latest/icons/ico-download-json.svg" alt="Download" className="icon size-sm space-sm" />,
next to the **FILE PATH**.
The full dataset listing (all files included) is available in the **CONFIGURATION** section under **Dataset Content**.
This allows you to quickly compare two dataset contents and visually see the difference.
The dataset genealogy DAG and change-set summary table is visualized in **RESULTS > PLOTS**
<details className="cml-expansion-panel screenshot">
<summary className="cml-expansion-panel-summary">Dataset Contents</summary>
<div className="cml-expansion-panel-content">
![Dataset data WebApp](../img/dataset_data.png)
</div>
</details>
<br/>
View a DAG of the dataset dependencies (all previous dataset versions and their parents) in the dataset's page **> ARTIFACTS > state**.
<details className="cml-expansion-panel screenshot">
<summary className="cml-expansion-panel-summary">Data Dependency DAG</summary>
<div className="cml-expansion-panel-content">
![Dataset state WebApp](../img/dataset_data_state.png)
</div>
</details>
Once a dataset has been finalized, view its genealogy in the dataset's
page **>** **RESULTS** **>** **PLOTS**
<details className="cml-expansion-panel screenshot">
<summary className="cml-expansion-panel-summary">Dataset Genealogy</summary>
<div className="cml-expansion-panel-content">
![Dataset genealogy and summary](../img/dataset_genealogy_summary.png)
</div>
</details>

View File

@@ -0,0 +1,328 @@
---
title: CLI
---
The `clearml-data` utility is a CLI tool for controlling and managing your data with ClearML.
The following page provides a reference to `clearml-data`'s CLI commands.
### Creating a Dataset
```bash
clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
```
Creates a new dataset. <br/>
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--name` |Dataset's name`| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|`--project`|Dataset's project`| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|`--parents`|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--tags` |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
:::important
clearml-data works in a stateful mode so once a new dataset is created, the following commands
do not require the `--id` flag.
:::
<br/>
### Adding Files
```bash
clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
```
It's possible to add individual files or complete folders.<br/>
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id` | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--files`|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|`--dataset-folder` | Dataset base folder to add the files to in the dataset. Default: dataset root| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--non-recursive` | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Removing Files
```bash
clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
```
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id` | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--files` | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|`--non-recursive` | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Uploading Dataset Content
```bash
clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
```
Uploads added files to [ClearML Server](../deploying_clearml/clearml_server.md) by default. It's possible to specify a different storage
medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure://`, `/mnt/shared/`.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--storage`| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Finalizing a Dataset
```bash
clearml-data close --id <dataset_id>
```
Finalizes the dataset and makes it ready to be consumed.
It automatically uploads all files that were not previously uploaded.
Once a dataset is finalized, it can no longer be modified.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--storage`| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--disable-upload` | Disable automatic upload when closing the dataset | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Syncing Local Storage
```
clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`
```
This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which
updates from time to time.
Once an update should be reflected into ClearML's system, users can call `clearml-data sync`, create a new dataset, enter the folder,
and the changes (either file addition, modification and removal) will be reflected in ClearML.
This command also uploads the data and finalizes the dataset automatically.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--folder`|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--parents`|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--project`|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--name`|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--tags`|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--skip-close`|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--verbose` | Verbose reporting |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Listing Dataset Content
```bash
clearml-data list [--id <dataset_id>]
```
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--name`|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--filter`|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--modified`|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Deleting a Dataset
```
clearml-data delete [--id <dataset_id_to_delete>]
```
Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.
This does not work on datasets with children.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--force`|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />||
</div>
<br/>
### Searching for a Dataset
```
clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
```
Lists all datasets in the system that match the search request.
Datasets can be searched by project, name, ID, and tags.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--ids`|A list of dataset IDs|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|`--project`|The project name of the datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|`--name`|A dataset name or a partial name to filter datasets by|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|`--tags`|A list of dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
</div>
<br/>
### Comparing Two Datasets
```
clearml-data compare [--source SOURCE] [--target TARGET]
```
Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:
```
Comparison summary: 4 files removed, 3 files modified, 0 files added
```
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--source`|Source dataset id (used as baseline)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--target`|Target dataset id (compare against the source baseline dataset)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
### Merging Datasets
```
clearml-data squash --name NAME --ids [IDS [IDS ...]]
```
Squash (merge) multiple datasets into a single dataset version.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--name`|Create squashed dataset name|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--ids`|Source dataset IDs to squash (merge down)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
### Verifying a Dataset
```
clearml-data verify [--id ID] [--folder FOLDER]
```
Verify that the dataset content matches the data from the local source.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`|Specify dataset ID. Default: previously created/accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--folder`|Specify dataset local copy (if not provided the local cache folder will be verified)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--filesize`| If True, only verify file size and skip hash checks (default: false)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
### Getting a Dataset
```
clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]
```
Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the
`--copy` flag.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`| Specify dataset ID. Default: previously created / accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--copy`| Get a writable copy of the dataset to a specific output folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--link`| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--overwrite`| If True, overwrite the target folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--verbose`| Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
### Publishing a Dataset
```
clearml-data publish --id ID
```
Publish the dataset for public use. The dataset must be [finalized](#finalizing-a-dataset) before it is published.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`| The dataset task id to be published.|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
</div>

View File

@@ -0,0 +1,148 @@
---
title: SDK
---
Datasets can be created, modified, and managed with ClearML Data's python interface. The following page provides an overview
for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md)
for a complete list of available methods.
Import the `Dataset` class, and let's get started!
```python
from clearml import Dataset
```
## Creating Datasets
ClearML Data supports multiple ways to create datasets programmatically, which provides for a variety of use-cases:
* [`Dataset.create()`](#datasetcreate) - Create a new dataset. Parent datasets can be specified, from which the new dataset
will inherit its data
* [`Dataset.squash()`](#datasetsquash) - Generate a new dataset from by squashing together a set of related datasets
### Dataset.create()
Use the [`Dataset.create`](../references/sdk/dataset.md#datasetcreate) class method to create a dataset.
Creating datasets programmatically is especially helpful when preprocessing the data so that the
preprocessing code and the resulting dataset are saved in a single task (see `use_current_task` parameter in [`Dataset.create`](../references/sdk/dataset.md#datasetcreate)).
```python
# Preprocessing code here
dataset = Dataset.create(
dataset_name='dataset name',
dataset_project='dataset project',
parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2]
)
```
The created dataset inherits the content of the `parent_datasets`. When multiple dataset parents are listed,
they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset.
### Dataset.squash()
To improve deep dataset DAG storage and speed, dataset squashing was introduced. The [`Dataset.squash`](../references/sdk/dataset.md#datasetsquash)
class method generates a new dataset by squashing a set of dataset versions, and merging down all changes introduced in
their lineage DAG, creating a new, flat, independent version.
The datasets being squashed into a single dataset can be specified by their IDs or by project & name pairs.
```python
# option 1 - list dataset IDs
squashed_dataset_1 = Dataset.squash(
dataset_name='squashed dataset\'s name',
dataset_ids=[DS1_ID, DS2_ID, DS3_ID]
)
# option 2 - list project and dataset pairs
squashed_dataset_2 = Dataset.squash(
dataset_name='squashed dataset 2',
dataset_project_name_pairs=[('dataset1 project', 'dataset1 name'),
('dataset2 project', 'dataset2 name')]
)
```
In addition, the target storage location for the squashed dataset can be specified using the `output_url` parameter of the
[`Dataset.squash`](../references/sdk/dataset.md#datasetsquash) method.
## Accessing Datasets
Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere.
Use the [`Dataset.get`](../references/sdk/dataset.md#datasetget) class method to access a specific Dataset object, either
with the dataset's ID or with its project and name. If only a project name or tag is provided, the method returns the
most recent dataset in the specified project, or the most recent dataset with the specified tag.
Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options:
* [`Dataset.get_local_copy()`](../references/sdk/dataset.md#get_local_copy) - get a read-only local copy of an entire dataset.
This method returns a path to the dataset in local cache (downloading the dataset if it is not already in cache)
* [`Dataset.get_mutable_local_copy()`](../references/sdk/dataset.md#get_mutable_local_copy) - get a writable local copy
of an entire dataset. This method downloads the dataset to a specific folder (non-cached), specified with the `target_folder` parameter. If
the specified folder already has contents, specify whether to overwrite its contents with the dataset contents, using the `overwrite` parameter.
## Modifying Datasets
Once a dataset has been created, its contents can be modified and replaced. When your data is changes, you can
add updated files or remove unnecessary files.
### add_files()
To add files or folders into the current dataset, use the [`Dataset.add_files`](../references/sdk/dataset.md#add_files)
method. If a file is already in a dataset, but it has been modified, it can be added again, and ClearML will
upload the file diff.
```python
dataset = Dataset.create()
dataset.add_files(path="path/to/folder_or_file")
```
There is an option to add a set of files based on wildcard matching of a single string or a list of strings, using the
`wildcard` parameter. Specify whether to match the wildcard files recursively using the `recursive` parameter.
For example:
```python
dataset.add_files(
path="path/to/folder",
wildcard="~/data/*.jpg",
recursive=True
)
```
### remove_files()
To remove files from a current dataset, use the [`Dataset.remove_files`](../references/sdk/dataset.md#remove_files) method.
Input the path to the folder or file to be removed in the `dataset_path` parameter. The path is relative to the dataset.
There is also an option to input a wildcard into `dataset_path` in order to remove a set of files matching the wildcard.
Set the `recursive` parameter to `True` in order to match all wildcard files recursively
For example:
```python
dataset.remove_files(dataset_path="*.csv", recursive=True)
```
## Uploading Files
To upload the dataset files to network storage, use the [`Dataset.upload`](../references/sdk/dataset.md#upload) method.
Use the `output_url` parameter to specify storage target, such as S3 / GS / Azure (e.g. `s3://bucket/data`, `gs://bucket/data`, `azure://bucket/data` , `/mnt/share/data`).
By default, the dataset uploads to ClearML's file server.
Dataset files must be uploaded before a dataset is [finalized](#finalizing-a-dataset).
## Finalizing a Dataset
Use the [`Dataset.finalize`](../references/sdk/dataset.md#finalize) method to close the current dataset. This marks the
dataset task as *Completed*, at which point, the dataset can no longer be modified.
Before closing a dataset, its files must first be [uploaded](#uploading-files).
## Syncing Local Storage
Use the [`Dataset.sync_folder`](../references/sdk/dataset.md#sync_folder) method in order to update a dataset according
to a specific folder's content changes. Specify the folder to sync with the `local_path` parameter (the method assumes all files within the folder and recursive).
This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically.
The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually
update (add / remove) files in a dataset.

View File

@@ -0,0 +1,98 @@
---
title: Dataset Management with CLI and SDK
---
In this tutorial, we are going to manage the CIFAR dataset with `clearml-data` CLI, and then use ClearML's [`Dataset`](../../references/sdk/dataset.md)
class to ingest the data.
## Creating the Dataset
### Downloading the Data
Before we can register the CIFAR dataset with `clearml-data`, we need to obtain a local copy of it.
Execute this python script to download the data
```python
from clearml import StorageManager
manager = StorageManager()
dataset_path = manager.get_local_copy(
remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
)
# make sure to copy the printed value
print("COPY THIS DATASET PATH: {}".format(dataset_path))
```
Expected response:
```bash
COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
```
The script prints the path to the downloaded data. It will be needed later on.
### Creating the Dataset
To create the dataset, execute the following command:
```
clearml-data create --project dataset_examples --name cifar_dataset
```
Expected response:
```
clearml-data - Dataset Management & Versioning CLI
Creating a new dataset:
New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec
```
Where `ee1c35f60f384e65bc800f42f0aca5ec` is the dataset ID.
## Adding Files
Add the files we just downloaded to the dataset:
```
clearml-data add --files <dataset_path>
```
where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
:::note
There's no need to specify a `dataset_id`, since the `clearml-data` session stores it.
:::
## Finalizing the Dataset
Run the [`close`](../../references/sdk/dataset.md#close) command to upload the files (it'll be uploaded to ClearML Server by default):<br/>
```
clearml-data close
```
This command sets the dataset task's status to *completed*, so it will no longer be modifiable. This ensures future
reproducibility.
The information about the dataset, including a list of files and their sizes, can be viewed
in the WebApp, in the dataset task's **ARTIFACTS** tab.
![image](../../img/examples_data_management_cifar_dataset.png)
## Using the Dataset
Now that we have a new dataset registered, we can consume it.
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) example
script demonstrates using the dataset within Python code.
```python
dataset_name = "cifar_dataset"
dataset_project = "dataset_examples"
from clearml import Dataset
dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
trainset = datasets.CIFAR10(
root=dataset_path,
train=True,
download=False,
transform=transform
)
```
The Dataset's [`get_local_copy`](../../references/sdk/dataset.md#get_local_copy) method will return a path to the cached,
downloaded dataset. Then we provide the path to Pytorch's dataset object.
The script then trains a neural network to classify images using the dataset created above.

View File

@@ -0,0 +1,83 @@
---
title: Folder Sync with CLI
---
This example shows how to use the `clearml-data` folder sync function.
`clearml-data` folder sync mode is useful for cases when users have a single point of truth (i.e. a folder) that updates
from time to time. When the point of truth is updated, users can call `clearml-data sync` and the
changes (file addition, modification, or removal) will be reflected in ClearML.
## Creating Initial Version
## Prerequisites
1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
the needed files.
1. Open terminal and change directory to the cloned repository's examples folder
```
cd clearml/examples/reporting
```
## Syncing a Folder
Create a dataset and sync the `data_samples` folder from the repo to ClearML
```bash
clearml-data sync --project datasets --name sync_folder --folder data_samples
```
Expected response:
```
clearml-data - Dataset Management & Versioning CLI
Creating a new dataset:
New dataset created id=0d8f5f3e5ebd4f849bfb218021be1ede
Syncing dataset id 0d8f5f3e5ebd4f849bfb218021be1ede to local folder data_samples
Generating SHA2 hash for 5 files
Hash generation completed
Sync completed: 0 files removed, 5 added / modified
Finalizing dataset
Pending uploads, starting dataset upload to https://files.community.clear.ml
Uploading compressed dataset changes (5 files, total 222.17 KB) to https://files.community.clear.ml
Upload completed (222.17 KB)
2021-05-04 09:57:56,809 - clearml.Task - INFO - Waiting to finish uploads
2021-05-04 09:57:57,581 - clearml.Task - INFO - Finished uploading
Dataset closed and finalized
```
As can be seen, the `clearml-data sync` command creates the dataset, then uploads the files, and closes the dataset.
## Modifying Synced Folder
Now we'll modify the folder:
1. Add another line to one of the files in the `data_samples` folder.
1. Add a file to the sample_data folder.<br/>
Run`echo "data data data" > data_samples/new_data.txt` (this will create the file `new_data.txt` and put it in the `data_samples` folder)
We'll repeat the process of creating a new dataset with the previous one as its parent, and syncing the folder.
```bash
clearml-data sync --project datasets --name second_ds --parents a1ddc8b0711b4178828f6c6e6e994b7c --folder data_samples
```
Expected response:
```
clearml-data - Dataset Management & Versioning CLI
Creating a new dataset:
New dataset created id=0992dd6bae6144388e0f2ef131d9724a
Syncing dataset id 0992dd6bae6144388e0f2ef131d9724a to local folder data_samples
Generating SHA2 hash for 6 files
Hash generation completed
Sync completed: 0 files removed, 2 added / modified
Finalizing dataset
Pending uploads, starting dataset upload to https://files.community.clear.ml
Uploading compressed dataset changes (2 files, total 742 bytes) to https://files.community.clear.ml
Upload completed (742 bytes)
2021-05-04 10:05:42,353 - clearml.Task - INFO - Waiting to finish uploads
2021-05-04 10:05:43,106 - clearml.Task - INFO - Finished uploading
Dataset closed and finalized
```
We can see that 2 files were added or modified, just as we expected!

View File

@@ -0,0 +1,94 @@
---
title: Data Management with Python
---
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) and
[data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py)
together demonstrate how to use ClearML's [`Dataset`](../../references/sdk/dataset.md) class to create a dataset and
subsequently ingest the data.
## Dataset Creation
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) script
demonstrates how to do the following:
* Create a dataset and add files to it
* Upload the dataset to the ClearML Server
* Finalize the dataset
### Downloading the Data
We first need to obtain a local copy of the CIFAR dataset.
```python
from clearml import StorageManager
manager = StorageManager()
dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz")
```
This script downloads the data and `dataset_path` contains the path to the downloaded data.
### Creating the Dataset
```python
from clearml import Dataset
dataset = Dataset.create(dataset_name="cifar_dataset", dataset_project="dataset examples" )
```
This creates a data processing task called `cifar_dataset` in the `dataset examples` project, which
can be viewed in the WebApp.
### Adding Files
```python
dataset.add_files(path=dataset_path)
```
This adds the downloaded files to the current dataset.
### Uploading the Files
```python
dataset.upload()
```
This uploads the dataset to the ClearML Server by default. The dataset's destination can be changed by specifying the
target storage with the `output_url` parameter of the [`upload`](../../references/sdk/dataset#upload) method.
### Finalizing the Dataset
Run the [`finalize`](../../references/sdk/dataset#finzalize) command to close the dataset and set that dataset's tasks
status to *completed*. The dataset can only be finalized if it doesn't have any pending uploads.
```python
dataset.finalize()
```
After a dataset has been closed, it can no longer be modified. This ensures future reproducibility.
The information about the dataset, including a list of files and their sizes, can be viewed
in the WebApp, in the dataset task's **ARTIFACTS** tab.
![image](../../img/examples_data_management_cifar_dataset.png)
## Data Ingestion
Now that we have a new dataset registered, we can consume it!
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) script
demonstrates data ingestion using the dataset created in the first script.
```python
dataset_name = "cifar_dataset"
dataset_project = "dataset_examples"
dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
```
The script above gets the dataset and uses the [`Dataset.get_local_copy`](../../references/sdk/dataset.md#get_local_copy)
method to return a path to the cached, read-only local dataset. If you need a modifiable copy of the dataset,
use `Dataset.get(dataset_name, dataset_project).get_mutable_local_copy(path/to/download)`
The script then creates a neural network to train a model to classify images from the dataset that was
created above.

View File

@@ -0,0 +1,170 @@
---
title: Data Management from CLI
---
In this example we'll create a simple dataset and demonstrate basic actions on it, using the `clearml-data` CLI.
## Prerequisites
1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
the needed files.
1. Open terminal and change directory to the cloned repository's examples folder
```
cd clearml/examples/reporting
```
## Creating Initial Dataset
1. To create the dataset, run this code:
```bash
clearml-data create --project datasets --name HelloDataset
```
Expected response:
```bash
clearml-data - Dataset Management & Versioning CLI
Creating a new dataset:
New dataset created id=24d05040f3e14fbfbed8edb1bf08a88c
```
1. Now let's add a folder. File addition is recursive, so it's enough to point at the folder
to captures all files and sub-folders:
```bash
clearml-data add --files data_samples
```
Expected response:
```bash
clearml-data - Dataset Management & Versioning CLI
Adding files/folder to dataset id 24d05040f3e14fbfbed8edb1bf08a88c
Generating SHA2 hash for 2 files
Hash generation completed
5 files added
```
:::note
After creating a dataset, we don't have to specify its ID when running commands, such as *add*, *remove* or *list*
:::
3. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but
this can be configured with the `--storage` flag to any of ClearML's supported storage mediums (see [storage](../../integrations/storage.md)).
The command also finalizes the dataset, making it immutable and ready to be consumed.
```bash
clearml-data close
```
Expected response:
```bash
clearml-data - Dataset Management & Versioning CLI
Finalizing dataset id 24d05040f3e14fbfbed8edb1bf08a88c
Pending uploads, starting dataset upload to https://files.community-master.hosted.allegro.ai
Pending uploads, starting dataset upload to https://files.community.clear.ml
Uploading compressed dataset changes (4 files, total 221.56 KB) to https://files.community.clear.ml
Upload completed (221.56 KB)
2021-05-04 09:32:03,388 - clearml.Task - INFO - Waiting to finish uploads
2021-05-04 09:32:04,067 - clearml.Task - INFO - Finished uploading
Dataset closed and finalized
```
## Listing Dataset Content
To see that all the files were added to the created dataset, use `clearml-data list` and enter the ID of the dataset
that was just closed.
```bash
clearml-data list --id 24d05040f3e14fbfbed8edb1bf08a88c
```
Expected response:
```console
clearml-data - Dataset Management & Versioning CLI
List dataset content: 24d05040f3e14fbfbed8edb1bf08a88c
Listing dataset content
file name | size | hash
------------------------------------------------------------------------------------------------------------------------------------------------
dancing.jpg | 40,484 | 78e804c0c1d54da8d67e9d072c1eec514b91f4d1f296cdf9bf16d6e54d63116a
data.csv | 21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
picasso.jpg | 114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
sample.json | 132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
sample.mp3 | 72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
Total 5 files, 248771 bytes
```
## Creating a Child Dataset
In Clear Data, it's possible to create datasets that inherit the content of other datasets, there are called child datasets.
1. Create a new dataset, specifying the previously created one as its parent:
```bash
clearml-data create --project datasets --name HelloDataset-improved --parents 24d05040f3e14fbfbed8edb1bf08a88c
```
:::note
You'll need to input the Dataset ID you received when created the dataset above
:::
1. Now, we want to add a new file.
* Create a new file: `echo "data data data" > new_data.txt` (this will create the file `new_data.txt`),
* Now add the file to the dataset:
```bash
clearml-data add --files new_data.txt
```
Which should return this output:
```console
clearml-data - Dataset Management & Versioning CLI
Adding files/folder to dataset id 8b68686a4af040d081027ba3cf6bbca6
1 file added
```
1. Let's also remove a file. We'll need to specify the file's full path (within the dataset, not locally) to remove it.
```bash
clearml-data remove --files data_samples/dancing.jpg
```
Expected response:
```bash
clearml-data - Dataset Management & Versioning CLI
Removing files/folder from dataset id 8b68686a4af040d081027ba3cf6bbca6
1 files removed
```
1. Close and finalize the dataset
```bash
clearml-data close
```
1. Let's take a look again at the files in the dataset:
```
clearml-data list --id 8b68686a4af040d081027ba3cf6bbca6
```
And we see that our changes have been made! `new_data.txt` has been added, and `dancing.jpg` has been removed.
```
file name | size | hash
------------------------------------------------------------------------------------------------------------------------------------------------
data.csv | 21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
new_data.txt | 15 | 6df986a2154902260a836febc5a32543f5337eac60560c57db99257a7e012051
picasso.jpg | 114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
sample.json | 132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
sample.mp3 | 72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
Total 5 files, 208302 bytes
```
By using `clearml-data`, a clear lineage is created for the data. As seen in this example, when a dataset is closed, the
only way to add or remove data is to create a new dataset, and using the previous dataset as a parent. This way, the data
is not reliant on the code and is reproducible.

View File

@@ -0,0 +1,12 @@
---
title: Workflows
---
Take a look at the ClearML Data examples which demonstrate common workflows using the `clearml-data` CLI and the
`Dataset` class:
* [Dataset Management with CLI](data_man_simple.md) - Tutorial for creating, modifying, and consuming dataset with CLI
* [Folder Sync with CLI](data_man_folder_sync.md) - Tutorial for using `clearml-data sync` CLI option to update a dataset according
to a local folder.
* [Dataset Management with CLI and SDK](data_man_cifar_classification.md) - Tutorial for creating a dataset with the CLI
then programmatically ingesting the data with the SDK
* [Data Management with Python](data_man_python.md) - Example scripts for creating and consuming a dataset with the SDK.