Refactor ClearML Data docs (#108)

2025-06-26 18:17:44 +00:00 · 2021-11-08 13:21:44 +02:00
parent 43751dc64b
commit e155c49cfd
17 changed files with 847 additions and 683 deletions
--- a/docs/clearml_data/best_practices.md
+++ b/docs/clearml_data/best_practices.md
@@ -0,0 +1,46 @@
+---
+title: Best Practices
+---
+
+The following are some recommendations for using ClearML Data. 
+
+## Versioning Datasets
+
+Use ClearML Data to version your datasets. Once a dataset is finalized, it can no longer be modified. This makes clear
+which version of the dataset was used with which task, enabling the accurate reproduction of your experiments. 
+
+Once you need to change the dataset's contents, you can create a new version of the dataset by specifying the previous 
+dataset as a parent. This makes the new dataset version inherit the previous version's contents, with the dataset's new 
+version contents ready to be updated. 
+
+## Organize Datasets for Easier Access
+
+Organize the datasets according to use-cases and use tags. This makes managing multiple datasets and 
+accessing the most updated datasets for different use-cases easier. 
+
+Like any ClearML tasks, datasets can be organized into [projects (and sub-projects)](../fundamentals/projects.md#creating-sub-projects). 
+Additionally, when creating a dataset, tags can be applied to the dataset, which will make searching for the dataset easier.
+
+Organizing your datasets into projects by use-case makes it easier to access the most recent dataset version for that use-case. 
+If only a project is specified when using [`Dataset.get`](../references/sdk/dataset.md#datasetget), the method returns the 
+most recent dataset in a project. The same is true with tags; if a tag is specified, the method will return the most recent dataset that is labeled with that tag.
+
+## Document your Datasets 
+
+Attach informative metrics or debug samples to the Dataset itself. Use the [`get_logger`](../references/sdk/dataset.md#get_logger)
+method to access the dataset's logger object, then add any additional information to the dataset, using the methods
+available with a [logger](../references/sdk/logger.md) object. 
+
+You can add some dataset summaries (like [table reporting](../references/sdk/logger.md#report_table)) to create a preview 
+of the data stored for better visibility, or attach any statistics generated by the data ingestion process. 
+
+
+## Periodically Update Your Dataset 
+
+Your data probably changes from time to time. If the data is updated into the same local / network folder structure, which 
+serves as a dataset's single point of truth, you can schedule a script which uses the dataset `sync` functionality which 
+will update the dataset based on the modifications made to the folder. This way, there is no need to manually modify a dataset. 
+This functionality will also track the modifications made to a folder.
+
+See the sync function with the [CLI](clearml_data_cli.md#syncing-local-storage) or [SDK](clearml_data_sdk.md#syncing-local-storage)
+interface. 
--- a/docs/clearml_data/clearml_data.md
+++ b/docs/clearml_data/clearml_data.md
@@ -0,0 +1,92 @@
+---
+title: Introduction
+---
+
+In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset,
+which you then need to be able to share, reproduce, and track.
+
+ClearML Data Management solves two important challenges:
+- Accessibility - Making data easily accessible from every machine,
+- Versioning - Linking data and experiments for better **traceability**.
+
+**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
+Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
+
+A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
+Datasets can be set up to inherit from other datasets, so data lineages can be created,
+and users can track when and how their data changes.
+
+Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
+
+Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
+When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
+
+## Setup
+
+`clearml-data` comes built-in with the `clearml` python package! Just check out the [Getting Started](../getting_started/ds/ds_first_steps.md) 
+guide for more info!
+
+## Using ClearML Data
+
+ClearML Data offers two interfaces:
+- `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](clearml_data_cli.md) for a reference of `clearml-data` commands.
+- `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets. See [SDK](clearml_data_sdk.md) for an overview of the basic methods of the `Dataset` module.
+
+For an overview of our recommendations for ClearML Data workflows and practices, see [Best Practices](best_practices.md).
+
+## WebApp 
+
+ClearML's WebApp provides a visual interface to your datasets through dataset tasks. Dataset tasks are categorized 
+as data-processing [task type](../fundamentals/task.md#task-types), and they are labeled with a `DATASET` system tag.
+
+Full log (calls / CLI) of the dataset creation process can be found in a dataset's **EXECUTION** section.
+
+Listing of the dataset differential snapshot, summary of files added / modified / removed and details of files in the 
+differential snapshot (location / size / hash), is available in the **ARTIFACTS** section. Download the dataset 
+by clicking  <img src="/docs/latest/icons/ico-download-json.svg" alt="Download" className="icon size-sm space-sm" />,
+next to the **FILE PATH**.
+
+The full dataset listing (all files included) is available in the **CONFIGURATION** section under **Dataset Content**. 
+This allows you to quickly compare two dataset contents and visually see the difference.
+The dataset genealogy DAG and change-set summary table is visualized in **RESULTS > PLOTS**
+
+
+<details className="cml-expansion-panel screenshot">
+<summary className="cml-expansion-panel-summary">Dataset Contents</summary>
+<div className="cml-expansion-panel-content">
+
+![Dataset data WebApp](../img/dataset_data.png)
+
+</div>
+</details>
+
+<br/>
+
+View a DAG of the dataset dependencies (all previous dataset versions and their parents) in the dataset's page **> ARTIFACTS > state**.
+
+<details className="cml-expansion-panel screenshot">
+<summary className="cml-expansion-panel-summary">Data Dependency DAG</summary>
+<div className="cml-expansion-panel-content">
+
+![Dataset state WebApp](../img/dataset_data_state.png)
+
+</div>
+</details>
+
+
+Once a dataset has been finalized, view its genealogy in the dataset's
+page **>** **RESULTS** **>** **PLOTS**
+
+<details className="cml-expansion-panel screenshot">
+<summary className="cml-expansion-panel-summary">Dataset Genealogy</summary>
+<div className="cml-expansion-panel-content">
+
+![Dataset genealogy and summary](../img/dataset_genealogy_summary.png)
+
+</div>
+</details>
+
+
+
+
+
--- a/docs/clearml_data/clearml_data_cli.md
+++ b/docs/clearml_data/clearml_data_cli.md
@@ -0,0 +1,328 @@
+---
+title: CLI 
+--- 
+
+The `clearml-data` utility is a CLI tool for controlling and managing your data with ClearML.  
+
+The following page provides a reference to `clearml-data`'s CLI commands. 
+
+### Creating a Dataset
+```bash
+clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
+```
+Creates a new dataset. <br/>
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--name` |Dataset's name`| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
+|`--project`|Dataset's project`| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
+|`--parents`|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--tags` |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+
+
+:::important
+clearml-data works in a stateful mode so once a new dataset is created, the following commands
+do not require the `--id` flag.
+:::
+
+<br/>
+
+### Adding Files
+```bash
+clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
+```
+It's possible to add individual files or complete folders.<br/>
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--id` | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--files`|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
+|`--dataset-folder` | Dataset base folder to add the files to in the dataset. Default: dataset root| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--non-recursive` | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+<br/>
+
+### Removing Files
+```bash
+clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
+```
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--id` | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--files` |  Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
+|`--non-recursive` | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+<br/>
+
+### Uploading Dataset Content
+```bash
+clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
+```
+Uploads added files to [ClearML Server](../deploying_clearml/clearml_server.md) by default. It's possible to specify a different storage
+medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure://`, `/mnt/shared/`.
+
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--storage`| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+<br/>
+
+### Finalizing a Dataset
+```bash
+clearml-data close --id <dataset_id>
+```
+Finalizes the dataset and makes it ready to be consumed.
+It automatically uploads all files that were not previously uploaded.
+Once a dataset is finalized, it can no longer be modified.
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--storage`| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--disable-upload` | Disable automatic upload when closing the dataset | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+<br/>
+
+### Syncing Local Storage
+```
+clearml-data sync [--id <dataset_id] --folder <folder_location>  [--parents '<parent_id>']`
+```
+This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which
+updates from time to time.
+
+
+Once an update should be reflected into ClearML's system, users can call `clearml-data sync`, create a new dataset, enter the folder,
+and the changes (either file addition, modification and removal) will be reflected in ClearML.
+
+This command also uploads the data and finalizes the dataset automatically.
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
+|`--folder`|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
+|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--parents`|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--project`|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--name`|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--tags`|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--skip-close`|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--verbose` | Verbose reporting |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+<br/>
+
+### Listing Dataset Content
+```bash
+clearml-data list [--id <dataset_id>]
+```
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--id`|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--name`|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--filter`|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--modified`|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+<br/>
+
+###  Deleting a Dataset
+```
+clearml-data delete [--id <dataset_id_to_delete>]
+```
+Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.
+
+This does not work on datasets with children.
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--id`|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--force`|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />||
+
+</div>
+
+<br/>
+
+### Searching for a Dataset
+```
+clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
+```
+Lists all datasets in the system that match the search request.
+
+Datasets can be searched by project, name, ID, and tags. 
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--ids`|A list of dataset IDs|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
+|`--project`|The project name of the datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
+|`--name`|A dataset name or a partial name to filter datasets by|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
+|`--tags`|A list of dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
+
+</div>
+
+<br/>
+
+### Comparing Two Datasets 
+
+```
+clearml-data compare [--source SOURCE] [--target TARGET] 
+```
+
+Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:
+
+```
+Comparison summary: 4 files removed, 3 files modified, 0 files added
+```
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--source`|Source dataset id (used as baseline)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
+|`--target`|Target dataset id (compare against the source baseline dataset)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
+|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+### Merging Datasets
+
+```
+clearml-data squash --name NAME --ids [IDS [IDS ...]] 
+```
+
+Squash (merge) multiple datasets into a single dataset version.
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--name`|Create squashed dataset name|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
+|`--ids`|Source dataset IDs to squash (merge down)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
+|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+### Verifying a Dataset
+
+```
+clearml-data verify [--id ID] [--folder FOLDER] 
+```
+
+Verify that the dataset content matches the data from the local source.  
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--id`|Specify dataset ID. Default: previously created/accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--folder`|Specify dataset local copy (if not provided the local cache folder will be verified)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--filesize`| If True, only verify file size and skip hash checks (default: false)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+### Getting a Dataset 
+
+```
+clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]
+```
+
+Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the 
+`--copy` flag. 
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--id`| Specify dataset ID. Default: previously created / accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--copy`| Get a writable copy of the dataset to a specific output folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--link`| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--overwrite`| If True, overwrite the target folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+|`--verbose`| Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
+
+</div>
+
+### Publishing a Dataset
+
+```
+clearml-data publish --id ID
+```
+
+Publish the dataset for public use. The dataset must be [finalized](#finalizing-a-dataset) before it is published.
+
+
+**Parameters**
+
+<div className="tbl-cmd">
+
+|Name|Description|Optional|
+|---|---|---|
+|`--id`| The dataset task id to be published.|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
+
+</div>
--- a/docs/clearml_data/clearml_data_sdk.md
+++ b/docs/clearml_data/clearml_data_sdk.md
@@ -0,0 +1,148 @@
+---
+title: SDK
+---
+
+
+Datasets can be created, modified, and managed with ClearML Data's python interface. The following page provides an overview
+for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md) 
+for a complete list of available methods.
+
+Import the `Dataset` class, and let's get started!
+
+```python
+from clearml import Dataset
+```
+
+## Creating Datasets 
+
+ClearML Data supports multiple ways to create datasets programmatically, which provides for a variety of use-cases:
+* [`Dataset.create()`](#datasetcreate) - Create a new dataset. Parent datasets can be specified, from which the new dataset 
+  will inherit its data
+* [`Dataset.squash()`](#datasetsquash)  - Generate a new dataset from by squashing together a set of related datasets
+
+### Dataset.create()
+
+Use the [`Dataset.create`](../references/sdk/dataset.md#datasetcreate) class method to create a dataset.
+
+Creating datasets programmatically is especially helpful when preprocessing the data so that the 
+preprocessing code and the resulting dataset are saved in a single task (see `use_current_task` parameter in [`Dataset.create`](../references/sdk/dataset.md#datasetcreate)).  
+
+```python
+# Preprocessing code here
+dataset = Dataset.create(
+  dataset_name='dataset name',
+  dataset_project='dataset project', 
+  parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2]
+)
+```
+
+The created dataset inherits the content of the `parent_datasets`. When multiple dataset parents are listed, 
+they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset.
+
+### Dataset.squash()
+
+To improve deep dataset DAG storage and speed, dataset squashing was introduced. The [`Dataset.squash`](../references/sdk/dataset.md#datasetsquash) 
+class method generates a new dataset by squashing a set of dataset versions, and merging down all changes introduced in 
+their lineage DAG, creating a new, flat, independent version.
+
+The datasets being squashed into a single dataset can be specified by their IDs or by project & name pairs. 
+
+```python
+# option 1 - list dataset IDs
+squashed_dataset_1 = Dataset.squash(
+  dataset_name='squashed dataset\'s name',
+  dataset_ids=[DS1_ID, DS2_ID, DS3_ID]
+)
+
+# option 2 - list project and dataset pairs 
+squashed_dataset_2 = Dataset.squash(
+  dataset_name='squashed dataset 2',
+  dataset_project_name_pairs=[('dataset1 project', 'dataset1 name'), 
+                              ('dataset2 project', 'dataset2 name')]
+)
+```
+
+In addition, the target storage location for the squashed dataset can be specified using the `output_url` parameter of the 
+[`Dataset.squash`](../references/sdk/dataset.md#datasetsquash) method.
+
+## Accessing Datasets
+Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere. 
+
+Use the [`Dataset.get`](../references/sdk/dataset.md#datasetget) class method to access a specific Dataset object, either 
+with the dataset's ID or with its project and name. If only a project name or tag is provided, the method returns the 
+most recent dataset in the specified project, or the most recent dataset with the specified tag.
+
+Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options:
+* [`Dataset.get_local_copy()`](../references/sdk/dataset.md#get_local_copy) - get a read-only local copy of an entire dataset. 
+  This method returns a path to the dataset in local cache (downloading the dataset if it is not already in cache)
+* [`Dataset.get_mutable_local_copy()`](../references/sdk/dataset.md#get_mutable_local_copy) - get a writable local copy 
+of an entire dataset. This method downloads the dataset to a specific folder (non-cached), specified with the `target_folder` parameter. If 
+the specified folder already has contents, specify whether to overwrite its contents with the dataset contents, using the `overwrite` parameter.
+
+## Modifying Datasets
+
+Once a dataset has been created, its contents can be modified and replaced. When your data is changes, you can 
+add updated files or remove unnecessary files. 
+
+### add_files()
+
+To add files or folders into the current dataset, use the [`Dataset.add_files`](../references/sdk/dataset.md#add_files) 
+method. If a file is already in a dataset, but it has been modified, it can be added again, and ClearML will 
+upload the file diff.
+
+```python
+dataset = Dataset.create()
+dataset.add_files(path="path/to/folder_or_file")
+```
+
+There is an option to add a set of files based on wildcard matching of a single string or a list of strings, using the 
+`wildcard` parameter. Specify whether to match the wildcard files recursively using the `recursive` parameter.
+
+For example:
+
+```python
+dataset.add_files(
+  path="path/to/folder",
+  wildcard="~/data/*.jpg",
+  recursive=True
+)
+```
+ 
+### remove_files()
+To remove files from a current dataset, use the [`Dataset.remove_files`](../references/sdk/dataset.md#remove_files) method.
+Input the path to the folder or file to be removed in the `dataset_path` parameter. The path is relative to the dataset. 
+
+There is also an option to input a wildcard into `dataset_path` in order to remove a set of files matching the wildcard. 
+Set the `recursive` parameter to `True` in order to match all wildcard files recursively
+
+For example:
+
+```python
+dataset.remove_files(dataset_path="*.csv", recursive=True)
+```
+
+## Uploading Files
+
+To upload the dataset files to network storage, use the [`Dataset.upload`](../references/sdk/dataset.md#upload) method. 
+Use the `output_url` parameter to specify storage target, such as S3 / GS / Azure (e.g. `s3://bucket/data`, `gs://bucket/data`, `azure://bucket/data` , `/mnt/share/data`). 
+By default, the dataset uploads to ClearML's file server. 
+
+Dataset files must be uploaded before a dataset is [finalized](#finalizing-a-dataset). 
+
+
+## Finalizing a Dataset
+
+Use the [`Dataset.finalize`](../references/sdk/dataset.md#finalize) method to close the current dataset. This marks the 
+dataset task as *Completed*, at which point, the dataset can no longer be modified. 
+
+Before closing a dataset, its files must first be [uploaded](#uploading-files).
+
+
+## Syncing Local Storage
+
+Use the [`Dataset.sync_folder`](../references/sdk/dataset.md#sync_folder) method in order to update a dataset according
+to a specific folder's content changes. Specify the folder to sync with the `local_path` parameter (the method assumes all files within the folder and recursive). 
+
+This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically. 
+The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually 
+update (add / remove) files in a dataset.
--- a/docs/clearml_data/data_management_examples/data_man_cifar_classification.md
+++ b/docs/clearml_data/data_management_examples/data_man_cifar_classification.md
@@ -0,0 +1,98 @@
+---
+title: Dataset Management with CLI and SDK
+---
+
+In this tutorial, we are going to manage the CIFAR dataset with `clearml-data` CLI, and then use ClearML's [`Dataset`](../../references/sdk/dataset.md) 
+class to ingest the data.
+
+## Creating the Dataset
+
+### Downloading the Data
+Before we can register the CIFAR dataset with `clearml-data`, we need to obtain a local copy of it.
+
+Execute this python script to download the data
+```python
+from clearml import StorageManager
+
+manager = StorageManager()
+dataset_path = manager.get_local_copy(
+    remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
+)
+# make sure to copy the printed value
+print("COPY THIS DATASET PATH: {}".format(dataset_path))
+```
+
+Expected response:
+```bash
+COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
+```
+The script prints the path to the downloaded data. It will be needed later on.
+
+### Creating the Dataset
+To create the dataset, execute the following command:
+ ```
+ clearml-data create --project dataset_examples --name cifar_dataset
+ ```
+
+Expected response:
+```
+clearml-data - Dataset Management & Versioning CLI 
+Creating a new dataset: 
+New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec
+```
+Where `ee1c35f60f384e65bc800f42f0aca5ec` is the dataset ID.
+
+## Adding Files
+Add the files we just downloaded to the dataset: 
+
+```
+clearml-data add --files <dataset_path>
+```
+
+where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
+
+:::note
+There's no need to specify a `dataset_id`, since the `clearml-data` session stores it.
+:::
+
+## Finalizing the Dataset
+Run the [`close`](../../references/sdk/dataset.md#close) command to upload the files (it'll be uploaded to ClearML Server by default):<br/>
+
+```
+clearml-data close 
+```
+
+This command sets the dataset task's status to *completed*, so it will no longer be modifiable. This ensures future
+reproducibility. 
+
+The information about the dataset, including a list of files and their sizes, can be viewed
+in the WebApp, in the dataset task's **ARTIFACTS** tab.
+
+![image](../../img/examples_data_management_cifar_dataset.png)
+
+## Using the Dataset
+
+Now that we have a new dataset registered, we can consume it.
+
+The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) example 
+script demonstrates using the dataset within Python code.
+
+```python
+dataset_name = "cifar_dataset"
+dataset_project = "dataset_examples"
+
+from clearml import Dataset
+
+dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
+
+trainset = datasets.CIFAR10(
+    root=dataset_path,
+    train=True,
+    download=False,
+    transform=transform
+)
+```
+The Dataset's [`get_local_copy`](../../references/sdk/dataset.md#get_local_copy) method will return a path to the cached, 
+downloaded dataset. Then we provide the path to Pytorch's dataset object.
+
+The script then trains a neural network to classify images using the dataset created above.
--- a/docs/clearml_data/data_management_examples/data_man_folder_sync.md
+++ b/docs/clearml_data/data_management_examples/data_man_folder_sync.md
@@ -0,0 +1,83 @@
+---
+title: Folder Sync with CLI
+---
+
+This example shows how to use the `clearml-data` folder sync function.
+
+`clearml-data` folder sync mode is useful for cases when users have a single point of truth (i.e. a folder) that updates 
+from time to time. When the point of truth is updated, users can call `clearml-data sync` and the 
+changes (file addition, modification, or removal) will be reflected in ClearML.
+
+## Creating Initial Version
+
+## Prerequisites
+1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
+the needed files.
+
+1. Open terminal and change directory to the cloned repository's examples folder
+    
+   ```
+   cd clearml/examples/reporting
+   ```
+
+## Syncing a Folder
+Create a dataset and sync the `data_samples` folder from the repo to ClearML
+```bash
+clearml-data sync --project datasets --name sync_folder --folder data_samples
+```
+
+Expected response:
+
+```
+clearml-data - Dataset Management & Versioning CLI
+Creating a new dataset:
+New dataset created id=0d8f5f3e5ebd4f849bfb218021be1ede
+Syncing dataset id 0d8f5f3e5ebd4f849bfb218021be1ede to local folder data_samples
+Generating SHA2 hash for 5 files
+Hash generation completed
+Sync completed: 0 files removed, 5 added / modified
+Finalizing dataset
+Pending uploads, starting dataset upload to https://files.community.clear.ml
+Uploading compressed dataset changes (5 files, total 222.17 KB) to https://files.community.clear.ml
+Upload completed (222.17 KB)
+2021-05-04 09:57:56,809 - clearml.Task - INFO - Waiting to finish uploads
+2021-05-04 09:57:57,581 - clearml.Task - INFO - Finished uploading
+Dataset closed and finalized
+```
+
+As can be seen, the `clearml-data sync` command creates the dataset, then uploads the files, and closes the dataset.
+
+
+## Modifying Synced Folder
+
+Now we'll modify the folder:
+1. Add another line to one of the files in the `data_samples` folder.
+1. Add a file to the sample_data folder.<br/> 
+   Run`echo "data data data" > data_samples/new_data.txt` (this will create the file `new_data.txt` and put it in the `data_samples` folder)
+
+
+We'll repeat the process of creating a new dataset with the previous one as its parent, and syncing the folder.
+
+```bash
+clearml-data sync --project datasets --name second_ds --parents a1ddc8b0711b4178828f6c6e6e994b7c --folder data_samples
+```
+
+Expected response:
+```
+clearml-data - Dataset Management & Versioning CLI
+Creating a new dataset:
+New dataset created id=0992dd6bae6144388e0f2ef131d9724a
+Syncing dataset id 0992dd6bae6144388e0f2ef131d9724a to local folder data_samples
+Generating SHA2 hash for 6 files
+Hash generation completed
+Sync completed: 0 files removed, 2 added / modified
+Finalizing dataset
+Pending uploads, starting dataset upload to https://files.community.clear.ml
+Uploading compressed dataset changes (2 files, total 742 bytes) to https://files.community.clear.ml
+Upload completed (742 bytes)
+2021-05-04 10:05:42,353 - clearml.Task - INFO - Waiting to finish uploads
+2021-05-04 10:05:43,106 - clearml.Task - INFO - Finished uploading
+Dataset closed and finalized
+```
+
+We can see that 2 files were added or modified, just as we expected!
--- a/docs/clearml_data/data_management_examples/data_man_python.md
+++ b/docs/clearml_data/data_management_examples/data_man_python.md
@@ -0,0 +1,94 @@
+---
+title: Data Management with Python
+---
+
+The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) and 
+[data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) 
+together demonstrate how to use ClearML's [`Dataset`](../../references/sdk/dataset.md) class to create a dataset and 
+subsequently ingest the data. 
+
+## Dataset Creation
+
+The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) script 
+demonstrates how to do the following:
+* Create a dataset and add files to it
+* Upload the dataset to the ClearML Server
+* Finalize the dataset
+
+
+### Downloading the Data
+
+We first need to obtain a local copy of the CIFAR dataset.
+
+ ```python
+ from clearml import StorageManager
+
+ manager = StorageManager()
+ dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz")
+```
+
+This script downloads the data and `dataset_path` contains the path to the downloaded data. 
+
+### Creating the Dataset
+
+```python
+from clearml import Dataset
+
+dataset = Dataset.create(dataset_name="cifar_dataset", dataset_project="dataset examples" )
+ ```
+
+This creates a data processing task called `cifar_dataset` in the `dataset examples` project, which
+can be viewed in the WebApp.
+
+### Adding Files
+
+```python
+dataset.add_files(path=dataset_path)
+```
+
+This adds the downloaded files to the current dataset.  
+
+### Uploading the Files
+
+```python
+dataset.upload()
+```
+This uploads the dataset to the ClearML Server by default. The dataset's destination can be changed by specifying the 
+target storage with the `output_url` parameter of the [`upload`](../../references/sdk/dataset#upload) method. 
+
+### Finalizing the Dataset
+
+Run the [`finalize`](../../references/sdk/dataset#finzalize) command to close the dataset and set that dataset's tasks
+status to *completed*. The dataset can only be finalized if it doesn't have any pending uploads. 
+
+```python
+dataset.finalize()
+```
+
+After a dataset has been closed, it can no longer be modified. This ensures future reproducibility. 
+
+The information about the dataset, including a list of files and their sizes, can be viewed
+in the WebApp, in the dataset task's **ARTIFACTS** tab.
+
+![image](../../img/examples_data_management_cifar_dataset.png)
+
+## Data Ingestion
+
+Now that we have a new dataset registered, we can consume it!
+
+The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) script 
+demonstrates data ingestion using the dataset created in the first script.
+
+```python
+dataset_name = "cifar_dataset"
+dataset_project = "dataset_examples"
+
+dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
+```
+
+The script above gets the dataset and uses the [`Dataset.get_local_copy`](../../references/sdk/dataset.md#get_local_copy) 
+method to return a path to the cached, read-only local dataset. If you need a modifiable copy of the dataset, 
+use `Dataset.get(dataset_name, dataset_project).get_mutable_local_copy(path/to/download)`
+
+The script then creates a neural network to train a model to classify images from the dataset that was
+created above.
--- a/docs/clearml_data/data_management_examples/data_man_simple.md
+++ b/docs/clearml_data/data_management_examples/data_man_simple.md
@@ -0,0 +1,170 @@
+---
+title: Data Management from CLI
+---
+
+In this example we'll create a simple dataset and demonstrate basic actions on it, using the `clearml-data` CLI. 
+
+## Prerequisites
+1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
+the needed files.
+1. Open terminal and change directory to the cloned repository's examples folder
+   
+    ```
+    cd clearml/examples/reporting
+   ```
+
+## Creating Initial Dataset
+
+1. To create the dataset, run this code:
+
+    ```bash
+    clearml-data create --project datasets --name HelloDataset
+    ```
+
+    Expected response:
+
+    ```bash
+    clearml-data - Dataset Management & Versioning CLI
+    Creating a new dataset:
+    New dataset created id=24d05040f3e14fbfbed8edb1bf08a88c
+    ```
+
+1. Now let's add a folder. File addition is recursive, so it's enough to point at the folder 
+to captures all files and sub-folders:
+   
+   ```bash
+   clearml-data add --files data_samples
+   ```
+   
+   Expected response:
+   
+   ```bash
+   clearml-data - Dataset Management & Versioning CLI
+   Adding files/folder to dataset id 24d05040f3e14fbfbed8edb1bf08a88c
+   Generating SHA2 hash for 2 files
+   Hash generation completed
+   5 files added
+   ```
+   
+   
+:::note
+After creating a dataset, we don't have to specify its ID when running commands, such as *add*, *remove* or *list*
+:::
+
+3. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but  
+this can be configured with the `--storage` flag to any of ClearML's supported storage mediums (see [storage](../../integrations/storage.md)).
+The command also finalizes the dataset, making it immutable and ready to be consumed.
+
+   ```bash
+   clearml-data close
+   ```
+
+   Expected response:
+
+   ```bash
+   clearml-data - Dataset Management & Versioning CLI
+   Finalizing dataset id 24d05040f3e14fbfbed8edb1bf08a88c
+   Pending uploads, starting dataset upload to https://files.community-master.hosted.allegro.ai
+   Pending uploads, starting dataset upload to https://files.community.clear.ml
+   Uploading compressed dataset changes (4 files, total 221.56 KB) to https://files.community.clear.ml
+   Upload completed (221.56 KB)
+   2021-05-04 09:32:03,388 - clearml.Task - INFO - Waiting to finish uploads
+   2021-05-04 09:32:04,067 - clearml.Task - INFO - Finished uploading
+   Dataset closed and finalized
+   ```
+
+## Listing Dataset Content
+
+To see that all the files were added to the created dataset, use `clearml-data list` and enter the ID of the dataset
+that was just closed.
+
+   ```bash
+  clearml-data list --id 24d05040f3e14fbfbed8edb1bf08a88c
+   ```
+
+Expected response:
+
+```console
+clearml-data - Dataset Management & Versioning CLI 
+
+List dataset content: 24d05040f3e14fbfbed8edb1bf08a88c 
+Listing dataset content
+file name                                                        | size       | hash                                                            
+------------------------------------------------------------------------------------------------------------------------------------------------
+dancing.jpg                                                      |     40,484 | 78e804c0c1d54da8d67e9d072c1eec514b91f4d1f296cdf9bf16d6e54d63116a
+data.csv                                                         |     21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
+picasso.jpg                                                      |    114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
+sample.json                                                      |        132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
+sample.mp3                                                       |     72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
+Total 5 files, 248771 bytes
+```
+
+## Creating a Child Dataset
+
+In Clear Data, it's possible to create datasets that inherit the content of other datasets, there are called child datasets.
+
+1. Create a new dataset, specifying the previously created one as its parent:
+
+   ```bash
+   clearml-data create --project datasets --name HelloDataset-improved --parents 24d05040f3e14fbfbed8edb1bf08a88c
+   ```
+:::note
+You'll need to input the Dataset ID you received when created the dataset above 
+:::
+
+1. Now, we want to add a new file. 
+   * Create a new file: `echo "data data data" > new_data.txt` (this will create the file `new_data.txt`),
+   * Now add the file to the dataset:  
+
+   ```bash
+   clearml-data add --files new_data.txt
+   ```
+   Which should return this output:
+
+   ```console
+   clearml-data - Dataset Management & Versioning CLI
+   Adding files/folder to dataset id 8b68686a4af040d081027ba3cf6bbca6
+   1 file added
+   ```
+   
+1. Let's also remove a file. We'll need to specify the file's full path (within the dataset, not locally) to remove it.
+
+   ```bash
+   clearml-data remove --files data_samples/dancing.jpg
+   ```
+
+   Expected response:
+   ```bash
+   clearml-data - Dataset Management & Versioning CLI
+   Removing files/folder from dataset id 8b68686a4af040d081027ba3cf6bbca6
+   1 files removed
+   ```
+
+1. Close and finalize the dataset
+
+   ```bash
+   clearml-data close
+   ```
+   
+1. Let's take a look again at the files in the dataset:
+
+   ```
+   clearml-data list --id 8b68686a4af040d081027ba3cf6bbca6
+   ```
+
+   And we see that our changes have been made! `new_data.txt` has been added, and `dancing.jpg` has been removed. 
+
+   ```
+   file name                                                        | size       | hash                                                            
+   ------------------------------------------------------------------------------------------------------------------------------------------------
+   data.csv                                                         |     21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
+   new_data.txt                                                     |         15 | 6df986a2154902260a836febc5a32543f5337eac60560c57db99257a7e012051
+   picasso.jpg                                                      |    114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
+   sample.json                                                      |        132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
+   sample.mp3                                                       |     72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
+   Total 5 files, 208302 bytes
+   ```
+
+By using `clearml-data`, a clear lineage is created for the data. As seen in this example, when a dataset is closed, the 
+only way to add or remove data is to create a new dataset, and using the previous dataset as a parent. This way, the data 
+is not reliant on the code and is reproducible. 
--- a/docs/clearml_data/data_management_examples/workflows.md
+++ b/docs/clearml_data/data_management_examples/workflows.md
@@ -0,0 +1,12 @@
+---
+title: Workflows 
+---
+
+Take a look at the ClearML Data examples which demonstrate common workflows using the `clearml-data` CLI and the 
+`Dataset` class:
+* [Dataset Management with CLI](data_man_simple.md) - Tutorial for creating, modifying, and consuming dataset with CLI
+* [Folder Sync with CLI](data_man_folder_sync.md) - Tutorial for using `clearml-data sync` CLI option to update a dataset according 
+  to a local folder.
+* [Dataset Management with CLI and SDK](data_man_cifar_classification.md) - Tutorial for creating a dataset with the CLI
+  then programmatically ingesting the data with the SDK
+* [Data Management with Python](data_man_python.md) - Example scripts for creating and consuming a dataset with the SDK.