Refactor ClearML Data docs (#108)

This commit is contained in:
pollfly 2021-11-08 13:21:44 +02:00 committed by GitHub
parent 43751dc64b
commit e155c49cfd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
17 changed files with 847 additions and 683 deletions

View File

@ -1,361 +0,0 @@
---
title: ClearML Data
---
In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset,
which you then need to be able to share, reproduce, and track.
ClearML Data Management solves two important challenges:
- Accessibility - Making data easily accessible from every machine,
- Versioning - Linking data and experiments for better **traceability**.
**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
Datasets can be set up to inherit from other datasets, so data lineages can be created,
and users can track when and how their data changes.
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
ClearML Data offers two interfaces:
- `clearml-data` - CLI utility for creating, uploading, and managing datasets.
- `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets.
## Setup
`clearml-data` comes built-in with our `clearml` python package! Just check out the [getting started](getting_started/ds/ds_first_steps.md) guide for more info!
## Workflow
Below is an example of a workflow using ClearML Data's command line tool to create a dataset and inegrating the dataset into code
using ClearML Data's python interface.
### Creating a Dataset
Using the `clearml-data` CLI, users can create datasets using the following commands:
```bash
clearml-data create --project dataset_example --name initial_version
clearml-data add --files data_folder
clearml-data close
```
The commands will do the following:
1. Start a Data Processing Task called "initial_version" in the "dataset_example" project
1. The CLI will return a unique ID for the dataset
1. All the files from the "data_folder" folder will be added to the dataset and uploaded
by default to the [ClearML server](deploying_clearml/clearml_server.md).
1. The dataset will be finalized, making it immutable and ready to be consumed.
:::note
`clearml-data` is stateful and remembers the last created dataset so there's no need to specify a specific dataset ID unless
we want to work on another dataset.
:::
### Using a Dataset
Now in our python code, we can access and use the created dataset from anywhere:
```python
from clearml import Dataset
local_path = Dataset.get(dataset_id='dataset_id_from_previous_command').get_local_copy()
```
We have all our files in the same folder structure under `local_path`, it is that simple!<br/>
The next step is to set the dataset_id as a parameter for our code and voilà! We can now train on any dataset we have in
the system.
## CLI Options
It's possible to manage datasets (create / modify / upload / delete) with the `clearml-data` command line tool.
### Creating a Dataset
```bash
clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
```
Creates a new dataset. <br/>
**Parameters**
|Name|Description|Optional|
|---|---|---|
|name |Dataset's name| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|project|Dataset's project| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|parents|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|tags |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
:::important
clearml-data works in a stateful mode so once a new dataset is created, the following commands
do not require the `--id` flag.
:::
<br/>
### Add Files
```bash
clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
```
It's possible to add individual files or complete folders.<br/>
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|files|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|dataset-folder | Dataset base folder to add the files to in the dataset. Default: dataset root| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|non-recursive | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/>
### Remove Files
```bash
clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
```
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|files | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|non-recursive | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/>
### Upload Dataset Content
```bash
clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
```
Uploads added files to [ClearML Server](deploying_clearml/clearml_server.md) by default. It's possible to specify a different storage
medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure://`, `/mnt/shared/`.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|storage| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/>
### Finalize Dataset
```bash
clearml-data close --id <dataset_id>
```
Finalizes the dataset and makes it ready to be consumed.
It automatically uploads all files that were not previously uploaded.
Once a dataset is finalized, it can no longer be modified.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|storage| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|disable-upload | Disable automatic upload when closing the dataset | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/>
### Sync Local Folder
```
clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`
```
This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which
updates from time to time.
Once an update should be reflected into ClearML's system, users can call `clearml-data sync`, create a new dataset, enter the folder,
and the changes (either file addition, modification and removal) will be reflected in ClearML.
This command also uploads the data and finalizes the dataset automatically.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|folder|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|storage|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|parents|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|project|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|name|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|tags|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|skip-close|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose | Verbose reporting |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/>
### List Dataset Content
```bash
clearml-data list [--id <dataset_id>]
```
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|project|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|name|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|filter|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|modified|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
<br/>
### Delete Dataset
```
clearml-data delete [--id <dataset_id_to_delete>]
```
Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.
This does not work on datasets with children.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|force|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />||
<br/>
### Search for a Dataset
```
clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
```
Lists all datasets in the system that match the search request.
Datasets can be searched by project, name, ID, and tags.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|ids|A list of dataset IDs|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|project|The project name of the datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|name|A dataset name or a partial name to filter datasets by|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|tags|A list of dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
<br/>
### Compare Two Datasets
```
clearml-data compare [--source SOURCE] [--target TARGET]
```
Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:
```
Comparison summary: 4 files removed, 3 files modified, 0 files added
```
**Parameters**
|Name|Description|Optional|
|---|---|---|
|source|Source dataset id (used as baseline)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|target|Target dataset id (compare against the source baseline dataset)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
### Merge Datasets
```
clearml-data squash --name NAME --ids [IDS [IDS ...]]
```
Squash (merge) multiple datasets into a single dataset version.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|name|Create squashed dataset name|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|ids|Source dataset IDs to squash (merge down)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|storage|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
### Verify Dataset
```
clearml-data verify [--id ID] [--folder FOLDER]
```
Verify that the dataset content matches the data from the local source.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id|Specify dataset ID. Default: previously created/accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|folder|Specify dataset local copy (if not provided the local cache folder will be verified)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|filesize| If True, only verify file size and skip hash checks (default: false)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
### Get a Dataset
```
clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]
```
Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the
`--copy` flag.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id| Specify dataset ID. Default: previously created / accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|copy| Get a writable copy of the dataset to a specific output folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|link| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|overwrite| If True, overwrite the target folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|verbose| Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
### Publish a Dataset
```
clearml-data publish --id ID
```
Publish the dataset for public use. The dataset must be [finalized](#finalize-dataset) before it is published.
**Parameters**
|Name|Description|Optional|
|---|---|---|
|id| The dataset task id to be published.|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
## Python API
It's also possible to manage a dataset using ClearML Data's python interface.
All API commands should be imported with:
```python
from clearml import Dataset
```
See all API commands in the [Dataset](references/sdk/dataset.md) reference page.
## Tutorials
Take a look at the ClearML Data tutorials:
* [Dataset Management with CLI and SDK](guides/data%20management/data_man_cifar_classification)
* [Dataset Management with CLI](guides/data%20management/data_man_simple)
* [Folder Sync with CLI](guides/data%20management/data_man_folder_sync)

View File

@ -0,0 +1,46 @@
---
title: Best Practices
---
The following are some recommendations for using ClearML Data.
## Versioning Datasets
Use ClearML Data to version your datasets. Once a dataset is finalized, it can no longer be modified. This makes clear
which version of the dataset was used with which task, enabling the accurate reproduction of your experiments.
Once you need to change the dataset's contents, you can create a new version of the dataset by specifying the previous
dataset as a parent. This makes the new dataset version inherit the previous version's contents, with the dataset's new
version contents ready to be updated.
## Organize Datasets for Easier Access
Organize the datasets according to use-cases and use tags. This makes managing multiple datasets and
accessing the most updated datasets for different use-cases easier.
Like any ClearML tasks, datasets can be organized into [projects (and sub-projects)](../fundamentals/projects.md#creating-sub-projects).
Additionally, when creating a dataset, tags can be applied to the dataset, which will make searching for the dataset easier.
Organizing your datasets into projects by use-case makes it easier to access the most recent dataset version for that use-case.
If only a project is specified when using [`Dataset.get`](../references/sdk/dataset.md#datasetget), the method returns the
most recent dataset in a project. The same is true with tags; if a tag is specified, the method will return the most recent dataset that is labeled with that tag.
## Document your Datasets
Attach informative metrics or debug samples to the Dataset itself. Use the [`get_logger`](../references/sdk/dataset.md#get_logger)
method to access the dataset's logger object, then add any additional information to the dataset, using the methods
available with a [logger](../references/sdk/logger.md) object.
You can add some dataset summaries (like [table reporting](../references/sdk/logger.md#report_table)) to create a preview
of the data stored for better visibility, or attach any statistics generated by the data ingestion process.
## Periodically Update Your Dataset
Your data probably changes from time to time. If the data is updated into the same local / network folder structure, which
serves as a dataset's single point of truth, you can schedule a script which uses the dataset `sync` functionality which
will update the dataset based on the modifications made to the folder. This way, there is no need to manually modify a dataset.
This functionality will also track the modifications made to a folder.
See the sync function with the [CLI](clearml_data_cli.md#syncing-local-storage) or [SDK](clearml_data_sdk.md#syncing-local-storage)
interface.

View File

@ -0,0 +1,92 @@
---
title: Introduction
---
In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset,
which you then need to be able to share, reproduce, and track.
ClearML Data Management solves two important challenges:
- Accessibility - Making data easily accessible from every machine,
- Versioning - Linking data and experiments for better **traceability**.
**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
Datasets can be set up to inherit from other datasets, so data lineages can be created,
and users can track when and how their data changes.
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
## Setup
`clearml-data` comes built-in with the `clearml` python package! Just check out the [Getting Started](../getting_started/ds/ds_first_steps.md)
guide for more info!
## Using ClearML Data
ClearML Data offers two interfaces:
- `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](clearml_data_cli.md) for a reference of `clearml-data` commands.
- `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets. See [SDK](clearml_data_sdk.md) for an overview of the basic methods of the `Dataset` module.
For an overview of our recommendations for ClearML Data workflows and practices, see [Best Practices](best_practices.md).
## WebApp
ClearML's WebApp provides a visual interface to your datasets through dataset tasks. Dataset tasks are categorized
as data-processing [task type](../fundamentals/task.md#task-types), and they are labeled with a `DATASET` system tag.
Full log (calls / CLI) of the dataset creation process can be found in a dataset's **EXECUTION** section.
Listing of the dataset differential snapshot, summary of files added / modified / removed and details of files in the
differential snapshot (location / size / hash), is available in the **ARTIFACTS** section. Download the dataset
by clicking <img src="/docs/latest/icons/ico-download-json.svg" alt="Download" className="icon size-sm space-sm" />,
next to the **FILE PATH**.
The full dataset listing (all files included) is available in the **CONFIGURATION** section under **Dataset Content**.
This allows you to quickly compare two dataset contents and visually see the difference.
The dataset genealogy DAG and change-set summary table is visualized in **RESULTS > PLOTS**
<details className="cml-expansion-panel screenshot">
<summary className="cml-expansion-panel-summary">Dataset Contents</summary>
<div className="cml-expansion-panel-content">
![Dataset data WebApp](../img/dataset_data.png)
</div>
</details>
<br/>
View a DAG of the dataset dependencies (all previous dataset versions and their parents) in the dataset's page **> ARTIFACTS > state**.
<details className="cml-expansion-panel screenshot">
<summary className="cml-expansion-panel-summary">Data Dependency DAG</summary>
<div className="cml-expansion-panel-content">
![Dataset state WebApp](../img/dataset_data_state.png)
</div>
</details>
Once a dataset has been finalized, view its genealogy in the dataset's
page **>** **RESULTS** **>** **PLOTS**
<details className="cml-expansion-panel screenshot">
<summary className="cml-expansion-panel-summary">Dataset Genealogy</summary>
<div className="cml-expansion-panel-content">
![Dataset genealogy and summary](../img/dataset_genealogy_summary.png)
</div>
</details>

View File

@ -0,0 +1,328 @@
---
title: CLI
---
The `clearml-data` utility is a CLI tool for controlling and managing your data with ClearML.
The following page provides a reference to `clearml-data`'s CLI commands.
### Creating a Dataset
```bash
clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
```
Creates a new dataset. <br/>
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--name` |Dataset's name`| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|`--project`|Dataset's project`| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|`--parents`|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--tags` |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
:::important
clearml-data works in a stateful mode so once a new dataset is created, the following commands
do not require the `--id` flag.
:::
<br/>
### Adding Files
```bash
clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
```
It's possible to add individual files or complete folders.<br/>
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id` | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--files`|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|`--dataset-folder` | Dataset base folder to add the files to in the dataset. Default: dataset root| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--non-recursive` | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Removing Files
```bash
clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
```
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id` | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--files` | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|`--non-recursive` | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Uploading Dataset Content
```bash
clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
```
Uploads added files to [ClearML Server](../deploying_clearml/clearml_server.md) by default. It's possible to specify a different storage
medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure://`, `/mnt/shared/`.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--storage`| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Finalizing a Dataset
```bash
clearml-data close --id <dataset_id>
```
Finalizes the dataset and makes it ready to be consumed.
It automatically uploads all files that were not previously uploaded.
Once a dataset is finalized, it can no longer be modified.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--storage`| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--disable-upload` | Disable automatic upload when closing the dataset | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Syncing Local Storage
```
clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`
```
This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which
updates from time to time.
Once an update should be reflected into ClearML's system, users can call `clearml-data sync`, create a new dataset, enter the folder,
and the changes (either file addition, modification and removal) will be reflected in ClearML.
This command also uploads the data and finalizes the dataset automatically.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--folder`|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--parents`|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--project`|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--name`|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--tags`|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--skip-close`|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--verbose` | Verbose reporting |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Listing Dataset Content
```bash
clearml-data list [--id <dataset_id>]
```
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--name`|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--filter`|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--modified`|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
### Deleting a Dataset
```
clearml-data delete [--id <dataset_id_to_delete>]
```
Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.
This does not work on datasets with children.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--force`|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />||
</div>
<br/>
### Searching for a Dataset
```
clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
```
Lists all datasets in the system that match the search request.
Datasets can be searched by project, name, ID, and tags.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--ids`|A list of dataset IDs|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|`--project`|The project name of the datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|`--name`|A dataset name or a partial name to filter datasets by|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|`--tags`|A list of dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
</div>
<br/>
### Comparing Two Datasets
```
clearml-data compare [--source SOURCE] [--target TARGET]
```
Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:
```
Comparison summary: 4 files removed, 3 files modified, 0 files added
```
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--source`|Source dataset id (used as baseline)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--target`|Target dataset id (compare against the source baseline dataset)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
### Merging Datasets
```
clearml-data squash --name NAME --ids [IDS [IDS ...]]
```
Squash (merge) multiple datasets into a single dataset version.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--name`|Create squashed dataset name|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--ids`|Source dataset IDs to squash (merge down)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
### Verifying a Dataset
```
clearml-data verify [--id ID] [--folder FOLDER]
```
Verify that the dataset content matches the data from the local source.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`|Specify dataset ID. Default: previously created/accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--folder`|Specify dataset local copy (if not provided the local cache folder will be verified)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--filesize`| If True, only verify file size and skip hash checks (default: false)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
### Getting a Dataset
```
clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]
```
Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the
`--copy` flag.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`| Specify dataset ID. Default: previously created / accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--copy`| Get a writable copy of the dataset to a specific output folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--link`| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--overwrite`| If True, overwrite the target folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--verbose`| Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
### Publishing a Dataset
```
clearml-data publish --id ID
```
Publish the dataset for public use. The dataset must be [finalized](#finalizing-a-dataset) before it is published.
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`| The dataset task id to be published.|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
</div>

View File

@ -0,0 +1,148 @@
---
title: SDK
---
Datasets can be created, modified, and managed with ClearML Data's python interface. The following page provides an overview
for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md)
for a complete list of available methods.
Import the `Dataset` class, and let's get started!
```python
from clearml import Dataset
```
## Creating Datasets
ClearML Data supports multiple ways to create datasets programmatically, which provides for a variety of use-cases:
* [`Dataset.create()`](#datasetcreate) - Create a new dataset. Parent datasets can be specified, from which the new dataset
will inherit its data
* [`Dataset.squash()`](#datasetsquash) - Generate a new dataset from by squashing together a set of related datasets
### Dataset.create()
Use the [`Dataset.create`](../references/sdk/dataset.md#datasetcreate) class method to create a dataset.
Creating datasets programmatically is especially helpful when preprocessing the data so that the
preprocessing code and the resulting dataset are saved in a single task (see `use_current_task` parameter in [`Dataset.create`](../references/sdk/dataset.md#datasetcreate)).
```python
# Preprocessing code here
dataset = Dataset.create(
dataset_name='dataset name',
dataset_project='dataset project',
parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2]
)
```
The created dataset inherits the content of the `parent_datasets`. When multiple dataset parents are listed,
they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset.
### Dataset.squash()
To improve deep dataset DAG storage and speed, dataset squashing was introduced. The [`Dataset.squash`](../references/sdk/dataset.md#datasetsquash)
class method generates a new dataset by squashing a set of dataset versions, and merging down all changes introduced in
their lineage DAG, creating a new, flat, independent version.
The datasets being squashed into a single dataset can be specified by their IDs or by project & name pairs.
```python
# option 1 - list dataset IDs
squashed_dataset_1 = Dataset.squash(
dataset_name='squashed dataset\'s name',
dataset_ids=[DS1_ID, DS2_ID, DS3_ID]
)
# option 2 - list project and dataset pairs
squashed_dataset_2 = Dataset.squash(
dataset_name='squashed dataset 2',
dataset_project_name_pairs=[('dataset1 project', 'dataset1 name'),
('dataset2 project', 'dataset2 name')]
)
```
In addition, the target storage location for the squashed dataset can be specified using the `output_url` parameter of the
[`Dataset.squash`](../references/sdk/dataset.md#datasetsquash) method.
## Accessing Datasets
Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere.
Use the [`Dataset.get`](../references/sdk/dataset.md#datasetget) class method to access a specific Dataset object, either
with the dataset's ID or with its project and name. If only a project name or tag is provided, the method returns the
most recent dataset in the specified project, or the most recent dataset with the specified tag.
Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options:
* [`Dataset.get_local_copy()`](../references/sdk/dataset.md#get_local_copy) - get a read-only local copy of an entire dataset.
This method returns a path to the dataset in local cache (downloading the dataset if it is not already in cache)
* [`Dataset.get_mutable_local_copy()`](../references/sdk/dataset.md#get_mutable_local_copy) - get a writable local copy
of an entire dataset. This method downloads the dataset to a specific folder (non-cached), specified with the `target_folder` parameter. If
the specified folder already has contents, specify whether to overwrite its contents with the dataset contents, using the `overwrite` parameter.
## Modifying Datasets
Once a dataset has been created, its contents can be modified and replaced. When your data is changes, you can
add updated files or remove unnecessary files.
### add_files()
To add files or folders into the current dataset, use the [`Dataset.add_files`](../references/sdk/dataset.md#add_files)
method. If a file is already in a dataset, but it has been modified, it can be added again, and ClearML will
upload the file diff.
```python
dataset = Dataset.create()
dataset.add_files(path="path/to/folder_or_file")
```
There is an option to add a set of files based on wildcard matching of a single string or a list of strings, using the
`wildcard` parameter. Specify whether to match the wildcard files recursively using the `recursive` parameter.
For example:
```python
dataset.add_files(
path="path/to/folder",
wildcard="~/data/*.jpg",
recursive=True
)
```
### remove_files()
To remove files from a current dataset, use the [`Dataset.remove_files`](../references/sdk/dataset.md#remove_files) method.
Input the path to the folder or file to be removed in the `dataset_path` parameter. The path is relative to the dataset.
There is also an option to input a wildcard into `dataset_path` in order to remove a set of files matching the wildcard.
Set the `recursive` parameter to `True` in order to match all wildcard files recursively
For example:
```python
dataset.remove_files(dataset_path="*.csv", recursive=True)
```
## Uploading Files
To upload the dataset files to network storage, use the [`Dataset.upload`](../references/sdk/dataset.md#upload) method.
Use the `output_url` parameter to specify storage target, such as S3 / GS / Azure (e.g. `s3://bucket/data`, `gs://bucket/data`, `azure://bucket/data` , `/mnt/share/data`).
By default, the dataset uploads to ClearML's file server.
Dataset files must be uploaded before a dataset is [finalized](#finalizing-a-dataset).
## Finalizing a Dataset
Use the [`Dataset.finalize`](../references/sdk/dataset.md#finalize) method to close the current dataset. This marks the
dataset task as *Completed*, at which point, the dataset can no longer be modified.
Before closing a dataset, its files must first be [uploaded](#uploading-files).
## Syncing Local Storage
Use the [`Dataset.sync_folder`](../references/sdk/dataset.md#sync_folder) method in order to update a dataset according
to a specific folder's content changes. Specify the folder to sync with the `local_path` parameter (the method assumes all files within the folder and recursive).
This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically.
The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually
update (add / remove) files in a dataset.

View File

@ -0,0 +1,98 @@
---
title: Dataset Management with CLI and SDK
---
In this tutorial, we are going to manage the CIFAR dataset with `clearml-data` CLI, and then use ClearML's [`Dataset`](../../references/sdk/dataset.md)
class to ingest the data.
## Creating the Dataset
### Downloading the Data
Before we can register the CIFAR dataset with `clearml-data`, we need to obtain a local copy of it.
Execute this python script to download the data
```python
from clearml import StorageManager
manager = StorageManager()
dataset_path = manager.get_local_copy(
remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
)
# make sure to copy the printed value
print("COPY THIS DATASET PATH: {}".format(dataset_path))
```
Expected response:
```bash
COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
```
The script prints the path to the downloaded data. It will be needed later on.
### Creating the Dataset
To create the dataset, execute the following command:
```
clearml-data create --project dataset_examples --name cifar_dataset
```
Expected response:
```
clearml-data - Dataset Management & Versioning CLI
Creating a new dataset:
New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec
```
Where `ee1c35f60f384e65bc800f42f0aca5ec` is the dataset ID.
## Adding Files
Add the files we just downloaded to the dataset:
```
clearml-data add --files <dataset_path>
```
where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
:::note
There's no need to specify a `dataset_id`, since the `clearml-data` session stores it.
:::
## Finalizing the Dataset
Run the [`close`](../../references/sdk/dataset.md#close) command to upload the files (it'll be uploaded to ClearML Server by default):<br/>
```
clearml-data close
```
This command sets the dataset task's status to *completed*, so it will no longer be modifiable. This ensures future
reproducibility.
The information about the dataset, including a list of files and their sizes, can be viewed
in the WebApp, in the dataset task's **ARTIFACTS** tab.
![image](../../img/examples_data_management_cifar_dataset.png)
## Using the Dataset
Now that we have a new dataset registered, we can consume it.
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) example
script demonstrates using the dataset within Python code.
```python
dataset_name = "cifar_dataset"
dataset_project = "dataset_examples"
from clearml import Dataset
dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
trainset = datasets.CIFAR10(
root=dataset_path,
train=True,
download=False,
transform=transform
)
```
The Dataset's [`get_local_copy`](../../references/sdk/dataset.md#get_local_copy) method will return a path to the cached,
downloaded dataset. Then we provide the path to Pytorch's dataset object.
The script then trains a neural network to classify images using the dataset created above.

View File

@ -1,20 +1,24 @@
---
title: Folder Sync
title: Folder Sync with CLI
---
This example shows how to use the *clearml-data* folder sync function.
This example shows how to use the `clearml-data` folder sync function.
*clearml-data* folder sync mode is useful for cases when users have a single point of truth (i.e. a folder) that updates
`clearml-data` folder sync mode is useful for cases when users have a single point of truth (i.e. a folder) that updates
from time to time. When the point of truth is updated, users can call `clearml-data sync` and the
changes (file addition, modification, or removal) will be reflected in ClearML.
## Creating Initial Version
## Prerequisites
First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
the needed files.
1. Open terminal and change directory to the cloned repository's examples folder
`cd clearml/examples/reporting`
```
cd clearml/examples/reporting
```
## Syncing a Folder
Create a dataset and sync the `data_samples` folder from the repo to ClearML

View File

@ -0,0 +1,94 @@
---
title: Data Management with Python
---
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) and
[data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py)
together demonstrate how to use ClearML's [`Dataset`](../../references/sdk/dataset.md) class to create a dataset and
subsequently ingest the data.
## Dataset Creation
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) script
demonstrates how to do the following:
* Create a dataset and add files to it
* Upload the dataset to the ClearML Server
* Finalize the dataset
### Downloading the Data
We first need to obtain a local copy of the CIFAR dataset.
```python
from clearml import StorageManager
manager = StorageManager()
dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz")
```
This script downloads the data and `dataset_path` contains the path to the downloaded data.
### Creating the Dataset
```python
from clearml import Dataset
dataset = Dataset.create(dataset_name="cifar_dataset", dataset_project="dataset examples" )
```
This creates a data processing task called `cifar_dataset` in the `dataset examples` project, which
can be viewed in the WebApp.
### Adding Files
```python
dataset.add_files(path=dataset_path)
```
This adds the downloaded files to the current dataset.
### Uploading the Files
```python
dataset.upload()
```
This uploads the dataset to the ClearML Server by default. The dataset's destination can be changed by specifying the
target storage with the `output_url` parameter of the [`upload`](../../references/sdk/dataset#upload) method.
### Finalizing the Dataset
Run the [`finalize`](../../references/sdk/dataset#finzalize) command to close the dataset and set that dataset's tasks
status to *completed*. The dataset can only be finalized if it doesn't have any pending uploads.
```python
dataset.finalize()
```
After a dataset has been closed, it can no longer be modified. This ensures future reproducibility.
The information about the dataset, including a list of files and their sizes, can be viewed
in the WebApp, in the dataset task's **ARTIFACTS** tab.
![image](../../img/examples_data_management_cifar_dataset.png)
## Data Ingestion
Now that we have a new dataset registered, we can consume it!
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) script
demonstrates data ingestion using the dataset created in the first script.
```python
dataset_name = "cifar_dataset"
dataset_project = "dataset_examples"
dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
```
The script above gets the dataset and uses the [`Dataset.get_local_copy`](../../references/sdk/dataset.md#get_local_copy)
method to return a path to the cached, read-only local dataset. If you need a modifiable copy of the dataset,
use `Dataset.get(dataset_name, dataset_project).get_mutable_local_copy(path/to/download)`
The script then creates a neural network to train a model to classify images from the dataset that was
created above.

View File

@ -1,14 +1,17 @@
---
title: Data Management Example
title: Data Management from CLI
---
In this example we'll create a simple dataset and demonstrate basic actions on it.
In this example we'll create a simple dataset and demonstrate basic actions on it, using the `clearml-data` CLI.
## Prerequisites
First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
the needed files.
1. Open terminal and change directory to the cloned repository's examples folder
`cd clearml/examples/reporting`
```
cd clearml/examples/reporting
```
## Creating Initial Dataset
@ -27,7 +30,7 @@ the needed files.
```
1. Now let's add a folder. File addition is recursive, so it's enough to point at the folder
to captures all files and subfolders:
to captures all files and sub-folders:
```bash
clearml-data add --files data_samples
@ -42,11 +45,13 @@ to captures all files and subfolders:
Hash generation completed
5 files added
```
:::note
After creating a dataset, we don't have to specify its ID when running commands, such as *add*, *remove* or *list*
:::
1. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but
3. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but
this can be configured with the `--storage` flag to any of ClearML's supported storage mediums (see [storage](../../integrations/storage.md)).
The command also finalizes the dataset, making it immutable and ready to be consumed.

View File

@ -0,0 +1,12 @@
---
title: Workflows
---
Take a look at the ClearML Data examples which demonstrate common workflows using the `clearml-data` CLI and the
`Dataset` class:
* [Dataset Management with CLI](data_man_simple.md) - Tutorial for creating, modifying, and consuming dataset with CLI
* [Folder Sync with CLI](data_man_folder_sync.md) - Tutorial for using `clearml-data sync` CLI option to update a dataset according
to a local folder.
* [Dataset Management with CLI and SDK](data_man_cifar_classification.md) - Tutorial for creating a dataset with the CLI
then programmatically ingesting the data with the SDK
* [Data Management with Python](data_man_python.md) - Example scripts for creating and consuming a dataset with the SDK.

View File

@ -29,9 +29,9 @@ Once we have a Task in ClearML, we can clone and edit its definitions in the UI,
- Once there are two or more experiments that run after another, group them together into a [pipeline](../../fundamentals/pipelines.md).
## Manage Your Data
Use [ClearML Data](../../clearml_data.md) to version your data, then link it to running experiments for easy reproduction.
Make datasets machine agnostic (i.e. store original dataset in a shared storage location, e.g. shared-folder/S3/Gs/Azure).
ClearML Data supports efficient Dataset storage and caching, differentiable & compressed.
Use [ClearML Data](../../clearml_data/clearml_data.md) to version your data, then link it to running experiments for easy reproduction.
Make datasets machine agnostic (i.e. store original dataset in a shared storage location, e.g. shared-folder/S3/Gs/Azure)
ClearML Data supports efficient Dataset storage and caching, differentiable & compressed
## Scale Your Work
Use [ClearML Agent](../../clearml_agent.md) to scale work. Install the agent machines (Remote or local) and manage

View File

@ -87,7 +87,7 @@ Task.enqueue(cloned_task, queue_name='default')
## Logging Artifacts
Artifacts are a great way to pass and reuse data between Tasks in the system.
From anywhere in the code you can upload [multiple](../../fundamentals/artifacts.md#logging-artifacts) types of data, object and files.
Artifacts are the base of ClearML's [Data Management](../../clearml_data.md) solution and as a way to communicate complex objects between different
Artifacts are the base of ClearML's [Data Management](../../clearml_data/clearml_data.md) solution and as a way to communicate complex objects between different
stages of a [pipeline](../../fundamentals/pipelines.md).
```python
@ -139,7 +139,7 @@ You can also search and query Tasks in the system.
Use the `Task.get_tasks` call to retrieve Tasks objects and filter based on the specific values of the Task - status, parameters, metrics and more!
```python
from clearml import Task
tasks = Task.get_tasks(project_name='examples', task_name='partial_name_match', task_filter={'status': 'in_proress'})
tasks = Task.get_tasks(project_name='examples', task_name='partial_name_match', task_filter={'status': 'in_progress'})
```
## Manage Your Data
@ -147,7 +147,7 @@ Data is probably one of the biggest factors that determines the success of a pro
Associating the data a model used to the model's configuration, code and results (such as accuracy) is key to deducing meaningful insights into how
models behave.
[ClearML Data](../../clearml_data.md) allows you to version your data so it's never lost, fetch it from every machine with minimal code changes,
[ClearML Data](../../clearml_data/clearml_data.md) allows you to version your data so it's never lost, fetch it from every machine with minimal code changes,
and associate data to experiment results.
Logging data can be done via command line, or via code. If any preprocessing code is involved, ClearML logs it as well! Once data is logged, it can be used by other experiments.

View File

@ -1,302 +0,0 @@
---
title: Dataset Management Using CIFAR10
---
In this tutorial, we are going use a CIFAR example, manage the CIFAR dataset with `clearml-data`, and then replace our
current dataset read method with one that interfaces with `clearml-data`.
## Creating the Dataset
### Downloading the Data
Before we can register the CIFAR dataset with `clearml-data` we need to obtain a local copy of it.
Execute this python script to download the data
```python
from clearml import StorageManager
# We're using the StorageManager to download the data for us!
# It's a neat little utility that helps us download
# files we need and cache them :)
manager = StorageManager()
dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz")
# make sure to copy the printed value
print("COPY THIS DATASET PATH: {}".format(dataset_path))
```
Expected reponse:
```bash
COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
```
The script prints the path to the downloaded data. It'll be needed later one
### Creating the Dataset
To create the dataset, in a CLI, execute:
```
clearml-data create --project cifar --name cifar_dataset
```
Expected response:
```
clearml-data - Dataset Management & Versioning CLI
Creating a new dataset:
New dataset created id=*********
```
Where \*\*\*\*\*\*\*\*\* is the dataset ID.
## Adding Files
Add the files we just downloaded to the dataset:
```
clearml-data add --files <dataset_path>
```
where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
:::note
There's no need to specify a *dataset_id* as *clearml-data* session stores it.
:::
## Finalizing the Dataset
Run the close command to upload the files (it'll be uploaded to file server by default):<br/>
```
clearml-data close
```
![image](../../img/examples_data_management_cifar_dataset.png)
## Using the Dataset
Now that we have a new dataset registered, we can consume it.
We take [this script](https://github.com/allegroai/clearml/blob/master/examples/frameworks/ignite/cifar_ignite.py) as a base to train on the CIFAR dataset.
We replace the file load part with ClearML's Dataset object. The Dataset's `get_local_copy()` method will return a path
to the cached, downloaded dataset.
Then we provide the path to Pytorch's dataset object.
```python
dataset_id = "ee1c35f60f384e65bc800f42f0aca5ec"
from clearml import Dataset
dataset_path = Dataset.get(dataset_id=dataset_id).get_local_copy()
trainset = datasets.CIFAR10(root=dataset_path,
train=True,
download=False,
transform=transform)
```
<details className="cml-expansion-panel info">
<summary className="cml-expansion-panel-summary">Full example code using dataset:</summary>
<div className="cml-expansion-panel-content">
```python
#These are the obligatory imports
from pathlib import Path
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from ignite.contrib.handlers import TensorboardLogger
from ignite.engine import Events, create_supervised_trainer, create_supervised_evaluator
from ignite.handlers import global_step_from_engine
from ignite.metrics import Accuracy, Loss, Recall
from ignite.utils import setup_logger
from torch.utils.tensorboard import SummaryWriter
from tqdm import tqdm
from clearml import Task, StorageManager
# Connecting ClearML with the current process,
# from here on everything is logged automatically
task = Task.init(project_name='Image Example', task_name='image classification CIFAR10')
params = {'number_of_epochs': 20, 'batch_size': 64, 'dropout': 0.25, 'base_lr': 0.001, 'momentum': 0.9, 'loss_report': 100}
params = task.connect(params) # enabling configuration override by clearml/
print(params) # printing actual configuration (after override in remote mode)
# This is our original data retrieval code. it uses storage manager to just download and cache our dataset.
'''
manager = StorageManager()
dataset_path = Path(manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"))
'''
# Let's now modify it to utilize for the new dataset API, you'll need to copy the created dataset id
# to the next variable
dataset_id = "ee1c35f60f384e65bc800f42f0aca5ec"
# The below gets the dataset and stores in the cache. If you want to download the dataset regardless if it's in the
# cache, use the Dataset.get(dataset_id).get_mutable_local_copy(path to download)
from clearml import Dataset
dataset_path = Dataset.get(dataset_id=dataset_id).get_local_copy()
# Dataset and Dataloader initializations
transform = transforms.Compose([transforms.ToTensor()])
trainset = datasets.CIFAR10(root=dataset_path,
train=True,
download=False,
transform=transform)
trainloader = torch.utils.data.DataLoader(trainset,
batch_size=params.get('batch_size', 4),
shuffle=True,
num_workers=10)
testset = datasets.CIFAR10(root=dataset_path,
train=False,
download=False,
transform=transform)
testloader = torch.utils.data.DataLoader(testset,
batch_size=params.get('batch_size', 4),
shuffle=False,
num_workers=10)
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
tb_logger = TensorboardLogger(log_dir="cifar-output")
# Helper function to store predictions and scores using matplotlib
def predictions_gt_images_handler(engine, logger, *args, **kwargs):
x, _ = engine.state.batch
y_pred, y = engine.state.output
num_x = num_y = 4
le = num_x * num_y
fig = plt.figure(figsize=(20, 20))
trans = transforms.ToPILImage()
for idx in range(le):
preds = torch.argmax(F.softmax(y_pred[idx],dim=0))
probs = torch.max(F.softmax(y_pred[idx],dim=0))
ax = fig.add_subplot(num_x, num_y, idx + 1, xticks=[], yticks=[])
ax.imshow(trans(x[idx]))
ax.set_title("{0} {1:.1f}% (label: {2})".format(
classes[preds],
probs * 100,
classes[y[idx]]),
color=("green" if preds == y[idx] else "red")
)
logger.writer.add_figure('predictions vs actuals', figure=fig, global_step=engine.state.epoch)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 3)
self.conv2 = nn.Conv2d(6, 16, 3)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(16 * 6 * 6, 120)
self.fc2 = nn.Linear(120, 84)
self.dorpout = nn.Dropout(p=params.get('dropout', 0.25))
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 6 * 6)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(self.dorpout(x))
return x
# Training
def run(epochs, lr, momentum, log_interval):
device = "cuda" if torch.cuda.is_available() else "cpu"
net = Net().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum)
trainer = create_supervised_trainer(net, optimizer, criterion, device=device)
trainer.logger = setup_logger("trainer")
val_metrics = {"accuracy": Accuracy(),"loss": Loss(criterion), "recall": Recall()}
evaluator = create_supervised_evaluator(net, metrics=val_metrics, device=device)
evaluator.logger = setup_logger("evaluator")
# Attach handler to plot trainer's loss every 100 iterations
tb_logger.attach_output_handler(
trainer,
event_name=Events.ITERATION_COMPLETED(every=params.get('loss_report')),
tag="training",
output_transform=lambda loss: {"loss": loss},
)
# Attach handler to dump evaluator's metrics every epoch completed
for tag, evaluator in [("training", trainer), ("validation", evaluator)]:
tb_logger.attach_output_handler(
evaluator,
event_name=Events.EPOCH_COMPLETED,
tag=tag,
metric_names="all",
global_step_transform=global_step_from_engine(trainer),
)
# Attach function to build debug images and report every epoch end
tb_logger.attach(
evaluator,
log_handler=predictions_gt_images_handler,
event_name=Events.EPOCH_COMPLETED(once=1),
);
desc = "ITERATION - loss: {:.2f}"
pbar = tqdm(initial=0, leave=False, total=len(trainloader), desc=desc.format(0))
@trainer.on(Events.ITERATION_COMPLETED(every=log_interval))
def log_training_loss(engine):
pbar.desc = desc.format(engine.state.output)
pbar.update(log_interval)
@trainer.on(Events.EPOCH_COMPLETED)
def log_training_results(engine):
pbar.refresh()
evaluator.run(trainloader)
metrics = evaluator.state.metrics
avg_accuracy = metrics["accuracy"]
avg_nll = metrics["loss"]
tqdm.write(
"Training Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}".format(
engine.state.epoch, avg_accuracy, avg_nll
)
)
@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(engine):
evaluator.run(testloader)
metrics = evaluator.state.metrics
avg_accuracy = metrics["accuracy"]
avg_nll = metrics["loss"]
tqdm.write(
"Validation Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}".format(
engine.state.epoch, avg_accuracy, avg_nll
)
)
pbar.n = pbar.last_print_n = 0
@trainer.on(Events.EPOCH_COMPLETED | Events.COMPLETED)
def log_time():
tqdm.write(
"{} took {} seconds".format(trainer.last_event_name.name, trainer.state.times[trainer.last_event_name.name])
)
trainer.run(trainloader, max_epochs=epochs)
pbar.close()
PATH = './cifar_net.pth'
torch.save(net.state_dict(), PATH)
print('Finished Training')
print('Task ID number is: {}'.format(task.id))
run(params.get('number_of_epochs'), params.get('base_lr'), params.get('momentum'), 10)
```
</div></details>
<br/><br/>
That's it! All you need to do now is run the full script.

BIN
docs/img/dataset_data.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 110 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 96 KiB

View File

@ -17,7 +17,8 @@ module.exports = {
'fundamentals/hpo', 'fundamentals/pipelines']},
'clearml_sdk',
'clearml_agent',
'clearml_data',
{'ClearML Data': ['clearml_data/clearml_data', 'clearml_data/clearml_data_cli', 'clearml_data/clearml_data_sdk', 'clearml_data/best_practices',
{'Workflows': ['clearml_data/data_management_examples/workflows', 'clearml_data/data_management_examples/data_man_simple', 'clearml_data/data_management_examples/data_man_folder_sync', 'clearml_data/data_management_examples/data_man_cifar_classification', 'clearml_data/data_management_examples/data_man_python']},]},
{'Applications': ['apps/clearml_session', 'apps/clearml_task']},
{'Integrations': ['integrations/libraries', 'integrations/storage']},
@ -59,8 +60,7 @@ module.exports = {
'guides/guidemain',
{'Advanced': ['guides/advanced/execute_remotely', 'guides/advanced/multiple_tasks_single_process']},
{'Automation': ['guides/automation/manual_random_param_search_example', 'guides/automation/task_piping']},
{'Data Management': ['guides/data management/data_man_simple', 'guides/data management/data_man_folder_sync', 'guides/data management/data_man_cifar_classification']},
{'ClearML Task': ['guides/clearml-task/clearml_task_tutorial']},
{'Clearml Task': ['guides/clearml-task/clearml_task_tutorial']},
{'Distributed': ['guides/distributed/distributed_pytorch_example', 'guides/distributed/subprocess_example']},
{'Docker': ['guides/docker/extra_docker_shell_script']},
{'Frameworks': [