mirror of
https://github.com/clearml/clearml-docs
synced 2025-04-03 12:51:54 +00:00
Refactor ClearML Data docs (#108)
This commit is contained in:
parent
43751dc64b
commit
e155c49cfd
@ -1,361 +0,0 @@
|
||||
---
|
||||
title: ClearML Data
|
||||
---
|
||||
|
||||
In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset,
|
||||
which you then need to be able to share, reproduce, and track.
|
||||
|
||||
ClearML Data Management solves two important challenges:
|
||||
- Accessibility - Making data easily accessible from every machine,
|
||||
- Versioning - Linking data and experiments for better **traceability**.
|
||||
|
||||
**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
|
||||
Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
|
||||
|
||||
A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
|
||||
Datasets can be set up to inherit from other datasets, so data lineages can be created,
|
||||
and users can track when and how their data changes.
|
||||
|
||||
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
|
||||
|
||||
Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
|
||||
When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
|
||||
|
||||
ClearML Data offers two interfaces:
|
||||
- `clearml-data` - CLI utility for creating, uploading, and managing datasets.
|
||||
- `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets.
|
||||
|
||||
## Setup
|
||||
|
||||
`clearml-data` comes built-in with our `clearml` python package! Just check out the [getting started](getting_started/ds/ds_first_steps.md) guide for more info!
|
||||
|
||||
## Workflow
|
||||
Below is an example of a workflow using ClearML Data's command line tool to create a dataset and inegrating the dataset into code
|
||||
using ClearML Data's python interface.
|
||||
|
||||
### Creating a Dataset
|
||||
|
||||
Using the `clearml-data` CLI, users can create datasets using the following commands:
|
||||
```bash
|
||||
clearml-data create --project dataset_example --name initial_version
|
||||
clearml-data add --files data_folder
|
||||
clearml-data close
|
||||
```
|
||||
|
||||
The commands will do the following:
|
||||
|
||||
1. Start a Data Processing Task called "initial_version" in the "dataset_example" project
|
||||
|
||||
1. The CLI will return a unique ID for the dataset
|
||||
|
||||
1. All the files from the "data_folder" folder will be added to the dataset and uploaded
|
||||
by default to the [ClearML server](deploying_clearml/clearml_server.md).
|
||||
|
||||
1. The dataset will be finalized, making it immutable and ready to be consumed.
|
||||
|
||||
:::note
|
||||
`clearml-data` is stateful and remembers the last created dataset so there's no need to specify a specific dataset ID unless
|
||||
we want to work on another dataset.
|
||||
:::
|
||||
|
||||
### Using a Dataset
|
||||
Now in our python code, we can access and use the created dataset from anywhere:
|
||||
|
||||
```python
|
||||
from clearml import Dataset
|
||||
|
||||
local_path = Dataset.get(dataset_id='dataset_id_from_previous_command').get_local_copy()
|
||||
```
|
||||
|
||||
We have all our files in the same folder structure under `local_path`, it is that simple!<br/>
|
||||
|
||||
The next step is to set the dataset_id as a parameter for our code and voilà! We can now train on any dataset we have in
|
||||
the system.
|
||||
|
||||
## CLI Options
|
||||
|
||||
It's possible to manage datasets (create / modify / upload / delete) with the `clearml-data` command line tool.
|
||||
|
||||
### Creating a Dataset
|
||||
```bash
|
||||
clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
|
||||
```
|
||||
Creates a new dataset. <br/>
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|name |Dataset's name| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|project|Dataset's project| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|parents|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|tags |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
:::important
|
||||
clearml-data works in a stateful mode so once a new dataset is created, the following commands
|
||||
do not require the `--id` flag.
|
||||
:::
|
||||
|
||||
<br/>
|
||||
|
||||
### Add Files
|
||||
```bash
|
||||
clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
|
||||
```
|
||||
It's possible to add individual files or complete folders.<br/>
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|id | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|files|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|dataset-folder | Dataset base folder to add the files to in the dataset. Default: dataset root| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|non-recursive | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
<br/>
|
||||
|
||||
### Remove Files
|
||||
```bash
|
||||
clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|id | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|files | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|non-recursive | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
<br/>
|
||||
|
||||
### Upload Dataset Content
|
||||
```bash
|
||||
clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
|
||||
```
|
||||
Uploads added files to [ClearML Server](deploying_clearml/clearml_server.md) by default. It's possible to specify a different storage
|
||||
medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure://`, `/mnt/shared/`.
|
||||
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|storage| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
<br/>
|
||||
|
||||
### Finalize Dataset
|
||||
```bash
|
||||
clearml-data close --id <dataset_id>
|
||||
```
|
||||
Finalizes the dataset and makes it ready to be consumed.
|
||||
It automatically uploads all files that were not previously uploaded.
|
||||
Once a dataset is finalized, it can no longer be modified.
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|storage| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|disable-upload | Disable automatic upload when closing the dataset | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|verbose | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
<br/>
|
||||
|
||||
### Sync Local Folder
|
||||
```
|
||||
clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`
|
||||
```
|
||||
This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which
|
||||
updates from time to time.
|
||||
|
||||
|
||||
Once an update should be reflected into ClearML's system, users can call `clearml-data sync`, create a new dataset, enter the folder,
|
||||
and the changes (either file addition, modification and removal) will be reflected in ClearML.
|
||||
|
||||
This command also uploads the data and finalizes the dataset automatically.
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|id| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|folder|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|storage|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|parents|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|project|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|name|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|tags|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|skip-close|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|verbose | Verbose reporting |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
<br/>
|
||||
|
||||
### List Dataset Content
|
||||
```bash
|
||||
clearml-data list [--id <dataset_id>]
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|id|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|project|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|name|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|filter|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|modified|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
<br/>
|
||||
|
||||
### Delete Dataset
|
||||
```
|
||||
clearml-data delete [--id <dataset_id_to_delete>]
|
||||
```
|
||||
Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.
|
||||
|
||||
This does not work on datasets with children.
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|id|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|force|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />||
|
||||
|
||||
<br/>
|
||||
|
||||
### Search for a Dataset
|
||||
```
|
||||
clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
|
||||
```
|
||||
Lists all datasets in the system that match the search request.
|
||||
|
||||
Datasets can be searched by project, name, ID, and tags.
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|ids|A list of dataset IDs|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|project|The project name of the datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|name|A dataset name or a partial name to filter datasets by|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|tags|A list of dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|
||||
<br/>
|
||||
|
||||
### Compare Two Datasets
|
||||
|
||||
```
|
||||
clearml-data compare [--source SOURCE] [--target TARGET]
|
||||
```
|
||||
|
||||
Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:
|
||||
|
||||
```
|
||||
Comparison summary: 4 files removed, 3 files modified, 0 files added
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|source|Source dataset id (used as baseline)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|target|Target dataset id (compare against the source baseline dataset)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
### Merge Datasets
|
||||
|
||||
```
|
||||
clearml-data squash --name NAME --ids [IDS [IDS ...]]
|
||||
```
|
||||
|
||||
Squash (merge) multiple datasets into a single dataset version.
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|name|Create squashed dataset name|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|ids|Source dataset IDs to squash (merge down)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|storage|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
### Verify Dataset
|
||||
|
||||
```
|
||||
clearml-data verify [--id ID] [--folder FOLDER]
|
||||
```
|
||||
|
||||
Verify that the dataset content matches the data from the local source.
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|id|Specify dataset ID. Default: previously created/accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|folder|Specify dataset local copy (if not provided the local cache folder will be verified)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|filesize| If True, only verify file size and skip hash checks (default: false)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|verbose|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
### Get a Dataset
|
||||
|
||||
```
|
||||
clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]
|
||||
```
|
||||
|
||||
Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the
|
||||
`--copy` flag.
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|id| Specify dataset ID. Default: previously created / accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|copy| Get a writable copy of the dataset to a specific output folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|link| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|overwrite| If True, overwrite the target folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|verbose| Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
### Publish a Dataset
|
||||
|
||||
```
|
||||
clearml-data publish --id ID
|
||||
```
|
||||
|
||||
Publish the dataset for public use. The dataset must be [finalized](#finalize-dataset) before it is published.
|
||||
|
||||
|
||||
**Parameters**
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|id| The dataset task id to be published.|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|
||||
|
||||
|
||||
## Python API
|
||||
|
||||
It's also possible to manage a dataset using ClearML Data's python interface.
|
||||
|
||||
All API commands should be imported with:
|
||||
|
||||
```python
|
||||
from clearml import Dataset
|
||||
```
|
||||
|
||||
See all API commands in the [Dataset](references/sdk/dataset.md) reference page.
|
||||
|
||||
## Tutorials
|
||||
|
||||
Take a look at the ClearML Data tutorials:
|
||||
* [Dataset Management with CLI and SDK](guides/data%20management/data_man_cifar_classification)
|
||||
* [Dataset Management with CLI](guides/data%20management/data_man_simple)
|
||||
* [Folder Sync with CLI](guides/data%20management/data_man_folder_sync)
|
46
docs/clearml_data/best_practices.md
Normal file
46
docs/clearml_data/best_practices.md
Normal file
@ -0,0 +1,46 @@
|
||||
---
|
||||
title: Best Practices
|
||||
---
|
||||
|
||||
The following are some recommendations for using ClearML Data.
|
||||
|
||||
## Versioning Datasets
|
||||
|
||||
Use ClearML Data to version your datasets. Once a dataset is finalized, it can no longer be modified. This makes clear
|
||||
which version of the dataset was used with which task, enabling the accurate reproduction of your experiments.
|
||||
|
||||
Once you need to change the dataset's contents, you can create a new version of the dataset by specifying the previous
|
||||
dataset as a parent. This makes the new dataset version inherit the previous version's contents, with the dataset's new
|
||||
version contents ready to be updated.
|
||||
|
||||
## Organize Datasets for Easier Access
|
||||
|
||||
Organize the datasets according to use-cases and use tags. This makes managing multiple datasets and
|
||||
accessing the most updated datasets for different use-cases easier.
|
||||
|
||||
Like any ClearML tasks, datasets can be organized into [projects (and sub-projects)](../fundamentals/projects.md#creating-sub-projects).
|
||||
Additionally, when creating a dataset, tags can be applied to the dataset, which will make searching for the dataset easier.
|
||||
|
||||
Organizing your datasets into projects by use-case makes it easier to access the most recent dataset version for that use-case.
|
||||
If only a project is specified when using [`Dataset.get`](../references/sdk/dataset.md#datasetget), the method returns the
|
||||
most recent dataset in a project. The same is true with tags; if a tag is specified, the method will return the most recent dataset that is labeled with that tag.
|
||||
|
||||
## Document your Datasets
|
||||
|
||||
Attach informative metrics or debug samples to the Dataset itself. Use the [`get_logger`](../references/sdk/dataset.md#get_logger)
|
||||
method to access the dataset's logger object, then add any additional information to the dataset, using the methods
|
||||
available with a [logger](../references/sdk/logger.md) object.
|
||||
|
||||
You can add some dataset summaries (like [table reporting](../references/sdk/logger.md#report_table)) to create a preview
|
||||
of the data stored for better visibility, or attach any statistics generated by the data ingestion process.
|
||||
|
||||
|
||||
## Periodically Update Your Dataset
|
||||
|
||||
Your data probably changes from time to time. If the data is updated into the same local / network folder structure, which
|
||||
serves as a dataset's single point of truth, you can schedule a script which uses the dataset `sync` functionality which
|
||||
will update the dataset based on the modifications made to the folder. This way, there is no need to manually modify a dataset.
|
||||
This functionality will also track the modifications made to a folder.
|
||||
|
||||
See the sync function with the [CLI](clearml_data_cli.md#syncing-local-storage) or [SDK](clearml_data_sdk.md#syncing-local-storage)
|
||||
interface.
|
92
docs/clearml_data/clearml_data.md
Normal file
92
docs/clearml_data/clearml_data.md
Normal file
@ -0,0 +1,92 @@
|
||||
---
|
||||
title: Introduction
|
||||
---
|
||||
|
||||
In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset,
|
||||
which you then need to be able to share, reproduce, and track.
|
||||
|
||||
ClearML Data Management solves two important challenges:
|
||||
- Accessibility - Making data easily accessible from every machine,
|
||||
- Versioning - Linking data and experiments for better **traceability**.
|
||||
|
||||
**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
|
||||
Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
|
||||
|
||||
A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
|
||||
Datasets can be set up to inherit from other datasets, so data lineages can be created,
|
||||
and users can track when and how their data changes.
|
||||
|
||||
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
|
||||
|
||||
Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
|
||||
When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
|
||||
|
||||
## Setup
|
||||
|
||||
`clearml-data` comes built-in with the `clearml` python package! Just check out the [Getting Started](../getting_started/ds/ds_first_steps.md)
|
||||
guide for more info!
|
||||
|
||||
## Using ClearML Data
|
||||
|
||||
ClearML Data offers two interfaces:
|
||||
- `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](clearml_data_cli.md) for a reference of `clearml-data` commands.
|
||||
- `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets. See [SDK](clearml_data_sdk.md) for an overview of the basic methods of the `Dataset` module.
|
||||
|
||||
For an overview of our recommendations for ClearML Data workflows and practices, see [Best Practices](best_practices.md).
|
||||
|
||||
## WebApp
|
||||
|
||||
ClearML's WebApp provides a visual interface to your datasets through dataset tasks. Dataset tasks are categorized
|
||||
as data-processing [task type](../fundamentals/task.md#task-types), and they are labeled with a `DATASET` system tag.
|
||||
|
||||
Full log (calls / CLI) of the dataset creation process can be found in a dataset's **EXECUTION** section.
|
||||
|
||||
Listing of the dataset differential snapshot, summary of files added / modified / removed and details of files in the
|
||||
differential snapshot (location / size / hash), is available in the **ARTIFACTS** section. Download the dataset
|
||||
by clicking <img src="/docs/latest/icons/ico-download-json.svg" alt="Download" className="icon size-sm space-sm" />,
|
||||
next to the **FILE PATH**.
|
||||
|
||||
The full dataset listing (all files included) is available in the **CONFIGURATION** section under **Dataset Content**.
|
||||
This allows you to quickly compare two dataset contents and visually see the difference.
|
||||
The dataset genealogy DAG and change-set summary table is visualized in **RESULTS > PLOTS**
|
||||
|
||||
|
||||
<details className="cml-expansion-panel screenshot">
|
||||
<summary className="cml-expansion-panel-summary">Dataset Contents</summary>
|
||||
<div className="cml-expansion-panel-content">
|
||||
|
||||

|
||||
|
||||
</div>
|
||||
</details>
|
||||
|
||||
<br/>
|
||||
|
||||
View a DAG of the dataset dependencies (all previous dataset versions and their parents) in the dataset's page **> ARTIFACTS > state**.
|
||||
|
||||
<details className="cml-expansion-panel screenshot">
|
||||
<summary className="cml-expansion-panel-summary">Data Dependency DAG</summary>
|
||||
<div className="cml-expansion-panel-content">
|
||||
|
||||

|
||||
|
||||
</div>
|
||||
</details>
|
||||
|
||||
|
||||
Once a dataset has been finalized, view its genealogy in the dataset's
|
||||
page **>** **RESULTS** **>** **PLOTS**
|
||||
|
||||
<details className="cml-expansion-panel screenshot">
|
||||
<summary className="cml-expansion-panel-summary">Dataset Genealogy</summary>
|
||||
<div className="cml-expansion-panel-content">
|
||||
|
||||

|
||||
|
||||
</div>
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
|
||||
|
328
docs/clearml_data/clearml_data_cli.md
Normal file
328
docs/clearml_data/clearml_data_cli.md
Normal file
@ -0,0 +1,328 @@
|
||||
---
|
||||
title: CLI
|
||||
---
|
||||
|
||||
The `clearml-data` utility is a CLI tool for controlling and managing your data with ClearML.
|
||||
|
||||
The following page provides a reference to `clearml-data`'s CLI commands.
|
||||
|
||||
### Creating a Dataset
|
||||
```bash
|
||||
clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
|
||||
```
|
||||
Creates a new dataset. <br/>
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--name` |Dataset's name`| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|`--project`|Dataset's project`| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|`--parents`|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--tags` |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
:::important
|
||||
clearml-data works in a stateful mode so once a new dataset is created, the following commands
|
||||
do not require the `--id` flag.
|
||||
:::
|
||||
|
||||
<br/>
|
||||
|
||||
### Adding Files
|
||||
```bash
|
||||
clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
|
||||
```
|
||||
It's possible to add individual files or complete folders.<br/>
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id` | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--files`|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|`--dataset-folder` | Dataset base folder to add the files to in the dataset. Default: dataset root| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--non-recursive` | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Removing Files
|
||||
```bash
|
||||
clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id` | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--files` | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|`--non-recursive` | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Uploading Dataset Content
|
||||
```bash
|
||||
clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
|
||||
```
|
||||
Uploads added files to [ClearML Server](../deploying_clearml/clearml_server.md) by default. It's possible to specify a different storage
|
||||
medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure://`, `/mnt/shared/`.
|
||||
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--storage`| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Finalizing a Dataset
|
||||
```bash
|
||||
clearml-data close --id <dataset_id>
|
||||
```
|
||||
Finalizes the dataset and makes it ready to be consumed.
|
||||
It automatically uploads all files that were not previously uploaded.
|
||||
Once a dataset is finalized, it can no longer be modified.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--storage`| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--disable-upload` | Disable automatic upload when closing the dataset | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Syncing Local Storage
|
||||
```
|
||||
clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`
|
||||
```
|
||||
This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which
|
||||
updates from time to time.
|
||||
|
||||
|
||||
Once an update should be reflected into ClearML's system, users can call `clearml-data sync`, create a new dataset, enter the folder,
|
||||
and the changes (either file addition, modification and removal) will be reflected in ClearML.
|
||||
|
||||
This command also uploads the data and finalizes the dataset automatically.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--folder`|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--parents`|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--project`|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--name`|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--tags`|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--skip-close`|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--verbose` | Verbose reporting |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Listing Dataset Content
|
||||
```bash
|
||||
clearml-data list [--id <dataset_id>]
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--name`|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--filter`|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--modified`|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Deleting a Dataset
|
||||
```
|
||||
clearml-data delete [--id <dataset_id_to_delete>]
|
||||
```
|
||||
Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.
|
||||
|
||||
This does not work on datasets with children.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--force`|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />||
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Searching for a Dataset
|
||||
```
|
||||
clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
|
||||
```
|
||||
Lists all datasets in the system that match the search request.
|
||||
|
||||
Datasets can be searched by project, name, ID, and tags.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--ids`|A list of dataset IDs|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|`--project`|The project name of the datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|`--name`|A dataset name or a partial name to filter datasets by|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|`--tags`|A list of dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Comparing Two Datasets
|
||||
|
||||
```
|
||||
clearml-data compare [--source SOURCE] [--target TARGET]
|
||||
```
|
||||
|
||||
Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:
|
||||
|
||||
```
|
||||
Comparison summary: 4 files removed, 3 files modified, 0 files added
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--source`|Source dataset id (used as baseline)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|`--target`|Target dataset id (compare against the source baseline dataset)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
### Merging Datasets
|
||||
|
||||
```
|
||||
clearml-data squash --name NAME --ids [IDS [IDS ...]]
|
||||
```
|
||||
|
||||
Squash (merge) multiple datasets into a single dataset version.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--name`|Create squashed dataset name|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|`--ids`|Source dataset IDs to squash (merge down)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
### Verifying a Dataset
|
||||
|
||||
```
|
||||
clearml-data verify [--id ID] [--folder FOLDER]
|
||||
```
|
||||
|
||||
Verify that the dataset content matches the data from the local source.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`|Specify dataset ID. Default: previously created/accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--folder`|Specify dataset local copy (if not provided the local cache folder will be verified)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--filesize`| If True, only verify file size and skip hash checks (default: false)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
### Getting a Dataset
|
||||
|
||||
```
|
||||
clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]
|
||||
```
|
||||
|
||||
Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the
|
||||
`--copy` flag.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`| Specify dataset ID. Default: previously created / accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--copy`| Get a writable copy of the dataset to a specific output folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--link`| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--overwrite`| If True, overwrite the target folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--verbose`| Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
### Publishing a Dataset
|
||||
|
||||
```
|
||||
clearml-data publish --id ID
|
||||
```
|
||||
|
||||
Publish the dataset for public use. The dataset must be [finalized](#finalizing-a-dataset) before it is published.
|
||||
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`| The dataset task id to be published.|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
148
docs/clearml_data/clearml_data_sdk.md
Normal file
148
docs/clearml_data/clearml_data_sdk.md
Normal file
@ -0,0 +1,148 @@
|
||||
---
|
||||
title: SDK
|
||||
---
|
||||
|
||||
|
||||
Datasets can be created, modified, and managed with ClearML Data's python interface. The following page provides an overview
|
||||
for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md)
|
||||
for a complete list of available methods.
|
||||
|
||||
Import the `Dataset` class, and let's get started!
|
||||
|
||||
```python
|
||||
from clearml import Dataset
|
||||
```
|
||||
|
||||
## Creating Datasets
|
||||
|
||||
ClearML Data supports multiple ways to create datasets programmatically, which provides for a variety of use-cases:
|
||||
* [`Dataset.create()`](#datasetcreate) - Create a new dataset. Parent datasets can be specified, from which the new dataset
|
||||
will inherit its data
|
||||
* [`Dataset.squash()`](#datasetsquash) - Generate a new dataset from by squashing together a set of related datasets
|
||||
|
||||
### Dataset.create()
|
||||
|
||||
Use the [`Dataset.create`](../references/sdk/dataset.md#datasetcreate) class method to create a dataset.
|
||||
|
||||
Creating datasets programmatically is especially helpful when preprocessing the data so that the
|
||||
preprocessing code and the resulting dataset are saved in a single task (see `use_current_task` parameter in [`Dataset.create`](../references/sdk/dataset.md#datasetcreate)).
|
||||
|
||||
```python
|
||||
# Preprocessing code here
|
||||
dataset = Dataset.create(
|
||||
dataset_name='dataset name',
|
||||
dataset_project='dataset project',
|
||||
parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2]
|
||||
)
|
||||
```
|
||||
|
||||
The created dataset inherits the content of the `parent_datasets`. When multiple dataset parents are listed,
|
||||
they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset.
|
||||
|
||||
### Dataset.squash()
|
||||
|
||||
To improve deep dataset DAG storage and speed, dataset squashing was introduced. The [`Dataset.squash`](../references/sdk/dataset.md#datasetsquash)
|
||||
class method generates a new dataset by squashing a set of dataset versions, and merging down all changes introduced in
|
||||
their lineage DAG, creating a new, flat, independent version.
|
||||
|
||||
The datasets being squashed into a single dataset can be specified by their IDs or by project & name pairs.
|
||||
|
||||
```python
|
||||
# option 1 - list dataset IDs
|
||||
squashed_dataset_1 = Dataset.squash(
|
||||
dataset_name='squashed dataset\'s name',
|
||||
dataset_ids=[DS1_ID, DS2_ID, DS3_ID]
|
||||
)
|
||||
|
||||
# option 2 - list project and dataset pairs
|
||||
squashed_dataset_2 = Dataset.squash(
|
||||
dataset_name='squashed dataset 2',
|
||||
dataset_project_name_pairs=[('dataset1 project', 'dataset1 name'),
|
||||
('dataset2 project', 'dataset2 name')]
|
||||
)
|
||||
```
|
||||
|
||||
In addition, the target storage location for the squashed dataset can be specified using the `output_url` parameter of the
|
||||
[`Dataset.squash`](../references/sdk/dataset.md#datasetsquash) method.
|
||||
|
||||
## Accessing Datasets
|
||||
Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere.
|
||||
|
||||
Use the [`Dataset.get`](../references/sdk/dataset.md#datasetget) class method to access a specific Dataset object, either
|
||||
with the dataset's ID or with its project and name. If only a project name or tag is provided, the method returns the
|
||||
most recent dataset in the specified project, or the most recent dataset with the specified tag.
|
||||
|
||||
Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options:
|
||||
* [`Dataset.get_local_copy()`](../references/sdk/dataset.md#get_local_copy) - get a read-only local copy of an entire dataset.
|
||||
This method returns a path to the dataset in local cache (downloading the dataset if it is not already in cache)
|
||||
* [`Dataset.get_mutable_local_copy()`](../references/sdk/dataset.md#get_mutable_local_copy) - get a writable local copy
|
||||
of an entire dataset. This method downloads the dataset to a specific folder (non-cached), specified with the `target_folder` parameter. If
|
||||
the specified folder already has contents, specify whether to overwrite its contents with the dataset contents, using the `overwrite` parameter.
|
||||
|
||||
## Modifying Datasets
|
||||
|
||||
Once a dataset has been created, its contents can be modified and replaced. When your data is changes, you can
|
||||
add updated files or remove unnecessary files.
|
||||
|
||||
### add_files()
|
||||
|
||||
To add files or folders into the current dataset, use the [`Dataset.add_files`](../references/sdk/dataset.md#add_files)
|
||||
method. If a file is already in a dataset, but it has been modified, it can be added again, and ClearML will
|
||||
upload the file diff.
|
||||
|
||||
```python
|
||||
dataset = Dataset.create()
|
||||
dataset.add_files(path="path/to/folder_or_file")
|
||||
```
|
||||
|
||||
There is an option to add a set of files based on wildcard matching of a single string or a list of strings, using the
|
||||
`wildcard` parameter. Specify whether to match the wildcard files recursively using the `recursive` parameter.
|
||||
|
||||
For example:
|
||||
|
||||
```python
|
||||
dataset.add_files(
|
||||
path="path/to/folder",
|
||||
wildcard="~/data/*.jpg",
|
||||
recursive=True
|
||||
)
|
||||
```
|
||||
|
||||
### remove_files()
|
||||
To remove files from a current dataset, use the [`Dataset.remove_files`](../references/sdk/dataset.md#remove_files) method.
|
||||
Input the path to the folder or file to be removed in the `dataset_path` parameter. The path is relative to the dataset.
|
||||
|
||||
There is also an option to input a wildcard into `dataset_path` in order to remove a set of files matching the wildcard.
|
||||
Set the `recursive` parameter to `True` in order to match all wildcard files recursively
|
||||
|
||||
For example:
|
||||
|
||||
```python
|
||||
dataset.remove_files(dataset_path="*.csv", recursive=True)
|
||||
```
|
||||
|
||||
## Uploading Files
|
||||
|
||||
To upload the dataset files to network storage, use the [`Dataset.upload`](../references/sdk/dataset.md#upload) method.
|
||||
Use the `output_url` parameter to specify storage target, such as S3 / GS / Azure (e.g. `s3://bucket/data`, `gs://bucket/data`, `azure://bucket/data` , `/mnt/share/data`).
|
||||
By default, the dataset uploads to ClearML's file server.
|
||||
|
||||
Dataset files must be uploaded before a dataset is [finalized](#finalizing-a-dataset).
|
||||
|
||||
|
||||
## Finalizing a Dataset
|
||||
|
||||
Use the [`Dataset.finalize`](../references/sdk/dataset.md#finalize) method to close the current dataset. This marks the
|
||||
dataset task as *Completed*, at which point, the dataset can no longer be modified.
|
||||
|
||||
Before closing a dataset, its files must first be [uploaded](#uploading-files).
|
||||
|
||||
|
||||
## Syncing Local Storage
|
||||
|
||||
Use the [`Dataset.sync_folder`](../references/sdk/dataset.md#sync_folder) method in order to update a dataset according
|
||||
to a specific folder's content changes. Specify the folder to sync with the `local_path` parameter (the method assumes all files within the folder and recursive).
|
||||
|
||||
This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically.
|
||||
The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually
|
||||
update (add / remove) files in a dataset.
|
@ -0,0 +1,98 @@
|
||||
---
|
||||
title: Dataset Management with CLI and SDK
|
||||
---
|
||||
|
||||
In this tutorial, we are going to manage the CIFAR dataset with `clearml-data` CLI, and then use ClearML's [`Dataset`](../../references/sdk/dataset.md)
|
||||
class to ingest the data.
|
||||
|
||||
## Creating the Dataset
|
||||
|
||||
### Downloading the Data
|
||||
Before we can register the CIFAR dataset with `clearml-data`, we need to obtain a local copy of it.
|
||||
|
||||
Execute this python script to download the data
|
||||
```python
|
||||
from clearml import StorageManager
|
||||
|
||||
manager = StorageManager()
|
||||
dataset_path = manager.get_local_copy(
|
||||
remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
|
||||
)
|
||||
# make sure to copy the printed value
|
||||
print("COPY THIS DATASET PATH: {}".format(dataset_path))
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```bash
|
||||
COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
|
||||
```
|
||||
The script prints the path to the downloaded data. It will be needed later on.
|
||||
|
||||
### Creating the Dataset
|
||||
To create the dataset, execute the following command:
|
||||
```
|
||||
clearml-data create --project dataset_examples --name cifar_dataset
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Creating a new dataset:
|
||||
New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec
|
||||
```
|
||||
Where `ee1c35f60f384e65bc800f42f0aca5ec` is the dataset ID.
|
||||
|
||||
## Adding Files
|
||||
Add the files we just downloaded to the dataset:
|
||||
|
||||
```
|
||||
clearml-data add --files <dataset_path>
|
||||
```
|
||||
|
||||
where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
|
||||
|
||||
:::note
|
||||
There's no need to specify a `dataset_id`, since the `clearml-data` session stores it.
|
||||
:::
|
||||
|
||||
## Finalizing the Dataset
|
||||
Run the [`close`](../../references/sdk/dataset.md#close) command to upload the files (it'll be uploaded to ClearML Server by default):<br/>
|
||||
|
||||
```
|
||||
clearml-data close
|
||||
```
|
||||
|
||||
This command sets the dataset task's status to *completed*, so it will no longer be modifiable. This ensures future
|
||||
reproducibility.
|
||||
|
||||
The information about the dataset, including a list of files and their sizes, can be viewed
|
||||
in the WebApp, in the dataset task's **ARTIFACTS** tab.
|
||||
|
||||

|
||||
|
||||
## Using the Dataset
|
||||
|
||||
Now that we have a new dataset registered, we can consume it.
|
||||
|
||||
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) example
|
||||
script demonstrates using the dataset within Python code.
|
||||
|
||||
```python
|
||||
dataset_name = "cifar_dataset"
|
||||
dataset_project = "dataset_examples"
|
||||
|
||||
from clearml import Dataset
|
||||
|
||||
dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
|
||||
|
||||
trainset = datasets.CIFAR10(
|
||||
root=dataset_path,
|
||||
train=True,
|
||||
download=False,
|
||||
transform=transform
|
||||
)
|
||||
```
|
||||
The Dataset's [`get_local_copy`](../../references/sdk/dataset.md#get_local_copy) method will return a path to the cached,
|
||||
downloaded dataset. Then we provide the path to Pytorch's dataset object.
|
||||
|
||||
The script then trains a neural network to classify images using the dataset created above.
|
@ -1,20 +1,24 @@
|
||||
---
|
||||
title: Folder Sync
|
||||
title: Folder Sync with CLI
|
||||
---
|
||||
|
||||
This example shows how to use the *clearml-data* folder sync function.
|
||||
This example shows how to use the `clearml-data` folder sync function.
|
||||
|
||||
*clearml-data* folder sync mode is useful for cases when users have a single point of truth (i.e. a folder) that updates
|
||||
`clearml-data` folder sync mode is useful for cases when users have a single point of truth (i.e. a folder) that updates
|
||||
from time to time. When the point of truth is updated, users can call `clearml-data sync` and the
|
||||
changes (file addition, modification, or removal) will be reflected in ClearML.
|
||||
|
||||
## Creating Initial Version
|
||||
|
||||
## Prerequisites
|
||||
First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
|
||||
1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
|
||||
the needed files.
|
||||
|
||||
1. Open terminal and change directory to the cloned repository's examples folder
|
||||
`cd clearml/examples/reporting`
|
||||
|
||||
```
|
||||
cd clearml/examples/reporting
|
||||
```
|
||||
|
||||
## Syncing a Folder
|
||||
Create a dataset and sync the `data_samples` folder from the repo to ClearML
|
@ -0,0 +1,94 @@
|
||||
---
|
||||
title: Data Management with Python
|
||||
---
|
||||
|
||||
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) and
|
||||
[data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py)
|
||||
together demonstrate how to use ClearML's [`Dataset`](../../references/sdk/dataset.md) class to create a dataset and
|
||||
subsequently ingest the data.
|
||||
|
||||
## Dataset Creation
|
||||
|
||||
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) script
|
||||
demonstrates how to do the following:
|
||||
* Create a dataset and add files to it
|
||||
* Upload the dataset to the ClearML Server
|
||||
* Finalize the dataset
|
||||
|
||||
|
||||
### Downloading the Data
|
||||
|
||||
We first need to obtain a local copy of the CIFAR dataset.
|
||||
|
||||
```python
|
||||
from clearml import StorageManager
|
||||
|
||||
manager = StorageManager()
|
||||
dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz")
|
||||
```
|
||||
|
||||
This script downloads the data and `dataset_path` contains the path to the downloaded data.
|
||||
|
||||
### Creating the Dataset
|
||||
|
||||
```python
|
||||
from clearml import Dataset
|
||||
|
||||
dataset = Dataset.create(dataset_name="cifar_dataset", dataset_project="dataset examples" )
|
||||
```
|
||||
|
||||
This creates a data processing task called `cifar_dataset` in the `dataset examples` project, which
|
||||
can be viewed in the WebApp.
|
||||
|
||||
### Adding Files
|
||||
|
||||
```python
|
||||
dataset.add_files(path=dataset_path)
|
||||
```
|
||||
|
||||
This adds the downloaded files to the current dataset.
|
||||
|
||||
### Uploading the Files
|
||||
|
||||
```python
|
||||
dataset.upload()
|
||||
```
|
||||
This uploads the dataset to the ClearML Server by default. The dataset's destination can be changed by specifying the
|
||||
target storage with the `output_url` parameter of the [`upload`](../../references/sdk/dataset#upload) method.
|
||||
|
||||
### Finalizing the Dataset
|
||||
|
||||
Run the [`finalize`](../../references/sdk/dataset#finzalize) command to close the dataset and set that dataset's tasks
|
||||
status to *completed*. The dataset can only be finalized if it doesn't have any pending uploads.
|
||||
|
||||
```python
|
||||
dataset.finalize()
|
||||
```
|
||||
|
||||
After a dataset has been closed, it can no longer be modified. This ensures future reproducibility.
|
||||
|
||||
The information about the dataset, including a list of files and their sizes, can be viewed
|
||||
in the WebApp, in the dataset task's **ARTIFACTS** tab.
|
||||
|
||||

|
||||
|
||||
## Data Ingestion
|
||||
|
||||
Now that we have a new dataset registered, we can consume it!
|
||||
|
||||
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) script
|
||||
demonstrates data ingestion using the dataset created in the first script.
|
||||
|
||||
```python
|
||||
dataset_name = "cifar_dataset"
|
||||
dataset_project = "dataset_examples"
|
||||
|
||||
dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
|
||||
```
|
||||
|
||||
The script above gets the dataset and uses the [`Dataset.get_local_copy`](../../references/sdk/dataset.md#get_local_copy)
|
||||
method to return a path to the cached, read-only local dataset. If you need a modifiable copy of the dataset,
|
||||
use `Dataset.get(dataset_name, dataset_project).get_mutable_local_copy(path/to/download)`
|
||||
|
||||
The script then creates a neural network to train a model to classify images from the dataset that was
|
||||
created above.
|
@ -1,14 +1,17 @@
|
||||
---
|
||||
title: Data Management Example
|
||||
title: Data Management from CLI
|
||||
---
|
||||
|
||||
In this example we'll create a simple dataset and demonstrate basic actions on it.
|
||||
In this example we'll create a simple dataset and demonstrate basic actions on it, using the `clearml-data` CLI.
|
||||
|
||||
## Prerequisites
|
||||
First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
|
||||
1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
|
||||
the needed files.
|
||||
1. Open terminal and change directory to the cloned repository's examples folder
|
||||
`cd clearml/examples/reporting`
|
||||
|
||||
```
|
||||
cd clearml/examples/reporting
|
||||
```
|
||||
|
||||
## Creating Initial Dataset
|
||||
|
||||
@ -27,7 +30,7 @@ the needed files.
|
||||
```
|
||||
|
||||
1. Now let's add a folder. File addition is recursive, so it's enough to point at the folder
|
||||
to captures all files and subfolders:
|
||||
to captures all files and sub-folders:
|
||||
|
||||
```bash
|
||||
clearml-data add --files data_samples
|
||||
@ -42,11 +45,13 @@ to captures all files and subfolders:
|
||||
Hash generation completed
|
||||
5 files added
|
||||
```
|
||||
|
||||
|
||||
:::note
|
||||
After creating a dataset, we don't have to specify its ID when running commands, such as *add*, *remove* or *list*
|
||||
:::
|
||||
|
||||
1. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but
|
||||
3. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but
|
||||
this can be configured with the `--storage` flag to any of ClearML's supported storage mediums (see [storage](../../integrations/storage.md)).
|
||||
The command also finalizes the dataset, making it immutable and ready to be consumed.
|
||||
|
12
docs/clearml_data/data_management_examples/workflows.md
Normal file
12
docs/clearml_data/data_management_examples/workflows.md
Normal file
@ -0,0 +1,12 @@
|
||||
---
|
||||
title: Workflows
|
||||
---
|
||||
|
||||
Take a look at the ClearML Data examples which demonstrate common workflows using the `clearml-data` CLI and the
|
||||
`Dataset` class:
|
||||
* [Dataset Management with CLI](data_man_simple.md) - Tutorial for creating, modifying, and consuming dataset with CLI
|
||||
* [Folder Sync with CLI](data_man_folder_sync.md) - Tutorial for using `clearml-data sync` CLI option to update a dataset according
|
||||
to a local folder.
|
||||
* [Dataset Management with CLI and SDK](data_man_cifar_classification.md) - Tutorial for creating a dataset with the CLI
|
||||
then programmatically ingesting the data with the SDK
|
||||
* [Data Management with Python](data_man_python.md) - Example scripts for creating and consuming a dataset with the SDK.
|
@ -29,9 +29,9 @@ Once we have a Task in ClearML, we can clone and edit its definitions in the UI,
|
||||
- Once there are two or more experiments that run after another, group them together into a [pipeline](../../fundamentals/pipelines.md).
|
||||
|
||||
## Manage Your Data
|
||||
Use [ClearML Data](../../clearml_data.md) to version your data, then link it to running experiments for easy reproduction.
|
||||
Make datasets machine agnostic (i.e. store original dataset in a shared storage location, e.g. shared-folder/S3/Gs/Azure).
|
||||
ClearML Data supports efficient Dataset storage and caching, differentiable & compressed.
|
||||
Use [ClearML Data](../../clearml_data/clearml_data.md) to version your data, then link it to running experiments for easy reproduction.
|
||||
Make datasets machine agnostic (i.e. store original dataset in a shared storage location, e.g. shared-folder/S3/Gs/Azure)
|
||||
ClearML Data supports efficient Dataset storage and caching, differentiable & compressed
|
||||
|
||||
## Scale Your Work
|
||||
Use [ClearML Agent](../../clearml_agent.md) to scale work. Install the agent machines (Remote or local) and manage
|
||||
|
@ -87,7 +87,7 @@ Task.enqueue(cloned_task, queue_name='default')
|
||||
## Logging Artifacts
|
||||
Artifacts are a great way to pass and reuse data between Tasks in the system.
|
||||
From anywhere in the code you can upload [multiple](../../fundamentals/artifacts.md#logging-artifacts) types of data, object and files.
|
||||
Artifacts are the base of ClearML's [Data Management](../../clearml_data.md) solution and as a way to communicate complex objects between different
|
||||
Artifacts are the base of ClearML's [Data Management](../../clearml_data/clearml_data.md) solution and as a way to communicate complex objects between different
|
||||
stages of a [pipeline](../../fundamentals/pipelines.md).
|
||||
|
||||
```python
|
||||
@ -139,7 +139,7 @@ You can also search and query Tasks in the system.
|
||||
Use the `Task.get_tasks` call to retrieve Tasks objects and filter based on the specific values of the Task - status, parameters, metrics and more!
|
||||
```python
|
||||
from clearml import Task
|
||||
tasks = Task.get_tasks(project_name='examples', task_name='partial_name_match', task_filter={'status': 'in_proress'})
|
||||
tasks = Task.get_tasks(project_name='examples', task_name='partial_name_match', task_filter={'status': 'in_progress'})
|
||||
```
|
||||
|
||||
## Manage Your Data
|
||||
@ -147,7 +147,7 @@ Data is probably one of the biggest factors that determines the success of a pro
|
||||
Associating the data a model used to the model's configuration, code and results (such as accuracy) is key to deducing meaningful insights into how
|
||||
models behave.
|
||||
|
||||
[ClearML Data](../../clearml_data.md) allows you to version your data so it's never lost, fetch it from every machine with minimal code changes,
|
||||
[ClearML Data](../../clearml_data/clearml_data.md) allows you to version your data so it's never lost, fetch it from every machine with minimal code changes,
|
||||
and associate data to experiment results.
|
||||
|
||||
Logging data can be done via command line, or via code. If any preprocessing code is involved, ClearML logs it as well! Once data is logged, it can be used by other experiments.
|
||||
|
@ -1,302 +0,0 @@
|
||||
---
|
||||
title: Dataset Management Using CIFAR10
|
||||
---
|
||||
|
||||
In this tutorial, we are going use a CIFAR example, manage the CIFAR dataset with `clearml-data`, and then replace our
|
||||
current dataset read method with one that interfaces with `clearml-data`.
|
||||
|
||||
## Creating the Dataset
|
||||
|
||||
### Downloading the Data
|
||||
Before we can register the CIFAR dataset with `clearml-data` we need to obtain a local copy of it.
|
||||
|
||||
Execute this python script to download the data
|
||||
```python
|
||||
from clearml import StorageManager
|
||||
# We're using the StorageManager to download the data for us!
|
||||
# It's a neat little utility that helps us download
|
||||
# files we need and cache them :)
|
||||
|
||||
manager = StorageManager()
|
||||
dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz")
|
||||
# make sure to copy the printed value
|
||||
print("COPY THIS DATASET PATH: {}".format(dataset_path))
|
||||
```
|
||||
|
||||
Expected reponse:
|
||||
```bash
|
||||
COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
|
||||
```
|
||||
The script prints the path to the downloaded data. It'll be needed later one
|
||||
|
||||
### Creating the Dataset
|
||||
To create the dataset, in a CLI, execute:
|
||||
```
|
||||
clearml-data create --project cifar --name cifar_dataset
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Creating a new dataset:
|
||||
New dataset created id=*********
|
||||
```
|
||||
Where \*\*\*\*\*\*\*\*\* is the dataset ID.
|
||||
|
||||
## Adding Files
|
||||
Add the files we just downloaded to the dataset:
|
||||
```
|
||||
clearml-data add --files <dataset_path>
|
||||
```
|
||||
|
||||
where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
|
||||
|
||||
:::note
|
||||
There's no need to specify a *dataset_id* as *clearml-data* session stores it.
|
||||
:::
|
||||
|
||||
## Finalizing the Dataset
|
||||
Run the close command to upload the files (it'll be uploaded to file server by default):<br/>
|
||||
```
|
||||
clearml-data close
|
||||
```
|
||||
|
||||

|
||||
|
||||
## Using the Dataset
|
||||
Now that we have a new dataset registered, we can consume it.
|
||||
|
||||
We take [this script](https://github.com/allegroai/clearml/blob/master/examples/frameworks/ignite/cifar_ignite.py) as a base to train on the CIFAR dataset.
|
||||
|
||||
We replace the file load part with ClearML's Dataset object. The Dataset's `get_local_copy()` method will return a path
|
||||
to the cached, downloaded dataset.
|
||||
Then we provide the path to Pytorch's dataset object.
|
||||
|
||||
```python
|
||||
dataset_id = "ee1c35f60f384e65bc800f42f0aca5ec"
|
||||
|
||||
from clearml import Dataset
|
||||
dataset_path = Dataset.get(dataset_id=dataset_id).get_local_copy()
|
||||
|
||||
trainset = datasets.CIFAR10(root=dataset_path,
|
||||
train=True,
|
||||
download=False,
|
||||
transform=transform)
|
||||
```
|
||||
|
||||
<details className="cml-expansion-panel info">
|
||||
<summary className="cml-expansion-panel-summary">Full example code using dataset:</summary>
|
||||
<div className="cml-expansion-panel-content">
|
||||
|
||||
```python
|
||||
#These are the obligatory imports
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch.optim as optim
|
||||
import torchvision.datasets as datasets
|
||||
import torchvision.transforms as transforms
|
||||
from ignite.contrib.handlers import TensorboardLogger
|
||||
from ignite.engine import Events, create_supervised_trainer, create_supervised_evaluator
|
||||
from ignite.handlers import global_step_from_engine
|
||||
from ignite.metrics import Accuracy, Loss, Recall
|
||||
from ignite.utils import setup_logger
|
||||
from torch.utils.tensorboard import SummaryWriter
|
||||
from tqdm import tqdm
|
||||
|
||||
from clearml import Task, StorageManager
|
||||
|
||||
# Connecting ClearML with the current process,
|
||||
# from here on everything is logged automatically
|
||||
task = Task.init(project_name='Image Example', task_name='image classification CIFAR10')
|
||||
params = {'number_of_epochs': 20, 'batch_size': 64, 'dropout': 0.25, 'base_lr': 0.001, 'momentum': 0.9, 'loss_report': 100}
|
||||
params = task.connect(params) # enabling configuration override by clearml/
|
||||
print(params) # printing actual configuration (after override in remote mode)
|
||||
|
||||
# This is our original data retrieval code. it uses storage manager to just download and cache our dataset.
|
||||
'''
|
||||
manager = StorageManager()
|
||||
|
||||
dataset_path = Path(manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"))
|
||||
'''
|
||||
|
||||
# Let's now modify it to utilize for the new dataset API, you'll need to copy the created dataset id
|
||||
# to the next variable
|
||||
|
||||
dataset_id = "ee1c35f60f384e65bc800f42f0aca5ec"
|
||||
|
||||
# The below gets the dataset and stores in the cache. If you want to download the dataset regardless if it's in the
|
||||
# cache, use the Dataset.get(dataset_id).get_mutable_local_copy(path to download)
|
||||
from clearml import Dataset
|
||||
dataset_path = Dataset.get(dataset_id=dataset_id).get_local_copy()
|
||||
|
||||
# Dataset and Dataloader initializations
|
||||
transform = transforms.Compose([transforms.ToTensor()])
|
||||
|
||||
trainset = datasets.CIFAR10(root=dataset_path,
|
||||
train=True,
|
||||
download=False,
|
||||
transform=transform)
|
||||
trainloader = torch.utils.data.DataLoader(trainset,
|
||||
batch_size=params.get('batch_size', 4),
|
||||
shuffle=True,
|
||||
num_workers=10)
|
||||
|
||||
testset = datasets.CIFAR10(root=dataset_path,
|
||||
train=False,
|
||||
download=False,
|
||||
transform=transform)
|
||||
testloader = torch.utils.data.DataLoader(testset,
|
||||
batch_size=params.get('batch_size', 4),
|
||||
shuffle=False,
|
||||
num_workers=10)
|
||||
|
||||
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
|
||||
|
||||
tb_logger = TensorboardLogger(log_dir="cifar-output")
|
||||
|
||||
|
||||
# Helper function to store predictions and scores using matplotlib
|
||||
def predictions_gt_images_handler(engine, logger, *args, **kwargs):
|
||||
x, _ = engine.state.batch
|
||||
y_pred, y = engine.state.output
|
||||
|
||||
num_x = num_y = 4
|
||||
le = num_x * num_y
|
||||
fig = plt.figure(figsize=(20, 20))
|
||||
trans = transforms.ToPILImage()
|
||||
for idx in range(le):
|
||||
preds = torch.argmax(F.softmax(y_pred[idx],dim=0))
|
||||
probs = torch.max(F.softmax(y_pred[idx],dim=0))
|
||||
ax = fig.add_subplot(num_x, num_y, idx + 1, xticks=[], yticks=[])
|
||||
ax.imshow(trans(x[idx]))
|
||||
ax.set_title("{0} {1:.1f}% (label: {2})".format(
|
||||
classes[preds],
|
||||
probs * 100,
|
||||
classes[y[idx]]),
|
||||
color=("green" if preds == y[idx] else "red")
|
||||
)
|
||||
logger.writer.add_figure('predictions vs actuals', figure=fig, global_step=engine.state.epoch)
|
||||
|
||||
|
||||
class Net(nn.Module):
|
||||
def __init__(self):
|
||||
super(Net, self).__init__()
|
||||
self.conv1 = nn.Conv2d(3, 6, 3)
|
||||
self.conv2 = nn.Conv2d(6, 16, 3)
|
||||
self.pool = nn.MaxPool2d(2, 2)
|
||||
self.fc1 = nn.Linear(16 * 6 * 6, 120)
|
||||
self.fc2 = nn.Linear(120, 84)
|
||||
self.dorpout = nn.Dropout(p=params.get('dropout', 0.25))
|
||||
self.fc3 = nn.Linear(84, 10)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.pool(F.relu(self.conv1(x)))
|
||||
x = self.pool(F.relu(self.conv2(x)))
|
||||
x = x.view(-1, 16 * 6 * 6)
|
||||
x = F.relu(self.fc1(x))
|
||||
x = F.relu(self.fc2(x))
|
||||
x = self.fc3(self.dorpout(x))
|
||||
return x
|
||||
|
||||
|
||||
# Training
|
||||
def run(epochs, lr, momentum, log_interval):
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
net = Net().to(device)
|
||||
criterion = nn.CrossEntropyLoss()
|
||||
optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum)
|
||||
|
||||
trainer = create_supervised_trainer(net, optimizer, criterion, device=device)
|
||||
trainer.logger = setup_logger("trainer")
|
||||
|
||||
val_metrics = {"accuracy": Accuracy(),"loss": Loss(criterion), "recall": Recall()}
|
||||
evaluator = create_supervised_evaluator(net, metrics=val_metrics, device=device)
|
||||
evaluator.logger = setup_logger("evaluator")
|
||||
|
||||
# Attach handler to plot trainer's loss every 100 iterations
|
||||
tb_logger.attach_output_handler(
|
||||
trainer,
|
||||
event_name=Events.ITERATION_COMPLETED(every=params.get('loss_report')),
|
||||
tag="training",
|
||||
output_transform=lambda loss: {"loss": loss},
|
||||
)
|
||||
|
||||
# Attach handler to dump evaluator's metrics every epoch completed
|
||||
for tag, evaluator in [("training", trainer), ("validation", evaluator)]:
|
||||
tb_logger.attach_output_handler(
|
||||
evaluator,
|
||||
event_name=Events.EPOCH_COMPLETED,
|
||||
tag=tag,
|
||||
metric_names="all",
|
||||
global_step_transform=global_step_from_engine(trainer),
|
||||
)
|
||||
|
||||
# Attach function to build debug images and report every epoch end
|
||||
tb_logger.attach(
|
||||
evaluator,
|
||||
log_handler=predictions_gt_images_handler,
|
||||
event_name=Events.EPOCH_COMPLETED(once=1),
|
||||
);
|
||||
|
||||
desc = "ITERATION - loss: {:.2f}"
|
||||
pbar = tqdm(initial=0, leave=False, total=len(trainloader), desc=desc.format(0))
|
||||
|
||||
@trainer.on(Events.ITERATION_COMPLETED(every=log_interval))
|
||||
def log_training_loss(engine):
|
||||
pbar.desc = desc.format(engine.state.output)
|
||||
pbar.update(log_interval)
|
||||
|
||||
@trainer.on(Events.EPOCH_COMPLETED)
|
||||
def log_training_results(engine):
|
||||
pbar.refresh()
|
||||
evaluator.run(trainloader)
|
||||
metrics = evaluator.state.metrics
|
||||
avg_accuracy = metrics["accuracy"]
|
||||
avg_nll = metrics["loss"]
|
||||
tqdm.write(
|
||||
"Training Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}".format(
|
||||
engine.state.epoch, avg_accuracy, avg_nll
|
||||
)
|
||||
)
|
||||
|
||||
@trainer.on(Events.EPOCH_COMPLETED)
|
||||
def log_validation_results(engine):
|
||||
evaluator.run(testloader)
|
||||
metrics = evaluator.state.metrics
|
||||
avg_accuracy = metrics["accuracy"]
|
||||
avg_nll = metrics["loss"]
|
||||
tqdm.write(
|
||||
"Validation Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}".format(
|
||||
engine.state.epoch, avg_accuracy, avg_nll
|
||||
)
|
||||
)
|
||||
|
||||
pbar.n = pbar.last_print_n = 0
|
||||
|
||||
@trainer.on(Events.EPOCH_COMPLETED | Events.COMPLETED)
|
||||
def log_time():
|
||||
tqdm.write(
|
||||
"{} took {} seconds".format(trainer.last_event_name.name, trainer.state.times[trainer.last_event_name.name])
|
||||
)
|
||||
|
||||
trainer.run(trainloader, max_epochs=epochs)
|
||||
pbar.close()
|
||||
|
||||
PATH = './cifar_net.pth'
|
||||
torch.save(net.state_dict(), PATH)
|
||||
|
||||
print('Finished Training')
|
||||
print('Task ID number is: {}'.format(task.id))
|
||||
|
||||
|
||||
run(params.get('number_of_epochs'), params.get('base_lr'), params.get('momentum'), 10)
|
||||
```
|
||||
|
||||
</div></details>
|
||||
|
||||
<br/><br/>
|
||||
That's it! All you need to do now is run the full script.
|
BIN
docs/img/dataset_data.png
Normal file
BIN
docs/img/dataset_data.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 110 KiB |
BIN
docs/img/dataset_data_state.png
Normal file
BIN
docs/img/dataset_data_state.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 107 KiB |
BIN
docs/img/dataset_genealogy_summary.png
Normal file
BIN
docs/img/dataset_genealogy_summary.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 96 KiB |
@ -17,7 +17,8 @@ module.exports = {
|
||||
'fundamentals/hpo', 'fundamentals/pipelines']},
|
||||
'clearml_sdk',
|
||||
'clearml_agent',
|
||||
'clearml_data',
|
||||
{'ClearML Data': ['clearml_data/clearml_data', 'clearml_data/clearml_data_cli', 'clearml_data/clearml_data_sdk', 'clearml_data/best_practices',
|
||||
{'Workflows': ['clearml_data/data_management_examples/workflows', 'clearml_data/data_management_examples/data_man_simple', 'clearml_data/data_management_examples/data_man_folder_sync', 'clearml_data/data_management_examples/data_man_cifar_classification', 'clearml_data/data_management_examples/data_man_python']},]},
|
||||
{'Applications': ['apps/clearml_session', 'apps/clearml_task']},
|
||||
{'Integrations': ['integrations/libraries', 'integrations/storage']},
|
||||
|
||||
@ -59,8 +60,7 @@ module.exports = {
|
||||
'guides/guidemain',
|
||||
{'Advanced': ['guides/advanced/execute_remotely', 'guides/advanced/multiple_tasks_single_process']},
|
||||
{'Automation': ['guides/automation/manual_random_param_search_example', 'guides/automation/task_piping']},
|
||||
{'Data Management': ['guides/data management/data_man_simple', 'guides/data management/data_man_folder_sync', 'guides/data management/data_man_cifar_classification']},
|
||||
{'ClearML Task': ['guides/clearml-task/clearml_task_tutorial']},
|
||||
{'Clearml Task': ['guides/clearml-task/clearml_task_tutorial']},
|
||||
{'Distributed': ['guides/distributed/distributed_pytorch_example', 'guides/distributed/subprocess_example']},
|
||||
{'Docker': ['guides/docker/extra_docker_shell_script']},
|
||||
{'Frameworks': [
|
||||
|
Loading…
Reference in New Issue
Block a user