mirror of
https://github.com/clearml/clearml-docs
synced 2025-06-26 18:17:44 +00:00
Refactor ClearML Data docs (#108)
This commit is contained in:
46
docs/clearml_data/best_practices.md
Normal file
46
docs/clearml_data/best_practices.md
Normal file
@@ -0,0 +1,46 @@
|
||||
---
|
||||
title: Best Practices
|
||||
---
|
||||
|
||||
The following are some recommendations for using ClearML Data.
|
||||
|
||||
## Versioning Datasets
|
||||
|
||||
Use ClearML Data to version your datasets. Once a dataset is finalized, it can no longer be modified. This makes clear
|
||||
which version of the dataset was used with which task, enabling the accurate reproduction of your experiments.
|
||||
|
||||
Once you need to change the dataset's contents, you can create a new version of the dataset by specifying the previous
|
||||
dataset as a parent. This makes the new dataset version inherit the previous version's contents, with the dataset's new
|
||||
version contents ready to be updated.
|
||||
|
||||
## Organize Datasets for Easier Access
|
||||
|
||||
Organize the datasets according to use-cases and use tags. This makes managing multiple datasets and
|
||||
accessing the most updated datasets for different use-cases easier.
|
||||
|
||||
Like any ClearML tasks, datasets can be organized into [projects (and sub-projects)](../fundamentals/projects.md#creating-sub-projects).
|
||||
Additionally, when creating a dataset, tags can be applied to the dataset, which will make searching for the dataset easier.
|
||||
|
||||
Organizing your datasets into projects by use-case makes it easier to access the most recent dataset version for that use-case.
|
||||
If only a project is specified when using [`Dataset.get`](../references/sdk/dataset.md#datasetget), the method returns the
|
||||
most recent dataset in a project. The same is true with tags; if a tag is specified, the method will return the most recent dataset that is labeled with that tag.
|
||||
|
||||
## Document your Datasets
|
||||
|
||||
Attach informative metrics or debug samples to the Dataset itself. Use the [`get_logger`](../references/sdk/dataset.md#get_logger)
|
||||
method to access the dataset's logger object, then add any additional information to the dataset, using the methods
|
||||
available with a [logger](../references/sdk/logger.md) object.
|
||||
|
||||
You can add some dataset summaries (like [table reporting](../references/sdk/logger.md#report_table)) to create a preview
|
||||
of the data stored for better visibility, or attach any statistics generated by the data ingestion process.
|
||||
|
||||
|
||||
## Periodically Update Your Dataset
|
||||
|
||||
Your data probably changes from time to time. If the data is updated into the same local / network folder structure, which
|
||||
serves as a dataset's single point of truth, you can schedule a script which uses the dataset `sync` functionality which
|
||||
will update the dataset based on the modifications made to the folder. This way, there is no need to manually modify a dataset.
|
||||
This functionality will also track the modifications made to a folder.
|
||||
|
||||
See the sync function with the [CLI](clearml_data_cli.md#syncing-local-storage) or [SDK](clearml_data_sdk.md#syncing-local-storage)
|
||||
interface.
|
||||
92
docs/clearml_data/clearml_data.md
Normal file
92
docs/clearml_data/clearml_data.md
Normal file
@@ -0,0 +1,92 @@
|
||||
---
|
||||
title: Introduction
|
||||
---
|
||||
|
||||
In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset,
|
||||
which you then need to be able to share, reproduce, and track.
|
||||
|
||||
ClearML Data Management solves two important challenges:
|
||||
- Accessibility - Making data easily accessible from every machine,
|
||||
- Versioning - Linking data and experiments for better **traceability**.
|
||||
|
||||
**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
|
||||
Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
|
||||
|
||||
A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
|
||||
Datasets can be set up to inherit from other datasets, so data lineages can be created,
|
||||
and users can track when and how their data changes.
|
||||
|
||||
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
|
||||
|
||||
Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
|
||||
When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
|
||||
|
||||
## Setup
|
||||
|
||||
`clearml-data` comes built-in with the `clearml` python package! Just check out the [Getting Started](../getting_started/ds/ds_first_steps.md)
|
||||
guide for more info!
|
||||
|
||||
## Using ClearML Data
|
||||
|
||||
ClearML Data offers two interfaces:
|
||||
- `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](clearml_data_cli.md) for a reference of `clearml-data` commands.
|
||||
- `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets. See [SDK](clearml_data_sdk.md) for an overview of the basic methods of the `Dataset` module.
|
||||
|
||||
For an overview of our recommendations for ClearML Data workflows and practices, see [Best Practices](best_practices.md).
|
||||
|
||||
## WebApp
|
||||
|
||||
ClearML's WebApp provides a visual interface to your datasets through dataset tasks. Dataset tasks are categorized
|
||||
as data-processing [task type](../fundamentals/task.md#task-types), and they are labeled with a `DATASET` system tag.
|
||||
|
||||
Full log (calls / CLI) of the dataset creation process can be found in a dataset's **EXECUTION** section.
|
||||
|
||||
Listing of the dataset differential snapshot, summary of files added / modified / removed and details of files in the
|
||||
differential snapshot (location / size / hash), is available in the **ARTIFACTS** section. Download the dataset
|
||||
by clicking <img src="/docs/latest/icons/ico-download-json.svg" alt="Download" className="icon size-sm space-sm" />,
|
||||
next to the **FILE PATH**.
|
||||
|
||||
The full dataset listing (all files included) is available in the **CONFIGURATION** section under **Dataset Content**.
|
||||
This allows you to quickly compare two dataset contents and visually see the difference.
|
||||
The dataset genealogy DAG and change-set summary table is visualized in **RESULTS > PLOTS**
|
||||
|
||||
|
||||
<details className="cml-expansion-panel screenshot">
|
||||
<summary className="cml-expansion-panel-summary">Dataset Contents</summary>
|
||||
<div className="cml-expansion-panel-content">
|
||||
|
||||

|
||||
|
||||
</div>
|
||||
</details>
|
||||
|
||||
<br/>
|
||||
|
||||
View a DAG of the dataset dependencies (all previous dataset versions and their parents) in the dataset's page **> ARTIFACTS > state**.
|
||||
|
||||
<details className="cml-expansion-panel screenshot">
|
||||
<summary className="cml-expansion-panel-summary">Data Dependency DAG</summary>
|
||||
<div className="cml-expansion-panel-content">
|
||||
|
||||

|
||||
|
||||
</div>
|
||||
</details>
|
||||
|
||||
|
||||
Once a dataset has been finalized, view its genealogy in the dataset's
|
||||
page **>** **RESULTS** **>** **PLOTS**
|
||||
|
||||
<details className="cml-expansion-panel screenshot">
|
||||
<summary className="cml-expansion-panel-summary">Dataset Genealogy</summary>
|
||||
<div className="cml-expansion-panel-content">
|
||||
|
||||

|
||||
|
||||
</div>
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
328
docs/clearml_data/clearml_data_cli.md
Normal file
328
docs/clearml_data/clearml_data_cli.md
Normal file
@@ -0,0 +1,328 @@
|
||||
---
|
||||
title: CLI
|
||||
---
|
||||
|
||||
The `clearml-data` utility is a CLI tool for controlling and managing your data with ClearML.
|
||||
|
||||
The following page provides a reference to `clearml-data`'s CLI commands.
|
||||
|
||||
### Creating a Dataset
|
||||
```bash
|
||||
clearml-data create --project <project_name> --name <dataset_name> --parents <existing_dataset_id>`
|
||||
```
|
||||
Creates a new dataset. <br/>
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--name` |Dataset's name`| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|`--project`|Dataset's project`| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|`--parents`|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--tags` |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
:::important
|
||||
clearml-data works in a stateful mode so once a new dataset is created, the following commands
|
||||
do not require the `--id` flag.
|
||||
:::
|
||||
|
||||
<br/>
|
||||
|
||||
### Adding Files
|
||||
```bash
|
||||
clearml-data add --id <dataset_id> --files <filenames/folders_to_add>
|
||||
```
|
||||
It's possible to add individual files or complete folders.<br/>
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id` | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--files`|Files / folders to add. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json` | <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|`--dataset-folder` | Dataset base folder to add the files to in the dataset. Default: dataset root| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--non-recursive` | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Removing Files
|
||||
```bash
|
||||
clearml-data remove --id <dataset_id_to_remove_from> --files <filenames/folders_to_remove>
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id` | Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--files` | Files / folders to remove (wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`). Notice: file path is the path within the dataset, not the local path.| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|
||||
|`--non-recursive` | Disable recursive scan of files | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Uploading Dataset Content
|
||||
```bash
|
||||
clearml-data upload [--id <dataset_id>] [--storage <upload_destination>]
|
||||
```
|
||||
Uploads added files to [ClearML Server](../deploying_clearml/clearml_server.md) by default. It's possible to specify a different storage
|
||||
medium by entering an upload destination, such as `s3://bucket`, `gs://`, `azure://`, `/mnt/shared/`.
|
||||
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--storage`| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Finalizing a Dataset
|
||||
```bash
|
||||
clearml-data close --id <dataset_id>
|
||||
```
|
||||
Finalizes the dataset and makes it ready to be consumed.
|
||||
It automatically uploads all files that were not previously uploaded.
|
||||
Once a dataset is finalized, it can no longer be modified.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--storage`| Remote storage to use for the dataset files. Default: files_server | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--disable-upload` | Disable automatic upload when closing the dataset | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--verbose` | Verbose reporting | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Syncing Local Storage
|
||||
```
|
||||
clearml-data sync [--id <dataset_id] --folder <folder_location> [--parents '<parent_id>']`
|
||||
```
|
||||
This option syncs a folder's content with ClearML. It is useful in case a user has a single point of truth (i.e. a folder) which
|
||||
updates from time to time.
|
||||
|
||||
|
||||
Once an update should be reflected into ClearML's system, users can call `clearml-data sync`, create a new dataset, enter the folder,
|
||||
and the changes (either file addition, modification and removal) will be reflected in ClearML.
|
||||
|
||||
This command also uploads the data and finalizes the dataset automatically.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|
||||
|`--folder`|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--parents`|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--project`|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--name`|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--tags`|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--skip-close`|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--verbose` | Verbose reporting |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Listing Dataset Content
|
||||
```bash
|
||||
clearml-data list [--id <dataset_id>]
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--name`|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--filter`|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--modified`|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Deleting a Dataset
|
||||
```
|
||||
clearml-data delete [--id <dataset_id_to_delete>]
|
||||
```
|
||||
Deletes an entire dataset from ClearML. This can also be used to delete a newly created dataset.
|
||||
|
||||
This does not work on datasets with children.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--force`|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />||
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Searching for a Dataset
|
||||
```
|
||||
clearml-data search [--name <name>] [--project <project_name>] [--tags <tag>]
|
||||
```
|
||||
Lists all datasets in the system that match the search request.
|
||||
|
||||
Datasets can be searched by project, name, ID, and tags.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--ids`|A list of dataset IDs|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|`--project`|The project name of the datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|`--name`|A dataset name or a partial name to filter datasets by|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|`--tags`|A list of dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
<br/>
|
||||
|
||||
### Comparing Two Datasets
|
||||
|
||||
```
|
||||
clearml-data compare [--source SOURCE] [--target TARGET]
|
||||
```
|
||||
|
||||
Compare two datasets (target vs. source). The command returns a comparison summary that looks like this:
|
||||
|
||||
```
|
||||
Comparison summary: 4 files removed, 3 files modified, 0 files added
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--source`|Source dataset id (used as baseline)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|`--target`|Target dataset id (compare against the source baseline dataset)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
### Merging Datasets
|
||||
|
||||
```
|
||||
clearml-data squash --name NAME --ids [IDS [IDS ...]]
|
||||
```
|
||||
|
||||
Squash (merge) multiple datasets into a single dataset version.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--name`|Create squashed dataset name|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|`--ids`|Source dataset IDs to squash (merge down)|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
### Verifying a Dataset
|
||||
|
||||
```
|
||||
clearml-data verify [--id ID] [--folder FOLDER]
|
||||
```
|
||||
|
||||
Verify that the dataset content matches the data from the local source.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`|Specify dataset ID. Default: previously created/accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--folder`|Specify dataset local copy (if not provided the local cache folder will be verified)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--filesize`| If True, only verify file size and skip hash checks (default: false)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--verbose`|Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
### Getting a Dataset
|
||||
|
||||
```
|
||||
clearml-data get [--id ID] [--copy COPY] [--link LINK] [--overwrite]
|
||||
```
|
||||
|
||||
Get a local copy of a dataset. By default, you get a read only cached folder, but you can get a mutable copy by using the
|
||||
`--copy` flag.
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`| Specify dataset ID. Default: previously created / accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--copy`| Get a writable copy of the dataset to a specific output folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--link`| Create a soft link (not supported on Windows) to a read-only cached folder containing the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--overwrite`| If True, overwrite the target folder|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|`--verbose`| Verbose report all file changes (instead of summary)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
|
||||
### Publishing a Dataset
|
||||
|
||||
```
|
||||
clearml-data publish --id ID
|
||||
```
|
||||
|
||||
Publish the dataset for public use. The dataset must be [finalized](#finalizing-a-dataset) before it is published.
|
||||
|
||||
|
||||
**Parameters**
|
||||
|
||||
<div className="tbl-cmd">
|
||||
|
||||
|Name|Description|Optional|
|
||||
|---|---|---|
|
||||
|`--id`| The dataset task id to be published.|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|
||||
|
||||
</div>
|
||||
148
docs/clearml_data/clearml_data_sdk.md
Normal file
148
docs/clearml_data/clearml_data_sdk.md
Normal file
@@ -0,0 +1,148 @@
|
||||
---
|
||||
title: SDK
|
||||
---
|
||||
|
||||
|
||||
Datasets can be created, modified, and managed with ClearML Data's python interface. The following page provides an overview
|
||||
for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md)
|
||||
for a complete list of available methods.
|
||||
|
||||
Import the `Dataset` class, and let's get started!
|
||||
|
||||
```python
|
||||
from clearml import Dataset
|
||||
```
|
||||
|
||||
## Creating Datasets
|
||||
|
||||
ClearML Data supports multiple ways to create datasets programmatically, which provides for a variety of use-cases:
|
||||
* [`Dataset.create()`](#datasetcreate) - Create a new dataset. Parent datasets can be specified, from which the new dataset
|
||||
will inherit its data
|
||||
* [`Dataset.squash()`](#datasetsquash) - Generate a new dataset from by squashing together a set of related datasets
|
||||
|
||||
### Dataset.create()
|
||||
|
||||
Use the [`Dataset.create`](../references/sdk/dataset.md#datasetcreate) class method to create a dataset.
|
||||
|
||||
Creating datasets programmatically is especially helpful when preprocessing the data so that the
|
||||
preprocessing code and the resulting dataset are saved in a single task (see `use_current_task` parameter in [`Dataset.create`](../references/sdk/dataset.md#datasetcreate)).
|
||||
|
||||
```python
|
||||
# Preprocessing code here
|
||||
dataset = Dataset.create(
|
||||
dataset_name='dataset name',
|
||||
dataset_project='dataset project',
|
||||
parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2]
|
||||
)
|
||||
```
|
||||
|
||||
The created dataset inherits the content of the `parent_datasets`. When multiple dataset parents are listed,
|
||||
they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset.
|
||||
|
||||
### Dataset.squash()
|
||||
|
||||
To improve deep dataset DAG storage and speed, dataset squashing was introduced. The [`Dataset.squash`](../references/sdk/dataset.md#datasetsquash)
|
||||
class method generates a new dataset by squashing a set of dataset versions, and merging down all changes introduced in
|
||||
their lineage DAG, creating a new, flat, independent version.
|
||||
|
||||
The datasets being squashed into a single dataset can be specified by their IDs or by project & name pairs.
|
||||
|
||||
```python
|
||||
# option 1 - list dataset IDs
|
||||
squashed_dataset_1 = Dataset.squash(
|
||||
dataset_name='squashed dataset\'s name',
|
||||
dataset_ids=[DS1_ID, DS2_ID, DS3_ID]
|
||||
)
|
||||
|
||||
# option 2 - list project and dataset pairs
|
||||
squashed_dataset_2 = Dataset.squash(
|
||||
dataset_name='squashed dataset 2',
|
||||
dataset_project_name_pairs=[('dataset1 project', 'dataset1 name'),
|
||||
('dataset2 project', 'dataset2 name')]
|
||||
)
|
||||
```
|
||||
|
||||
In addition, the target storage location for the squashed dataset can be specified using the `output_url` parameter of the
|
||||
[`Dataset.squash`](../references/sdk/dataset.md#datasetsquash) method.
|
||||
|
||||
## Accessing Datasets
|
||||
Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere.
|
||||
|
||||
Use the [`Dataset.get`](../references/sdk/dataset.md#datasetget) class method to access a specific Dataset object, either
|
||||
with the dataset's ID or with its project and name. If only a project name or tag is provided, the method returns the
|
||||
most recent dataset in the specified project, or the most recent dataset with the specified tag.
|
||||
|
||||
Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options:
|
||||
* [`Dataset.get_local_copy()`](../references/sdk/dataset.md#get_local_copy) - get a read-only local copy of an entire dataset.
|
||||
This method returns a path to the dataset in local cache (downloading the dataset if it is not already in cache)
|
||||
* [`Dataset.get_mutable_local_copy()`](../references/sdk/dataset.md#get_mutable_local_copy) - get a writable local copy
|
||||
of an entire dataset. This method downloads the dataset to a specific folder (non-cached), specified with the `target_folder` parameter. If
|
||||
the specified folder already has contents, specify whether to overwrite its contents with the dataset contents, using the `overwrite` parameter.
|
||||
|
||||
## Modifying Datasets
|
||||
|
||||
Once a dataset has been created, its contents can be modified and replaced. When your data is changes, you can
|
||||
add updated files or remove unnecessary files.
|
||||
|
||||
### add_files()
|
||||
|
||||
To add files or folders into the current dataset, use the [`Dataset.add_files`](../references/sdk/dataset.md#add_files)
|
||||
method. If a file is already in a dataset, but it has been modified, it can be added again, and ClearML will
|
||||
upload the file diff.
|
||||
|
||||
```python
|
||||
dataset = Dataset.create()
|
||||
dataset.add_files(path="path/to/folder_or_file")
|
||||
```
|
||||
|
||||
There is an option to add a set of files based on wildcard matching of a single string or a list of strings, using the
|
||||
`wildcard` parameter. Specify whether to match the wildcard files recursively using the `recursive` parameter.
|
||||
|
||||
For example:
|
||||
|
||||
```python
|
||||
dataset.add_files(
|
||||
path="path/to/folder",
|
||||
wildcard="~/data/*.jpg",
|
||||
recursive=True
|
||||
)
|
||||
```
|
||||
|
||||
### remove_files()
|
||||
To remove files from a current dataset, use the [`Dataset.remove_files`](../references/sdk/dataset.md#remove_files) method.
|
||||
Input the path to the folder or file to be removed in the `dataset_path` parameter. The path is relative to the dataset.
|
||||
|
||||
There is also an option to input a wildcard into `dataset_path` in order to remove a set of files matching the wildcard.
|
||||
Set the `recursive` parameter to `True` in order to match all wildcard files recursively
|
||||
|
||||
For example:
|
||||
|
||||
```python
|
||||
dataset.remove_files(dataset_path="*.csv", recursive=True)
|
||||
```
|
||||
|
||||
## Uploading Files
|
||||
|
||||
To upload the dataset files to network storage, use the [`Dataset.upload`](../references/sdk/dataset.md#upload) method.
|
||||
Use the `output_url` parameter to specify storage target, such as S3 / GS / Azure (e.g. `s3://bucket/data`, `gs://bucket/data`, `azure://bucket/data` , `/mnt/share/data`).
|
||||
By default, the dataset uploads to ClearML's file server.
|
||||
|
||||
Dataset files must be uploaded before a dataset is [finalized](#finalizing-a-dataset).
|
||||
|
||||
|
||||
## Finalizing a Dataset
|
||||
|
||||
Use the [`Dataset.finalize`](../references/sdk/dataset.md#finalize) method to close the current dataset. This marks the
|
||||
dataset task as *Completed*, at which point, the dataset can no longer be modified.
|
||||
|
||||
Before closing a dataset, its files must first be [uploaded](#uploading-files).
|
||||
|
||||
|
||||
## Syncing Local Storage
|
||||
|
||||
Use the [`Dataset.sync_folder`](../references/sdk/dataset.md#sync_folder) method in order to update a dataset according
|
||||
to a specific folder's content changes. Specify the folder to sync with the `local_path` parameter (the method assumes all files within the folder and recursive).
|
||||
|
||||
This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically.
|
||||
The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually
|
||||
update (add / remove) files in a dataset.
|
||||
@@ -0,0 +1,98 @@
|
||||
---
|
||||
title: Dataset Management with CLI and SDK
|
||||
---
|
||||
|
||||
In this tutorial, we are going to manage the CIFAR dataset with `clearml-data` CLI, and then use ClearML's [`Dataset`](../../references/sdk/dataset.md)
|
||||
class to ingest the data.
|
||||
|
||||
## Creating the Dataset
|
||||
|
||||
### Downloading the Data
|
||||
Before we can register the CIFAR dataset with `clearml-data`, we need to obtain a local copy of it.
|
||||
|
||||
Execute this python script to download the data
|
||||
```python
|
||||
from clearml import StorageManager
|
||||
|
||||
manager = StorageManager()
|
||||
dataset_path = manager.get_local_copy(
|
||||
remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
|
||||
)
|
||||
# make sure to copy the printed value
|
||||
print("COPY THIS DATASET PATH: {}".format(dataset_path))
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```bash
|
||||
COPY THIS DATASET PATH: ~/.clearml/cache/storage_manager/global/f2751d3a22ccb78db0e07874912b5c43.cifar-10-python_artifacts_archive_None
|
||||
```
|
||||
The script prints the path to the downloaded data. It will be needed later on.
|
||||
|
||||
### Creating the Dataset
|
||||
To create the dataset, execute the following command:
|
||||
```
|
||||
clearml-data create --project dataset_examples --name cifar_dataset
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Creating a new dataset:
|
||||
New dataset created id=ee1c35f60f384e65bc800f42f0aca5ec
|
||||
```
|
||||
Where `ee1c35f60f384e65bc800f42f0aca5ec` is the dataset ID.
|
||||
|
||||
## Adding Files
|
||||
Add the files we just downloaded to the dataset:
|
||||
|
||||
```
|
||||
clearml-data add --files <dataset_path>
|
||||
```
|
||||
|
||||
where `dataset_path` is the path that was printed earlier, which denotes the location of the downloaded dataset.
|
||||
|
||||
:::note
|
||||
There's no need to specify a `dataset_id`, since the `clearml-data` session stores it.
|
||||
:::
|
||||
|
||||
## Finalizing the Dataset
|
||||
Run the [`close`](../../references/sdk/dataset.md#close) command to upload the files (it'll be uploaded to ClearML Server by default):<br/>
|
||||
|
||||
```
|
||||
clearml-data close
|
||||
```
|
||||
|
||||
This command sets the dataset task's status to *completed*, so it will no longer be modifiable. This ensures future
|
||||
reproducibility.
|
||||
|
||||
The information about the dataset, including a list of files and their sizes, can be viewed
|
||||
in the WebApp, in the dataset task's **ARTIFACTS** tab.
|
||||
|
||||

|
||||
|
||||
## Using the Dataset
|
||||
|
||||
Now that we have a new dataset registered, we can consume it.
|
||||
|
||||
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) example
|
||||
script demonstrates using the dataset within Python code.
|
||||
|
||||
```python
|
||||
dataset_name = "cifar_dataset"
|
||||
dataset_project = "dataset_examples"
|
||||
|
||||
from clearml import Dataset
|
||||
|
||||
dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
|
||||
|
||||
trainset = datasets.CIFAR10(
|
||||
root=dataset_path,
|
||||
train=True,
|
||||
download=False,
|
||||
transform=transform
|
||||
)
|
||||
```
|
||||
The Dataset's [`get_local_copy`](../../references/sdk/dataset.md#get_local_copy) method will return a path to the cached,
|
||||
downloaded dataset. Then we provide the path to Pytorch's dataset object.
|
||||
|
||||
The script then trains a neural network to classify images using the dataset created above.
|
||||
@@ -0,0 +1,83 @@
|
||||
---
|
||||
title: Folder Sync with CLI
|
||||
---
|
||||
|
||||
This example shows how to use the `clearml-data` folder sync function.
|
||||
|
||||
`clearml-data` folder sync mode is useful for cases when users have a single point of truth (i.e. a folder) that updates
|
||||
from time to time. When the point of truth is updated, users can call `clearml-data sync` and the
|
||||
changes (file addition, modification, or removal) will be reflected in ClearML.
|
||||
|
||||
## Creating Initial Version
|
||||
|
||||
## Prerequisites
|
||||
1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
|
||||
the needed files.
|
||||
|
||||
1. Open terminal and change directory to the cloned repository's examples folder
|
||||
|
||||
```
|
||||
cd clearml/examples/reporting
|
||||
```
|
||||
|
||||
## Syncing a Folder
|
||||
Create a dataset and sync the `data_samples` folder from the repo to ClearML
|
||||
```bash
|
||||
clearml-data sync --project datasets --name sync_folder --folder data_samples
|
||||
```
|
||||
|
||||
Expected response:
|
||||
|
||||
```
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Creating a new dataset:
|
||||
New dataset created id=0d8f5f3e5ebd4f849bfb218021be1ede
|
||||
Syncing dataset id 0d8f5f3e5ebd4f849bfb218021be1ede to local folder data_samples
|
||||
Generating SHA2 hash for 5 files
|
||||
Hash generation completed
|
||||
Sync completed: 0 files removed, 5 added / modified
|
||||
Finalizing dataset
|
||||
Pending uploads, starting dataset upload to https://files.community.clear.ml
|
||||
Uploading compressed dataset changes (5 files, total 222.17 KB) to https://files.community.clear.ml
|
||||
Upload completed (222.17 KB)
|
||||
2021-05-04 09:57:56,809 - clearml.Task - INFO - Waiting to finish uploads
|
||||
2021-05-04 09:57:57,581 - clearml.Task - INFO - Finished uploading
|
||||
Dataset closed and finalized
|
||||
```
|
||||
|
||||
As can be seen, the `clearml-data sync` command creates the dataset, then uploads the files, and closes the dataset.
|
||||
|
||||
|
||||
## Modifying Synced Folder
|
||||
|
||||
Now we'll modify the folder:
|
||||
1. Add another line to one of the files in the `data_samples` folder.
|
||||
1. Add a file to the sample_data folder.<br/>
|
||||
Run`echo "data data data" > data_samples/new_data.txt` (this will create the file `new_data.txt` and put it in the `data_samples` folder)
|
||||
|
||||
|
||||
We'll repeat the process of creating a new dataset with the previous one as its parent, and syncing the folder.
|
||||
|
||||
```bash
|
||||
clearml-data sync --project datasets --name second_ds --parents a1ddc8b0711b4178828f6c6e6e994b7c --folder data_samples
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Creating a new dataset:
|
||||
New dataset created id=0992dd6bae6144388e0f2ef131d9724a
|
||||
Syncing dataset id 0992dd6bae6144388e0f2ef131d9724a to local folder data_samples
|
||||
Generating SHA2 hash for 6 files
|
||||
Hash generation completed
|
||||
Sync completed: 0 files removed, 2 added / modified
|
||||
Finalizing dataset
|
||||
Pending uploads, starting dataset upload to https://files.community.clear.ml
|
||||
Uploading compressed dataset changes (2 files, total 742 bytes) to https://files.community.clear.ml
|
||||
Upload completed (742 bytes)
|
||||
2021-05-04 10:05:42,353 - clearml.Task - INFO - Waiting to finish uploads
|
||||
2021-05-04 10:05:43,106 - clearml.Task - INFO - Finished uploading
|
||||
Dataset closed and finalized
|
||||
```
|
||||
|
||||
We can see that 2 files were added or modified, just as we expected!
|
||||
@@ -0,0 +1,94 @@
|
||||
---
|
||||
title: Data Management with Python
|
||||
---
|
||||
|
||||
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) and
|
||||
[data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py)
|
||||
together demonstrate how to use ClearML's [`Dataset`](../../references/sdk/dataset.md) class to create a dataset and
|
||||
subsequently ingest the data.
|
||||
|
||||
## Dataset Creation
|
||||
|
||||
The [dataset_creation.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py) script
|
||||
demonstrates how to do the following:
|
||||
* Create a dataset and add files to it
|
||||
* Upload the dataset to the ClearML Server
|
||||
* Finalize the dataset
|
||||
|
||||
|
||||
### Downloading the Data
|
||||
|
||||
We first need to obtain a local copy of the CIFAR dataset.
|
||||
|
||||
```python
|
||||
from clearml import StorageManager
|
||||
|
||||
manager = StorageManager()
|
||||
dataset_path = manager.get_local_copy(remote_url="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz")
|
||||
```
|
||||
|
||||
This script downloads the data and `dataset_path` contains the path to the downloaded data.
|
||||
|
||||
### Creating the Dataset
|
||||
|
||||
```python
|
||||
from clearml import Dataset
|
||||
|
||||
dataset = Dataset.create(dataset_name="cifar_dataset", dataset_project="dataset examples" )
|
||||
```
|
||||
|
||||
This creates a data processing task called `cifar_dataset` in the `dataset examples` project, which
|
||||
can be viewed in the WebApp.
|
||||
|
||||
### Adding Files
|
||||
|
||||
```python
|
||||
dataset.add_files(path=dataset_path)
|
||||
```
|
||||
|
||||
This adds the downloaded files to the current dataset.
|
||||
|
||||
### Uploading the Files
|
||||
|
||||
```python
|
||||
dataset.upload()
|
||||
```
|
||||
This uploads the dataset to the ClearML Server by default. The dataset's destination can be changed by specifying the
|
||||
target storage with the `output_url` parameter of the [`upload`](../../references/sdk/dataset#upload) method.
|
||||
|
||||
### Finalizing the Dataset
|
||||
|
||||
Run the [`finalize`](../../references/sdk/dataset#finzalize) command to close the dataset and set that dataset's tasks
|
||||
status to *completed*. The dataset can only be finalized if it doesn't have any pending uploads.
|
||||
|
||||
```python
|
||||
dataset.finalize()
|
||||
```
|
||||
|
||||
After a dataset has been closed, it can no longer be modified. This ensures future reproducibility.
|
||||
|
||||
The information about the dataset, including a list of files and their sizes, can be viewed
|
||||
in the WebApp, in the dataset task's **ARTIFACTS** tab.
|
||||
|
||||

|
||||
|
||||
## Data Ingestion
|
||||
|
||||
Now that we have a new dataset registered, we can consume it!
|
||||
|
||||
The [data_ingestion.py](https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py) script
|
||||
demonstrates data ingestion using the dataset created in the first script.
|
||||
|
||||
```python
|
||||
dataset_name = "cifar_dataset"
|
||||
dataset_project = "dataset_examples"
|
||||
|
||||
dataset_path = Dataset.get(dataset_name=dataset_name, dataset_project=dataset_project).get_local_copy()
|
||||
```
|
||||
|
||||
The script above gets the dataset and uses the [`Dataset.get_local_copy`](../../references/sdk/dataset.md#get_local_copy)
|
||||
method to return a path to the cached, read-only local dataset. If you need a modifiable copy of the dataset,
|
||||
use `Dataset.get(dataset_name, dataset_project).get_mutable_local_copy(path/to/download)`
|
||||
|
||||
The script then creates a neural network to train a model to classify images from the dataset that was
|
||||
created above.
|
||||
170
docs/clearml_data/data_management_examples/data_man_simple.md
Normal file
170
docs/clearml_data/data_management_examples/data_man_simple.md
Normal file
@@ -0,0 +1,170 @@
|
||||
---
|
||||
title: Data Management from CLI
|
||||
---
|
||||
|
||||
In this example we'll create a simple dataset and demonstrate basic actions on it, using the `clearml-data` CLI.
|
||||
|
||||
## Prerequisites
|
||||
1. First, make sure that you have cloned the [clearml](https://github.com/allegroai/clearml) repository. This contains all
|
||||
the needed files.
|
||||
1. Open terminal and change directory to the cloned repository's examples folder
|
||||
|
||||
```
|
||||
cd clearml/examples/reporting
|
||||
```
|
||||
|
||||
## Creating Initial Dataset
|
||||
|
||||
1. To create the dataset, run this code:
|
||||
|
||||
```bash
|
||||
clearml-data create --project datasets --name HelloDataset
|
||||
```
|
||||
|
||||
Expected response:
|
||||
|
||||
```bash
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Creating a new dataset:
|
||||
New dataset created id=24d05040f3e14fbfbed8edb1bf08a88c
|
||||
```
|
||||
|
||||
1. Now let's add a folder. File addition is recursive, so it's enough to point at the folder
|
||||
to captures all files and sub-folders:
|
||||
|
||||
```bash
|
||||
clearml-data add --files data_samples
|
||||
```
|
||||
|
||||
Expected response:
|
||||
|
||||
```bash
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Adding files/folder to dataset id 24d05040f3e14fbfbed8edb1bf08a88c
|
||||
Generating SHA2 hash for 2 files
|
||||
Hash generation completed
|
||||
5 files added
|
||||
```
|
||||
|
||||
|
||||
:::note
|
||||
After creating a dataset, we don't have to specify its ID when running commands, such as *add*, *remove* or *list*
|
||||
:::
|
||||
|
||||
3. Close the dataset - this command uploads the files. By default, the files are uploaded to the file server, but
|
||||
this can be configured with the `--storage` flag to any of ClearML's supported storage mediums (see [storage](../../integrations/storage.md)).
|
||||
The command also finalizes the dataset, making it immutable and ready to be consumed.
|
||||
|
||||
```bash
|
||||
clearml-data close
|
||||
```
|
||||
|
||||
Expected response:
|
||||
|
||||
```bash
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Finalizing dataset id 24d05040f3e14fbfbed8edb1bf08a88c
|
||||
Pending uploads, starting dataset upload to https://files.community-master.hosted.allegro.ai
|
||||
Pending uploads, starting dataset upload to https://files.community.clear.ml
|
||||
Uploading compressed dataset changes (4 files, total 221.56 KB) to https://files.community.clear.ml
|
||||
Upload completed (221.56 KB)
|
||||
2021-05-04 09:32:03,388 - clearml.Task - INFO - Waiting to finish uploads
|
||||
2021-05-04 09:32:04,067 - clearml.Task - INFO - Finished uploading
|
||||
Dataset closed and finalized
|
||||
```
|
||||
|
||||
## Listing Dataset Content
|
||||
|
||||
To see that all the files were added to the created dataset, use `clearml-data list` and enter the ID of the dataset
|
||||
that was just closed.
|
||||
|
||||
```bash
|
||||
clearml-data list --id 24d05040f3e14fbfbed8edb1bf08a88c
|
||||
```
|
||||
|
||||
Expected response:
|
||||
|
||||
```console
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
|
||||
List dataset content: 24d05040f3e14fbfbed8edb1bf08a88c
|
||||
Listing dataset content
|
||||
file name | size | hash
|
||||
------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
dancing.jpg | 40,484 | 78e804c0c1d54da8d67e9d072c1eec514b91f4d1f296cdf9bf16d6e54d63116a
|
||||
data.csv | 21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
|
||||
picasso.jpg | 114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
|
||||
sample.json | 132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
|
||||
sample.mp3 | 72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
|
||||
Total 5 files, 248771 bytes
|
||||
```
|
||||
|
||||
## Creating a Child Dataset
|
||||
|
||||
In Clear Data, it's possible to create datasets that inherit the content of other datasets, there are called child datasets.
|
||||
|
||||
1. Create a new dataset, specifying the previously created one as its parent:
|
||||
|
||||
```bash
|
||||
clearml-data create --project datasets --name HelloDataset-improved --parents 24d05040f3e14fbfbed8edb1bf08a88c
|
||||
```
|
||||
:::note
|
||||
You'll need to input the Dataset ID you received when created the dataset above
|
||||
:::
|
||||
|
||||
1. Now, we want to add a new file.
|
||||
* Create a new file: `echo "data data data" > new_data.txt` (this will create the file `new_data.txt`),
|
||||
* Now add the file to the dataset:
|
||||
|
||||
```bash
|
||||
clearml-data add --files new_data.txt
|
||||
```
|
||||
Which should return this output:
|
||||
|
||||
```console
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Adding files/folder to dataset id 8b68686a4af040d081027ba3cf6bbca6
|
||||
1 file added
|
||||
```
|
||||
|
||||
1. Let's also remove a file. We'll need to specify the file's full path (within the dataset, not locally) to remove it.
|
||||
|
||||
```bash
|
||||
clearml-data remove --files data_samples/dancing.jpg
|
||||
```
|
||||
|
||||
Expected response:
|
||||
```bash
|
||||
clearml-data - Dataset Management & Versioning CLI
|
||||
Removing files/folder from dataset id 8b68686a4af040d081027ba3cf6bbca6
|
||||
1 files removed
|
||||
```
|
||||
|
||||
1. Close and finalize the dataset
|
||||
|
||||
```bash
|
||||
clearml-data close
|
||||
```
|
||||
|
||||
1. Let's take a look again at the files in the dataset:
|
||||
|
||||
```
|
||||
clearml-data list --id 8b68686a4af040d081027ba3cf6bbca6
|
||||
```
|
||||
|
||||
And we see that our changes have been made! `new_data.txt` has been added, and `dancing.jpg` has been removed.
|
||||
|
||||
```
|
||||
file name | size | hash
|
||||
------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
data.csv | 21,440 | b618696f57b822cd2e9b92564a52b3cc93a2206f41df3f022956bb6cfe4e7ad5
|
||||
new_data.txt | 15 | 6df986a2154902260a836febc5a32543f5337eac60560c57db99257a7e012051
|
||||
picasso.jpg | 114,573 | 6b3c67ea9ec82b09bd7520dd09dad2f1176347d740fd2042c88720e780691a7c
|
||||
sample.json | 132 | 9c42a9a978ac7a71873ebd5c65985e613cfaaff1c98f655af0d2ee0246502fd7
|
||||
sample.mp3 | 72,142 | fbb756ae14005420ff00ccdaff99416bebfcea3adb7e30963a69e68e9fbe361b
|
||||
Total 5 files, 208302 bytes
|
||||
```
|
||||
|
||||
By using `clearml-data`, a clear lineage is created for the data. As seen in this example, when a dataset is closed, the
|
||||
only way to add or remove data is to create a new dataset, and using the previous dataset as a parent. This way, the data
|
||||
is not reliant on the code and is reproducible.
|
||||
12
docs/clearml_data/data_management_examples/workflows.md
Normal file
12
docs/clearml_data/data_management_examples/workflows.md
Normal file
@@ -0,0 +1,12 @@
|
||||
---
|
||||
title: Workflows
|
||||
---
|
||||
|
||||
Take a look at the ClearML Data examples which demonstrate common workflows using the `clearml-data` CLI and the
|
||||
`Dataset` class:
|
||||
* [Dataset Management with CLI](data_man_simple.md) - Tutorial for creating, modifying, and consuming dataset with CLI
|
||||
* [Folder Sync with CLI](data_man_folder_sync.md) - Tutorial for using `clearml-data sync` CLI option to update a dataset according
|
||||
to a local folder.
|
||||
* [Dataset Management with CLI and SDK](data_man_cifar_classification.md) - Tutorial for creating a dataset with the CLI
|
||||
then programmatically ingesting the data with the SDK
|
||||
* [Data Management with Python](data_man_python.md) - Example scripts for creating and consuming a dataset with the SDK.
|
||||
Reference in New Issue
Block a user