Update Clearml Data (#277)

This commit is contained in:
pollfly 2022-06-30 09:27:45 +03:00 committed by GitHub
parent 48b70440a8
commit 110e7b5fe7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 189 additions and 23 deletions

View File

@ -17,7 +17,8 @@ Creates a new dataset.
```bash ```bash
clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT]
--name NAME [--tags [TAGS [TAGS ...]]] --name NAME [--version VERSION] [--output-uri OUTPUT_URI]
[--tags [TAGS [TAGS ...]]]
``` ```
**Parameters** **Parameters**
@ -28,7 +29,9 @@ clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT]
|---|---|---| |---|---|---|
|`--name` |Dataset's name| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> | |`--name` |Dataset's name| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|`--project`|Dataset's project| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> | |`--project`|Dataset's project| <img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" /> |
|`--version` |Dataset version. If not specified a version will automatically be assigned | <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--parents`|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> | |`--parents`|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--output-uri`| Sets where dataset and its previews are uploaded to| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--tags` |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--tags` |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div> </div>
@ -160,8 +163,8 @@ This command also uploads the data and finalizes the dataset automatically.
```bash ```bash
clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLDER clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLDER
[--parents [PARENTS [PARENTS ...]]] [--project PROJECT] [--name NAME] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] [--name NAME]
[--tags [TAGS [TAGS ...]]] [--storage STORAGE] [--skip-close] [--version VERSION] [--output-uri OUTPUT_URI] [--tags [TAGS [TAGS ...]]]
[--chunk-size CHUNK_SIZE] [--verbose] [--storage STORAGE] [--skip-close] [--chunk-size CHUNK_SIZE] [--verbose]
``` ```
**Parameters** **Parameters**
@ -173,10 +176,11 @@ clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLD
|`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> | |`--id`| Dataset's ID. Default: previously created / accessed dataset| <img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" /> |
|`--dataset-folder`|Dataset base folder to add the files to (default: Dataset root)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--dataset-folder`|Dataset base folder to add the files to (default: Dataset root)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--folder`|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />| |`--folder`|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--storage`|Remote storage to use for the dataset files. Default: files_server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--storage`|Remote storage to use for the dataset files. Default: files server |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--parents`|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--parents`|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--project`|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--project`|If creating a new dataset, specify the dataset's project name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--name`|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--name`|If creating a new dataset, specify the dataset's name|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--version`|Specify the datasets version. Default: `1.0.0`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--tags`|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--tags`|Dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--skip-close`|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--skip-close`|Do not auto close dataset after syncing folders|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--chunk-size`| Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--chunk-size`| Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
@ -191,7 +195,7 @@ clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLD
List a dataset's contents. List a dataset's contents.
```bash ```bash
clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] [--version VERSION]
[--filter [FILTER [FILTER ...]]] [--modified] [--filter [FILTER [FILTER ...]]] [--modified]
``` ```
@ -204,6 +208,7 @@ clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME]
|`--id`|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--id`|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--name`|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--name`|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--version`|Specify dataset version. Default: most recent version |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--filter`|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--filter`|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--modified`|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--modified`|Only list file changes (add / remove / modify) introduced in this version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
@ -211,25 +216,103 @@ clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME]
<br/> <br/>
## delete ## set-description
Delete an entire dataset from ClearML. This can also be used to delete a newly created dataset. Sets the description of an existing dataset.
This does not work on datasets with children.
```bash ```bash
clearml-data delete [-h] [--id ID] [--force] clearml-data set-description [-h] [--id ID] [--description DESCRIPTION]
``` ```
**Parameters** **Parameters**
<div className="tbl-cmd"> <div className="tbl-cmd">
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|`--id`|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />| |`--id`|Datasets ID|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--force`|Force dataset deletion even if other dataset versions depend on it|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|| |`--description`|Description to be set|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
</div>
<br/>
## delete
Deletes dataset(s). Pass any of the attributes of the dataset(s) you want to delete. Multiple datasets matching the
request will raise an exception, unless you pass `--entire-dataset` and `--force`. In this case, all matching datasets
will be deleted.
If a dataset is a parent to a dataset(s), you must pass `--force` in order to delete it.
:::warning
Deleting a parent dataset may cause child datasets to lose data!
:::
```bash
clearml-data delete [-h] [--id ID] [--project PROJECT] [--name NAME]
[--version VERSION] [--force] [--entire-dataset]
```
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--id`|ID of the dataset to delete (alternatively, use project / name combination).|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--name`|Specify dataset name (if used instead of ID, dataset project is also required)|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--version`|Specify dataset version|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`-force`|Force dataset deletion even if other dataset versions depend on it. Must also be used if `--entire-dataset` flag is used|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--entire-dataset`|Delete all found datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div>
<br/>
## rename
Rename a dataset (and all of its versions).
```bash
clearml-data rename [-h] --new-name NEW_NAME --project PROJECT --name NAME
```
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--new-name`|The new name of the dataset|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--project`|The project the dataset to be renamed belongs to|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--name`|The current name of the dataset(s) to be renamed|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
</div>
<br/>
## move
Moves a dataset to another project
```bash
clearml-data move [-h] --new-project NEW_PROJECT --project PROJECT --name NAME
```
**Parameters**
<div className="tbl-cmd">
|Name|Description|Optional|
|---|---|---|
|`--new-project`|The new project of the dataset|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--project`|The current project the dataset to be move belongs to|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--name`|The name of the dataset to be moved|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
</div> </div>
@ -252,10 +335,10 @@ clearml-data search [-h] [--ids [IDS [IDS ...]]] [--project PROJECT]
|Name|Description|Optional| |Name|Description|Optional|
|---|---|---| |---|---|---|
|`--ids`|A list of dataset IDs|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |`--ids`|A list of dataset IDs|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--project`|The project name of the datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |`--project`|The project name of the datasets|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--name`|A dataset name or a partial name to filter datasets by|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |`--name`|A dataset name or a partial name to filter datasets by|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--tags`|A list of dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" className="icon size-md center-md" />| |`--tags`|A list of dataset user tags|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
</div> </div>

View File

@ -36,7 +36,9 @@ preprocessing code and the resulting dataset are saved in a single task (see `us
dataset = Dataset.create( dataset = Dataset.create(
dataset_name='dataset name', dataset_name='dataset name',
dataset_project='dataset project', dataset_project='dataset project',
parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2] parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2],
dataset_version="1.0",
output_uri="gs://bucket-name/folder"
) )
``` ```
@ -45,6 +47,11 @@ To locate a dataset's ID, go to the dataset task's info panel in the [WebApp](..
to the right of the dataset task name, click `ID` and the dataset ID appears to the right of the dataset task name, click `ID` and the dataset ID appears
::: :::
Use the `output_uri` parameter to specify a network storage target to upload the dataset files, and associated information
(such as previews) to (e.g. `s3://bucket/data`, `gs://bucket/data`, `azure://bucket/data`, `file:///mnt/share/data`).
By default, the dataset uploads to ClearML's file server. The `output_uri` parameter of [`Dataset.upload`](#uploading-files),
and the storage parameter of [`Dataset.sync_folder`](../references/sdk/dataset.md#sync_folder) overrides this parameters value.
The created dataset inherits the content of the `parent_datasets`. When multiple dataset parents are listed, The created dataset inherits the content of the `parent_datasets`. When multiple dataset parents are listed,
they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset. they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset.
@ -71,15 +78,40 @@ squashed_dataset_2 = Dataset.squash(
) )
``` ```
In addition, the target storage location for the squashed dataset can be specified using the `output_url` parameter of the In addition, the target storage location for the squashed dataset can be specified using the `output_uri` parameter of the
[`Dataset.squash`](../references/sdk/dataset.md#datasetsquash) method. [`Dataset.squash`](../references/sdk/dataset.md#datasetsquash) method.
## Accessing Datasets ## Accessing Datasets
Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere. Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere.
Use the [`Dataset.get`](../references/sdk/dataset.md#datasetget) class method to access a specific Dataset object, either Use the [`Dataset.get`](../references/sdk/dataset.md#datasetget) class method to access a specific Dataset object, by
with the dataset's ID or with its project and name. If only a project name or tag is provided, the method returns the providing any of the datasets following attributes: dataset ID, project, name, tags, and or version. If multiple
most recent dataset in the specified project, or the most recent dataset with the specified tag. datasets match the query, the most recent one is returned.
```python
dataset = Dataset.get(
dataset_id=None,
dataset_project="Example Project",
dataset_name="Example Dataset",
dataset_tags="my tag",
dataset_version="1.2",
only_completed=True,
only_published=False,
)
```
Pass `auto_create=True`, and a dataset will be created on-the-fly with the input attributes (project name, dataset name,
and tags) if no datasets match the query.
In cases where you use a dataset in a task (e.g. consuming a dataset), you can have its ID stored in the tasks hyper
parameters: pass `alias=<dataset_alias_string>`, and the task using the dataset will store the datasets ID in the
`dataset_alias_string` parameter under the `Datasets` hyper parameters section. This way you can easily track which
dataset the task is using. If you use `alias` with `overridable=True`, you can override the dataset ID from the UIs
**CONFIGURATION > HYPER PARAMETERS >** `Datasets` section, allowing you to change the dataset used when running a task
remotely.
In case you want to get a modifiable dataset, you can get a newly created mutable dataset with the current one as its
parent, by passing `writable_copy=True`.
Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options: Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options:
* [`Dataset.get_local_copy()`](../references/sdk/dataset.md#get_local_copy) - get a read-only local copy of an entire dataset. * [`Dataset.get_local_copy()`](../references/sdk/dataset.md#get_local_copy) - get a read-only local copy of an entire dataset.
@ -168,7 +200,8 @@ dataset.remove_files(dataset_path="*.csv", recursive=True)
To upload the dataset files to network storage, use the [`Dataset.upload`](../references/sdk/dataset.md#upload) method. To upload the dataset files to network storage, use the [`Dataset.upload`](../references/sdk/dataset.md#upload) method.
Use the `output_url` parameter to specify storage target, such as S3 / GS / Azure (e.g. `s3://bucket/data`, `gs://bucket/data`, `azure://bucket/data` , `/mnt/share/data`). Use the `output_url` parameter to specify storage target, such as S3 / GS / Azure (e.g. `s3://bucket/data`, `gs://bucket/data`, `azure://bucket/data` , `/mnt/share/data`).
By default, the dataset uploads to ClearML's file server. By default, the dataset uploads to ClearML's file server. This target storage overrides the `output_uri` value of the
[`Dataset.create`](#creating-datasets) method.
ClearML supports parallel uploading of datasets. Use the `max_workers` parameter to specify the number of threads to use ClearML supports parallel uploading of datasets. Use the `max_workers` parameter to specify the number of threads to use
when uploading the dataset. By default, its the number of your machines logical cores. when uploading the dataset. By default, its the number of your machines logical cores.
@ -192,3 +225,53 @@ to a specific folder's content changes. Specify the folder to sync with the `loc
This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically. This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically.
The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually
update (add / remove) files in a dataset. update (add / remove) files in a dataset.
## Deleting Datasets
Delete a dataset using the [`Dataset.delete`](../references/sdk/dataset.md#datasetdelete) class method. Input any of the
attributes of the dataset(s) you want to delete, including ID, project name, version, and/or dataset name. Multiple
datasets matching the query will raise an exception, unless you pass `entire_dataset=True` and `force=True`. In this
case, all matching datasets will be deleted.
If a dataset is a parent to a dataset(s), you must pass `force=True` in order to delete it.
:::warning
Deleting a parent dataset may cause child datasets to lose data!
:::
```python
Dataset.delete(
dataset_id=None,
dataset_project="example project",
dataset_name="example dataset",
force=False,
dataset_version="3.0",
entire_dataset=False
)
```
## Renaming Datasets
Rename a dataset using the [`Dataset.rename`](../references/sdk/dataset.md#datasetrename) class method. All the datasets
with the given `dataset_project` and `dataset_name` will be renamed.
```python
Dataset.rename(
new_dataset_name="New name",
dataset_project="Example project",
dataset_name="Example dataset",
)
```
## Moving Datasets to Another Project
Move a dataset to another project using the [`Dataset.move_to_project`](../references/sdk/dataset.md#datasetmove_to_projetc)
class method. All the datasets with the given `dataset_project` and `dataset_name` will be moved to the new dataset
project.
```python
Dataset.move_to_project(
new_dataset_project="New project",
dataset_project="Example project",
dataset_name="Example dataset",
)
```