diff --git a/docs/clearml_data/clearml_data_cli.md b/docs/clearml_data/clearml_data_cli.md index e549fe2a..26da4eed 100644 --- a/docs/clearml_data/clearml_data_cli.md +++ b/docs/clearml_data/clearml_data_cli.md @@ -17,7 +17,8 @@ Creates a new dataset. ```bash clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] - --name NAME [--tags [TAGS [TAGS ...]]] + --name NAME [--version VERSION] [--output-uri OUTPUT_URI] + [--tags [TAGS [TAGS ...]]] ``` **Parameters** @@ -28,7 +29,9 @@ clearml-data create [-h] [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] |---|---|---| |`--name` |Dataset's name| No | |`--project`|Dataset's project| No | +|`--version` |Dataset version. If not specified a version will automatically be assigned | Yes | |`--parents`|IDs of the dataset's parents. The dataset inherits all of its parents' content. Multiple parents can be entered, but they are merged in the order they were entered| Yes | +|`--output-uri`| Sets where dataset and its previews are uploaded to| Yes| |`--tags` |Dataset user tags. The dataset can be labeled, which can be useful for organizing datasets| Yes| @@ -160,8 +163,8 @@ This command also uploads the data and finalizes the dataset automatically. ```bash clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLDER [--parents [PARENTS [PARENTS ...]]] [--project PROJECT] [--name NAME] - [--tags [TAGS [TAGS ...]]] [--storage STORAGE] [--skip-close] - [--chunk-size CHUNK_SIZE] [--verbose] + [--version VERSION] [--output-uri OUTPUT_URI] [--tags [TAGS [TAGS ...]]] + [--storage STORAGE] [--skip-close] [--chunk-size CHUNK_SIZE] [--verbose] ``` **Parameters** @@ -173,10 +176,11 @@ clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLD |`--id`| Dataset's ID. Default: previously created / accessed dataset| Yes | |`--dataset-folder`|Dataset base folder to add the files to (default: Dataset root)|Yes| |`--folder`|Local folder to sync. Wildcard selection is supported, for example: `~/data/*.jpg ~/data/json`|No| -|`--storage`|Remote storage to use for the dataset files. Default: files_server |Yes| +|`--storage`|Remote storage to use for the dataset files. Default: files server |Yes| |`--parents`|IDs of the dataset's parents (i.e. merge all parents). All modifications made to the folder since the parents were synced will be reflected in the dataset|Yes| |`--project`|If creating a new dataset, specify the dataset's project name|Yes| |`--name`|If creating a new dataset, specify the dataset's name|Yes| +|`--version`|Specify the dataset’s version. Default: `1.0.0`|Yes| |`--tags`|Dataset user tags|Yes| |`--skip-close`|Do not auto close dataset after syncing folders|Yes| |`--chunk-size`| Set dataset artifact upload chunk size in MB. Default 512, (pass -1 for a single chunk). Example: 512, dataset will be split and uploaded in 512 MB chunks. |Yes| @@ -191,7 +195,7 @@ clearml-data sync [-h] [--id ID] [--dataset-folder DATASET_FOLDER] --folder FOLD List a dataset's contents. ```bash -clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] +clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] [--version VERSION] [--filter [FILTER [FILTER ...]]] [--modified] ``` @@ -204,6 +208,7 @@ clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME] |`--id`|Dataset ID whose contents will be shown (alternatively, use project / name combination). Default: previously accessed dataset|Yes| |`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)|Yes| |`--name`|Specify dataset name (if used instead of ID, dataset project is also required)|Yes| +|`--version`|Specify dataset version. Default: most recent version |Yes| |`--filter`|Filter files based on folder / wildcard. Multiple filters are supported. Example: `folder/date_*.json folder/sub-folder`|Yes| |`--modified`|Only list file changes (add / remove / modify) introduced in this version|Yes| @@ -211,25 +216,103 @@ clearml-data list [-h] [--id ID] [--project PROJECT] [--name NAME]
-## delete +## set-description -Delete an entire dataset from ClearML. This can also be used to delete a newly created dataset. - -This does not work on datasets with children. +Sets the description of an existing dataset. ```bash -clearml-data delete [-h] [--id ID] [--force] +clearml-data set-description [-h] [--id ID] [--description DESCRIPTION] ``` - **Parameters**
|Name|Description|Optional| |---|---|---| -|`--id`|ID of dataset to be deleted. Default: previously created / accessed dataset that hasn't been finalized yet|Yes| -|`--force`|Force dataset deletion even if other dataset versions depend on it|Yes|| +|`--id`|Dataset’s ID|No| +|`--description`|Description to be set|No| + + +
+ +
+ + +## delete + +Deletes dataset(s). Pass any of the attributes of the dataset(s) you want to delete. Multiple datasets matching the +request will raise an exception, unless you pass `--entire-dataset` and `--force`. In this case, all matching datasets +will be deleted. + +If a dataset is a parent to a dataset(s), you must pass `--force` in order to delete it. + +:::warning +Deleting a parent dataset may cause child datasets to lose data! +::: + +```bash +clearml-data delete [-h] [--id ID] [--project PROJECT] [--name NAME] + [--version VERSION] [--force] [--entire-dataset] +``` + +**Parameters** + +
+ +|Name|Description|Optional| +|---|---|---| +|`--id`|ID of the dataset to delete (alternatively, use project / name combination).|Yes| +|`--project`|Specify dataset project name (if used instead of ID, dataset name is also required)|Yes| +|`--name`|Specify dataset name (if used instead of ID, dataset project is also required)|Yes| +|`--version`|Specify dataset version|Yes| +|`-–force`|Force dataset deletion even if other dataset versions depend on it. Must also be used if `--entire-dataset` flag is used|Yes| +|`--entire-dataset`|Delete all found datasets|Yes| + +
+ +
+ +## rename + +Rename a dataset (and all of its versions). + +```bash +clearml-data rename [-h] --new-name NEW_NAME --project PROJECT --name NAME +``` + +**Parameters** + +
+ +|Name|Description|Optional| +|---|---|---| +|`--new-name`|The new name of the dataset|No| +|`--project`|The project the dataset to be renamed belongs to|No| +|`--name`|The current name of the dataset(s) to be renamed|No| + +
+ +
+ + +## move + +Moves a dataset to another project + +```bash +clearml-data move [-h] --new-project NEW_PROJECT --project PROJECT --name NAME +``` + +**Parameters** + +
+ +|Name|Description|Optional| +|---|---|---| +|`--new-project`|The new project of the dataset|No| +|`--project`|The current project the dataset to be move belongs to|No| +|`--name`|The name of the dataset to be moved|No|
@@ -252,10 +335,10 @@ clearml-data search [-h] [--ids [IDS [IDS ...]]] [--project PROJECT] |Name|Description|Optional| |---|---|---| -|`--ids`|A list of dataset IDs|| -|`--project`|The project name of the datasets|| -|`--name`|A dataset name or a partial name to filter datasets by|| -|`--tags`|A list of dataset user tags|| +|`--ids`|A list of dataset IDs|Yes| +|`--project`|The project name of the datasets|Yes| +|`--name`|A dataset name or a partial name to filter datasets by|Yes| +|`--tags`|A list of dataset user tags|Yes| diff --git a/docs/clearml_data/clearml_data_sdk.md b/docs/clearml_data/clearml_data_sdk.md index 01201988..a3ca3409 100644 --- a/docs/clearml_data/clearml_data_sdk.md +++ b/docs/clearml_data/clearml_data_sdk.md @@ -36,7 +36,9 @@ preprocessing code and the resulting dataset are saved in a single task (see `us dataset = Dataset.create( dataset_name='dataset name', dataset_project='dataset project', - parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2] + parent_datasets=[PARENT_DS_ID_1, PARENT_DS_ID_2], + dataset_version="1.0", + output_uri="gs://bucket-name/folder" ) ``` @@ -45,6 +47,11 @@ To locate a dataset's ID, go to the dataset task's info panel in the [WebApp](.. to the right of the dataset task name, click `ID` and the dataset ID appears ::: +Use the `output_uri` parameter to specify a network storage target to upload the dataset files, and associated information +(such as previews) to (e.g. `s3://bucket/data`, `gs://bucket/data`, `azure://bucket/data`, `file:///mnt/share/data`). +By default, the dataset uploads to ClearML's file server. The `output_uri` parameter of [`Dataset.upload`](#uploading-files), +and the storage parameter of [`Dataset.sync_folder`](../references/sdk/dataset.md#sync_folder) overrides this parameter’s value. + The created dataset inherits the content of the `parent_datasets`. When multiple dataset parents are listed, they are merged in order of specification. Each parent overrides any overlapping files from a previous parent dataset. @@ -71,15 +78,40 @@ squashed_dataset_2 = Dataset.squash( ) ``` -In addition, the target storage location for the squashed dataset can be specified using the `output_url` parameter of the +In addition, the target storage location for the squashed dataset can be specified using the `output_uri` parameter of the [`Dataset.squash`](../references/sdk/dataset.md#datasetsquash) method. ## Accessing Datasets Once a dataset has been created and uploaded to a server, the dataset can be accessed programmatically from anywhere. -Use the [`Dataset.get`](../references/sdk/dataset.md#datasetget) class method to access a specific Dataset object, either -with the dataset's ID or with its project and name. If only a project name or tag is provided, the method returns the -most recent dataset in the specified project, or the most recent dataset with the specified tag. +Use the [`Dataset.get`](../references/sdk/dataset.md#datasetget) class method to access a specific Dataset object, by +providing any of the dataset’s following attributes: dataset ID, project, name, tags, and or version. If multiple +datasets match the query, the most recent one is returned. + +```python +dataset = Dataset.get( + dataset_id=None, + dataset_project="Example Project", + dataset_name="Example Dataset", + dataset_tags="my tag", + dataset_version="1.2", + only_completed=True, + only_published=False, +) +``` + +Pass `auto_create=True`, and a dataset will be created on-the-fly with the input attributes (project name, dataset name, +and tags) if no datasets match the query. + +In cases where you use a dataset in a task (e.g. consuming a dataset), you can have its ID stored in the task’s hyper +parameters: pass `alias=`, and the task using the dataset will store the dataset’s ID in the +`dataset_alias_string` parameter under the `Datasets` hyper parameters section. This way you can easily track which +dataset the task is using. If you use `alias` with `overridable=True`, you can override the dataset ID from the UI’s +**CONFIGURATION > HYPER PARAMETERS >** `Datasets` section, allowing you to change the dataset used when running a task +remotely. + +In case you want to get a modifiable dataset, you can get a newly created mutable dataset with the current one as its +parent, by passing `writable_copy=True`. Once a specific dataset object has been obtained, get a local copy of the dataset using one of the following options: * [`Dataset.get_local_copy()`](../references/sdk/dataset.md#get_local_copy) - get a read-only local copy of an entire dataset. @@ -168,7 +200,8 @@ dataset.remove_files(dataset_path="*.csv", recursive=True) To upload the dataset files to network storage, use the [`Dataset.upload`](../references/sdk/dataset.md#upload) method. Use the `output_url` parameter to specify storage target, such as S3 / GS / Azure (e.g. `s3://bucket/data`, `gs://bucket/data`, `azure://bucket/data` , `/mnt/share/data`). -By default, the dataset uploads to ClearML's file server. +By default, the dataset uploads to ClearML's file server. This target storage overrides the `output_uri` value of the +[`Dataset.create`](#creating-datasets) method. ClearML supports parallel uploading of datasets. Use the `max_workers` parameter to specify the number of threads to use when uploading the dataset. By default, it’s the number of your machine’s logical cores. @@ -192,3 +225,53 @@ to a specific folder's content changes. Specify the folder to sync with the `loc This method is useful in the case where there's a single point of truth, either a local or network folder, that gets updated periodically. The folder changes will be reflected in a new dataset version. This method saves time since you don't have to manually update (add / remove) files in a dataset. + +## Deleting Datasets +Delete a dataset using the [`Dataset.delete`](../references/sdk/dataset.md#datasetdelete) class method. Input any of the +attributes of the dataset(s) you want to delete, including ID, project name, version, and/or dataset name. Multiple +datasets matching the query will raise an exception, unless you pass `entire_dataset=True` and `force=True`. In this +case, all matching datasets will be deleted. + +If a dataset is a parent to a dataset(s), you must pass `force=True` in order to delete it. + +:::warning +Deleting a parent dataset may cause child datasets to lose data! +::: + + +```python +Dataset.delete( + dataset_id=None, + dataset_project="example project", + dataset_name="example dataset", + force=False, + dataset_version="3.0", + entire_dataset=False +) +``` + +## Renaming Datasets +Rename a dataset using the [`Dataset.rename`](../references/sdk/dataset.md#datasetrename) class method. All the datasets +with the given `dataset_project` and `dataset_name` will be renamed. + +```python +Dataset.rename( + new_dataset_name="New name", + dataset_project="Example project", + dataset_name="Example dataset", +) +``` + +## Moving Datasets to Another Project +Move a dataset to another project using the [`Dataset.move_to_project`](../references/sdk/dataset.md#datasetmove_to_projetc) +class method. All the datasets with the given `dataset_project` and `dataset_name` will be moved to the new dataset +project. + +```python +Dataset.move_to_project( + new_dataset_project="New project", + dataset_project="Example project", + dataset_name="Example dataset", +) +``` +