mirror of
https://github.com/clearml/clearml-docs
synced 2025-01-31 06:27:22 +00:00
410 lines
14 KiB
Markdown
410 lines
14 KiB
Markdown
---
|
|
title: Datasets and Dataset Versions
|
|
---
|
|
|
|
ClearML Enterprise's **Datasets** and **Dataset versions** provide the internal data structure
|
|
and functionality for the following purposes:
|
|
* Connecting source data to the ClearML Enterprise platform
|
|
* Using ClearML Enterprise's Git-like [Dataset versioning](#dataset-versioning)
|
|
* Integrating the powerful features of [Dataviews](dataviews.md) with an experiment
|
|
* [Annotating](webapp/webapp_datasets_frames.md#annotations) images and videos
|
|
|
|
Datasets consist of versions with SingleFrames and/or FrameGroups. Each Dataset can contain multiple versions, which
|
|
can have multiple children that inherit their parent's contents.
|
|
|
|
Mask-labels are defined at the DatasetVersion level, and are applied to all masks in a DatasetVersion.
|
|
|
|
## Example Datasets
|
|
|
|
ClearML Enterprise provides Example Datasets complete with frames,
|
|
and ready for your experimentation. Find these example Datasets in the ClearML Enterprise WebApp (UI). They appear
|
|
with an "Example" banner in the WebApp (UI).
|
|
|
|
## Usage
|
|
|
|
### Creating Datasets
|
|
|
|
Use the [`Dataset.create`](../references/hyperdataset/hyperdataset.md#datasetcreate) method to create a Dataset. It will
|
|
contain an empty version named `Current`.
|
|
|
|
```python
|
|
from allegroai import Dataset
|
|
|
|
myDataset = Dataset.create(dataset_name='myDataset')
|
|
```
|
|
|
|
Or, use the [`DatasetVersion.create_new_dataset`](../references/hyperdataset/hyperdatasetversion.md#datasetversioncreate_new_dataset)
|
|
method.
|
|
|
|
```python
|
|
from allegroai import DatasetVersion
|
|
|
|
myDataset = DatasetVersion.create_new_dataset(dataset_name='myDataset Two')
|
|
```
|
|
|
|
When creating a dataset, you can put it into a project. In this case, the dataset will adhere to the [access rules](../webapp/settings/webapp_settings_access_rules.md)
|
|
specified for its containing project. Use `dataset_project` parameter in `Dataset.create` or `DatasetVersion.create_new_dataset`
|
|
to specify a project name.
|
|
|
|
```python
|
|
myDataset_1 = Dataset.create(dataset_name="myDataset", dataset_project="myDataset Project")
|
|
|
|
myDataset_2 = DatasetVersion.create_new_dataset(
|
|
dataset_name="myDataset_2", dataset_project="myDatasetProject_2"
|
|
)
|
|
```
|
|
|
|
To raise a `ValueError` exception if the Dataset exists, specify the `raise_if_exists` parameters as `True`.
|
|
|
|
* With `Dataset.create`:
|
|
```python
|
|
try:
|
|
myDataset = Dataset.create(dataset_name='myDataset One', raise_if_exists=True)
|
|
except ValueError:
|
|
print('Dataset exists.')
|
|
```
|
|
|
|
* Or with `DatasetVersion.create_new_dataset`:
|
|
|
|
```python
|
|
try:
|
|
myDataset = DatasetVersion.create_new_dataset(dataset_name='myDataset Two', raise_if_exists=True)
|
|
except ValueError:
|
|
print('Dataset exists.')
|
|
```
|
|
|
|
Additionally, create a Dataset with tags and a description.
|
|
|
|
```python
|
|
myDataset = DatasetVersion.create_new_dataset(
|
|
dataset_name='myDataset',
|
|
tags=['One Tag', 'Another Tag', 'And one more tag'],
|
|
description='some description text'
|
|
)
|
|
```
|
|
|
|
### Accessing Current Dataset
|
|
|
|
To get the current Dataset, use the [`DatasetVersion.get_current`](../references/hyperdataset/hyperdatasetversion.md#datasetversionget_current)
|
|
method.
|
|
|
|
```python
|
|
myDataset = DatasetVersion.get_current(dataset_name='myDataset')
|
|
```
|
|
|
|
### Deleting Datasets
|
|
|
|
Use the [`Dataset.delete`](../references/hyperdataset/hyperdataset.md#datasetdelete) class method to delete a Dataset:
|
|
|
|
* Delete an empty Dataset (no versions):
|
|
|
|
```python
|
|
Dataset.delete(dataset_name='MyDataset', delete_all_versions=False, force=False)
|
|
```
|
|
|
|
* Delete a Dataset containing only versions whose status is *Draft*:
|
|
|
|
```python
|
|
Dataset.delete(dataset_name='MyDataset', delete_all_versions=True, force=False)
|
|
```
|
|
|
|
* Delete a Dataset even if it contains versions whose status is *Published*:
|
|
|
|
```python
|
|
Dataset.delete(dataset_name='MyDataset', delete_all_versions=True, force=True)
|
|
```
|
|
|
|
* Delete a Dataset and the sources associated with its deleted frames:
|
|
|
|
```python
|
|
Dataset.delete(
|
|
dataset_name='MyDataset', delete_all_versions=True, force=True, delete_sources=True
|
|
)
|
|
```
|
|
|
|
This supports deleting sources located in AWS S3, GCP, and Azure Storage (not local storage). The `delete_sources`
|
|
parameter is ignored if `delete_all_versions` is `False`. You can view the deletion process' progress by passing
|
|
`show_progress=True` (`tqdm` required).
|
|
|
|
### Tagging Datasets
|
|
|
|
Tags can be added to datasets, allowing to easily identify and group experiments.
|
|
|
|
Add tags to a dataset:
|
|
```python
|
|
MyDataset.add_tags(["coco", "dogs"])
|
|
```
|
|
|
|
Remove tags from a dataset:
|
|
|
|
```python
|
|
MyDataset.remove_tags(["dogs"])
|
|
```
|
|
|
|
## Dataset Versioning
|
|
|
|
Dataset versioning refers to the group of ClearML Enterprise SDK and WebApp (UI) features for creating,
|
|
modifying, and deleting Dataset versions.
|
|
|
|
ClearML Enterprise supports simple and advanced Dataset versioning paradigms. A **simple version structure** consists of
|
|
a single evolving version, with historic static snapshots. Continuously push your changes to your single dataset version,
|
|
and take a snapshot to record the content of your dataset at a specific point in time.
|
|
|
|
You can, alternatively, employ any **advanced structure**, where each version evolves in parallel, and you control which
|
|
versions are locked for further changes and which can be modified. See details [below](#dataset-version-structure).
|
|
|
|
## Dataset Version State
|
|
|
|
Dataset versions can have either *Draft* or *Published* state.
|
|
|
|
A *Draft* version is editable, so frames can be added to and deleted and/or modified.
|
|
|
|
A *Published* version is read-only, which ensures reproducible experiments and preserves the Dataset version contents.
|
|
Child versions can only be created from *Published* versions, as they inherit their predecessor version contents.
|
|
|
|
## Dataset Version Structure
|
|
To implement a simple version structure, where the dataset is ever evolving, with a linear set of historic snapshots,
|
|
a parent version can have one and only one child, with the last child in the Dataset versions tree in *Draft* state.
|
|
Different version structures, such as where at least one parent has more than one child, or the single last child in the
|
|
Dataset versions tree is *Published* are considered advanced version structures.
|
|
|
|
For details about programmatically implementing simple and advanced version structures, see [Creating Snapshots](#creating-snapshots)
|
|
and [Creating Child Versions](#creating-child-versions) respectively below.
|
|
|
|
|
|
## DatasetVersion Usage
|
|
|
|
Manage Dataset versioning using the DatasetVersion class in the ClearML Enterprise SDK.
|
|
|
|
### Creating Snapshots
|
|
|
|
If the Dataset contains only one version whose status is *Draft*, snapshots of the current version can be created.
|
|
When creating a snapshot, the current version becomes the snapshot (it keeps the same version ID),
|
|
and the newly created version (with its new version ID) becomes the current version.
|
|
|
|
To create a snapshot, use the `DatasetVersion.create_snapshot` method.
|
|
|
|
|
|
#### Snapshot Naming
|
|
|
|
In the simple version structure, ClearML Enterprise supports two methods for snapshot naming:
|
|
* **Timestamp naming** - If only the Dataset name or ID is provided, the snapshot is named `snapshot` with a timestamp
|
|
appended.
|
|
The timestamp format is ISO 8601 (`YYYY-MM-DDTHH:mm:ss.SSSSSS`). For example, `snapshot 2020-03-26T16:55:38.441671`.
|
|
|
|
**Example:**
|
|
```python
|
|
from allegroai import DatasetVersion
|
|
|
|
myDataset = DatasetVersion.create_snapshot(dataset_name='MyDataset')
|
|
```
|
|
|
|
After the statement above runs, the previous current version keeps its existing version ID, and it becomes a
|
|
snapshot named `snapshot` with a timestamp appended. The newly created version with a new version ID becomes
|
|
the current version, and its name is `Current`.
|
|
|
|
* **User-specified snapshot naming** - If the `publish_name` parameter is provided, it will be the name of the snapshot name.
|
|
|
|
**Example:**
|
|
```python
|
|
myDataset = DatasetVersion.create_snapshot(dataset_name='MyDataset', publish_name='NewSnapshotName')
|
|
```
|
|
After the above statement runs, the previous current version keeps its existing version ID and becomes a snapshot named
|
|
`NewSnapshotName`.
|
|
The newly created version (with a new version ID) becomes the current version, and its name is `Current`.
|
|
|
|
|
|
#### Current Version Naming
|
|
|
|
In the simple version structure, ClearML Enterprise supports two methods for current version naming:
|
|
|
|
* **Default naming** - If the `child_name` parameter is not provided, `Current` is the current version name.
|
|
* **User-specified current version naming** - If the `child_name` parameter is provided, that child name becomes the current
|
|
version name.
|
|
|
|
For example, after the following statement runs, the previous current version keeps its existing version ID and becomes
|
|
a snapshot named `snapshot` with the timestamp appended.
|
|
The newly created version (with a new version ID) is the current version, and its name is `NewCurrentVersionName`.
|
|
|
|
```python
|
|
myDataset = DatasetVersion.create_snapshot(
|
|
dataset_name='MyDataset',
|
|
child_name='NewCurrentVersionName'
|
|
)
|
|
```
|
|
|
|
#### Adding Metadata and Comments
|
|
|
|
Add a metadata dictionary and/or comment to a snapshot.
|
|
|
|
For example:
|
|
|
|
```python
|
|
myDataset = DatasetVersion.create_snapshot(
|
|
dataset_name='MyDataset',
|
|
child_metadata={'abc':'1234','def':'5678'},
|
|
child_comment='some text comment'
|
|
)
|
|
```
|
|
|
|
### Creating Child Versions
|
|
|
|
Create a new version from any version whose status is *Published*.
|
|
|
|
To create a new version, call the [`DatasetVersion.create_version`](../references/hyperdataset/hyperdataset.md#datasetversioncreate_version)
|
|
method, and provide:
|
|
* Either the Dataset name or ID
|
|
* The parent version name or ID from which the child inherits frames
|
|
* The new version's name.
|
|
|
|
For example, create a new version named `NewChildVersion` from the existing version `PublishedVersion`,
|
|
where the new version inherits the frames of the existing version. If `NewChildVersion` already exists,
|
|
it is returned.
|
|
|
|
```python
|
|
myVersion = DatasetVersion.create_version(
|
|
dataset_name='MyDataset',
|
|
parent_version_names=['PublishedVersion'],
|
|
version_name='NewChildVersion'
|
|
)
|
|
```
|
|
|
|
To raise a ValueError exception if `NewChildVersion` exists, set `raise_if_exists` to `True`.
|
|
|
|
```python
|
|
myVersion = DatasetVersion.create_version(
|
|
dataset_name='MyDataset',
|
|
parent_version_names=['PublishedVersion'],
|
|
version_name='NewChildVersion',
|
|
raise_if_exists=True
|
|
)
|
|
```
|
|
|
|
### Creating Root-level Parent Versions
|
|
|
|
Create a new version at the root-level. This is a version without a parent, and it contains no frames.
|
|
|
|
```python
|
|
myDataset = DatasetVersion.create_version(
|
|
dataset_name='MyDataset',
|
|
version_name='NewRootVersion'
|
|
)
|
|
```
|
|
|
|
### Getting Versions
|
|
|
|
To get a version or versions, use the [`DatasetVersion.get_version`](../references/hyperdataset/hyperdatasetversion.md#datasetversionget_version)
|
|
and [`DatasetVersion.get_versions`](../references/hyperdataset/hyperdatasetversion.md#datasetversionget_versions)
|
|
methods, respectively.
|
|
|
|
**Getting a list of all versions**
|
|
|
|
```python
|
|
myDatasetversion = DatasetVersion.get_versions(dataset_name='MyDataset')
|
|
```
|
|
|
|
**Getting a list of all _published_ versions**
|
|
|
|
```python
|
|
myDatasetversion = DatasetVersion.get_versions(
|
|
dataset_name='MyDataset',
|
|
only_published=True
|
|
)
|
|
```
|
|
|
|
**Getting a list of all _drafts_ versions**
|
|
|
|
```python
|
|
myDatasetversion = DatasetVersion.get_versions(
|
|
dataset_name='MyDataset',
|
|
only_draft=True
|
|
)
|
|
```
|
|
|
|
**Getting the current version**
|
|
|
|
If more than one version exists, ClearML Enterprise outputs a warning.
|
|
|
|
```python
|
|
myDatasetversion = DatasetVersion.get_version(dataset_name='MyDataset')
|
|
```
|
|
|
|
**Getting a specific version**
|
|
|
|
```python
|
|
myDatasetversion = DatasetVersion.get_version(
|
|
dataset_name='MyDataset',
|
|
version_name='VersionName'
|
|
)
|
|
```
|
|
|
|
### Deleting Versions
|
|
|
|
Delete versions which are status *Draft* using the [`Dataset.delete_version`](../references/hyperdataset/hyperdataset.md#delete_version)
|
|
method.
|
|
|
|
```python
|
|
from allegroai import Dataset
|
|
|
|
myDataset = Dataset.get(dataset_name='MyDataset')
|
|
myDataset.delete_version(version_name='VersionToDelete')
|
|
```
|
|
|
|
|
|
### Publishing Versions
|
|
|
|
Publish (make read-only) versions which are status *Draft* using the [`DatasetVersion.publish_version`](../references/hyperdataset/hyperdatasetversion.md#publish_version)
|
|
method. This includes the current version, if the Dataset is in the simple version structure.
|
|
|
|
```python
|
|
myVersion = DatasetVersion.get_version(
|
|
dataset_name='MyDataset',
|
|
version_name='VersionToPublish'
|
|
)
|
|
|
|
myVersion.publish_version()
|
|
```
|
|
|
|
### Managing Version Mask-labels
|
|
|
|
#### Setting Version Mask-label Mapping
|
|
|
|
In order to visualize masks in a dataset version, the mask values need to be mapped to their labels. Mask-label
|
|
mapping is stored in a version's metadata.
|
|
|
|
To define the DatasetVersion level mask-label mapping, use the [`DatasetVersion.set_masks_labels`](../references/hyperdataset/hyperdatasetversion.md#set_masks_labels)
|
|
method, and input a dictionary of RGB-value tuple keys and label-list values.
|
|
|
|
```python
|
|
from allegroai import DatasetVersion
|
|
|
|
# Getting a version
|
|
myDatasetversion = DatasetVersion.get_version(dataset_name='MyDataset',
|
|
version_name='VersionName')
|
|
|
|
# Mapping out colors and labels of masks
|
|
myDatasetversion.set_masks_labels(
|
|
{
|
|
(0, 0, 0): ["background"],
|
|
(1, 1, 1): ["person", "sitting"],
|
|
(2, 2, 2): ["cat"],
|
|
}
|
|
)
|
|
```
|
|
|
|
#### Accessing Version Mask-label Mapping
|
|
|
|
The mask values and labels are stored as a property in a dataset version's metadata.
|
|
|
|
```python
|
|
mapping = myDatasetversion.get_metadata()['mask_labels']
|
|
|
|
print(mapping)
|
|
```
|
|
|
|
This should return a dictionary of the version's masks and labels, which should look something like this:
|
|
|
|
```python
|
|
{'_all_': [{'value': [0, 0, 0], 'labels': ['background']}, {'value': [1, 1, 1], 'labels': ['person', 'sitting']}, {'value': [2, 2, 2], 'labels': ['cat']}]}
|
|
```
|