clearml-docs/docs/hyperdatasets/dataset.md

306 lines
11 KiB
Markdown
Raw Normal View History

2021-06-20 22:00:16 +00:00
---
title: Datasets and Dataset Versions
---
ClearML Enterprise's **Datasets** and **Dataset versions** provide the internal data structure
and functionality for the following purposes:
* Connecting source data to the **ClearML Enterprise** platform
* Using **ClearML Enterprise**'s GIT-like [Dataset versioning](#dataset-versioning)
* Integrating the powerful features of [Dataviews](dataviews.md) with an experiment
* [Annotating](webapp/webapp_datasets_frames.md#annotations) images and videos
Datasets consist of versions with SingleFrames and / or FrameGroups. Each Dataset can contain multiple versions, where
each version can have multiple children that inherit their parent's SingleFrames and / or FrameGroups. This inheritance
includes the frame metadata and data connecting the source data to the ClearML Enterprise platform, as well as the other
metadata and data.
These parent-child version relationships can be represented as version trees with a root-level parent. A Dataset
can contain one or more trees.
## Dataset version state
Dataset versions can have either **Draft** or **Published** status.
A **Draft** version is editable, so frames can be added to and deleted and / or modified from the Dataset.
A **Published** version is read-only, which ensures reproducible experiments and preserves a version of a Dataset.
Child versions can only be created from *Published* versions. To create a child of a *Draft* Dataset version,
it must be published first.
## Example Datasets
**ClearML Enterprise** provides Example Datasets, available to in the **ClearML Enterprise** platform, with frames already built,
and ready for your experimentation. Find these example Datasets in the **ClearML Enterprise** WebApp (UI). They appear
with an "Example" banner in the WebApp (UI).
## Usage
### Creating Datasets
Use the [Dataset.create](google.com) method to create a Dataset. It will contain an empty version named `Current`.
```python
from allegroai import Dataset
myDataset = Dataset.create(dataset_name='myDataset')
```
Or, use the [DatasetVersion.create_new_dataset](google.com) method.
```python
from allegroai import DatasetVersion
myDataset = DatasetVersion.create_new_dataset(dataset_name='myDataset Two')
```
To raise a `ValueError` exception if the Dataset exists, specify the `raise_if_exists` parameters as `True`.
* With `Dataset.create`
```python
try:
myDataset = Dataset.create(dataset_name='myDataset One', raise_if_exists=True)
except ValueError:
print('Dataset exists.')
```
* Or with `DatasetVersion.create_new_dataset`
```python
try:
myDataset = DatasetVersion.create_new_dataset(dataset_name='myDataset Two', raise_if_exists=True)
except ValueError:
print('Dataset exists.')
```
Additionally, create a Dataset with tags and a description.
```python
myDataset = DatasetVersion.create_new_dataset(dataset_name='myDataset',
tags=['One Tag', 'Another Tag', 'And one more tag'],
description='some description text')
```
### Accessing current Dataset
To get the current Dataset, use the `DatasetVersion.get_current` method.
```python
myDataset = DatasetVersion.get_current(dataset_name='myDataset')
```
### Deleting Datasets
Use the `Dataset.delete` method to delete a Dataset.
Delete an empty Dataset (no versions).
```python
Dataset.delete(dataset_name='MyDataset', delete_all_versions=False, force=False)
```
Delete a Dataset containing only versions whose status is *Draft*.
```python
Dataset.delete(dataset_name='MyDataset', delete_all_versions=True, force=False)
```
Delete a Dataset even if it contains versions whose status is *Published*.
```python
Dataset.delete(dataset_name='MyDataset', delete_all_versions=True, force=True)
```
## Dataset Versioning
Dataset versioning refers to the group of **ClearML Enterprise** SDK and WebApp (UI) features for creating,
modifying, and deleting Dataset versions.
**ClearML Enterprise** supports simple and sophisticated Dataset versioning, including **simple version structures** and
**advanced version structures**.
In a **simple version structure**, a parent can have one and only one child, and the last child in the Dataset versions
tree must be a *Draft*. This simple structure allows working with a single set of versions of a Dataset. Create children
and publish versions to preserve data history. Each version whose status is *Published* in a simple version structure is
referred to as a **snapshot**.
In an **advanced version structure**, at least one parent has more than one child (this can include more than one parent
version at the root level), or the last child in the Dataset versions tree is *Published*.
Creating a version in a simple version structure may convert it to an advanced structure. This happens when creating
a Dataset version that yields a parent with two children, or when publishing the last child version.
## Versioning Usage
Manage Dataset versioning using the [DatasetVersion](google.com) class in the ClearML Enterprise SDK.
### Creating snapshots
If the Dataset contains only one version whose status is *Draft*, snapshots of the current version can be created.
When creating a snapshot, the current version becomes the snapshot (it keeps the same version ID),
and the newly created version (with its new version ID) becomes the current version.
To create a snapshot, use the [DatasetVersion.create_snapshot](google.com) method.
#### Snapshot naming
In the simple version structure, ClearML Enterprise supports two methods for snapshot naming:
* **Timestamp naming** - If only the Dataset name or ID is provided, the snapshot is named `snapshot` with a timestamp
appended.
The timestamp format is ISO 8601 (`YYYY-MM-DDTHH:mm:ss.SSSSSS`). For example, `snapshot 2020-03-26T16:55:38.441671`.
**Example:**
```python
from allegroai import DatasetVersion
myDataset = DatasetVersion.create_snapshot(dataset_name='MyDataset')
```
After the statement above runs, the previous current version keeps its existing version ID, and it becomes a
snapshot named `snapshot` with a timestamp appended. The newly created version with a new version ID becomes
the current version, and its name is `Current`.
* **User-specified snapshot naming** - If the `publish_name` parameter is provided, it will be the name of the snapshot name.
**Example:**
```python
myDataset = DatasetVersion.create_snapshot(dataset_name='MyDataset', publish_name='NewSnapshotName')
```
After the above statement runs, the previous current version keeps its existing version ID and becomes a snapshot named
`NewSnapshotName`.
The newly created version (with a new version ID) becomes the current version, and its name is `Current`.
#### Current version naming
In the simple version structure, ClearML Enterprise supports two methods for current version naming:
* **Default naming** - If the `child_name` parameter is not provided, `Current` is the current version name.
* **User-specified current version naming** - If the `child_name` parameter is provided, that child name becomes the current
version name.
For example, after the following statement runs, the previous current version keeps its existing version ID and becomes
a snapshot named `snapshot` with the timestamp appended.
The newly created version (with a new version ID) is the current version, and its name is `NewCurrentVersionName`.
```python
myDataset = DatasetVersion.create_snapshot(dataset_name='MyDataset',
child_name='NewCurrentVersionName')
```
#### Adding metadata and comments
Add a metadata dictionary and / or comment to a snapshot.
For example:
```python
myDataset = DatasetVersion.create_snapshot(dataset_name='MyDataset',
child_metadata={'abc':'1234','def':'5678'},
child_comment='some text comment')
```
### Creating child versions
Create a new version from any version whose status is *Published*.
To create a new version, call the [DatasetVersion.create_version](google.com) method, and
provide:
* Either the Dataset name or ID
* The parent version name or ID from which the child inherits frames
* The new version's name.
For example, create a new version named `NewChildVersion` from the existing version `PublishedVersion`,
where the new version inherits the frames of the existing version. If `NewChildVersion` already exists,
it is returned.
```python
myVersion = DatasetVersion.create_version(dataset_name='MyDataset',
parent_version_names=['PublishedVersion'],
version_name='NewChildVersion')
```
To raise a ValueError exception if `NewChildVersion` exists, set `raise_if_exists` to `True`.
```python
myVersion = DatasetVersion.create_version(dataset_name='MyDataset',
parent_version_names=['PublishedVersion'],
version_name='NewChildVersion',
raise_if_exists=True))
```
### Creating root-level parent versions
Create a new version at the root-level. This is a version without a parent, and it contains no frames.
```python
myDataset = DatasetVersion.create_version(dataset_name='MyDataset',
version_name='NewRootVersion')
```
### Getting versions
To get a version or versions, use the [DatasetVersion.get_version](google.com) and [DatasetVersion.get_versions](google.com)
methods, respectively.
**Getting a list of all versions**
```python
myDatasetversion = DatasetVersion.get_versions(dataset_name='MyDataset')
```
**Getting a list of all _published_ versions**
```python
myDatasetversion = DatasetVersion.get_versions(dataset_name='MyDataset',
only_published=True)
```
**Getting a list of all _drafts_ versions**
```python
myDatasetversion = DatasetVersion.get_versions(dataset_name='MyDataset',
only_draft=True)
```
**Getting the current version**
If more than one version exists, ClearML Enterprise outputs a warning.
```python
myDatasetversion = DatasetVersion.get_version(dataset_name='MyDataset')
```
**Getting a specific version**
```python
myDatasetversion = DatasetVersion.get_version(dataset_name='MyDataset',
version_name='VersionName')
```
### Deleting versions
Delete versions which are status *Draft* using the [Dataset.delete_version](google.com) method.
```python
from allegroai import Dataset
myDataset = Dataset.get(dataset_name='MyDataset')
myDataset.delete_version(version_name='VersionToDelete')
```
### Publishing versions
Publish (make read-only) versions which are status *Draft* using the [Dataset.publish_version](google.com) method. This includes the current version, if the Dataset is in
the simple version structure.
```python
myVersion = DatasetVersion.get_version(dataset_name='MyDataset',
version_name='VersionToPublish')
myVersion.publish_version()
```