mirror of
https://github.com/clearml/clearml-docs
synced 2025-02-07 21:24:49 +00:00
306 lines
11 KiB
Markdown
306 lines
11 KiB
Markdown
|
---
|
||
|
title: Datasets and Dataset Versions
|
||
|
---
|
||
|
|
||
|
ClearML Enterprise's **Datasets** and **Dataset versions** provide the internal data structure
|
||
|
and functionality for the following purposes:
|
||
|
* Connecting source data to the **ClearML Enterprise** platform
|
||
|
* Using **ClearML Enterprise**'s GIT-like [Dataset versioning](#dataset-versioning)
|
||
|
* Integrating the powerful features of [Dataviews](dataviews.md) with an experiment
|
||
|
* [Annotating](webapp/webapp_datasets_frames.md#annotations) images and videos
|
||
|
|
||
|
Datasets consist of versions with SingleFrames and / or FrameGroups. Each Dataset can contain multiple versions, where
|
||
|
each version can have multiple children that inherit their parent's SingleFrames and / or FrameGroups. This inheritance
|
||
|
includes the frame metadata and data connecting the source data to the ClearML Enterprise platform, as well as the other
|
||
|
metadata and data.
|
||
|
|
||
|
These parent-child version relationships can be represented as version trees with a root-level parent. A Dataset
|
||
|
can contain one or more trees.
|
||
|
|
||
|
## Dataset version state
|
||
|
|
||
|
Dataset versions can have either **Draft** or **Published** status.
|
||
|
|
||
|
A **Draft** version is editable, so frames can be added to and deleted and / or modified from the Dataset.
|
||
|
|
||
|
A **Published** version is read-only, which ensures reproducible experiments and preserves a version of a Dataset.
|
||
|
Child versions can only be created from *Published* versions. To create a child of a *Draft* Dataset version,
|
||
|
it must be published first.
|
||
|
|
||
|
## Example Datasets
|
||
|
|
||
|
**ClearML Enterprise** provides Example Datasets, available to in the **ClearML Enterprise** platform, with frames already built,
|
||
|
and ready for your experimentation. Find these example Datasets in the **ClearML Enterprise** WebApp (UI). They appear
|
||
|
with an "Example" banner in the WebApp (UI).
|
||
|
|
||
|
## Usage
|
||
|
|
||
|
### Creating Datasets
|
||
|
|
||
|
Use the [Dataset.create](google.com) method to create a Dataset. It will contain an empty version named `Current`.
|
||
|
|
||
|
```python
|
||
|
from allegroai import Dataset
|
||
|
|
||
|
myDataset = Dataset.create(dataset_name='myDataset')
|
||
|
```
|
||
|
|
||
|
Or, use the [DatasetVersion.create_new_dataset](google.com) method.
|
||
|
|
||
|
```python
|
||
|
from allegroai import DatasetVersion
|
||
|
|
||
|
myDataset = DatasetVersion.create_new_dataset(dataset_name='myDataset Two')
|
||
|
```
|
||
|
|
||
|
To raise a `ValueError` exception if the Dataset exists, specify the `raise_if_exists` parameters as `True`.
|
||
|
|
||
|
* With `Dataset.create`
|
||
|
```python
|
||
|
try:
|
||
|
myDataset = Dataset.create(dataset_name='myDataset One', raise_if_exists=True)
|
||
|
except ValueError:
|
||
|
print('Dataset exists.')
|
||
|
```
|
||
|
|
||
|
* Or with `DatasetVersion.create_new_dataset`
|
||
|
|
||
|
```python
|
||
|
try:
|
||
|
myDataset = DatasetVersion.create_new_dataset(dataset_name='myDataset Two', raise_if_exists=True)
|
||
|
except ValueError:
|
||
|
print('Dataset exists.')
|
||
|
```
|
||
|
|
||
|
Additionally, create a Dataset with tags and a description.
|
||
|
|
||
|
```python
|
||
|
myDataset = DatasetVersion.create_new_dataset(dataset_name='myDataset',
|
||
|
tags=['One Tag', 'Another Tag', 'And one more tag'],
|
||
|
description='some description text')
|
||
|
```
|
||
|
|
||
|
### Accessing current Dataset
|
||
|
|
||
|
To get the current Dataset, use the `DatasetVersion.get_current` method.
|
||
|
|
||
|
```python
|
||
|
myDataset = DatasetVersion.get_current(dataset_name='myDataset')
|
||
|
```
|
||
|
|
||
|
### Deleting Datasets
|
||
|
|
||
|
Use the `Dataset.delete` method to delete a Dataset.
|
||
|
|
||
|
Delete an empty Dataset (no versions).
|
||
|
|
||
|
```python
|
||
|
Dataset.delete(dataset_name='MyDataset', delete_all_versions=False, force=False)
|
||
|
```
|
||
|
|
||
|
Delete a Dataset containing only versions whose status is *Draft*.
|
||
|
|
||
|
```python
|
||
|
Dataset.delete(dataset_name='MyDataset', delete_all_versions=True, force=False)
|
||
|
```
|
||
|
|
||
|
Delete a Dataset even if it contains versions whose status is *Published*.
|
||
|
|
||
|
```python
|
||
|
Dataset.delete(dataset_name='MyDataset', delete_all_versions=True, force=True)
|
||
|
```
|
||
|
|
||
|
|
||
|
## Dataset Versioning
|
||
|
|
||
|
Dataset versioning refers to the group of **ClearML Enterprise** SDK and WebApp (UI) features for creating,
|
||
|
modifying, and deleting Dataset versions.
|
||
|
|
||
|
**ClearML Enterprise** supports simple and sophisticated Dataset versioning, including **simple version structures** and
|
||
|
**advanced version structures**.
|
||
|
|
||
|
In a **simple version structure**, a parent can have one and only one child, and the last child in the Dataset versions
|
||
|
tree must be a *Draft*. This simple structure allows working with a single set of versions of a Dataset. Create children
|
||
|
and publish versions to preserve data history. Each version whose status is *Published* in a simple version structure is
|
||
|
referred to as a **snapshot**.
|
||
|
|
||
|
In an **advanced version structure**, at least one parent has more than one child (this can include more than one parent
|
||
|
version at the root level), or the last child in the Dataset versions tree is *Published*.
|
||
|
|
||
|
Creating a version in a simple version structure may convert it to an advanced structure. This happens when creating
|
||
|
a Dataset version that yields a parent with two children, or when publishing the last child version.
|
||
|
|
||
|
## Versioning Usage
|
||
|
|
||
|
Manage Dataset versioning using the [DatasetVersion](google.com) class in the ClearML Enterprise SDK.
|
||
|
|
||
|
### Creating snapshots
|
||
|
|
||
|
If the Dataset contains only one version whose status is *Draft*, snapshots of the current version can be created.
|
||
|
When creating a snapshot, the current version becomes the snapshot (it keeps the same version ID),
|
||
|
and the newly created version (with its new version ID) becomes the current version.
|
||
|
|
||
|
To create a snapshot, use the [DatasetVersion.create_snapshot](google.com) method.
|
||
|
|
||
|
|
||
|
#### Snapshot naming
|
||
|
|
||
|
In the simple version structure, ClearML Enterprise supports two methods for snapshot naming:
|
||
|
* **Timestamp naming** - If only the Dataset name or ID is provided, the snapshot is named `snapshot` with a timestamp
|
||
|
appended.
|
||
|
The timestamp format is ISO 8601 (`YYYY-MM-DDTHH:mm:ss.SSSSSS`). For example, `snapshot 2020-03-26T16:55:38.441671`.
|
||
|
|
||
|
**Example:**
|
||
|
```python
|
||
|
from allegroai import DatasetVersion
|
||
|
|
||
|
myDataset = DatasetVersion.create_snapshot(dataset_name='MyDataset')
|
||
|
```
|
||
|
|
||
|
After the statement above runs, the previous current version keeps its existing version ID, and it becomes a
|
||
|
snapshot named `snapshot` with a timestamp appended. The newly created version with a new version ID becomes
|
||
|
the current version, and its name is `Current`.
|
||
|
|
||
|
* **User-specified snapshot naming** - If the `publish_name` parameter is provided, it will be the name of the snapshot name.
|
||
|
|
||
|
**Example:**
|
||
|
```python
|
||
|
myDataset = DatasetVersion.create_snapshot(dataset_name='MyDataset', publish_name='NewSnapshotName')
|
||
|
```
|
||
|
After the above statement runs, the previous current version keeps its existing version ID and becomes a snapshot named
|
||
|
`NewSnapshotName`.
|
||
|
The newly created version (with a new version ID) becomes the current version, and its name is `Current`.
|
||
|
|
||
|
|
||
|
#### Current version naming
|
||
|
|
||
|
In the simple version structure, ClearML Enterprise supports two methods for current version naming:
|
||
|
|
||
|
* **Default naming** - If the `child_name` parameter is not provided, `Current` is the current version name.
|
||
|
* **User-specified current version naming** - If the `child_name` parameter is provided, that child name becomes the current
|
||
|
version name.
|
||
|
|
||
|
For example, after the following statement runs, the previous current version keeps its existing version ID and becomes
|
||
|
a snapshot named `snapshot` with the timestamp appended.
|
||
|
The newly created version (with a new version ID) is the current version, and its name is `NewCurrentVersionName`.
|
||
|
|
||
|
```python
|
||
|
myDataset = DatasetVersion.create_snapshot(dataset_name='MyDataset',
|
||
|
child_name='NewCurrentVersionName')
|
||
|
```
|
||
|
|
||
|
#### Adding metadata and comments
|
||
|
|
||
|
Add a metadata dictionary and / or comment to a snapshot.
|
||
|
|
||
|
For example:
|
||
|
|
||
|
```python
|
||
|
myDataset = DatasetVersion.create_snapshot(dataset_name='MyDataset',
|
||
|
child_metadata={'abc':'1234','def':'5678'},
|
||
|
child_comment='some text comment')
|
||
|
```
|
||
|
|
||
|
### Creating child versions
|
||
|
|
||
|
Create a new version from any version whose status is *Published*.
|
||
|
|
||
|
To create a new version, call the [DatasetVersion.create_version](google.com) method, and
|
||
|
provide:
|
||
|
* Either the Dataset name or ID
|
||
|
* The parent version name or ID from which the child inherits frames
|
||
|
* The new version's name.
|
||
|
|
||
|
For example, create a new version named `NewChildVersion` from the existing version `PublishedVersion`,
|
||
|
where the new version inherits the frames of the existing version. If `NewChildVersion` already exists,
|
||
|
it is returned.
|
||
|
|
||
|
```python
|
||
|
myVersion = DatasetVersion.create_version(dataset_name='MyDataset',
|
||
|
parent_version_names=['PublishedVersion'],
|
||
|
version_name='NewChildVersion')
|
||
|
```
|
||
|
|
||
|
To raise a ValueError exception if `NewChildVersion` exists, set `raise_if_exists` to `True`.
|
||
|
|
||
|
```python
|
||
|
myVersion = DatasetVersion.create_version(dataset_name='MyDataset',
|
||
|
parent_version_names=['PublishedVersion'],
|
||
|
version_name='NewChildVersion',
|
||
|
raise_if_exists=True))
|
||
|
```
|
||
|
|
||
|
### Creating root-level parent versions
|
||
|
|
||
|
Create a new version at the root-level. This is a version without a parent, and it contains no frames.
|
||
|
|
||
|
```python
|
||
|
myDataset = DatasetVersion.create_version(dataset_name='MyDataset',
|
||
|
version_name='NewRootVersion')
|
||
|
```
|
||
|
|
||
|
### Getting versions
|
||
|
|
||
|
To get a version or versions, use the [DatasetVersion.get_version](google.com) and [DatasetVersion.get_versions](google.com)
|
||
|
methods, respectively.
|
||
|
|
||
|
**Getting a list of all versions**
|
||
|
|
||
|
```python
|
||
|
myDatasetversion = DatasetVersion.get_versions(dataset_name='MyDataset')
|
||
|
```
|
||
|
|
||
|
**Getting a list of all _published_ versions**
|
||
|
|
||
|
```python
|
||
|
myDatasetversion = DatasetVersion.get_versions(dataset_name='MyDataset',
|
||
|
only_published=True)
|
||
|
```
|
||
|
|
||
|
**Getting a list of all _drafts_ versions**
|
||
|
|
||
|
```python
|
||
|
myDatasetversion = DatasetVersion.get_versions(dataset_name='MyDataset',
|
||
|
only_draft=True)
|
||
|
```
|
||
|
|
||
|
**Getting the current version**
|
||
|
|
||
|
If more than one version exists, ClearML Enterprise outputs a warning.
|
||
|
|
||
|
```python
|
||
|
myDatasetversion = DatasetVersion.get_version(dataset_name='MyDataset')
|
||
|
```
|
||
|
|
||
|
**Getting a specific version**
|
||
|
|
||
|
```python
|
||
|
myDatasetversion = DatasetVersion.get_version(dataset_name='MyDataset',
|
||
|
version_name='VersionName')
|
||
|
```
|
||
|
|
||
|
### Deleting versions
|
||
|
|
||
|
Delete versions which are status *Draft* using the [Dataset.delete_version](google.com) method.
|
||
|
|
||
|
```python
|
||
|
from allegroai import Dataset
|
||
|
|
||
|
myDataset = Dataset.get(dataset_name='MyDataset')
|
||
|
myDataset.delete_version(version_name='VersionToDelete')
|
||
|
```
|
||
|
|
||
|
|
||
|
### Publishing versions
|
||
|
|
||
|
Publish (make read-only) versions which are status *Draft* using the [Dataset.publish_version](google.com) method. This includes the current version, if the Dataset is in
|
||
|
the simple version structure.
|
||
|
|
||
|
```python
|
||
|
myVersion = DatasetVersion.get_version(dataset_name='MyDataset',
|
||
|
version_name='VersionToPublish')
|
||
|
|
||
|
myVersion.publish_version()
|
||
|
```
|
||
|
|