mirror of
https://github.com/clearml/clearml-docs
synced 2025-01-31 14:37:18 +00:00
62 lines
3.2 KiB
Markdown
62 lines
3.2 KiB
Markdown
---
|
||
title: Introduction
|
||
---
|
||
|
||
:::important
|
||
This page covers `clearml-data`, ClearML's file-based data management solution.
|
||
See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution.
|
||
:::
|
||
|
||
In Machine Learning, you are very likely dealing with a gargantuan amount of data that you need to put in a dataset,
|
||
which you then need to be able to share, reproduce, and track.
|
||
|
||
ClearML Data Management solves two important challenges:
|
||
- Accessibility - Making data easily accessible from every machine,
|
||
- Versioning - Linking data and experiments for better **traceability**.
|
||
|
||
![Dataset lineage and preview](../img/webapp_dataset_lineage_preview.png)
|
||
|
||
**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
|
||
Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
|
||
|
||
Use ClearML Data to create, manage, and version your datasets. Store your files in any storage location of your choice
|
||
(S3 / GS / Azure / Network Storage) by setting the dataset’s upload destination (see [`--storage`](clearml_data_cli.md#upload)
|
||
CLI option or [`output_url`](clearml_data_sdk.md#uploading-files) parameter).
|
||
|
||
Datasets can be set up to inherit from other datasets, so data lineages can be created, and users can track when and how
|
||
their data changes. Dataset changes are stored using differentiable storage, meaning a version will store the change-set
|
||
from its previous dataset parents.
|
||
|
||
You can get a local copy of your dataset on any machine. Local copies of datasets are always cached, so the same data
|
||
never needs to be downloaded twice. When a dataset is pulled it will automatically pull all parent datasets and merge
|
||
them into one output folder for you to work with.
|
||
|
||
The [Dataset Versions](../webapp/datasets/webapp_dataset_viewing.md) page in the web UI displays dataset versions'
|
||
lineage and content information. See [dataset UI](../webapp/datasets/webapp_dataset_page.md) for more details.
|
||
|
||
## Setup
|
||
|
||
`clearml-data` comes built-in with the `clearml` python package! Just check out the [Getting Started](../getting_started/ds/ds_first_steps.md)
|
||
guide for more info!
|
||
|
||
## Using ClearML Data
|
||
|
||
ClearML Data offers two interfaces:
|
||
- `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](clearml_data_cli.md) for a reference of `clearml-data` commands.
|
||
- `clearml.Dataset` - A python interface for creating, retrieving, managing, and using datasets. See [SDK](clearml_data_sdk.md) for an overview of the basic methods of the `Dataset` module.
|
||
|
||
For an overview of our recommendations for ClearML Data workflows and practices, see [Best Practices](best_practices.md).
|
||
|
||
## Dataset Version States
|
||
The following table displays the possible states for a dataset version.
|
||
|
||
|
||
| State | Description |
|
||
|---|---|
|
||
|*Uploading* | Dataset creation is in progress |
|
||
|*Failed* | Dataset creation was terminated with an error|
|
||
|*Aborted* | Dataset creation was aborted by user before it was finalization |
|
||
|*Final* | A dataset was created and finalized successfully |
|
||
|*Published* | The dataset is read-only. Publish a dataset to prevent changes to it |
|
||
|