Data storage clarification (#298)

This commit is contained in:
pollfly 2022-08-01 11:25:54 +03:00 committed by GitHub
parent 5476908523
commit 03b8862d70
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 19 additions and 9 deletions

View File

@ -17,14 +17,17 @@ ClearML Data Management solves two important challenges:
**We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.
A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
Datasets can be set up to inherit from other datasets, so data lineages can be created,
and users can track when and how their data changes.
Use ClearML Data to create, manage, and version your datasets. Store your files in any storage location of your choice
(S3 / GS / Azure / Network Storage) by setting the datasets upload destination (see [`--storage`](clearml_data_cli.md#upload)
CLI option or [`output_url`](clearml_data_sdk.md#uploading-files) parameter).
Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
Datasets can be set up to inherit from other datasets, so data lineages can be created, and users can track when and how
their data changes. Dataset changes are stored using differentiable storage, meaning a version will store the change-set
from its previous dataset parents.
Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
You can get a local copy of your dataset on any machine. Local copies of datasets are always cached, so the same data
never needs to be downloaded twice. When a dataset is pulled it will automatically pull all parent datasets and merge
them into one output folder for you to work with.
The [Dataset Versions](../webapp/datasets/webapp_dataset_viewing.md) page in the web UI displays dataset versions'
lineage and content information. See [dataset UI](../webapp/datasets/webapp_dataset_page.md) for more details.

View File

@ -7,7 +7,10 @@ This page covers `clearml-data`, ClearML's file-based data management solution.
See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution.
:::
The `clearml-data` utility is a CLI tool for controlling and managing your data with ClearML.
`clearml-data` is a data management CLI tool that comes as part of the `clearml` python package. Use `clearml-data` to
create, modify, and manage your datasets. You can upload your dataset to any storage service of your choice (S3 / GS /
Azure / Network Storage) by setting the datasets upload destination (see [`--storage`](#upload)). Once you have uploaded
your dataset, you can access it from any machine.
The following page provides a reference to `clearml-data`'s CLI commands.

View File

@ -7,8 +7,12 @@ This page covers `clearml-data`, ClearML's file-based data management solution.
See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution.
:::
Datasets can be created, modified, and managed with ClearML Data's python interface. The following page provides an overview
for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md)
Datasets can be created, modified, and managed with ClearML Data's python interface. You can upload your dataset to any
storage service of your choice (S3 / GS / Azure / Network Storage) by setting the datasets upload destination (see
[`output_url`](#uploading-files) parameter of `Dataset.upload` method). Once you have uploaded your dataset, you can access
it from any machine.
The following page provides an overview for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md)
for a complete list of available methods.
Import the `Dataset` class, and let's get started!