Data storage clarification (#298)

2025-06-26 18:17:44 +00:00 · 2022-08-01 11:25:54 +03:00 · 2022-08-01 11:25:54 +03:00 · 03b8862d70
commit 03b8862d70
parent 5476908523
3 changed files with 19 additions and 9 deletions
--- a/docs/clearml_data/clearml_data.md
+++ b/docs/clearml_data/clearml_data.md
@ -17,14 +17,17 @@ ClearML Data Management solves two important challenges:
 **We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear.
 Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset.

-A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage).
-Datasets can be set up to inherit from other datasets, so data lineages can be created,
-and users can track when and how their data changes.
+Use ClearML Data to create, manage, and version your datasets. Store your files in any storage location of your choice 
+(S3 / GS / Azure / Network Storage) by setting the dataset’s upload destination (see [`--storage`](clearml_data_cli.md#upload) 
+CLI option or [`output_url`](clearml_data_sdk.md#uploading-files) parameter). 

-Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents.
+Datasets can be set up to inherit from other datasets, so data lineages can be created, and users can track when and how 
+their data changes. Dataset changes are stored using differentiable storage, meaning a version will store the change-set 
+from its previous dataset parents.

-Local copies of datasets are always cached, so the same data never needs to be downloaded twice.
-When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with.
+You can get a local copy of your dataset on any machine. Local copies of datasets are always cached, so the same data 
+never needs to be downloaded twice. When a dataset is pulled it will automatically pull all parent datasets and merge 
+them into one output folder for you to work with.

 The [Dataset Versions](../webapp/datasets/webapp_dataset_viewing.md) page in the web UI displays dataset versions' 
 lineage and content information. See [dataset UI](../webapp/datasets/webapp_dataset_page.md) for more details.
--- a/docs/clearml_data/clearml_data_cli.md
+++ b/docs/clearml_data/clearml_data_cli.md
@ -7,7 +7,10 @@ This page covers `clearml-data`, ClearML's file-based data management solution.
 See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution.
 :::

-The `clearml-data` utility is a CLI tool for controlling and managing your data with ClearML.  
+`clearml-data` is a data management CLI tool that comes as part of the `clearml` python package. Use `clearml-data` to 
+create, modify, and manage your datasets. You can upload your dataset to any storage service of your choice (S3 / GS / 
+Azure / Network Storage) by setting the dataset’s upload destination (see [`--storage`](#upload)). Once you have uploaded 
+your dataset, you can access it from any machine. 

 The following page provides a reference to `clearml-data`'s CLI commands. 

--- a/docs/clearml_data/clearml_data_sdk.md
+++ b/docs/clearml_data/clearml_data_sdk.md
@ -7,8 +7,12 @@ This page covers `clearml-data`, ClearML's file-based data management solution.
 See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution.
 :::

-Datasets can be created, modified, and managed with ClearML Data's python interface. The following page provides an overview
-for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md) 
+Datasets can be created, modified, and managed with ClearML Data's python interface. You can upload your dataset to any 
+storage service of your choice  (S3 / GS / Azure / Network Storage) by setting the dataset’s upload destination (see 
+[`output_url`](#uploading-files) parameter of `Dataset.upload` method). Once you have uploaded your dataset, you can access 
+it from any machine.  
+
+The following page provides an overview for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md) 
 for a complete list of available methods.

 Import the `Dataset` class, and let's get started!