From 03b8862d705f70c0d98c52e1ff2713d79b7431da Mon Sep 17 00:00:00 2001 From: pollfly <75068813+pollfly@users.noreply.github.com> Date: Mon, 1 Aug 2022 11:25:54 +0300 Subject: [PATCH] Data storage clarification (#298) --- docs/clearml_data/clearml_data.md | 15 +++++++++------ docs/clearml_data/clearml_data_cli.md | 5 ++++- docs/clearml_data/clearml_data_sdk.md | 8 ++++++-- 3 files changed, 19 insertions(+), 9 deletions(-) diff --git a/docs/clearml_data/clearml_data.md b/docs/clearml_data/clearml_data.md index c1b36c40..8609fbee 100644 --- a/docs/clearml_data/clearml_data.md +++ b/docs/clearml_data/clearml_data.md @@ -17,14 +17,17 @@ ClearML Data Management solves two important challenges: **We believe Data is not code**. It should not be stored in a git tree, because progress on datasets is not always linear. Moreover, it can be difficult and inefficient to find on a git tree the commit associated with a certain version of a dataset. -A `clearml-data` dataset is a collection of files, stored on a central storage location (S3 / GS / Azure / Network Storage). -Datasets can be set up to inherit from other datasets, so data lineages can be created, -and users can track when and how their data changes. +Use ClearML Data to create, manage, and version your datasets. Store your files in any storage location of your choice +(S3 / GS / Azure / Network Storage) by setting the dataset’s upload destination (see [`--storage`](clearml_data_cli.md#upload) +CLI option or [`output_url`](clearml_data_sdk.md#uploading-files) parameter). -Dataset changes are stored using differentiable storage, meaning a version will store the change-set from its previous dataset parents. +Datasets can be set up to inherit from other datasets, so data lineages can be created, and users can track when and how +their data changes. Dataset changes are stored using differentiable storage, meaning a version will store the change-set +from its previous dataset parents. -Local copies of datasets are always cached, so the same data never needs to be downloaded twice. -When a dataset is pulled it will automatically pull all parent datasets and merge them into one output folder for you to work with. +You can get a local copy of your dataset on any machine. Local copies of datasets are always cached, so the same data +never needs to be downloaded twice. When a dataset is pulled it will automatically pull all parent datasets and merge +them into one output folder for you to work with. The [Dataset Versions](../webapp/datasets/webapp_dataset_viewing.md) page in the web UI displays dataset versions' lineage and content information. See [dataset UI](../webapp/datasets/webapp_dataset_page.md) for more details. diff --git a/docs/clearml_data/clearml_data_cli.md b/docs/clearml_data/clearml_data_cli.md index 1d7207cd..9632d3c2 100644 --- a/docs/clearml_data/clearml_data_cli.md +++ b/docs/clearml_data/clearml_data_cli.md @@ -7,7 +7,10 @@ This page covers `clearml-data`, ClearML's file-based data management solution. See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution. ::: -The `clearml-data` utility is a CLI tool for controlling and managing your data with ClearML. +`clearml-data` is a data management CLI tool that comes as part of the `clearml` python package. Use `clearml-data` to +create, modify, and manage your datasets. You can upload your dataset to any storage service of your choice (S3 / GS / +Azure / Network Storage) by setting the dataset’s upload destination (see [`--storage`](#upload)). Once you have uploaded +your dataset, you can access it from any machine. The following page provides a reference to `clearml-data`'s CLI commands. diff --git a/docs/clearml_data/clearml_data_sdk.md b/docs/clearml_data/clearml_data_sdk.md index b29e04c4..a8d935d7 100644 --- a/docs/clearml_data/clearml_data_sdk.md +++ b/docs/clearml_data/clearml_data_sdk.md @@ -7,8 +7,12 @@ This page covers `clearml-data`, ClearML's file-based data management solution. See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced queryable dataset management solution. ::: -Datasets can be created, modified, and managed with ClearML Data's python interface. The following page provides an overview -for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md) +Datasets can be created, modified, and managed with ClearML Data's python interface. You can upload your dataset to any +storage service of your choice (S3 / GS / Azure / Network Storage) by setting the dataset’s upload destination (see +[`output_url`](#uploading-files) parameter of `Dataset.upload` method). Once you have uploaded your dataset, you can access +it from any machine. + +The following page provides an overview for using the most basic methods of the `Dataset` class. See the [Dataset reference page](../references/sdk/dataset.md) for a complete list of available methods. Import the `Dataset` class, and let's get started!