Update docs folder (#1384)

2025-06-26 18:16:07 +00:00 · 2025-03-05 14:20:41 +02:00 · 2025-03-05 14:20:41 +02:00 · 330abbf9c0
commit 330abbf9c0
parent 4fa233ff76
10 changed files with 7 additions and 1726 deletions
--- a/docs/clearml-task.md
+++ b/docs/clearml-task.md
@ -1,167 +1,6 @@
 # `clearml-task` - Execute ANY python code on a remote machine
 Using only your command line and __zero__ additional lines of code, you can easily integrate the ClearML platform
-into your experiment. With the `clearml-task` command, you can create a [Task](https://clear.ml/docs/latest/docs/fundamentals/task)
+into your experiment with the `clearml-task` CLI tool.
 using any script from **any python code or repository and launch it on a remote machine**.
-The remote execution is fully monitored. All outputs - including console / tensorboard / matplotlib -
+For more information, see the [ClearML Documentation](https://clear.ml/docs/latest/docs/apps/clearml_task/).
 are logged in real-time into the ClearML UI.
 ## What does it do?
 With the `clearml-task` command, you specify the details of your experiment including:
 * Project and task name
 * Repository / commit / branch
 * [Queue](https://clear.ml/docs/latest/docs/fundamentals/agents_and_queues#what-is-a-queue)
  name
 * Optional: the base docker image to be used as underlying environment
 * Optional: alternative python requirements, in case `requirements.txt` is not found inside the repository.
 Then `clearml-task` does the rest of the heavy-lifting. It creates a new experiment or Task on your `clearml-server`
 according to your specifications, and then, it will enqueue the experiment to the selected execution queue.
 While the Task is executed on the remote machine (performed by an available `clearml-agent`), all the console outputs
 will be logged in real-time, alongside your TensorBoard and matplotlib. During and after the Task execution, you can
 track and visualize the results in the ClearML Web UI.
 ### Use-cases for `clearml-task` remote execution
 - You have an off-the-shelf code, and you want to launch it on a remote machine with a specific resource (i.e., GPU)
 - You want to run the [hyper-parameter optimization]() on a codebase that is still not connected to `clearml`
 - You want to create a [pipeline]() from an assortment of scripts, and you need to create Tasks for those scripts
 - Sometimes, you just want to run some code on a remote machine, either using an on-prem cluster or on the cloud...
 ### Prerequisites
 - A single python script, or an up-to-date repository containing the codebase.
 - `clearml` installed. `clearml` also has a [Task](https://clear.ml/docs/latest/docs/fundamentals/task)
  feature but it requires two lines of code in order to integrate the platform.
 - `clearml-agent` running on at least one machine (to execute the experiment)
 ## Tutorial
 ### Launching a job from a repository
 You will be launching this [script](https://github.com/allegroai/events/blob/master/webinar-0620/keras_mnist.py)
 on a remote machine. You will be using the following command-line options:
 1. Give the experiment a name and select a project, for example: `--project keras_examples --name remote_test`. If the project
   doesn't exist, a new project will be created with the selected name.
 2. Select the repository with your code. For example: `--repo https://github.com/allegroai/events.git` You can specify a
   branch and/or commit using `--branch <branch_name> --commit <commit_id>`. If you do not specify the
   branch / commit, it will use by default the latest commit from the master branch,
 3. Specify which script in the repository needs to be run, for example: `--script /webinar-0620/keras_mnist.py`
 By default, the execution working directory will be the root of the repository. If you need to change it,
   add `--cwd <folder>`
 4. If you need, pass an argument to your scripts, use `--args`, followed by the arguments.
   The names of the arguments should match the argparse arguments, but without the '--' prefix. Instead
   of --key=value -> use `--args key=value`, for example `--args batch_size=64 epochs=1`
 5. Select the queue for your Task's execution, for example: `--queue default`. If a queue isn't chosen, the Task
   will not be executed, it will be left in [draft mode](https://clear.ml/docs/latest/docs/fundamentals/task#task-states),
   and you can enqueue and execute the Task at a later point.
 6. Add required packages. If your repo has a requirements.txt file, you don't need to do anything; `clearml-task`
   will automatically find the file and put it in your Task. If your repo does __not__ have a requirements file and
 there are packages that are necessary for the execution of your code, use --packages <package_name>. For example:
   `--packages "keras" "tensorflow>2.2"`.
 ``` bash
 clearml-task --project keras_examples --name remote_test --repo https://github.com/allegroai/events.git
 --script /webinar-0620/keras_mnist.py --args batch_size=64 epochs=1 --queue default
 ```
 ### Launching a job from a local script
 You will be launching a single local script file (no git repo needed) on a remote machine:
 1. Give the experiment a name and select a project (`--project examples --name remote_test`)
 2. Select the script file on your machine, `--script /path/to/my/script.py`
 3. If you require specific packages to run your code, you can specify them manually with `--packages "package_name" "package_name2`,
   for example: `packages "keras" "tensorflow>2.2"`
  or you can pass a requirements file `--requirements /path/to/my/requirements.txt`
 4. If you need to pass arguments, like in the repo case, add `--args key=value` and make sure that the key names match
   the argparse arguments (`--args batch_size=64 epochs=1`)
 5. If you have a docker container with an entire environment in which you want your script to run inside,
  add e.g. `--docker nvcr.io/nvidia/pytorch:20.11-py3`
 6. Select the queue for your Task's execution, for example: `--queue dual_gpu`. If a queue isn't chosen, the Task
   will not be executed, it will be left in [draft mode](https://clear.ml/docs/latest/docs/fundamentals/task#task-states),
   and you can enqueue and execute it at a later point.
 ``` bash
 clearml-task --project examples --name remote_test --script /path/to/my/script.py
 --packages "keras" "tensorflow>2.2" --args epochs=1 batch_size=64
 --queue dual_gpu
 ```
 ### CLI options
 ``` bash
 clearml-task --help
 ```
 ``` console
 ClearML launch - launch any codebase on remote machine running clearml-agent
 optional arguments:
  -h, --help            show this help message and exit
  --version             Display the clearml-task utility version
  --project PROJECT     Required: set the project name for the task. If
                        --base-task-id is used, this arguments is optional.
  --name NAME           Required: select a name for the remote task
  --repo REPO           remote URL for the repository to use. Example: --repo
                        https://github.com/allegroai/clearml.git
  --branch BRANCH       Select specific repository branch/tag (implies the
                        latest commit from the branch)
  --commit COMMIT       Select specific commit id to use (default: latest
                        commit, or when used with local repository matching
                        the local commit id)
  --folder FOLDER       Remotely execute the code in the local folder. Notice!
                        It assumes a git repository already exists. Current
                        state of the repo (commit id and uncommitted changes)
                        is logged and will be replicated on the remote machine
  --script SCRIPT       Specify the entry point script for the remote
                        execution. When used in tandem with --repo the script
                        should be a relative path inside the repository, for
                        example: --script source/train.py .When used with
                        --folder it supports a direct path to a file inside
                        the local repository itself, for example: --script
                        ~/project/source/train.py
  --cwd CWD             Working directory to launch the script from. Default:
                        repository root folder. Relative to repo root or local
                        folder
  --args [ARGS [ARGS ...]]
                        Arguments to pass to the remote execution, list of
                        <argument>=<value> strings.Currently only argparse
                        arguments are supported. Example: --args lr=0.003
                        batch_size=64
  --queue QUEUE         Select the queue to launch the task. If not provided a
                        Task will be created but it will not be launched.
  --requirements REQUIREMENTS
                        Specify requirements.txt file to install when setting
                        the session. If not provided, the requirements.txt
                        from the repository will be used.
  --packages [PACKAGES [PACKAGES ...]]
                        Manually specify a list of required packages. Example:
                        --packages "tqdm>=2.1" "scikit-learn"
  --docker DOCKER       Select the docker image to use in the remote session
  --docker_args DOCKER_ARGS
                        Add docker arguments, pass a single string
  --docker_bash_setup_script DOCKER_BASH_SETUP_SCRIPT
                        Add bash script to be executed inside the docker
                        before setting up the Task's environment
  --output-uri OUTPUT_URI
                        Optional: set the Task `output_uri` (automatically
                        upload model destination)
  --task-type TASK_TYPE
                        Set the Task type, optional values: training, testing,
                        inference, data_processing, application, monitor,
                        controller, optimizer, service, qc, custom
  --skip-task-init      If set, Task.init() call is not added to the entry
                        point, and is assumed to be called in within the
                        script. Default: add Task.init() call entry point
                        script
  --base-task-id BASE_TASK_ID
                        Use a pre-existing task in the system, instead of a
                        local repo/script. Essentially clones an existing task
                        and overrides arguments/requirements.
 ```
--- a/docs/dataset_screenshots.gif
+++ b/docs/dataset_screenshots.gif
--- a/docs/datasets.md
+++ b/docs/datasets.md
@ -1,159 +1,5 @@
 # ClearML introducing Dataset management!
-## Decoupling Data from Code - The Dataset Paradigm
+Simplify data management with ClearML: create, version, and access datasets from anywhere, ensuring traceability and reproducibility.
 <a href="https://app.clear.ml"><img src="https://github.com/allegroai/clearml/blob/master/docs/dataset_screenshots.gif?raw=true" width="80%"></a>
 ### The ultimate goal of `clearml-data` is to transform datasets into configuration parameters
 Just like any other argument, the dataset argument should retrieve a full local copy of the
 dataset to be used by the experiment. 
 This means datasets can be efficiently retrieved by any machine in a reproducible way.
 Together it creates a full version control solution for all your data,  
 that is both machine and environment agnostic.
 ### Design Goals : Simple / Agnostic / File-based / Efficient
 ## Key Concepts:
 1) **Dataset** is a **collection of files** : e.g. folder with all subdirectories and files included in the dataset
 2) **Differential storage** : Efficient storage / network
 3) **Flexible**: support addition / removal / merge of files and datasets
 4) **Descriptive, transparent & searchable**: support projects, names, descriptions, tags and searchable fields
 5) **Simple interface**  (CLI and programmatic)
 6) **Accessible**: get a copy of the dataset files from anywhere on any machine
 ### Workflow:
 #### Simple dataset creation with CLI:
 - Create a dataset
 ``` bash
 clearml-data create --project <my_project> --name <my_dataset_name>
 ```
 - Add local files to the dataset
 ``` bash
 clearml-data add --files ~/datasets/best_dataset/
 ```
 - Close dataset and upload files (Optional: specify storage `--storage` `s3://bucket`, `gs://`, `azure://` or `/mnt/shared/`)
 ``` bash
 clearml-data close --id <dataset_id>
 ```
 #### Integrating datasets into your code:
 ```python
 from argparse import ArgumentParser
 from clearml import Dataset
 # adding command line interface, so it is easy to use
 parser = ArgumentParser()
 parser.add_argument('--dataset', default='aayyzz', type=str, help='Dataset ID to train on')
 args = parser.parse_args()
 # creating a task, so that later we could override the argparse from UI
 task = Task.init(project_name='examples', task_name='dataset demo')
 # getting a local copy of the dataset
 dataset_folder = Dataset.get(dataset_id=args.dataset).get_local_copy()
 # go over the files in `dataset_folder` and train your model
 ```
 #### Create dataset from code
 Creating datasets from code is especially helpful when some preprocessing is done on raw data and we want to save
 preprocessing code as well as dataset in a single Task.
 ```python
 from clearml import Dataset
 # Preprocessing code here
 dataset = Dataset.create(dataset_name='dataset name',dataset_project='dataset project')
 dataset.add_files('/path_to_data')
 dataset.upload()
 dataset.finalize()
 ```
 #### Modifying a dataset with CLI:
 - Create a new dataset (specify the parent dataset id)
 ```bash
 clearml-data create --name <improved_dataset> --parents <existing_dataset_id>
 ```
 - Get a mutable copy of the current dataset
 ```bash
 clearml-data get --id <created_dataset_id> --copy ~/datasets/working_dataset
 ```
 - Change / add / remove files from the dataset folder
 ```bash
 vim ~/datasets/working_dataset/everything.csv
 ```
 #### Folder sync mode
 Folder sync mode updates dataset according to folder content changes.<br/>
 This is useful in case there's a single point of truth, either a local or network folder that gets updated periodically.
 When using `clearml-data sync` and specifying parent dataset, the folder changes will be reflected in a new dataset version.
 This saves time manually updating (adding \ removing) files.
 - Sync local changes
 ``` bash
 clearml-data sync --id <created_dataset_id> --folder ~/datasets/working_dataset
 ```
 - Upload files (Optional: specify storage `--storage` `s3://bucket`, `gs://`, `azure://`, `/mnt/shared/`)
 ``` bash
 clearml-data upload --id <created_dataset_id>
 ```
 - Close dataset
 ``` bash
 clearml-data close --id <created_dataset_id>
 ```
 #### Command Line Interface Summary:
 - **`search`**  Search a dataset based on project / name / description / tag etc.
 - **`list`**  List the file directory content of a dataset (no need to download a copy pf the dataset)
 - **`verify`**  Verify a local copy of a dataset (verify the dataset files SHA2 hash)
 - **`create`**  Create a new dataset (support extending/inheriting multiple parents)
 - **`delete`**  Delete a dataset
 - **`add`**  Add local files to a dataset
 - **`sync`**  Sync dataset with a local folder (source-of-truth being the local folder)
 - **`remove`**  Remove files from dataset (no need to download a copy of the dataset)
 - **`get`**  Get a local copy of the dataset (either readonly --link, or writable --copy)
 - **`upload`**  Upload the dataset (use --storage to specify storage target such as S3/GS/Azure/Folder, default: file server)
 #### Under the hood (how it all works):
 Each dataset instance stores the collection of files added/modified from the previous version (parent).
 When requesting a copy of the dataset all parent datasets on the graph are downloaded and a new folder
 is merged with all changes introduced in the dataset DAG.
 Implementation details:
 Dataset differential snapshot is stored in a single zip file for efficiency in storage and network
 bandwidth. Local cache is built into the process making sure datasets are downloaded only once.
 Dataset contains SHA2 hash of all the files in the dataset.
 In order to increase dataset fetching speed, only file size is verified automatically,
 the SHA2 hash is verified only on user's request.
 The design supports multiple parents per dataset, essentially merging all parents based on order.
 To improve deep dataset DAG storage and speed, dataset squashing was introduced. A user can squash
 a dataset, merging down all changes introduced in the DAG, creating a new flat version without parent datasets.
 ### Datasets UI:
 A dataset is represented as a special `Task` in the system. <br>
 It is of type `data-processing` with a special tag `dataset`.
 - Full log (calls / CLI) of the dataset creation process can be found in the "Execution" section.
 - Listing of the dataset differential snapshot, summary of files added / modified / removed and details of files
 in the differential snapshot (location / size / hash), is available in the Artifacts section you can find a
 - The full dataset listing (all files included) is available in the Configuration section under `Dataset Content`.
 This allows you to quickly compare two dataset contents and visually see the difference.
 - The dataset genealogy DAG and change-set summary table is visualized in Results / Plots
 For more information, see the [ClearML Documentation](https://clear.ml/docs/latest/docs/clearml_data/).
--- a/docs/logger.md
+++ b/docs/logger.md
--- a/docs/results_screenshots.gif
+++ b/docs/results_screenshots.gif
--- a/docs/screenshots/compare_plots.png
+++ b/docs/screenshots/compare_plots.png
--- a/docs/screenshots/compare_plots_hist.png
+++ b/docs/screenshots/compare_plots_hist.png
--- a/docs/screenshots/compare_values.png
+++ b/docs/screenshots/compare_values.png
--- a/docs/screenshots/custom_column.png
+++ b/docs/screenshots/custom_column.png
--- a/docs/screenshots/set_custom_column.png
+++ b/docs/screenshots/set_custom_column.png