Small edits (#162)

This commit is contained in:
pollfly
2022-01-18 13:23:47 +02:00
committed by GitHub
parent 8f4851c5c1
commit e72ca23b54
24 changed files with 96 additions and 93 deletions

View File

@@ -16,10 +16,10 @@ The below is only our opinion. ClearML was designed to fit into any workflow whe
During early stages of model development, while code is still being modified heavily, this is the usual setup we'd expect to see used by data scientists:
- A local development machine, usually a laptop (and usually using only CPU) with a fraction of the dataset for faster iterations - this is used for writing the training pipeline code, ensuring it knows to parse the data
and there are no glaring bugs.
- A workstation with a GPU, usually with a limited amount of memory for small batch-sizes. This is used to train the model and ensure the model we chose makes sense and that the training
procedure works. Can be used to provide initial models for testing.
- A local development machine, usually a laptop (and usually using only CPU) with a fraction of the dataset for faster
iterations - Use a local machine for writing, training, and debugging pipeline code.
- A workstation with a GPU, usually with a limited amount of memory for small batch-sizes - Use this workstation to train
the model and ensure that you choose a model that makes sense, and the training procedure works. Can be used to provide initial models for testing.
The abovementioned setups might be folded into each other and that's great! If you have a GPU machine for each researcher, that's awesome!
The goal of this phase is to get a code, dataset and environment setup, so we can start digging to find the best model!

View File

@@ -10,7 +10,7 @@ Now, we'll learn how to track Hyperparameters, Artifacts and Metrics!
Every previously executed experiment is stored as a Task.
A Task has a project and a name, both can be changed after the experiment has been executed.
A Task is also automatically assigned an auto-generated unique identifier (UUID string) that cannot be changed and will always locate the same Task in the system.
A Task is also automatically assigned an auto-generated unique identifier (UUID string) that cannot be changed and always locates the same Task in the system.
It's possible to retrieve a Task object programmatically by querying the system based on either the Task ID,
or project & name combination. It's also possible to query tasks based on their properties, like Tags.
@@ -26,7 +26,7 @@ Once we have a Task object we can query the state of the Task, get its Model, sc
For full reproducibility, it's paramount to save Hyperparameters for each experiment. Since Hyperparameters can have substantial impact
on Model performance, saving and comparing these between experiments is sometimes the key to understand model behavior.
ClearML supports logging `argparse` module arguments out of the box, so once integrating it into the code, it will automatically log all parameters provided to the argument parser.
ClearML supports logging `argparse` module arguments out of the box, so once ClearML is integrated into the code, it automatically logs all parameters provided to the argument parser.
It's also possible to log parameter dictionaries (very useful when parsing an external config file and storing as a dict object),
whole configuration files or even custom objects or [Hydra](https://hydra.cc/docs/intro/) configurations!
@@ -46,7 +46,7 @@ Essentially, artifacts are files (or python objects) uploaded from a script and
These Artifacts can be easily accessed by the web UI or programmatically.
Artifacts can be stored anywhere, either on the ClearML server, or any object storage solution or shared folder.
See all [storage capabilities](../../integrations/storage).
See all [storage capabilities](../../integrations/storage.md).
### Adding Artifacts
@@ -84,9 +84,9 @@ local_csv = preprocess_task.artifacts['data'].get_local_copy()
```
The `task.artifacts` is a dictionary where the keys are the Artifact names, and the returned object is the Artifact object.
Calling `get_local_copy()` will return a local cached copy of the artifact,
this means that the next time we execute the code we will not need to download the artifact again.
Calling `get()` will get a deserialized pickled object.
Calling `get_local_copy()` returns a local cached copy of the artifact. Therefore, next time we execute the code, we don't
need to download the artifact again.
Calling `get()` gets a deserialized pickled object.
Check out the [artifacts retrieval](https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts_retrieval.py) example code.
@@ -94,15 +94,15 @@ Check out the [artifacts retrieval](https://github.com/allegroai/clearml/blob/ma
Models are a special kind artifact.
Models created by popular frameworks (such as Pytorch, Tensorflow, Scikit-learn) are automatically logged by ClearML.
All snapshots are automatically logged, in order to make sure we also automatically upload the model snapshot (instead of saving its local path)
All snapshots are automatically logged. In order to make sure we also automatically upload the model snapshot (instead of saving its local path),
we need to pass a storage location for the model files to be uploaded to.
For example uploading all snapshots to our S3 bucket:
For example, upload all snapshots to an S3 bucket:
```python
task = Task.init(project_name='examples', task_name='storing model', output_uri='s3://my_models/')
```
From now on, whenever the framework (TF/Keras/PyTorch etc.) will be storing a snapshot, the model file will automatically get uploaded to our bucket under a specific folder for the experiment.
Now, whenever the framework (TF/Keras/PyTorch etc.) stores a snapshot, the model file is automatically uploaded to the bucket to a specific folder for the experiment.
Loading models by a framework is also logged by the system, these models appear under the “Input Models” section, under the Artifacts tab.
@@ -124,7 +124,7 @@ Like before we have to get the instance of the Task training the original weight
:::note
Using Tensorflow, the snapshots are stored in a folder, meaning the `local_weights_path` will point to a folder containing our requested snapshot.
:::
As with Artifacts all models are cached, meaning the next time we will run this code, no model will need to be downloaded.
As with Artifacts, all models are cached, meaning the next time we run this code, no model needs to be downloaded.
Once one of the frameworks will load the weights file, the running Task will be automatically updated with “Input Model” pointing directly to the original training Tasks Model.
This feature allows you to easily get a full genealogy of every trained and used model by your system!
@@ -150,7 +150,7 @@ The experiment table is a powerful tool for creating dashboards and views of you
### Creating Leaderboards
The [experiments table](../../webapp/webapp_exp_table.md) can be customized to your own needs, adding desired views of parameters, metrics and tags.
Customize the [experiments table](../../webapp/webapp_exp_table.md) to fit your own needs, adding desired views of parameters, metrics and tags.
It's possible to filter and sort based on parameters and metrics, so creating custom views is simple and flexible.
Create a dashboard for a project, presenting the latest Models and their accuracy scores, for immediate insights.

View File

@@ -115,7 +115,7 @@ Task.enqueue(task=cloned_task, queue_name='default')
```
### Advanced Usage
Before execution, there are a variety of programmatic methods which can be used to manipulate a task object.
Before execution, use a variety of programmatic methods to manipulate a task object.
#### Modify Hyperparameters
[Hyperparameters](../../fundamentals/hyperparameters.md) are an integral part of Machine Learning code as they let you

View File

@@ -7,7 +7,10 @@ Pipelines provide users with a greater level of abstraction and automation, with
Tasks can interface with other Tasks in the pipeline and leverage other Tasks' work products.
We'll go through a scenario where users create a Dataset, process the data then consume it with another task, all running as a pipeline.
The sections below describe the following scenarios:
* Dataset creation
* Data processing and consumption
* Pipeline building
## Building Tasks
@@ -56,11 +59,11 @@ dataset.tags = []
new_dataset.tags = ['latest']
```
We passed the `parents` argument when we created v2 of the Dataset, this inherits all the parent's version content.
This will not only help us in tracing back dataset changes with full genealogy, but will also make our storage more efficient,
as it will only store the files that were changed / added from the parent versions.
When we will later need access to the Dataset it will automatically merge the files from all parent versions
in a fully automatic and transparent process, as if they were always part of the requested Dataset.
We passed the `parents` argument when we created v2 of the Dataset, which inherits all the parent's version content.
This not only helps trace back dataset changes with full genealogy, but also makes our storage more efficient,
since it only store the changed and / or added files from the parent versions.
When we access the Dataset, it automatically merges the files from all parent versions
in a fully automatic and transparent process, as if the files were always part of the requested Dataset.
### Training
We can now train our model with the **latest** Dataset we have in the system.