Small edits (#451)

This commit is contained in:
pollfly
2023-01-23 15:04:24 +02:00
committed by GitHub
parent 2fd532b2c3
commit e8d0267bbd
18 changed files with 44 additions and 42 deletions

View File

@@ -1,6 +1,6 @@
---
title: Hyperdatasets Data Versioning
description: Learn more about the hyperdatasets, a supercharged version of ClearML Data.
title: Hyper-Datasets Data Versioning
description: Learn more about the Hyper-Datasets, a supercharged version of ClearML Data.
keywords: [mlops, components, hyperdatasets]
---
@@ -21,11 +21,11 @@ keywords: [mlops, components, hyperdatasets]
<summary className="cml-expansion-panel-summary">Read the transcript</summary>
<div className="cml-expansion-panel-content">
Hello and welcome to ClearML. In this video, we're taking a closer look at hyperdatasets, a supercharged version of ClearML Data.
Hello and welcome to ClearML. In this video, we're taking a closer look at Hyper-Datasets, a supercharged version of ClearML Data.
Hyperdatasets is a data management system thats designed for unstructured data like text, audio, or visual data. It is part of the ClearML paid offering, which means it brings along quite a bit of upgrades over the open source `clearml-data`.
Hyper-Datasets is a data management system thats designed for unstructured data like text, audio, or visual data. It is part of the ClearML paid offering, which means it brings along quite a bit of upgrades over the open source `clearml-data`.
The main conceptual difference between the two is that hyperdatasets decouple the metadata from the raw data files. This allows you to manipulate the metadata in all kinds of ways while abstracting away the logistics of having to deal with large amounts of data.
The main conceptual difference between the two is that Hyper-Datasets decouple the metadata from the raw data files. This allows you to manipulate the metadata in all kinds of ways while abstracting away the logistics of having to deal with large amounts of data.
Manipulating the metadata is done through queries and parameters, both of which can then be tracked using the experiment manager.
@@ -35,9 +35,9 @@ The data manipulations themselves become part of the experiment, we call it a da
By contrast, in ClearML Data, just like many other data versioning tools, the data and the metadata are entangled. Take this example where the label of the image is defined by which folder it is in, a common dataset structure. What if I want to train only on donuts? Or what if I have a large class imbalance? I still have to download the whole dataset even though I might only be using a small part of it. Then I have to change my code to only grab the donut images or to rebalance my classes by over or under sampling them. If later I want to add waffles to the mix, I have to change my code again.
Lets take a look at an example that will show you how to use hyperdatasets to debug an underperforming model. But first, we start where any good data science projects starts: data exploration.
Lets take a look at an example that will show you how to use Hyper-Datasets to debug an underperforming model. But first, we start where any good data science projects starts: data exploration.
When you open hyperdatasets to explore a dataset, you can find the version history of that dataset here. Datasets can have multiple versions, which in turn can have multiple child versions. Each of the child versions will inherit the contents of their parents.
When you open Hyper-Datasets to explore a dataset, you can find the version history of that dataset here. Datasets can have multiple versions, which in turn can have multiple child versions. Each of the child versions will inherit the contents of their parents.
By default, a dataset version will be in draft mode, meaning it can still be modified. You can press the publish button to essentially lock it to make sure it will not change anymore. If you want to make changes to a published dataset version, make a new version thats based on it.
@@ -53,13 +53,13 @@ The goal of these queries is not to simply serve as a neat filter for data explo
Enter the dataviews that I introduced in the beginning of this video. Dataviews can use sophisticated queries to connect specific data from one or more datasets to an experiment in the experiment manager. Essentially it creates and manages local views of remote Datasets.
As an example, imagine you have created an experiment that tries to train a model based on a specific subset of data using hyperdatasets.
As an example, imagine you have created an experiment that tries to train a model based on a specific subset of data using Hyper-Datasets.
To get the data you need to train on, you can easily create a dataview from code like so. Then you can add all sorts of constraints, like class filters, metadata filters, and class weights which will over or under sample the data as is required.
After running the task, we can see it in the experiment manager. The model is reporting scalars and training as we would expect. When using hyperdatasets, there is also a dataviews tab with all of the possibilities at your disposal. You can see which input datasets and versions that you used and can see the querying system that is used to subset them. This will already give you a nice, clean way to train your models on a very specific subset of the data, but there is more!
After running the task, we can see it in the experiment manager. The model is reporting scalars and training as we would expect. When using Hyper-Datasets, there is also a dataviews tab with all of the possibilities at your disposal. You can see which input datasets and versions that you used and can see the querying system that is used to subset them. This will already give you a nice, clean way to train your models on a very specific subset of the data, but there is more!
If you want to remap labels, or enumerate them to integers on-the-fly, ClearML will keep track of all the transformations that are done and make sure they are reproducible. There is, of course, more still, so if youre interested check out our documentation on hyperdatasets.
If you want to remap labels, or enumerate them to integers on-the-fly, ClearML will keep track of all the transformations that are done and make sure they are reproducible. There is, of course, more still, so if youre interested check out our documentation on Hyper-Datasets.
ClearML veterans already know whats coming next. Cloning.
@@ -73,6 +73,6 @@ After the remote machine has executed the experiment on the new dataview, we can
If youve been following along with the other Getting Started videos, you should already start to see the potential this approach can have. For example: we could now run hyperparameter optimization on the data itself, because all of the filters and settings previously shown are just parameters on a task. The whole process could be running in parallel on a cloud autoscaler for example. Imagine finding the best training data confidence threshold for each class to optimize the model performance.
If youre interested in using Hyperdatasets for your team, then contact us using our website and well get you going in no time. In the meantime, you can enjoy the power of the open source components at app.clear.ml, and dont forget to join our Slack channel, if you need any help!
If youre interested in using Hyper-Datasets for your team, then contact us using our website and well get you going in no time. In the meantime, you can enjoy the power of the open source components at app.clear.ml, and dont forget to join our Slack channel, if you need any help!
</div>
</details>