diff --git a/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_a_data_scientist.md b/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_a_data_scientist.md
index 2be0aa82..41dc7878 100644
--- a/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_a_data_scientist.md
+++ b/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_a_data_scientist.md
@@ -19,7 +19,304 @@ title: How ClearML is used by a Data Scientist
Read the transcript
-Welcome to ClearML! In this video, I'll try to walk you through a day in my life where I try to optimize a model and I'll be teaching you how I used to do it before I was working for ClearML and then now that I'm using ClearML all the time, what kind of problems it solved and what, how it made my life easier. So let's get started here. You can see the overview of the code so I'm not going to dive into the code immediately, I'm just going to give you some context and then we'll go deeper from there. So the idea is that I'm doing audio classification here. I have a client on for which I want to give like a proof of concept on how well it can work and I'm doing that on the Urbansound dataset. So the first thing I'll do and you'll see that later is I'll get the data from the Urbansound servers. I'm using a script called get_data.py for that and then for reasons I'll go further into in the video I'm actually putting all of that data into a ClearML dataset which is a special kind of dataset task or like a special kind of ClearML task that can keep track of your data. Then the preprocessing.py script will get that data and then convert the WAV files or like the audio files to spectrum images. Essentially you're turning all your data into image data because the models that do image data are actually very very easy to work with and are pretty good so you can actually do the classification by using by classifying an image from your audio instead of classifying your audio as a whole. Really cool stuff. So that will convert the WAV files into spectrum images and then send it to a new version of that same dataset so that I can keep track of where everything is going and then that new data I will use to train a model right. And I'm using training.py exactly for that. So let's go to the code and get a look on how this looks in in real life, right? We have here the get_data.py script which looks like this: We have the preprocessing.py which looks like this and we have the training.py which looks like this. I've collapsed a lot of the functions here so that it's a lot easier to take a look. The first thing you'll notice when I'm going through these files is the Task.init command and essentially this is what ClearML uses to keep track of every every time you run this specific script. So you'll see it in get_data.py You'll see it in preprocessing.py and you'll see it in training.py as well. And so these this line is all you need to get started. It will already start capturing everything that you'll need and that the that the program produces like plots or hyper parameters, you name it. So let's take a look in depth first at what getdata.py does for me, right? So getting data is very simple, but what I used to do is I would get the data from like a remote location, You download a zip file or whatever and then you extract it to your local folder and then you start working on that. Now the problem with that is it's really difficult to keep that thing clean. So it's yeah. How would I version that right if I add data to it? For example, the preprocessed data we'll see later. How can I keep my correct version? How did I? How do I know if the data changes over time? When did I do that? Like can I rerun the models that I trained on previous data on the new data. like when just to keep an overview of how all of this data is flowing. it's a lot easier to use a ClearML dataset instead. So what I'm doing here and this is actually really cool. I'm using a single link to a zip file that I made which is a subset of the complete data so it only has like 120 samples or something and then we use that to iterate really quickly and then we. We also have the part to the Urbansounds full dataset which we then label as full dataset and that will give us the freedom to switch between subset and full dataset. So I will essentially create two ClearML data versions, one with the subset, one with the full dataset and that will allow me to very quickly change without having the whole thing on my um, yeah, with different versions on my desk all the time. And so what I used to do is then have different versions or different folders and then probably different folders with different names as well for every time you do it again. but then if you don't change the name, you overwrite it. so that's that's all the thing of the past Now we have nice and clear. I'll show it to you later in the UI we have a nice and clear overview of all of the different versions. I'll add some dataset statistics that's also something you can do and ClearML is just add some, for example, class distribution or other kind of plots that could be interesting and then I'm actually building the ClearML dataset here. Also an an extra thing that is really really useful if you use ClearML datasets is you can actually share it as well. so not only with colleagues and friends. for example, you can share the data with them and they can add to the data and always you will always have the latest version, you will always know what happened before that. There's also the possibility of using the data on a different machine, which is really, really useful because this machine isn't the most powerful of them all. And I want to train on the full dataset on a different machine that has a GPU and then I can just point it to the ID of the dataset and it will just grab the latest version and we're good to go. So that's a lot easier than trying to keep the dataset versions in sync over two machines, which is one of the main things that ClearML datasets tries to solve. So that's what the dataset or the get_data.py does for you, and then we have the preprocessing.py which is relatively simple. Essentially, what I'm doing is, I'll get the data from the get_data.py So the previous dataset version. I'll get that data and then each line by line. So each, every, each and every sample in that dataset will then be preprocessed using the preprocessing class, which will just calculate a mel spectrogram if you're into that kind of thing. but I won't go into depth in about it here. Essentially, we'll create a mel spectrogram for each sample that will give us an image, and then we take that image and put it into a different dataset, which now has the same structure as the previous dataset, but now also with images there. And because the WAV files or the audio files are already in the previous dataset, this new version will only upload the images that we just produced. It won't duplicate the data because it knows it's already in a previous version. It will just reuse that instead. So that also saves a bit of disk space, if you're trying to put it on on the cloud as well. Now how I used to do this before ClearML is actually creating a new folder with a unique name for that specific run and then putting all of the WAV files in there, or sorry, all of the images in there. But that's that's just a huge mess, right? We've all done this. But then you forget to change the name and then you overwrite your previous samples. But you also don't know if you're just running through it. You don't know what kind of code or like what the code was that created your previous versions right? so they're not saved together which is a huge mess. It gets out of hand really quickly. You end up with a huge folder full of different names and and like versions, but the original code isn't attached. The original plots aren't attached so that's really annoying. And that is what ClearML data does for you is it will keep track of this, but it will also keep track of which run of your code actually produced this, and that allows you to always go back and see if you made any mistakes. You can always go back, which allows you to iterate a lot faster. And then finally we have the training script. If I go to the training script, you also again see the task.init So we want to capture every single time that we run this code and then you can also see that I made a huge configuration dict. So this is essentially every parameter that I use in my code is in this dictionary and then I connect it to the task and we'll see later why that is really really useful. But for now at the very least what it will do is it will keep track of all of these parameters so I can very easily verify in the UI that we'll see later where those parameters came from, what they're doing, in which case which parameters were used. It just keeps track of everything which is really really nice. I just read set to random seeds, I put it on a CUDA device if it's available and then there is a Tensorboard writer. So I like to use Tensorboard which is like the industry standard right to keep track of my logs and outputs. And what is really cool about ClearML is it will automatically detect that you're using Tensorboard and you don't have to manually log everything to ClearML as well. ClearML will just say oh, you log this to Tensorboard, I'll take it and I'll log it to ClearML as well. Really nice. And then I just prepare the data. I get my model, which is dependent on the different parameters that I just showed you. Then I plot the signal train, eval and model. And if I plot things with MathPlotLib for example, that's also automatically captured by ClearML. So that's again something that I don't have to think about. But the plots are all saved together with my code together with those hyper parameters you just saw together with the output, which is really handy. But then there is a last thing that I want to focus on. And that is the model files. So again, before I used ClearML, the model files, I would essentially create one long name for the file name, essentially with just underscores and all of the different parameters in there, so that in every model file, I could easily see what the parameters were that I used over time to create those model files. But that's just a huge mess because the amount of parameters that you use changes over time, you add more parameters, you just destroy some, and then it gets a huge mess because you can't go back to the code that actually used those parameters. And if you're looking like this, a configuration dict is quite long. Look at those parameters. What if I want to include those classes? It's a huge parameter. I can't just add it to the file size or to the file name and and not have it become a mess. So to be able to connect these parameters to the model files that they output, to the plots that it produced to the original code that produced all of this. We need ClearML or you need some kind of experiment manager at the very least. And this is what ClearML is doing for me. So now that you've seen all of my code, let's take a look at the ClearML UI where all of this code actually gets tracked and you have this nice overview of your experiments. So if I go to the dashboard, you will see my recent projects and my recent experiments. Let's dive into the recent projects that I'm working on currently, which is the video or day in the life of video and that's exactly what we're working on right now. And as you can see, it's a mix between the training which is training.py that we just saw, downloading data which is get_data.py that we just saw and preprocessing which is preprocessing.py that we just saw. We also have tags, which is really really handy because every time I run a specific script on either the subset or the full dataset, I can tag it as such and in that way I can easily filter for example, on saying hey, I only want to see my tasks that have been run on the subset for now. Which is really really nice to get started with. And then there's also of course usually a huge mess right? So what I start to do as well is you have a status of your task and I tend to only show everything that is not failed or aborted just to get enough just to get all of these models or training runs that I did that failed or that I made a code mistake or whatever. Just get them out of there. They just clutter the whole thing. You can do whatever you want of course, but this is my preference. And then there is the option to sort as well. So in any of these columns you can just sort on that property and in this case I'm going to sort on name and then quick tip for you. If you use shift click you can sort secondarily on that column as well. So right now I'm sorted on training first and then all of the uh sorry on name first and then all of the experiments that have the same name are secondarily sorted on their started date or started moment which will give me the most recent one of each batch. And then you can see here the training. Last training that I did was 18 hours ago and if I scroll down a while ago I did some preprocessing. I did some downloading of the data and that is also sorted quite nicely. You can also see on the tag that this is specifically dataset and then you can also see if I go to this dataset which is really cool. As I said before, if we go to the diagram we have preprocessing.py creates a new version of the original dataset, so if we're going to look at the plots here, we can actually see this. This is an overview of your dataset, genealogy or lineage and it will keep track of every single time you created a new version and where that data came from. So you can always go back and see what the original data was, what the original task was I've made. And here you get a summary of what all the things were that were added. So in the preprocessed dataset, we created 109 new images that we added to the 111 WAV files that were already there, which actually immediately tells you there might be something wrong there because we missed two audio files that we didn't convert into image files. So it can actually help for debugging as well. Now if we go back to the experiment table here, we can take a little bit of a deeper look into the training runs that I've been doing recently just to get my bearings before I start making a new model. I can click this button or right click and press details to get more in-depth on this specific training run. So what you can see here is: I've tracked my repository. I've tracked every uncommitted change as well, which will allow you to always go back to the original code if you needed to. It installed every package or all the installed packages are tracked as well so that we can later reproduce this thing on another machine, but we'll go more into that in a later part of the video. There's also configuration, so these are the configuration dict values that I showed you before in the training script, right? So you can see everything here and it gives you a nice and clean overview of everything that is going on. There's artifacts, so this is the model that I was talking about that is being tracked and saved as well. Then there is info on which machine it was run and stuff like that. And then there is a console output or the results of everything that is basically outputted by your code. So the results here are either a console output which is how the training looked and essentially what has been printed to my console back then. There's a scalars, which is something that is of value in the training of your model. So in this case, it would be the F1 score or the performance of my model over time or over the iterations. And I also have a training loss, which looks very weird here, but we can figure out where that came from because we can analyze it right now. It will also monitor your machine usage and your GPU usage and stuff like that, and then the learning rate for example as well. So this will give you a really, really quick overview of the most important metrics that you're trying to solve. And keep in mind this F1 score because this is the thing that we're trying to optimize here. Then plots, I can for example, plot a confusion matrix every X iterations. So in this case for example, after a few iterations, I plot the confusion matrix again just so I can see over time how well the model becomes or starts performing. So as you can see here, a perfect confusion matrix will be a diagonal line because every true label will be combined with the exact same predicted label. And in this case, it's horribly wrong. But then over time it starts getting closer and closer to this diagonal shape that we're trying to get to. So this is showing me that at least in this point it's learning something, it's doing something so that actually is very interesting. And then you have debug samples as well, which you can use to show actually whatever kind of media you need. So these are for example, the images that I generated that are the mel spectrograms so that the preprocessing outputs uh, and you can just show them here with the name of what the label was and what to predict it was. So I can just have a very quick overview of how this is working and then I can actually even do it with audio samples as well. So I can for example here say this is labeled dog and it is predicted as children playing. So then I can listen to it and get an idea on, is this correct? Is it not correct? In this case, obviously it's not correct, but then I can go further into the iterations and then hopefully it will get better and better over time. But this is a quick way that I can just validate that what I'm seeing here, what the model is doing is actually translatable into: Yes, this is correct. This is a correct assumption of the audio here. The last thing I want to show you is how you can customize this table so it's quite easy to just say okay, I don't want for example the name of who run it or whatever. But you can also do this really cool thing which is called adding custom columns and I use this all the time in my daily life. It makes it makes everything so much easier because I can add a metric as well which is one of the scalars that we that we saw before. So if I use the F1 score here and take the maximum of that, I can see the max F1 score of every single training run in this list and then I can sort on that to just get a leaderboard essentially which will give me a nice overview of the best models that I have and then I can just dive in deeper to figure out why they were so good, right? So, But now it's time to actually start changing some stuff, right? So this is the beginning of my day. I've just gotten my bearings, I know what the latest models were, and the score to beat here is 0.571429, and that's the F1 score we're trying to beat on the subset and if the moment that we find a combination of parameters or a change of the code that does better than this or that is in the general ballpark of this, we can actually then run it on the full dataset as well. But I'll tell you more about that later. So the first thing we're going to do is going back to training the training.py script. I might want to change several parameters here, but what I've read something that I've been interested in and while getting the model, I see that here I still use the optimizer stochastic gradient descent and it could be really interesting to see how it compares if I change this to atom. Now the atom optimizer is a really, really good one, so maybe it can give me an edge here. Of course, Atom doesn't have the same parameters as SGD has, so I'm removing the momentum here because the atom optimizer doesn't really care about momentum. It's not using that. So all that I have to do now is just run my training run and all will be well. So you can see here that ClearML created a new task and it's starting to train the model. So it's using a specific dataset ID which you can find in the configuration dict. I set it to this dataset tag, use the latest dataset using a subset tag so in that case it will get the latest data that is only in the subset. So that's what we're training on here. You can see I have 102 samples in the training set, only seven in the test set. This is why it's subset and now you can see that it's training in the app box and if I go to the experiment overview and I take a look at what is here, I can see that the training run, here, I'll sort it on started up front so that we have it up top. I can see that the training run here is in status running, which means it's essentially reporting to ClearML which is what exactly what we want. And if I go to the details tab, I can go to results and see the console output being logged here in real time and the causal output might not be this interesting to yeah to keep staring at, but what is interesting to keep staring at is the scalars. So here you can see the F1 score and the training loss go up or down before your eyes and that's really cool because then I can keep track of it or like have it as a dashboard somewhere just so that I know what is happening out there. So I'm going to fast forward this a little bit until it's completely done and I will go into more in-depth analysis of this whole thing. So right now we see that it's completed and if we go back to what we had before and I sort again by the F1 score, we see that the newest training run that we just did two minutes ago and it was updated a few seconds ago is actually better than what we had before. So it seems that the atom optimizer in fact does have a big diff, a big effect on what we're doing. And just to make sure that we didn't overlook anything, what I can do is I can select both my new model my new best model and the previous model that I had and then compare them. So it's what I have right here and everything that you just saw that was tracked, be it hyper parameters or plots or whatever can all be compared between different training runs. So what we can see here if we click on execution, we have some uncommitted changes that are obviously different, and then if we scroll down, what we can see is that for example, here the optimizer, the atom optimizer was added and the optimizer SGD was removed. So this already gives us the idea of okay, this is what changed. This is really interesting and we can always also use these differences to then go back to the original code. Of course, hyper parameters. There weren't any differences. We didn't actually change any of the hyper parameters here, but if we did, that would also be highlighted in red in this section. So if we're going to look at the scalars, this is where it gets really interesting because now the plots are overlaid on top of each other and you can change the color if you don't if you don't like the color. I think green is a bit ugly. So let's take red for example. We can just change that here. And then we have a quick overview of the different the two different compared experiments and then how their scalars did over time. And because they have the same X-axis the iterations, we can actually compare them immediately to each other, which is really, really cool. We can actually even see how the GPU memory usage or the GPU utilization has fared over time, which is really interesting. And then things come up like like for example, in this case, the higher F1 score which is in our case, the atom optimizer, had a higher loss as well, which is really interesting and we might might want to take a look at that. Also, the spikes are still there. Why are they there? So this is really handy if you want to dive deep into the analysis of your model, then we also have plots. For example, we can compare the confusion matrix between the SGD optimizer and the atom optimizer. So again, very useful and the same thing happens with our debug samples. So if we want to see the same audio samples be compared between the different experiments, that makes it very, very easy. So now we can look at the label dog barked and see how both experiments predicted it. But now we're going to start with a very interesting part. Remember, I had a subset and a full dataset. Now the subset is very easy to iterate on quickly and to run for example, on CPU. But now we have a model. If we go back to our overview. Now we have a model that's very, very good, even on the subset. So the first thing I want to do now is run the same thing with on the full dataset instead and I don't want to do this on my current machine because it's too small and it doesn't have a GPU and whatnot so we can clone the experiment and then run it on a different machine straight from the UI. So strap in. I'm going to show you how. So what you're going to do is right click the experiment that you're interested in, and then go to clone. You'll get a clone experiment dialog. I want it on the same project of course and then I want to keep running. Keep calling it training just to have- it's what's my preference, but you can call it whatever you want and then I'm going to clone it. Now it's in draft mode. What draft mode allows you to do is remember all of the hyper parameters that we had before. So all of these hyper parameters are now editable so I can go into these hyper parameters that were tracked from our code before. I can show you these are the same parameters here. And then I can change whatever I want here. So essentially I can change the dataset tag to full dataset so that it will grab the latest dataset version with the full dataset tag and not the subset tag. And then because the data is so big, I can change the batch sizes to be a little bit higher, so I can change it to let's say 64 and 64. Let's keep the seed and the number of epochs and everything the same as it was and then I save it. So now I've edited these hyper parameters and gotten the script the whole thing ready for deployment. So now what I want to do is, I can right click or go to these these bars there and then say enqueue and what that will do is, it will put that experiment into the queue and I have a stronger machine, a machine with a GPU as a ClearML agent. And that agent is currently listening to a queue and it will grab experiments that are enqueued and start running them. And because we tracked everything, the code, the hyper parameters, the original packages. The agent, the different machine, has no issues with completely reproducing my experiment, but just on a different dataset. and so now I can click on enqueue. I just want to enqueue it in the default queue because my agent is listening to the default queue and it is currently pending. As we can see here, if I now go to workers and queues, what you can see is that I have my available worker here and it is currently running my training experiment. So if I click on the training experiment, it will get me back to the experiment that we just edited. So if I go to the configuration we see that the batch sizes and the full dataset is right here. And it's currently running, but it's not running on this machine. It's running on a different machine that is hosting the ClearML agent and it was listening to the queue. If we go to the results and the console, what you'll see here is that the output of this agent is essentially right now, just showing that it's trying to reproduce the environment that I had before. So now the agent is installing all the packages and is installing the whole environment to be able to run your code without any issues on its own. And then we'll be able to follow along with the scalers as well and the plots just as we would on any other task. How cool is that? That's that's awesome. So I'm going to let this run for a while and we'll come back. We'll keep it on the scalars tab so that you can see the progress being made and then we can see the whole loss and F1 score grow and go down over time. But on the full dataset this time and then I'll come back when it's done. All right, we're back. It finally trained. We can see the F1 score over time. We can see the training loss over time. So this is already a big difference with the subset that we saw before. And now if we go to projects and then little video, which is the current project, we can now change our sorts for our tag sort to full dataset. And this will give us the full range of experiments that we trained this way on the full dataset and you can clearly see that even though it got the most or the highest F1 score on the subset, we don't actually have the highest score on the full dataset yet. However, even though it is not the best model, it might be interesting to get a colleague or like a friend to take a look at it and see what we could do better or just show off the new model that you made. So the last thing I want to show you is that you can now easily click it, right click, and then go to share and you can share it publicly. If you create a link, you can send this link to your friend, colleague, whatever and they will be able to see the complete details of the whole experiment, of everything you did, you can see the graphs, they can see the hyper parameters, and I can help you find the best ways forward for your own models. So I hope this kind of inspired you a little bit to try out ClearML. It's free to try at App.clear.ml or you can even host your own open source server with the interface that you can see right now. So why not have a go at it? And thank you for watching.
+Welcome to ClearML! In this video, I'll try to walk you through a day in my life where I try to optimize a model, and
+I'll be teaching you how I used to do it before I was working for ClearML, and then now that I'm using ClearML all the
+time, what kind of problems it solved and what, how it made my life easier. So let's get started here.
+
+You can see the overview of the code, so I'm not going to dive into the code immediately, I'm just going to give you some
+context, and then we'll go deeper from there.
+
+So the idea is that I'm doing audio classification here. I have a client who I want to give like a proof of concept on
+how well it can work, and I'm doing that on the Urbansound dataset. So the first thing I'll do, and you'll see that
+later is I'll get the data from the Urbansound servers. I'm using a script called `get_data.py` for that, and then for
+reasons I'll go further into in the video I'm actually putting all of that data into a ClearML dataset which is a special
+kind of dataset task or like a special kind of ClearML task that can keep track of your data. Then the `preprocessing.py`
+script will get that data and then convert the WAV files or like the audio files to spectrum images. Essentially you're
+turning all your data into image data because the models that do image data are actually very, very easy to work with and
+are pretty good, so you can actually do the classification by classifying an image from your audio instead of classifying
+your audio as a whole. Really cool stuff.
+
+So that will convert the WAV files into spectrum images and then send it to a
+new version of that same dataset so that I can keep track of where everything is going and then that new data I will use
+to train a model right. And I'm using `training.py` exactly for that.
+
+So let's go to the code and get a look on how this looks in real life, right? We have here the `get_data.py` script which
+looks like this. We have the `preprocessing.py` which looks like this, and we have the `training.py` which looks like this.
+I've collapsed a lot of the functions here so that it's a lot easier to take a look. The first thing you'll notice when
+I'm going through these files is the `Task.init` command and essentially this is what ClearML uses to keep track of every
+time you run this specific script. So you'll see it in `get_data.py`, you'll see it in `preprocessing.py`, and you'll
+see it in `training.py` as well. And so this line is all you need to get started. It will already start capturing
+everything that you'll need and that the program produces like plots or hyper parameters, you name it.
+
+So let's take a look in depth first at what `get_data.py` does for me. So getting data is very simple, but what I used
+to do is I would get the data from like a remote location, You download a zip file or whatever, and then you extract it
+to your local folder, and then you start working on that. Now the problem with that is it's really difficult to keep
+that thing clean. So, how would I version that right if I add data to it? For example, the preprocessed data we'll see
+later. How can I keep my correct version? How did I? How do I know if the data changes over time? When did I do that?
+Like, can I rerun the models that I trained on previous data on the new data. like just to keep an overview of how all
+of this data is flowing. It's a lot easier to use a ClearML dataset instead.
+
+So what I'm doing here and this is actually really cool. I'm using a single link to a zip file that I made, which is a
+subset of the complete data, so it only has like 120 samples or something, and then we use that to iterate really quickly.
+We also have the part to the Urbansounds full dataset, which we then label as `full dataset` and that will give us the
+freedom to switch between subset and full dataset. So I will essentially create two ClearML data versions, one with the
+subset, one with the full dataset, and that will allow me to very quickly change without having the whole thing, with
+different versions on my desk all the time. What I used to do is then have different versions or different
+folders and then probably different folders with different names as well for every time you do it again. but then if you
+don't change the name, you overwrite it. so that's all the thing of the past. Now we have nice and clear. I'll show
+it to you later in the UI, we have a nice and clear overview of all of the different versions.
+
+I'll add some dataset statistics that's also something you can do and ClearML is just add some, for example, class
+distribution or other kind of plots that could be interesting and then I'm actually building the ClearML dataset here.
+Also, an an extra thing that is really really useful if you use ClearML datasets is you can actually share it as well.
+So not only with colleagues and friends, for example. You can share the data with them, and they can add to the data, and
+always you will always have the latest version, you will always know what happened before that.
+
+There's also the possibility of using the data on a different machine, which is really, really useful because this machine isn't the most
+powerful of them all. And I want to train on the full dataset on a different machine that has a GPU, and then I can just
+point it to the ID of the dataset, and it will just grab the latest version, and we're good to go. So that's a lot easier
+than trying to keep the dataset versions in sync over two machines, which is one of the main things that ClearML Data
+tries to solve. So that's what the dataset or the `get_data.py` does for you.
+
+Then we have the `preprocessing.py` which is relatively simple. Essentially, what I'm doing is, I'll get the data from
+the `get_data.py` So the previous dataset version. I'll get that data and then each line by line. So each, every, each
+and every sample in that dataset will then be preprocessed using the preprocessing class, which will just calculate a
+mel spectrogram if you're into that kind of thing. but I won't go into depth in about it here. Essentially, we'll create
+a mel spectrogram for each sample that will give us an image, and then we take that image and put it into a different
+dataset, which now has the same structure as the previous dataset, but now also with images there. And because the WAV
+files, or the audio files, are already in the previous dataset, this new version will only upload the images that we
+just produced. It won't duplicate the data because it knows it's already in a previous version. It will just reuse that
+instead. So that also saves a bit of disk space, if you're trying to put it on the cloud as well.
+
+Now how I used to do this before ClearML is actually creating a new folder with a unique name for that specific run and
+then putting all of the images in there. But that's that's just a huge mess, right? We've all done this. But then you
+forget to change the name, and then you overwrite your previous samples. But you also don't know if you're just running
+through it. You don't know what kind of code or like what the code was that created your previous versions right? So
+they're not saved together which is a huge mess. It gets out of hand really quickly. You end up with a huge folder full
+of different names and versions, but the original code isn't attached. The original plots aren't attached so that's
+really annoying. And that is what ClearML Data does for you is it will keep track of this, but it will also keep track
+of which run of your code actually produced this, and that allows you to always go back and see if you made any mistakes.
+You can always go back, which allows you to iterate a lot faster.
+
+And then finally we have the training script. If I go to the training script, you also again see the `task.init`, so we
+want to capture every single time that we run this code, and then you can also see that I made a huge configuration dict.
+So essentially every parameter that I use in my code is in this dictionary, and then I connect it to the task. and we'll
+see later why that is really, really useful. But for now at the very least what it will do is it will keep track of all
+of these parameters, so I can very easily verify in the UI that we'll see later where those parameters came from, what
+they're doing, in which case which parameters were used. It just keeps track of everything which is really, really nice.
+I just read set to random seeds, I put it on a CUDA device if it's available and then there is a TensorBoard writer. So
+I like to use Tensorboard which is like the industry standard to keep track of my logs and outputs. And what is really
+cool about ClearML is it will automatically detect that you're using TensorBoard and you don't have to manually log
+everything to ClearML as well. ClearML will just say oh, you log this to Tensorboard, I'll take it, and I'll log it to
+ClearML as well. Really nice.
+
+And then I just prepare the data. I get my model, which is dependent on the different parameters that I just showed you.
+Then I plot the signal train, eval, and model. And if I plot things with MathPlotLib for example, that's also
+automatically captured by ClearML. So that's again something that I don't have to think about. But the plots are all
+saved together with my code together with those hyperparameters you just saw together with the output, which is really
+handy.
+
+But then there is a last thing that I want to focus on. And that is the model files. So again, before I used ClearML,
+the model files, I would essentially create one long name for the file with just underscores and all of the different
+parameters in there, so that in every model file, I could easily see what the parameters were that I used over time to
+create those model files. But that's just a huge mess because the amount of parameters that you use changes over time,
+you add more parameters, you just destroy some, and then it gets a huge mess because you can't go back to the code that
+actually used those parameters. And if you're looking like this, a configuration dict is quite long. Look at those
+parameters. What if I want to include those classes? It's a huge parameter. I can't just add it to the file size or to
+the file name and not have it become a mess.
+
+So to be able to connect these parameters to the model files that they output, to the plots that it produced to the
+original code that produced all of this. We need ClearML or you need some kind of experiment manager at the very least.
+And this is what ClearML is doing for me.
+
+So now that you've seen all of my code, let's take a look at the ClearML UI where all of this code actually gets tracked,
+and you have this nice overview of your experiments. So if I go to the dashboard, you will see my recent projects and my
+recent experiments.
+
+Let's dive into the recent projects that I'm working on currently, which is "The Day in the Life of" video and that's
+exactly what we're working on right now. And as you can see, it's a mix between the training which is `training.py` that
+we just saw, downloading data which is `get_data.py` that we just saw and preprocessing which is `preprocessing.py` that
+we just saw. We also have tags, which is really, really handy because every time I run a specific script on either the
+subset or the full dataset, I can tag it as such and in that way I can easily filter for example, on saying hey, I only
+want to see my tasks that have been run on the subset for now. Which is really, really nice to get started with.
+
+And then there's also of course usually a huge mess right? So what I start to do as well is you have a status of your
+task, and I tend to only show everything that is not _failed_ or _aborted_ just to get enough just to get all of these
+models or training runs that I did that failed or that I made a code mistake or whatever. Just get them out of there.
+They just clutter the whole thing. You can do whatever you want of course, but this is my preference.
+
+And then there is the option to sort as well. So in any of these columns you can just sort on that property and in this
+case I'm going to sort on name and then quick tip for you. If you use shift click you can sort secondarily on that
+column as well. So right now I'm sorted on name first and then all of the experiments that have the same name are
+secondarily sorted on their started time, which will give me the most recent one of each batch. And then you can see
+here the last training that I did was 18 hours ago and if I scroll down a while ago I did some preprocessing. I did
+some downloading of the data and that is also sorted quite nicely. You can also see on the tag that this is specifically
+dataset, and then you can also see if I go to this dataset which is really cool.
+
+As I said before, if we go to the
+diagram we have `preprocessing.py` creates a new version of the original dataset, so if we're going to look at the plots
+here, we can actually see this. This is an overview of your dataset, genealogy or lineage, and it will keep track of
+every single time you created a new version and where that data came from. So you can always go back and see what the
+original data was, what the original task was that made it. And here you get a summary of what all the things were that
+were added. So in the preprocessed dataset, we created 109 new images that we added to the 111 WAV files that were
+already there, which actually immediately tells you there might be something wrong there because we missed two audio
+files that we didn't convert into image files. So it can actually help for debugging as well.
+
+Now if we go back to the experiment table here, we can take a little bit of a deeper look into the training runs that
+I've been doing recently just to get my bearings before I start making a new model. I can click this button or right
+click and press details to get more in-depth on this specific training run. So what you can see here is: I've tracked my
+repository. I've tracked every uncommitted change as well, which will allow you to always go back to the original code
+if you needed to. It installed every package or all the installed packages are tracked as well so that we can later
+reproduce this thing on another machine, but we'll go more into that in a later part of the video.
+
+There's also configuration, so these are the configuration dict values that I showed you before in the training script,
+right? So you can see everything here, and it gives you a nice and clean overview of everything that is going on.
+
+There's artifacts, so this is the model that I was talking about that is being tracked and saved as well.
+
+Then there is info on which machine it was run and stuff like that.
+
+And then there is a console output or the results of everything that is basically outputted by your code. So the results
+here are either a console output which is how the training looked and essentially what has been printed to my console
+back then.
+
+There's scalars, which is something that is of value in the training of your model. So in this case, it would be the
+F1 score or the performance of my model over time or over the iterations. And I also have a training loss, which looks
+very weird here, but we can figure out where that came from because we can analyze it right now. It will also monitor
+your machine usage and your GPU usage and stuff like that, and then the learning rate for example as well. So this will
+give you a really, really quick overview of the most important metrics that you're trying to solve. And keep in mind
+this F1 score because this is the thing that we're trying to optimize here.
+
+Then plots. I can, for example, plot a confusion matrix every X iterations. So in this case ,for example, after a few
+iterations, I plot the confusion matrix again just so I can see over time how well the model starts performing. So as
+you can see here, a perfect confusion matrix will be a diagonal line because every true label will be combined with the
+exact same predicted label. And in this case, it's horribly wrong. But then over time it starts getting closer and
+closer to this diagonal shape that we're trying to get to. So this is showing me that at least in this point it's
+learning something, it's doing something so that actually is very interesting.
+
+And then you have debug samples as well, which you can use to show actually whatever kind of media you need. So these
+are for example, the images that I generated that are the mel spectrograms so that the preprocessing outputs uh, and you
+can just show them here with the name of what the label was and what to predict it was. So I can just have a very quick
+overview of how this is working and then I can actually even do it with audio samples as well. So I can for example here
+say this is labeled "dog", and it is predicted as "children playing". So then I can listen to it and get an idea on, is
+this correct? Is it not correct? In this case, obviously it's not correct, but then I can go further into the iterations
+and then hopefully it will get better and better over time. But this is a quick way that I can just validate that what
+I'm seeing here, what the model is doing is actually translatable into: Yes, this is correct. This is a correct assumption
+of the audio here.
+
+The last thing I want to show you is how you can customize this table, so it's quite easy to just say okay, I don't want,
+for example, the name of who ran it or whatever. But you can also do this really cool thing which is called adding
+custom columns and I use this all the time in my daily life. It makes everything so much easier because I can add a metric
+as well which is one of the scalars that we saw before. So if I use the F1 score here and take the maximum of
+that, I can see the max F1 score of every single training run in this list, and then I can sort on that to just get a
+leaderboard essentially which will give me a nice overview of the best models that I have, and then I can just dive in
+deeper to figure out why they were so good.
+
+So, But now it's time to actually start changing some stuff. So this is the beginning of my day. I've just gotten my
+bearings, I know what the latest models were, and the score to beat here is 0.571429, and that's the F1 score we're
+trying to beat on the subset and if the moment that we find a combination of parameters or a change of the code that
+does better than this or that is in the general ballpark of this, we can actually then run it on the full dataset as
+well. But I'll tell you more about that later.
+
+So the first thing we're going to do is going back to training the `training.py` script. I might want to change several
+parameters here, but what I've read something that I've been interested in and while getting the model, I see that here
+I still use the optimizer stochastic gradient descent, and it could be really interesting to see how it compares if I
+change this to atom. Now the atom optimizer is a really, really good one, so maybe it can give me an edge here. Of
+course, Atom doesn't have the same parameters as SGD has, so I'm removing the momentum here because the atom optimizer
+doesn't really care about momentum. It's not using that. So all that I have to do now is just run my training run and
+all will be well. So you can see here that ClearML created a new task, and it's starting to train the model. So it's
+using a specific dataset ID which you can find in the configuration dict. I set it to this dataset tag, use the latest
+dataset using a subset tag so in that case it will get the latest data that is only in the subset. So that's what we're
+training on here. You can see I have 102 samples in the training set, only seven in the test set. This is why it's
+subset, and now you can see that it's training in the app box and if I go to the experiment overview and I take a look
+at what is here, I can see that the training run, here, I'll sort it on started up front so that we have it up top.
+I can see that the training run here is in status running, which means it's essentially reporting to ClearML which is
+what exactly what we want. And if I go to the Details tab, I can go to Results and see the console output being logged
+here in real time and the causal output might not be this interesting to keep staring at, but what is interesting to
+keep staring at is the scalars. So here you can see the F1 score and the training loss go up or down before your eyes
+and that's really cool because then I can keep track of it or like have it as a dashboard somewhere just so that I know
+what is happening out there. So I'm going to fast-forward this a little bit until it's completely done and I will go
+into more in-depth analysis of this whole thing.
+
+So right now we see that it's completed and if we go back to what we had before and I sort again by the F1 score, we see
+that the newest training run that we just did two minutes ago, and it was updated a few seconds ago is actually better
+than what we had before. So it seems that the atom optimizer in fact does have a big effect on what we're doing. And
+just to make sure that we didn't overlook anything, what I can do is I can select both my new model my new best model
+and the previous model that I had and then compare them. So it's what I have right here and everything that you just saw
+that was tracked, be it hyperparameters or plots or whatever, can all be compared between different training runs. So
+what we can see here if we click on execution, we have some uncommitted changes that are obviously different, and then
+if we scroll down, what we can see is that for example, here the atom optimizer was added and the optimizer SGD was
+removed. So this already gives us the idea of okay, this is what changed. This is really interesting and we can always
+also use these differences to then go back to the original code.
+
+Of course, hyperparameters. There weren't any differences. We didn't actually change any of the hyperparameters here,
+but if we did, that would also be highlighted in red in this section. So if we're going to look at the scalars, this is
+where it gets really interesting because now the plots are overlaid on top of each other and you can change the color
+if you don't if you don't like the color. I think green is a bit ugly. So let's take red for example. We can just
+change that here. And then we have a quick overview of two different compared experiments and then how their scalars did
+over time. And because they have the same X-axis the iterations, we can actually compare them immediately to each other,
+which is really, really cool. We can actually even see how the GPU memory usage or the GPU utilization has fared over
+time, which is really interesting. And then things come up like for example, in this case, the higher F1 score which is
+in our case, the atom optimizer, had a higher loss as well, which is really interesting and we might want to take a look
+at that. Also, the spikes are still there. Why are they there? So this is really handy if you want to dive deep into the
+analysis of your model.
+
+Then we also have plots. For example, we can compare the confusion matrix between the SGD optimizer and the atom
+optimizer. So again, very useful and the same thing happens with our debug samples. So if we want to see the same audio
+samples be compared between the different experiments, that makes it very, very easy. So now we can look at the label
+dog barked and see how both experiments predicted it.
+
+But now we're going to start with a very interesting part. Remember, I had a subset and a full dataset. Now the subset
+is very easy to iterate on quickly and to run for example, on CPU. But now we have a model. If we go back to our
+overview. Now we have a model that's very, very good, even on the subset. So the first thing I want to do now is run the
+same thing on the full dataset instead, and I don't want to do this on my current machine because it's too small, and it
+doesn't have a GPU and whatnot, so we can clone the experiment and then run it on a different machine straight from the
+UI. So strap in. I'm going to show you how. So what you're going to do is right-click the experiment that you're
+interested in, and then go to clone. You'll get a clone experiment dialog. I want it on the same project of course, and
+then I want to keep running. Keep calling it training - it's my preference, but you can call it whatever you want - and
+then I'm going to clone it. Now it's in draft mode. What draft mode allows you to do is remember all of the
+hyperparameters that we had before. So all of these hyperparameters are now editable, so I can go into these
+hyperparameters that were tracked from our code before. I can show you these are the same parameters here. And then I
+can change whatever I want here. So essentially I can change the dataset tag to full dataset so that it will grab the
+latest dataset version with the full dataset tag and not the subset tag. And then because the data is so big, I can
+change the batch sizes to be a little bit higher, so I can change it to let's say 64 and 64. Let's keep the seed and the
+number of epochs and everything the same as it was, and then I save it. So now I've edited these hyperparameters and
+gotten the script the whole thing ready for deployment. So now what I want to do is, I can right-click or go to
+these bars there and then say enqueue and what that will do is it will put that experiment into the queue and I have a
+stronger machine, a machine with a GPU as a ClearML Agent. And that agent is currently listening to a queue, and it will
+grab experiments that are enqueued and start running them. And because we tracked everything, the code, the
+hyperparameters, the original packages. The agent, the different machine, has no issues with completely reproducing my
+experiment, but just on a different dataset. And so now I can click on enqueue. I just want to enqueue it in the default
+queue because my agent is listening to the default queue, and it is currently pending. As we can see here.
+
+If I now go to Workers and Queues, what you can see is that I have my available worker here, and it is currently running
+my training experiment. So if I click on the training experiment, it will get me back to the experiment that we just
+edited. So if I go to the configuration we see that the batch sizes and the full dataset is right here. And it's
+currently running, but it's not running on this machine. It's running on a different machine that is hosting the
+ClearML agent, and it was listening to the queue. If we go to the results and the console, what you'll see here is that
+the output of this agent is essentially right now, just showing that it's trying to reproduce the environment that I
+had before. So now the agent is installing all the packages and is installing the whole environment to be able to run
+your code without any issues on its own. And then we'll be able to follow along with the scalars as well and the plots
+just as we would on any other task. How cool is that? That's awesome. So I'm going to let this run for a while, and
+we'll come back. We'll keep it on the scalars tab so that you can see the progress being made, and then we can see the
+whole loss and F1 score grow and go down over time, but on the full dataset this time, and then I'll come back when
+it's done.
+
+All right, we're back. It finally trained. We can see the F1 score over time. We can see the training loss over time.
+So this is already a big difference with the subset that we saw before. And now if we go to projects and then little
+video, which is the current project, we can now change our filter for our tag sort to full dataset. And this will give
+us the full range of experiments that we trained this way on the full dataset, and you can clearly see that even though
+it got the most or the highest F1 score on the subset, we don't actually have the highest score on the full dataset yet.
+However, even though it is not the best model, it might be interesting to get a colleague or a friend to take a look at
+it and see what we could do better or just show off the new model that you made. So the last thing I want to show you is
+that you can now easily click it, right click, and then go to share, and you can share it publicly. If you create a
+link, you can send this link to your friend, colleague, whatever, and they will be able to see the complete details of
+the whole experiment, of everything you did, you can see the graphs, they can see the hyperparameters, and I can help
+you find the best ways forward for your own models.
+
+So I hope this kind of inspired you a little bit to try out ClearML. It's free to try at [app.clear.ml](https://app.clear.ml/),
+or you can even host your own open source server with the interface that you can see right now. So why not have a go at
+it? And thank you for watching.
diff --git a/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_an_mlops_engineer.md b/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_an_mlops_engineer.md
index 8b580b7a..af6941c9 100644
--- a/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_an_mlops_engineer.md
+++ b/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_an_mlops_engineer.md
@@ -19,7 +19,436 @@ title: How ClearML is used by an MLOps Engineer
Read the transcript
-Hello again and welcome to ClearML. In this video we'll be going over a workflow of a potential MLOps Engineer. Now an MLOps Engineer is a vague term. This might be a specific person in your team that is doing only the Ops part of the machine learning. So the infrastructure and all of the workers and whatnot. Or it could be you as a data scientist. It could be just the data scientist of the team that is most into things like docker and deployments. And that person now has the job of a MLOps Engineer. So it really doesn't matter who you are. This video is going to be about what this kind of person will be doing and what ClearML can do to make the life of them a little easier. Just a little. So what we're going to do here is take a look or get started at the very least with our workers and queues tab. So if you've followed along with the getting started videos, this is actually something you've probably seen before. but I'm going to go a little bit more into depth in this video. So the workers and queues tab, what does it do? So we have what we can expect. We have the workers tab and we have the queues tab. Workers in ClearML are actually called agents. So you can see here that we have a bunch of available workers which are spun up by using the ClearML agent. I'll go more in depth in that in a minute. So we have a few available workers. We have Beast Zero, One, Two, and Three. I'm the person that called my own computer Beast. So my own computer is running a few workers here. And then we also have Apps Agents and I'll go a little bit more into detail what that means later. Essentially, what it means is you have the applications right here and what that's going to do is give you a few pre-made applications for automation that you can use straight out of the box. And if you use any of them, in this case a GPU scaler, an auto scaler from the cloud, then it will spin up an available worker for you that will just manage the orchestration there so that that worker will do nothing else but just tell things where to go and what they should do, right? So if we go a little bit more into depth here, we can also see which worker is running which experiment at this moment. So we have example: Task 1 2 3, and 4, oh sorry, 0 1, 2 and 3, programmer terms of course. We see the experiment running time. We see the iterations. In this case, it's a classical machine learning model, so we don't really have iterations, but if you have a deep learning model, this is where your amount of iterations would come into play. If we click on any of these, we can see the worker name and we can see its utilization over time in here as well. All right, so we can obviously make this longer. I've only been running this for a few hours or for an hour. So we have the worker name right here. We have the update time, so just to know that when was the last time that the worker actually sent in any new data, right? We have the current experiment on which we can click through so I'll do that in a minute and we have the experiment runtime and experiment iterations here as well. We also have the queues, which means that we can actually find out what queues is this worker listening to and I should give some more context here. So if we go into the queues, ClearML works with a system of queues as well as workers, right? So this actually comes from the fact that originally people were just giving out SSH keys to everyone to get them to work on a remote system. And this is far far far from perfect, right? So you have several people sshing in, you have several people running their own workloads on the same GPUs. They have to share everything. Because of the amount of people that are also running their stuff on the GPU, you actually have no idea how long your own task will take, so that's something. You can't have any priorities. So if everyone is just running their stuff and actually probably killing your stuff as well because it's out of memory, because too many people are using it. So that's just a disaster, right? If you have a large GPU machine that you have to share with multiple people, or just even want to orchestrate several different tasks on, with different priorities, it becomes a real hassle. So that's actually why ClearML has workers and queues to try and deal with that a little bit. And this is actually what we're calling orchestration. So if you look at our website, you'll see orchestration and automation. Those terms might not mean very much. So this is what I'm going to be talking about in this video. Orchestration in this case, is like a director in an orchestra, right? You're essentially saying who should do what when, so which worker should run which experiment, or which task at what time and in what priority. So this is what the queues are all for. Essentially, queues are just what they're called, right. They're queues, but you can have as many of them as you want. So in this case, we have the services queue, we have the default queue, GPU queue and CPU queue. You can create new queues by clicking the button here, so it's very simple. I can make a very simple queue, but this is very worthless, right? But you can make however many of them you want. I can delete that queue again. We can see for each view how many workers it has. So I'll show you that in a minute when we spin up a new worker. But we can actually pretty easily see how many workers are serving a specific queue. So listening to that queue and that actually has an effect on the overall waiting time. So for example, here we have four workers that we saw here before, right? So these are these four workers. They're all listening to the CPU queue. They're all running a CPU experiment. But then we have a bunch of other experiments in the queue still. So this is just a list of the next, like, essentially the order in which the next example tasks will be executed. So we see here that the next experiment is task four. We see that it was last updated there, and now we see the queue experiment time rising rapidly. Because people are waiting in queue here, there are many more tasks to be executed. We also have CPU queued experiments, which is just the amount of queued experiments per queue. So we also have a GPU queue and in this case, we see that we have zero workers here. Now we'll go a little bit more into depth on that later because we don't actually have zero workers there. We actually have an auto scaler listening to this. Then we have the default queue and we have the services queue and I should stop a little bit on the services queue because the services queue is relatively special in ClearML. You have all of your custom queues that you can create however you want CPU queue, GPU queue, whatever. The services queue is actually meant to host specific, not very heavy workloads, but that do this kind of orchestration that I was talking about. So imagine you have a pipeline for example, if you don't know what a pipeline does in ClearML, you can look at the getting started video that we made on pipelines. And if you want to run a pipeline, you will need a pipeline controller. Essentially, it's a very small piece of code that tells ClearML now you should run this step, then take that output, give it to the other step, run the other step, give that to the output of the next three steps, take those three steps, run them, and so on, and so forth. It's essentially the director. It's orchestration, right? And so that's what the services queue is meant for is it's usually meant to put in those orchestration tasks, those long running, not very heavy tasks. So that allows you to essentially assign a bunch of not very powerful CPU machines to that queue just to do the orchestration and then everything else, like your GPU machines or your heavy CPU machines, can be assigned to the CPU and GPU queues in which we can choose to just enqueue tasks or experiments that do simply that. So that's essentially what a services queue is when compared to other user-made queues. We can see here that we have a bunch of workers so we have the beast 0, 1, 2 and 3 that are assigned to this services queue. But as we can see if we go and take a look at the CPU queue, we have a whole bunch of tasks here. So there is a lot of people waiting for their turn. So actually one thing that we can do is we can already change the priority. So imagine we have example Person 19 that has a very, very tight deadline and they actually need to be first. So we can just drag them all the way up, let me scroll there, all the way up and now. Oh, it didn't quite work so we can drag them all the way up, all the way through here. There we go, all the way up top. So now we can see that example 18, for example task 18 is the first in the queue. So once any of the four workers finish off their example 0, 1, 2 or 3, the next one will be example 18. So in this case, you can actually very easily change the priority and make sure that the people that have their deadline meet their deadline. So that this could be a common question in the day of the life of an MLOps engineer is, please give me priority. Please let my stuff work first. So that's something you can do. Another thing you can do, which is relatively simple to do, is aborting a task. So a task is essentially just a Python script that was cloned or that was already in the system or that was put into the system that can be recreated on a different machine. So what we could do is, we could go into the queues here and then say clear, which will clear the complete queue. So that's something we don't necessarily want in this case. But in this case, we, for example, want to abort task 0. So one way of doing that would be to go to the current experiment, right here. And if I click on that ClearML will actually bring me to the original experiment view, the experiment manager, remember everything is integrated here. The experiment manager of that example task. So what I can do here if I look at the console, I have a bunch of output here. I can actually abort it as well. And if I abort it, what will happen is this task will stop executing. Essentially, it will send a control C right? So a quit command or a terminate command to the original task on the remote machine. So the remote machine will say okay, I'm done here. I will just quit it right here. If for example, your model is not performing very well or you see like oh, something is definitely wrong here, you can always just abort it. And the cool thing is if we go back to the workers and queues, we'll see that the Beast 0 has given up working on task 0 and has now picked task 18 instead. Which is the task that we put in there in terms priority. So this has the next priority. Work has already started on task 18. So this is really, really cool. But we can see that in the CPU queue, the amount of tasks is still very, very high. Even though we just aborted one, people are waiting. The waiting time is rising. The amount of tasks is very, very high. So what we should do now is actually start up a new worker. This could be something that is very much in daily life over an MLOps engineer. It's just to add workers to your worker pool. I'll put it on workers so that we can see very clearly when we added it here. Go out of full screen and we're going into a remote machine here. Oh, apparently it's lost connection. Let me just get in there. There we go. So you could remote in almost any machine, right? It doesn't really matter which type of machine it is, if it's a cloud VM, if it's on premise, if it's your own laptop, it could be any remote machine that you want. It's very easy to turn those into a ClearML agent or a worker for the ClearML ecosystem. The first thing you'll have to do though, is `pip install clearml-agent`. So this will install the very thing that is necessary to turn it into a worker and that's actually everything you need. That's all the packages you need. It is a Python package, but usually you have Pip3 available. So then the next thing you should do is `clearml-init`. Now clearml-init will connect this machine to your server, to the actual orchestration server, this one that will handle all the workers and queues. So in this case it's App.clear.ml which is the hosted version. You can also host your own open source server if you so require. If I run clearml-init, you'll see that I have already done this of course, but in this case you should be able to just add your credentials from the server and it should connect no problem. If you want more information on that, we have tutorial videos on that as well. And then the next thing we should do is `clearml-agent daemon --queue` Now we can decide which queues we want this ClearML agent to listen to. So in this case, if we go and take a look at queues, we have CPU queue, which is by far the most requested queue. So in this case, imagine we have an extra machine on hand in the faculty or in the the company you're working with and you should just add this machine or its resources to the pool. So in this case we're just going to say `CPU Queue` and that's everything. So we just want a simple extra machine for the CPU queue. Then we're going to add `--docker` because personally I quite like the fact that the machine would be using docker. So essentially what will happen here is that the ClearML agent will pull a new task from the queue and then we'll spin up a new container depending on either the default image that I give here or the image that is attached to the task itself. So people, the data scientists that are creating their remote tasks or their experiments, they can also assign a docker file or a docker image to that, that it should be running. So if you have very specific package requirements or very specific needs, you can, as a data scientist, already say, I want to attach this docker image to it and it will be run like such on the ClearML agent. So that's that gives you a lot of power. But in this case I will just say if the data scientists gave no indication of what docker container to use, just use Python 3.7. This is the standard in our company, let's say, and this is what we want to use. Okay, so if I run this, it should start up a newlearML agent. So in this case you can see it's running in docker mode, it's using default image Python 3.7 and it's listening to the CPU queue. Now if we go back to our workers and queues tab. We can see that here any-remote-machine:0. So we can actually see that we now immediately have a new remote worker and it's actually already started on the next task. So now we're currently running five workers on the CPU queue instead of four. So this was very, very easy to handle, very very easy to set up. So this is one part of what an MLOps engineer could be doing. Now this is very, very manual to set up and the MLOps engineer is king of the automation after all. So we want some kind of some kind of way to automate all of this, right? So what we can do here is go to applications. And what we have is AWS Autoscaler and GCP Autoscaler in essence. Also, Azure will come later so that will be out soon. So if we go into the AWS Autoscaler. What we see here is we have an MLOps GPU scaler and what that means is, we don't always have fixed demand for GPU resources, right? So imagine you have a company in this case that has a lot of demand for CPU compute in this case, five workers and a lot of tasks. We only have GPU requests only every so often. And it's not very economical to buy a few very, very powerful GPUs just for a few minutes every week, for example, or a few hours every week. So what is much more economical there is to use the cloud instead, in which you pay for the hours that you use a GPU and you don't pay for the hours you don't use it anymore. Of course, what you could be doing is if a data scientist needs a GPU machine, you could go to the cloud console, spin up a new VM SSH into it, and then create a ClearML agent for it, to be able to access it from the queue. But that could also be done automatically. And that's essentially what a Autoscaler is doing for you. So the Autoscaler will detect if there is a task in the queue, and then will spin up a new machine on the cloud, run the ClearML agent there, reconnect it to your own server, and then run that specific task. And then if that task is done and the agent and the machine is up for like a minute without doing anything, you can choose that minute by the way, if it's up for a while and it's not doing anything, it will just shut itself down again. And that actually makes it very, very economical. Because if you've ever forgotten to close down or to shut off a machine, especially a GPU machine on the cloud over the weekend, it's super expensive. So you can actually pay ClearML Pro for a year for just the same amount of money as forgetting to shut down a large GPU machine for a single day. So just to give an idea of how economical this can be. If we go and take a look at our configuration here, we can see that we have our AWS credentials and GCP will be obviously GCP credentials. We have our Git configuration and this is what the MLOps engineer will be doing. They will be configuring this kind of thing. They will be configuring the max idle time, which says how long should the machine be doing nothing before we shut it down again. It could be beneficial to keep it up for a little while because if then another task comes in like two minutes later, it's immediately launched. You don't have to wait for the machine to boot up. You can add prefixes, you can add polling intervals, you can add a base docker image that you can use, but obviously you can overwrite that again. And then there are obviously the computer resources. You can have GPU machines of a specific type, this obviously depends on which cloud provider that you're using, but it's all basically the same thing. You can run it in CPU mode mode, use Spot instances to have some savings there, availability zones, etc, etc, etc. So this is the kind of thing that an MLOps engineer would probably spend a lot of their time tuning and fine-tuning and getting up and working. Another way next to running your ClearML agent just manually and spinning up an autoscaler to get some extra agents is to run them on Kubernetes. So if I go to the ClearML Github and go to the ClearML agent repository, you have two different versions of integration of the ClearML agent in Kubernetes. And the really cool thing is, Kubernetes can scale. So you can actually use the scaling of Kubernetes to handle the load of your different tasks that are being pulled by the ClearML agent. Now, one way of doing this is spinning up a ClearML agent as a long lasting service pod that has access to the docker socket so it can actually spin up new docker containers as it sees fit. Or you can use Kubernetes glue, which is some code that we wrote that allows you to actually map ClearML jobs directly to Kubernetes jobs. So that's actually also a really neat way of doing this. Now we would go a little bit too far if we go straight into Kubernetes in this video, but if you're interested, let me know in the comments and we'll make a video about it. Now that we have all our agents, let's take a look at the code and I'll give you some examples on how you can enqueue some tasks into these queues and get get these workers that we've just spun up working their hardest. So if we go and take a look here, what we see is an a simple Python file that does some training. This is CPU based training. It's essentially using light Gbm to train a model. So this is the kind of thing that a data scientist would give to you that you would have made yourself and now you want to get it into the queue. Now one way of doing that is what we saw before you could do a `Task.init` which essentially tracks the run of your code as an experiment in the experiment manager and then you could go and clone the experiment and then queue it. This is something that we saw in the getting Started videos before. Now another way of doing this is to actually use what you can see here, which is `task.execute_remotely`. What this line specifically will do, is when you run the file right here. Let me just do that real quick. So if we do `python setup/example_task_CPU.py` what will happen is ClearML will do the task.init like it would always do, but then it would encounter the `task.execute_remotely` and what that will tell ClearML is say okay, take all of this code, take all of the packages that are installed, take all of the things that you would normally take as part of the experiment manager, but stop executing right here and then send the rest, send everything through to a ClearML agent or to the queue so that a ClearML agent can start working on it. So one way of doing this is to add a `task.execute_remotely` just all the way at the top and then once you run it, you will see here `clearml WARNING - Terminating local execution process` and so if we're seeing here if we're going to take a look we can see that Model Training currently running and if we go and take a look, at our queues here, we have any-remote-machine running Model Training right here. And if we go and click on this, we go back to our actual training task and we just can easily follow along with what is happening here. So okay let's take a look at how you can do this differently as well. So there is a specific different kind of way of doing this. And let me take the example task GPU here. So this is a lot larger in terms of what it does. But essentially it just trains a model. So you have trained tests and main. And what you can see here is, we have the `Task.init` in main. It's just a global scope so that's all fine. Then we parse a bunch of arguments and then something very interesting happens. So we create our train loader and our test loader right here. But then what we can also do is say okay for epoch in all our epochs, so for in the epoch range, what we can say is if the epoch is larger than one, it doesn't execute remotely in the GPU queue and so what this will do is it will train the model for one epoch locally. Which means that you can test that it works and if you get a single epoch it usually means it's working. And then if we get to that point and start epoch number two, we actually just run it remotely and then ClearML will take this whole bunch and start working on it remotely instead. Which means that you can very easily locally debug and see if everything works and once everything does work, it will just immediately send it to a remote machine that will do the actual heavy lifting instead of you having to do it on your laptop or computer. So if we actually run this, it will be GPU right here. Then it's very interesting to see this happen. I really like this workflow because you have this local debugging first. So as you can see here, let's wait a little bit so as you can see it's completed it's training. And we can also see that it's only been for one epoch so in this case it only went for one epoch and then once it reached that point as we saw in the code it will say `clearml WARNING - Terminating local execution process` so in this case it's already sent it to the remote machine. Now if we're going to take a look at the remote machine, we can see that we have our Model Training GPU in pending state and remember we had no workers at all in our GPU queue. We have zero workers and the next experiment is our Model Training GPU. But remember again that we also have the autoscaler. So if I go to applications and go to autoscaler, you'll see here that we have indeed one task in the GPU queue. And we also see that the GPU_machines Running Instances is one as well. So we can follow along with the logs here. And it actually detected that there is a task in a GPU queue and it's now spinning up a new machine, a new GPU machine to be running that specific task and then it will shut that back down again when it's done. So this is just one example of how you can use `task.execute_remotely` to very efficiently get your tasks into the queue. Actually, it could also be the first time. So if you don't want to use the experiment manager for example, you don't actually have to use a task that is already in the system, you can just say it does not execute remotely and it will just put it into the system for you and immediately launch it remotely. You don't ever have to run anything locally if you don't want to. So this is essentially what we're talking about when I'm talking about orchestration. So the the autoscaler, the workers, the queues, the ClearML agent, everything here is what we call orchestration. It's essentially saying you should run this then there and everything is managed for you. That's the idea here. But there's also something else that we usually quite a lot talk about. And that is Automations. And Automations is specifically trying to automate a lot of manual stuff that you probably would be doing, but without actually noticing that it could be automated. So let me tell you what I mean with that. Let's go into automation here and get for example, the task Scheduler. And the task Scheduler, it's very, very intuitive to know what it does right? So a task scheduler will essentially take a specific task and it will schedule it to run every X amount of time. So in this case, for example, the MLOps engineer gets called in by Project Team NASA. So Project Team NASA, which is really cool, they're actually creating a model here that is meant to detect if asteroids are going to be hazardous for earth or not,. So they come in, they have a specific project of their own. So if I go into Project Team NASA here, you see that they have a bunch of tasks here. For example, getting the data, they will pour that into a real asteroid dataset that will preprocess the data and actually put that preprocessed data in another dataset version which is preprocessed asteroid dataset. And then you have a bunch of model training that they do on top of it in which they have scalars with test and train. You all know the drill. This is the kind of thing they're doing. So they actually have their feature importances, which is really cool. So they call you in as an MLOps engineer or you are part of the team and you're the MLOps engineer, the designated developers engineer and you essentially, what you want to do is, if we go to get data here and we go into configuration, you see that there is a query date and if we go into the code of this query date. What we'll see here, this is the original repository of the NASA Team. We'll see that they actually query a database with a specific query and the specific areas select everything from asteroids which is their data, where the date is smaller than the data given. So they have an actual end date and everything before that given date is the data they want to work with. So if we're going to take a look, the query date here is a specific a specific date, but that's not today. So essentially what they want to do is rerun this, get data every single day or week or month depending on how quickly they can get their data labeled. So this could be done manually relatively easily. You could just do every week, Click, go here and it will just put a new entry in the experiment list or you could of course automate it. And that's essentially what we're going to do with the task scheduler. So you just get the task scheduler object from the automation module. You say the amount of sync frequency. So this is essentially just when you change something in the configuration of the task scheduler, it will poll every minute to get that. So this can be very low or very high depending on what you want. I want the scheduler itself, which again, everything in ClearML is a task. So I want the scheduler itself a scheduled task, to be in the Project MLOps because this is the actual scheduler. I want to be taking care of that and then the actual task will be in the original NASA project. Also, I want to call it NASA Scheduler because it just sounds cool. Then what we could do is get a task from the Project Team NASA project folder, and then the get data task. But there is something else that we can do. You could easily just clone a task. This is essentially what the task scheduler is doing. If you watch the getting started videos, you know that we can actually clone any of the experiments that are in the system and then change the parameters and rerun it. So we could get the data or clone the get data and then do a task parameter override of the query date with the current date today. That's very valid, but the NASA team actually made something really cool. If we go to pipelines here, you see that there is a NASA pipeline as well and the NASA pipeline is actually the exact steps that we saw before, but they train three different models with three different parameters and then pick the best model from there. And what we see is that the query date is actually a parameter of the pipeline as well. And if you remember correctly, pipelines are also tasks in ClearML, everything is a task, so that means that you can use this the task scheduler also to schedule a complete pipeline run. And then overwrite the parameters of the pipeline run just as easily as you could do with any other task. So if I go into the full details of this task here, you will see that this is actually the pipeline itself. The pipeline has just as any other task, these different tabs with info, consoles, scalars, etc, etc. and it has an ID as well. And this ID, if we copy it, we can actually use that instead. So let me paste it, it's already there. So the task to schedule is in fact the pipeline that we want to schedule. And then if I do scheduler add task. I take the ID of the task to schedule, which is the pipeline in this case, I want to enqueue it in the CPU queue. I want it to be at the hour 8:30. So 8:30 o'clock every Friday. So every week at 8:30, this will be run. So the pipeline will be cloned and run using this parameter override. And the parameter override says essentially, take the query date, but set it to the date of the day instead of whatever was before and then started remotely. So if I run this, we have automation and then task scheduler. So if I run this, it will create a new scheduler task in the Project MLOps folder and then it will start. Because I said here, execute immediately, it will immediately clone and recreate a pipeline already, right now. And then we'll start doing this every other week on Friday 8:30. So if we're looking at our projects here, we see that we have Project MLOps. We have the NASA scheduler, and the NASA scheduler is of course pending because it itself is a task that needs to be grabbed by one of the agents. And so in this case, we can probably see that our agents are still busy on the different tasks. So let's just abort one so that we can easily take over what we should do. Let's abort all of the example tasks just so we can get going here. Oh, actually, you can do that with multiple at the same time. So you have a board here. If you selected two, that will all work quite well. So we have our CPU queue here, we have our GPU queue here. There are all our workers. And now we see that the NASA scheduler is actually scheduled on Beast One, so it's currently running. If we go to Project MLOps, we see that our scheduler is in fact running. And then if we go to our console, we can follow along with its setup. And then if we go into our pipelines, we should be able to see that a new pipeline is being started up by the scheduler. Now there we go. It says here: Launching Jobs, Schedule Job, Base Task ID, Base Function, Blah blah blah. Essentially, it's saying I've launched the pipeline. So if we go into the NASA pipeline, we should see that in fact, there is now NASA Pipeline 2 that is currently running that is using the exact date of today instead of the previous version, which is using the date of before. So this is a very easy way of automating or scheduling essentially, just tasks, pipelines, datasets, whatever you want in the ClearML ecosystem, you can schedule it. Then there is a second type of automation that we can do. So we have the trigger scheduler as well. And the trigger scheduler is something relatively similar to a task scheduler. Only the difference is that instead of with a task scheduler, you create a task, or you clone a task and run it and then queue it based on a specific schedule. With a trigger is based on a trigger. So a trigger could be any event in the ClearML ecosystem. In this case, it could be a successful dataset, a tagging of a dataset, or any other kind of event that the ClearML ecosystem can produce, you can use as a base to launch a new kind of task. So if we're going to take a look here, we actually want to create a trigger scheduler for something called Alice Bob. And I'm going to explain what that means. So if we're going into our Alice project here, Alice is a data scientist of our team, and she essentially asked me to help her. So she has a bunch of model training tasks here. She actually uses the stable tag as well. We'll come back to that later. And essentially what she's doing is just training a model based on the dataset of another data scientist in another project. So if we're going to take a look at that other project, it's called Bob. So Bob is the other data scientist which is in charge of producing the dataset that is required. So essentially, he uses the production tag to tell Alice this is the kind of dataset that you want. This is the dataset that you want to use. This is the best so far. He has more recent datasets, but hasn't tagged them as production yet because he says they're not ready., we should do that later. So he can just keep continue to experiment, and to to do all the kind of things that he wants while Alice is still firmly using production. So what Alice is doing is she's essentially querying on the dataset of production. But it's annoying because they're in a different time zone, for example. And when Bob publishes a new dataset, Alice has to be notified by using a chat application or whatever. And then Alice has to re-run everything remotely so that her training is using the latest version. So this is not ideal, we can automate all of this. And this is essentially what the task trigger is trying to do. So again, we make a scheduler just like we did with task scheduler. Pooling frequency in minutes is again to poll the configuration as well as sync frequency minutes. We put again the scheduler itself. We put it in the Project MLOps. We call it Alice-Bob Trigger instead of the scheduler before. And then we get the task that we want to actually trigger. So not the task that triggers, but the task that we want to create when the trigger, triggers, if that makes sense. So the actual task that we want to make is, we want Project Alice, so we want the training job for from Alice. Actually, so we use the project name, Project Alice and or Project Alice I think. Let me just check real quick. Project Alice Not Project Alice. There was a little mistake there. So we have task name Model Training. So we want any of the Model Training and Alice actually uses a stable tag. Like I said before, she uses a stable tag to say, this is actually my latest good version of an experiment. So if there is a new dataset I want to retrain a clone and enqueue and retrain this specific version of my experiment. This also allows Alice to experiment and continue experimenting without the dread of having a new dataset come in and then be it being retrained on code that is not stable yet. So we can use the tag for that purpose. But if I go back to the task trigger, essentially what we're going to get here is a task from the project Project Alice with the name Model Training but also crucially with the tag stable. And then if there's multiple tasks that fit this description, it will just take the latest one. So it will take the latest task that has a tag stable from this project. Now we have to add a trigger. And you can add a dataset trigger, you can add a task trigger, you can add any kind of trigger that you wish. In this case it will be a dataset trigger. If we have a different dataset, a new kind of dataset that fits this description, we want to trigger this task. So essentially the scheduled task ID is the task that we want to run if the trigger triggers, which is in this case `training_task.id` is the Project Alice task, the Model Training task. We have the schedule queue so we want to obviously schedule it in any of the queues. We can use CPU queue in this case and then we can give it a name as well. And just to make it clear that this training is not actually training from Alice herself, but it's training on the new data of Bob. It's an automated training. We can give it a specific name so that Alice knows this was triggered automatically and then we can use `trigger_on_tags` where we should look to to actually trigger the trigger. Damn this is a lot of trigger. So what happens here is we look in the Project Bob folder and then if a new tag production is found that wasn't there before we trigger and in this case this means we create a new task project Alice. So if we're going to run this so automation, not task scheduler but task trigger, this will again create a new specific let's call it orchestration task or automation task. And these are kind of tasks that you want in the services queue. These are the tasks that essentially just keep track of, is there a new data set version or not and it's very light to do so. This is typically something you want in the services queue. So we have terminating local process so it should now be in Project MLOps right here. So we see that our NASA scheduler is running, but the Alice-Bob Trigger is still pending because obviously we have our pipeline running and our workers need to need to first work on that and then they can go on. So if we take a look at the queues, we're actually now using the tools that we need. So we see that in the services queue the Alice-Bob Trigger was the next experiment and it's just been picked up. So we should see here that indeed, one of the beasts workers here has picked up Alice-Bob Trigger which is essentially what the queues are meant for. We're pushing too much into the system so they're just waiting a little bit before the next thing has finished. If we take a look at our NASA pipeline, we see that it's actually going very well here. So these are the kind of tasks that our workers were busy with before they picked up the Alice-Bob Trigger. So now we see that the Alice-Bob Trigger is in fact running. We can take a look at the console and it can tell you that okay, everything is installed. It gives you a few error messages, which is usually a good thing because it says that it's actually doing something. It's sleeping until the next pool in one minute. So it's polling every minute. So now we should be able to go into Bob's project. And if we say okay, I want to add a tag here, production. And in this case, what I just did is I created a new dataset version with this specific tag. I said okay, this example dataset. I've tested it. I've checked it. I'm Bob in this case. So I've tested it. I've checked it. Everything seems to be in order. So I'm going to tag this as production and this should technically trigger the task trigger or the trigger task to pick that up and to then spin up a new Model Training run for Alice. And Alice will then pull the latest version that fits the production tag. So essentially she will pull this one and then end up with a new version or like with the new task that is running. So if we're going to take a look, it's sleeping for a while. Technically it won't be in Alice yet. So we should wait just a little bit before the Alice-Bob Trigger picks it up. But it shouldn't take very long. So as we can see the scheduling job, Alice-Bob, new training data or new data, training has been scheduled on the CPU queue, so it has essentially figured out that Okay, we actually do have a new tag now, so it is being scheduled. If we're going to take a look at Project Alice now, you can see that in fact, Model Training is running currently, so it's been enqueued, it's running, and it's been tagged as Alice-Bob New Data Training. So Alice actually knows that this time this model is automated. Finally, there are some things that I want to show you that might make your life easier. Yet again, that is the name of the game in this video but they're just a little bit smaller. So one of the things that I want to show you is the monitor. It's the ClearML monitor. It's essentially an object that you can implement that you can override and it allows you to take a look into the depths of the ClearML ecosystem, what happens there? So it can give you an idea of when tasks failed, when tasks succeeded, all of the types of events that ClearML can generate for you. So one of the things you can do with it and this is part of the example. It's also in the example repository, is create a slack bot for it. So essentially we've just used a bunch of slack APIs around this monitor, which is just a slack monitor that we created ourselves and that will essentially just give you a message whenever a task succeeds, fails, whatever you want to do. So in this case, it's fully equipped. We added a lot of arguments there so that you can just use it as a command line tool, but you can create your own script based on your own requirements. Now what it will do is, let me show you, is in this case, I'll make it a little bit bigger. You can see that there is a ClearML Alert Bot that you can just add to your slack if you want it, and it will essentially just tell you what kind of projects, what kind of tasks are succeeded or failed. You can set, I want only alerts from this project, I want only alerts that are failed, Only alerts that are completed, will give you a bunch of output as well, which is really, really useful to see, etc etc. So this is just a very small extra thing that you can take a look at to have some monitoring as well so that you don't even have to wait or take a look yourself at when your tasks finished. Another thing that I want to show you is a cool way to just get a little bit more of that juicy GPU power. One way you can add agents next to Kubernetes spinning up themselves, spinning up a ClearML agent on your own machines or the auto scaler is Colab. So the runtime was just connected here, but Colab is something we all know, we all love. It's an easy way to get notebooks on a GPU machine very easily, but it's also very easy to get a ClearML agent running on this,. So this is really, really cool. I personally really like it. So I can say `!clearml-agent daemon --queue "GPU Queue"` and if I run this, essentially we get a free GPU worker. So it is currently doing the ClearML agent thing. This is the output of the ClearML agent and if we go into our project here. We can now see we have a GPU all worker that is essentially just Google Colab. So you can spin up a bunch of Google Colabs, run all of your agents on here. And the only downside is that you can't use the docker mode. So this will mean that every single task that is being run by this Colab instance is actually going to be run in the environment of the Colab instance. So if the Colab instance has a different Python version than you, it's a bit annoying, you can't spin up a different container. But that's really only the only downside. So this is just a quick way. The actual notebook you can find on our github. But this is just a really cool way to get some extra GPU power as well. Now, all of these agents is one thing. You have the queues now, finally. Now, thank you for making it through this far. We haven't actually even covered everything that ClearML can automate for you. There is HPO, so which is hyper parameter optimization. There are pipelines as well that can chain everything together. You saw a little bit when I showed you the NASA project, but yeah, we're not there yet. There's also even a ClearML session that you can use to run on a specific machine, on a remote machine and it will give you a remote interactive Jupyter Notebook instance or even a VS code instance so that you can always code already on the remote machine. So that's also really, really cool. It's something we're going to cover soon, but I think the video is already long enough. So thank you very very much for watching. Thank you very very much for your attention. Let me know in the comments: if you want to see videos of these hyper parameters and pipelines and sessions and don't forget to join our Slack channel if you need any help.
+Hello again and welcome to ClearML. In this video we'll be going over a workflow of a potential MLOps Engineer. Now an
+MLOps Engineer is a vague term. This might be a specific person in your team that is doing only the Ops part of the
+machine learning. So the infrastructure and all of the workers and whatnot. Or it could be you as a data scientist. It
+could be just the data scientist of the team that is most into things like docker and deployments. And that person now
+has the job of a MLOps Engineer. So it really doesn't matter who you are. This video is going to be about what this kind
+of person will be doing and what ClearML can do to make the life of them a little easier. Just a little.
+
+So what we're going to do here is take a look or get started at the very least with our Workers and Queues tab. So if
+you've followed along with the Getting Started videos, this is actually something you've probably seen before. but I'm
+going to go a little bit more into depth in this video.
+
+So the workers and queues tab, what does it do? So we have what we can expect. We have the workers tab, and we have the
+queues tab. Workers in ClearML are actually called agents. So you can see here that we have a bunch of available workers
+which are spun up by using the ClearML agent. I'll go more in depth in that in a minute. So we have a few available
+workers. We have Beast Zero, One, Two, and Three. I'm the person that called my own computer Beast. So my own computer
+is running a few workers here. And then we also have Apps Agents, and I'll go a little bit more into detail what that
+means later. Essentially, what it means is you have the applications right here and what that's going to do is give you
+a few pre-made applications for automation that you can use straight out of the box. And if you use any of them, in this
+case a GPU scaler, an auto scaler from the cloud, then it will spin up an available worker for you that will just manage
+the orchestration there so that that worker will do nothing else but just tell things where to go and what they should
+do.
+
+So if we go a little bit more into depth here, we can also see which worker is running which experiment at this moment.
+So we have example: Task 0 1, 2 and 3, programmer terms of course. We see the experiment running time. We see the
+iterations. In this case, it's a classical machine learning model, so we don't really have iterations, but if you have a
+deep learning model, this is where your amount of iterations would come into play.
+
+If we click on any of these, we can see the worker name, and we can see its utilization over time in here as well. All
+right, so we can obviously make this longer. I've only been running this for a few hours or for an hour. So we have the
+worker name right here. We have the update time, so just to know that when was the last time that the worker actually
+sent in any new data. We have the current experiment on which we can click through, so I'll do that in a minute, and we
+have the experiment runtime, and experiment iterations here as well.
+
+We also have the queues, which means that we can actually find out what queues this worker listening to. I should
+give some more context here. So if we go into the queues, ClearML works with a system of queues as well as workers. So
+this actually comes from the fact that originally people were just giving out SSH keys to everyone to get them to work
+on a remote system. And this is far far far from perfect, right? So you have several people SSHing in, you have several
+people running their own workloads on the same GPUs. They have to share everything. Because of the amount of people that
+are also running their stuff on the GPU, you actually have no idea how long your own task will take, so that's something.
+You can't have any priorities. So if everyone is just running their stuff and actually probably killing your stuff as
+well because it's out of memory, because too many people are using it. So that's just a disaster, right? If you have a
+large GPU machine that you have to share with multiple people, or just even want to orchestrate several different tasks
+on, with different priorities, it becomes a real hassle. So that's actually why ClearML has workers and queues to try
+and deal with that a little bit. And this is actually what we're calling orchestration. So if you look at our website,
+you'll see orchestration and automation. Those terms might not mean very much. So this is what I'm going to be talking
+about in this video.
+
+Orchestration in this case, is like a director in an orchestra. You're essentially saying who should do what when, so
+which worker should run which experiment, or which task at what time and in what priority. So this is what the queues
+are all for. Essentially, queues are just what they're called, right. They're queues, but you can have as many of them
+as you want. So in this case, we have the services queue, we have the default queue, GPU queue, and CPU queue. You can
+create new queues by clicking the button here, so it's very simple. I can make a very simple queue, but this is very
+worthless, right? But you can make however many of them you want. I can delete that queue again. We can see for each
+queue how many workers it has. So I'll show you that in a minute when we spin up a new worker. But we can actually
+pretty easily see how many workers are serving a specific queue. So listening to that queue and that actually has an
+effect on the overall waiting time. So for example, here we have four workers that we saw here before, right? So these
+are these four workers. They're all listening to the CPU queue. They're all running a CPU experiment. But then we have a
+bunch of other experiments in the queue still. So this is just a list of the next, like, essentially the order in which
+the next example tasks will be executed. So we see here that the next experiment is task four. We see that it was last
+updated there, and now we see the queue experiment time rising rapidly. Because people are waiting in queue here, there
+are many more tasks to be executed. We also have CPU queued experiments, which is just the amount of queued experiments
+per queue.
+
+So we also have a GPU queue and in this case, we see that we have zero workers here. Now we'll go a little
+bit more into depth on that later because we don't actually have zero workers there. We actually have an auto scaler
+listening to this. Then we have the default queue, and we have the services queue and I should stop a little bit on the
+services queue because the services queue is relatively special in ClearML. You have all of your custom queues that you
+can create however you want CPU queue, GPU queue, whatever. The services queue is actually meant to host specific, not
+very heavy workloads, but that do this kind of orchestration that I was talking about. So imagine you have a pipeline
+for example, if you don't know what a pipeline does in ClearML, you can look at the Getting Started video that we made
+on pipelines. And if you want to run a pipeline, you will need a pipeline controller. Essentially, it's a very small
+piece of code that tells ClearML now you should run this step, then take that output, give it to the other step, run the
+other step, give that to the output of the next three steps, take those three steps, run them, and so on, and so forth.
+It's essentially the director. It's orchestration, right? And so that's what the services queue is meant for is it's
+usually meant to put in those orchestration tasks, those long-running, not very heavy tasks. So that allows you to
+essentially assign a bunch of not very powerful CPU machines to that queue just to do the orchestration and then
+everything else, like your GPU machines or your heavy CPU machines, can be assigned to the CPU and GPU queues in which
+we can choose to just enqueue tasks or experiments that do simply that. So that's essentially what a services queue is
+when compared to other user-made queues.
+
+We can see here that we have a bunch of workers, so we have the Beast 0, 1, 2 and 3 that are assigned to this services
+queue. But as we can see if we go and take a look at the CPU queue, we have a whole bunch of tasks here. So there is a
+lot of people waiting for their turn. So actually one thing that we can do is we can already change the priority. So
+imagine we have example Person 19 that has a very, very tight deadline, and they actually need to be first. So we can
+just drag them all the way up, let me scroll there, all the way up and now. There we go, all the way up top. So now we
+can see that example 18, for example task 18 is the first in the queue. So once any of the four workers finish off their
+example 0, 1, 2 or 3, the next one will be example 18. So in this case, you can actually very easily change the priority
+and make sure that the people that have their deadline meet their deadline. So that this could be a common question in
+the day of the life of an MLOps engineer is, please give me priority. Please let my stuff work first. So that's something
+you can do.
+
+Another thing you can do, which is relatively simple to do, is aborting a task. So a task is essentially just a Python
+script that was cloned or that was already in the system or that was put into the system that can be recreated on a
+different machine. So what we could do is, we could go into the queues here and then say clear, which will clear the
+complete queue. So that's something we don't necessarily want in this case. But in this case, we, for example, want to
+abort task 0. So one way of doing that would be to go to the current experiment, right here. And if I click on that,
+ClearML will actually bring me to the original experiment view, the experiment manager, remember everything is
+integrated here. The experiment manager of that example task. So what I can do here if I look at the console, I have a
+bunch of output here. I can actually abort it as well. And if I abort it, what will happen is this task will stop
+executing. Essentially, it will send a `ctrl c`, so a quit command or a terminate command, to the original task on the \
+remote machine. So the remote machine will say okay, I'm done here. I will just quit it right here. If, for example,
+your model is not performing very well, or you see like oh, something is definitely wrong here, you can always just
+abort it. And the cool thing is if we go back to the workers and queues, we'll see that the Beast 0 has given up working
+on task 0 and has now picked task 18 instead. Which is the task that we put in there in terms of priority. So this has the
+next priority. Work has already started on task 18. So this is really, really cool.
+
+But we can see that in the CPU queue, the amount of tasks is still very, very high. Even though we just aborted one,
+people are waiting. The waiting time is rising. The amount of tasks is very, very high. So what we should do now is
+actually start up a new worker. This could be something that is very much in daily life over an MLOps engineer. It's
+just to add workers to your worker pool. I'll put it on workers so that we can see very clearly when we added it here.
+Go out of full screen, and we're going into a remote machine here. So you could remote in almost any machine, right?
+It doesn't really matter which type of machine it is, if it's a cloud VM, if it's on-premise, if it's your own laptop,
+it could be any remote machine that you want. It's very easy to turn those into a ClearML agent or a worker for the
+ClearML ecosystem.
+The first thing you'll have to do though, is `pip install clearml-agent`. So this will install the
+very thing that is necessary to turn it into a worker and that's actually everything you need. That's all the packages
+you need. It is a Python package, but usually you have Pip3 available. So then the next thing you should do is
+`clearml-init`. Now `clearml-init` will connect this machine to your server, to the actual orchestration server, this
+one that will handle all the workers and queues. So in this case it's [app.clear.ml](https://app.clear.ml) which is the
+hosted version. You can also host your own open source server if you so require. If I run `clearml-init`, you'll see
+that I have already done this of course, but in this case you should be able to just add your credentials from the
+server, and it should connect no problem. If you want more information on that, we have tutorial videos on that as well.
+And then the next thing we should do is `clearml-agent daemon --queue` Now we can decide which queues we want this
+ClearML agent to listen to. So in this case, if we go and take a look at queues, we have CPU queue, which is by far the
+most requested queue. So in this case, imagine we have an extra machine on hand in the faculty or in the company you're
+working with, and you should just add this machine or its resources to the pool. So in this case we're just going to say
+`CPU Queue` and that's everything. So we just want a simple extra machine for the CPU queue. Then we're going to
+add `--docker` because personally I quite like the fact that the machine would be using docker. So essentially what will
+happen here is that the ClearML agent will pull a new task from the queue, and then we'll spin up a new container
+depending on either the default image that I give here or the image that is attached to the task itself. So people, the
+data scientists that are creating their remote tasks or their experiments, they can also assign a docker file or a
+docker image to that, that it should be running. So if you have very specific package requirements or very specific
+needs, you can, as a data scientist, already say I want to attach this docker image to it, and it will be run like such
+on the ClearML agent. So that's that gives you a lot of power. But in this case I will just say if the data scientists
+gave no indication of what docker container to use, just use Python 3.7. This is the standard in our company, let's say,
+and this is what we want to use. Okay, so if I run this, it should start up a new ClearML agent. So in this case you can
+see it's running in docker mode, it's using default image Python 3.7, and it's listening to the CPU queue. Now if we go
+back to our workers and queues tab. We can see that here `any-remote-machine:0`. So we can actually see that we now
+immediately have a new remote worker, and it's actually already started on the next task. So now we're currently running
+five workers on the CPU queue instead of four. So this was very, very easy to handle, very, very easy to set up. So this
+is one part of what an MLOps engineer could be doing.
+
+Now this is very, very manual to set up and the MLOps engineer is king of the automation after all. So we want some kind
+of some kind of way to automate all of this, right? So what we can do here is go to applications. And what we have is AWS
+Autoscaler and GCP Autoscaler in essence. Also, Azure will come later so that will be out soon. So if we go into the
+AWS Autoscaler. What we see here is we have an MLOps GPU scaler and what that means is, we don't always have fixed
+demand for GPU resources, right? So imagine you have a company in this case that has a lot of demand for CPU compute
+in this case, five workers and a lot of tasks. We only have GPU requests only every so often. And it's not very
+economical to buy a few very, very powerful GPUs just for a few minutes every week, for example, or a few hours every
+week. So what is much more economical there is to use the cloud instead, in which you pay for the hours that you use a
+GPU and you don't pay for the hours you don't use it anymore. Of course, what you could be doing is if a data scientist
+needs a GPU machine, you could go to the cloud console, spin up a new VM, SSH into it, and then create a ClearML agent
+for it, to be able to access it from the queue. But that could also be done automatically. And that's essentially what
+an Autoscaler is doing for you. So the Autoscaler will detect if there is a task in the queue, and then will spin up a
+new machine on the cloud, run the ClearML agent there, reconnect it to your own server, and then run that specific task.
+And then if that task is done and the agent and the machine are up for like a minute without doing anything, you can
+choose that minute by the way, if it's up for a while, and it's not doing anything, it will just shut itself down again.
+And that actually makes it very, very economical. Because if you've ever forgotten to close down or to shut off a
+machine, especially a GPU machine on the cloud over the weekend, it's super expensive. So you can actually pay ClearML
+Pro for a year for just the same amount of money as forgetting to shut down a large GPU machine for a single day. So
+just to give an idea of how economical this can be.
+
+If we go and take a look at our configuration here, we can see that we have our AWS credentials and GCP will be
+obviously GCP credentials. We have our Git configuration and this is what the MLOps engineer will be doing. They will be
+configuring this kind of thing. They will be configuring the max idle time, which says how long should the machine be
+doing nothing before we shut it down again. It could be beneficial to keep it up for a little while because if then
+another task comes in like two minutes later, it's immediately launched. You don't have to wait for the machine to boot
+up. You can add prefixes, you can add polling intervals, you can add a base docker image that you can use, but obviously
+you can overwrite that again. And then there are obviously the computer resources. You can have GPU machines of a
+specific type, this obviously depends on which cloud provider that you're using, but it's all basically the same thing.
+You can run it in CPU mode, use Spot instances to have some savings there, availability zones, etc. So this is the kind
+of thing that an MLOps engineer would probably spend a lot of their time tuning and fine-tuning and getting up and
+working.
+
+Another way next to running your ClearML agent just manually and spinning up an autoscaler to get some extra agents is
+to run them on Kubernetes. So if I go to the ClearML GitHub and go to the ClearML Agent repository, you have two
+different versions of integration of the ClearML agent in Kubernetes. And the really cool thing is, Kubernetes can
+scale. So you can actually use the scaling of Kubernetes to handle the load of your different tasks that are being
+pulled by the ClearML agent. Now, one way of doing this is spinning up a ClearML agent as a long-lasting service pod
+that has access to the docker socket, so it can actually spin up new docker containers as it sees fit. Or you can use
+Kubernetes glue, which is some code that we wrote that allows you to actually map ClearML jobs directly to Kubernetes
+jobs. So that's actually also a really neat way of doing this. Now we would go a little bit too far if we go straight
+into Kubernetes in this video, but if you're interested, let me know in the comments, and we'll make a video about it.
+
+Now that we have all our agents, let's take a look at the code, and I'll give you some examples on how you can enqueue
+some tasks into these queues and get these workers that we've just spun up working their hardest. So if we go and
+take a look here, what we see is a simple Python file that does some training. This is CPU based training. It's
+essentially using LightGBM to train a model. So this is the kind of thing that a data scientist would give to you that
+you would have made yourself, and now you want to get it into the queue. Now one way of doing that is what we saw before
+you could do a `Task.init` which essentially tracks the run of your code as an experiment in the experiment manager, and
+then you could go and clone the experiment and then enqueue it. This is something that we saw in the Getting Started videos before.
+
+Now, another way of doing this is to actually use what you can see here, which is `task.execute_remotely`. What this line
+specifically will do, is when you run the file right here. Let me just do that real quick. So if we do
+`python setup/example_task_CPU.py` what will happen is ClearML will do the `task.init` like it would always do, but then
+it would encounter the `task.execute_remotely` and what that will tell ClearML is say okay, take all of this code, take
+all of the packages that are installed, take all of the things that you would normally take as part of the experiment
+manager, but stop executing right here and then send the rest, send everything through to a ClearML agent or to the queue
+so that a ClearML agent can start working on it. So one way of doing this is to add a `task.execute_remotely` just all
+the way at the top and then once you run it, you will see here `clearml WARNING - Terminating local execution process`,
+and so if we're seeing here if we're going to take a look we can see that Model Training currently running, and if we go
+and take a look, at our queues here, we have `any-remote-machine` running Model Training right here. And if we go and
+click on this, we go back to our actual training task, and we just can easily follow along with what is happening here.
+
+So okay let's take a look at how you can do this differently as well. So there is a specific different kind of way of
+doing this. And let me take the example task GPU here. So this is a lot larger in terms of what it does. But essentially
+it just trains a model. So you have trained tests and main. And what you can see here is, we have the `Task.init` in main.
+It's just a global scope so that's all fine. Then we parse a bunch of arguments and then something very interesting
+happens. So we create our train loader and our test loader right here. But then what we can also do is say okay for epoch
+in all our epochs, so for in the epoch range, what we can say is if the epoch is larger than one, it doesn't execute
+remotely in the GPU queue, and so what this will do is it will train the model for one epoch locally. Which means that
+you can test that it works and if you get a single epoch it usually means it's working. And then if we get to that point
+and start epoch number two, we actually just run it remotely and then ClearML will take this whole bunch and start
+working on it remotely instead. Which means that you can very easily locally debug and see if everything works and once
+everything does work, it will just immediately send it to a remote machine that will do the actual heavy lifting instead
+of you having to do it on your laptop or computer. So if we actually run this, it will be GPU right here. Then it's very
+interesting to see this happen. I really like this workflow because you have this local debugging first. So as you can
+see here, let's wait a little bit, so as you can see it's completed it's training. And we can also see that it's only
+been for one epoch so in this case it only went for one epoch and then once it reached that point as we saw in the code
+it will say `clearml WARNING - Terminating local execution process`, so in this case it's already sent it to the remote
+machine. Now if we're going to take a look at the remote machine, we can see that we have our Model Training GPU in
+`pending` state and remember we had no workers at all in our GPU queue. We have zero workers and the next experiment is
+our Model Training GPU. But remember again that we also have the autoscaler. So if I go to Applications and go to
+autoscaler, you'll see here that we have indeed one task in the GPU queue. And we also see that the `GPU_machines`
+Running Instances is one as well. So we can follow along with the logs here. And it actually detected that there is a
+task in a GPU queue, and it's now spinning up a new machine, a new GPU machine to be running that specific task, and then
+it will shut that back down again when it's done. So this is just one example of how you can use `task.execute_remotely`
+to very efficiently get your tasks into the queue. Actually, it could also be the first time. So if you don't want to
+use the experiment manager for example, you don't actually have to use a task that is already in the system, you can
+just say it does not execute remotely, and it will just put it into the system for you and immediately launch it remotely.
+You don't ever have to run anything locally if you don't want to.
+
+So this is essentially what we're talking about when
+I'm talking about orchestration. So the autoscaler, the workers, the queues, the ClearML agent, everything here is
+what we call orchestration. It's essentially saying you should run this then there and everything is managed for you.
+That's the idea here.
+
+But there's also something else that we usually talk about quite a lot. And that is Automations.
+And Automations is specifically trying to automate a lot of manual stuff that you probably would be doing, but without
+actually noticing that it could be automated. So let me tell you what I mean with that.
+
+Let's go into automation here
+and get for example, the task Scheduler. And the task Scheduler, it's very, very intuitive to know what it does right?
+So a task scheduler will essentially take a specific task, and it will schedule it to run every X amount of time. So in
+this case, for example, the MLOps engineer gets called in by Project Team NASA. So Project Team NASA, which is really cool,
+they're actually creating a model here that is meant to detect if asteroids are going to be hazardous for earth or not.
+So they come in, they have a specific project of their own. So if I go into Project Team NASA here, you see that they
+have a bunch of tasks here. For example, getting the data, they will pour that into a real asteroid dataset that will
+preprocess the data and actually put that preprocessed data in another dataset version which is preprocessed asteroid
+dataset. And then you have a bunch of model training that they do on top of it in which they have scalars with test and
+train. You all know the drill. This is the kind of thing they're doing. So they actually have their feature importances,
+which is really cool.
+
+So they call you in as an MLOps engineer, or you are part of the team, and you're the MLOps engineer, the designated
+developers engineer and you essentially, what you want to do is, if we go to get data here, and we go into configuration,
+you see that there is a query date and if we go into the code of this query date. What we'll see here, this is the
+original repository of the NASA Team. We'll see that they actually query a database with a specific query and the
+specific query is select everything from asteroids which is their data, where the date is smaller than the data given.
+So they have an actual end date and everything before that given date is the data they want to work with. So if we're
+going to take a look, the query date here is a specific date, but that's not today. So essentially what they
+want to do is rerun this, get data every single day or week or month depending on how quickly they can get their data
+labeled.
+
+So this could be done manually relatively easily. You could just do every week, Click, go here, and it will just put a
+new entry in the experiment list, or you could of course automate it. And that's essentially what we're going to do with
+the task scheduler. So you just get the task scheduler object from the automation module. You say the amount of sync
+frequency. So this is essentially just when you change something in the configuration of the task scheduler, it will
+poll every minute to get that. So this can be very low or very high depending on what you want. I want the scheduler
+itself, which again, everything in ClearML is a task. So I want the scheduler itself a scheduled task, to be in the
+Project MLOps because this is the actual scheduler. I want to be taking care of that and then the actual task will be in
+the original NASA project. Also, I want to call it NASA Scheduler because it just sounds cool. Then what we could do is
+get a task from the Project Team NASA project folder, and then the `get data` task. But there is something else that we
+can do. You could easily just clone a task. This is essentially what the task scheduler is doing. If you watch the
+Getting Started videos, you know that we can actually clone any of the experiments that are in the system and then
+change the parameters and rerun it. So we could get the data or clone the `get data` and then do a task parameter
+override of the query date with the current date today. That's very valid, but the NASA team actually made something
+really cool. If we go to pipelines here, you see that there is a NASA pipeline as well and the NASA pipeline is actually
+the exact steps that we saw before, but they train three different models with three different parameters and then pick
+the best model from there. And what we see is that the query date is actually a parameter of the pipeline as well. And
+if you remember correctly, pipelines are also tasks in ClearML, everything is a task, so that means that you can use
+this the task scheduler also to schedule a complete pipeline run. And then overwrite the parameters of the pipeline run
+just as easily as you could do with any other task. So if I go into the full details of this task here, you will see
+that this is actually the pipeline itself. The pipeline has just as any other task, these different tabs with info,
+consoles, scalars, etc. and it has an ID as well. And this ID, if we copy it, we can actually use that instead. So let
+me paste it, it's already there. So the task to schedule is in fact the pipeline that we want to schedule. And then if I
+do `scheduler.add_task`, I take the ID of the task to schedule, which is the pipeline in this case, I want to enqueue it
+in the CPU queue. I want it to be at the hour 08:30 every Friday. So every week at 08:30, this will be run. So the
+pipeline will be cloned and run using this parameter override. And the parameter override says essentially, take the
+query date, but set it to the date of the day instead of whatever was before and then started remotely. So if I run
+this, we have automation and then task scheduler. So if I run this, it will create a new scheduler task in the Project
+MLOps folder, and then it will start. Because I said here, execute immediately, it will immediately clone and recreate a
+pipeline already, right now. And then we'll start doing this every other week on Friday 08:30. So if we're looking at
+our projects here, we see that we have Project MLOps. We have the NASA scheduler, and the NASA scheduler is of course
+`pending` because it itself is a task that needs to be grabbed by one of the agents. And so in this case, we can
+probably see that our agents are still busy on the different tasks. So let's just abort one so that we can easily take
+over what we should do. Let's abort all of the example tasks just so we can get going here. Oh, actually, you can do
+that with multiple at the same time. So you have abort here. If you selected two, that will all work quite well.
+
+So we have our CPU queue here, we have our GPU queue here. There are all our workers. And now we see that the NASA
+scheduler is actually scheduled on Beast 1, so it's currently running. If we go to Project MLOps, we see that our
+scheduler is in fact running. And then if we go to our console, we can follow along with its setup. And then if we go
+into our pipelines, we should be able to see that a new pipeline is being started up by the scheduler. Now there we go.
+It says here: Launching Jobs, Schedule Job, Base Task ID, Base Function, Blah blah blah. Essentially, it's saying I've
+launched the pipeline. So if we go into the NASA pipeline, we should see that in fact, there is now NASA Pipeline 2 that
+is currently running that is using the exact date of today instead of the previous version, which is using the date of
+before. So this is a very easy way of automating or scheduling essentially, just tasks, pipelines, datasets, whatever
+you want in the ClearML ecosystem, you can schedule it.
+
+Then there is a second type of automation that we can do. So we have the trigger scheduler as well. And the trigger
+scheduler is something relatively similar to a task scheduler. Only the difference is that instead of with a task
+scheduler, you create a task, or you clone a task and run it and then queue it based on a specific schedule. With a
+trigger is based on a trigger. So a trigger could be any event in the ClearML ecosystem. In this case, it could be a
+successful dataset, a tagging of a dataset, or any other kind of event that the ClearML ecosystem can produce, you can
+use as a base to launch a new kind of task.
+
+So if we're going to take a look here, we actually want to create a trigger scheduler for something called Alice Bob.
+And I'm going to explain what that means. So if we're going into our Alice project here, Alice is a data scientist on
+our team, and she essentially asked me to help her. So she has a bunch of model training tasks here. She actually uses
+the stable tag as well. We'll come back to that later. And essentially what she's doing is just training a model based
+on the dataset of another data scientist in another project. So if we're going to take a look at that other project,
+it's called Bob. So Bob is the other data scientist which is in charge of producing the dataset that is required.
+So essentially, he uses the production tag to tell Alice this is the kind of dataset that you want. This is the dataset
+that you want to use. This is the best so far. He has more recent datasets, but hasn't tagged them as production yet
+because he says they're not ready. So he can just keep continue to experiment, and do all the kind of things that he
+wants while Alice is still firmly using production. So what Alice is doing is she's essentially querying on the dataset
+of production. But it's annoying because they're in a different time zone, for example. And when Bob publishes a new
+dataset, Alice has to be notified by using a chat application or whatever. And then Alice has to re-run everything
+remotely so that her training is using the latest version. So this is not ideal, we can automate all of this. And this
+is essentially what the task trigger is trying to do.
+
+So again, we make a scheduler just like we did with task scheduler. Pooling frequency in minutes is again to poll the
+configuration as well as sync frequency minutes. We put again the scheduler itself. We put it in the Project MLOps.
+We call it Alice-Bob Trigger instead of the scheduler before. And then we get the task that we want to actually trigger.
+So not the task that triggers, but the task that we want to create when the trigger is triggered, if that makes sense. So
+the actual task that we want to make is, we want Project Alice, so we want the training job for from Alice. Actually,
+so we use the project name, Project Alice and or Project Alice I think. Let me just check real quick. Project Alice Not
+Project Alice. There was a little mistake there. So we have task name Model Training. So we want any of the Model
+Training and Alice actually uses a stable tag. Like I said before, she uses a stable tag to say, this is actually my
+latest good version of an experiment. So if there is a new dataset I want to retrain a clone and enqueue and retrain
+this specific version of my experiment. This also allows Alice to experiment and continue experimenting without the
+dread of having a new dataset come in and then be it being retrained on code that is not stable yet. So we can use the
+tag for that purpose. But if I go back to the task trigger, essentially what we're going to get here is a task from the
+project Project Alice with the name Model Training but also crucially with the tag stable. And then if there's multiple
+tasks that fit this description, it will just take the latest one. So it will take the latest task that has a tag stable
+from this project.
+
+Now we have to add a trigger. And you can add a dataset trigger, you can add a task trigger, you can add any kind of
+trigger that you wish. In this case it will be a dataset trigger. If we have a different dataset, a new kind of dataset
+that fits this description, we want to trigger this task. So essentially the scheduled task ID is the task that we want
+to run if the trigger is triggered, which is in this case `training_task.id` is the Project Alice task, the Model Training
+task. We have the schedule queue, so we want to obviously schedule it in any of the queues. We can use CPU queue in
+this case, and then we can give it a name as well. And just to make it clear that this training is not actually training
+from Alice herself, but it's training on the new data of Bob. It's an automated training. We can give it a specific name
+so that Alice knows this was triggered automatically, and then we can use `trigger_on_tags` where we should look to to
+actually trigger the trigger. Damn this is a lot of trigger.
+
+So what happens here is we look in the Project Bob folder and then if a new tag production is found that wasn't there
+before we trigger, and in this case, this means we create a new task project Alice. So if we're going to run this so
+automation, not task scheduler but task trigger, this will again create a new specific let's call it orchestration task
+or automation task. And these are kind of tasks that you want in the services queue. These are the tasks that
+essentially just keep track of, is there a new data set version or not and it's very light to do so. This is typically
+something you want in the services queue.
+
+So we have terminating local process, so it should now be in Project MLOps right here. So we see that our NASA scheduler
+is running, but the Alice-Bob Trigger is still pending because obviously we have our pipeline running and our workers
+need to first work on that, and then they can go on. So if we take a look at the queues, we're actually now using the
+tools that we need. So we see that in the services queue the Alice-Bob Trigger was the next experiment, and it's just
+been picked up. So we should see here that indeed, one of the beasts workers here has picked up Alice-Bob Trigger which
+is essentially what the queues are meant for. We're pushing too much into the system, so they're just waiting a little bit
+before the next thing has finished. If we take a look at our NASA pipeline, we see that it's actually going very well here.
+So these are the kind of tasks that our workers were busy with before they picked up the Alice-Bob Trigger.
+
+So now we see that the Alice-Bob Trigger is in fact running. We can take a look at the console, and it can tell you that
+okay, everything is installed. It gives you a few error messages, which is usually a good thing because it says that
+it's actually doing something. It's sleeping until the next poll in one minute. So it's polling every minute. So now we
+should be able to go into Bob's project. And if we say okay, I want to add a tag here, production. And in this case,
+what I just did is I created a new dataset version with this specific tag. I said okay, this example dataset. I've
+tested it. I've checked it. I'm Bob in this case. So I've tested it. I've checked it. Everything seems to be in order.
+So I'm going to tag this as production and this should technically trigger the task trigger or the trigger task to pick
+that up and to then spin up a new Model Training run for Alice. And Alice will then pull the latest version that fits
+the production tag. So essentially she will pull this one and then end up with a new version or like with the new task
+that is running. So if we're going to take a look, it's sleeping for a while. Technically it won't be in Alice yet. So
+we should wait just a little bit before the Alice-Bob Trigger picks it up. But it shouldn't take very long.
+
+So as we can see the scheduling job, Alice-Bob, new training data or new data, training has been scheduled on the CPU
+queue, so it has essentially figured out that Okay, we actually do have a new tag now, so it is being scheduled. If
+we're going to take a look at Project Alice now, you can see that in fact, Model Training is running currently, so it's
+been enqueued, it's running, and it's been tagged as Alice-Bob New Data Training. So Alice actually knows that this time
+this model is automated.
+
+Finally, there are some things that I want to show you that might make your life easier. Yet again, that is the name of
+the game in this video, but they're just a little bit smaller. So one of the things that I want to show you is the
+monitor. It's the ClearML monitor. It's essentially an object that you can implement that you can override, and it
+allows you to take a look into the depths of the ClearML ecosystem, what happens there? So it can give you an idea of
+when tasks failed, when tasks succeeded, all of the types of events that ClearML can generate for you. So one of the
+things you can do with it, and this is part of the example, it's also in the example repository, is create a Slack bot
+for it. So essentially we've just used a bunch of slack APIs around this monitor, which is just a Slack monitor that we
+created ourselves and that will essentially just give you a message whenever a task succeeds, fails, whatever you want
+to do. So in this case, it's fully equipped. We added a lot of arguments there so that you can just use it as a
+command line tool, but you can create your own script based on your own requirements. Now what it will do is, let me
+show you, is in this case, I'll make it a little bit bigger. You can see that there is a ClearML Alert Bot that you can
+just add to your Slack if you want it, and it will essentially just tell you what kind of projects, what kind of tasks
+are succeeded or failed. You can set, I want only alerts from this project, I want only alerts that are failed, only
+alerts that are completed, will give you a bunch of output as well, which is really, really useful to see, etc. So this
+is just a very small extra thing that you can take a look at to have some monitoring as well so that you don't even have
+to wait or take a look yourself at when your tasks finished.
+
+Another thing that I want to show you is a cool way to just get a little bit more of that juicy GPU power. One way you
+can add agents next to Kubernetes spinning up themselves, spinning up a ClearML agent on your own machines or the auto
+scaler is Colab. So the runtime was just connected here, but Colab is something we all know, we all love. It's an easy way
+to get notebooks on a GPU machine very easily, but it's also very easy to get a ClearML agent running on this. So this
+is really, really cool. I personally really like it. So I can say `!clearml-agent daemon --queue "GPU Queue"` and if I
+run this, essentially we get a free GPU worker. So it is currently doing the ClearML agent thing. This is the output of
+the ClearML agent and if we go into our project here. We can now see we have a GPU all worker that is essentially just
+Google Colab. So you can spin up a bunch of Google Colabs, run all of your agents on here. And the only downside is that
+you can't use the docker mode. So this will mean that every single task that is being run by this Colab instance is
+actually going to be run in the environment of the Colab instance. So if the Colab instance has a different Python
+version than you, it's a bit annoying, you can't spin up a different container. But that's really only the only
+downside. So this is just a quick way. The actual notebook you can find on our GitHub. But this is just a really cool
+way to get some extra GPU power as well.
+
+Now, all of these agents is one thing. You have the queues now, finally. Now, thank you for making it through this far.
+We haven't actually even covered everything that ClearML can automate for you. There is HPO, which is hyperparameter
+optimization. There are pipelines as well that can chain everything together. You saw a little bit when I showed you the
+NASA project, but yeah, we're not there yet. There's also even a ClearML Session that you can use to run on a specific
+machine, on a remote machine, and it will give you a remote interactive Jupyter Notebook instance or even a VS code
+instance so that you can always code already on the remote machine. So that's also really, really cool. It's something
+we're going to cover soon, but I think the video is already long enough. So thank you very, very much for watching.
+Thank you very, very much for your attention. Let me know in the comments: if you want to see videos of these
+hyperparameters, and pipelines, and sessions and don't forget to join our Slack channel if you need any help.
diff --git a/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/ml_ci_cd_using_github_actions_and_clearml.md b/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/ml_ci_cd_using_github_actions_and_clearml.md
index 301a3a55..c775461e 100644
--- a/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/ml_ci_cd_using_github_actions_and_clearml.md
+++ b/docs/getting_started/video_tutorials/hands-on_mlops_tutorials/ml_ci_cd_using_github_actions_and_clearml.md
@@ -1,5 +1,5 @@
---
-title: Machine Learning CI/CD using Github Actions and ClearML
+title: Machine Learning CI/CD using GitHub Actions and ClearML
---
@@ -19,1423 +19,266 @@ title: Machine Learning CI/CD using Github Actions and ClearML
Read the transcript
-hello welcome back to ClearML my name is
-
-Victor and in this video I'll be going
-
-through some cicd tips and tricks you
-
-can do with clearml
-
-for this video I'm going to assume that
-
-you already know about clearml and cicd
-
-in general the cicd stuff will be
-
-relatively easy to understand but if
-
-this is your first time working with
-
-caramel you better check out our getting
-
-started series first now there's three
-
-specific cicd jobs that I want to talk
-
-about in this video that you can
-
-accomplish with clearml the first job
-
-is about visibility imagine I have an
-
-experiment that I am tracking in git
-
-somewhere I open a new PR to add a new
-
-feature and now I want to make sure that
-
-curamal has at least one task in its
-
-database that has been successfully run
-
-using this PR code right to make this
-
-very visible I want to automatically add
-
-the model metrics from that task as a
-
-comment on the open PR
-
-the second job is similar to the first
-
-in the sense that I still want to take
-
-the task that corresponds to the open
-
-PR's code but in this case I want to
-
-make sure that the model residing in
-
-this task is equal or better than the
-
-previous best model in clearml I can
-
-easily keep track of that with tags in
-
-the clearml UI and in this way I can
-
-always guarantee that my main branch
-
-contains the best model finally for the
-
-last job usually I use my local computer
-
-and environment to quickly iterate and
-
-develop my code and then only later I'll
-
-send it to a clearml agent to be
-
-executed remotely and properly trained
-
-on some gpus for example now to make
-
-sure that always works I want to add a
-
-check to my PR that basically checks out
-
-this PR code runs it on a clearml agent
-
-and then listens to it and the moment
-
-that the clearml agents starts spitting
-
-out iterations it means that the whole
-
-setup process was successful and in this
-
-way I can make sure that every single
-
-commit in my main branch is remotely
-
-runnable right so those were the three
-
-jobs that I want to talk about in this
-
-video let's get started so as you can
-
-see I have here my example project uh
-
-with me and there's a few things
-
-immediately apparent so one is we have
-
-the dot GitHub folder with workflows
-
-we're using GitHub actions in this
-
-specific video again you don't have to
-
-use GitHub actions if you don't want to
-
-it's just as an example for General CI
-
-CD stuff then we have a few scripts here
-
-and we have our task as well now I'll
-
-start with the task because that's the
-
-thing we're going to run as the
-
-experiment you want to keep track of in
-
-your git and in clearml and in this case
-
-we'll just take like a dummy task we'll
-
-take a very very simple example here so
-
-we just do import from clearml import
-
-task we do the task that initialize if
-
-you're familiar with clearml this will
-
-be very familiar to you as well it's
-
-just the task dot initialize give it a
-
-project give it a name and then I
-
-basically always set to reuse lost task
-
-ID to false which basically means that
-
-it will never override the previous task
-
-if it didn't complete properly it's more
-
-or less a thing of taste then I set
-
-random C to do something completely
-
-random
-
-and then for in with 10 times basically
-
-we're going to be reporting a scalar
-
-which is called performance metric in
-
-series series one and it will it will
-
-have a random value so in this case it's
-
-super super simple it's just a dummy
-
-task this of course this report scalar
-
-should be your metric your output metric
-
-that you're trying to check could be F1
-
-score could be map whatever
-
-and fix your fancy right
-
-um if I then go to uh clearml itself
-
-let me make this a little bigger for you
-
-if I then do go to clearml itself
-
-you'll see the dummy task right here so
-
-we actually take care of the repository
-
-here we also have the commit ID which
-
-will come in handy later and then we
-
-also have the script path and the
-
-working directory as you might know we
-
-also keep track of any uncommitted
-
-changes so if you add anything in the
-
-code that isn't already tracked by clear
-
-by git
-
-um in itself we also take care of that
-
-but that will come in handy a little bit
-
-later as well we also keep track of
-
-install packages and stuff like that in
-
-this case of course we don't really keep
-
-track of very much it's it's only the
-
-task that in it and then just reporting
-
-some scalars but what we do have is some
-
-scalars so this is what it would look
-
-like and we'll be using this one later
-
-down the line right so if I go back here
-
-to my code you can also see we have a
-
-GitHub folder with the workflow
-
-subfolder in there this basically tells
-
-GitHub that whatever you do a push or
-
-commit or whatever it will check this
-
-yaml file to see if it has to do any
-
-kind of checks right in this case we'll
-
-call it clearml checks and we'll set
-
-the on to pull requests now most of the
-
-time that you're using clearml it's going to
-
-be interesting to do checks on a pull
-
-request because it can take some time
-
-it's machine learning after all but it
-
-highly depends on what you want to do of
-
-course now I'll be setting it to pull
-
-requests specifically to branches main
-
-so if I want to do a pull request to my
-
-main branch I will want those checks
-
-being fired and then I wanted them to be
-
-added to like several different actions
-
-there specifically the edited and opened
-
-are the ones that I'm interested in so
-
-every time I open a PR but also every
-
-time I update a PR like send a new
-
-commit to it it will trigger
-
-and then what do we actually want to
-
-trigger right so this is the meat of the
-
-story this is the jobs in this video
-
-we're going to run three specific jobs
-
-one is Task starts to comment the other
-
-one is compare models and the third one
-
-is test remote runnable now the first
-
-one task starts to come to comment
-
-basically wants to take a task that
-
-corresponds to the code you're trying to
-
-merge and then add a comment on the pr
-
-with the different performance metrics
-
-from clearml so that it's like kind of
-
-neat you can easily see what the task is
-
-doing how good it is stuff like that so
-
-that's what we're going to do first
-
-now how this is built up
-
-um I'll run down this and I will go into
-
-the code later in a second but then to
-
-start with we have the environment
-
-variables now to be sure that the clear
-
-that the GitHub action worker or the
-
-gitlab runner or whatever you're going
-
-to run these actions on has access to
-
-clearml you have to give it the clear
-
-remote credentials right and you can do
-
-that with the environment variable clear
-
-ml API access key and clearml API
-
-secret key these are the kind of these
-
-are the keys you get when you create new
-
-um new credentials in the main UI in the
-
-web UI
-
-uh in this case I'll get them from the
-
-secrets I've added them to GitHub as a
-
-secret and we can gather them from there
-
-same thing with the clearml API host in
-
-our case it will just be app.clear.ml
-
-which is the free tier version
-
-um of caramel you also want a GitHub
-
-token because we want to actually
-
-comment add a comment to a PR right so
-
-we also need to GitHub token which is
-
-very easy to easy to generate I'll put a
-
-link for that down in the description
-
-then we also have the comment commit ID
-
-so specifically we want the pull request
-
-head shot which is the latest commit in
-
-the pull request we're going to do some
-
-things with that we'll run this uh these
-
-this job basically on Ubuntu and then we
-
-have some steps here so first we want to
-
-check out our code which is just the pr
-
-then we want to set up python with 3.10
-
-which depends on on whatever you would
-
-you might be running with and then also
-
-install clearml so we have some
-
-packages here that we want to install in
-
-order to be able to run our code now
-
-most of the time I like to to just have
-
-a very simple job like this that just
-
-uses a python script that does the
-
-actual logic because command line logic
-
-is not very handy to work with so it's
-
-usually easier to just use a python file
-
-like this so we'll be doing python task
-
-starts to comment.pi which will check
-
-out right away
-
-I'll collapse some of these functions
-
-for you because they're not actually
-
-that interesting most of the code here
-
-is not related to clearml specifically
-
-it's mainly related to getting the
-
-comment out out to the PR but in this
-
-case we'll just walk through the if
-
-name.main and we'll go from there so
-
-first off
-
-this is running on a PR right so we want
-
-to say we're running on the commit hash
-
-with the commit hash just so we know and
-
-then we already have our first
-
-interesting function so the first step
-
-that we want to do is to make sure that
-
-we already have a task in clearml
-
-present in clearml that basically runs
-
-the code that wants to be committed
-
-right now so we have to check that the
-
-two are the same right we have a PR
-
-opened right now we have a commit hash
-
-we want to check if that commit hash is
-
-in any of the tasks in clearml so we
-
-can say like this is the code in clear
-
-ml that we want to track right so we
-
-know where to get the statistics
-
-basically I'll check this open so this
-
-is the first cool thing is uh querying a
-
-lot of people don't know that you can
-
-actually use the clearml SDK to just
-
-query the database in clear about so in
-
-this case I'll want to query all of our
-
-tasks with the task filter basically
-
-order it by the latest first then set
-
-the script version number and the script
-
-version number tag or the the key here
-
-actually corresponds here to the commit
-
-ID so we'll basically get this
-
-and I wanted to fit the commit ID that
-
-we get from the pr right so now we've
-
-opened the pr we get the commit ID that
-
-is the latest in this case you'll see
-
-actually here it's uh this one so the
-
-commit ID is the one that we set here as
-
-the pull request head
-
-we get that from the environment here
-
-and pass it through this function and if
-
-we go to this function this commit ID we
-
-basically want to check if this
-
-committed ID is already in a task in
-
-clear amount
-
-and I also want the task to be completed
-
-I don't want any failed tasks here we
-
-just want to make sure that that the
-
-code can run right that it all has
-
-already run in caramel and I also want
-
-the script diff which is the uncommitted
-
-changes as well we'll check that in just
-
-a sec so basically this query will just
-
-return all the tasks that fit these
-
-descriptions basically every single task
-
-that was run on this code base
-
-essentially
-
-but we don't just want the commit ID to
-
-to match we also want to make sure that
-
-there weren't any uncommitted changes so
-
-we make very very sure that the task in
-
-clearml has the exact same code as the pr
-
-we're looking at right now
-
-so we basically check if tasks so if any
-
-tasks were
-
-returned then we can go through them if
-
-none of these tasks have no if no task
-
-was found
-
-so if no task was found we basically
-
-want to raise a value error saying you
-
-at least have to run it once in clearml
-
-with this code base before you can
-
-actually merge it into main seems like a
-
-reasonable request if we actually do
-
-find a task we go for each task in the
-
-task there could be multiple but again
-
-they're sorted on last update remember
-
-so we just can take the first one and
-
-then if not task script.diff basically
-
-if there's not any uncommitted changes
-
-we know the exact code that was used
-
-there then we can just return the task
-
-and that's it so now we have our task
-
-object we know for sure that was run
-
-with the same code AS was done in the pr
-
-and we also know that was completed
-
-successfully so we want to add a tag for
-
-example main branch just in your clear
-
-ml you will be able to see a tag there
-
-main branch
-
-then we also want to get the statistics
-
-right because we still want to log it to
-
-the pr in as part of a comment so if I
-
-go there and open it up we first get the
-
-status of the task just to be sure
-
-remember we queried it on completed but
-
-something else might have happened in
-
-the meantime if the status is not
-
-completed we want to say this is the
-
-status it isn't completed this should
-
-not happen but if it is completed we are
-
-going to create a table with these
-
-functions that I won't go deeper into
-
-basically they format the dictionary of
-
-the state of the task scalars into
-
-um markdown that we can actually use let
-
-me just go into this though one quick
-
-time so we can basically do task dot get
-
-lost scalar metrics and this function is
-
-built into clearml which basically gives
-
-you a dictionary with all the metrics on
-
-your task right we'll just get that
-
-formatted into a table make it into a
-
-pandas data frame and then tabulate it
-
-with this cool package that basic turns
-
-it into markdown
-
-right so now that we have marked down in
-
-the table we then want to return results
-
-table you can view the full task this is
-
-basically the comment content right this
-
-is what we want to be in the comment
-
-that will later end up in the pr if
-
-something else went wrong we want to log
-
-it here
-
-it will also end up in a comment by the
-
-way so then we know that something went
-
-wrong from the pr itself
-
-right so this is what get task stats
-
-returns so basically in stats now we
-
-have our markdown that can be used to
-
-create a GitHub comment and then we have
-
-create stats comment which just uses the
-
-GitHub
-
-API to essentially get the repository
-
-get the full name take your token and
-
-then get the pull request and create the
-
-comment using the project stats that we
-
-gave here now to check if everything is
-
-working we can open a new PR for example
-
-I have I'm here on the branch video
-
-example I'll just add
-
-a small change here just so we know that
-
-everything uh that there is a change and
-
-then we'll change the we'll add a PR so
-
-add PR for video example
-
-there we go let's do that
-
-publish the branch and then I can create
-
-a pull request straight from vs code
-
-because it's an awesome tool
-
-created and now if I go to our PR here
-
-which we can go into oh in
-
-GitHub here you can actually see that
-
-there's a little bubble here it's
-
-already checking everything that it
-
-should so you can see here we have the
-
-add PR for Via example we changed our
-
-tests here
-
-or test five here and you can see all
-
-the checks here so tasks that's the
-
-comment is the one that we're interested
-
-right now it will basically set up
-
-everything install clearml and then run
-
-the task now no task based on this code
-
-was found in clearml right because we
-
-just changed the code
-
-it has an uncommitted change remember so
-
-there is no task in clearml yet with
-
-the change that we just made so in order
-
-to get that running we have to go into
-
-task run this first
-
-with this new PR and now we actually get
-
-get a um a new task right here with the
-
-exact commits in Branch video example
-
-without any uncommitted changes and if
-
-we now rerun our pipeline we should be
-
-good to go
-
-so let me just go there it is almost
-
-done here yep it's done so this should
-
-now be completed there we go
-
-and if I go back to our tests here we
-
-can see that some of them have failed so
-
-let's rerun the failed jobs rerun now
-
-this in this case we actually do or we
-
-should actually find a task in clearml
-
-that has all our code changes and it
-
-should work just like nicely
-
-right we're back this actually worked
-
-totally fine this time um so it actually
-
-only took 25 or 35 seconds depending on
-
-which the tasks you run but task starts
-
-to comment
-
-was successful so this means that if we
-
-now go to the pull request we see our
-
-little checkbox here that all the checks
-
-worked out perfectly fine and if I go in
-
-here you can see that the actual
-
-performance metric of series 1 is right
-
-there so that's really really cool we
-
-just changed it and there's already an
-
-example there right
-
-so that was actually the first one
-
-um task starts to comment which is
-
-really handy you can just slap it on any
-
-task and you'll always get the output
-
-there if you add a new commit to your PR
-
-you'll just get a new comment from these
-
-checks just to be sure that it's always
-
-up to date
-
-right so let's get to the second part we
-
-now have oh these are all the jobs so we
-
-had our task starts to comment what else
-
-might you want to do with uh GitHub CI
-
-CD right another thing you might want to
-
-do is compare models basically compare
-
-the output of the model or like the last
-
-metric that we just pulled
-
-from the current task which is the task
-
-connected to the pr that we want to open
-
-or that we we've just opened and compare
-
-it compare its performance to the
-
-performance of the best model before it
-
-right so we can always know that it's
-
-either equal or better performance than
-
-last commit so if we go to compare
-
-models here and we have our environments
-
-again so this is all the same thing we
-
-run again on Ubuntu 20.04 we check out
-
-the code we set up python we install our
-
-packages and then we run compare
-
-models.py compare models.py is very very
-
-similar it is very simple
-
-so here we say we print running on
-
-Commit hash which we get from the
-
-environment that we just gave to the to
-
-GitHub
-
-and then we run compare and tag task
-
-right so what we want to do is basically
-
-compare and then if it's better
-
-tag it as such right so if I do now
-
-current current task is get clear maltas
-
-from current commit which is basically
-
-the same thing that we used before in
-
-the last check basically it goes again
-
-it goes to clearml to check if there's
-
-already a task that has been run with
-
-this exact same code as in the pr so we
-
-get a task from there which is the
-
-current task and then we want to get the
-
-best task as well so in this case it's
-
-very simple to get it so you just run
-
-get task give the project name to the
-
-project that we want to run in right now
-
-give the task name which will be the
-
-same probably as the one that we're
-
-running now but also with the tags best
-
-performance and then if I go into our
-
-clearml overview here what you'll
-
-get is the best performance here because
-
-our checks already run so you solve the
-
-three checks right before we open the pr
-
-so basically the dummy task here was
-
-found to be the best performance and it
-
-has been tagged but that means that
-
-every single time I open a PR or I
-
-update a PR it will search clearml and
-
-get this dummy task
-
-it will get this one and then we say if
-
-we find the best task if not we'll just
-
-add best performance anyway because
-
-you're the first task in the list you'll
-
-always be getting best performance but
-
-if you're not then we'll get the best
-
-latest metric for example get reported
-
-scalers get performance metrics get
-
-scale get series 1 and get y so the the
-
-why value there so this could basically
-
-be the best or the highest map from a
-
-task or like the highest F1 score from a
-
-task or any some such then you have the
-
-best metric we do the same thing for the
-
-current task as well and then it's
-
-fairly easy we just say hey if the
-
-current metric is larger or equal than
-
-the best metric then this means we're
-
-better or equal we're good to go current
-
-task add tags best performance if not
-
-this means the current metric is worse
-
-and the pr you're trying to merge
-
-actually has worse performance than what
-
-was there before
-
-we at least want to say that but you
-
-could also easily say I want to raise a
-
-value error for example that says must
-
-be better
-
-and then the pipeline will fail which
-
-can allow you to block the pr until it
-
-actually is equal or better right so now
-
-it's time for the third check and the
-
-last one as well this is a little more
-
-complicated so that's why I keep it kept
-
-it for last but it's a really cool one
-
-as well specifically we're going to be
-
-using the remote execution capabilities
-
-of clearml next to the CI CD so
-
-will basically test if whatever you want
-
-to add to the main branch so whatever is
-
-in your PR we want to check if that code
-
-is even remotely runnable using a clearml
-
-agent because most of the time what
-
-you want to be doing is you want to be
-
-running stuff locally and testing
-
-locally and iterating very very fast and
-
-then whenever your code is good to go
-
-you want to check if that actually runs
-
-on a remote machine because that's where
-
-you want to end up doing the real heavy
-
-lifting the real training so the only
-
-thing we want to check is is there
-
-anything missing from the requirements
-
-if there's anything other that might
-
-break if it's going to run on the remote
-
-machine the cool thing about that is
-
-that you know for sure that every commit
-
-on the main branch is also runnable on a
-
-remote machine just to be sure so how
-
-can we do that
-
-we can add again our environment
-
-variables so that our runner has access
-
-to clearml we run on Ubuntu 20.04 we
-
-check out this time we check out
-
-specifically to the branch because
-
-sometimes the agent might have issues
-
-with that so we want to make sure that
-
-we're actually in the headshot
-
-um and then we set up our python
-
-environments again we pip install
-
-clearml and we also add some rib grab
-
-function that we'll just use in just a
-
-second now the first thing we want to do
-
-in this whole pipeline is we want to
-
-start the task remotely
-
-we want to make sure that it doesn't
-
-fail and then we actually want to pull
-
-every so often to to capture if it
-
-starts its iterations if only one
-
-iteration is already reported it means
-
-that the loop the main training Loop
-
-will probably work just fine and we can
-
-quit it there so that's exactly what
-
-we're going to do first step launching
-
-the task so we want to start a task here
-
-we'll give it an ID so that we can
-
-actually use the output of that process
-
-and then there is this small tool that
-
-not a lot of people know about but it's
-
-actually clear my task as a command line
-
-utility and the cool thing about that is
-
-clearml Task allows you to basically
-
-run any kind of GitHub repository
-
-remotely from the get-go so you don't
-
-have to add anything to the code in to
-
-begin with right so in this case this is
-
-perfect because we've just checked out
-
-our code and the only thing we want to
-
-do is throw that to a remote machine and
-
-make sure that it works so what we're
-
-going to do let me just copy paste this
-
-for a second so that I can show you
-
-I'll open my
-
-command line here so what I'll do is
-
-I'll put it into a queue that is
-
-non-existent
-
-so that it will fail but then we'll see
-
-the output just to be sure
-
-and then I'll keep make sure that the
-
-branch is gone here because it's an
-
-interpolated value that we don't have in
-
-this case so if I run this in my GitHub
-
-actions example repository here what I
-
-will do is it will launch the task on a
-
-remote machine using pyraml agent so it
-
-will set up the requirements it will set
-
-up everything and it says new task
-
-created with this ID right of course we
-
-can't actually queue it because the
-
-queue is non-existent but what we want
-
-to do here is we use this command to
-
-actually launch the clear mail task and
-
-then we use rib grab to basically get
-
-this task ID out of the console output
-
-we'll store that into a value GitHub
-
-output that we can access here so we'll
-
-give this task ID that we just started
-
-on the remote machine to this python
-
-file which will check out right now so
-
-it's again very simple so we check the
-
-task status of the first argument which
-
-again will be the task ID we'll check
-
-the task status
-
-we'll get the task itself which is a
-
-task object from clearml we'll start a
-
-timer and then we'll say if the task if
-
-we have a task at all right if it starts
-
-it might have broken somewhere so always
-
-check if the task exists we do we check
-
-for a timeout right for a while so what
-
-we want to do is a while loop where you
-
-say okay whenever uh the the time that
-
-I've been checking has not been longer
-
-than a certain timeout I want to be
-
-pulling the task and making sure that
-
-it's still running right so I get the
-
-task status which hopefully should be
-
-either queued pending in progress or
-
-whatever hopefully not failed of course
-
-but that can always happen so we get a
-
-task status we print some stuff and then
-
-if the task status is skewed which means
-
-that there's tasks in the queue before
-
-it and it can't actually be run yet
-
-because all the agents are currently
-
-working we actually just want to reset
-
-the timer so we reset the start time to
-
-be time.time which basically will not
-
-allow this timeout to be triggered this
-
-is kind of nice because we don't want
-
-the timer to be triggered because it's
-
-waiting in the queue like there's
-
-nothing happening to it so we only want
-
-the timer to be started whenever it's
-
-actually being executed by clearml agent
-
-so we've reseted the timer at some point
-
-the task status will change from queued
-
-to anything else if this task status is
-
-failed or stopped it means we did have
-
-an error which is not ideal which is
-
-exactly what we want to catch in this
-
-case so we'll raise a value error with
-
-saying tiles did not return correctly
-
-check the logs in the web UI you'll see
-
-probably in clearml that the task will
-
-actually have failed and then you can
-
-check and debug there also raising a
-
-value error will actually fail the
-
-pipeline as well which is exactly what
-
-we want we don't want this PR to go
-
-through if the pipeline fails because of
-
-a task that can be run remotely this is
-
-exactly what we want to catch
-
-but if the task status is in progress we
-
-go into a next Loop in which we say okay
-
-if the task get lost iteration is larger
-
-than zero basically if we get only one
-
-iteration it means that the whole setups
-
-process was successful the model is
-
-training and we're good to go
-
-so in that case we just clean up we've
-
-just checked everything is good so we
-
-set the task as Mark stopped we set the
-
-task as set archived and we return true
-
-which basically says get the task out of
-
-the way it shouldn't be in the project
-
-anymore we just checked everything works
-
-get it out of my site right so that was
-
-the last of the three checks that I
-
-wanted to cover today I hope you found
-
-this interesting I mean if we go back to
-
-rpr here it's really nice to see all of
-
-these checks coming back green it's very
-
-easy to just use the clearml API and
-
-even clearml task for example to
-
-launch stuff remotely it's not that far
-
-of a fetch either to just think why not
-
-use clearml agent as for example a test
-
-bed for GPU tests right so you could
-
-very easily add things to the queue for
-
-the agent to work on and then just pull
-
-its performance in this way or like pull
-
-its status in this very way so you could
-
-actually run tests that are supposed to
-
-be run on GPU machines this way because
-
-GitHub doesn't automatically or just
-
-isn't out of the books allow you to run
-
-on GPU workers so it's just one of the
-
-very many ways that you can use clearml
-
-to do these kind of things and I hope
-
-you learned something valuable today all
-
-of the code that you saw in this example
-
-will be available in the from a link in
-
-the description and if you need any help
-
-follow us on slack or like join our
-
-slack Channel we're always there always
-
-happy to help and yeah thank you for
-
-watching
+Hello, welcome back to ClearML my name is Victor and in this video I'll be going through some CI/CD tips and tricks you
+can do with ClearML. For this video, I'm going to assume that you already know about ClearML and CI/CD.
+In general, the CI/CD stuff will be relatively easy to understand but if this is your first time working with ClearML,
+you better check out our Getting Started series first.
+
+Now there's three specific CI/CD jobs that I want to talk about in this video that you can accomplish with ClearML.
+
+The first job is about visibility. Imagine I have an experiment that I am tracking in Git somewhere. I open a new PR to
+add a new feature, and now I want to make sure that ClearML has at least one task in its database that has been
+successfully run using this PR code. To make this very visible, I want to automatically add the model metrics from that
+task as a comment on the open PR.
+
+The second job is similar to the first in the sense that I still want to take the task that corresponds to the open PR's
+code, but in this case I want to make sure that the model residing in this task is equal or better than the previous
+best model in ClearML. I can easily keep track of that with tags in the ClearML UI, and in this way I can always
+guarantee that my main branch contains the best model.
+
+Finally, for the last job, usually I use my local computer and environment to quickly iterate and develop my code, and
+then, only later, I'll send it to a ClearML Agent to be executed remotely and properly trained on some GPUs for example.
+Now to make sure that always works, I want to add a check to my PR that basically checks out this PR code runs it on a
+ClearML Agent, and then listens to it and the moment, that the ClearML Agents starts spitting out iterations, it means
+that the whole setup process was successful, and in this way, I can make sure that every single commit in my main branch
+is remotely runnable right.
+
+So those were the three jobs that I want to talk about in this video. Let's get started.
+
+So as you can see, I have here my example project with me and there's a few things immediately apparent. So one is we
+have the `.github` folder with workflows. We're using GitHub actions in this specific video again you don't have to use
+GitHub actions if you don't want to. It's just as an example for General CI/CD stuff. Then we have a few scripts here,
+and we have our task as well.
+
+Now, I'll start with the task because that's the thing we're going to run as the experiment you want to keep track of
+in your Git, and in ClearML, and in this case, we'll just take like a dummy task. We'll take a very, very simple example
+here, so we just do `from clearml import Task`. If you're familiar with ClearML this will be very familiar to you as
+well. It's just the `task.init`, give it a project, give it a name, and then I basically always set `reuse_last_task_id`
+to `false`, which basically means that it will never override the previous task if it didn't complete properly. It's more
+or less a thing of taste. Then I set the random seed to do something completely random. Then for 10 times basically we're
+going to be reporting a scalar, which is called performance metric in series "Series 1" and it will have a random value,
+so in this case it's super, super simple, it's just a dummy task. This report scalar should be your output metric that
+you're trying to check could be F1 score, could be MAP, whatever takes your fancy.
+
+If I then go to ClearML itself, you'll see the dummy task right here, so we actually take care of the repository here, we
+also have the commit ID, which will come in handy later, and then we also have the script path, and the working
+directory. As you might know, we also keep track of any uncommitted changes, so if you add anything in the code that
+isn't already tracked by Git, we also take care of that. But that will come in handy a little bit later as well. We also
+keep track of installed packages and stuff like that. In this case, of course, we don't really keep track of very much,
+it's only the `task.init` and then just reporting some scalars.
+
+What we do have is some scalars, so this is what it would look like, and we'll be using this one later down the line.
+Right, so if I go back here to my code you can also see we have a GitHub folder with the workflow sub-folder in there.
+This basically tells GitHub that whatever you do--a push or commit or whatever--it will check this `yaml` file to see
+if it has to do any kind of checks. In this case, we'll call it ClearML checks, and we'll set it on to pull requests.
+Now, most of the time that you're using ClearML, it's going to be interesting to do checks on a pull request because it
+can take some time. It's machine learning after all, but it highly depends on what you want to do, of course. Now,
+I'll be setting it to pull requests specifically to branch `main`. So if I want to do a pull request to my `main`
+branch, I will want those checks being fired, and then I wanted them to be added to several different actions there,
+specifically the edited and opened are the ones that I'm interested in. So, every time I open a PR, but also every
+time I update a PR, like send a new commit to it, it will trigger, and then what do we actually want to trigger. So this is
+the meat of the story this is the jobs.
+
+In this video, we're going to run three specific jobs. One is `task-stats-to-comment`, the other one is `compare-models`,
+and the third one is `test-remote-runnable`.
+
+Now, the first one task starts to come to comment basically wants to take a task that corresponds to the code you're
+trying to merge, and then add a comment on the PR with the different performance metrics from ClearML, so that it's
+kind of neat; you can easily see what the task is doing, how good it is, stuff like that. So that's what we're going
+to do first.
+
+Now, how this is built up? I'll run down this and I will go into the code later in a second, but then to start with we
+have the environment variables. Now, to be sure that the GitHub action worker or the gitlab runner or whatever you're
+going to run these actions on has access to ClearML, you have to give it the ClearML credentials. You can do that with
+the environment variable `CLEARML_API_ACCESS_KEY` and `CLEARML_API_SECRET_KEY`, these are the keys you get when you
+create new credentials in the main UI. In this case I'll get them from the secrets; I've added them to GitHub as a
+secret, and we can gather them from there. Same thing with the ClearML API host. in our case it will just be
+`app.clear.ml`, which is the free tier version pf ClearML. You also want a GitHub token because we want to actually
+add a comment to a PR, so we also need to GitHub token, which is very easy to generate. I'll put a link for that down
+in the description. Then we also have the comment commit ID. So, specifically we want the pull request headshot, which
+is the latest commit in the pull request. We're going to do some things with that.
+
+We'll run this job basically on Ubuntu, and then we have some steps here. So, first we want to check out our code which
+is just the PR. Then, we want to set up python with 3.10, which depends on whatever you might be running with, and then
+also install ClearML. So we have some packages here that we want to install in order to be able to run our code. Now
+most of the time I like to just have a very simple job like this that just uses a Python script that does the
+actual logic, because command line logic is not very handy to work with, so it's usually easier to just use a Python
+file, like this, so we'll be doing `python_task_stats_to comment.py`, which we'll check out right away.
+
+I'll collapse some of these functions for you because they're not actually that interesting. Most of the code here
+is not related to ClearML specifically, it's mainly related to getting the comment out to the PR, but in this
+case we'll just walk through the `if __name__=='__main__'` and we'll go from there.
+
+So first off, this is running on a PR right, so we want to say we're running on the commit hash with the commit hash
+just so we know, and then we already have our first interesting function. So the first step that we want to do is to
+make sure that we already have a task present in ClearML that basically runs the code that wants to be committed
+right now, so we have to check that the two are the same right. We have a PR opened right now, we have a commit hash.
+We want to check if that commit hash is in any of the tasks in ClearML, so we can say like this is the code in ClearML,
+that we want to track right, so we know where to get the statistics basically. I'll check this open so this is the
+first cool--querying. A lot of people don't know that you can actually use the ClearML SDK to just query the database
+in ClearML. So in this case I'll want to query all of our tasks with the task filter, basically order it by the latest
+first, then set the script version number and the script version number tag. The key here actually corresponds here to
+the commit ID, so we'll basically get this, and I wanted to fit the commit ID that we get from the PR. So now we've
+opened the PR, we get the commit ID that is the latest. In this case you'll see actually here it's this one so the
+commit ID is the one that we set here as the pull request head. We get that from the environment here and pass it
+through this function, and if we go to this function, this commit ID we basically want to check if this committed ID is
+already in a task in ClearML. I also want the task to be completed. I don't want any failed tasks here. We just want to
+make sure that the code can run, that it all has already run in ClearML, and I also want the script diff which is the
+uncommitted changes as well. We'll check that in just a sec. So basically this query will just return all the tasks
+that fit these descriptions; basically every single task that was run on this code base essentially.
+
+But, we don't just want the commit ID to match, we also want to make sure that there weren't any uncommitted changes.
+So, we make very, very sure that the task in ClearML has the exact same code as the PR we're looking at right now.
+So we basically check if any tasks were returned, then we can go through them. If no task was found, we basically
+want to raise a value error saying you at least have to run it once in ClearML with this code base before you can
+actually merge it into `main`. Seems like a reasonable request. If we actually do find a task, we go for each task in
+the task, there could be multiple, but again they're sorted on last update, so we just can take the first one, and
+then if not `task[script.diff]`, basically if there's not any uncommitted changes, we know the exact code that was
+used there then we can just return the task, and that's it.
+
+So now we have our task object. We know for sure that was run with the same code as was done in the PR, and we also know
+that it was completed successfully. So we want to add a tag for example `main_branch`, just in your ClearML, you will be
+able to see a tag there `main_branch`.
+
+Then, we also want to get the statistics, because we still want to log it to the PR as part of a comment. So if I go
+there and open it up, we first get the status of the task, just to be sure. Remember we queried it on `completed`, but
+something else might have happened in the meantime. If the status is not `completed`, we want to say this is the
+status, it isn't completed this should not happen but. If it is completed, we are going to create a table with these
+functions that I won't go deeper into. Basically, they format the dictionary of the state of the task scalars into
+markdown that we can actually use. Let me just go into this though one quick time. So we can basically do `task.get_last_scalar_metrics`,
+and this function is built into ClearML, which basically gives you a dictionary with all the metrics on your task.
+We'll just get that formatted into a table, make it into a pandas DataFrame, and then tabulate it with this cool package
+that turns it into MarkDown. So now that we have marked down in the table, we then want to return results table. You can
+view the full task. This is basically the comment content we want to be in the comment that will later end up in the PR.
+If something else went wrong, we want to log it here. It will also end up in a comment, by the way, so then we know that
+something went wrong from the PR itself.
+
+So this is what `get_task_stats` returns. So basically, in stats now we have our MarkDown that can be used to
+create a GitHub comment, and then we have `create_stats_comment`, which just uses the GitHub API to essentially get the
+repository, get the full name, take your token, and then get the pull request and create the
+comment using the project stats that we gave here. Now to check if everything is working, we can open a new PR. For
+example, I'm here on the branch `video_example`, I'll just add a small change here just so we know that there is a
+change, and then we'll add a PR. So "add PR for video example," there we go, let's do that. Publish the branch, and
+then I can create a pull request straight from VS Code, because it's an awesome tool. Created, and now if I go to our
+PR here, which we can go into in GitHub here, you can actually see that there's a little bubble here. It's
+already checking everything that it should, so you can see here we have the "add PR for video example," and you can see
+all the checks here. So `task-stats-to-comment` is the one that we're interested in right now. It will basically set up
+everything, install ClearML, and then run the task. Now no task based on this code was found in ClearML, because we
+just changed the code, it has an uncommitted change, remember? So there is no task in ClearML yet with the change that
+we just made. So in order to get that running, we have to go into the task, run this first with this new PR, and now we
+actually get a new task right here with the exact commits in branch `video_example`, without any uncommitted changes,
+and if we now rerun our pipeline we should be good to go. So let me just go there it is almost done here. Yep, it's done
+so this should now be completed. And if I go back to our tests here, we can see that some of them have failed, so let's
+rerun the failed jobs. Now, in this case we should actually find a task in ClearML that has all our
+code changes, and it should work just nicely.
+
+Right, we're back. This actually worked totally fine this time. So it actually only took 25 or 35 seconds, depending on
+which of the tasks you run, but `task_stats_to_comment` was successful, so this means that if we now go to the pull
+request, we see our little checkbox here that all the checks worked out perfectly fine, and if I go in
+here, you can see that the actual performance metric of series 1 is right there, so that's really, really cool. We
+just changed it and there's already an example there. So that was actually the first one, `task_stats_to_comment`, which
+is really handy. You can just slap it on any task, and you'll always get the output there, if you add a new commit to
+your PR, you'll just get a new comment from these checks just to be sure that it's always up-to-date.
+
+So let's get to the second part. So we had our `task_stats_to_comment`, what else might you want to do with GitHub CI/CD?
+Another thing you might want to do is compare models, basically compare the output of the model or like the last metric
+that we just pulled from the current task, which is the task connected to the PR that we want to open, or that we've
+just opened, and compare its performance to the performance of the best model before it. So we can always know that it's
+either equal or better performance than the last commit. So if we go to `compare-models` here, and we have our
+environments again, so this is all the same thing. We run again on Ubuntu 20.04, we check out the code we set up Python,
+we install our packages, and then we run `compare_models.py`. `compare_models.py` is very, very similar. It is very
+simple. So here we print "running on Commit hash" which we get from the environment variable that we just gave to
+GitHub, and then we run `compare_and_tag_task`. So what we want to do is basically compare and then if it's better, tag
+it as such. So if I do now `current_task` is `get_clearml_task_from_current_commit`, which is basically the same thing
+that we used before in the last check. Basically, it goes to ClearML to check if there's already a task that has been
+run with this exact same code as in the PR. So we get a task from there, which is the current task, and then we want to
+get the best task as well. So in this case it's very simple to get it, so you just run `get_task`, give the project name
+to the project that we want to run in right now, give the task name which will be the same probably as the one that
+we're running now, but also with the tag `Best Performance`. And then if I go into our ClearML overview here, what
+you'll get is the best performance here because our checks already run, so you solve the three checks right before we
+open the PR, so basically the dummy task here was found to be the best performance, and it has been tagged but that
+means that every single time I open a PR or I update a PR, it will search ClearML, and get this dummy task. It will get
+this one, and then we say if we find the best task, if not we'll just add the best performance anyway because you're the
+first task in the list, you'll always be getting best performance, but if you're not then we'll get the best latest
+metric. For example `get_reported_scalars().get('Performance Metric').get('Series 1').get('y')`, so the `y` value there
+so this could basically be the best or the highest map from a task or the highest F1 score from a task, or any some
+such. Then you have the best metric. We do the same thing for the current task as well, and then it's fairly easy. We
+just say hey if the current metric is larger or equal than the best metric, then this means we're better or equal, we're
+good to go `current_task.add_tags("Best Performance")`. If not, this means the current metric is worse and the PR you're
+trying to merge actually has worse performance than what was there before. We at least want to say that, but you could
+also easily say I want to raise a value error, for example, that says that it must be better and then the pipeline will
+fail, which can allow you to block the PR until it actually is equal or better.
+
+So now it's time for the third check, and the last one as well. This is a little more complicated so that's why I
+kept it for last, but it's a really cool one as well. Specifically, we're going to be using the remote execution
+capabilities of ClearML next to the CI/CD. So we'll basically test if whatever you want to add to the `main` branch, so
+whatever is in your PR, we want to check if that code is even remotely runnable using a ClearML Agent, because most of
+the time what you want to be doing is you want to be running stuff locally and testing locally and iterating very, very
+fast, and then whenever your code is good to go, you want to check if that actually runs on a remote machine because
+that's where you want to end up doing the real heavy lifting, the real training. So, the only thing we want to check is
+if there's anything missing from the requirements, if there's anything else that might break if it's going to run on the
+remote machine. The cool thing about that is that you know for sure that every commit on the `main` branch is also
+runnable on a remote machine, just to be sure.
+
+So how can we do that? We can add again our environment variables so that our runner has access to ClearML. We run on
+Ubuntu 20.04, we check out specifically to the branch, because sometimes the agent might have issues with that, so we
+want to make sure that we're actually in the headshot, and then we set up our Python environments again, we
+pip install ClearML, and we also add some `ripgrep` function that we'll just use in just a second. Now, the first thing
+we want to do in this whole pipeline is we want to start the task remotely, we want to make sure that it doesn't fail,
+and then we actually want to pull every so often to capture if it starts its iterations. If only one iteration is
+already reported, it means that the loop, the main training loop, will probably work just fine, and we can quit it there.
+So that's exactly what we're going to do.
+
+First step: launching the task. So we want to start a task here. We'll give it an ID so that we can actually use the
+output of that process and then there is this small tool that not a lot of people know about, but it's actually `clearml-task`
+as a command line utility, and the cool thing about that is `clearml-task` allows you to basically run any kind of
+GitHub repository remotely from the get-go. So you don't have to add anything to the code to begin with. So in this case,
+this is perfect because we've just checked out our code, and the only thing we want to do is throw that to a remote
+machine and make sure that it works. So what we're going to do, I'll open my command line here, so what I'll do is I'll
+put it into a queue that is non-existent so that it will fail, but then we'll see the output just to be sure, and then
+I'll keep make sure that the branch is gone here because it's an interpolated value that we don't have in this case. So
+if I run this in my GitHub actions example repository here, what it will do is it will launch the task on a
+remote machine using a ClearML Agent, so it will set up the requirements, it will set up everything, and it says new
+task created with this ID. Of course, we can't actually queue it because the queue is non-existent, but what we want to
+do here is use this command to actually launch the ClearML task, and then we use `ripgrep` to basically get this task ID
+out of the console output. We'll store that into a value GitHub output that we can access here, so we'll give this task
+ID that we just started on the remote machine to this Python file, which we'll check out right now. So it's again very
+simple, so we check the task status of the first argument which again will be the task ID, we'll check the task status,
+we'll get the task itself which is a task object from ClearML, we'll start a timer, and then we'll say that
+if the task exists, we check for a timeout for a while. So what we want to do is a `while` loop where you say that
+whenever the time that I've been checking has not been longer than a certain timeout, I want to be pulling the task
+and making sure that it's still running, so I get the task status which hopefully should be either queued `pending`,
+`in progress`, or whatever, hopefully not `failed` of course, but that can always happen. So we get a task status, we
+print some stuff, and then if the task status is skewed, which means that there are tasks in the queue before it, and it
+can't actually be run yet because all the agents are currently working, we actually just want to reset the timer, so
+we reset the start time to be `time.time`, which basically will not allow this timeout to be triggered. This is kind of
+nice because we don't want the timer to be triggered because it's waiting in the queue, like there's nothing happening
+to it, so we only want the timer to be started whenever it's actually being executed by ClearML agent. So we've reset
+the timer. At some point the task status will change from `queued` to anything else. If this task status is `failed` or
+`stopped`, it means we did have an error which is not ideal which is exactly what we want to catch in this case, so
+we'll raise a value error saying "Task did not return correctly, check the logs in the web UI." You'll see probably in
+ClearML that the task will actually have failed, and then you can check and debug there. Also raising a value error
+will actually fail the pipeline as well, which is exactly what we want. We don't want this PR to go through if the
+pipeline fails, because of a task that can't be run remotely, this is exactly what we want to catch.
+
+But, if the task status is in progress, we go into a next loop, in which we say, if the task `get_last_iteration` is larger than zero,
+basically if we get only one iteration, it means that the whole setups process was successful, the model is training,
+and we're good to go. So in that case, we just clean up, we've just checked everything is good, so we set the task as
+`mark_stopped`, we set the task as `set_archived`, and we return `true`, which basically says get the task out of the
+way, it shouldn't be in the project anymore. We just checked everything works get it out of my sight.
+
+So that was the last of the three checks that I wanted to cover today. I hope you found this interesting I mean if we
+go back to the PR here, it's really nice to see all of these checks coming back green. It's very easy to just use the
+ClearML API and even ClearML task for example to launch stuff remotely. It's not that far of a fetch either to just
+think why not use ClearML agent as for example a test bed for GPU tests. So you could very easily add things to the
+queue for the agent to work on and then just pull its performance in this way or pull its status in this very way. So
+you could actually run tests that are supposed to be run on GPU machines this way because GitHub doesn't automatically
+or out-of-the-box allow you to run on GPU workers.
+
+So it's just one of the very many ways that you can use ClearML to do
+these kind of things and I hope you learned something valuable today. All of the code that you saw in this example
+will be available in the link in the description, and if you need any help, join our Slack Channel, we're always there,
+always happy to help and thank you for watching.
diff --git a/sidebars.js b/sidebars.js
index 69f30e29..d2c8ccd7 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -27,9 +27,9 @@ module.exports = {
'getting_started/video_tutorials/hyperdatasets_data_versioning',
{
'Hands-on MLOps Tutorials':[
- 'getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_a_data_scientist.md',
- 'getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_an_mlops_engineer.md',
- 'getting_started/video_tutorials/hands-on_mlops_tutorials/ml_ci_cd_using_github_actions_and_clearml.md'
+ 'getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_a_data_scientist',
+ 'getting_started/video_tutorials/hands-on_mlops_tutorials/how_clearml_is_used_by_an_mlops_engineer',
+ 'getting_started/video_tutorials/hands-on_mlops_tutorials/ml_ci_cd_using_github_actions_and_clearml'
]
}
]}]},