clearml/examples/frameworks/huggingface/transformers.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# HuggingFace Transformers\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/allegroai/clearml/blob/master/examples/frameworks/huggingface/transformers.ipynb)\n",
    "\n",
    "HuggingFace's [Transformers](https://huggingface.co/docs/transformers/index) is a popular deep learning framework. You can seamlessly integrate ClearML into your Transformer's PyTorch Trainer code using the built-in [ClearMLCallback](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/callback#transformers.integrations.ClearMLCallback). ClearML automatically logs Transformer's models, parameters, scalars, and more."
   ],
   "metadata": {
    "id": "jF2e1XVCazr3"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Set up ClearML\n",
    "1. To keep track of your experiments and/or data, ClearML needs to communicate to a server. You have 2 server options:\n",
    "    * Sign up for free to the [ClearML Hosted Service](https://app.clear.ml/)\n",
    "    * Set up your own server, see [here](https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server).\n",
    "1. Add your ClearML credentials below. ClearML will use these credentials to connect to your server (see instructions for generating credentials [here](https://clear.ml/docs/latest/docs/getting_started/ds/ds_first_steps/#jupyter-notebook)).\n"
   ],
   "metadata": {
    "id": "hkRlrlpoKu7X"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1F-V3rDzDPKj"
   },
   "outputs": [],
   "source": [
    "# ClearML credentials\n",
    "%env CLEARML_WEB_HOST=\n",
    "%env CLEARML_API_HOST=\n",
    "%env CLEARML_FILES_HOST=\n",
    "%env CLEARML_API_ACCESS_KEY=\n",
    "%env CLEARML_API_SECRET_KEY="
   ]
  },
  {
   "cell_type": "code",
   "source": [
    "# Set to true so ClearML will log the models created by the trainer\n",
    "%env CLEARML_LOG_MODEL=True\n",
    "\n",
    "# Set the ClearML task name (default \"Trainer\")\n",
    "# %env CLEARML_TASK=\n",
    "\n",
    "# Set task's project (default \"HuggingFace Transformers\")\n",
    "# %env CLEARML_PROJECT="
   ],
   "metadata": {
    "id": "2rrFgf_4OPaW"
   },
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Set up Environment"
   ],
   "metadata": {
    "id": "pmA00rW9yQnA"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "DdLigEyL9_Sr"
   },
   "outputs": [],
   "source": [
    "# ClearML installation\n",
    "! pip install clearml\n",
    "# Transformers installation\n",
    "! pip install transformers[torch] datasets\n",
    "\n",
    "!pip install accelerate -U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1hA8BaUo9_Tm"
   },
   "source": [
    "## Create Trainer - a PyTorch optimized training loop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LG-TqpPf9_Tn"
   },
   "source": [
    "Create a trainer, in which you'll typically pass the following parameters to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer):\n",
    "\n",
    "1. A [PreTrainedModel](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel) or a [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module):\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "source": [
    "from transformers import AutoModelForSequenceClassification\n",
    "\n",
    "model = AutoModelForSequenceClassification.from_pretrained(\"distilbert-base-uncased\")\n"
   ],
   "metadata": {
    "id": "_seFDUhq_Mt9"
   },
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "2. [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) contains the model hyperparameters you can change like learning rate, batch size, and the number of epochs to train for. The default values are used if you don't specify any training arguments.\n",
    "ClearML will capture all the training arguments.\n",
    "\n"
   ],
   "metadata": {
    "id": "Va3iOGxa_Vra"
   }
  },
  {
   "cell_type": "code",
   "source": [
    "from transformers import TrainingArguments\n",
    "\n",
    "training_args = TrainingArguments(\n",
    "     output_dir=\"path/to/save/folder/\",\n",
    "     learning_rate=2e-5,\n",
    "     per_device_train_batch_size=8,\n",
    "     per_device_eval_batch_size=8,\n",
    "     num_train_epochs=2,\n",
    ")\n"
   ],
   "metadata": {
    "id": "LjSvf25e_XuL"
   },
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "3. A preprocessing class like a tokenizer, image processor, feature extractor, or processor:\n",
    "\n"
   ],
   "metadata": {
    "id": "dt38Jd46_gJT"
   }
  },
  {
   "cell_type": "code",
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"distilbert-base-uncased\")\n"
   ],
   "metadata": {
    "id": "Yfr18wjk_h89"
   },
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "\n",
    "4. Load a dataset:\n"
   ],
   "metadata": {
    "id": "o_lcuw5b_l76"
   }
  },
  {
   "cell_type": "code",
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "dataset = load_dataset(\"rotten_tomatoes\")  # doctest: +IGNORE_RESULT\n"
   ],
   "metadata": {
    "id": "fpZhsScX_nZh"
   },
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "\n",
    "5. Create a function to tokenize the dataset, and then apply over the entire dataset with [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)::\n",
    "\n",
    "\n"
   ],
   "metadata": {
    "id": "zHQ3A5bC_r5e"
   }
  },
  {
   "cell_type": "code",
   "source": [
    "\n",
    "def tokenize_dataset(dataset):\n",
    "     return tokenizer(dataset[\"text\"])\n",
    "\n",
    "dataset = dataset.map(tokenize_dataset, batched=True)\n"
   ],
   "metadata": {
    "id": "SdyzdgrS_tdf"
   },
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "6. A [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding) to create a batch of examples from your dataset:\n"
   ],
   "metadata": {
    "id": "w_ailAc5AA2w"
   }
  },
  {
   "cell_type": "code",
   "source": [
    "from transformers import DataCollatorWithPadding\n",
    "\n",
    "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)\n"
   ],
   "metadata": {
    "id": "l4L_HvPpAG8y"
   },
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "Now gather all these classes in [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer):"
   ],
   "metadata": {
    "id": "Q3WimRh3AOQ6"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RMTBiL159_Tn"
   },
   "outputs": [],
   "source": [
    "from transformers import Trainer\n",
    "\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=training_args,\n",
    "    train_dataset=dataset[\"train\"],\n",
    "    eval_dataset=dataset[\"test\"],\n",
    "    tokenizer=tokenizer,\n",
    "    data_collator=data_collator,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1rc5xo-F9_Tn"
   },
   "source": [
    "## Start Training\n",
    "When you're ready, call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to start training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jO6FkQM_9_To"
   },
   "outputs": [],
   "source": [
    "trainer.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QdLmkuBF9_To"
   },
   "source": [
    "Since `clearml` is installed, the trainer will use the [ClearMLCallback](https://huggingface.co/docs/transformers/main/en/main_classes/callback#transformers.integrations.ClearMLCallback) so a ClearML task is created, which captures your experiment's models, scalars, images, and more.\n",
    "\n",
    "By default, a task called `Trainer` is created in the `HuggingFace Transformers` project. To change the task’s name or project, use the `CLEARML_PROJECT` and `CLEARML_TASK` environment variables.\n",
    "\n",
    "The console output displays the task ID and a link to the task's page in the [ClearML web UI](https://clear.ml/docs/latest/docs/webapp/webapp_exp_track_visual). In the UI, you can visualize all the information captured by the task.\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}