diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..959fc77ba7fb3fee1c8580527116493c6bd734a5 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,2 @@ +*.lance filter=lfs diff=lfs merge=lfs -text +*.idx filter=lfs diff=lfs merge=lfs -text diff --git a/Copy_of_rag_homework_fin.ipynb b/Copy_of_rag_homework_fin.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..c582cffd783912e094797b51d43da221b9c71b4f --- /dev/null +++ b/Copy_of_rag_homework_fin.ipynb @@ -0,0 +1,1315 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "844fe3af-9cf1-4c66-aa78-b88a3429acc6", + "metadata": { + "id": "844fe3af-9cf1-4c66-aa78-b88a3429acc6" + }, + "source": [ + "### 0. Setup\n", + "1) Clone https://github.com/plaggy/rag-gradio-sample-project and set up an environment with gradio_app/requirements.txt.\n", + "\n", + "There you'll find the following files:\n", + "- [prep_scripts/markdown_to_text.py](https://github.com/plaggy/rag-gradio-sample-project/blob/main/prep_scripts/markdown_to_text.py) processes markdown into text; you won't need to change it.\n", + "- [prep_scripts/lancedb_setup.py](https://github.com/plaggy/rag-gradio-sample-project/blob/main/prep_scripts/lancedb_setup.py) is the file where the database is created and, in particular, an embedding model is defined.\n", + "- [gradio_app/backend/query_llm.py](https://github.com/plaggy/rag-gradio-sample-project/blob/main/gradio_app/backend/query_llm.py) defines what LLM is used.\n", + "- [gradio_app/app.py](https://github.com/plaggy/rag-gradio-sample-project/blob/main/gradio_app/app.py) creates the gradio app.\n", + "\n", + "In this task you'll try not only OpenAI models, but also open-source models from Hugging Face Hub through InferenceClient interface (see [gradio_app/backend/query_llm.py](https://github.com/plaggy/rag-gradio-sample-project/blob/main/gradio_app/backend/query_llm.py)). Please don't forget to obtain a Hugging Face token for that (see here https://huggingface.co/settings/tokens).\n", + "\n", + "\n", + "A convenient way to work through the project is to test locally and keep committing the changes to the [HF Spaces](https://huggingface.co/spaces) repo. A space gets automatically rebuilt after each commit and you get a new version of your application up and running.\n", + "\n", + "2) Create a new space with Gradio SDK. You'll get an almost empty repo, the only thing you'll need from it is README.md which has a config letting a space builder know that it's a Gradio app. Reset a remote upstream of your local rag-gradio-sample-project clone to be your freshly created Spaces repository.\n", + "\n", + "The easiest way to set your space up is to set up the gradio_app folder as a git repo, set remote origin to your space repo and checkout the remote README:\n", + "\n", + "```\n", + "cd gradio_app\n", + "git init\n", + "git remote add origin \n", + "git fetch\n", + "git checkout origin/main README.md\n", + "```\n", + "\n", + "The space is not working yet. You'll get the first working version after the Step 3.\n", + "\n", + "- Clone https://github.com/huggingface/transformers to a local machine and run prep_scripts/markdown_to_text.py script to extract raw text from transformers/docs/source/en/. This will be your knowledge base, you don't need it to be a part of your repository\n", + "\n", + "Run the command as follows (pass arguments that work for you)\n", + "```\n", + "python prep_scripts/markdown_to_text.py --input-dir transformers/docs/source/en/ --output-dir docs\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "762e9fde-c1f4-464c-b12b-dca602fac5ba", + "metadata": { + "id": "762e9fde-c1f4-464c-b12b-dca602fac5ba" + }, + "source": [ + "**By design, you'll be running your experiments in a [Gradio space](https://huggingface.co/docs/hub/en/spaces-sdks-gradio). Apart from deliverables for each step you'll need to provide a link to a functioning RAG space in it final state!**" + ] + }, + { + "cell_type": "code", + "source": [ + "!git clone https://github.com/plaggy/rag-gradio-sample-project" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "BUHKUeqR7unC", + "outputId": "92617e28-da69-45e3-b34e-2b88876ae3dd" + }, + "id": "BUHKUeqR7unC", + "execution_count": 1, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Cloning into 'rag-gradio-sample-project'...\n", + "remote: Enumerating objects: 73, done.\u001b[K\n", + "remote: Counting objects: 100% (73/73), done.\u001b[K\n", + "remote: Compressing objects: 100% (59/59), done.\u001b[K\n", + "remote: Total 73 (delta 23), reused 57 (delta 14), pack-reused 0\u001b[K\n", + "Receiving objects: 100% (73/73), 31.10 KiB | 10.37 MiB/s, done.\n", + "Resolving deltas: 100% (23/23), done.\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "!pip install -r /content/rag-gradio-sample-project/gradio_app/requirements.txt" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FFIvgBYDcVMt", + "outputId": "3c53faf0-f87e-4d19-bbac-90401cc70b71" + }, + "id": "FFIvgBYDcVMt", + "execution_count": 2, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting lancedb==0.5.3 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n", + " Downloading lancedb-0.5.3-py3-none-any.whl (106 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m107.0/107.0 kB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting openai==1.11.1 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2))\n", + " Downloading openai-1.11.1-py3-none-any.whl (226 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m226.1/226.1 kB\u001b[0m \u001b[31m19.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting sentence-transformers==2.3.1 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3))\n", + " Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m132.8/132.8 kB\u001b[0m \u001b[31m15.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting tqdm==4.66.1 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 4))\n", + " Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.3/78.3 kB\u001b[0m \u001b[31m9.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting torch==2.1.1 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading torch-2.1.1-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m670.2/670.2 MB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting transformers==4.37.2 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 6))\n", + " Downloading transformers-4.37.2-py3-none-any.whl (8.4 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.4/8.4 MB\u001b[0m \u001b[31m89.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting deprecation (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n", + " Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)\n", + "Collecting pylance==0.9.12 (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n", + " Downloading pylance-0.9.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.4/21.4 MB\u001b[0m \u001b[31m68.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting ratelimiter~=1.0 (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n", + " Downloading ratelimiter-1.2.0.post0-py3-none-any.whl (6.6 kB)\n", + "Collecting retry>=0.9.2 (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n", + " Downloading retry-0.9.2-py2.py3-none-any.whl (8.0 kB)\n", + "Requirement already satisfied: pydantic>=1.10 in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (2.6.1)\n", + "Requirement already satisfied: attrs>=21.3.0 in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (23.2.0)\n", + "Collecting semver>=3.0 (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n", + " Downloading semver-3.0.2-py3-none-any.whl (17 kB)\n", + "Requirement already satisfied: cachetools in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (5.3.2)\n", + "Requirement already satisfied: pyyaml>=6.0 in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (6.0.1)\n", + "Requirement already satisfied: click>=8.1.7 in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (8.1.7)\n", + "Requirement already satisfied: requests>=2.31.0 in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (2.31.0)\n", + "Collecting overrides>=0.7 (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n", + " Downloading overrides-7.7.0-py3-none-any.whl (17 kB)\n", + "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (3.7.1)\n", + "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (1.7.0)\n", + "Collecting httpx<1,>=0.23.0 (from openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2))\n", + " Downloading httpx-0.26.0-py3-none-any.whl (75 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.9/75.9 kB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (1.3.0)\n", + "Requirement already satisfied: typing-extensions<5,>=4.7 in /usr/local/lib/python3.10/dist-packages (from openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (4.9.0)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (1.25.2)\n", + "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (1.2.2)\n", + "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (1.11.4)\n", + "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (3.8.1)\n", + "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (0.1.99)\n", + "Requirement already satisfied: huggingface-hub>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (0.20.3)\n", + "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (9.4.0)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (3.13.1)\n", + "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (1.12)\n", + "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (3.2.1)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (3.1.3)\n", + "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (2023.6.0)\n", + "Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m23.7/23.7 MB\u001b[0m \u001b[31m59.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m823.6/823.6 kB\u001b[0m \u001b[31m43.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.1/14.1 MB\u001b[0m \u001b[31m80.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m731.7/731.7 MB\u001b[0m \u001b[31m764.7 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m410.6/410.6 MB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m121.6/121.6 MB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting nvidia-curand-cu12==10.3.2.106 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.5/56.5 MB\u001b[0m \u001b[31m10.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting nvidia-cusolver-cu12==11.4.5.107 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m124.2/124.2 MB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting nvidia-cusparse-cu12==12.1.0.106 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m196.0/196.0 MB\u001b[0m \u001b[31m2.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting nvidia-nccl-cu12==2.18.1 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64.whl (209.8 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m209.8/209.8 MB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting nvidia-nvtx-cu12==12.1.105 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m99.1/99.1 kB\u001b[0m \u001b[31m13.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (2.1.0)\n", + "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.2->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 6)) (23.2)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.2->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 6)) (2023.12.25)\n", + "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.2->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 6)) (0.15.2)\n", + "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.2->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 6)) (0.4.2)\n", + "Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n", + " Downloading nvidia_nvjitlink_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (20.5 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m20.5/20.5 MB\u001b[0m \u001b[31m72.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting pyarrow>=12 (from pylance==0.9.12->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n", + " Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m38.3/38.3 MB\u001b[0m \u001b[31m14.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (3.6)\n", + "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (1.2.0)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (2024.2.2)\n", + "Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2))\n", + " Downloading httpcore-1.0.3-py3-none-any.whl (77 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.0/77.0 kB\u001b[0m \u001b[31m10.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2))\n", + " Downloading h11-0.14.0-py3-none-any.whl (58 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (0.6.0)\n", + "Requirement already satisfied: pydantic-core==2.16.2 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (2.16.2)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.31.0->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (3.3.2)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.31.0->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (2.0.7)\n", + "Requirement already satisfied: decorator>=3.4.2 in /usr/local/lib/python3.10/dist-packages (from retry>=0.9.2->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (4.4.2)\n", + "Collecting py<2.0.0,>=1.4.26 (from retry>=0.9.2->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n", + " Downloading py-1.11.0-py2.py3-none-any.whl (98 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.7/98.7 kB\u001b[0m \u001b[31m12.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (2.1.5)\n", + "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (1.3.2)\n", + "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (3.2.0)\n", + "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (1.3.0)\n", + "Installing collected packages: ratelimiter, tqdm, semver, pyarrow, py, overrides, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, h11, deprecation, retry, pylance, nvidia-cusparse-cu12, nvidia-cudnn-cu12, httpcore, nvidia-cusolver-cu12, lancedb, httpx, transformers, torch, openai, sentence-transformers\n", + " Attempting uninstall: tqdm\n", + " Found existing installation: tqdm 4.66.2\n", + " Uninstalling tqdm-4.66.2:\n", + " Successfully uninstalled tqdm-4.66.2\n", + " Attempting uninstall: pyarrow\n", + " Found existing installation: pyarrow 10.0.1\n", + " Uninstalling pyarrow-10.0.1:\n", + " Successfully uninstalled pyarrow-10.0.1\n", + " Attempting uninstall: transformers\n", + " Found existing installation: transformers 4.35.2\n", + " Uninstalling transformers-4.35.2:\n", + " Successfully uninstalled transformers-4.35.2\n", + " Attempting uninstall: torch\n", + " Found existing installation: torch 2.1.0+cu121\n", + " Uninstalling torch-2.1.0+cu121:\n", + " Successfully uninstalled torch-2.1.0+cu121\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "llmx 0.0.15a0 requires cohere, which is not installed.\n", + "llmx 0.0.15a0 requires tiktoken, which is not installed.\n", + "ibis-framework 7.1.0 requires pyarrow<15,>=2, but you have pyarrow 15.0.0 which is incompatible.\n", + "torchaudio 2.1.0+cu121 requires torch==2.1.0, but you have torch 2.1.1 which is incompatible.\n", + "torchdata 0.7.0 requires torch==2.1.0, but you have torch 2.1.1 which is incompatible.\n", + "torchtext 0.16.0 requires torch==2.1.0, but you have torch 2.1.1 which is incompatible.\n", + "torchvision 0.16.0+cu121 requires torch==2.1.0, but you have torch 2.1.1 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0mSuccessfully installed deprecation-2.1.0 h11-0.14.0 httpcore-1.0.3 httpx-0.26.0 lancedb-0.5.3 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.3.101 nvidia-nvtx-cu12-12.1.105 openai-1.11.1 overrides-7.7.0 py-1.11.0 pyarrow-15.0.0 pylance-0.9.12 ratelimiter-1.2.0.post0 retry-0.9.2 semver-3.0.2 sentence-transformers-2.3.1 torch-2.1.1 tqdm-4.66.1 transformers-4.37.2\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "!pip install huggingface_hub" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "uVHHnyoedIPy", + "outputId": "527c17ae-a7db-45db-cf12-46cb07f90342" + }, + "id": "uVHHnyoedIPy", + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.10/dist-packages (0.20.3)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (3.13.1)\n", + "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (2023.6.0)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (2.31.0)\n", + "Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (4.66.1)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (6.0.1)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (4.9.0)\n", + "Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (23.2)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (3.3.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (3.6)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (2.0.7)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (2024.2.2)\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "!huggingface-cli login" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8-M0jyfGdKYe", + "outputId": "c7f7d369-c51b-43e6-8aa4-182af93a7f4a" + }, + "id": "8-M0jyfGdKYe", + "execution_count": 4, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + " _| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|\n", + " _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n", + " _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|\n", + " _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n", + " _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|\n", + "\n", + " To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .\n", + "Token: \n", + "Add token as git credential? (Y/n) \n", + "Token is valid (permission: read).\n", + "\u001b[1m\u001b[31mCannot authenticate through git-credential as no helper is defined on your machine.\n", + "You might have to re-authenticate when pushing to the Hugging Face Hub.\n", + "Run the following command in your terminal in case you want to set the 'store' credential helper as default.\n", + "\n", + "git config --global credential.helper store\n", + "\n", + "Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.\u001b[0m\n", + "Token has not been saved to git credential helper.\n", + "Your token has been saved to /root/.cache/huggingface/token\n", + "Login successful\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "%cd rag-gradio-sample-project/gradio_app/\n", + "%ls" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "HjXOD1nH1fx5", + "outputId": "874e8e62-8730-47e2-b185-c7c1e9cc6cfe" + }, + "id": "HjXOD1nH1fx5", + "execution_count": 5, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "/content/rag-gradio-sample-project/gradio_app\n", + "app.py \u001b[0m\u001b[01;34mbackend\u001b[0m/ requirements.txt \u001b[01;34mtemplates\u001b[0m/\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "%pwd" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 36 + }, + "id": "55IyOwgr1kNR", + "outputId": "6235574d-e278-40eb-f389-bdae96090556" + }, + "id": "55IyOwgr1kNR", + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "'/content/rag-gradio-sample-project/gradio_app'" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + } + }, + "metadata": {}, + "execution_count": 6 + } + ] + }, + { + "cell_type": "code", + "source": [ + "!git init\n", + "!git remote add origin https://huggingface.co/spaces/Ahmadzei/RAG\n", + "!git config --global init.defaultBranch main\n", + "!git fetch\n", + "!git checkout origin/main README.md" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1wY0VaL-9c14", + "outputId": "27d6b540-2d9a-4dee-d397-25601878c187" + }, + "id": "1wY0VaL-9c14", + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\u001b[33mhint: Using 'master' as the name for the initial branch. This default branch name\u001b[m\n", + "\u001b[33mhint: is subject to change. To configure the initial branch name to use in all\u001b[m\n", + "\u001b[33mhint: of your new repositories, which will suppress this warning, call:\u001b[m\n", + "\u001b[33mhint: \u001b[m\n", + "\u001b[33mhint: \tgit config --global init.defaultBranch \u001b[m\n", + "\u001b[33mhint: \u001b[m\n", + "\u001b[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and\u001b[m\n", + "\u001b[33mhint: 'development'. The just-created branch can be renamed via this command:\u001b[m\n", + "\u001b[33mhint: \u001b[m\n", + "\u001b[33mhint: \tgit branch -m \u001b[m\n", + "Initialized empty Git repository in /content/rag-gradio-sample-project/gradio_app/.git/\n", + "remote: Enumerating objects: 4, done.\u001b[K\n", + "remote: Total 4 (delta 0), reused 0 (delta 0), pack-reused 4\u001b[K\n", + "Unpacking objects: 100% (4/4), 1.27 KiB | 1.27 MiB/s, done.\n", + "From https://huggingface.co/spaces/Ahmadzei/RAG\n", + " * [new branch] main -> origin/main\n", + "Updated 1 path from b4805fb\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "!git clone https://github.com/huggingface/transformers" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cgX7Aqujk37U", + "outputId": "6294a191-642f-41f2-bb07-0ce528fae8c2" + }, + "id": "cgX7Aqujk37U", + "execution_count": 7, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Cloning into 'transformers'...\n", + "remote: Enumerating objects: 185037, done.\u001b[K\n", + "remote: Counting objects: 100% (1681/1681), done.\u001b[K\n", + "remote: Compressing objects: 100% (1231/1231), done.\u001b[K\n", + "remote: Total 185037 (delta 824), reused 742 (delta 374), pack-reused 183356\u001b[K\n", + "Receiving objects: 100% (185037/185037), 205.20 MiB | 19.65 MiB/s, done.\n", + "Resolving deltas: 100% (130045/130045), done.\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "# !python transformers/prep_scripts/markdown_to_text.py --input_dir transformers/docs/source/en/ --output_dir /content/knowledge_base/\n", + "!python /content/rag-gradio-sample-project/prep_scripts/markdown_to_text.py --input-dir /content/rag-gradio-sample-project/gradio_app/transformers/docs/source/en/ --output-dir /content/docs/" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "2NYMq3KIlMAz", + "outputId": "d24cd17b-2f77-4f3a-b8c0-449acd9b0f80" + }, + "id": "2NYMq3KIlMAz", + "execution_count": 8, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\r0it [00:00, ?it/s]/content/rag-gradio-sample-project/prep_scripts/markdown_to_text.py:22: DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead.\n", + " text = ''.join(soup.findAll(text=True))\n", + "385it [00:06, 60.38it/s]\n" + ] + } + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c813d03-33a7-4ce1-836f-11afc541f291", + "metadata": { + "id": "6c813d03-33a7-4ce1-836f-11afc541f291" + }, + "outputs": [], + "source": [ + "# Add the link to the space you've just created here:\n", + "# https://huggingface.co/spaces/Ahmadzei/RAG" + ] + }, + { + "cell_type": "markdown", + "id": "c970d0a4-fee8-48ac-9377-4a6def7712b2", + "metadata": { + "id": "c970d0a4-fee8-48ac-9377-4a6def7712b2" + }, + "source": [ + "### Step 1: Chunk Your Data\n", + "\n", + "To efficiently pull up documents relevant to a query from a knowledge base documents are embedded and stored as vectors. Documents in your knowledge base are not expected to fit into the context length of an embedding model (most have 512 token limit). Hence chunking your documents into smaller pieces is required. Take a deeper dive into why chunking is important and what are the options [here](https://www.pinecone.io/learn/chunking-strategies/).\n", + "\n", + "Your task is to implement and compare two chunking strategies: fixed-sized chunking and content-aware chunking. For content-aware you could split by sentences, paragraphs or in some other way that makes sense.\n", + "\n", + "The deliverables are:\n", + "- The code for chunk splitting" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7bad8c8", + "metadata": { + "id": "f7bad8c8" + }, + "outputs": [], + "source": [ + "# Chunk splitting deliverables" + ] + }, + { + "cell_type": "code", + "source": [ + "def fixed_size_chunking(text, chunk_size=512):\n", + " \"\"\"\n", + " Splits the text into fixed-sized chunks.\n", + "\n", + " :param text: The input text to be chunked.\n", + " :param chunk_size: The size of each chunk in number of characters.\n", + " :return: A list of chunks.\n", + " \"\"\"\n", + " return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]\n" + ], + "metadata": { + "id": "n9qEj8jfvlPj" + }, + "id": "n9qEj8jfvlPj", + "execution_count": 9, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "def content_aware_chunking(text, max_chunk_size=512):\n", + " \"\"\"\n", + " Splits the text into content-aware chunks by sentences.\n", + "\n", + " :param text: The input text to be chunked.\n", + " :param max_chunk_size: The maximum size of each chunk in number of characters.\n", + " :return: A list of chunks.\n", + " \"\"\"\n", + " sentences = text.split('. ') # Simple sentence splitting, can be improved with NLP libraries\n", + " chunks = []\n", + " current_chunk = \"\"\n", + "\n", + " for sentence in sentences:\n", + " if len(current_chunk) + len(sentence) < max_chunk_size:\n", + " current_chunk += sentence + \". \"\n", + " else:\n", + " chunks.append(current_chunk.strip())\n", + " current_chunk = sentence + \". \"\n", + " if current_chunk:\n", + " chunks.append(current_chunk.strip())\n", + "\n", + " return chunks" + ], + "metadata": { + "id": "DB5IlJAdL6Bq" + }, + "id": "DB5IlJAdL6Bq", + "execution_count": 10, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "import nltk\n", + "nltk.download('punkt')\n", + "from nltk.tokenize import sent_tokenize\n", + "\n", + "def nltk_chunking(text):\n", + " \"\"\"\n", + " Divide text into chunks based on sentences.\n", + "\n", + " Args:\n", + " text (str): The text to be chunked.\n", + "\n", + " Returns:\n", + " list of str: A list containing the text chunks (sentences).\n", + " \"\"\"\n", + " return sent_tokenize(text)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8eYOiabGvl00", + "outputId": "abf76bf5-09cb-43f6-b40e-0fffcbf37b3a" + }, + "id": "8eYOiabGvl00", + "execution_count": 11, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "[nltk_data] Downloading package punkt to /root/nltk_data...\n", + "[nltk_data] Unzipping tokenizers/punkt.zip.\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "def paragraph_chunking(text):\n", + " \"\"\"\n", + " Divide text into chunks based on paragraphs.\n", + "\n", + " Args:\n", + " text (str): The text to be chunked.\n", + "\n", + " Returns:\n", + " list of str: A list containing the text chunks (paragraphs).\n", + " \"\"\"\n", + " return text.split('\\n\\n')" + ], + "metadata": { + "id": "Sk2M6tYmvosj" + }, + "id": "Sk2M6tYmvosj", + "execution_count": 12, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "import os\n", + "import glob\n", + "\n", + "def chunk_and_write_docs(input_dir, output_dir_fixed, output_dir_content_aware):\n", + " # Ensure output directories exist\n", + " os.makedirs(output_dir_fixed, exist_ok=True)\n", + " os.makedirs(output_dir_content_aware, exist_ok=True)\n", + "\n", + " # List all text files in the input directory\n", + " file_paths = glob.glob(os.path.join(input_dir, '*.txt'))\n", + "\n", + " for file_path in file_paths:\n", + " # Read the content of the file\n", + " with open(file_path, 'r', encoding='utf-8') as file:\n", + " text_content = file.read()\n", + "\n", + " # Generate chunks using both methods\n", + " fixed_chunks = fixed_size_chunking(text_content)\n", + " content_aware_chunks = content_aware_chunking(text_content)\n", + "\n", + " # Extract base name without extension for use in chunk file names\n", + " base_name = os.path.splitext(os.path.basename(file_path))[0]\n", + "\n", + " # Fixed-size chunking\n", + " fixed_chunk_dir = os.path.join(output_dir_fixed, base_name.replace('.txt', ''))\n", + " os.makedirs(fixed_chunk_dir, exist_ok=True)\n", + " for i, chunk in enumerate(fixed_chunks):\n", + " with open(os.path.join(fixed_chunk_dir, f'chunk_{i}.txt'), 'w', encoding='utf-8') as chunk_file:\n", + " chunk_file.write(chunk)\n", + "\n", + " # Content-aware chunking\n", + " content_aware_chunk_dir = os.path.join(output_dir_content_aware, base_name.replace('.txt', ''))\n", + " os.makedirs(content_aware_chunk_dir, exist_ok=True)\n", + " for i, chunk in enumerate(content_aware_chunks):\n", + " with open(os.path.join(content_aware_chunk_dir, f'chunk_{i}.txt'), 'w', encoding='utf-8') as chunk_file:\n", + " chunk_file.write(chunk)\n", + "\n", + "# Define input and output directories\n", + "input_dir = '/content/docs'\n", + "output_dir_fixed = '/content/chunked/fixed_size_chunking'\n", + "output_dir_content_aware = '/content/chunked/content_aware_chunking'\n", + "\n", + "# Process the documents\n", + "chunk_and_write_docs(input_dir, output_dir_fixed, output_dir_content_aware)\n", + "\n", + "# To indicate completion and the count of processed files\n", + "processed_files_count = len(glob.glob(os.path.join(input_dir, '*.txt')))\n", + "processed_files_count\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FGDf40tqSK2C", + "outputId": "39033395-444e-4579-a387-1128ec73bc41" + }, + "id": "FGDf40tqSK2C", + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "381" + ] + }, + "metadata": {}, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "id": "5e5ebaad-8d42-430c-b00b-18198cdb9ce8", + "metadata": { + "id": "5e5ebaad-8d42-430c-b00b-18198cdb9ce8" + }, + "source": [ + "### Step 2: Ingest chunks into a database and create an index\n", + "\n", + "Chunks need to be vectorized and made accessible to an LLM to enable semantic search with embedding models. A current industry standard is to use a vector database to store and retrieve texts both conveniently and efficiently. There are many products out there, we'll be using [LanceDB](https://lancedb.github.io/lancedb/). LanceDB is a young product, one way it stands out is that it's embedded - it's designed not to be a standalone service but rather a part of an application, more on this [here](https://lancedb.github.io/lancedb/basic/).\n", + "\n", + "Find more details on how different databases compare in [this](https://thedataquarry.com/tags/vector-db/) series of posts.\n", + "\n", + "Your task is to vectorize and ingest chunked documents into the database.\n", + "**For each chunking strategy from the previous step create a separate table with one of the embedding models. Compare the chunking strategies and choose one. Perform vectorization+ingestion with the second model only with one chunking strategy of your choice**.\n", + "Use prep_scrips/lancedb_setup.py to vectorize chunks and store vector representations along with raw text in a Lancedb instance. The script also creates an index for fast ANN retrieval (not really needed for this exercise but necessary at scale). Try different embedding models and see how results differ. The options are:\n", + "\n", + "- `sentence-transformers/all-MiniLM-L6-v2`: a light model, produces vectors of length 384\n", + "- `BAAI/bge-large-en-v1.5`: a much heavier model, embedding vector length is 1024\n", + "\n", + "Feel free to explore other embedding models and justify your choice.\n", + "For different embedding models and different chunking strategies create different tables in the database so you can easily switch between them and compare.\n", + "\n", + "Run the embedding+ingestion script as follows, make sure to look into the script and go over the arguments. Note that the number of sub-vectors for indexing must be a divisor of the model embedding size.\n", + "\n", + "```\n", + "python prep_scrips/lancedb_setup.py --emb-model --table --input-dir --num-sub-vectors \n", + "```\n", + "\n", + "Before committing to your space set up environment variables on the settings tab of your space, use `.env` as a ference list of all the things you can customize. Make sure to add HF_TOKEN and OPENAI_API_KEY as secrets.\n", + "Not all the parameters are required to set via environment variables, most have default values.\n", + "\n", + "*The database is expected to be in the `gradio_app` folder under `.lancedb`, make sure to move it there if was initialized elsewhere.* It can be parametrized but it's unnecessary here.\n", + "\n", + "To commit large files to Github use `git lfs`:\n", + "```\n", + "git lfs install\n", + "git lfs track \"*.lance\"\n", + "git lfs track \"*.idx\"\n", + "git add .gitattributes\n", + "```\n", + "Then proceed as usual.\n", + "\n", + "For experimenting you can easily switch between embedding models/tables by changing the values of the corresponding env variables in your space (`EMB_MODEL`, `TABLE_NAME`). Overall, every time you change the value of an environment variable a space gets automatically rebuilt.\n", + "\n", + "The deliverables are:\n", + "1. The illustration of how retrieved documents differ depending on the embedding model and the chunking strategy. You should create at least 3 tables: model_1 + chunking_strategy_1, model_1 + chunking_strategy_2, model_2 + chunking_strategy_<1 or 2>\n", + "2. The analysis of pros and cons of chunking strategies\n", + "3. The analysis of how retrieved document differ between embedding models (is one better than the other?)\n", + "4. The analysis of how the embedding time differs between models" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7db282e-e03c-41de-9c03-54abf455481f", + "metadata": { + "id": "f7db282e-e03c-41de-9c03-54abf455481f" + }, + "outputs": [], + "source": [ + "# Embed documents with different chunking strategies and ingest into the database" + ] + }, + { + "cell_type": "code", + "source": [ + "!pip install lancedb openai pyarrow pandas numpy sentence-transformers" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "vrrCjCs3-lNy", + "outputId": "c1a20049-d733-4390-ef65-cd9df1c0109f" + }, + "id": "vrrCjCs3-lNy", + "execution_count": 14, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: lancedb in /usr/local/lib/python3.10/dist-packages (0.5.3)\n", + "Requirement already satisfied: openai in /usr/local/lib/python3.10/dist-packages (1.11.1)\n", + "Requirement already satisfied: pyarrow in /usr/local/lib/python3.10/dist-packages (15.0.0)\n", + "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.25.2)\n", + "Requirement already satisfied: sentence-transformers in /usr/local/lib/python3.10/dist-packages (2.3.1)\n", + "Requirement already satisfied: deprecation in /usr/local/lib/python3.10/dist-packages (from lancedb) (2.1.0)\n", + "Requirement already satisfied: pylance==0.9.12 in /usr/local/lib/python3.10/dist-packages (from lancedb) (0.9.12)\n", + "Requirement already satisfied: ratelimiter~=1.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (1.2.0.post0)\n", + "Requirement already satisfied: retry>=0.9.2 in /usr/local/lib/python3.10/dist-packages (from lancedb) (0.9.2)\n", + "Requirement already satisfied: tqdm>=4.27.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (4.66.1)\n", + "Requirement already satisfied: pydantic>=1.10 in /usr/local/lib/python3.10/dist-packages (from lancedb) (2.6.1)\n", + "Requirement already satisfied: attrs>=21.3.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (23.2.0)\n", + "Requirement already satisfied: semver>=3.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (3.0.2)\n", + "Requirement already satisfied: cachetools in /usr/local/lib/python3.10/dist-packages (from lancedb) (5.3.2)\n", + "Requirement already satisfied: pyyaml>=6.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (6.0.1)\n", + "Requirement already satisfied: click>=8.1.7 in /usr/local/lib/python3.10/dist-packages (from lancedb) (8.1.7)\n", + "Requirement already satisfied: requests>=2.31.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (2.31.0)\n", + "Requirement already satisfied: overrides>=0.7 in /usr/local/lib/python3.10/dist-packages (from lancedb) (7.7.0)\n", + "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai) (3.7.1)\n", + "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai) (1.7.0)\n", + "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from openai) (0.26.0)\n", + "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai) (1.3.0)\n", + "Requirement already satisfied: typing-extensions<5,>=4.7 in /usr/local/lib/python3.10/dist-packages (from openai) (4.9.0)\n", + "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.4)\n", + "Requirement already satisfied: transformers<5.0.0,>=4.32.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.37.2)\n", + "Requirement already satisfied: torch>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (2.1.1)\n", + "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (1.2.2)\n", + "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (1.11.4)\n", + "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (3.8.1)\n", + "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.1.99)\n", + "Requirement already satisfied: huggingface-hub>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.20.3)\n", + "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (9.4.0)\n", + "Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai) (3.6)\n", + "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai) (1.2.0)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai) (2024.2.2)\n", + "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai) (1.0.3)\n", + "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (3.13.1)\n", + "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (2023.6.0)\n", + "Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (23.2)\n", + "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->lancedb) (0.6.0)\n", + "Requirement already satisfied: pydantic-core==2.16.2 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->lancedb) (2.16.2)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.31.0->lancedb) (3.3.2)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.31.0->lancedb) (2.0.7)\n", + "Requirement already satisfied: decorator>=3.4.2 in /usr/local/lib/python3.10/dist-packages (from retry>=0.9.2->lancedb) (4.4.2)\n", + "Requirement already satisfied: py<2.0.0,>=1.4.26 in /usr/local/lib/python3.10/dist-packages (from retry>=0.9.2->lancedb) (1.11.0)\n", + "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (1.12)\n", + "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.2.1)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.1.3)\n", + "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.105)\n", + "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.105)\n", + "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.105)\n", + "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (8.9.2.26)\n", + "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.3.1)\n", + "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (11.0.2.54)\n", + "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (10.3.2.106)\n", + "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (11.4.5.107)\n", + "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.0.106)\n", + "Requirement already satisfied: nvidia-nccl-cu12==2.18.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (2.18.1)\n", + "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.105)\n", + "Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (2.1.0)\n", + "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.11.0->sentence-transformers) (12.3.101)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (2023.12.25)\n", + "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.15.2)\n", + "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.4.2)\n", + "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->sentence-transformers) (1.3.2)\n", + "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers) (3.2.0)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers) (2.1.5)\n", + "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.11.0->sentence-transformers) (1.3.0)\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Setting environment variables\n", + "os.environ['EMB_MODEL'] = 'sentence-transformers/all-MiniLM-L6-v2' #sentence-transformers/all-MiniLM-L6-v2: a light model, produces vectors of length 384 / BAAI/bge-large-en-v1.5: a much heavier model, embedding vector length is 1024\n", + "os.environ['TABLE_NAME'] = 'fixed_size_chunking' # fixed_size_chunking / content_aware_chunking\n", + "os.environ['INPUT_DIR'] = '/content/chunked/docs/fixed_size_chunking/' # fixed_size_chunking / content_aware_chunking\n", + "os.environ['NUM_SUB_VECTORS'] = '12'" + ], + "metadata": { + "id": "o3TCdDIEYwk6" + }, + "id": "o3TCdDIEYwk6", + "execution_count": 15, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "EMB_MODEL = os.getenv('EMB_MODEL')\n", + "TABLE_NAME = os.getenv('TABLE_NAME')\n", + "INPUT_DIR = os.getenv('INPUT_DIR')\n", + "NUM_SUB_VECTORS = os.getenv('NUM_SUB_VECTORS')" + ], + "metadata": { + "id": "1tVGE7JYZc3i" + }, + "id": "1tVGE7JYZc3i", + "execution_count": 16, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "print(INPUT_DIR)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "uL8Gzk6TgLtK", + "outputId": "68c608cf-e685-45c6-fc5f-e51ba204c074" + }, + "id": "uL8Gzk6TgLtK", + "execution_count": 17, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "/content/chunked/docs/fixed_size_chunking/\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "!python /content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py --emb-model {EMB_MODEL} --table {TABLE_NAME} --input-dir {INPUT_DIR} --num-sub-vectors {NUM_SUB_VECTORS}" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Xy1cyu7_zFgO", + "outputId": "89ade558-d3bf-4aab-9b29-35f72950a07d" + }, + "id": "Xy1cyu7_zFgO", + "execution_count": 19, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2\n", + "/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()\n", + " return self.fget.__get__(instance, owner)()\n", + "INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu\n", + "INFO:__main__:using cpu device\n", + "0it [00:00, ?it/s]\n", + "Traceback (most recent call last):\n", + " File \"/content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py\", line 96, in \n", + " main()\n", + " File \"/content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py\", line 88, in main\n", + " tbl.create_index(\n", + " File \"/usr/local/lib/python3.10/dist-packages/lancedb/table.py\", line 858, in create_index\n", + " self._dataset.create_index(\n", + " File \"/usr/local/lib/python3.10/dist-packages/lance/dataset.py\", line 1269, in create_index\n", + " self._ds.create_index(column, index_type, name, replace, kwargs)\n", + "OSError: LanceError(Index): KMeans: can not train 256 centroids with 0 vectors, choose a smaller K (< 0) instead, /home/runner/work/lance/lance/rust/lance-index/src/vector/kmeans.rs:45:21\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Setting environment variables\n", + "os.environ['EMB_MODEL'] = 'sentence-transformers/all-MiniLM-L6-v2' #sentence-transformers/all-MiniLM-L6-v2: a light model, produces vectors of length 384 / BAAI/bge-large-en-v1.5: a much heavier model, embedding vector length is 1024\n", + "os.environ['TABLE_NAME'] = 'content_aware_chunking' # fixed_size_chunking / content_aware_chunking\n", + "os.environ['INPUT_DIR'] = '/content/chunked/docs/content_aware_chunking/' # fixed_size_chunking / content_aware_chunking\n", + "os.environ['NUM_SUB_VECTORS'] = '12'" + ], + "metadata": { + "id": "t7aqMOI3bh2s" + }, + "id": "t7aqMOI3bh2s", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "EMB_MODEL2 = os.getenv('EMB_MODEL')\n", + "TABLE_NAME2 = os.getenv('TABLE_NAME')\n", + "INPUT_DIR2 = os.getenv('INPUT_DIR')\n", + "NUM_SUB_VECTORS2 = os.getenv('NUM_SUB_VECTORS')" + ], + "metadata": { + "id": "Gk9ynF4Bbslu" + }, + "id": "Gk9ynF4Bbslu", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "!python /content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py --emb-model {EMB_MODEL2} --table {TABLE_NAME2} --input-dir {INPUT_DIR2} --num-sub-vectors {NUM_SUB_VECTORS2}" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rc0n7a9zbwh2", + "outputId": "50251872-bad0-473b-9ac3-36ed6d7a2e5f" + }, + "id": "rc0n7a9zbwh2", + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2\n", + "/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()\n", + " return self.fget.__get__(instance, owner)()\n", + "INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu\n", + "INFO:__main__:using cpu device\n", + "0it [00:00, ?it/s]\n", + "Traceback (most recent call last):\n", + " File \"/content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py\", line 100, in \n", + " main()\n", + " File \"/content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py\", line 92, in main\n", + " tbl.create_index(\n", + " File \"/usr/local/lib/python3.10/dist-packages/lancedb/table.py\", line 858, in create_index\n", + " self._dataset.create_index(\n", + " File \"/usr/local/lib/python3.10/dist-packages/lance/dataset.py\", line 1269, in create_index\n", + " self._ds.create_index(column, index_type, name, replace, kwargs)\n", + "OSError: LanceError(Index): KMeans: can not train 256 centroids with 0 vectors, choose a smaller K (< 0) instead, /home/runner/work/lance/lance/rust/lance-index/src/vector/kmeans.rs:45:21\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "!git lfs install\n", + "!git lfs track \"*.lance\"\n", + "!git lfs track \"*.idx\"\n", + "!git add .gitattributes\n", + "# Then commit and push as usual\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3Mlmy4j7x9Ln", + "outputId": "c4940d06-37a5-4861-a101-d6cbf753b5d2" + }, + "id": "3Mlmy4j7x9Ln", + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Updated git hooks.\n", + "Git LFS initialized.\n", + "Tracking \"*.lance\"\n", + "Tracking \"*.idx\"\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "id": "7d818b4f-ba5a-4c81-b6d7-f3474c398d9c", + "metadata": { + "id": "7d818b4f-ba5a-4c81-b6d7-f3474c398d9c" + }, + "source": [ + "### Step 3: Add a reranker\n", + "\n", + "A reranker is a second-level model which produces similarity scores for pairs of (input query + retrieved document). Cross-encoders are conventionally used for reranking, their architecture is slightly different from retrieval models (more on it [here] and [here](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)). Cross-encoders are much more costly to run, therefore a retrieval model is used to get a few (dozens) highest-scoring items, and a reranker picks the best among these. The overall pipeline is similar to the recommender system industry standard: a light model retrieves top-n, a precise and heavy model reranks n to get top k, n is orders of magnitude larger than k.\n", + "\n", + "Cross-encoders are optional because of the overhead their usage implies. Your task is to implement a reranker using a cross-encoder and assess pros and cons of having it. Do not forget that the process of pulling the most relevant documents becomes two-staged: retrieve a larger number of items first, than rerank and keep the best top-k for context.\n", + "\n", + "The models fit for the task:\n", + "1. BAAI/bge-reranker-large\n", + "2. cross-encoder/ms-marco-MiniLM-L-6-v2\n", + "\n", + "As usual, feel free to pick another model and provide some description to it.\n", + "\n", + "The deliverables are:\n", + "\n", + "1. The code that enables a reranker.\n", + "3. A comparison of how the prompt and the model output change after adding a reranker\n", + "4. The analysis of pros and cons. The evaluation aspects should include the relevance of the top-k documents, the response time.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee1b0160-0ba0-4b5f-81c4-ef3ea76850e5", + "metadata": { + "id": "ee1b0160-0ba0-4b5f-81c4-ef3ea76850e5" + }, + "outputs": [], + "source": [ + "# Implement code for selecting the final documents using a cross-encoder and compare with and without" + ] + }, + { + "cell_type": "code", + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "\n", + "# Load the model\n", + "model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # BAAI/bge-reranker-large\n", + "\n", + "# Vectorize the query\n", + "query = \"Your search query here\"\n", + "query_vector = model.encode(query)" + ], + "metadata": { + "id": "peSWSL0lXOK5" + }, + "id": "peSWSL0lXOK5", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "import lancedb\n", + "import numpy as np\n", + "\n", + "# Connect to LanceDB and open your table\n", + "db = lancedb.connect(\"/content/rag-gradio-sample-project/gradio_app/.lancedb/\")\n", + "tbl = db.open_table({TABLE_NAME2})\n", + "\n", + "# Perform a vector search for the top-N documents\n", + "df = tbl.search(query_vector) \\\n", + " .metric(\"cosine\") \\\n", + " .limit(10) \\\n", + " .to_list() # Or use .to_pandas(), .to_arrow(), etc., based on your preference\n", + "\n", + "# `df` now contains the top-N documents and their similarity scores" + ], + "metadata": { + "id": "xd10rndiUCIW" + }, + "id": "xd10rndiUCIW", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Assuming `df` contains document IDs or keys to fetch the actual documents\n", + "documents = [db.fetch_document(table_name, doc_id) for doc_id in df]" + ], + "metadata": { + "id": "8KWuDzhxTLTX" + }, + "id": "8KWuDzhxTLTX", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n", + "from torch.utils.data import DataLoader\n", + "import torch\n", + "\n", + "# Initialize the tokenizer and model\n", + "tokenizer = AutoTokenizer.from_pretrained(\"cross-encoder/ms-marco-MiniLM-L-6-v2\")\n", + "model = AutoModelForSequenceClassification.from_pretrained(\"cross-encoder/ms-marco-MiniLM-L-6-v2\")\n", + "\n", + "def rerank(query, documents):\n", + " # Assuming `documents` is a list of texts\n", + " pairs = [[query, doc['text']] for doc in documents] # Adjust based on your `results` structure\n", + " inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors=\"pt\")\n", + " with torch.no_grad():\n", + " scores = rerank_model(**inputs).logits[:,1] # Scores for each pair\n", + " # Sort documents by scores in descending order and return\n", + " documents = [doc for _, doc in sorted(zip(scores, documents), key=lambda x: x[0], reverse=True)]\n", + " return documents" + ], + "metadata": { + "id": "O6xMyqFjRp_m" + }, + "id": "O6xMyqFjRp_m", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "top_k_documents = rerank(query, documents)[:K] # Keep top K after reranking" + ], + "metadata": { + "id": "dZtiwhPBRtnS" + }, + "id": "dZtiwhPBRtnS", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "f5816c54-a290-4cb0-b7db-3b8901998cb0", + "metadata": { + "id": "f5816c54-a290-4cb0-b7db-3b8901998cb0" + }, + "source": [ + "### Step 4: Try a different LLM\n", + "\n", + "The suggested `Mistral-7b-instruct` is a great but small model for an LLM. A larger model can be applied to a wider range of problems and do more complex reasoning. Within the scope of this project a larger model may not be beneficial but for more complex cases the difference would become apparent. Another dimension to explore is a base model which was not instruction fine-tuned - it won't respond to your queries the way you'd expect. It may be a great exercise to see the value of fine-tuning.\n", + "\n", + "The task here is to try out an alternative LLM to explore the differences.\n", + "\n", + "The options are:\n", + "1. mistralai/Mistral-7B-v0.1\n", + "2. mistralai/Mixtral-8x7B-Instruct-v0.1\n", + "\n", + "Of course, feel free to choose another one and give some details on how different it is from the initial model.\n", + "\n", + "The deliverables are:\n", + "\n", + "1. The comparison between outputs of the Mistral-7b-instuct and a different model of your choice.\n", + "2. The difference in response times if a larger model was chosen. Make sure to make multiple queries to make the comparison meaningful.\n", + "3. Analyse the differences between outputs and share the conclusions.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "942f39d4-eb27-4f2d-ae47-a5d65f102faa", + "metadata": { + "id": "942f39d4-eb27-4f2d-ae47-a5d65f102faa" + }, + "outputs": [], + "source": [ + "# Analysis of the difference between LLMs" + ] + }, + { + "cell_type": "markdown", + "id": "70c16440", + "metadata": { + "id": "70c16440" + }, + "source": [ + "### Step 5 (Bonus): Use an LLM to quantitatively compare outputs of different variants of the system (LLM as a Judge)\n", + "\n", + "Use a powerful LLM (e.g. GPT-4) to quantitatively evaluate outputs of two alternative setups (different embedding models, different LLMs, both etc.). For inspiration and for prompts refer to [1](https://arxiv.org/pdf/2306.05685.pdf), [2](https://arxiv.org/pdf/2401.10020.pdf), [3](https://www.airtrain.ai/blog/the-comprehensive-guide-to-llm-evaluation#high-level-approach)\n", + "\n", + "The deliverables:\n", + "\n", + "1. The code you put together\n", + "2. The high-level description of the setup\n", + "3. The results of the qualitative comparison\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39c18ba0-e54a-478f-9e60-0ea65c29238a", + "metadata": { + "id": "39c18ba0-e54a-478f-9e60-0ea65c29238a" + }, + "outputs": [], + "source": [ + "# The code implementing LLM-as-a-Judge and the evaluation results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ce78700-2578-4719-8b6b-d59fc669d1c1", + "metadata": { + "id": "2ce78700-2578-4719-8b6b-d59fc669d1c1" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + }, + "colab": { + "provenance": [] + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/README.md b/README.md index a95f7be2bdb24b73352337c874742072cca479fd..635df9a912c80c812d35d6418377f1e781c3b1d4 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,11 @@ --- title: RAG emoji: ⚡ -colorFrom: pink -colorTo: gray +colorFrom: yellow +colorTo: indigo sdk: gradio -sdk_version: 4.19.0 -app_file: gradio_app\app.py +sdk_version: 4.4.1 +app_file: app.py pinned: false +license: apache-2.0 --- diff --git a/gradio_app/app.py b/app.py similarity index 100% rename from gradio_app/app.py rename to app.py diff --git a/chunked/content_aware_chunking/__config/chunk_0.txt b/chunked/content_aware_chunking/__config/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..452517e93b8075d54836afceb23dc8ee074c8fb0 --- /dev/null +++ b/chunked/content_aware_chunking/__config/chunk_0.txt @@ -0,0 +1,13 @@ +docstyle-ignore +INSTALL_CONTENT = """ +Transformers installation +! pip install transformers datasets +To install from source instead of the last release, comment the command above and uncomment the following one. +! pip install git+https://github.com/huggingface/transformers.git +""" +notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}] +black_avoid_patterns = { + "{processor_class}": "FakeProcessorClass", + "{model_class}": "FakeModelClass", + "{object_class}": "FakeObjectClass", +}. \ No newline at end of file diff --git a/chunked/content_aware_chunking/__redirects/chunk_0.txt b/chunked/content_aware_chunking/__redirects/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..7668155a452385855ca2c7208680ecf5816a34e2 --- /dev/null +++ b/chunked/content_aware_chunking/__redirects/chunk_0.txt @@ -0,0 +1,2 @@ +Optimizing inference +perf_infer_gpu_many: perf_infer_gpu_one. \ No newline at end of file diff --git a/chunked/content_aware_chunking/__toctree/chunk_0.txt b/chunked/content_aware_chunking/__toctree/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/chunked/content_aware_chunking/__toctree/chunk_1.txt b/chunked/content_aware_chunking/__toctree/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..26d1d9cde63014e3992dca14b4740b4b06588a03 --- /dev/null +++ b/chunked/content_aware_chunking/__toctree/chunk_1.txt @@ -0,0 +1,836 @@ +sections: +local: index + title: 🤗 Transformers +local: quicktour + title: Quick tour +local: installation + title: Installation + title: Get started +sections: +local: pipeline_tutorial + title: Run inference with pipelines +local: autoclass_tutorial + title: Write portable code with AutoClass +local: preprocessing + title: Preprocess data +local: training + title: Fine-tune a pretrained model +local: run_scripts + title: Train with a script +local: accelerate + title: Set up distributed training with 🤗 Accelerate +local: peft + title: Load and train adapters with 🤗 PEFT +local: model_sharing + title: Share your model +local: transformers_agents + title: Agents +local: llm_tutorial + title: Generation with LLMs + title: Tutorials +sections: +isExpanded: false + sections: +local: tasks/sequence_classification + title: Text classification +local: tasks/token_classification + title: Token classification +local: tasks/question_answering + title: Question answering +local: tasks/language_modeling + title: Causal language modeling +local: tasks/masked_language_modeling + title: Masked language modeling +local: tasks/translation + title: Translation +local: tasks/summarization + title: Summarization +local: tasks/multiple_choice + title: Multiple choice +title: Natural Language Processing + +isExpanded: false + sections: +local: tasks/audio_classification + title: Audio classification +local: tasks/asr + title: Automatic speech recognition +title: Audio + +isExpanded: false + sections: +local: tasks/image_classification + title: Image classification +local: tasks/semantic_segmentation + title: Image segmentation +local: tasks/video_classification + title: Video classification +local: tasks/object_detection + title: Object detection +local: tasks/zero_shot_object_detection + title: Zero-shot object detection +local: tasks/zero_shot_image_classification + title: Zero-shot image classification +local: tasks/monocular_depth_estimation + title: Depth estimation +local: tasks/image_to_image + title: Image-to-Image +local: tasks/mask_generation + title: Mask Generation +local: tasks/knowledge_distillation_for_image_classification + title: Knowledge Distillation for Computer Vision +title: Computer Vision + +isExpanded: false + sections: +local: tasks/image_captioning + title: Image captioning +local: tasks/document_question_answering + title: Document Question Answering +local: tasks/visual_question_answering + title: Visual Question Answering +local: tasks/text-to-speech + title: Text to speech +title: Multimodal + +isExpanded: false + sections: +local: generation_strategies + title: Customize the generation strategy +title: Generation + +isExpanded: false + sections: +local: tasks/idefics + title: Image tasks with IDEFICS +local: tasks/prompting + title: LLM prompting guide +title: Prompting + title: Task Guides + +sections: +local: fast_tokenizers + title: Use fast tokenizers from 🤗 Tokenizers +local: multilingual + title: Run inference with multilingual models +local: create_a_model + title: Use model-specific APIs +local: custom_models + title: Share a custom model +local: chat_templating + title: Templates for chat models +local: trainer + title: Trainer +local: sagemaker + title: Run training on Amazon SageMaker +local: serialization + title: Export to ONNX +local: tflite + title: Export to TFLite +local: torchscript + title: Export to TorchScript +local: benchmarks + title: Benchmarks +local: notebooks + title: Notebooks with examples +local: community + title: Community resources +local: custom_tools + title: Custom Tools and Prompts +local: troubleshooting + title: Troubleshoot +local: hf_quantizer + title: Contribute new quantization method + title: Developer guides +sections: +local: performance + title: Overview +local: quantization + title: Quantization +sections: +local: perf_train_gpu_one + title: Methods and tools for efficient training on a single GPU +local: perf_train_gpu_many + title: Multiple GPUs and parallelism +local: fsdp + title: Fully Sharded Data Parallel +local: deepspeed + title: DeepSpeed +local: perf_train_cpu + title: Efficient training on CPU +local: perf_train_cpu_many + title: Distributed CPU training +local: perf_train_tpu_tf + title: Training on TPU with TensorFlow +local: perf_train_special + title: PyTorch training on Apple silicon +local: perf_hardware + title: Custom hardware for training +local: hpo_train + title: Hyperparameter Search using Trainer API +title: Efficient training techniques + +sections: +local: perf_infer_cpu + title: CPU inference +local: perf_infer_gpu_one + title: GPU inference +title: Optimizing inference + +local: big_models + title: Instantiating a big model +local: debugging + title: Debugging +local: tf_xla + title: XLA Integration for TensorFlow Models +local: perf_torch_compile + title: Optimize inference using torch.compile() + title: Performance and scalability +sections: +local: contributing + title: How to contribute to 🤗 Transformers? +local: add_new_model + title: How to add a model to 🤗 Transformers? +local: add_tensorflow_model + title: How to convert a 🤗 Transformers model to TensorFlow? +local: add_new_pipeline + title: How to add a pipeline to 🤗 Transformers? +local: testing + title: Testing +local: pr_checks + title: Checks on a Pull Request + title: Contribute +sections: +local: philosophy + title: Philosophy +local: glossary + title: Glossary +local: task_summary + title: What 🤗 Transformers can do +local: tasks_explained + title: How 🤗 Transformers solve tasks +local: model_summary + title: The Transformer model family +local: tokenizer_summary + title: Summary of the tokenizers +local: attention + title: Attention mechanisms +local: pad_truncation + title: Padding and truncation +local: bertology + title: BERTology +local: perplexity + title: Perplexity of fixed-length models +local: pipeline_webserver + title: Pipelines for webserver inference +local: model_memory_anatomy + title: Model training anatomy +local: llm_tutorial_optimization + title: Getting the most out of LLMs + title: Conceptual guides +sections: +sections: +local: main_classes/agent + title: Agents and Tools +local: model_doc/auto + title: Auto Classes +local: main_classes/backbones + title: Backbones +local: main_classes/callback + title: Callbacks +local: main_classes/configuration + title: Configuration +local: main_classes/data_collator + title: Data Collator +local: main_classes/keras_callbacks + title: Keras callbacks +local: main_classes/logging + title: Logging +local: main_classes/model + title: Models +local: main_classes/text_generation + title: Text Generation +local: main_classes/onnx + title: ONNX +local: main_classes/optimizer_schedules + title: Optimization +local: main_classes/output + title: Model outputs +local: main_classes/pipelines + title: Pipelines +local: main_classes/processors + title: Processors +local: main_classes/quantization + title: Quantization +local: main_classes/tokenizer + title: Tokenizer +local: main_classes/trainer + title: Trainer +local: main_classes/deepspeed + title: DeepSpeed +local: main_classes/feature_extractor + title: Feature Extractor +local: main_classes/image_processor + title: Image Processor +title: Main Classes + +sections: +isExpanded: false + sections: +local: model_doc/albert + title: ALBERT +local: model_doc/bart + title: BART +local: model_doc/barthez + title: BARThez +local: model_doc/bartpho + title: BARTpho +local: model_doc/bert + title: BERT +local: model_doc/bert-generation + title: BertGeneration +local: model_doc/bert-japanese + title: BertJapanese +local: model_doc/bertweet + title: Bertweet +local: model_doc/big_bird + title: BigBird +local: model_doc/bigbird_pegasus + title: BigBirdPegasus +local: model_doc/biogpt + title: BioGpt +local: model_doc/blenderbot + title: Blenderbot +local: model_doc/blenderbot-small + title: Blenderbot Small +local: model_doc/bloom + title: BLOOM +local: model_doc/bort + title: BORT +local: model_doc/byt5 + title: ByT5 +local: model_doc/camembert + title: CamemBERT +local: model_doc/canine + title: CANINE +local: model_doc/codegen + title: CodeGen +local: model_doc/code_llama + title: CodeLlama +local: model_doc/convbert + title: ConvBERT +local: model_doc/cpm + title: CPM +local: model_doc/cpmant + title: CPMANT +local: model_doc/ctrl + title: CTRL +local: model_doc/deberta + title: DeBERTa +local: model_doc/deberta-v2 + title: DeBERTa-v2 +local: model_doc/dialogpt + title: DialoGPT +local: model_doc/distilbert + title: DistilBERT +local: model_doc/dpr + title: DPR +local: model_doc/electra + title: ELECTRA +local: model_doc/encoder-decoder + title: Encoder Decoder Models +local: model_doc/ernie + title: ERNIE +local: model_doc/ernie_m + title: ErnieM +local: model_doc/esm + title: ESM +local: model_doc/falcon + title: Falcon +local: model_doc/fastspeech2_conformer + title: FastSpeech2Conformer +local: model_doc/flan-t5 + title: FLAN-T5 +local: model_doc/flan-ul2 + title: FLAN-UL2 +local: model_doc/flaubert + title: FlauBERT +local: model_doc/fnet + title: FNet +local: model_doc/fsmt + title: FSMT +local: model_doc/funnel + title: Funnel Transformer +local: model_doc/fuyu + title: Fuyu +local: model_doc/openai-gpt + title: GPT +local: model_doc/gpt_neo + title: GPT Neo +local: model_doc/gpt_neox + title: GPT NeoX +local: model_doc/gpt_neox_japanese + title: GPT NeoX Japanese +local: model_doc/gptj + title: GPT-J +local: model_doc/gpt2 + title: GPT2 +local: model_doc/gpt_bigcode + title: GPTBigCode +local: model_doc/gptsan-japanese + title: GPTSAN Japanese +local: model_doc/gpt-sw3 + title: GPTSw3 +local: model_doc/herbert + title: HerBERT +local: model_doc/ibert + title: I-BERT +local: model_doc/jukebox + title: Jukebox +local: model_doc/led + title: LED +local: model_doc/llama + title: LLaMA +local: model_doc/llama2 + title: Llama2 +local: model_doc/longformer + title: Longformer +local: model_doc/longt5 + title: LongT5 +local: model_doc/luke + title: LUKE +local: model_doc/m2m_100 + title: M2M100 +local: model_doc/madlad-400 + title: MADLAD-400 +local: model_doc/marian + title: MarianMT +local: model_doc/markuplm + title: MarkupLM +local: model_doc/mbart + title: MBart and MBart-50 +local: model_doc/mega + title: MEGA +local: model_doc/megatron-bert + title: MegatronBERT +local: model_doc/megatron_gpt2 + title: MegatronGPT2 +local: model_doc/mistral + title: Mistral +local: model_doc/mixtral + title: Mixtral +local: model_doc/mluke + title: mLUKE +local: model_doc/mobilebert + title: MobileBERT +local: model_doc/mpnet + title: MPNet +local: model_doc/mpt + title: MPT +local: model_doc/mra + title: MRA +local: model_doc/mt5 + title: MT5 +local: model_doc/mvp + title: MVP +local: model_doc/nezha + title: NEZHA +local: model_doc/nllb + title: NLLB +local: model_doc/nllb-moe + title: NLLB-MoE +local: model_doc/nystromformer + title: Nyströmformer +local: model_doc/open-llama + title: Open-Llama +local: model_doc/opt + title: OPT +local: model_doc/pegasus + title: Pegasus +local: model_doc/pegasus_x + title: PEGASUS-X +local: model_doc/persimmon + title: Persimmon +local: model_doc/phi + title: Phi +local: model_doc/phobert + title: PhoBERT +local: model_doc/plbart + title: PLBart +local: model_doc/prophetnet + title: ProphetNet +local: model_doc/qdqbert + title: QDQBert +local: model_doc/qwen2 + title: Qwen2 +local: model_doc/rag + title: RAG +local: model_doc/realm + title: REALM +local: model_doc/reformer + title: Reformer +local: model_doc/rembert + title: RemBERT +local: model_doc/retribert + title: RetriBERT +local: model_doc/roberta + title: RoBERTa +local: model_doc/roberta-prelayernorm + title: RoBERTa-PreLayerNorm +local: model_doc/roc_bert + title: RoCBert +local: model_doc/roformer + title: RoFormer +local: model_doc/rwkv + title: RWKV +local: model_doc/splinter + title: Splinter +local: model_doc/squeezebert + title: SqueezeBERT +local: model_doc/stablelm + title: StableLm +local: model_doc/switch_transformers + title: SwitchTransformers +local: model_doc/t5 + title: T5 +local: model_doc/t5v1.1 + title: T5v1.1 +local: model_doc/tapex + title: TAPEX +local: model_doc/transfo-xl + title: Transformer XL +local: model_doc/ul2 + title: UL2 +local: model_doc/umt5 + title: UMT5 +local: model_doc/xmod + title: X-MOD +local: model_doc/xglm + title: XGLM +local: model_doc/xlm + title: XLM +local: model_doc/xlm-prophetnet + title: XLM-ProphetNet +local: model_doc/xlm-roberta + title: XLM-RoBERTa +local: model_doc/xlm-roberta-xl + title: XLM-RoBERTa-XL +local: model_doc/xlm-v + title: XLM-V +local: model_doc/xlnet + title: XLNet +local: model_doc/yoso + title: YOSO + title: Text models +isExpanded: false + sections: +local: model_doc/beit + title: BEiT +local: model_doc/bit + title: BiT +local: model_doc/conditional_detr + title: Conditional DETR +local: model_doc/convnext + title: ConvNeXT +local: model_doc/convnextv2 + title: ConvNeXTV2 +local: model_doc/cvt + title: CvT +local: model_doc/deformable_detr + title: Deformable DETR +local: model_doc/deit + title: DeiT +local: model_doc/depth_anything + title: Depth Anything +local: model_doc/deta + title: DETA +local: model_doc/detr + title: DETR +local: model_doc/dinat + title: DiNAT +local: model_doc/dinov2 + title: DINOV2 +local: model_doc/dit + title: DiT +local: model_doc/dpt + title: DPT +local: model_doc/efficientformer + title: EfficientFormer +local: model_doc/efficientnet + title: EfficientNet +local: model_doc/focalnet + title: FocalNet +local: model_doc/glpn + title: GLPN +local: model_doc/imagegpt + title: ImageGPT +local: model_doc/levit + title: LeViT +local: model_doc/mask2former + title: Mask2Former +local: model_doc/maskformer + title: MaskFormer +local: model_doc/mobilenet_v1 + title: MobileNetV1 +local: model_doc/mobilenet_v2 + title: MobileNetV2 +local: model_doc/mobilevit + title: MobileViT +local: model_doc/mobilevitv2 + title: MobileViTV2 +local: model_doc/nat + title: NAT +local: model_doc/poolformer + title: PoolFormer +local: model_doc/pvt + title: Pyramid Vision Transformer (PVT) +local: model_doc/regnet + title: RegNet +local: model_doc/resnet + title: ResNet +local: model_doc/segformer + title: SegFormer +local: model_doc/swiftformer + title: SwiftFormer +local: model_doc/swin + title: Swin Transformer +local: model_doc/swinv2 + title: Swin Transformer V2 +local: model_doc/swin2sr + title: Swin2SR +local: model_doc/table-transformer + title: Table Transformer +local: model_doc/upernet + title: UperNet +local: model_doc/van + title: VAN +local: model_doc/vit + title: Vision Transformer (ViT) +local: model_doc/vit_hybrid + title: ViT Hybrid +local: model_doc/vitdet + title: ViTDet +local: model_doc/vit_mae + title: ViTMAE +local: model_doc/vitmatte + title: ViTMatte +local: model_doc/vit_msn + title: ViTMSN +local: model_doc/yolos + title: YOLOS + title: Vision models +isExpanded: false + sections: +local: model_doc/audio-spectrogram-transformer + title: Audio Spectrogram Transformer +local: model_doc/bark + title: Bark +local: model_doc/clap + title: CLAP +local: model_doc/encodec + title: EnCodec +local: model_doc/hubert + title: Hubert +local: model_doc/mctct + title: MCTCT +local: model_doc/mms + title: MMS +local: model_doc/musicgen + title: MusicGen +local: model_doc/pop2piano + title: Pop2Piano +local: model_doc/seamless_m4t + title: Seamless-M4T +local: model_doc/seamless_m4t_v2 + title: SeamlessM4T-v2 +local: model_doc/sew + title: SEW +local: model_doc/sew-d + title: SEW-D +local: model_doc/speech_to_text + title: Speech2Text +local: model_doc/speech_to_text_2 + title: Speech2Text2 +local: model_doc/speecht5 + title: SpeechT5 +local: model_doc/unispeech + title: UniSpeech +local: model_doc/unispeech-sat + title: UniSpeech-SAT +local: model_doc/univnet + title: UnivNet +local: model_doc/vits + title: VITS +local: model_doc/wav2vec2 + title: Wav2Vec2 +local: model_doc/wav2vec2-bert + title: Wav2Vec2-BERT +local: model_doc/wav2vec2-conformer + title: Wav2Vec2-Conformer +local: model_doc/wav2vec2_phoneme + title: Wav2Vec2Phoneme +local: model_doc/wavlm + title: WavLM +local: model_doc/whisper + title: Whisper +local: model_doc/xls_r + title: XLS-R +local: model_doc/xlsr_wav2vec2 + title: XLSR-Wav2Vec2 + title: Audio models +isExpanded: false + sections: +local: model_doc/timesformer + title: TimeSformer +local: model_doc/videomae + title: VideoMAE +local: model_doc/vivit + title: ViViT + title: Video models +isExpanded: false + sections: +local: model_doc/align + title: ALIGN +local: model_doc/altclip + title: AltCLIP +local: model_doc/blip + title: BLIP +local: model_doc/blip-2 + title: BLIP-2 +local: model_doc/bridgetower + title: BridgeTower +local: model_doc/bros + title: BROS +local: model_doc/chinese_clip + title: Chinese-CLIP +local: model_doc/clip + title: CLIP +local: model_doc/clipseg + title: CLIPSeg +local: model_doc/clvp + title: CLVP +local: model_doc/data2vec + title: Data2Vec +local: model_doc/deplot + title: DePlot +local: model_doc/donut + title: Donut +local: model_doc/flava + title: FLAVA +local: model_doc/git + title: GIT +local: model_doc/groupvit + title: GroupViT +local: model_doc/idefics + title: IDEFICS +local: model_doc/instructblip + title: InstructBLIP +local: model_doc/kosmos-2 + title: KOSMOS-2 +local: model_doc/layoutlm + title: LayoutLM +local: model_doc/layoutlmv2 + title: LayoutLMV2 +local: model_doc/layoutlmv3 + title: LayoutLMV3 +local: model_doc/layoutxlm + title: LayoutXLM +local: model_doc/lilt + title: LiLT +local: model_doc/llava + title: Llava +local: model_doc/lxmert + title: LXMERT +local: model_doc/matcha + title: MatCha +local: model_doc/mgp-str + title: MGP-STR +local: model_doc/nougat + title: Nougat +local: model_doc/oneformer + title: OneFormer +local: model_doc/owlvit + title: OWL-ViT +local: model_doc/owlv2 + title: OWLv2 +local: model_doc/perceiver + title: Perceiver +local: model_doc/pix2struct + title: Pix2Struct +local: model_doc/sam + title: Segment Anything +local: model_doc/siglip + title: SigLIP +local: model_doc/speech-encoder-decoder + title: Speech Encoder Decoder Models +local: model_doc/tapas + title: TAPAS +local: model_doc/trocr + title: TrOCR +local: model_doc/tvlt + title: TVLT +local: model_doc/tvp + title: TVP +local: model_doc/vilt + title: ViLT +local: model_doc/vipllava + title: VipLlava +local: model_doc/vision-encoder-decoder + title: Vision Encoder Decoder Models +local: model_doc/vision-text-dual-encoder + title: Vision Text Dual Encoder +local: model_doc/visual_bert + title: VisualBERT +local: model_doc/xclip + title: X-CLIP + title: Multimodal models +isExpanded: false + sections: +local: model_doc/decision_transformer + title: Decision Transformer +local: model_doc/trajectory_transformer + title: Trajectory Transformer + title: Reinforcement learning models +isExpanded: false + sections: +local: model_doc/autoformer + title: Autoformer +local: model_doc/informer + title: Informer +local: model_doc/patchtsmixer + title: PatchTSMixer +local: model_doc/patchtst + title: PatchTST +local: model_doc/time_series_transformer + title: Time Series Transformer + title: Time series models +isExpanded: false + sections: +local: model_doc/graphormer + title: Graphormer + title: Graph models +title: Models + +sections: +local: internal/modeling_utils + title: Custom Layers and Utilities +local: internal/pipelines_utils + title: Utilities for pipelines +local: internal/tokenization_utils + title: Utilities for Tokenizers +local: internal/trainer_utils + title: Utilities for Trainer +local: internal/generation_utils + title: Utilities for Generation +local: internal/image_processing_utils + title: Utilities for Image Processors +local: internal/audio_utils + title: Utilities for Audio processing +local: internal/file_utils + title: General Utilities +local: internal/time_series_utils + title: Utilities for Time Series +title: Internal Helpers + title: API + +. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_accelerate/chunk_0.txt b/chunked/content_aware_chunking/_accelerate/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..fe4490daa7653b809880ae0b79503b949ff9878a --- /dev/null +++ b/chunked/content_aware_chunking/_accelerate/chunk_0.txt @@ -0,0 +1,2 @@ +Distributed training with 🤗 Accelerate +As models get bigger, parallelism has emerged as a strategy for training larger models on limited hardware and accelerating training speed by several orders of magnitude. At Hugging Face, we created the 🤗 Accelerate library to help users easily train a 🤗 Transformers model on any type of distributed setup, whether it is multiple GPU's on one machine or multiple GPU's across several machines. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_accelerate/chunk_1.txt b/chunked/content_aware_chunking/_accelerate/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..e76e9310e76fea500a19c487990b8e58aa5c78c4 --- /dev/null +++ b/chunked/content_aware_chunking/_accelerate/chunk_1.txt @@ -0,0 +1,6 @@ +In this tutorial, learn how to customize your native PyTorch training loop to enable training in a distributed environment. +Setup +Get started by installing 🤗 Accelerate: + +pip install accelerate +Then import and create an [~accelerate.Accelerator] object. The [~accelerate.Accelerator] will automatically detect your type of distributed setup and initialize all the necessary components for training. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_accelerate/chunk_2.txt b/chunked/content_aware_chunking/_accelerate/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..0a51d00ba99ea7c902ce120d936377a4aa05d0ca --- /dev/null +++ b/chunked/content_aware_chunking/_accelerate/chunk_2.txt @@ -0,0 +1,7 @@ +You don't need to explicitly place your model on a device. + +from accelerate import Accelerator +accelerator = Accelerator() + +Prepare to accelerate +The next step is to pass all the relevant training objects to the [~accelerate.Accelerator.prepare] method. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_accelerate/chunk_3.txt b/chunked/content_aware_chunking/_accelerate/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..337f2f7cf8fea6239353c3c8a784fb517d4a675b --- /dev/null +++ b/chunked/content_aware_chunking/_accelerate/chunk_3.txt @@ -0,0 +1,72 @@ +This includes your training and evaluation DataLoaders, a model and an optimizer: + +train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( + train_dataloader, eval_dataloader, model, optimizer + ) + +Backward +The last addition is to replace the typical loss.backward() in your training loop with 🤗 Accelerate's [~accelerate.Accelerator.backward]method: + +for epoch in range(num_epochs): + for batch in train_dataloader: + outputs = model(**batch) + loss = outputs.loss + accelerator.backward(loss) + + optimizer.step() + lr_scheduler.step() + optimizer.zero_grad() + progress_bar.update(1) + +As you can see in the following code, you only need to add four additional lines of code to your training loop to enable distributed training! + ++ from accelerate import Accelerator + from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler + +accelerator = Accelerator() + +model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) + optimizer = AdamW(model.parameters(), lr=3e-5) + +device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") + +model.to(device) + +train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( + +train_dataloader, eval_dataloader, model, optimizer +) + +num_epochs = 3 + num_training_steps = num_epochs * len(train_dataloader) + lr_scheduler = get_scheduler( + "linear", + optimizer=optimizer, + num_warmup_steps=0, + num_training_steps=num_training_steps + ) +progress_bar = tqdm(range(num_training_steps)) +model.train() + for epoch in range(num_epochs): + for batch in train_dataloader: + + outputs = model(**batch) + loss = outputs.loss + ++ accelerator.backward(loss) + optimizer.step() + lr_scheduler.step() + optimizer.zero_grad() + progress_bar.update(1) + +Train +Once you've added the relevant lines of code, launch your training in a script or a notebook like Colaboratory. +Train with a script +If you are running your training from a script, run the following command to create and save a configuration file: + +accelerate config +Then launch your training with: + +accelerate launch train.py +Train with a notebook +🤗 Accelerate can also run in a notebook if you're planning on using Colaboratory's TPUs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_accelerate/chunk_4.txt b/chunked/content_aware_chunking/_accelerate/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..7018e655942e23c4d6e3ca32dc0d91002b2b3387 --- /dev/null +++ b/chunked/content_aware_chunking/_accelerate/chunk_4.txt @@ -0,0 +1,6 @@ +Wrap all the code responsible for training in a function, and pass it to [~accelerate.notebook_launcher]: + +from accelerate import notebook_launcher +notebook_launcher(training_function) + +For more information about 🤗 Accelerate and its rich features, refer to the documentation.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_0.txt b/chunked/content_aware_chunking/_add_new_model/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..7dd2fc6d4832175833008de5769a5ac2127a0916 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_0.txt @@ -0,0 +1,2 @@ +How to add a model to 🤗 Transformers? +The 🤗 Transformers library is often able to offer new models thanks to community contributors. But this can be a challenging project and requires an in-depth knowledge of the 🤗 Transformers library and the model to implement. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_1.txt b/chunked/content_aware_chunking/_add_new_model/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..d6f2841b75e81f0a40e4e8af8aeaed5acfbb29d1 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_1.txt @@ -0,0 +1,12 @@ +At Hugging Face, we're trying to empower more of the community to actively add models and we've put together this guide to walk you through the process of adding a PyTorch model (make sure you have PyTorch installed). + +If you're interested in implementing a TensorFlow model, take a look at the How to convert a 🤗 Transformers model to TensorFlow guide! + +Along the way, you'll: + +get insights into open-source best practices +understand the design principles behind one of the most popular deep learning libraries +learn how to efficiently test large models +learn how to integrate Python utilities like black, ruff, and make fix-copies to ensure clean and readable code + +A Hugging Face team member will be available to help you along the way so you'll never be alone. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_10.txt b/chunked/content_aware_chunking/_add_new_model/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..2e830144269a38032ba01947a7db2b2f312289bf --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_10.txt @@ -0,0 +1,2 @@ +Note that the configuration and the model are always serialized into two +different formats - the model to a pytorch_model.bin file and the configuration to a config.json file. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_100.txt b/chunked/content_aware_chunking/_add_new_model/chunk_100.txt new file mode 100644 index 0000000000000000000000000000000000000000..682d50bc398931387ba065dce6de11f630643460 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_100.txt @@ -0,0 +1,5 @@ +Hence, the documentation must be understandable and concise. It is very useful for +the community to add some Tips to show how the model should be used. Don't hesitate to ping the Hugging Face team +regarding the docstrings. +Next, make sure that the docstring added to src/transformers/models/brand_new_bert/modeling_brand_new_bert.py is +correct and included all necessary inputs and outputs. We have a detailed guide about writing documentation and our docstring format here. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_101.txt b/chunked/content_aware_chunking/_add_new_model/chunk_101.txt new file mode 100644 index 0000000000000000000000000000000000000000..1f6bd11d0354ba1c33578ff92075be507bec9705 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_101.txt @@ -0,0 +1,5 @@ +It is always to good to remind oneself that documentation should +be treated at least as carefully as the code in 🤗 Transformers since the documentation is usually the first contact +point of the community with the model. +Code refactor +Great, now you have added all the necessary code for brand_new_bert. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_102.txt b/chunked/content_aware_chunking/_add_new_model/chunk_102.txt new file mode 100644 index 0000000000000000000000000000000000000000..183acbe4fdf5f344de3320003d3cd3d1f926f303 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_102.txt @@ -0,0 +1,10 @@ +At this point, you should correct some potential +incorrect code style by running: + +make style +and verify that your coding style passes the quality check: + +make quality +There are a couple of other very strict design tests in 🤗 Transformers that might still be failing, which shows up in +the tests of your pull request. This is often because of some missing information in the docstring or some incorrect +naming. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_103.txt b/chunked/content_aware_chunking/_add_new_model/chunk_103.txt new file mode 100644 index 0000000000000000000000000000000000000000..0ecb6e49a2af501a4e7e8bbce00fd3527b597973 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_103.txt @@ -0,0 +1,5 @@ +The Hugging Face team will surely help you if you're stuck here. +Lastly, it is always a good idea to refactor one's code after having ensured that the code works correctly. With all +tests passing, now it's a good time to go over the added code again and do some refactoring. +You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎 +12. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_104.txt b/chunked/content_aware_chunking/_add_new_model/chunk_104.txt new file mode 100644 index 0000000000000000000000000000000000000000..7ef946dba1cda2c4bf8c1db2b988d4c82d758b8f --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_104.txt @@ -0,0 +1,5 @@ +Upload the models to the model hub +In this final part, you should convert and upload all checkpoints to the model hub and add a model card for each +uploaded model checkpoint. You can get familiar with the hub functionalities by reading our Model sharing and uploading Page. You should work alongside the Hugging Face team here to decide on a fitting name for each +checkpoint and to get the required access rights to be able to upload the model under the author's organization of +brand_new_bert. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_105.txt b/chunked/content_aware_chunking/_add_new_model/chunk_105.txt new file mode 100644 index 0000000000000000000000000000000000000000..c36c4ba953899e6041ecaf06623c7e910ded4db0 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_105.txt @@ -0,0 +1,8 @@ +The push_to_hub method, present in all models in transformers, is a quick and efficient way to push your checkpoint to the hub. A little snippet is pasted below: +thon +brand_new_bert.push_to_hub("brand_new_bert") +Uncomment the following line to push to an organization. +brand_new_bert.push_to_hub("/brand_new_bert") + +It is worth spending some time to create fitting model cards for each checkpoint. The model cards should highlight the +specific characteristics of this particular checkpoint, e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_106.txt b/chunked/content_aware_chunking/_add_new_model/chunk_106.txt new file mode 100644 index 0000000000000000000000000000000000000000..e3380cb4604be6e15c964464b5eef318ca6e2f6a --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_106.txt @@ -0,0 +1,7 @@ +On which dataset was the checkpoint +pretrained/fine-tuned on? On what down-stream task should the model be used? And also include some code on how to +correctly use the model. +13. (Optional) Add notebook +It is very helpful to add a notebook that showcases in-detail how brand_new_bert can be used for inference and/or +fine-tuned on a downstream task. This is not mandatory to merge your PR, but very useful for the community. +14. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_107.txt b/chunked/content_aware_chunking/_add_new_model/chunk_107.txt new file mode 100644 index 0000000000000000000000000000000000000000..85e2c0109c0800dc999b3c7cf1e109a38c7d7480 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_107.txt @@ -0,0 +1,2 @@ +Submit your finished PR +You're done programming now and can move to the last step, which is getting your PR merged into main. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_108.txt b/chunked/content_aware_chunking/_add_new_model/chunk_108.txt new file mode 100644 index 0000000000000000000000000000000000000000..eff096b18d5d75f85be58d6e81b8bc8b91ade575 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_108.txt @@ -0,0 +1,7 @@ +Usually, the +Hugging Face team should have helped you already at this point, but it is worth taking some time to give your finished +PR a nice description and eventually add comments to your code, if you want to point out certain design choices to your +reviewer. +Share your work!! +Now, it's time to get some credit from the community for your work! Having completed a model addition is a major +contribution to Transformers and the whole NLP community. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_109.txt b/chunked/content_aware_chunking/_add_new_model/chunk_109.txt new file mode 100644 index 0000000000000000000000000000000000000000..ef22c206a10aad3a9119a78539f34b6a5e6c093a --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_109.txt @@ -0,0 +1,4 @@ +Your code and the ported pre-trained models will certainly be +used by hundreds and possibly even thousands of developers and researchers. You should be proud of your work and share +your achievements with the community. +You have made another model that is super easy to access for everyone in the community! 🤯. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_11.txt b/chunked/content_aware_chunking/_add_new_model/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..4402695e84900febdc3a9964b36a5763ec21f076 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_11.txt @@ -0,0 +1,9 @@ +Calling +[~PreTrainedModel.save_pretrained] will automatically call +[~PretrainedConfig.save_pretrained], so that both model and configuration are saved. +Code style +When coding your new model, keep in mind that Transformers is an opinionated library and we have a few quirks of our +own regarding how code should be written :-) + +The forward pass of your model should be fully written in the modeling file while being fully independent of other + models in the library. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_12.txt b/chunked/content_aware_chunking/_add_new_model/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..dff4a0e479914ede76843999abe5ad5e8e7f0de6 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_12.txt @@ -0,0 +1,5 @@ +If you want to reuse a block from another model, copy the code and paste it with a + # Copied from comment on top (see here + for a good example and there for more documentation on Copied from). +The code should be fully understandable, even by a non-native English speaker. This means you should pick + descriptive variable names and avoid abbreviations. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_13.txt b/chunked/content_aware_chunking/_add_new_model/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..b19062f1aecaed26e33a027e2e24581ee41a59bd --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_13.txt @@ -0,0 +1,6 @@ +As an example, activation is preferred to act. + One-letter variable names are strongly discouraged unless it's an index in a for loop. +More generally we prefer longer explicit code to short magical one. +Avoid subclassing nn.Sequential in PyTorch but subclass nn.Module and write the forward pass, so that anyone + using your code can quickly debug it by adding print statements or breaking points. +Your function signature should be type-annotated. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_14.txt b/chunked/content_aware_chunking/_add_new_model/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..14d6903085a15f35b31114eead80851013f79a5f --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_14.txt @@ -0,0 +1,8 @@ +For the rest, good variable names are way more readable and + understandable than type annotations. + +Overview of tokenizers +Not quite ready yet :-( This section will be added soon! +Step-by-step recipe to add a model to 🤗 Transformers +Everyone has different preferences of how to port a model so it can be very helpful for you to take a look at summaries +of how other contributors ported models to Hugging Face. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_15.txt b/chunked/content_aware_chunking/_add_new_model/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..4c5d66221b6e20a88e5dc6c091ffd72b3cc2d553 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_15.txt @@ -0,0 +1,11 @@ +Here is a list of community blog posts on how to port a model: + +Porting GPT2 Model by Thomas +Porting WMT19 MT Model by Stas + +From experience, we can tell you that the most important things to keep in mind when adding a model are: + +Don't reinvent the wheel! Most parts of the code you will add for the new 🤗 Transformers model already exist + somewhere in 🤗 Transformers. Take some time to find similar, already existing models and tokenizers you can copy + from. grep and rg are your + friends. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_16.txt b/chunked/content_aware_chunking/_add_new_model/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..e32123f33ed0d00f066a70540eb0cd9a12e32adc --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_16.txt @@ -0,0 +1,4 @@ +Note that it might very well happen that your model's tokenizer is based on one model implementation, and + your model's modeling code on another one. E.g. FSMT's modeling code is based on BART, while FSMT's tokenizer code + is based on XLM. +It's more of an engineering challenge than a scientific challenge. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_17.txt b/chunked/content_aware_chunking/_add_new_model/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..749710c60c750f6a6e37a55beceb9a0f79c493c6 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_17.txt @@ -0,0 +1,4 @@ +You should spend more time creating an + efficient debugging environment rather than trying to understand all theoretical aspects of the model in the paper. +Ask for help, when you're stuck! Models are the core component of 🤗 Transformers so we at Hugging Face are more + than happy to help you at every step to add your model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_18.txt b/chunked/content_aware_chunking/_add_new_model/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..47d6c024496afbcc342f923eeae2e9d753c120db --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_18.txt @@ -0,0 +1,21 @@ +Don't hesitate to ask if you notice you are not making + progress. + +In the following, we try to give you a general recipe that we found most useful when porting a model to 🤗 Transformers. +The following list is a summary of everything that has to be done to add a model and can be used by you as a To-Do +List: +☐ (Optional) Understood the model's theoretical aspects +☐ Prepared 🤗 Transformers dev environment +☐ Set up debugging environment of the original repository +☐ Created script that successfully runs the forward() pass using the original repository and checkpoint +☐ Successfully added the model skeleton to 🤗 Transformers +☐ Successfully converted original checkpoint to 🤗 Transformers checkpoint +☐ Successfully ran forward() pass in 🤗 Transformers that gives identical output to original checkpoint +☐ Finished model tests in 🤗 Transformers +☐ Successfully added tokenizer in 🤗 Transformers +☐ Run end-to-end integration tests +☐ Finished docs +☐ Uploaded model weights to the Hub +☐ Submitted the pull request +☐ (Optional) Added a demo notebook +To begin with, we usually recommend starting by getting a good theoretical understanding of BrandNewBert. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_19.txt b/chunked/content_aware_chunking/_add_new_model/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..09ca8f14034b0b98e2491d5d3ec35be463d23061 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_19.txt @@ -0,0 +1,6 @@ +However, +if you prefer to understand the theoretical aspects of the model on-the-job, then it is totally fine to directly dive +into the BrandNewBert's code-base. This option might suit you better if your engineering skills are better than +your theoretical skill, if you have trouble understanding BrandNewBert's paper, or if you just enjoy programming +much more than reading scientific papers. +1. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_2.txt b/chunked/content_aware_chunking/_add_new_model/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..5c8b13e17bc32866d67f5438b09d22fbf3c8db61 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_2.txt @@ -0,0 +1,5 @@ +🤗 ❤️ +To get started, open a New model addition issue for the model you want to see in 🤗 Transformers. If you're not especially picky about contributing a specific model, you can filter by the New model label to see if there are any unclaimed model requests and work on it. +Once you've opened a new model request, the first step is to get familiar with 🤗 Transformers if you aren't already! +General overview of 🤗 Transformers +First, you should get a general overview of 🤗 Transformers. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_20.txt b/chunked/content_aware_chunking/_add_new_model/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..eda013fbfbef3685841aad42e2b1b833df6ca5c9 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_20.txt @@ -0,0 +1,5 @@ +(Optional) Theoretical aspects of BrandNewBert +You should take some time to read BrandNewBert's paper, if such descriptive work exists. There might be large +sections of the paper that are difficult to understand. If this is the case, this is fine - don't worry! The goal is +not to get a deep theoretical understanding of the paper, but to extract the necessary information required to +effectively re-implement the model in 🤗 Transformers. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_21.txt b/chunked/content_aware_chunking/_add_new_model/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..ca96d03c8d91b82b94ec58e779e556b9e59e8656 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_21.txt @@ -0,0 +1,15 @@ +That being said, you don't have to spend too much time on the +theoretical aspects, but rather focus on the practical ones, namely: + +What type of model is brand_new_bert? BERT-like encoder-only model? GPT2-like decoder-only model? BART-like + encoder-decoder model? Look at the model_summary if you're not familiar with the differences between those. +What are the applications of brand_new_bert? Text classification? Text generation? Seq2Seq tasks, e.g., + summarization? +What is the novel feature of the model that makes it different from BERT/GPT-2/BART? +Which of the already existing 🤗 Transformers models is most + similar to brand_new_bert? +What type of tokenizer is used? A sentencepiece tokenizer? Word piece tokenizer? Is it the same tokenizer as used + for BERT or BART? + +After you feel like you have gotten a good overview of the architecture of the model, you might want to write to the +Hugging Face team with any questions you might have. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_22.txt b/chunked/content_aware_chunking/_add_new_model/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..6e6a4a9b02e46dad7a3444e663f5440f539bb114 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_22.txt @@ -0,0 +1,6 @@ +This might include questions regarding the model's architecture, +its attention layer, etc. We will be more than happy to help you. +2. Next prepare your environment + +Fork the repository by clicking on the ‘Fork' button on the + repository's page. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_23.txt b/chunked/content_aware_chunking/_add_new_model/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..87e895dd1b96469fec5463cf9e05a6fdb2523666 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_23.txt @@ -0,0 +1,15 @@ +This creates a copy of the code under your GitHub user account. + +Clone your transformers fork to your local disk, and add the base repository as a remote: + +git clone https://github.com/[your Github handle]/transformers.git +cd transformers +git remote add upstream https://github.com/huggingface/transformers.git + +Set up a development environment, for instance by running the following command: + +python -m venv .env +source .env/bin/activate +pip install -e ".[dev]" +Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a +failure with this command. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_24.txt b/chunked/content_aware_chunking/_add_new_model/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..1ef6644f8b5e5a1263176997277be39dabb866d1 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_24.txt @@ -0,0 +1,12 @@ +If that's the case make sure to install the Deep Learning framework you are working with +(PyTorch, TensorFlow and/or Flax) then do: + +pip install -e ".[quality]" +which should be enough for most use cases. You can then return to the parent directory + +cd .. + +We recommend adding the PyTorch version of brand_new_bert to Transformers. To install PyTorch, please follow the + instructions on https://pytorch.org/get-started/locally/. + +Note: You don't need to have CUDA installed. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_25.txt b/chunked/content_aware_chunking/_add_new_model/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..a52ff69fcd80c8071abf1dcf5a972191b955318b --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_25.txt @@ -0,0 +1,10 @@ +Making the new model work on CPU is sufficient. + +To port brand_new_bert, you will also need access to its original repository: + +git clone https://github.com/org_that_created_brand_new_bert_org/brand_new_bert.git +cd brand_new_bert +pip install -e . +Now you have set up a development environment to port brand_new_bert to 🤗 Transformers. +3.-4. Run a pretrained checkpoint using the original repository +At first, you will work on the original brand_new_bert repository. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_26.txt b/chunked/content_aware_chunking/_add_new_model/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..46070bdfead06e5050c1302907e3d0c88255b89f --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_26.txt @@ -0,0 +1,5 @@ +Often, the original implementation is very +“researchy”. Meaning that documentation might be lacking and the code can be difficult to understand. But this should +be exactly your motivation to reimplement brand_new_bert. At Hugging Face, one of our main goals is to make people +stand on the shoulders of giants which translates here very well into taking a working model and rewriting it to make +it as accessible, user-friendly, and beautiful as possible. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_27.txt b/chunked/content_aware_chunking/_add_new_model/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..ec395294c6c0cec1ab09aba0e43895d46232d665 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_27.txt @@ -0,0 +1,5 @@ +This is the number-one motivation to re-implement +models into 🤗 Transformers - trying to make complex new NLP technology accessible to everybody. +You should start thereby by diving into the original repository. +Successfully running the official pretrained model in the original repository is often the most difficult step. +From our experience, it is very important to spend some time getting familiar with the original code-base. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_28.txt b/chunked/content_aware_chunking/_add_new_model/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..9bf19a22218deedbbe4f0c987d1797093bdb7e20 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_28.txt @@ -0,0 +1,10 @@ +You need to +figure out the following: + +Where to find the pretrained weights? +How to load the pretrained weights into the corresponding model? +How to run the tokenizer independently from the model? +Trace one forward pass so that you know which classes and functions are required for a simple forward pass. Usually, + you only have to reimplement those functions. +Be able to locate the important components of the model: Where is the model's class? Are there model sub-classes, + e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_29.txt b/chunked/content_aware_chunking/_add_new_model/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..82f5e3eef54846f85e23bca044665cbd981a4666 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_29.txt @@ -0,0 +1,2 @@ +EncoderModel, DecoderModel? Where is the self-attention layer? Are there multiple different attention layers, + e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_3.txt b/chunked/content_aware_chunking/_add_new_model/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..b02734ad30abc1eb7e20a8bea49846f00cb0cc6a --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_3.txt @@ -0,0 +1,5 @@ +🤗 Transformers is a very opinionated library, so there is a +chance that you don't agree with some of the library's philosophies or design choices. From our experience, however, we +found that the fundamental design choices and philosophies of the library are crucial to efficiently scale 🤗 +Transformers while keeping maintenance costs at a reasonable level. +A good first starting point to better understand the library is to read the documentation of our philosophy. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_30.txt b/chunked/content_aware_chunking/_add_new_model/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..3b8e29133f92daf942d0f936d429a9b77d6e9c35 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_30.txt @@ -0,0 +1,7 @@ +self-attention, cross-attention? +How can you debug the model in the original environment of the repo? Do you have to add print statements, can you + work with an interactive debugger like ipdb, or should you use an efficient IDE to debug the model, like PyCharm? + +It is very important that before you start the porting process, you can efficiently debug code in the original +repository! Also, remember that you are working with an open-source library, so do not hesitate to open an issue, or +even a pull request in the original repository. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_31.txt b/chunked/content_aware_chunking/_add_new_model/chunk_31.txt new file mode 100644 index 0000000000000000000000000000000000000000..ae524323e269cf7cc93dc896d0df97ffa708e111 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_31.txt @@ -0,0 +1,5 @@ +The maintainers of this repository are most likely very happy about +someone looking into their code! +At this point, it is really up to you which debugging environment and strategy you prefer to use to debug the original +model. We strongly advise against setting up a costly GPU environment, but simply work on a CPU both when starting to +dive into the original repository and also when starting to write the 🤗 Transformers implementation of the model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_32.txt b/chunked/content_aware_chunking/_add_new_model/chunk_32.txt new file mode 100644 index 0000000000000000000000000000000000000000..115dc00e64b657ab23a1291e0d14a73d2983d0d7 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_32.txt @@ -0,0 +1,10 @@ +Only +at the very end, when the model has already been successfully ported to 🤗 Transformers, one should verify that the +model also works as expected on GPU. +In general, there are two possible debugging environments for running the original model + +Jupyter notebooks / google colab +Local python scripts. + +Jupyter notebooks have the advantage that they allow for cell-by-cell execution which can be helpful to better split +logical components from one another and to have faster debugging cycles as intermediate results can be stored. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_33.txt b/chunked/content_aware_chunking/_add_new_model/chunk_33.txt new file mode 100644 index 0000000000000000000000000000000000000000..6097aaae0bbeff07f2f4e2a495a463307d164456 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_33.txt @@ -0,0 +1,3 @@ +Also, +notebooks are often easier to share with other contributors, which might be very helpful if you want to ask the Hugging +Face team for help. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_34.txt b/chunked/content_aware_chunking/_add_new_model/chunk_34.txt new file mode 100644 index 0000000000000000000000000000000000000000..1f467da2546de6cebaed95a72d606657f8e26c4d --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_34.txt @@ -0,0 +1,6 @@ +If you are familiar with Jupyter notebooks, we strongly recommend you work with them. +The obvious disadvantage of Jupyter notebooks is that if you are not used to working with them you will have to spend +some time adjusting to the new programming environment and you might not be able to use your known debugging tools +anymore, like ipdb. +For each code-base, a good first step is always to load a small pretrained checkpoint and to be able to reproduce a +single forward pass using a dummy integer vector of input IDs as an input. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_35.txt b/chunked/content_aware_chunking/_add_new_model/chunk_35.txt new file mode 100644 index 0000000000000000000000000000000000000000..4b6315e657aa0e7d174658becba1485cd33f45c9 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_35.txt @@ -0,0 +1,14 @@ +Such a script could look like this (in +pseudocode): +python +model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/") +input_ids = [0, 4, 5, 2, 3, 7, 9] # vector of input ids +original_output = model.predict(input_ids) +Next, regarding the debugging strategy, there are generally a few from which to choose from: + +Decompose the original model into many small testable components and run a forward pass on each of those for + verification +Decompose the original model only into the original tokenizer and the original model, run a forward pass on + those, and use intermediate print statements or breakpoints for verification + +Again, it is up to you which strategy to choose. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_36.txt b/chunked/content_aware_chunking/_add_new_model/chunk_36.txt new file mode 100644 index 0000000000000000000000000000000000000000..9b07372d9643a66fbbbd6d111604af4fc172af96 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_36.txt @@ -0,0 +1,4 @@ +Often, one or the other is advantageous depending on the original code +base. +If the original code-base allows you to decompose the model into smaller sub-components, e.g. if the original +code-base can easily be run in eager mode, it is usually worth the effort to do so. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_37.txt b/chunked/content_aware_chunking/_add_new_model/chunk_37.txt new file mode 100644 index 0000000000000000000000000000000000000000..66c637099ff5bd0ff74e864cbf8bba5a16e2623f --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_37.txt @@ -0,0 +1,17 @@ +There are some important advantages +to taking the more difficult road in the beginning: + +at a later stage when comparing the original model to the Hugging Face implementation, you can verify automatically + for each component individually that the corresponding component of the 🤗 Transformers implementation matches instead + of relying on visual comparison via print statements +it can give you some rope to decompose the big problem of porting a model into smaller problems of just porting + individual components and thus structure your work better +separating the model into logical meaningful components will help you to get a better overview of the model's design + and thus to better understand the model +at a later stage those component-by-component tests help you to ensure that no regression occurs as you continue + changing your code + +Lysandre's integration checks for ELECTRA +gives a nice example of how this can be done. +However, if the original code-base is very complex or only allows intermediate components to be run in a compiled mode, +it might be too time-consuming or even impossible to separate the model into smaller testable sub-components. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_38.txt b/chunked/content_aware_chunking/_add_new_model/chunk_38.txt new file mode 100644 index 0000000000000000000000000000000000000000..a30cc7dba977d15c47c1bee69d80b9f9cae55507 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_38.txt @@ -0,0 +1,3 @@ +A good +example is T5's MeshTensorFlow library which is +very complex and does not offer a simple way to decompose the model into its sub-components. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_39.txt b/chunked/content_aware_chunking/_add_new_model/chunk_39.txt new file mode 100644 index 0000000000000000000000000000000000000000..8b8841b4c8e6a59df6bce307bd87a62daf90349b --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_39.txt @@ -0,0 +1,15 @@ +For such libraries, one +often relies on verifying print statements. +No matter which strategy you choose, the recommended procedure is often the same that you should start to debug the +starting layers first and the ending layers last. +It is recommended that you retrieve the output, either by print statements or sub-component functions, of the following +layers in the following order: + +Retrieve the input IDs passed to the model +Retrieve the word embeddings +Retrieve the input of the first Transformer layer +Retrieve the output of the first Transformer layer +Retrieve the output of the following n - 1 Transformer layers +Retrieve the output of the whole BrandNewBert Model + +Input IDs should thereby consists of an array of integers, e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_4.txt b/chunked/content_aware_chunking/_add_new_model/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..cba1ceb6d79d171f33e4c74e5895e26d94d1c9c0 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_4.txt @@ -0,0 +1,8 @@ +As a result of our way of working, there are some choices that we try to apply to all models: + +Composition is generally favored over-abstraction +Duplicating code is not always bad if it strongly improves the readability or accessibility of a model +Model files are as self-contained as possible so that when you read the code of a specific model, you ideally only + have to look into the respective modeling_.py file. + +In our opinion, the library's code is not just a means to provide a product, e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_40.txt b/chunked/content_aware_chunking/_add_new_model/chunk_40.txt new file mode 100644 index 0000000000000000000000000000000000000000..fd1dde7b3c6ca7ec1c4823acf9f75ee0e0fd08eb --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_40.txt @@ -0,0 +1,14 @@ +input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19] +The outputs of the following layers often consist of multi-dimensional float arrays and can look like this: +[[ + [-0.1465, -0.6501, 0.1993, , 0.1451, 0.3430, 0.6024], + [-0.4417, -0.5920, 0.3450, , -0.3062, 0.6182, 0.7132], + [-0.5009, -0.7122, 0.4548, , -0.3662, 0.6091, 0.7648], + , + [-0.5613, -0.6332, 0.4324, , -0.3792, 0.7372, 0.9288], + [-0.5416, -0.6345, 0.4180, , -0.3564, 0.6992, 0.9191], + [-0.5334, -0.6403, 0.4271, , -0.3339, 0.6533, 0.8694]]], +We expect that every model added to 🤗 Transformers passes a couple of integration tests, meaning that the original +model and the reimplemented version in 🤗 Transformers have to give the exact same output up to a precision of 0.001! +Since it is normal that the exact same model written in different libraries can give a slightly different output +depending on the library framework, we accept an error tolerance of 1e-3 (0.001). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_41.txt b/chunked/content_aware_chunking/_add_new_model/chunk_41.txt new file mode 100644 index 0000000000000000000000000000000000000000..a31fec8286e4a6778773f62172d0190590010f94 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_41.txt @@ -0,0 +1,5 @@ +It is not enough if the model gives +nearly the same output, they have to be almost identical. Therefore, you will certainly compare the intermediate +outputs of the 🤗 Transformers version multiple times against the intermediate outputs of the original implementation of +brand_new_bert in which case an efficient debugging environment of the original repository is absolutely +important. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_42.txt b/chunked/content_aware_chunking/_add_new_model/chunk_42.txt new file mode 100644 index 0000000000000000000000000000000000000000..b93de8c9f3eefc3b170cea479d596a8eb3a7e774 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_42.txt @@ -0,0 +1,7 @@ +Here is some advice to make your debugging environment as efficient as possible. + +Find the best way of debugging intermediate results. Is the original repository written in PyTorch? Then you should + probably take the time to write a longer script that decomposes the original model into smaller sub-components to + retrieve intermediate values. Is the original repository written in Tensorflow 1? Then you might have to rely on + TensorFlow print operations like tf.print to output + intermediate values. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_43.txt b/chunked/content_aware_chunking/_add_new_model/chunk_43.txt new file mode 100644 index 0000000000000000000000000000000000000000..c203248598b9cce5578c6a342f722172a6b2aaf4 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_43.txt @@ -0,0 +1,4 @@ +Is the original repository written in Jax? Then make sure that the model is not jitted when + running the forward pass, e.g. check-out this link. +Use the smallest pretrained checkpoint you can find. The smaller the checkpoint, the faster your debug cycle + becomes. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_44.txt b/chunked/content_aware_chunking/_add_new_model/chunk_44.txt new file mode 100644 index 0000000000000000000000000000000000000000..20a44f1d7d558690c10b943a94ea58718233731b --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_44.txt @@ -0,0 +1,5 @@ +It is not efficient if your pretrained model is so big that your forward pass takes more than 10 seconds. + In case only very large checkpoints are available, it might make more sense to create a dummy model in the new + environment with randomly initialized weights and save those weights for comparison with the 🤗 Transformers version + of your model +Make sure you are using the easiest way of calling a forward pass in the original repository. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_45.txt b/chunked/content_aware_chunking/_add_new_model/chunk_45.txt new file mode 100644 index 0000000000000000000000000000000000000000..3d168aec28a973c268b72fdec617027f2fcdc476 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_45.txt @@ -0,0 +1,5 @@ +Ideally, you want to + find the function in the original repository that only calls a single forward pass, i.e. that is often called + predict, evaluate, forward or __call__. You don't want to debug a function that calls forward + multiple times, e.g. to generate text, like autoregressive_sample, generate. +Try to separate the tokenization from the model's forward pass. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_46.txt b/chunked/content_aware_chunking/_add_new_model/chunk_46.txt new file mode 100644 index 0000000000000000000000000000000000000000..b6a6e4bf29cc79f604a8cd180600c8d581690968 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_46.txt @@ -0,0 +1,3 @@ +If the original repository shows examples where + you have to input a string, then try to find out where in the forward call the string input is changed to input ids + and start from this point. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_47.txt b/chunked/content_aware_chunking/_add_new_model/chunk_47.txt new file mode 100644 index 0000000000000000000000000000000000000000..8e11c80a4e18a6bac362c4ff09406b410351f1ed --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_47.txt @@ -0,0 +1,5 @@ +This might mean that you have to possibly write a small script yourself or change the + original code so that you can directly input the ids instead of an input string. +Make sure that the model in your debugging setup is not in training mode, which often causes the model to yield + random outputs due to multiple dropout layers in the model. Make sure that the forward pass in your debugging + environment is deterministic so that the dropout layers are not used. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_48.txt b/chunked/content_aware_chunking/_add_new_model/chunk_48.txt new file mode 100644 index 0000000000000000000000000000000000000000..24b17f01764c8c13b3fcb249118ac8d13e7bfb00 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_48.txt @@ -0,0 +1,6 @@ +Or use transformers.utils.set_seed + if the old and new implementations are in the same framework. + +The following section gives you more specific details/tips on how you can do this for brand_new_bert. +5.-14. Port BrandNewBert to 🤗 Transformers +Next, you can finally start adding new code to 🤗 Transformers. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_49.txt b/chunked/content_aware_chunking/_add_new_model/chunk_49.txt new file mode 100644 index 0000000000000000000000000000000000000000..35b6b23a5a0c87bcab6fcac6f8e05dfca1bcd48d --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_49.txt @@ -0,0 +1,7 @@ +Go into the clone of your 🤗 Transformers' fork: + +cd transformers +In the special case that you are adding a model whose architecture exactly matches the model architecture of an +existing model you only have to add a conversion script as described in this section. +In this case, you can just re-use the whole model architecture of the already existing model. +Otherwise, let's start generating a new model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_5.txt b/chunked/content_aware_chunking/_add_new_model/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..52acc3a8e4eedd36fbdde911ff24a4c561408538 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_5.txt @@ -0,0 +1,7 @@ +the ability to use BERT for +inference, but also as the very product that we want to improve. Hence, when adding a model, the user is not only the +person who will use your model, but also everybody who will read, try to understand, and possibly tweak your code. +With this in mind, let's go a bit deeper into the general library design. +Overview of models +To successfully add a model, it is important to understand the interaction between your model and its config, +[PreTrainedModel], and [PretrainedConfig]. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_50.txt b/chunked/content_aware_chunking/_add_new_model/chunk_50.txt new file mode 100644 index 0000000000000000000000000000000000000000..f5e2b852e764ff3aef280032556dbe281d0ca8c8 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_50.txt @@ -0,0 +1,6 @@ +You have two choices here: + +transformers-cli add-new-model-like to add a new model like an existing one +transformers-cli add-new-model to add a new model from our template (will look like BERT or Bart depending on the type of model you select) + +In both cases, you will be prompted with a questionnaire to fill in the basic information of your model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_51.txt b/chunked/content_aware_chunking/_add_new_model/chunk_51.txt new file mode 100644 index 0000000000000000000000000000000000000000..1360869aeec7dd0f0f4d9a4c577f8ca19d6fcb77 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_51.txt @@ -0,0 +1,4 @@ +The second command requires to install cookiecutter, you can find more information on it here. +Open a Pull Request on the main huggingface/transformers repo +Before starting to adapt the automatically generated code, now is the time to open a “Work in progress (WIP)” pull +request, e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_52.txt b/chunked/content_aware_chunking/_add_new_model/chunk_52.txt new file mode 100644 index 0000000000000000000000000000000000000000..c24afbd31da523fad11fbc9257ae25c44acd1d15 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_52.txt @@ -0,0 +1,23 @@ +“[WIP] Add brand_new_bert”, in 🤗 Transformers so that you and the Hugging Face team can work +side-by-side on integrating the model into 🤗 Transformers. +You should do the following: + +Create a branch with a descriptive name from your main branch + +git checkout -b add_brand_new_bert + +Commit the automatically generated code: + +git add . +git commit + +Fetch and rebase to current main + +git fetch upstream +git rebase upstream/main + +Push the changes to your account using: + +git push -u origin a-descriptive-name-for-my-changes + +Once you are satisfied, go to the webpage of your fork on GitHub. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_53.txt b/chunked/content_aware_chunking/_add_new_model/chunk_53.txt new file mode 100644 index 0000000000000000000000000000000000000000..22d3d761d81c612748f323e7bb308c197e102693 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_53.txt @@ -0,0 +1,8 @@ +Click on “Pull request”. Make sure to add the + GitHub handle of some members of the Hugging Face team as reviewers, so that the Hugging Face team gets notified for + future changes. + +Change the PR into a draft by clicking on “Convert to draft” on the right of the GitHub pull request web page. + +In the following, whenever you have made some progress, don't forget to commit your work and push it to your account so +that it shows in the pull request. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_54.txt b/chunked/content_aware_chunking/_add_new_model/chunk_54.txt new file mode 100644 index 0000000000000000000000000000000000000000..2ffbfac4070efd02a5a653b1480abeea664d4fb6 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_54.txt @@ -0,0 +1,8 @@ +Additionally, you should make sure to update your work with the current main from +time to time by doing: + +git fetch upstream +git merge upstream/main +In general, all questions you might have regarding the model or your implementation should be asked in your PR and +discussed/solved in the PR. This way, the Hugging Face team will always be notified when you are committing new code or +if you have a question. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_55.txt b/chunked/content_aware_chunking/_add_new_model/chunk_55.txt new file mode 100644 index 0000000000000000000000000000000000000000..808cb9fe11e6678f1492e85e8d149be7e04d1fb0 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_55.txt @@ -0,0 +1,4 @@ +It is often very helpful to point the Hugging Face team to your added code so that the Hugging +Face team can efficiently understand your problem or question. +To do so, you can go to the “Files changed” tab where you see all of your changes, go to a line regarding which you +want to ask a question, and click on the “+” symbol to add a comment. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_56.txt b/chunked/content_aware_chunking/_add_new_model/chunk_56.txt new file mode 100644 index 0000000000000000000000000000000000000000..ae8954d0b9721d1ae244b21a21b1f5fb1180dcac --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_56.txt @@ -0,0 +1,6 @@ +Whenever a question or problem has been solved, +you can click on the “Resolve” button of the created comment. +In the same way, the Hugging Face team will open comments when reviewing your code. We recommend asking most questions +on GitHub on your PR. For some very general questions that are not very useful for the public, feel free to ping the +Hugging Face team by Slack or email. +5. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_57.txt b/chunked/content_aware_chunking/_add_new_model/chunk_57.txt new file mode 100644 index 0000000000000000000000000000000000000000..744ec6db95fce160b0e4ad50e1d4ab1fa18debce --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_57.txt @@ -0,0 +1,5 @@ +Adapt the generated models code for brand_new_bert +At first, we will focus only on the model itself and not care about the tokenizer. All the relevant code should be +found in the generated files src/transformers/models/brand_new_bert/modeling_brand_new_bert.py and +src/transformers/models/brand_new_bert/configuration_brand_new_bert.py. +Now you can finally start coding :). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_58.txt b/chunked/content_aware_chunking/_add_new_model/chunk_58.txt new file mode 100644 index 0000000000000000000000000000000000000000..3250783d05ae643d5ebeac5e4856890dc38e9f67 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_58.txt @@ -0,0 +1,5 @@ +The generated code in +src/transformers/models/brand_new_bert/modeling_brand_new_bert.py will either have the same architecture as BERT if +it's an encoder-only model or BART if it's an encoder-decoder model. At this point, you should remind yourself what +you've learned in the beginning about the theoretical aspects of the model: How is the model different from BERT or +BART?". \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_59.txt b/chunked/content_aware_chunking/_add_new_model/chunk_59.txt new file mode 100644 index 0000000000000000000000000000000000000000..fa26d2f32abc71cc0dd989bd7abbdb0ef9800b75 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_59.txt @@ -0,0 +1,4 @@ +Implement those changes which often means changing the self-attention layer, the order of the normalization +layer, etc… Again, it is often useful to look at the similar architecture of already existing models in Transformers to +get a better feeling of how your model should be implemented. +Note that at this point, you don't have to be very sure that your code is fully correct or clean. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_6.txt b/chunked/content_aware_chunking/_add_new_model/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..782cd51dde871119f71ba8edb083589cb58cc03c --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_6.txt @@ -0,0 +1,8 @@ +For exemplary purposes, we will +call the model to be added to 🤗 Transformers BrandNewBert. +Let's take a look: + +As you can see, we do make use of inheritance in 🤗 Transformers, but we keep the level of abstraction to an absolute +minimum. There are never more than two levels of abstraction for any model in the library. BrandNewBertModel +inherits from BrandNewBertPreTrainedModel which in turn inherits from [PreTrainedModel] and +that's it. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_60.txt b/chunked/content_aware_chunking/_add_new_model/chunk_60.txt new file mode 100644 index 0000000000000000000000000000000000000000..fa5448c4bd7ffe059352557dd8a87257b541a02f --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_60.txt @@ -0,0 +1,5 @@ +Rather, it is +advised to add a first unclean, copy-pasted version of the original code to +src/transformers/models/brand_new_bert/modeling_brand_new_bert.py until you feel like all the necessary code is +added. From our experience, it is much more efficient to quickly add a first version of the required code and +improve/correct the code iteratively with the conversion script as described in the next section. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_61.txt b/chunked/content_aware_chunking/_add_new_model/chunk_61.txt new file mode 100644 index 0000000000000000000000000000000000000000..a9a28f708d30cd6d3eefb688b38c21a1207018a1 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_61.txt @@ -0,0 +1,2 @@ +The only thing that +has to work at this point is that you can instantiate the 🤗 Transformers implementation of brand_new_bert, i.e. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_62.txt b/chunked/content_aware_chunking/_add_new_model/chunk_62.txt new file mode 100644 index 0000000000000000000000000000000000000000..3bdd2c9c99d720543bed9575b8cfe5015c8f5472 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_62.txt @@ -0,0 +1,10 @@ +the +following command should work: +thon +from transformers import BrandNewBertModel, BrandNewBertConfig +model = BrandNewBertModel(BrandNewBertConfig()) + +The above command will create a model according to the default parameters as defined in BrandNewBertConfig() with +random weights, thus making sure that the init() methods of all components works. +Note that all random initialization should happen in the _init_weights method of your BrandnewBertPreTrainedModel +class. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_63.txt b/chunked/content_aware_chunking/_add_new_model/chunk_63.txt new file mode 100644 index 0000000000000000000000000000000000000000..a70cb3780ea5cd1dd32e6382d01eac3982aeccb5 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_63.txt @@ -0,0 +1 @@ +It should initialize all leaf modules depending on the variables of the config. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_64.txt b/chunked/content_aware_chunking/_add_new_model/chunk_64.txt new file mode 100644 index 0000000000000000000000000000000000000000..105386cfe99da3b416269e4c41c34e53d1d45d70 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_64.txt @@ -0,0 +1,17 @@ +Here is an example with the +BERT _init_weights method: +py +def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, nn.Linear): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) +You can have some more custom schemes if you need a special initialization for some modules. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_65.txt b/chunked/content_aware_chunking/_add_new_model/chunk_65.txt new file mode 100644 index 0000000000000000000000000000000000000000..e7dc83ef1106badd6452f0ea0eea91b8e1b4d05d --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_65.txt @@ -0,0 +1,3 @@ +For instance, in +Wav2Vec2ForPreTraining, the last two linear layers need to have the initialization of the regular PyTorch nn.Linear +but all the other ones should use an initialization as above. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_66.txt b/chunked/content_aware_chunking/_add_new_model/chunk_66.txt new file mode 100644 index 0000000000000000000000000000000000000000..90d95220ca5265b1cdab721d81f83952a39fcc72 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_66.txt @@ -0,0 +1,14 @@ +This is coded like this: +py +def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, Wav2Vec2ForPreTraining): + module.project_hid.reset_parameters() + module.project_q.reset_parameters() + module.project_hid._is_hf_initialized = True + module.project_q._is_hf_initialized = True + elif isinstance(module, nn.Linear): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() +The _is_hf_initialized flag is internally used to make sure we only initialize a submodule once. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_67.txt b/chunked/content_aware_chunking/_add_new_model/chunk_67.txt new file mode 100644 index 0000000000000000000000000000000000000000..f0f3cd0dee45d254e02c47d9a93760453c2956c5 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_67.txt @@ -0,0 +1,7 @@ +By setting it to +True for module.project_q and module.project_hid, we make sure the custom initialization we did is not overridden later on, +the _init_weights function won't be applied to them. +6. Write a conversion script +Next, you should write a conversion script that lets you convert the checkpoint you used to debug brand_new_bert in +the original repository to a checkpoint compatible with your just created 🤗 Transformers implementation of +brand_new_bert. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_68.txt b/chunked/content_aware_chunking/_add_new_model/chunk_68.txt new file mode 100644 index 0000000000000000000000000000000000000000..14d8058d55b673d968e19a56d759cb9ecb83a0b8 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_68.txt @@ -0,0 +1,4 @@ +It is not advised to write the conversion script from scratch, but rather to look through already +existing conversion scripts in 🤗 Transformers for one that has been used to convert a similar model that was written in +the same framework as brand_new_bert. Usually, it is enough to copy an already existing conversion script and +slightly adapt it for your use case. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_69.txt b/chunked/content_aware_chunking/_add_new_model/chunk_69.txt new file mode 100644 index 0000000000000000000000000000000000000000..1ce7489aaac7531aeda7c98a6088c45c209fc696 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_69.txt @@ -0,0 +1,7 @@ +Don't hesitate to ask the Hugging Face team to point you to a similar already +existing conversion script for your model. + +If you are porting a model from TensorFlow to PyTorch, a good starting point might be BERT's conversion script here +If you are porting a model from PyTorch to PyTorch, a good starting point might be BART's conversion script here + +In the following, we'll quickly explain how PyTorch models store layer weights and define layer names. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_7.txt b/chunked/content_aware_chunking/_add_new_model/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..9c2ba1b6256a0be6087428447a732073f90a073f --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_7.txt @@ -0,0 +1,6 @@ +As a general rule, we want to make sure that a new model only depends on +[PreTrainedModel]. The important functionalities that are automatically provided to every new +model are [~PreTrainedModel.from_pretrained] and +[~PreTrainedModel.save_pretrained], which are used for serialization and deserialization. All of the +other important functionalities, such as BrandNewBertModel.forward should be completely defined in the new +modeling_brand_new_bert.py script. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_70.txt b/chunked/content_aware_chunking/_add_new_model/chunk_70.txt new file mode 100644 index 0000000000000000000000000000000000000000..a6e680681af74cecac09590c9a37a88d4191ce4b --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_70.txt @@ -0,0 +1,2 @@ +In PyTorch, the +name of a layer is defined by the name of the class attribute you give the layer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_71.txt b/chunked/content_aware_chunking/_add_new_model/chunk_71.txt new file mode 100644 index 0000000000000000000000000000000000000000..32f30eba53fad9c9f72e307f6c218f65f1778457 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_71.txt @@ -0,0 +1,13 @@ +Let's define a dummy model in +PyTorch, called SimpleModel as follows: +thon +from torch import nn +class SimpleModel(nn.Module): + def init(self): + super().init() + self.dense = nn.Linear(10, 10) + self.intermediate = nn.Linear(10, 10) + self.layer_norm = nn.LayerNorm(10) + +Now we can create an instance of this model definition which will fill all weights: dense, intermediate, +layer_norm with random weights. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_72.txt b/chunked/content_aware_chunking/_add_new_model/chunk_72.txt new file mode 100644 index 0000000000000000000000000000000000000000..00e8e5ad7d2169d975fc0b7edcb6fb1847827101 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_72.txt @@ -0,0 +1,12 @@ +We can print the model to see its architecture +thon +model = SimpleModel() +print(model) + +This will print out the following: +SimpleModel( + (dense): Linear(in_features=10, out_features=10, bias=True) + (intermediate): Linear(in_features=10, out_features=10, bias=True) + (layer_norm): LayerNorm((10,), eps=1e-05, elementwise_affine=True) +) +We can see that the layer names are defined by the name of the class attribute in PyTorch. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_73.txt b/chunked/content_aware_chunking/_add_new_model/chunk_73.txt new file mode 100644 index 0000000000000000000000000000000000000000..bc8bd4d45bc94e87e3bc2f3bfd56ae6f4f102b36 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_73.txt @@ -0,0 +1,27 @@ +You can print out the weight +values of a specific layer: +python +print(model.dense.weight.data) +to see that the weights were randomly initialized +tensor([[-0.0818, 0.2207, -0.0749, -0.0030, 0.0045, -0.1569, -0.1598, 0.0212, + -0.2077, 0.2157], + [ 0.1044, 0.0201, 0.0990, 0.2482, 0.3116, 0.2509, 0.2866, -0.2190, + 0.2166, -0.0212], + [-0.2000, 0.1107, -0.1999, -0.3119, 0.1559, 0.0993, 0.1776, -0.1950, + -0.1023, -0.0447], + [-0.0888, -0.1092, 0.2281, 0.0336, 0.1817, -0.0115, 0.2096, 0.1415, + -0.1876, -0.2467], + [ 0.2208, -0.2352, -0.1426, -0.2636, -0.2889, -0.2061, -0.2849, -0.0465, + 0.2577, 0.0402], + [ 0.1502, 0.2465, 0.2566, 0.0693, 0.2352, -0.0530, 0.1859, -0.0604, + 0.2132, 0.1680], + [ 0.1733, -0.2407, -0.1721, 0.1484, 0.0358, -0.0633, -0.0721, -0.0090, + 0.2707, -0.2509], + [-0.1173, 0.1561, 0.2945, 0.0595, -0.1996, 0.2988, -0.0802, 0.0407, + 0.1829, -0.1568], + [-0.1164, -0.2228, -0.0403, 0.0428, 0.1339, 0.0047, 0.1967, 0.2923, + 0.0333, -0.0536], + [-0.1492, -0.1616, 0.1057, 0.1950, -0.2807, -0.2710, -0.1586, 0.0739, + 0.2220, 0.2358]]). +In the conversion script, you should fill those randomly initialized weights with the exact weights of the +corresponding layer in the checkpoint. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_74.txt b/chunked/content_aware_chunking/_add_new_model/chunk_74.txt new file mode 100644 index 0000000000000000000000000000000000000000..67ea49a73c7e77d73cd96bef7a8cec55f91f0862 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_74.txt @@ -0,0 +1,11 @@ +E.g. +thon +retrieve matching layer weights, e.g. by +recursive algorithm +layer_name = "dense" +pretrained_weight = array_of_dense_layer +model_pointer = getattr(model, "dense") +model_pointer.weight.data = torch.from_numpy(pretrained_weight) + +While doing so, you must verify that each randomly initialized weight of your PyTorch model and its corresponding +pretrained checkpoint weight exactly match in both shape and name. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_75.txt b/chunked/content_aware_chunking/_add_new_model/chunk_75.txt new file mode 100644 index 0000000000000000000000000000000000000000..dc52d4547336a80019fb31ece6cd9c53aeea8007 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_75.txt @@ -0,0 +1,2 @@ +To do so, it is necessary to add assert +statements for the shape and print out the names of the checkpoints weights. E.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_76.txt b/chunked/content_aware_chunking/_add_new_model/chunk_76.txt new file mode 100644 index 0000000000000000000000000000000000000000..430f7b7b09ae2c07faf5bcb583ad4e4936485159 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_76.txt @@ -0,0 +1,12 @@ +you should add statements like: +python +assert ( + model_pointer.weight.shape == pretrained_weight.shape +), f"Pointer shape of random weight {model_pointer.shape} and array shape of checkpoint weight {pretrained_weight.shape} mismatched" +Besides, you should also print out the names of both weights to make sure they match, e.g. +python +logger.info(f"Initialize PyTorch weight {layer_name} from {pretrained_weight.name}") +If either the shape or the name doesn't match, you probably assigned the wrong checkpoint weight to a randomly +initialized layer of the 🤗 Transformers implementation. +An incorrect shape is most likely due to an incorrect setting of the config parameters in BrandNewBertConfig() that +do not exactly match those that were used for the checkpoint you want to convert. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_77.txt b/chunked/content_aware_chunking/_add_new_model/chunk_77.txt new file mode 100644 index 0000000000000000000000000000000000000000..c152a35ec41175af7778437cd70491d2ec5f59b3 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_77.txt @@ -0,0 +1,5 @@ +However, it could also be that +PyTorch's implementation of a layer requires the weight to be transposed beforehand. +Finally, you should also check that all required weights are initialized and print out all checkpoint weights that +were not used for initialization to make sure the model is correctly converted. It is completely normal, that the +conversion trials fail with either a wrong shape statement or a wrong name assignment. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_78.txt b/chunked/content_aware_chunking/_add_new_model/chunk_78.txt new file mode 100644 index 0000000000000000000000000000000000000000..b0ef56a06b06ed3687d152566c4fdc3b24790113 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_78.txt @@ -0,0 +1,6 @@ +This is most likely because either +you used incorrect parameters in BrandNewBertConfig(), have a wrong architecture in the 🤗 Transformers +implementation, you have a bug in the init() functions of one of the components of the 🤗 Transformers +implementation or you need to transpose one of the checkpoint weights. +This step should be iterated with the previous step until all weights of the checkpoint are correctly loaded in the +Transformers model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_79.txt b/chunked/content_aware_chunking/_add_new_model/chunk_79.txt new file mode 100644 index 0000000000000000000000000000000000000000..1c5084595221d56a964d52bee3671e5d78692ad1 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_79.txt @@ -0,0 +1,6 @@ +Having correctly loaded the checkpoint into the 🤗 Transformers implementation, you can then save +the model under a folder of your choice /path/to/converted/checkpoint/folder that should then contain both a +pytorch_model.bin file and a config.json file: +python +model.save_pretrained("/path/to/converted/checkpoint/folder") +7. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_8.txt b/chunked/content_aware_chunking/_add_new_model/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..9ba2db0ff28b953ed0d31bd66b7571a0dc7ec691 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_8.txt @@ -0,0 +1,4 @@ +Next, we want to make sure that a model with a specific head layer, such as +BrandNewBertForMaskedLM does not inherit from BrandNewBertModel, but rather uses BrandNewBertModel +as a component that can be called in its forward pass to keep the level of abstraction low. Every new model requires a +configuration class, called BrandNewBertConfig. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_80.txt b/chunked/content_aware_chunking/_add_new_model/chunk_80.txt new file mode 100644 index 0000000000000000000000000000000000000000..672c7397ecc54b1b7e03f15203a5c0b739659c4e --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_80.txt @@ -0,0 +1,5 @@ +Implement the forward pass +Having managed to correctly load the pretrained weights into the 🤗 Transformers implementation, you should now make +sure that the forward pass is correctly implemented. In Get familiar with the original repository, you have already created a script that runs a forward +pass of the model using the original repository. Now you should write an analogous script using the 🤗 Transformers +implementation instead of the original one. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_81.txt b/chunked/content_aware_chunking/_add_new_model/chunk_81.txt new file mode 100644 index 0000000000000000000000000000000000000000..8288f768b988da514fe0d481436ee07754e7e22c --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_81.txt @@ -0,0 +1,8 @@ +It should look as follows: +python +model = BrandNewBertModel.from_pretrained("/path/to/converted/checkpoint/folder") +input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19] +output = model(input_ids).last_hidden_states +It is very likely that the 🤗 Transformers implementation and the original model implementation don't give the exact +same output the very first time or that the forward pass throws an error. Don't be disappointed - it's expected! First, +you should make sure that the forward pass doesn't throw any errors. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_82.txt b/chunked/content_aware_chunking/_add_new_model/chunk_82.txt new file mode 100644 index 0000000000000000000000000000000000000000..bcd8aa2373ba3b0572fb93ed8e38810f23a58cdf --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_82.txt @@ -0,0 +1,6 @@ +It often happens that the wrong dimensions are +used leading to a Dimensionality mismatch error or that the wrong data type object is used, e.g. torch.long +instead of torch.float32. Don't hesitate to ask the Hugging Face team for help, if you don't manage to solve +certain errors. +The final part to make sure the 🤗 Transformers implementation works correctly is to ensure that the outputs are +equivalent to a precision of 1e-3. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_83.txt b/chunked/content_aware_chunking/_add_new_model/chunk_83.txt new file mode 100644 index 0000000000000000000000000000000000000000..d55587d2a9b866065416225563b2b52b68e61a47 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_83.txt @@ -0,0 +1,6 @@ +First, you should ensure that the output shapes are identical, i.e. +outputs.shape should yield the same value for the script of the 🤗 Transformers implementation and the original +implementation. Next, you should make sure that the output values are identical as well. This one of the most difficult +parts of adding a new model. Common mistakes why the outputs are not identical are: + +Some layers were not added, i.e. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_84.txt b/chunked/content_aware_chunking/_add_new_model/chunk_84.txt new file mode 100644 index 0000000000000000000000000000000000000000..582ea1ce5f3989b3f58dc2228cd767367452c5a1 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_84.txt @@ -0,0 +1,5 @@ +an activation layer was not added, or the residual connection was forgotten +The word embedding matrix was not tied +The wrong positional embeddings are used because the original implementation uses on offset +Dropout is applied during the forward pass. To fix this make sure model.training is False and that no dropout + layer is falsely activated during the forward pass, i.e. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_85.txt b/chunked/content_aware_chunking/_add_new_model/chunk_85.txt new file mode 100644 index 0000000000000000000000000000000000000000..4eab99464f15e47c4d9ee341d827f4d040ddabcb --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_85.txt @@ -0,0 +1,6 @@ +pass self.training to PyTorch's functional dropout + +The best way to fix the problem is usually to look at the forward pass of the original implementation and the 🤗 +Transformers implementation side-by-side and check if there are any differences. Ideally, you should debug/print out +intermediate outputs of both implementations of the forward pass to find the exact position in the network where the 🤗 +Transformers implementation shows a different output than the original implementation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_86.txt b/chunked/content_aware_chunking/_add_new_model/chunk_86.txt new file mode 100644 index 0000000000000000000000000000000000000000..8349c6d998d5cb10fcbb34e1115301b74ba35fe7 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_86.txt @@ -0,0 +1,5 @@ +First, make sure that the +hard-coded input_ids in both scripts are identical. Next, verify that the outputs of the first transformation of +the input_ids (usually the word embeddings) are identical. And then work your way up to the very last layer of the +network. At some point, you will notice a difference between the two implementations, which should point you to the bug +in the 🤗 Transformers implementation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_87.txt b/chunked/content_aware_chunking/_add_new_model/chunk_87.txt new file mode 100644 index 0000000000000000000000000000000000000000..a07257eb87f40400f323e13eb77adae2f97ecfda --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_87.txt @@ -0,0 +1,7 @@ +From our experience, a simple and efficient way is to add many print statements +in both the original implementation and 🤗 Transformers implementation, at the same positions in the network +respectively, and to successively remove print statements showing the same values for intermediate presentations. +When you're confident that both implementations yield the same output, verify the outputs with +torch.allclose(original_output, output, atol=1e-3), you're done with the most difficult part! Congratulations - the +work left to be done should be a cakewalk 😊. +8. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_88.txt b/chunked/content_aware_chunking/_add_new_model/chunk_88.txt new file mode 100644 index 0000000000000000000000000000000000000000..eff78d4ce1982dd357d47f659e2b70755c6477ec --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_88.txt @@ -0,0 +1,5 @@ +Adding all necessary model tests +At this point, you have successfully added a new model. However, it is very much possible that the model does not yet +fully comply with the required design. To make sure, the implementation is fully compatible with 🤗 Transformers, all +common tests should pass. The Cookiecutter should have automatically added a test file for your model, probably under +the same tests/models/brand_new_bert/test_modeling_brand_new_bert.py. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_89.txt b/chunked/content_aware_chunking/_add_new_model/chunk_89.txt new file mode 100644 index 0000000000000000000000000000000000000000..4ba59dfb5a4dfec5417726d09f2f721d461ff740 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_89.txt @@ -0,0 +1,10 @@ +Run this test file to verify that all common +tests pass: + +pytest tests/models/brand_new_bert/test_modeling_brand_new_bert.py +Having fixed all common tests, it is now crucial to ensure that all the nice work you have done is well tested, so that + +a) The community can easily understand your work by looking at specific tests of brand_new_bert +b) Future changes to your model will not break any important feature of the model. + +At first, integration tests should be added. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_9.txt b/chunked/content_aware_chunking/_add_new_model/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..782e453a5f39834e74b1d489dc993c98c259ea63 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_9.txt @@ -0,0 +1,8 @@ +This configuration is always stored as an attribute in +[PreTrainedModel], and thus can be accessed via the config attribute for all classes +inheriting from BrandNewBertPreTrainedModel: +python +model = BrandNewBertModel.from_pretrained("brandy/brand_new_bert") +model.config # model has access to its config +Similar to the model, the configuration inherits basic serialization and deserialization functionalities from +[PretrainedConfig]. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_90.txt b/chunked/content_aware_chunking/_add_new_model/chunk_90.txt new file mode 100644 index 0000000000000000000000000000000000000000..561c9e4478ec108b1adb70ba6908a8e4715d426b --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_90.txt @@ -0,0 +1,3 @@ +Those integration tests essentially do the same as the debugging scripts +you used earlier to implement the model to 🤗 Transformers. A template of those model tests has already added by the +Cookiecutter, called BrandNewBertModelIntegrationTests and only has to be filled out by you. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_91.txt b/chunked/content_aware_chunking/_add_new_model/chunk_91.txt new file mode 100644 index 0000000000000000000000000000000000000000..e654211eb62a1a8ebab5bf44f923b22e6f2d0091 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_91.txt @@ -0,0 +1,9 @@ +To ensure that those +tests are passing, run + +RUN_SLOW=1 pytest -sv tests/models/brand_new_bert/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests + +In case you are using Windows, you should replace RUN_SLOW=1 with SET RUN_SLOW=1 + +Second, all features that are special to brand_new_bert should be tested additionally in a separate test under +BrandNewBertModelTester/`BrandNewBertModelTest. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_92.txt b/chunked/content_aware_chunking/_add_new_model/chunk_92.txt new file mode 100644 index 0000000000000000000000000000000000000000..af780c600c03f433b9b6d59ab430b4044c52f5e9 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_92.txt @@ -0,0 +1,9 @@ +This part is often forgotten but is extremely useful in two +ways: + +It helps to transfer the knowledge you have acquired during the model addition to the community by showing how the + special features of brand_new_bert should work. +Future contributors can quickly test changes to the model by running those special tests. + +9. Implement the tokenizer +Next, we should add the tokenizer of brand_new_bert. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_93.txt b/chunked/content_aware_chunking/_add_new_model/chunk_93.txt new file mode 100644 index 0000000000000000000000000000000000000000..47a8a3599b142261c7fca3364f78def256497555 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_93.txt @@ -0,0 +1,6 @@ +Usually, the tokenizer is equivalent to or very similar to an +already existing tokenizer of 🤗 Transformers. +It is very important to find/extract the original tokenizer file and to manage to load this file into the 🤗 +Transformers' implementation of the tokenizer. +To ensure that the tokenizer works correctly, it is recommended to first create a script in the original repository +that inputs a string and returns the `input_ids``. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_94.txt b/chunked/content_aware_chunking/_add_new_model/chunk_94.txt new file mode 100644 index 0000000000000000000000000000000000000000..598369a3b5a483ea8f48b1be002b6571aee76cbd --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_94.txt @@ -0,0 +1,7 @@ +It could look similar to this (in pseudo-code): +python +input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words." +model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/") +input_ids = model.tokenize(input_str) +You might have to take a deeper look again into the original repository to find the correct tokenizer function or you +might even have to do changes to your clone of the original repository to only output the input_ids. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_95.txt b/chunked/content_aware_chunking/_add_new_model/chunk_95.txt new file mode 100644 index 0000000000000000000000000000000000000000..34c17bb5fd0d16297ca79a6a8a0c366c808ae198 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_95.txt @@ -0,0 +1,3 @@ +Having written +a functional tokenization script that uses the original repository, an analogous script for 🤗 Transformers should be +created. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_96.txt b/chunked/content_aware_chunking/_add_new_model/chunk_96.txt new file mode 100644 index 0000000000000000000000000000000000000000..7005dfc0f24d7d4c1d80cbc3d5a8ab1ff471a88c --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_96.txt @@ -0,0 +1,11 @@ +It should look similar to this: +thon +from transformers import BrandNewBertTokenizer +input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words." +tokenizer = BrandNewBertTokenizer.from_pretrained("/path/to/tokenizer/folder/") +input_ids = tokenizer(input_str).input_ids + +When both input_ids yield the same values, as a final step a tokenizer test file should also be added. +Analogous to the modeling test files of brand_new_bert, the tokenization test files of brand_new_bert should +contain a couple of hard-coded integration tests. +10. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_97.txt b/chunked/content_aware_chunking/_add_new_model/chunk_97.txt new file mode 100644 index 0000000000000000000000000000000000000000..da67a1a138b2716a8f0247efaaa3986cd17d72d3 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_97.txt @@ -0,0 +1,6 @@ +Run End-to-end integration tests +Having added the tokenizer, you should also add a couple of end-to-end integration tests using both the model and the +tokenizer to tests/models/brand_new_bert/test_modeling_brand_new_bert.py in 🤗 Transformers. +Such a test should show on a meaningful +text-to-text sample that the 🤗 Transformers implementation works as expected. A meaningful text-to-text sample can +include e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_98.txt b/chunked/content_aware_chunking/_add_new_model/chunk_98.txt new file mode 100644 index 0000000000000000000000000000000000000000..ec4536901395e68c7ad46174107d59562c79c8c3 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_98.txt @@ -0,0 +1,5 @@ +a source-to-target-translation pair, an article-to-summary pair, a question-to-answer pair, etc… If none +of the ported checkpoints has been fine-tuned on a downstream task it is enough to simply rely on the model tests. In a +final step to ensure that the model is fully functional, it is advised that you also run all tests on GPU. It can +happen that you forgot to add some .to(self.device) statements to internal tensors of the model, which in such a +test would show in an error. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_model/chunk_99.txt b/chunked/content_aware_chunking/_add_new_model/chunk_99.txt new file mode 100644 index 0000000000000000000000000000000000000000..3e2b567c59fda9a4cc45874ade55c8c51f7aecc2 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_model/chunk_99.txt @@ -0,0 +1,7 @@ +In case you have no access to a GPU, the Hugging Face team can take care of running those +tests for you. +11. Add Docstring +Now, all the necessary functionality for brand_new_bert is added - you're almost done! The only thing left to add is +a nice docstring and a doc page. The Cookiecutter should have added a template file called +docs/source/model_doc/brand_new_bert.md that you should fill out. Users of your model will usually first look at +this page before using your model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_0.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..673c5658adae3a00e53c68b888c46011c3cee50f --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_0.txt @@ -0,0 +1,6 @@ +How to create a custom pipeline? +In this guide, we will see how to create a custom pipeline and share it on the Hub or add it to the +🤗 Transformers library. +First and foremost, you need to decide the raw entries the pipeline will be able to take. It can be strings, raw bytes, +dictionaries or whatever seems to be the most likely desired input. Try to keep these inputs as pure Python as possible +as it makes compatibility easier (even through other languages via JSON). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_1.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..3ec51637662acba671c440d5c78070be4ee491b1 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_1.txt @@ -0,0 +1,3 @@ +Those will be the inputs of the +pipeline (preprocess). +Then define the outputs. Same policy as the inputs. The simpler, the better. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_10.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..ac70d3913dda9fbb1781e8b5cc7da459579180bc --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_10.txt @@ -0,0 +1,14 @@ +If we have saved this in +a file named pair_classification.py, we can then import it and register it like this: + +from pair_classification import PairClassificationPipeline +from transformers.pipelines import PIPELINE_REGISTRY +from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification +PIPELINE_REGISTRY.register_pipeline( + "pair-classification", + pipeline_class=PairClassificationPipeline, + pt_model=AutoModelForSequenceClassification, + tf_model=TFAutoModelForSequenceClassification, +) + +Once this is done, we can use it with a pretrained model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_11.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..dc312d358246f3e72d84443a838e771561098b66 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_11.txt @@ -0,0 +1,16 @@ +For instance sgugger/finetuned-bert-mrpc has been +fine-tuned on the MRPC dataset, which classifies pairs of sentences as paraphrases or not. + +from transformers import pipeline +classifier = pipeline("pair-classification", model="sgugger/finetuned-bert-mrpc") + +Then we can share it on the Hub by using the save_pretrained method in a Repository: + +from huggingface_hub import Repository +repo = Repository("test-dynamic-pipeline", clone_from="{your_username}/test-dynamic-pipeline") +classifier.save_pretrained("test-dynamic-pipeline") +repo.push_to_hub() + +This will copy the file where you defined PairClassificationPipeline inside the folder "test-dynamic-pipeline", +along with saving the model and tokenizer of the pipeline, before pushing everything into the repository +{your_username}/test-dynamic-pipeline. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_12.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..ed5f097f698c9f556299b4b31b8394f49a67a255 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_12.txt @@ -0,0 +1,10 @@ +After that, anyone can use it as long as they provide the option +trust_remote_code=True: + +from transformers import pipeline +classifier = pipeline(model="{your_username}/test-dynamic-pipeline", trust_remote_code=True) + +Add the pipeline to 🤗 Transformers +If you want to contribute your pipeline to 🤗 Transformers, you will need to add a new module in the pipelines submodule +with the code of your pipeline, then add it to the list of tasks defined in pipelines/__init__.py. +Then you will need to add tests. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_13.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..540a72ca103538d71e128fafb91dea6117d5f6f4 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_13.txt @@ -0,0 +1,5 @@ +Create a new file tests/test_pipelines_MY_PIPELINE.py with examples of the other tests. +The run_pipeline_test function will be very generic and run on small random models on every possible +architecture as defined by model_mapping and tf_model_mapping. +This is very important to test future compatibility, meaning if someone adds a new model for +XXXForQuestionAnswering then the pipeline test will attempt to run on it. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_14.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..f79b5f2b05e588fc0b34318dd40d7ae708b44cd0 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_14.txt @@ -0,0 +1,7 @@ +Because the models are random it's +impossible to check for actual values, that's why there is a helper ANY that will simply attempt to match the +output of the pipeline TYPE. +You also need to implement 2 (ideally 4) tests. + +test_small_model_pt : Define 1 small model for this pipeline (doesn't matter if the results don't make sense) + and test the pipeline outputs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_15.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..c145d7dfd2ffa094b91bdfeae5e7c41896e7f0de --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_15.txt @@ -0,0 +1,5 @@ +The results should be the same as test_small_model_tf. +test_small_model_tf : Define 1 small model for this pipeline (doesn't matter if the results don't make sense) + and test the pipeline outputs. The results should be the same as test_small_model_pt. +test_large_model_pt (optional): Tests the pipeline on a real pipeline where the results are supposed to + make sense. These tests are slow and should be marked as such. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_16.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..8d199008139f999f5e2136370752b18db20dec3e --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_16.txt @@ -0,0 +1,6 @@ +Here the goal is to showcase the pipeline and to make + sure there is no drift in future releases. +test_large_model_tf (optional): Tests the pipeline on a real pipeline where the results are supposed to + make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make + sure there is no drift in future releases. +. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_2.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..acb13f048d91f38a7ce5eb5dad6e21b6b95eb52f --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_2.txt @@ -0,0 +1,29 @@ +Those will be the outputs of +postprocess method. +Start by inheriting the base class Pipeline with the 4 methods needed to implement preprocess, +_forward, postprocess, and _sanitize_parameters. +thon +from transformers import Pipeline +class MyPipeline(Pipeline): + def _sanitize_parameters(self, **kwargs): + preprocess_kwargs = {} + if "maybe_arg" in kwargs: + preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"] + return preprocess_kwargs, {}, {} +def preprocess(self, inputs, maybe_arg=2): + model_input = Tensor(inputs["input_ids"]) + return {"model_input": model_input} + +def _forward(self, model_inputs): + # model_inputs == {"model_input": model_input} + outputs = self.model(**model_inputs) + # Maybe {"logits": Tensor()} + return outputs + +def postprocess(self, model_outputs): + best_class = model_outputs["logits"].softmax(-1) + return best_class + +The structure of this breakdown is to support relatively seamless support for CPU/GPU, while supporting doing +pre/postprocessing on the CPU on different threads +preprocess will take the originally defined inputs, and turn them into something feedable to the model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_3.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..66e0d4f7f8f8222fec2456aa94d190812bcdd9a3 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_3.txt @@ -0,0 +1,4 @@ +It might +contain more information and is usually a Dict. +_forward is the implementation detail and is not meant to be called directly. forward is the preferred +called method as it contains safeguards to make sure everything is working on the expected device. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_4.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..e0d68adf29dea6774164aef27b2811ff7e48f789 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_4.txt @@ -0,0 +1,8 @@ +If anything is +linked to a real model it belongs in the _forward method, anything else is in the preprocess/postprocess. +postprocess methods will take the output of _forward and turn it into the final output that was decided +earlier. +_sanitize_parameters exists to allow users to pass any parameters whenever they wish, be it at initialization +time pipeline(., maybe_arg=4) or at call time pipe = pipeline(); output = pipe(., maybe_arg=4). +The returns of _sanitize_parameters are the 3 dicts of kwargs that will be passed directly to preprocess, +_forward, and postprocess. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_5.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..1b016b7f50761e4c0e656e2c6fe5c68aba08fbc2 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_5.txt @@ -0,0 +1 @@ +Don't fill anything if the caller didn't call with any extra parameter. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_6.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..82c2c5f93a79dbc5cdacbeac8a4bfa3e3900357e --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_6.txt @@ -0,0 +1,13 @@ +That +allows to keep the default arguments in the function definition which is always more "natural". +A classic example would be a top_k argument in the post processing in classification tasks. +thon + +pipe = pipeline("my-new-task") +pipe("This is a test") +[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05} +{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}] +pipe("This is a test", top_k=2) +[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}] + +In order to achieve that, we'll update our postprocess method with a default parameter to 5. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_7.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..407e340c0d8eea6810a7c158e834a85f2a7d0ea0 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_7.txt @@ -0,0 +1,18 @@ +and edit +_sanitize_parameters to allow this new parameter. +thon +def postprocess(self, model_outputs, top_k=5): + best_class = model_outputs["logits"].softmax(-1) + # Add logic to handle top_k + return best_class +def _sanitize_parameters(self, **kwargs): + preprocess_kwargs = {} + if "maybe_arg" in kwargs: + preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"] +postprocess_kwargs = {} +if "top_k" in kwargs: + postprocess_kwargs["top_k"] = kwargs["top_k"] +return preprocess_kwargs, {}, postprocess_kwargs + +Try to keep the inputs/outputs very simple and ideally JSON-serializable as it makes the pipeline usage very easy +without requiring users to understand new kinds of objects. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_8.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..83eabcb6047becdec5f9cbcce9295f8881853a36 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_8.txt @@ -0,0 +1,24 @@ +It's also relatively common to support many different types +of arguments for ease of use (audio files, which can be filenames, URLs or pure bytes) +Adding it to the list of supported tasks +To register your new-task to the list of supported tasks, you have to add it to the PIPELINE_REGISTRY: +thon +from transformers.pipelines import PIPELINE_REGISTRY +PIPELINE_REGISTRY.register_pipeline( + "new-task", + pipeline_class=MyPipeline, + pt_model=AutoModelForSequenceClassification, +) + +You can specify a default model if you want, in which case it should come with a specific revision (which can be the name of a branch or a commit hash, here we took "abcdef") as well as the type: +python +PIPELINE_REGISTRY.register_pipeline( + "new-task", + pipeline_class=MyPipeline, + pt_model=AutoModelForSequenceClassification, + default={"pt": ("user/awesome_model", "abcdef")}, + type="text", # current support type: text, audio, image, multimodal +) +Share your pipeline on the Hub +To share your custom pipeline on the Hub, you just have to save the custom code of your Pipeline subclass in a +python file. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_new_pipeline/chunk_9.txt b/chunked/content_aware_chunking/_add_new_pipeline/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..1658cdaf72837a9afb89c256120e40a686b91508 --- /dev/null +++ b/chunked/content_aware_chunking/_add_new_pipeline/chunk_9.txt @@ -0,0 +1,31 @@ +For instance, let's say we want to use a custom pipeline for sentence pair classification like this: + +import numpy as np +from transformers import Pipeline +def softmax(outputs): + maxes = np.max(outputs, axis=-1, keepdims=True) + shifted_exp = np.exp(outputs - maxes) + return shifted_exp / shifted_exp.sum(axis=-1, keepdims=True) +class PairClassificationPipeline(Pipeline): + def _sanitize_parameters(self, **kwargs): + preprocess_kwargs = {} + if "second_text" in kwargs: + preprocess_kwargs["second_text"] = kwargs["second_text"] + return preprocess_kwargs, {}, {} +def preprocess(self, text, second_text=None): + return self.tokenizer(text, text_pair=second_text, return_tensors=self.framework) + +def _forward(self, model_inputs): + return self.model(**model_inputs) + +def postprocess(self, model_outputs): + logits = model_outputs.logits[0].numpy() + probabilities = softmax(logits) + + best_class = np.argmax(probabilities) + label = self.model.config.id2label[best_class] + score = probabilities[best_class].item() + logits = logits.tolist() + return {"label": label, "score": score, "logits": logits} + +The implementation is framework agnostic, and will work for PyTorch and TensorFlow models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_0.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..93217e34484a36f5d4ed299052d3d608b450b4f6 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_0.txt @@ -0,0 +1,3 @@ +How to convert a 🤗 Transformers model to TensorFlow? +Having multiple frameworks available to use with 🤗 Transformers gives you flexibility to play their strengths when +designing your application, but it implies that compatibility must be added on a per-model basis. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_1.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..858cec7df2fb2e1684baa15c9b7892863fca9d53 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_1.txt @@ -0,0 +1,6 @@ +The good news is that +adding TensorFlow compatibility to an existing model is simpler than adding a new model from scratch! +Whether you wish to have a deeper understanding of large TensorFlow models, make a major open-source contribution, or +enable TensorFlow for your model of choice, this guide is for you. +This guide empowers you, a member of our community, to contribute TensorFlow model weights and/or +architectures to be used in 🤗 Transformers, with minimal supervision from the Hugging Face team. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_10.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..4605d086b944363ac9e059327c225155b2e724b3 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_10.txt @@ -0,0 +1,13 @@ +If the specific model you want to use with TensorFlow already has a TensorFlow architecture implementation in +🤗 Transformers but is lacking weights, feel free to jump straight into the +weight conversion section +of this page. +For simplicity, the remainder of this guide assumes you've decided to contribute with the TensorFlow version of +BrandNewBert (the same example as in the guide to add a new model from scratch). + +Before starting the work on a TensorFlow model architecture, double-check that there is no ongoing effort to do so. +You can search for BrandNewBert on the +pull request GitHub page to confirm that there is no +TensorFlow-related pull request. + +2. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_11.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..a584c3c0d9684fe441579b9f5aa3d433d0df8948 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_11.txt @@ -0,0 +1,6 @@ +Prepare transformers dev environment +Having selected the model architecture, open a draft PR to signal your intention to work on it. Follow the +instructions below to set up your environment and open a draft PR. + +Fork the repository by clicking on the 'Fork' button on the + repository's page. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_12.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..87e895dd1b96469fec5463cf9e05a6fdb2523666 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_12.txt @@ -0,0 +1,15 @@ +This creates a copy of the code under your GitHub user account. + +Clone your transformers fork to your local disk, and add the base repository as a remote: + +git clone https://github.com/[your Github handle]/transformers.git +cd transformers +git remote add upstream https://github.com/huggingface/transformers.git + +Set up a development environment, for instance by running the following command: + +python -m venv .env +source .env/bin/activate +pip install -e ".[dev]" +Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a +failure with this command. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_13.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..ea3960e85469134efe5f7273c67dd1159482c10b --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_13.txt @@ -0,0 +1,15 @@ +If that's the case make sure to install TensorFlow then do: + +pip install -e ".[quality]" +Note: You don't need to have CUDA installed. Making the new model work on CPU is sufficient. + +Create a branch with a descriptive name from your main branch + +git checkout -b add_tf_brand_new_bert + +Fetch and rebase to current main + +git fetch upstream +git rebase upstream/main + +Add an empty .py file in transformers/src/models/brandnewbert/ named modeling_tf_brandnewbert.py. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_14.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..fd808efe3a262e18753022e888fc5506facc190b --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_14.txt @@ -0,0 +1,10 @@ +This will +be your TensorFlow model file. + +Push the changes to your account using: + +git add . +git commit -m "initial commit" +git push -u origin add_tf_brand_new_bert + +Once you are satisfied, go to the webpage of your fork on GitHub. Click on “Pull request”. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_15.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..2a0235f69721aaaceb57dd6109a08ba523a30ff9 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_15.txt @@ -0,0 +1,8 @@ +Make sure to add the + GitHub handle of some members of the Hugging Face team as reviewers, so that the Hugging Face team gets notified for + future changes. + +Change the PR into a draft by clicking on “Convert to draft” on the right of the GitHub pull request web page. + +Now you have set up a development environment to port BrandNewBert to TensorFlow in 🤗 Transformers. +3. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_16.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..410c2b7f9aa8f5918af7254cdf1ee5ac8ab518e5 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_16.txt @@ -0,0 +1,5 @@ +(Optional) Understand theoretical aspects and the existing implementation +You should take some time to read BrandNewBert's paper, if such descriptive work exists. There might be large +sections of the paper that are difficult to understand. If this is the case, this is fine - don't worry! The goal is +not to get a deep theoretical understanding of the paper, but to extract the necessary information required to +effectively re-implement the model in 🤗 Transformers using TensorFlow. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_17.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..5f2c90b88f646f7789e9a389055a42c7a280eb12 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_17.txt @@ -0,0 +1,5 @@ +That being said, you don't have to spend too +much time on the theoretical aspects, but rather focus on the practical ones, namely the existing model documentation +page (e.g. model docs for BERT). +After you've grasped the basics of the models you are about to implement, it's important to understand the existing +implementation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_18.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..60572a6e531c2878b3545dbca8edc75be083e3d3 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_18.txt @@ -0,0 +1,6 @@ +This is a great chance to confirm that a working implementation matches your expectations for the +model, as well as to foresee technical challenges on the TensorFlow side. +It's perfectly natural that you feel overwhelmed with the amount of information that you've just absorbed. It is +definitely not a requirement that you understand all facets of the model at this stage. Nevertheless, we highly +encourage you to clear any pressing questions in our forum. +4. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_19.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..d17c13111d62b1c73cd4a75dd2100d7e6b23e90f --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_19.txt @@ -0,0 +1,4 @@ +Model implementation +Now it's time to finally start coding. Our suggested starting point is the PyTorch file itself: copy the contents of +modeling_brand_new_bert.py inside src/transformers/models/brand_new_bert/ into +modeling_tf_brand_new_bert.py. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_2.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..efa6e35acacc3549d50b2714cc6c968716831239 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_2.txt @@ -0,0 +1,10 @@ +Writing a new model +is no small feat, but hopefully this guide will make it less of a rollercoaster 🎢 and more of a walk in the park 🚶. +Harnessing our collective experiences is absolutely critical to make this process increasingly easier, and thus we +highly encourage that you suggest improvements to this guide! +Before you dive deeper, it is recommended that you check the following resources if you're new to 🤗 Transformers: +- General overview of 🤗 Transformers +- Hugging Face's TensorFlow Philosophy +In the remainder of this guide, you will learn what's needed to add a new TensorFlow model architecture, the +procedure to convert PyTorch into TensorFlow model weights, and how to efficiently debug mismatches across ML +frameworks. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_20.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..675437bc1bcb55069e4dfb5899da8d3aa116632c --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_20.txt @@ -0,0 +1,6 @@ +The goal of this section is to modify the file and update the import structure of +🤗 Transformers such that you can import TFBrandNewBert and +TFBrandNewBert.from_pretrained(model_repo, from_pt=True) successfully loads a working TensorFlow BrandNewBert model. +Sadly, there is no prescription to convert a PyTorch model into TensorFlow. You can, however, follow our selection of +tips to make the process as smooth as possible: +- Prepend TF to the name of all classes (e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_21.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..7d2c1944aa16bf3a9af55017ba96b25b828f7a82 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_21.txt @@ -0,0 +1,6 @@ +BrandNewBert becomes TFBrandNewBert). +- Most PyTorch operations have a direct TensorFlow replacement. For example, torch.nn.Linear corresponds to + tf.keras.layers.Dense, torch.nn.Dropout corresponds to tf.keras.layers.Dropout, etc. If you're not sure + about a specific operation, you can use the TensorFlow documentation + or the PyTorch documentation. +- Look for patterns in the 🤗 Transformers codebase. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_22.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..24942cc809be4c2fdf6b3ecd5a114a95b06b2ca2 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_22.txt @@ -0,0 +1,5 @@ +If you come across a certain operation that doesn't have a direct + replacement, the odds are that someone else already had the same problem. +- By default, keep the same variable names and structure as in PyTorch. This will make it easier to debug, track + issues, and add fixes down the line. +- Some layers have different default values in each framework. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_23.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..b9c58324d8a7823814a2fc961f2281f7ba5148a6 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_23.txt @@ -0,0 +1,5 @@ +A notable example is the batch normalization layer's + epsilon (1e-5 in PyTorch + and 1e-3 in TensorFlow). + Double-check the documentation! +- PyTorch's nn.Parameter variables typically need to be initialized within TF Layer's build(). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_24.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..633a8635c76f3c4e6577191ff77b9425bd858850 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_24.txt @@ -0,0 +1,7 @@ +See the following + example: PyTorch / + TensorFlow +- If the PyTorch model has a #copied from on top of a function, the odds are that your TensorFlow model can also + borrow that function from the architecture it was copied from, assuming it has a TensorFlow architecture. +- Assigning the name attribute correctly in TensorFlow functions is critical to do the from_pt=True weight + cross-loading. name is almost always the name of the corresponding variable in the PyTorch code. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_25.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..32a479acd45acaf1293922e8c6dad9d52daa8c0d --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_25.txt @@ -0,0 +1,6 @@ +If name is not + properly set, you will see it in the error message when loading the model weights. +- The logic of the base model class, BrandNewBertModel, will actually reside in TFBrandNewBertMainLayer, a Keras + layer subclass (example). + TFBrandNewBertModel will simply be a wrapper around this layer. +- Keras models need to be built in order to load pretrained weights. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_26.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..5632503e0e94a018720ab5de557b2c5f3b26bea3 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_26.txt @@ -0,0 +1,7 @@ +For that reason, TFBrandNewBertPreTrainedModel + will need to hold an example of inputs to the model, the dummy_inputs + (example). +- If you get stuck, ask for help - we're here to help you! 🤗 +In addition to the model file itself, you will also need to add the pointers to the model classes and related +documentation pages. You can complete this part entirely following the patterns in other PRs +(example). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_27.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..7636e39dc0e4bdd4449318de6944c49d1ff070cd --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_27.txt @@ -0,0 +1,12 @@ +Here's a list of the needed manual +changes: +- Include all public classes of BrandNewBert in src/transformers/__init__.py +- Add BrandNewBert classes to the corresponding Auto classes in src/transformers/models/auto/modeling_tf_auto.py +- Add the lazy loading classes related to BrandNewBert in src/transformers/utils/dummy_tf_objects.py +- Update the import structures for the public classes in src/transformers/models/brand_new_bert/__init__.py +- Add the documentation pointers to the public methods of BrandNewBert in docs/source/en/model_doc/brand_new_bert.md +- Add yourself to the list of contributors to BrandNewBert in docs/source/en/model_doc/brand_new_bert.md +- Finally, add a green tick ✅ to the TensorFlow column of BrandNewBert in docs/source/en/index.md +When you're happy with your implementation, run the following checklist to confirm that your model architecture is +ready: +1. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_28.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..9f3758bb5595329f16e759faf39841f053bd08dc --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_28.txt @@ -0,0 +1,7 @@ +All layers that behave differently at train time (e.g. Dropout) are called with a training argument, which is +propagated all the way from the top-level classes +2. You have used #copied from whenever possible +3. TFBrandNewBertMainLayer and all classes that use it have their call function decorated with @unpack_inputs +4. TFBrandNewBertMainLayer is decorated with @keras_serializable +5. A TensorFlow model can be loaded from PyTorch weights using TFBrandNewBert.from_pretrained(model_repo, from_pt=True) +6. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_29.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..c37014b42a2a4acfdae633625c47b045a589878d --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_29.txt @@ -0,0 +1,6 @@ +You can call the TensorFlow model using the expected input format +5. Add model tests +Hurray, you've implemented a TensorFlow model! Now it's time to add tests to make sure that your model behaves as +expected. As in the previous section, we suggest you start by copying the test_modeling_brand_new_bert.py file in +tests/models/brand_new_bert/ into test_modeling_tf_brand_new_bert.py, and continue by making the necessary +TensorFlow replacements. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_3.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..f89dbff8566e22692fa8afc5567e13d60c1a1af8 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_3.txt @@ -0,0 +1,6 @@ +Let's get started! + +Are you unsure whether the model you wish to use already has a corresponding TensorFlow architecture? +  +Check the model_type field of the config.json of your model of choice +(example). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_30.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..c99018497b42743b77155d72a3e8cc9a9b77e5f4 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_30.txt @@ -0,0 +1,7 @@ +For now, in all .from_pretrained() calls, you should use the from_pt=True flag to load +the existing PyTorch weights. +After you're done, it's time for the moment of truth: run the tests! 😬 + +NVIDIA_TF32_OVERRIDE=0 RUN_SLOW=1 RUN_PT_TF_CROSS_TESTS=1 \ +py.test -vv tests/models/brand_new_bert/test_modeling_tf_brand_new_bert.py +The most likely outcome is that you'll see a bunch of errors. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_31.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_31.txt new file mode 100644 index 0000000000000000000000000000000000000000..967e20011041a92ecd7b39c2078708cfd4f03dac --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_31.txt @@ -0,0 +1,5 @@ +Don't worry, this is expected! Debugging ML models is +notoriously hard, and the key ingredient to success is patience (and breakpoint()). In our experience, the hardest +problems arise from subtle mismatches between ML frameworks, for which we have a few pointers at the end of this guide. +In other cases, a general test might not be directly applicable to your model, in which case we suggest an override +at the model test class level. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_32.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_32.txt new file mode 100644 index 0000000000000000000000000000000000000000..f0fdd747483cdab66c2ea6b7eefcbdb209463a5a --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_32.txt @@ -0,0 +1,7 @@ +Regardless of the issue, don't hesitate to ask for help in your draft pull request if +you're stuck. +When all tests pass, congratulations, your model is nearly ready to be added to the 🤗 Transformers library! 🎉 +6.-7. Ensure everyone can use your model +6. Submit the pull request +Once you're done with the implementation and the tests, it's time to submit a pull request. Before pushing your code, +run our code formatting utility, make fixup 🪄. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_33.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_33.txt new file mode 100644 index 0000000000000000000000000000000000000000..11944041b5fc8948e3ca3066365f0148ba19ec8a --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_33.txt @@ -0,0 +1,4 @@ +This will automatically fix any formatting issues, which would cause +our automatic checks to fail. +It's now time to convert your draft pull request into a real pull request. To do so, click on the "Ready for +review" button and add Joao (@gante) and Matt (@Rocketknight1) as reviewers. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_34.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_34.txt new file mode 100644 index 0000000000000000000000000000000000000000..0f23faef6288abc295489477fbef90de8c661dbe --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_34.txt @@ -0,0 +1,4 @@ +A model pull request will need +at least 3 reviewers, but they will take care of finding appropriate additional reviewers for your model. +After all reviewers are happy with the state of your PR, the final action point is to remove the from_pt=True flag in +.from_pretrained() calls. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_35.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_35.txt new file mode 100644 index 0000000000000000000000000000000000000000..98ea23fba95bd13bfe887c973bfa4d3049203c2a --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_35.txt @@ -0,0 +1,9 @@ +Since there are no TensorFlow weights, you will have to add them! Check the section +below for instructions on how to do it. +Finally, when the TensorFlow weights get merged, you have at least 3 reviewer approvals, and all CI checks are +green, double-check the tests locally one last time + +NVIDIA_TF32_OVERRIDE=0 RUN_SLOW=1 RUN_PT_TF_CROSS_TESTS=1 \ +py.test -vv tests/models/brand_new_bert/test_modeling_tf_brand_new_bert.py +and we will merge your PR! Congratulations on the milestone 🎉 +7. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_36.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_36.txt new file mode 100644 index 0000000000000000000000000000000000000000..0337b67161a48642fc2021cfc7b4adf204291f35 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_36.txt @@ -0,0 +1,5 @@ +(Optional) Build demos and share with the world +One of the hardest parts about open-source is discovery. How can the other users learn about the existence of your +fabulous TensorFlow contribution? With proper communication, of course! 📣 +There are two main ways to share your model with the community: +- Build demos. These include Gradio demos, notebooks, and other fun ways to show off your model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_37.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_37.txt new file mode 100644 index 0000000000000000000000000000000000000000..110e3a9f1a9ffb3eedaaa755b31de131cb838405 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_37.txt @@ -0,0 +1,3 @@ +We highly + encourage you to add a notebook to our community-driven demos. +- Share stories on social media like Twitter and LinkedIn. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_38.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_38.txt new file mode 100644 index 0000000000000000000000000000000000000000..1e69acc8e7444d3026b6619ab5d984a5a3a79115 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_38.txt @@ -0,0 +1,8 @@ +You should be proud of your work and share + your achievement with the community - your model can now be used by thousands of engineers and researchers around + the world 🌍! We will be happy to retweet your posts and help you share your work with the community. +Adding TensorFlow weights to 🤗 Hub +Assuming that the TensorFlow model architecture is available in 🤗 Transformers, converting PyTorch weights into +TensorFlow weights is a breeze! +Here's how to do it: +1. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_39.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_39.txt new file mode 100644 index 0000000000000000000000000000000000000000..f252758ae0ad20bea02a5428280e5059a2ba264d --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_39.txt @@ -0,0 +1,5 @@ +Make sure you are logged into your Hugging Face account in your terminal. You can log in using the command + huggingface-cli login (you can find your access tokens here) +2. Run transformers-cli pt-to-tf --model-name foo/bar, where foo/bar is the name of the model repository + containing the PyTorch weights you want to convert +3. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_4.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..6f51de2d4fef78fbefa75405b4b1c87930158be0 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_4.txt @@ -0,0 +1,6 @@ +If the corresponding model folder in +🤗 Transformers has a file whose name starts with "modeling_tf", it means that it has a corresponding TensorFlow +architecture (example). + +Step-by-step guide to add TensorFlow model architecture code +There are many ways to design a large model architecture, and multiple ways of implementing said design. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_40.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_40.txt new file mode 100644 index 0000000000000000000000000000000000000000..980f36fd8b571224c766a03b4dd2e52bbc5e35ee --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_40.txt @@ -0,0 +1,6 @@ +Tag @joaogante and @Rocketknight1 in the 🤗 Hub PR the command above has just created +That's it! 🎉 +Debugging mismatches across ML frameworks 🐛 +At some point, when adding a new architecture or when creating TensorFlow weights for an existing architecture, you +might come across errors complaining about mismatches between PyTorch and TensorFlow. You might even decide to open the +model architecture code for the two frameworks, and find that they look identical. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_41.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_41.txt new file mode 100644 index 0000000000000000000000000000000000000000..61340c1a56858e3cd81cc3e4420a297e3809d315 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_41.txt @@ -0,0 +1,5 @@ +What's going on? 🤔 +First of all, let's talk about why understanding these mismatches matters. Many community members will use 🤗 +Transformers models out of the box, and trust that our models behave as expected. When there is a large mismatch +between the two frameworks, it implies that the model is not following the reference implementation for at least one +of the frameworks. This might lead to silent failures, in which the model runs but has poor performance. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_42.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_42.txt new file mode 100644 index 0000000000000000000000000000000000000000..71bbd53ad739bd7b5ce6d64e2d9d69e8698ee1d9 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_42.txt @@ -0,0 +1,7 @@ +This is +arguably worse than a model that fails to run at all! To that end, we aim at having a framework mismatch smaller than +1e-5 at all stages of the model. +As in other numerical problems, the devil is in the details. And as in any detail-oriented craft, the secret +ingredient here is patience. Here is our suggested workflow for when you come across this type of issues: +1. Locate the source of mismatches. The model you're converting probably has near identical inner variables up to a + certain point. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_43.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_43.txt new file mode 100644 index 0000000000000000000000000000000000000000..d031ff2f56a574981c58a9ac0eecf0d7e5c8535e --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_43.txt @@ -0,0 +1,6 @@ +Place breakpoint() statements in the two frameworks' architectures, and compare the values of the + numerical variables in a top-down fashion until you find the source of the problems. +2. Now that you've pinpointed the source of the issue, get in touch with the 🤗 Transformers team. It is possible + that we've seen a similar problem before and can promptly provide a solution. As a fallback, scan popular pages + like StackOverflow and GitHub issues. +3. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_44.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_44.txt new file mode 100644 index 0000000000000000000000000000000000000000..50aa039f5d367a7b3bcdb34967aada019800e5fd --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_44.txt @@ -0,0 +1,3 @@ +If there is no solution in sight, it means you'll have to go deeper. The good news is that you've located the + issue, so you can focus on the problematic instruction, abstracting away the rest of the model! The bad news is + that you'll have to venture into the source implementation of said instruction. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_45.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_45.txt new file mode 100644 index 0000000000000000000000000000000000000000..07018f40f4c80e66f16b73b033e8df9d60cf056b --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_45.txt @@ -0,0 +1,5 @@ +In some cases, you might find an + issue with a reference implementation - don't abstain from opening an issue in the upstream repository. +In some cases, in discussion with the 🤗 Transformers team, we might find that fixing the mismatch is infeasible. +When the mismatch is very small in the output layers of the model (but potentially large in the hidden states), we +might decide to ignore it in favor of distributing the model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_46.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_46.txt new file mode 100644 index 0000000000000000000000000000000000000000..07c8e20f316dfdcb1000bb18277ddd631187e566 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_46.txt @@ -0,0 +1,2 @@ +The pt-to-tf CLI mentioned above has a --max-error +flag to override the error message at weight conversion time.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_5.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..69dbe0303794e6d53fa7cdcfe9028603a82f865c --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_5.txt @@ -0,0 +1,3 @@ +However, +you might recall from our general overview of 🤗 Transformers +that we are an opinionated bunch - the ease of use of 🤗 Transformers relies on consistent design choices. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_6.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..b0e040c82e2c8a165f608134700e1090c4ce4630 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_6.txt @@ -0,0 +1,7 @@ +From +experience, we can tell you a few important things about adding TensorFlow models: + +Don't reinvent the wheel! More often than not, there are at least two reference implementations you should check: the +PyTorch equivalent of the model you are implementing and other TensorFlow models for the same class of problems. +Great model implementations survive the test of time. This doesn't happen because the code is pretty, but rather +because the code is clear, easy to debug and build upon. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_7.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..f8b299cdc3f810bac2b145021c1cdcb63b5254b0 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_7.txt @@ -0,0 +1,8 @@ +If you make the life of the maintainers easy with your +TensorFlow implementation, by replicating the same patterns as in other TensorFlow models and minimizing the mismatch +to the PyTorch implementation, you ensure your contribution will be long lived. +Ask for help when you're stuck! The 🤗 Transformers team is here to help, and we've probably found solutions to the same +problems you're facing. + +Here's an overview of the steps needed to add a TensorFlow model architecture: +1. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_8.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..0f4f1e55c4d11dba2d0c0ab24e48ac054a65fc58 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_8.txt @@ -0,0 +1,10 @@ +Select the model you wish to convert +2. Prepare transformers dev environment +3. (Optional) Understand theoretical aspects and the existing implementation +4. Implement the model architecture +5. Implement model tests +6. Submit the pull request +7. (Optional) Build demos and share with the world +1.-3. Prepare your model contribution +1. Select the model you wish to convert +Let's start off with the basics: the first thing you need to know is the architecture you want to convert. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_add_tensorflow_model/chunk_9.txt b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..403af0e746f8fc145b10ff58823a71758589a276 --- /dev/null +++ b/chunked/content_aware_chunking/_add_tensorflow_model/chunk_9.txt @@ -0,0 +1,4 @@ +If you +don't have your eyes set on a specific architecture, asking the 🤗 Transformers team for suggestions is a great way to +maximize your impact - we will guide you towards the most prominent architectures that are missing on the TensorFlow +side. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_attention/chunk_0.txt b/chunked/content_aware_chunking/_attention/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..75ccc2f53f635437a68f4260b045e26d054c0362 --- /dev/null +++ b/chunked/content_aware_chunking/_attention/chunk_0.txt @@ -0,0 +1,7 @@ +Attention mechanisms +Most transformer models use full attention in the sense that the attention matrix is square. It can be a big +computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and +use a sparse version of the attention matrix to speed up training. +LSH attention +Reformer uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax +dimension) of the matrix QK^t are going to give useful contributions. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_attention/chunk_1.txt b/chunked/content_aware_chunking/_attention/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..33e44714e59b00608c6c254aff180f0799290922 --- /dev/null +++ b/chunked/content_aware_chunking/_attention/chunk_1.txt @@ -0,0 +1,4 @@ +So for each query q in Q, we can consider only +the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is +modified to mask the current token (except at the first position), because it will give a query and a key equal (so +very similar to each other). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_attention/chunk_2.txt b/chunked/content_aware_chunking/_attention/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..9e41782f92feba13d943742618a6450e032cfc7b --- /dev/null +++ b/chunked/content_aware_chunking/_attention/chunk_2.txt @@ -0,0 +1,5 @@ +Since the hash can be a bit random, several hash functions are used in practice +(determined by a n_rounds parameter) and then are averaged together. +Local attention +Longformer uses local attention: often, the local context (e.g., what are the two tokens to the +left and right?) is enough to take action for a given token. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_attention/chunk_3.txt b/chunked/content_aware_chunking/_attention/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..c4867b909b716cf09ae816888c3e565abc22eee0 --- /dev/null +++ b/chunked/content_aware_chunking/_attention/chunk_3.txt @@ -0,0 +1,6 @@ +Also, by stacking attention layers that have a small +window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a +representation of the whole sentence. +Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access +all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in +their local window). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_attention/chunk_4.txt b/chunked/content_aware_chunking/_attention/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..0c654b813aa4653a198c6ae9510bde131102ac77 --- /dev/null +++ b/chunked/content_aware_chunking/_attention/chunk_4.txt @@ -0,0 +1,9 @@ +This is shown in Figure 2d of the paper, see below for a sample attention mask: + +Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence +length. +Other tricks +Axial positional encodings +Reformer uses axial positional encodings: in traditional transformer models, the positional encoding +E is a matrix of size \(l\) by \(d\), \(l\) being the sequence length and \(d\) the dimension of the +hidden state. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_attention/chunk_5.txt b/chunked/content_aware_chunking/_attention/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..fcaad09140d60bc2f4211fda359dedc846913dc8 --- /dev/null +++ b/chunked/content_aware_chunking/_attention/chunk_5.txt @@ -0,0 +1,4 @@ +If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate +that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with +dimensions \(l_{1} \times d_{1}\) and \(l_{2} \times d_{2}\), such that \(l_{1} \times l_{2} = l\) and +\(d_{1} + d_{2} = d\) (with the product for the lengths, this ends up being way smaller). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_attention/chunk_6.txt b/chunked/content_aware_chunking/_attention/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..4433ca6198247f62c1ec5c5650782ba22116db37 --- /dev/null +++ b/chunked/content_aware_chunking/_attention/chunk_6.txt @@ -0,0 +1,3 @@ +The embedding for time +step \(j\) in E is obtained by concatenating the embeddings for timestep \(j \% l1\) in E1 and \(j // l1\) +in E2.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_0.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..45a6037674ef61b70732c96c5a7a7f5871a139af --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_0.txt @@ -0,0 +1,2 @@ +Load pretrained instances with an AutoClass +With so many different Transformer architectures, it can be challenging to create one for your checkpoint. As a part of 🤗 Transformers core philosophy to make the library easy, simple and flexible to use, an AutoClass automatically infers and loads the correct architecture from a given checkpoint. The from_pretrained() method lets you quickly load a pretrained model for any architecture so you don't have to devote time and resources to train a model from scratch. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_1.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..5e793a8a93e1065cfac1610b8e6ab363474ff830 --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_1.txt @@ -0,0 +1,3 @@ +Producing this type of checkpoint-agnostic code means if your code works for one checkpoint, it will work with another checkpoint - as long as it was trained for a similar task - even if the architecture is different. + +Remember, architecture refers to the skeleton of the model and checkpoints are the weights for a given architecture. For example, BERT is an architecture, while google-bert/bert-base-uncased is a checkpoint. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_10.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..1f2ed76b09d7fedb75df5b8520ed8fa9eefda846 --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_10.txt @@ -0,0 +1,3 @@ +In the next tutorial, learn how to use your newly loaded tokenizer, image processor, feature extractor and processor to preprocess a dataset for fine-tuning. + +Finally, the TFAutoModelFor classes let you load a pretrained model for a given task (see here for a complete list of available tasks). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_11.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..f5f3a1ad0b2f9502e0a1c3a75e569d88f7642221 --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_11.txt @@ -0,0 +1,11 @@ +For example, load a model for sequence classification with [TFAutoModelForSequenceClassification.from_pretrained]: + +from transformers import TFAutoModelForSequenceClassification +model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + +Easily reuse the same checkpoint to load an architecture for a different task: + +from transformers import TFAutoModelForTokenClassification +model = TFAutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") + +Generally, we recommend using the AutoTokenizer class and the TFAutoModelFor class to load pretrained instances of models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_12.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..54ed77532368e50c3ac4c60bb6dac23693ca698c --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_12.txt @@ -0,0 +1,3 @@ +This will ensure you load the correct architecture every time. In the next tutorial, learn how to use your newly loaded tokenizer, image processor, feature extractor and processor to preprocess a dataset for fine-tuning. + +. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_2.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..ab86a5255dccf15754e10c7675a1327480fc869f --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_2.txt @@ -0,0 +1,13 @@ +Model is a general term that can mean either architecture or checkpoint. + +In this tutorial, learn to: + +Load a pretrained tokenizer. +Load a pretrained image processor +Load a pretrained feature extractor. +Load a pretrained processor. +Load a pretrained model. +Load a model as a backbone. + +AutoTokenizer +Nearly every NLP task begins with a tokenizer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_3.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..01a4203944c1abd58bc07fbeed533223b247d39e --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_3.txt @@ -0,0 +1,25 @@ +A tokenizer converts your input into a format that can be processed by the model. +Load a tokenizer with [AutoTokenizer.from_pretrained]: + +from transformers import AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") + +Then tokenize your input as shown below: + +sequence = "In a hole in the ground there lived a hobbit." +print(tokenizer(sequence)) +{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} + +AutoImageProcessor +For vision tasks, an image processor processes the image into the correct input format. + +from transformers import AutoImageProcessor +image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") + +AutoBackbone + +A Swin backbone with multiple stages for outputting a feature map. + +The [AutoBackbone] lets you use pretrained models as backbones to get feature maps from different stages of the backbone. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_4.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..2cf25f619e03bf6e30468f3c77a5126072d13ab8 --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_4.txt @@ -0,0 +1,8 @@ +You should specify one of the following parameters in [~PretrainedConfig.from_pretrained]: + +out_indices is the index of the layer you'd like to get the feature map from +out_features is the name of the layer you'd like to get the feature map from + +These parameters can be used interchangeably, but if you use both, make sure they're aligned with each other! If you don't pass any of these parameters, the backbone returns the feature map from the last layer. + +A feature map from the first stage of the backbone. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_5.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..f3a4bea3689c1f32b98a9ff0f9ed128623564249 --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_5.txt @@ -0,0 +1,32 @@ +The patch partition refers to the model stem. + +For example, in the above diagram, to return the feature map from the first stage of the Swin backbone, you can set out_indices=(1,): + +from transformers import AutoImageProcessor, AutoBackbone +import torch +from PIL import Image +import requests +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +image = Image.open(requests.get(url, stream=True).raw) +processor = AutoImageProcessor.from_pretrained("microsoft/swin-tiny-patch4-window7-224") +model = AutoBackbone.from_pretrained("microsoft/swin-tiny-patch4-window7-224", out_indices=(1,)) +inputs = processor(image, return_tensors="pt") +outputs = model(**inputs) +feature_maps = outputs.feature_maps + +Now you can access the feature_maps object from the first stage of the backbone: + +list(feature_maps[0].shape) +[1, 96, 56, 56] + +AutoFeatureExtractor +For audio tasks, a feature extractor processes the audio signal the correct input format. +Load a feature extractor with [AutoFeatureExtractor.from_pretrained]: + +from transformers import AutoFeatureExtractor +feature_extractor = AutoFeatureExtractor.from_pretrained( + "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" + ) + +AutoProcessor +Multimodal tasks require a processor that combines two types of preprocessing tools. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_6.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..6fbd8180b3d98a88a3973c9fa33e7b3c1f04de77 --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_6.txt @@ -0,0 +1,9 @@ +For example, the LayoutLMV2 model requires an image processor to handle images and a tokenizer to handle text; a processor combines both of them. +Load a processor with [AutoProcessor.from_pretrained]: + +from transformers import AutoProcessor +processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased") + +AutoModel + +The AutoModelFor classes let you load a pretrained model for a given task (see here for a complete list of available tasks). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_7.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..83f19848c1a29b4cdee1dc3f35e1aada85956468 --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_7.txt @@ -0,0 +1,11 @@ +For example, load a model for sequence classification with [AutoModelForSequenceClassification.from_pretrained]: + +from transformers import AutoModelForSequenceClassification +model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + +Easily reuse the same checkpoint to load an architecture for a different task: + +from transformers import AutoModelForTokenClassification +model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") + +For PyTorch models, the from_pretrained() method uses torch.load() which internally uses pickle and is known to be insecure. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_8.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..dd34f1e3108d898a5939f4d1fcaa0381943a0793 --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_8.txt @@ -0,0 +1 @@ +In general, never load a model that could have come from an untrusted source, or that could have been tampered with. This security risk is partially mitigated for public models hosted on the Hugging Face Hub, which are scanned for malware at each commit. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_autoclass_tutorial/chunk_9.txt b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..90e4af824b89ddff363d1a8120b5c0cc919086a4 --- /dev/null +++ b/chunked/content_aware_chunking/_autoclass_tutorial/chunk_9.txt @@ -0,0 +1,4 @@ +See the Hub documentation for best practices like signed commit verification with GPG. +TensorFlow and Flax checkpoints are not affected, and can be loaded within PyTorch architectures using the from_tf and from_flax kwargs for the from_pretrained method to circumvent this issue. + +Generally, we recommend using the AutoTokenizer class and the AutoModelFor class to load pretrained instances of models. This will ensure you load the correct architecture every time. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_0.txt b/chunked/content_aware_chunking/_benchmarks/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_1.txt b/chunked/content_aware_chunking/_benchmarks/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..6468d74fbb12d427cbb2c2c3aa6627f9f918ee2d --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_1.txt @@ -0,0 +1,10 @@ +Benchmarks + +Hugging Face's Benchmarking tools are deprecated and it is advised to use external Benchmarking libraries to measure the speed +and memory complexity of Transformer models. + +[[open-in-colab]] +Let's take a look at how 🤗 Transformers models can be benchmarked, best practices, and already available benchmarks. +A notebook explaining in more detail how to benchmark 🤗 Transformers models can be found here. +How to benchmark 🤗 Transformers models +The classes [PyTorchBenchmark] and [TensorFlowBenchmark] allow to flexibly benchmark 🤗 Transformers models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_10.txt b/chunked/content_aware_chunking/_benchmarks/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..16170248ba18591d601a83ae2bd80946b4506317 --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_10.txt @@ -0,0 +1,136 @@ +In this case, a list of +configurations must be inserted with the benchmark args as follows. + +from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig +args = PyTorchBenchmarkArguments( + models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] + ) +config_base = BertConfig() +config_384_hid = BertConfig(hidden_size=384) +config_6_lay = BertConfig(num_hidden_layers=6) +benchmark = PyTorchBenchmark(args, configs=[config_base, config_384_hid, config_6_lay]) +benchmark.run() +==================== INFERENCE - SPEED - RESULT ==================== + +Model Name Batch Size Seq Length Time in s +bert-base 8 128 0.006 +bert-base 8 512 0.006 +bert-base 8 128 0.018 +bert-base 8 512 0.088 +bert-384-hid 8 8 0.006 +bert-384-hid 8 32 0.006 +bert-384-hid 8 128 0.011 +bert-384-hid 8 512 0.054 +bert-6-lay 8 8 0.003 +bert-6-lay 8 32 0.004 +bert-6-lay 8 128 0.009 +bert-6-lay 8 512 0.044 + +==================== INFERENCE - MEMORY - RESULT ==================== +Model Name Batch Size Seq Length Memory in MB +bert-base 8 8 1277 +bert-base 8 32 1281 +bert-base 8 128 1307 +bert-base 8 512 1539 +bert-384-hid 8 8 1005 +bert-384-hid 8 32 1027 +bert-384-hid 8 128 1035 +bert-384-hid 8 512 1255 +bert-6-lay 8 8 1097 +bert-6-lay 8 32 1101 +bert-6-lay 8 128 1127 +bert-6-lay 8 512 1359 + +==================== ENVIRONMENT INFORMATION ==================== + +transformers_version: 2.11.0 +framework: PyTorch +use_torchscript: False +framework_version: 1.4.0 +python_version: 3.6.10 +system: Linux +cpu: x86_64 +architecture: 64bit +date: 2020-06-29 +time: 09:35:25.143267 +fp16: False +use_multiprocessing: True +only_pretrain_model: False +cpu_ram_mb: 32088 +use_gpu: True +num_gpus: 1 +gpu: TITAN RTX +gpu_ram_mb: 24217 +gpu_power_watts: 280.0 +gpu_performance_state: 2 +use_tpu: False + +py + +from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig + +args = TensorFlowBenchmarkArguments( + models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] + ) +config_base = BertConfig() +config_384_hid = BertConfig(hidden_size=384) +config_6_lay = BertConfig(num_hidden_layers=6) +benchmark = TensorFlowBenchmark(args, configs=[config_base, config_384_hid, config_6_lay]) +benchmark.run() +==================== INFERENCE - SPEED - RESULT ==================== + +Model Name Batch Size Seq Length Time in s +bert-base 8 8 0.005 +bert-base 8 32 0.008 +bert-base 8 128 0.022 +bert-base 8 512 0.106 +bert-384-hid 8 8 0.005 +bert-384-hid 8 32 0.007 +bert-384-hid 8 128 0.018 +bert-384-hid 8 512 0.064 +bert-6-lay 8 8 0.002 +bert-6-lay 8 32 0.003 +bert-6-lay 8 128 0.0011 +bert-6-lay 8 512 0.074 + +==================== INFERENCE - MEMORY - RESULT ==================== +Model Name Batch Size Seq Length Memory in MB +bert-base 8 8 1330 +bert-base 8 32 1330 +bert-base 8 128 1330 +bert-base 8 512 1770 +bert-384-hid 8 8 1330 +bert-384-hid 8 32 1330 +bert-384-hid 8 128 1330 +bert-384-hid 8 512 1540 +bert-6-lay 8 8 1330 +bert-6-lay 8 32 1330 +bert-6-lay 8 128 1330 +bert-6-lay 8 512 1540 + +==================== ENVIRONMENT INFORMATION ==================== + +transformers_version: 2.11.0 +framework: Tensorflow +use_xla: False +framework_version: 2.2.0 +python_version: 3.6.10 +system: Linux +cpu: x86_64 +architecture: 64bit +date: 2020-06-29 +time: 09:38:15.487125 +fp16: False +use_multiprocessing: True +only_pretrain_model: False +cpu_ram_mb: 32088 +use_gpu: True +num_gpus: 1 +gpu: TITAN RTX +gpu_ram_mb: 24217 +gpu_power_watts: 280.0 +gpu_performance_state: 2 +use_tpu: False + +Again, inference time and required memory for inference are measured, but this time for customized configurations +of the BertModel class. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_11.txt b/chunked/content_aware_chunking/_benchmarks/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..d4cbff7569c706cdbf2586d0cb7fb342708206e3 --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_11.txt @@ -0,0 +1,8 @@ +This feature can especially be helpful when deciding for which configuration the model +should be trained. +Benchmark best practices +This section lists a couple of best practices one should be aware of when benchmarking a model. + +Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user + specifies on which device the code should be run by setting the CUDA_VISIBLE_DEVICES environment variable in the + shell, e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_12.txt b/chunked/content_aware_chunking/_benchmarks/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..fd491c882c2d6a9cf9d8b3be01d71bbb2d816f2a --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_12.txt @@ -0,0 +1,5 @@ +export CUDA_VISIBLE_DEVICES=0 before running the code. +The option no_multi_processing should only be set to True for testing and debugging. To ensure accurate + memory measurement it is recommended to run each memory benchmark in a separate process by making sure + no_multi_processing is set to True. +One should always state the environment information when sharing the results of a model benchmark. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_13.txt b/chunked/content_aware_chunking/_benchmarks/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..cd7a8d7f93e1e26ce6c2e54227566082e3ee6ecb --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_13.txt @@ -0,0 +1,7 @@ +Results can vary + heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very + useful for the community. + +Sharing your benchmark +Previously all available core models (10 at the time) have been benchmarked for inference time, across many different +settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_14.txt b/chunked/content_aware_chunking/_benchmarks/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..3e486827a49c0cffa9335ee21c967bcaa81a9a00 --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_14.txt @@ -0,0 +1,9 @@ +All of those tests were +done across CPUs (except for TensorFlow XLA) and GPUs. +The approach is detailed in the following blogpost and the results are +available here. +With the new benchmark tools, it is easier than ever to share your benchmark results with the community + +PyTorch Benchmarking Results. +TensorFlow Benchmarking Results. +. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_2.txt b/chunked/content_aware_chunking/_benchmarks/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..cf4f01bc0b9d3a6297496782ee7d69dd6be9a6f7 --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_2.txt @@ -0,0 +1,7 @@ +The benchmark classes allow us to measure the peak memory usage and required time for both inference and training. + +Hereby, inference is defined by a single forward pass, and training is defined by a single forward pass and +backward pass. + +The benchmark classes [PyTorchBenchmark] and [TensorFlowBenchmark] expect an object of type [PyTorchBenchmarkArguments] and +[TensorFlowBenchmarkArguments], respectively, for instantiation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_3.txt b/chunked/content_aware_chunking/_benchmarks/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..d17b3c0f76e73a2e10c88c59107f9f37f92ea61f --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_3.txt @@ -0,0 +1 @@ +[PyTorchBenchmarkArguments] and [TensorFlowBenchmarkArguments] are data classes and contain all relevant configurations for their corresponding benchmark class. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_4.txt b/chunked/content_aware_chunking/_benchmarks/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..3d244dfcad03c340972fac40e5e25cfdf7518ac9 --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_4.txt @@ -0,0 +1,15 @@ +In the following example, it is shown how a BERT model of type bert-base-cased can be benchmarked. + +from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments +args = PyTorchBenchmarkArguments(models=["google-bert/bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]) +benchmark = PyTorchBenchmark(args) + +py +from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments +args = TensorFlowBenchmarkArguments( + models=["google-bert/bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] + ) +benchmark = TensorFlowBenchmark(args) + +Here, three arguments are given to the benchmark argument data classes, namely models, batch_sizes, and +sequence_lengths. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_5.txt b/chunked/content_aware_chunking/_benchmarks/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..cd367b33d494d4a8d01b1b0be30240d48d376bf0 --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_5.txt @@ -0,0 +1,4 @@ +The argument models is required and expects a list of model identifiers from the +model hub The list arguments batch_sizes and sequence_lengths define +the size of the input_ids on which the model is benchmarked. There are many more parameters that can be configured +via the benchmark argument data classes. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_6.txt b/chunked/content_aware_chunking/_benchmarks/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..7bf79525da625e64331c88e2ed26ae90aeb1394f --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_6.txt @@ -0,0 +1,3 @@ +For more detail on these one can either directly consult the files +src/transformers/benchmark/benchmark_args_utils.py, src/transformers/benchmark/benchmark_args.py (for PyTorch) +and src/transformers/benchmark/benchmark_args_tf.py (for Tensorflow). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_7.txt b/chunked/content_aware_chunking/_benchmarks/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..c2206ef855d73dd2f6f854586620ee7a943943ef --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_7.txt @@ -0,0 +1,97 @@ +Alternatively, running the following shell +commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow +respectively. + +python examples/pytorch/benchmarking/run_benchmark.py --help +An instantiated benchmark object can then simply be run by calling benchmark.run(). + +results = benchmark.run() +print(results) +==================== INFERENCE - SPEED - RESULT ==================== + +Model Name Batch Size Seq Length Time in s +google-bert/bert-base-uncased 8 8 0.006 +google-bert/bert-base-uncased 8 32 0.006 +google-bert/bert-base-uncased 8 128 0.018 +google-bert/bert-base-uncased 8 512 0.088 + +==================== INFERENCE - MEMORY - RESULT ==================== +Model Name Batch Size Seq Length Memory in MB +google-bert/bert-base-uncased 8 8 1227 +google-bert/bert-base-uncased 8 32 1281 +google-bert/bert-base-uncased 8 128 1307 +google-bert/bert-base-uncased 8 512 1539 + +==================== ENVIRONMENT INFORMATION ==================== + +transformers_version: 2.11.0 +framework: PyTorch +use_torchscript: False +framework_version: 1.4.0 +python_version: 3.6.10 +system: Linux +cpu: x86_64 +architecture: 64bit +date: 2020-06-29 +time: 08:58:43.371351 +fp16: False +use_multiprocessing: True +only_pretrain_model: False +cpu_ram_mb: 32088 +use_gpu: True +num_gpus: 1 +gpu: TITAN RTX +gpu_ram_mb: 24217 +gpu_power_watts: 280.0 +gpu_performance_state: 2 +use_tpu: False + +bash +python examples/tensorflow/benchmarking/run_benchmark_tf.py --help + +An instantiated benchmark object can then simply be run by calling benchmark.run(). + +results = benchmark.run() +print(results) +results = benchmark.run() +print(results) +==================== INFERENCE - SPEED - RESULT ==================== + +Model Name Batch Size Seq Length Time in s +google-bert/bert-base-uncased 8 8 0.005 +google-bert/bert-base-uncased 8 32 0.008 +google-bert/bert-base-uncased 8 128 0.022 +google-bert/bert-base-uncased 8 512 0.105 + +==================== INFERENCE - MEMORY - RESULT ==================== +Model Name Batch Size Seq Length Memory in MB +google-bert/bert-base-uncased 8 8 1330 +google-bert/bert-base-uncased 8 32 1330 +google-bert/bert-base-uncased 8 128 1330 +google-bert/bert-base-uncased 8 512 1770 + +==================== ENVIRONMENT INFORMATION ==================== + +transformers_version: 2.11.0 +framework: Tensorflow +use_xla: False +framework_version: 2.2.0 +python_version: 3.6.10 +system: Linux +cpu: x86_64 +architecture: 64bit +date: 2020-06-29 +time: 09:26:35.617317 +fp16: False +use_multiprocessing: True +only_pretrain_model: False +cpu_ram_mb: 32088 +use_gpu: True +num_gpus: 1 +gpu: TITAN RTX +gpu_ram_mb: 24217 +gpu_power_watts: 280.0 +gpu_performance_state: 2 +use_tpu: False + +By default, the time and the required memory for inference are benchmarked. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_8.txt b/chunked/content_aware_chunking/_benchmarks/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..23c8b7ee5d999cb2d84dd04bdaedf5ea067d86cf --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_8.txt @@ -0,0 +1,6 @@ +In the example output above the first +two sections show the result corresponding to inference time and inference memory. In addition, all relevant +information about the computing environment, e.g. the GPU type, the system, the library versions, etc are printed +out in the third section under ENVIRONMENT INFORMATION. This information can optionally be saved in a .csv file +when adding the argument save_to_csv=True to [PyTorchBenchmarkArguments] and +[TensorFlowBenchmarkArguments] respectively. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_benchmarks/chunk_9.txt b/chunked/content_aware_chunking/_benchmarks/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..95ada0b7918931e1846c054404270bebf4fe949e --- /dev/null +++ b/chunked/content_aware_chunking/_benchmarks/chunk_9.txt @@ -0,0 +1,4 @@ +In this case, every section is saved in a separate +.csv file. The path to each .csv file can optionally be defined via the argument data classes. +Instead of benchmarking pre-trained models via their model identifier, e.g. google-bert/bert-base-uncased, the user can +alternatively benchmark an arbitrary configuration of any available model class. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_bertology/chunk_0.txt b/chunked/content_aware_chunking/_bertology/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..1479e3ccd228bba0b64215a1f60940914a0a1e64 --- /dev/null +++ b/chunked/content_aware_chunking/_bertology/chunk_0.txt @@ -0,0 +1,3 @@ +BERTology +There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT +(that some call "BERTology"). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_bertology/chunk_1.txt b/chunked/content_aware_chunking/_bertology/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..b0b8ddbeb0db0f29dfb54d77fbdc16e6d4736eea --- /dev/null +++ b/chunked/content_aware_chunking/_bertology/chunk_1.txt @@ -0,0 +1,20 @@ +Some good examples of this field are: + +BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: + https://arxiv.org/abs/1905.05950 +Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650 +What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. + Manning: https://arxiv.org/abs/1906.04341 +CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure: https://arxiv.org/abs/2210.04633 + +In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to +help people access the inner representations, mainly adapted from the great work of Paul Michel +(https://arxiv.org/abs/1905.10650): + +accessing all the hidden-states of BERT/GPT/GPT-2, +accessing all the attention weights for each head of BERT/GPT/GPT-2, +retrieving heads output values and gradients to be able to compute head importance score and prune head as explained + in https://arxiv.org/abs/1905.10650. + +To help you understand and use these features, we have added a specific example script: bertology.py while extract information and prune a model pre-trained on +GLUE.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_big_models/chunk_0.txt b/chunked/content_aware_chunking/_big_models/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..6957a033d4f83b09b8ddf03f4fe3d37f938c5f75 --- /dev/null +++ b/chunked/content_aware_chunking/_big_models/chunk_0.txt @@ -0,0 +1,9 @@ +Instantiating a big model +When you want to use a very big pretrained model, one challenge is to minimize the use of the RAM. The usual workflow +from PyTorch is: + +Create your model with random weights. +Load your pretrained weights. +Put those pretrained weights in your random model. + +Step 1 and 2 both require a full version of the model in memory, which is not a problem in most cases, but if your model starts weighing several GigaBytes, those two copies can make you get out of RAM. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_big_models/chunk_1.txt b/chunked/content_aware_chunking/_big_models/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..81eee8f26c58233419e99e0e52c230ae0a13047e --- /dev/null +++ b/chunked/content_aware_chunking/_big_models/chunk_1.txt @@ -0,0 +1,3 @@ +Even worse, if you are using torch.distributed to launch a distributed training, each process will load the pretrained model and store these two copies in RAM. + +Note that the randomly created model is initialized with "empty" tensors, which take the space in memory without filling it (thus the random values are whatever was in this chunk of memory at a given time). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_big_models/chunk_2.txt b/chunked/content_aware_chunking/_big_models/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..9b24f006ff76b29439129afc8e1c08ded0d6f60c --- /dev/null +++ b/chunked/content_aware_chunking/_big_models/chunk_2.txt @@ -0,0 +1,3 @@ +The random initialization following the appropriate distribution for the kind of model/parameters instantiated (like a normal distribution for instance) is only performed after step 3 on the non-initialized weights, to be as fast as possible! + +In this guide, we explore the solutions Transformers offer to deal with this issue. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_big_models/chunk_3.txt b/chunked/content_aware_chunking/_big_models/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..e63217b2d253d3cc5ab2cfbe977bc477558ec2c8 --- /dev/null +++ b/chunked/content_aware_chunking/_big_models/chunk_3.txt @@ -0,0 +1,3 @@ +Note that this is an area of active development, so the APIs explained here may change slightly in the future. +Sharded checkpoints +Since version 4.18.0, model checkpoints that end up taking more than 10GB of space are automatically sharded in smaller pieces. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_big_models/chunk_4.txt b/chunked/content_aware_chunking/_big_models/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..af2a01682fccc55368012aa84dc20d4f334a0b63 --- /dev/null +++ b/chunked/content_aware_chunking/_big_models/chunk_4.txt @@ -0,0 +1,23 @@ +In terms of having one single checkpoint when you do model.save_pretrained(save_dir), you will end up with several partial checkpoints (each of which being of size < 10GB) and an index that maps parameter names to the files they are stored in. +You can control the maximum size before sharding with the max_shard_size parameter, so for the sake of an example, we'll use a normal-size models with a small shard size: let's take a traditional BERT model. + +from transformers import AutoModel +model = AutoModel.from_pretrained("google-bert/bert-base-cased") + +If you save it using [~PreTrainedModel.save_pretrained], you will get a new folder with two files: the config of the model and its weights: + +import os +import tempfile +with tempfile.TemporaryDirectory() as tmp_dir: + model.save_pretrained(tmp_dir) + print(sorted(os.listdir(tmp_dir))) +['config.json', 'pytorch_model.bin'] + +Now let's use a maximum shard size of 200MB: + +with tempfile.TemporaryDirectory() as tmp_dir: + model.save_pretrained(tmp_dir, max_shard_size="200MB") + print(sorted(os.listdir(tmp_dir))) +['config.json', 'pytorch_model-00001-of-00003.bin', 'pytorch_model-00002-of-00003.bin', 'pytorch_model-00003-of-00003.bin', 'pytorch_model.bin.index.json'] + +On top of the configuration of the model, we see three different weights files, and an index.json file which is our index. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_big_models/chunk_5.txt b/chunked/content_aware_chunking/_big_models/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..688a01975c2a3bcc5607efd6e60add3209b35217 --- /dev/null +++ b/chunked/content_aware_chunking/_big_models/chunk_5.txt @@ -0,0 +1,8 @@ +A checkpoint like this can be fully reloaded using the [~PreTrainedModel.from_pretrained] method: + +with tempfile.TemporaryDirectory() as tmp_dir: + model.save_pretrained(tmp_dir, max_shard_size="200MB") + new_model = AutoModel.from_pretrained(tmp_dir) + +The main advantage of doing this for big models is that during step 2 of the workflow shown above, each shard of the checkpoint is loaded after the previous one, capping the memory usage in RAM to the model size plus the size of the biggest shard. +Behind the scenes, the index file is used to determine which keys are in the checkpoint, and where the corresponding weights are stored. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_big_models/chunk_6.txt b/chunked/content_aware_chunking/_big_models/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..212b0e52c4b7a78161975b2a0fc040c3426d5412 --- /dev/null +++ b/chunked/content_aware_chunking/_big_models/chunk_6.txt @@ -0,0 +1,11 @@ +We can load that index like any json and get a dictionary: + +import json +with tempfile.TemporaryDirectory() as tmp_dir: + model.save_pretrained(tmp_dir, max_shard_size="200MB") + with open(os.path.join(tmp_dir, "pytorch_model.bin.index.json"), "r") as f: + index = json.load(f) +print(index.keys()) +dict_keys(['metadata', 'weight_map']) + +The metadata just consists of the total size of the model for now. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_big_models/chunk_7.txt b/chunked/content_aware_chunking/_big_models/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..7ba080d3050ab2045e3ec5753ef7afe68475bfc9 --- /dev/null +++ b/chunked/content_aware_chunking/_big_models/chunk_7.txt @@ -0,0 +1,22 @@ +We plan to add other information in the future: + +index["metadata"] +{'total_size': 433245184} + +The weights map is the main part of this index, which maps each parameter name (as usually found in a PyTorch model state_dict) to the file it's stored in: + +index["weight_map"] +{'embeddings.LayerNorm.bias': 'pytorch_model-00001-of-00003.bin', + 'embeddings.LayerNorm.weight': 'pytorch_model-00001-of-00003.bin', + + +If you want to directly load such a sharded checkpoint inside a model without using [~PreTrainedModel.from_pretrained] (like you would do model.load_state_dict() for a full checkpoint) you should use [~modeling_utils.load_sharded_checkpoint]: + +from transformers.modeling_utils import load_sharded_checkpoint +with tempfile.TemporaryDirectory() as tmp_dir: + model.save_pretrained(tmp_dir, max_shard_size="200MB") + load_sharded_checkpoint(model, tmp_dir) + +Low memory loading +Sharded checkpoints reduce the memory usage during step 2 of the workflow mentioned above, but in order to use that model in a low memory setting, we recommend leveraging our tools based on the Accelerate library. +Please read the following guide for more information: Large model loading using Accelerate. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_0.txt b/chunked/content_aware_chunking/_chat_templating/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..1c87d484f3ac6fdca61808e80500a6fbb8b7a900 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_0.txt @@ -0,0 +1,6 @@ +Templates for Chat Models +Introduction +An increasingly common use case for LLMs is chat. In a chat context, rather than continuing a single string +of text (as is the case with a standard language model), the model instead continues a conversation that consists +of one or more messages, each of which includes a role, like "user" or "assistant", as well as message text. +Much like tokenization, different models expect very different input formats for chat. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_1.txt b/chunked/content_aware_chunking/_chat_templating/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..107eab72f456dbc353b0bcc49739b97de81b46d8 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_1.txt @@ -0,0 +1,4 @@ +This is the reason we added +chat templates as a feature. Chat templates are part of the tokenizer. They specify how to convert conversations, +represented as lists of messages, into a single tokenizable string in the format that the model expects. +Let's make this concrete with a quick example using the BlenderBot model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_10.txt b/chunked/content_aware_chunking/_chat_templating/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..6ba01bd313140b6587a40c8b5d6e26b28d9ab3cb --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_10.txt @@ -0,0 +1,16 @@ +Let's try the Zephyr example again, but this time using +a pipeline: +thon +from transformers import pipeline +pipe = pipeline("text-generation", "HuggingFaceH4/zephyr-7b-beta") +messages = [ + { + "role": "system", + "content": "You are a friendly chatbot who always responds in the style of a pirate", + }, + {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, +] +print(pipe(messages, max_new_tokens=128)[0]['generated_text'][-1]) # Print the assistant's response + +text +{'role': 'assistant', 'content': "Matey, I'm afraid I must inform ye that humans cannot eat helicopters. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_11.txt b/chunked/content_aware_chunking/_chat_templating/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..8aee07dd14317a9d60a94637b5eb040625242f78 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_11.txt @@ -0,0 +1 @@ +Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_12.txt b/chunked/content_aware_chunking/_chat_templating/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..c6883b7f1a824685bf17321870e1983d3f39e281 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_12.txt @@ -0,0 +1,6 @@ +None at all."} +The pipeline will take care of all the details of tokenization and calling apply_chat_template for you - +once the model has a chat template, all you need to do is initialize the pipeline and pass it the list of messages! +What are "generation prompts"? +You may have noticed that the apply_chat_template method has an add_generation_prompt argument. This argument tells +the template to add tokens that indicate the start of a bot response. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_13.txt b/chunked/content_aware_chunking/_chat_templating/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..98892e5b8ea639040d7b4dc11de7fa67940bda91 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_13.txt @@ -0,0 +1,29 @@ +For example, consider the following chat: +python +messages = [ + {"role": "user", "content": "Hi there!"}, + {"role": "assistant", "content": "Nice to meet you!"}, + {"role": "user", "content": "Can I ask a question?"} +] +Here's what this will look like without a generation prompt, using the ChatML template we saw in the Zephyr example: +python +tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) +"""<|im_start|>user +Hi there!<|im_end|> +<|im_start|>assistant +Nice to meet you!<|im_end|> +<|im_start|>user +Can I ask a question?<|im_end|> +""" +And here's what it looks like with a generation prompt: +python +tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) +"""<|im_start|>user +Hi there!<|im_end|> +<|im_start|>assistant +Nice to meet you!<|im_end|> +<|im_start|>user +Can I ask a question?<|im_end|> +<|im_start|>assistant +""" +Note that this time, we've added the tokens that indicate the start of a bot response. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_14.txt b/chunked/content_aware_chunking/_chat_templating/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..41a128e9bfd38f65d177494a9020dcb2070fdb0d --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_14.txt @@ -0,0 +1,6 @@ +This ensures that when the model +generates text it will write a bot response instead of doing something unexpected, like continuing the user's +message. Remember, chat models are still just language models - they're trained to continue text, and chat is just a +special kind of text to them! You need to guide them with appropriate control tokens, so they know what they're +supposed to be doing. +Not all models require generation prompts. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_15.txt b/chunked/content_aware_chunking/_chat_templating/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..4e87b578b12b4c5988c64496ac2a02800a48e246 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_15.txt @@ -0,0 +1,6 @@ +Some models, like BlenderBot and LLaMA, don't have any +special tokens before bot responses. In these cases, the add_generation_prompt argument will have no effect. The exact +effect that add_generation_prompt has will depend on the template being used. +Can I use chat templates in training? +Yes! We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you +can simply continue like any other language model training task. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_16.txt b/chunked/content_aware_chunking/_chat_templating/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..da9ada9d0ca014e199daebf48e5256300962eb57 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_16.txt @@ -0,0 +1,3 @@ +When training, you should usually set +add_generation_prompt=False, because the added tokens to prompt an assistant response will not be helpful during +training. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_17.txt b/chunked/content_aware_chunking/_chat_templating/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..60205047e793fe9746980e311d6667129488e2d4 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_17.txt @@ -0,0 +1,25 @@ +Let's see an example: +thon +from transformers import AutoTokenizer +from datasets import Dataset +tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta") +chat1 = [ + {"role": "user", "content": "Which is bigger, the moon or the sun?"}, + {"role": "assistant", "content": "The sun."} +] +chat2 = [ + {"role": "user", "content": "Which is bigger, a virus or a bacterium?"}, + {"role": "assistant", "content": "A bacterium."} +] +dataset = Dataset.from_dict({"chat": [chat1, chat2]}) +dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)}) +print(dataset['formatted_chat'][0]) +And we get:text +<|user|> +Which is bigger, the moon or the sun? +<|assistant|> +The sun. + +From here, just continue training like you would with a standard language modelling task, using the formatted_chat column. +Advanced: How do chat templates work? +The chat template for a model is stored on the tokenizer.chat_template attribute. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_18.txt b/chunked/content_aware_chunking/_chat_templating/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..dd976097fd26ddc73c2545d51732f919d4cd8cff --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_18.txt @@ -0,0 +1,10 @@ +If no chat template is set, the +default template for that model class is used instead. Let's take a look at the template for BlenderBot: +thon + +from transformers import AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill") +tokenizer.default_chat_template +"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}" + +That's kind of intimidating. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_19.txt b/chunked/content_aware_chunking/_chat_templating/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..3f1a3598fd26de4e23ab9c1b87546999a6eff5c1 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_19.txt @@ -0,0 +1,4 @@ +Let's add some newlines and indentation to make it more readable. Note that the first +newline after each block as well as any preceding whitespace before a block are ignored by default, using the +Jinja trim_blocks and lstrip_blocks flags. However, be cautious - although leading whitespace on each +line is stripped, spaces between blocks on the same line are not. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_2.txt b/chunked/content_aware_chunking/_chat_templating/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..e2f79a3717220d0536f09d6ad461a78f119ca7b8 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_2.txt @@ -0,0 +1,9 @@ +BlenderBot has an extremely simple default +template, which mostly just adds whitespace between rounds of dialogue: +thon + +from transformers import AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill") +chat = [ + {"role": "user", "content": "Hello, how are you?"}, + {"role": "assistant", "content": "I'm doing great. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_20.txt b/chunked/content_aware_chunking/_chat_templating/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..e504e37337080514d13b7dc0309244ed34e0ed7b --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_20.txt @@ -0,0 +1,14 @@ +We strongly recommend checking that your template +isn't printing extra spaces where it shouldn't be! +{% for message in messages %} + {% if message['role'] == 'user' %} + {{ ' ' }} + {% endif %} + {{ message['content'] }} + {% if not loop.last %} + {{ ' ' }} + {% endif %} +{% endfor %} +{{ eos_token }} +If you've never seen one of these before, this is a Jinja template. +Jinja is a templating language that allows you to write simple code that generates text. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_21.txt b/chunked/content_aware_chunking/_chat_templating/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..dea8a5f0f9296835473ef2c6df70ca416cf50174 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_21.txt @@ -0,0 +1,12 @@ +In many ways, the code and +syntax resembles Python. In pure Python, this template would look something like this: +python +for idx, message in enumerate(messages): + if message['role'] == 'user': + print(' ') + print(message['content']) + if not idx == len(messages) - 1: # Check for the last message in the conversation + print(' ') +print(eos_token) +Effectively, the template does three things: +1. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_22.txt b/chunked/content_aware_chunking/_chat_templating/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..763d80f387348329b5c02b2a80537ebadfa47bd2 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_22.txt @@ -0,0 +1,3 @@ +For each message, if the message is a user message, add a blank space before it, otherwise print nothing. +2. Add the message content +3. If the message is not the last message, add two spaces after it. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_23.txt b/chunked/content_aware_chunking/_chat_templating/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..529bde786fadea2e419556707a8bb294d0c3838e --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_23.txt @@ -0,0 +1,17 @@ +After the final message, print the EOS token. +This is a pretty simple template - it doesn't add any control tokens, and it doesn't support "system" messages, which +are a common way to give the model directives about how it should behave in the subsequent conversation. +But Jinja gives you a lot of flexibility to do those things! Let's see a Jinja template that can format inputs +similarly to the way LLaMA formats them (note that the real LLaMA template includes handling for default system +messages and slightly different system message handling in general - don't use this one in your actual code!) +{% for message in messages %} + {% if message['role'] == 'user' %} + {{ bos_token + '[INST] ' + message['content'] + ' [/INST]' }} + {% elif message['role'] == 'system' %} + {{ '<>\\n' + message['content'] + '\\n<>\\n\\n' }} + {% elif message['role'] == 'assistant' %} + {{ ' ' + message['content'] + ' ' + eos_token }} + {% endif %} +{% endfor %} +Hopefully if you stare at this for a little bit you can see what this template is doing - it adds specific tokens based +on the "role" of each message, which represents who sent it. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_24.txt b/chunked/content_aware_chunking/_chat_templating/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..c3a1a736525d00598e674cf932097415df29c69c --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_24.txt @@ -0,0 +1,5 @@ +User, assistant and system messages are clearly +distinguishable to the model because of the tokens they're wrapped in. +Advanced: Adding and editing chat templates +How do I create a chat template? +Simple, just write a jinja template and set tokenizer.chat_template. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_25.txt b/chunked/content_aware_chunking/_chat_templating/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..0e3d959afd25dfa11c4efd6829e2444f04ebd5c1 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_25.txt @@ -0,0 +1,13 @@ +You may find it easier to start with an +existing template from another model and simply edit it for your needs! For example, we could take the LLaMA template +above and add "[ASST]" and "[/ASST]" to assistant messages: +{% for message in messages %} + {% if message['role'] == 'user' %} + {{ bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }} + {% elif message['role'] == 'system' %} + {{ '<>\\n' + message['content'].strip() + '\\n<>\\n\\n' }} + {% elif message['role'] == 'assistant' %} + {{ '[ASST] ' + message['content'] + ' [/ASST]' + eos_token }} + {% endif %} +{% endfor %} +Now, simply set the tokenizer.chat_template attribute. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_26.txt b/chunked/content_aware_chunking/_chat_templating/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..8b409da19a55a64570a35c4985755069339a15d4 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_26.txt @@ -0,0 +1,14 @@ +Next time you use [~PreTrainedTokenizer.apply_chat_template], it will +use your new template! This attribute will be saved in the tokenizer_config.json file, so you can use +[~utils.PushToHubMixin.push_to_hub] to upload your new template to the Hub and make sure everyone's using the right +template for your model! +python +template = tokenizer.chat_template +template = template.replace("SYS", "SYSTEM") # Change the system token +tokenizer.chat_template = template # Set the new template +tokenizer.push_to_hub("model_name") # Upload your new template to the Hub! +The method [~PreTrainedTokenizer.apply_chat_template] which uses your chat template is called by the [TextGenerationPipeline] class, so +once you set the correct chat template, your model will automatically become compatible with [TextGenerationPipeline]. + +If you're fine-tuning a model for chat, in addition to setting a chat template, you should probably add any new chat +control tokens as special tokens in the tokenizer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_27.txt b/chunked/content_aware_chunking/_chat_templating/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..aa3e2648335601bcfc39b9f2d402430a2d885d95 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_27.txt @@ -0,0 +1,7 @@ +Special tokens are never split, +ensuring that your control tokens are always handled as single tokens rather than being tokenized in pieces. You +should also set the tokenizer's eos_token attribute to the token that marks the end of assistant generations in your +template. This will ensure that text generation tools can correctly figure out when to stop generating text. + +What are "default" templates? +Before the introduction of chat templates, chat handling was hardcoded at the model class level. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_28.txt b/chunked/content_aware_chunking/_chat_templating/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..964e8a51ad10c1f8fdd8df85f76bf3c865329f18 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_28.txt @@ -0,0 +1,4 @@ +For backwards +compatibility, we have retained this class-specific handling as default templates, also set at the class level. If a +model does not have a chat template set, but there is a default template for its model class, the TextGenerationPipeline +class and methods like apply_chat_template will use the class template instead. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_29.txt b/chunked/content_aware_chunking/_chat_templating/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..c1a990409928bd1f9dce403a4512862611409d76 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_29.txt @@ -0,0 +1,3 @@ +You can find out what the default +template for your tokenizer is by checking the tokenizer.default_chat_template attribute. +This is something we do purely for backward compatibility reasons, to avoid breaking any existing workflows. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_3.txt b/chunked/content_aware_chunking/_chat_templating/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..ba2afb043a2e925d725e0080371cd2000c044315 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_3.txt @@ -0,0 +1,8 @@ +How can I help you today?"}, + {"role": "user", "content": "I'd like to show off how chat templating works!"}, + ] +tokenizer.apply_chat_template(chat, tokenize=False) +" Hello, how are you? I'm doing great. How can I help you today? I'd like to show off how chat templating works!" + +Notice how the entire chat is condensed into a single string. If we use tokenize=True, which is the default setting, +that string will also be tokenized for us. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_30.txt b/chunked/content_aware_chunking/_chat_templating/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..2c3a0ce9ba2ef8a3dc3201e026a63f445928b2eb --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_30.txt @@ -0,0 +1,8 @@ +Even when +the class template is appropriate for your model, we strongly recommend overriding the default template by +setting the chat_template attribute explicitly to make it clear to users that your model has been correctly configured +for chat, and to future-proof in case the default templates are ever altered or deprecated. +What template should I use? +When setting the template for a model that's already been trained for chat, you should ensure that the template +exactly matches the message formatting that the model saw during training, or else you will probably experience +performance degradation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_31.txt b/chunked/content_aware_chunking/_chat_templating/chunk_31.txt new file mode 100644 index 0000000000000000000000000000000000000000..342755bd82e90c96b18640dac163c138e5d50615 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_31.txt @@ -0,0 +1,2 @@ +This is true even if you're training the model further - you will probably get the best +performance if you keep the chat tokens constant. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_32.txt b/chunked/content_aware_chunking/_chat_templating/chunk_32.txt new file mode 100644 index 0000000000000000000000000000000000000000..650c5215179561fd6d204dfff69749ec4f858416 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_32.txt @@ -0,0 +1,5 @@ +This is very analogous to tokenization - you generally get the +best performance for inference or fine-tuning when you precisely match the tokenization used during training. +If you're training a model from scratch, or fine-tuning a base language model for chat, on the other hand, +you have a lot of freedom to choose an appropriate template! LLMs are smart enough to learn to handle lots of different +input formats. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_33.txt b/chunked/content_aware_chunking/_chat_templating/chunk_33.txt new file mode 100644 index 0000000000000000000000000000000000000000..2f6f4635050f225a70ed77ea12721a057207c0be --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_33.txt @@ -0,0 +1,6 @@ +Our default template for models that don't have a class-specific template follows the +ChatML format, and this is a good, flexible choice for many use-cases. It looks like this: +{% for message in messages %} + {{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}} +{% endfor %} +If you like this one, here it is in one-liner form, ready to copy into your code. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_34.txt b/chunked/content_aware_chunking/_chat_templating/chunk_34.txt new file mode 100644 index 0000000000000000000000000000000000000000..98fee7a2f096957c79e67f571efeb6c8214af564 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_34.txt @@ -0,0 +1,5 @@ +The one-liner also includes +handy support for generation prompts, but note that it doesn't add BOS or EOS tokens! +If your model expects those, they won't be added automatically by apply_chat_template - in other words, the +text will be tokenized with add_special_tokens=False. This is to avoid potential conflicts between the template and +the add_special_tokens logic. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_35.txt b/chunked/content_aware_chunking/_chat_templating/chunk_35.txt new file mode 100644 index 0000000000000000000000000000000000000000..939b85d38592d46ebc434518521c9ff27c3fa102 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_35.txt @@ -0,0 +1,5 @@ +If your model expects special tokens, make sure to add them to the template! +python +tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}" +This template wraps each message in <|im_start|> and <|im_end|> tokens, and simply writes the role as a string, which +allows for flexibility in the roles you train with. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_36.txt b/chunked/content_aware_chunking/_chat_templating/chunk_36.txt new file mode 100644 index 0000000000000000000000000000000000000000..b89f99a512226c9158a3cc946d71a90bd1cdd72a --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_36.txt @@ -0,0 +1,10 @@ +The output looks like this: +text +<|im_start|>system +You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|> +<|im_start|>user +How are you?<|im_end|> +<|im_start|>assistant +I'm doing great!<|im_end|> +The "user", "system" and "assistant" roles are the standard for chat, and we recommend using them when it makes sense, +particularly if you want your model to operate well with [TextGenerationPipeline]. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_37.txt b/chunked/content_aware_chunking/_chat_templating/chunk_37.txt new file mode 100644 index 0000000000000000000000000000000000000000..e48b0f5df53e1ca3b8adef01bc3dec37e502e40b --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_37.txt @@ -0,0 +1,5 @@ +However, you are not limited +to these roles - templating is extremely flexible, and any string can be a role. +I want to add some chat templates! How should I get started? +If you have any chat models, you should set their tokenizer.chat_template attribute and test it using +[~PreTrainedTokenizer.apply_chat_template], then push the updated tokenizer to the Hub. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_38.txt b/chunked/content_aware_chunking/_chat_templating/chunk_38.txt new file mode 100644 index 0000000000000000000000000000000000000000..14b33efc682a14f280d6b7de792a09b3f5f1be86 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_38.txt @@ -0,0 +1,7 @@ +This applies even if you're +not the model owner - if you're using a model with an empty chat template, or one that's still using the default class +template, please open a pull request to the model repository so that this attribute can be set properly! +Once the attribute is set, that's it, you're done! tokenizer.apply_chat_template will now work correctly for that +model, which means it is also automatically supported in places like TextGenerationPipeline! +By ensuring that models have this attribute, we can make sure that the whole community gets to use the full power of +open-source models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_39.txt b/chunked/content_aware_chunking/_chat_templating/chunk_39.txt new file mode 100644 index 0000000000000000000000000000000000000000..67ecadc46ba5aebf92b21c0f63627361b1edcdf4 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_39.txt @@ -0,0 +1,6 @@ +Formatting mismatches have been haunting the field and silently harming performance for too long - +it's time to put an end to them! +Advanced: Template writing tips +If you're unfamiliar with Jinja, we generally find that the easiest way to write a chat template is to first +write a short Python script that formats messages the way you want, and then convert that script into a template. +Remember that the template handler will receive the conversation history as a variable called messages. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_4.txt b/chunked/content_aware_chunking/_chat_templating/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..07d5bcafffc09ac2462d9c5378a8b603bd52174e --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_4.txt @@ -0,0 +1,9 @@ +To see a more complex template in action, though, let's use the +mistralai/Mistral-7B-Instruct-v0.1 model. +thon + +from transformers import AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1") +chat = [ + {"role": "user", "content": "Hello, how are you?"}, + {"role": "assistant", "content": "I'm doing great. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_40.txt b/chunked/content_aware_chunking/_chat_templating/chunk_40.txt new file mode 100644 index 0000000000000000000000000000000000000000..7b3ce11866fd407b32683e0e2700f23092c0a149 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_40.txt @@ -0,0 +1,2 @@ +Each +message is a dictionary with two keys, role and content. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_41.txt b/chunked/content_aware_chunking/_chat_templating/chunk_41.txt new file mode 100644 index 0000000000000000000000000000000000000000..12c870ad7e397eee122df8bc92d4247467ea749a --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_41.txt @@ -0,0 +1,10 @@ +You will be able to access messages in your template +just like you can in Python, which means you can loop over it with {% for message in messages %} or access +individual messages with, for example, {{ messages[0] }}. +You can also use the following tips to convert your code to Jinja: +For loops +For loops in Jinja look like this: +{% for message in messages %} +{{ message['content'] }} +{% endfor %} +Note that whatever's inside the {{ expression block }} will be printed to the output. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_42.txt b/chunked/content_aware_chunking/_chat_templating/chunk_42.txt new file mode 100644 index 0000000000000000000000000000000000000000..e413058a66afd3bd22a087d53d12bcb9bb1bf552 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_42.txt @@ -0,0 +1,12 @@ +You can use operators like ++ to combine strings inside expression blocks. +If statements +If statements in Jinja look like this: +{% if message['role'] == 'user' %} +{{ message['content'] }} +{% endif %} +Note how where Python uses whitespace to mark the beginnings and ends of for and if blocks, Jinja requires you +to explicitly end them with {% endfor %} and {% endif %}. +Special variables +Inside your template, you will have access to the list of messages, but you can also access several other special +variables. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_43.txt b/chunked/content_aware_chunking/_chat_templating/chunk_43.txt new file mode 100644 index 0000000000000000000000000000000000000000..d0223e30f5f3ae79202498363c9ea8616398e923 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_43.txt @@ -0,0 +1,4 @@ +These include special tokens like bos_token and eos_token, as well as the add_generation_prompt +variable that we discussed above. You can also use the loop variable to access information about the current loop +iteration, for example using {% if loop.last %} to check if the current message is the last message in the +conversation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_44.txt b/chunked/content_aware_chunking/_chat_templating/chunk_44.txt new file mode 100644 index 0000000000000000000000000000000000000000..12924d2af1d87971a4a8cc46f490f2e5fd50f5ac --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_44.txt @@ -0,0 +1,7 @@ +Here's an example that puts these ideas together to add a generation prompt at the end of the +conversation if add_generation_prompt is True: +{% if loop.last and add_generation_prompt %} +{{ bos_token + 'Assistant:\n' }} +{% endif %} +Notes on whitespace +As much as possible, we've tried to get Jinja to ignore whitespace outside of {{ expressions }}. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_45.txt b/chunked/content_aware_chunking/_chat_templating/chunk_45.txt new file mode 100644 index 0000000000000000000000000000000000000000..3b8d199bdf2697e7fbf18eda266c63f6d5593df7 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_45.txt @@ -0,0 +1,4 @@ +However, be aware +that Jinja is a general-purpose templating engine, and it may treat whitespace between blocks on the same line +as significant and print it to the output. We strongly recommend checking that your template isn't printing extra +spaces where it shouldn't be before you upload it!. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_5.txt b/chunked/content_aware_chunking/_chat_templating/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..5bbdd2d44252c2ca813a4dd297c29380ef4754ec --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_5.txt @@ -0,0 +1,8 @@ +How can I help you today?"}, + {"role": "user", "content": "I'd like to show off how chat templating works!"}, + ] +tokenizer.apply_chat_template(chat, tokenize=False) +"[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]" + +Note that this time, the tokenizer has added the control tokens [INST] and [/INST] to indicate the start and end of +user messages (but not assistant messages!). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_6.txt b/chunked/content_aware_chunking/_chat_templating/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..ff5e2c2fe4f3785f059e884672ae6f8713b6efaf --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_6.txt @@ -0,0 +1,6 @@ +Mistral-instruct was trained with these tokens, but BlenderBot was not. +How do I use chat templates? +As you can see in the example above, chat templates are easy to use. Simply build a list of messages, with role +and content keys, and then pass it to the [~PreTrainedTokenizer.apply_chat_template] method. Once you do that, +you'll get output that's ready to go! When using chat templates as input for model generation, it's also a good idea +to use add_generation_prompt=True to add a generation prompt. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_7.txt b/chunked/content_aware_chunking/_chat_templating/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..7b1448fc3b0fc62cbb6b0a51e8964de4b7735d52 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_7.txt @@ -0,0 +1,34 @@ +Here's an example of preparing input for model.generate(), using the Zephyr assistant model: +thon +from transformers import AutoModelForCausalLM, AutoTokenizer +checkpoint = "HuggingFaceH4/zephyr-7b-beta" +tokenizer = AutoTokenizer.from_pretrained(checkpoint) +model = AutoModelForCausalLM.from_pretrained(checkpoint) # You may want to use bfloat16 and/or move to GPU here +messages = [ + { + "role": "system", + "content": "You are a friendly chatbot who always responds in the style of a pirate", + }, + {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, + ] +tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") +print(tokenizer.decode(tokenized_chat[0])) +This will yield a string in the input format that Zephyr expects.text +<|system|> +You are a friendly chatbot who always responds in the style of a pirate +<|user|> +How many helicopters can a human eat in one sitting? +<|assistant|> + +Now that our input is formatted correctly for Zephyr, we can use the model to generate a response to the user's question: +python +outputs = model.generate(tokenized_chat, max_new_tokens=128) +print(tokenizer.decode(outputs[0])) +This will yield: +text +<|system|> +You are a friendly chatbot who always responds in the style of a pirate +<|user|> +How many helicopters can a human eat in one sitting? +<|assistant|> +Matey, I'm afraid I must inform ye that humans cannot eat helicopters. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_8.txt b/chunked/content_aware_chunking/_chat_templating/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..1e4536de8579cbb9c7a7c0e49ae1b9f872c96a5f --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_8.txt @@ -0,0 +1,4 @@ +Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all. +Arr, 'twas easy after all! +Is there an automated pipeline for chat? +Yes, there is! Our text generation pipelines support chat inputs, which makes it easy to use chat models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_chat_templating/chunk_9.txt b/chunked/content_aware_chunking/_chat_templating/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..27a1e3283c9355222aabf63209780a200a753872 --- /dev/null +++ b/chunked/content_aware_chunking/_chat_templating/chunk_9.txt @@ -0,0 +1,3 @@ +In the past, +we used to use a dedicated "ConversationalPipeline" class, but this has now been deprecated and its functionality +has been merged into the [TextGenerationPipeline]. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_community/chunk_0.txt b/chunked/content_aware_chunking/_community/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..186d1294ec9a6bddb6fdcafcfeb54f0961635b2e --- /dev/null +++ b/chunked/content_aware_chunking/_community/chunk_0.txt @@ -0,0 +1,6 @@ +Community +This page regroups resources around 🤗 Transformers developed by the community. +Community resources: +| Resource | Description | Author | +|:----------|:-------------|------:| +| Hugging Face Transformers Glossary Flashcards | A set of flashcards based on the Transformers Docs Glossary that has been put into a form which can be easily learned/revised using Anki an open source, cross platform app specifically designed for long term knowledge retention. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_community/chunk_1.txt b/chunked/content_aware_chunking/_community/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..1b293ed6f554141d90c0be88aee2367930d329aa --- /dev/null +++ b/chunked/content_aware_chunking/_community/chunk_1.txt @@ -0,0 +1,6 @@ +See this Introductory video on how to use the flashcards. | Darigov Research | +Community notebooks: +| Notebook | Description | Author | | +|:----------|:-------------|:-------------|------:| +| Fine-tune a pre-trained Transformer to generate lyrics | How to generate lyrics in the style of your favorite artist by fine-tuning a GPT-2 model | Aleksey Korshuk | | +| Train T5 in Tensorflow 2 | How to train T5 for any task using Tensorflow 2. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_community/chunk_2.txt b/chunked/content_aware_chunking/_community/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..8399f3a7c130767345b24d9608d3fc3a09cb5c23 --- /dev/null +++ b/chunked/content_aware_chunking/_community/chunk_2.txt @@ -0,0 +1,18 @@ +This notebook demonstrates a Question & Answer task implemented in Tensorflow 2 using SQUAD | Muhammad Harris | | +| Train T5 on TPU | How to train T5 on SQUAD with Transformers and Nlp | Suraj Patil | | +| Fine-tune T5 for Classification and Multiple Choice | How to fine-tune T5 for classification and multiple choice tasks using a text-to-text format with PyTorch Lightning | Suraj Patil | | +| Fine-tune DialoGPT on New Datasets and Languages | How to fine-tune the DialoGPT model on a new dataset for open-dialog conversational chatbots | Nathan Cooper | | +| Long Sequence Modeling with Reformer | How to train on sequences as long as 500,000 tokens with Reformer | Patrick von Platen | | +| Fine-tune BART for Summarization | How to fine-tune BART for summarization with fastai using blurr | Wayde Gilliam | | +| Fine-tune a pre-trained Transformer on anyone's tweets | How to generate tweets in the style of your favorite Twitter account by fine-tuning a GPT-2 model | Boris Dayma | | +| Optimize 🤗 Hugging Face models with Weights & Biases | A complete tutorial showcasing W&B integration with Hugging Face | Boris Dayma | | +| Pretrain Longformer | How to build a "long" version of existing pretrained models | Iz Beltagy | | +| Fine-tune Longformer for QA | How to fine-tune longformer model for QA task | Suraj Patil | | +| Evaluate Model with 🤗nlp | How to evaluate longformer on TriviaQA with nlp | Patrick von Platen | | +| Fine-tune T5 for Sentiment Span Extraction | How to fine-tune T5 for sentiment span extraction using a text-to-text format with PyTorch Lightning | Lorenzo Ampil | | +| Fine-tune DistilBert for Multiclass Classification | How to fine-tune DistilBert for multiclass classification with PyTorch | Abhishek Kumar Mishra | | +|Fine-tune BERT for Multi-label Classification|How to fine-tune BERT for multi-label classification using PyTorch|Abhishek Kumar Mishra || +|Fine-tune T5 for Summarization|How to fine-tune T5 for summarization in PyTorch and track experiments with WandB|Abhishek Kumar Mishra || +|Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing|How to speed up fine-tuning by a factor of 2 using dynamic padding / bucketing|Michael Benesty || +|Pretrain Reformer for Masked Language Modeling| How to train a Reformer model with bi-directional self-attention layers | Patrick von Platen | | +|Expand and Fine Tune Sci-BERT| How to increase vocabulary of a pretrained SciBERT model from AllenAI on the CORD dataset and pipeline it. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_community/chunk_3.txt b/chunked/content_aware_chunking/_community/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..5db1570ff2f62431fc721d22960f499a07ccc374 --- /dev/null +++ b/chunked/content_aware_chunking/_community/chunk_3.txt @@ -0,0 +1,2 @@ +| Tanmay Thakur | | +|Fine Tune BlenderBotSmall for Summarization using the Trainer API| How to fine-tune BlenderBotSmall for summarization on a custom dataset, using the Trainer API. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_community/chunk_4.txt b/chunked/content_aware_chunking/_community/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..3fbe1d059671f733f6eda51250e29c449696c694 --- /dev/null +++ b/chunked/content_aware_chunking/_community/chunk_4.txt @@ -0,0 +1,32 @@ +| Tanmay Thakur | | +|Fine-tune Electra and interpret with Integrated Gradients | How to fine-tune Electra for sentiment analysis and interpret predictions with Captum Integrated Gradients | Eliza Szczechla | | +|fine-tune a non-English GPT-2 Model with Trainer class | How to fine-tune a non-English GPT-2 Model with Trainer class | Philipp Schmid | | +|Fine-tune a DistilBERT Model for Multi Label Classification task | How to fine-tune a DistilBERT Model for Multi Label Classification task | Dhaval Taunk | | +|Fine-tune ALBERT for sentence-pair classification | How to fine-tune an ALBERT model or another BERT-based model for the sentence-pair classification task | Nadir El Manouzi | | +|Fine-tune Roberta for sentiment analysis | How to fine-tune a Roberta model for sentiment analysis | Dhaval Taunk | | +|Evaluating Question Generation Models | How accurate are the answers to questions generated by your seq2seq transformer model? | Pascal Zoleko | | +|Classify text with DistilBERT and Tensorflow | How to fine-tune DistilBERT for text classification in TensorFlow | Peter Bayerle | | +|Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail | How to warm-start a EncoderDecoderModel with a google-bert/bert-base-uncased checkpoint for summarization on CNN/Dailymail | Patrick von Platen | | +|Leverage RoBERTa for Encoder-Decoder Summarization on BBC XSum | How to warm-start a shared EncoderDecoderModel with a FacebookAI/roberta-base checkpoint for summarization on BBC/XSum | Patrick von Platen | | +|Fine-tune TAPAS on Sequential Question Answering (SQA) | How to fine-tune TapasForQuestionAnswering with a tapas-base checkpoint on the Sequential Question Answering (SQA) dataset | Niels Rogge | | +|Evaluate TAPAS on Table Fact Checking (TabFact) | How to evaluate a fine-tuned TapasForSequenceClassification with a tapas-base-finetuned-tabfact checkpoint using a combination of the 🤗 datasets and 🤗 transformers libraries | Niels Rogge | | +|Fine-tuning mBART for translation | How to fine-tune mBART using Seq2SeqTrainer for Hindi to English translation | Vasudev Gupta | | +|Fine-tune LayoutLM on FUNSD (a form understanding dataset) | How to fine-tune LayoutLMForTokenClassification on the FUNSD dataset for information extraction from scanned documents | Niels Rogge | | +|Fine-Tune DistilGPT2 and Generate Text | How to fine-tune DistilGPT2 and generate text | Aakash Tripathi | | +|Fine-Tune LED on up to 8K tokens | How to fine-tune LED on pubmed for long-range summarization | Patrick von Platen | | +|Evaluate LED on Arxiv | How to effectively evaluate LED on long-range summarization | Patrick von Platen | | +|Fine-tune LayoutLM on RVL-CDIP (a document image classification dataset) | How to fine-tune LayoutLMForSequenceClassification on the RVL-CDIP dataset for scanned document classification | Niels Rogge | | +|Wav2Vec2 CTC decoding with GPT2 adjustment | How to decode CTC sequence with language model adjustment | Eric Lam | | +|Fine-tune BART for summarization in two languages with Trainer class | How to fine-tune BART for summarization in two languages with Trainer class | Eliza Szczechla | | +|Evaluate Big Bird on Trivia QA | How to evaluate BigBird on long document question answering on Trivia QA | Patrick von Platen | | +| Create video captions using Wav2Vec2 | How to create YouTube captions from any video by transcribing the audio with Wav2Vec | Niklas Muennighoff | | +| Fine-tune the Vision Transformer on CIFAR-10 using PyTorch Lightning | How to fine-tune the Vision Transformer (ViT) on CIFAR-10 using HuggingFace Transformers, Datasets and PyTorch Lightning | Niels Rogge | | +| Fine-tune the Vision Transformer on CIFAR-10 using the 🤗 Trainer | How to fine-tune the Vision Transformer (ViT) on CIFAR-10 using HuggingFace Transformers, Datasets and the 🤗 Trainer | Niels Rogge | | +| Evaluate LUKE on Open Entity, an entity typing dataset | How to evaluate LukeForEntityClassification on the Open Entity dataset | Ikuya Yamada | | +| Evaluate LUKE on TACRED, a relation extraction dataset | How to evaluate LukeForEntityPairClassification on the TACRED dataset | Ikuya Yamada | | +| Evaluate LUKE on CoNLL-2003, an important NER benchmark | How to evaluate LukeForEntitySpanClassification on the CoNLL-2003 dataset | Ikuya Yamada | | +| Evaluate BigBird-Pegasus on PubMed dataset | How to evaluate BigBirdPegasusForConditionalGeneration on PubMed dataset | Vasudev Gupta | | +| Speech Emotion Classification with Wav2Vec2 | How to leverage a pretrained Wav2Vec2 model for Emotion Classification on the MEGA dataset | Mehrdad Farahani | | +| Detect objects in an image with DETR | How to use a trained DetrForObjectDetection model to detect objects in an image and visualize attention | Niels Rogge | | +| Fine-tune DETR on a custom object detection dataset | How to fine-tune DetrForObjectDetection on a custom object detection dataset | Niels Rogge | | +| Finetune T5 for Named Entity Recognition | How to fine-tune T5 on a Named Entity Recognition Task | Ogundepo Odunayo | |. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_contributing/chunk_0.txt b/chunked/content_aware_chunking/_contributing/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..b2590733aed30709824c1afb292458dc96adcd55 --- /dev/null +++ b/chunked/content_aware_chunking/_contributing/chunk_0.txt @@ -0,0 +1 @@ +../../../CONTRIBUTING.md. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_0.txt b/chunked/content_aware_chunking/_create_a_model/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..80ef7d00867ffad1558c62112863cb4ca2980594 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_0.txt @@ -0,0 +1,2 @@ +Create a custom architecture +An AutoClass automatically infers the model architecture and downloads pretrained configuration and weights. Generally, we recommend using an AutoClass to produce checkpoint-agnostic code. But users who want more control over specific model parameters can create a custom 🤗 Transformers model from just a few base classes. This could be particularly useful for anyone who is interested in studying, training or experimenting with a 🤗 Transformers model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_1.txt b/chunked/content_aware_chunking/_create_a_model/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..f9112b68dd26d1ccb3b81ce46f37d13f8d8bdc37 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_1.txt @@ -0,0 +1,11 @@ +In this guide, dive deeper into creating a custom model without an AutoClass. Learn how to: + +Load and customize a model configuration. +Create a model architecture. +Create a slow and fast tokenizer for text. +Create an image processor for vision tasks. +Create a feature extractor for audio tasks. +Create a processor for multimodal tasks. + +Configuration +A configuration refers to a model's specific attributes. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_10.txt b/chunked/content_aware_chunking/_create_a_model/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..bcd661141a974014c1e53ed963e20642c9243952 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_10.txt @@ -0,0 +1,11 @@ +However, you can still replace - some or all of - the default model configuration attributes with your own if you'd like: + +model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) + +Load your custom configuration attributes into the model: + +from transformers import TFDistilBertModel +my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json") +tf_model = TFDistilBertModel(my_config) + +This creates a model with random values instead of pretrained weights. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_11.txt b/chunked/content_aware_chunking/_create_a_model/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..eb16c7232d560c5e140b1d31546a1572968bd560 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_11.txt @@ -0,0 +1 @@ +You won't be able to use this model for anything useful yet until you train it. Training is a costly and time-consuming process. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_12.txt b/chunked/content_aware_chunking/_create_a_model/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..5cd53edccd0a407229802881fe9b67610a9e17ff --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_12.txt @@ -0,0 +1,6 @@ +It is generally better to use a pretrained model to obtain better results faster, while using only a fraction of the resources required for training. +Create a pretrained model with [~TFPreTrainedModel.from_pretrained]: + +tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") + +When you load pretrained weights, the default model configuration is automatically loaded if the model is provided by 🤗 Transformers. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_13.txt b/chunked/content_aware_chunking/_create_a_model/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..3a063b8c0aad2d645304ab23ac7675745f4e8350 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_13.txt @@ -0,0 +1,6 @@ +However, you can still replace - some or all of - the default model configuration attributes with your own if you'd like: + +tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) + +Model heads +At this point, you have a base DistilBERT model which outputs the hidden states. The hidden states are passed as inputs to a model head to produce the final output. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_14.txt b/chunked/content_aware_chunking/_create_a_model/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..5454d6b78725e642becfb1f8d6a2a2075ef38493 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_14.txt @@ -0,0 +1,3 @@ +🤗 Transformers provides a different model head for each task as long as a model supports the task (i.e., you can't use DistilBERT for a sequence-to-sequence task like translation). + +For example, [DistilBertForSequenceClassification] is a base DistilBERT model with a sequence classification head. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_15.txt b/chunked/content_aware_chunking/_create_a_model/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..1f7a99755104ca643fa6a853c330205cb85ab082 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_15.txt @@ -0,0 +1,6 @@ +The sequence classification head is a linear layer on top of the pooled outputs. + +from transformers import DistilBertForSequenceClassification +model = DistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + +Easily reuse this checkpoint for another task by switching to a different model head. For a question answering task, you would use the [DistilBertForQuestionAnswering] model head. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_16.txt b/chunked/content_aware_chunking/_create_a_model/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..1217d4ca75c1f00f03099852bb00a529973820d9 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_16.txt @@ -0,0 +1,8 @@ +The question answering head is similar to the sequence classification head except it is a linear layer on top of the hidden states output. + +from transformers import DistilBertForQuestionAnswering +model = DistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") +`` + + +For example, [TFDistilBertForSequenceClassification`] is a base DistilBERT model with a sequence classification head. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_17.txt b/chunked/content_aware_chunking/_create_a_model/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..29fcaa74f29144751f9742ed859e6ad3bee9d884 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_17.txt @@ -0,0 +1,6 @@ +The sequence classification head is a linear layer on top of the pooled outputs. + +from transformers import TFDistilBertForSequenceClassification +tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + +Easily reuse this checkpoint for another task by switching to a different model head. For a question answering task, you would use the [TFDistilBertForQuestionAnswering] model head. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_18.txt b/chunked/content_aware_chunking/_create_a_model/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..0baed266ccab71cda276f0ff49b80071fe2e3fb8 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_18.txt @@ -0,0 +1,7 @@ +The question answering head is similar to the sequence classification head except it is a linear layer on top of the hidden states output. + +from transformers import TFDistilBertForQuestionAnswering +tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") + +Tokenizer +The last base class you need before using a model for textual data is a tokenizer to convert raw text to tensors. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_19.txt b/chunked/content_aware_chunking/_create_a_model/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..f255808d45d7a6867f884fa61ae2f84e5c913169 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_19.txt @@ -0,0 +1,4 @@ +There are two types of tokenizers you can use with 🤗 Transformers: + +[PreTrainedTokenizer]: a Python implementation of a tokenizer. +[PreTrainedTokenizerFast]: a tokenizer from our Rust-based 🤗 Tokenizer library. This tokenizer type is significantly faster - especially during batch tokenization - due to its Rust implementation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_2.txt b/chunked/content_aware_chunking/_create_a_model/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..88f09011a41e6570e533e4220ded90b4be35726b --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_2.txt @@ -0,0 +1 @@ +Each model configuration has different attributes; for instance, all NLP models have the hidden_size, num_attention_heads, num_hidden_layers and vocab_size attributes in common. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_20.txt b/chunked/content_aware_chunking/_create_a_model/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..c268a853b2e8df311c41f683fd6c056cda1bcd3c --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_20.txt @@ -0,0 +1,5 @@ +The fast tokenizer also offers additional methods like offset mapping which maps tokens to their original words or characters. + +Both tokenizers support common methods such as encoding and decoding, adding new tokens, and managing special tokens. + +Not every model supports a fast tokenizer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_21.txt b/chunked/content_aware_chunking/_create_a_model/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..333702ce78f4fc46480021e522f86bdc3e4f5f28 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_21.txt @@ -0,0 +1,8 @@ +Take a look at this table to check if a model has fast tokenizer support. + +If you trained your own tokenizer, you can create one from your vocabulary file: + +from transformers import DistilBertTokenizer +my_tokenizer = DistilBertTokenizer(vocab_file="my_vocab_file.txt", do_lower_case=False, padding_side="left") + +It is important to remember the vocabulary from a custom tokenizer will be different from the vocabulary generated by a pretrained model's tokenizer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_22.txt b/chunked/content_aware_chunking/_create_a_model/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..f418f02b1b72358b666f61296ede1a46471fd37b --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_22.txt @@ -0,0 +1 @@ +You need to use a pretrained model's vocabulary if you are using a pretrained model, otherwise the inputs won't make sense. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_23.txt b/chunked/content_aware_chunking/_create_a_model/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..5b4e21935cf193fbbed46dd33cc84baac026402d --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_23.txt @@ -0,0 +1,11 @@ +Create a tokenizer with a pretrained model's vocabulary with the [DistilBertTokenizer] class: + +from transformers import DistilBertTokenizer +slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert/distilbert-base-uncased") + +Create a fast tokenizer with the [DistilBertTokenizerFast] class: + +from transformers import DistilBertTokenizerFast +fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased") + +By default, [AutoTokenizer] will try to load a fast tokenizer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_24.txt b/chunked/content_aware_chunking/_create_a_model/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..28ab013b1554fb36d6f8fd399a96fe1634bf6b57 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_24.txt @@ -0,0 +1,5 @@ +You can disable this behavior by setting use_fast=False in from_pretrained. + +Image processor +An image processor processes vision inputs. It inherits from the base [~image_processing_utils.ImageProcessingMixin] class. +To use, create an image processor associated with the model you're using. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_25.txt b/chunked/content_aware_chunking/_create_a_model/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..85a5917d1474e1d32bc4ac777e49997cb47f9ff5 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_25.txt @@ -0,0 +1,51 @@ +For example, create a default [ViTImageProcessor] if you are using ViT for image classification: + +from transformers import ViTImageProcessor +vit_extractor = ViTImageProcessor() +print(vit_extractor) +ViTImageProcessor { + "do_normalize": true, + "do_resize": true, + "image_processor_type": "ViTImageProcessor", + "image_mean": [ + 0.5, + 0.5, + 0.5 + ], + "image_std": [ + 0.5, + 0.5, + 0.5 + ], + "resample": 2, + "size": 224 +} + +If you aren't looking for any customization, just use the from_pretrained method to load a model's default image processor parameters. + +Modify any of the [ViTImageProcessor] parameters to create your custom image processor: + +from transformers import ViTImageProcessor +my_vit_extractor = ViTImageProcessor(resample="PIL.Image.BOX", do_normalize=False, image_mean=[0.3, 0.3, 0.3]) +print(my_vit_extractor) +ViTImageProcessor { + "do_normalize": false, + "do_resize": true, + "image_processor_type": "ViTImageProcessor", + "image_mean": [ + 0.3, + 0.3, + 0.3 + ], + "image_std": [ + 0.5, + 0.5, + 0.5 + ], + "resample": "PIL.Image.BOX", + "size": 224 +} + +Backbone + +Computer vision models consist of a backbone, neck, and head. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_26.txt b/chunked/content_aware_chunking/_create_a_model/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..84ce0e1d618b3154281e99397d11ea024578bc8b --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_26.txt @@ -0,0 +1 @@ +The backbone extracts features from an input image, the neck combines and enhances the extracted features, and the head is used for the main task (e.g., object detection). Start by initializing a backbone in the model config and specify whether you want to load pretrained weights or load randomly initialized weights. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_27.txt b/chunked/content_aware_chunking/_create_a_model/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..ba5234d3475cabd91adba022a6d22ee8b1de4d3b --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_27.txt @@ -0,0 +1,37 @@ +Then you can pass the model config to the model head. +For example, to load a ResNet backbone into a MaskFormer model with an instance segmentation head: + +Set use_pretrained_backbone=True to load pretrained ResNet weights for the backbone. + +from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig +config = MaskFormerConfig(backbone="microsoft/resnet50", use_pretrained_backbone=True) # backbone and neck config +model = MaskFormerForInstanceSegmentation(config) # head + +You could also load the backbone config separately and then pass it to the model config. + +from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig +backbone_config = ResNetConfig.from_pretrained("microsoft/resnet-50") +config = MaskFormerConfig(backbone_config=backbone_config) +model = MaskFormerForInstanceSegmentation(config) + +Set use_pretrained_backbone=False to randomly initialize a ResNet backbone. + +from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig +config = MaskFormerConfig(backbone="microsoft/resnet50", use_pretrained_backbone=False) # backbone and neck config +model = MaskFormerForInstanceSegmentation(config) # head + +You could also load the backbone config separately and then pass it to the model config. + +from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig +backbone_config = ResNetConfig() +config = MaskFormerConfig(backbone_config=backbone_config) +model = MaskFormerForInstanceSegmentation(config) + +timm models are loaded with [TimmBackbone] and [TimmBackboneConfig]. +thon +from transformers import TimmBackboneConfig, TimmBackbone +backbone_config = TimmBackboneConfig("resnet50") +model = TimmBackbone(config=backbone_config) + +Feature extractor +A feature extractor processes audio inputs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_28.txt b/chunked/content_aware_chunking/_create_a_model/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..af3f48bfbc10920a9fc6cedc68dc8c8eb0a468aa --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_28.txt @@ -0,0 +1,2 @@ +It inherits from the base [~feature_extraction_utils.FeatureExtractionMixin] class, and may also inherit from the [SequenceFeatureExtractor] class for processing audio inputs. +To use, create a feature extractor associated with the model you're using. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_29.txt b/chunked/content_aware_chunking/_create_a_model/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..a4b806f3260f7be82e1f0c8a99cab59f1cb827a0 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_29.txt @@ -0,0 +1,34 @@ +For example, create a default [Wav2Vec2FeatureExtractor] if you are using Wav2Vec2 for audio classification: + +from transformers import Wav2Vec2FeatureExtractor +w2v2_extractor = Wav2Vec2FeatureExtractor() +print(w2v2_extractor) +Wav2Vec2FeatureExtractor { + "do_normalize": true, + "feature_extractor_type": "Wav2Vec2FeatureExtractor", + "feature_size": 1, + "padding_side": "right", + "padding_value": 0.0, + "return_attention_mask": false, + "sampling_rate": 16000 +} + +If you aren't looking for any customization, just use the from_pretrained method to load a model's default feature extractor parameters. + +Modify any of the [Wav2Vec2FeatureExtractor] parameters to create your custom feature extractor: + +from transformers import Wav2Vec2FeatureExtractor +w2v2_extractor = Wav2Vec2FeatureExtractor(sampling_rate=8000, do_normalize=False) +print(w2v2_extractor) +Wav2Vec2FeatureExtractor { + "do_normalize": false, + "feature_extractor_type": "Wav2Vec2FeatureExtractor", + "feature_size": 1, + "padding_side": "right", + "padding_value": 0.0, + "return_attention_mask": false, + "sampling_rate": 8000 +} + +Processor +For models that support multimodal tasks, 🤗 Transformers offers a processor class that conveniently wraps processing classes such as a feature extractor and a tokenizer into a single object. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_3.txt b/chunked/content_aware_chunking/_create_a_model/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..b9d7a95b0cb894c65f1d3c0b8c453b193c63ab91 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_3.txt @@ -0,0 +1,26 @@ +These attributes specify the number of attention heads or hidden layers to construct a model with. +Get a closer look at DistilBERT by accessing [DistilBertConfig] to inspect it's attributes: + +from transformers import DistilBertConfig +config = DistilBertConfig() +print(config) +DistilBertConfig { + "activation": "gelu", + "attention_dropout": 0.1, + "dim": 768, + "dropout": 0.1, + "hidden_dim": 3072, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "model_type": "distilbert", + "n_heads": 12, + "n_layers": 6, + "pad_token_id": 0, + "qa_dropout": 0.1, + "seq_classif_dropout": 0.2, + "sinusoidal_pos_embds": false, + "transformers_version": "4.16.2", + "vocab_size": 30522 +} + +[DistilBertConfig] displays all the default attributes used to build a base [DistilBertModel]. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_30.txt b/chunked/content_aware_chunking/_create_a_model/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..328b2b200f3504090808e7af6c780383277d724e --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_30.txt @@ -0,0 +1 @@ +For example, let's use the [Wav2Vec2Processor] for an automatic speech recognition task (ASR). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_31.txt b/chunked/content_aware_chunking/_create_a_model/chunk_31.txt new file mode 100644 index 0000000000000000000000000000000000000000..5b6410e34e9ebdf66255d18328ed905edb0e4ab8 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_31.txt @@ -0,0 +1,17 @@ +ASR transcribes audio to text, so you will need a feature extractor and a tokenizer. +Create a feature extractor to handle the audio inputs: + +from transformers import Wav2Vec2FeatureExtractor +feature_extractor = Wav2Vec2FeatureExtractor(padding_value=1.0, do_normalize=True) + +Create a tokenizer to handle the text inputs: + +from transformers import Wav2Vec2CTCTokenizer +tokenizer = Wav2Vec2CTCTokenizer(vocab_file="my_vocab_file.txt") + +Combine the feature extractor and tokenizer in [Wav2Vec2Processor]: + +from transformers import Wav2Vec2Processor +processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer) + +With two basic classes - configuration and model - and an additional preprocessing class (tokenizer, image processor, feature extractor, or processor), you can create any of the models supported by 🤗 Transformers. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_32.txt b/chunked/content_aware_chunking/_create_a_model/chunk_32.txt new file mode 100644 index 0000000000000000000000000000000000000000..b4b70ce12b90eff5c009513a536e22a1b60de5f8 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_32.txt @@ -0,0 +1 @@ +Each of these base classes are configurable, allowing you to use the specific attributes you want. You can easily setup a model for training or modify an existing pretrained model to fine-tune.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_4.txt b/chunked/content_aware_chunking/_create_a_model/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..f2e6bd45bebd828c1048f6bf7b7e86904f75701f --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_4.txt @@ -0,0 +1 @@ +All attributes are customizable, creating space for experimentation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_5.txt b/chunked/content_aware_chunking/_create_a_model/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..69c182665a0dada821b28afdc5d760a9f8c5c3a0 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_5.txt @@ -0,0 +1,31 @@ +For example, you can customize a default model to: + +Try a different activation function with the activation parameter. +Use a higher dropout ratio for the attention probabilities with the attention_dropout parameter. + +my_config = DistilBertConfig(activation="relu", attention_dropout=0.4) +print(my_config) +DistilBertConfig { + "activation": "relu", + "attention_dropout": 0.4, + "dim": 768, + "dropout": 0.1, + "hidden_dim": 3072, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "model_type": "distilbert", + "n_heads": 12, + "n_layers": 6, + "pad_token_id": 0, + "qa_dropout": 0.1, + "seq_classif_dropout": 0.2, + "sinusoidal_pos_embds": false, + "transformers_version": "4.16.2", + "vocab_size": 30522 +} + +Pretrained model attributes can be modified in the [~PretrainedConfig.from_pretrained] function: + +my_config = DistilBertConfig.from_pretrained("distilbert/distilbert-base-uncased", activation="relu", attention_dropout=0.4) + +Once you are satisfied with your model configuration, you can save it with [~PretrainedConfig.save_pretrained]. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_6.txt b/chunked/content_aware_chunking/_create_a_model/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..255f49e8253ac6c1a8974bb8d65d32cbf99d9e4a --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_6.txt @@ -0,0 +1,12 @@ +Your configuration file is stored as a JSON file in the specified save directory: + +my_config.save_pretrained(save_directory="./your_model_save_path") + +To reuse the configuration file, load it with [~PretrainedConfig.from_pretrained]: + +my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json") + +You can also save your configuration file as a dictionary or even just the difference between your custom configuration attributes and the default configuration attributes! See the configuration documentation for more details. + +Model +The next step is to create a model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_7.txt b/chunked/content_aware_chunking/_create_a_model/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..f36560a026be3ee8b173288302b1cd59bbf75db6 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_7.txt @@ -0,0 +1 @@ +The model - also loosely referred to as the architecture - defines what each layer is doing and what operations are happening. Attributes like num_hidden_layers from the configuration are used to define the architecture. Every model shares the base class [PreTrainedModel] and a few common methods like resizing input embeddings and pruning self-attention heads. In addition, all models are also either a torch.nn.Module, tf.keras.Model or flax.linen.Module subclass. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_8.txt b/chunked/content_aware_chunking/_create_a_model/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..f484c88416d2f31eed345824d1e244be819fa53d --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_8.txt @@ -0,0 +1,9 @@ +This means models are compatible with each of their respective framework's usage. + +Load your custom configuration attributes into the model: + +from transformers import DistilBertModel +my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json") +model = DistilBertModel(my_config) + +This creates a model with random values instead of pretrained weights. You won't be able to use this model for anything useful yet until you train it. Training is a costly and time-consuming process. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_create_a_model/chunk_9.txt b/chunked/content_aware_chunking/_create_a_model/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..8d2a6749c6cd392198363938343e23033f619337 --- /dev/null +++ b/chunked/content_aware_chunking/_create_a_model/chunk_9.txt @@ -0,0 +1,6 @@ +It is generally better to use a pretrained model to obtain better results faster, while using only a fraction of the resources required for training. +Create a pretrained model with [~PreTrainedModel.from_pretrained]: + +model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") + +When you load pretrained weights, the default model configuration is automatically loaded if the model is provided by 🤗 Transformers. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_0.txt b/chunked/content_aware_chunking/_custom_models/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..707dd4cdd6806f3de0b6047d1bcc8ee4d91e21f7 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_0.txt @@ -0,0 +1,4 @@ +Building custom models +The 🤗 Transformers library is designed to be easily extensible. Every model is fully coded in a given subfolder +of the repository with no abstraction, so you can easily copy a modeling file and tweak it to your needs. +If you are writing a brand new model, it might be easier to start from scratch. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_1.txt b/chunked/content_aware_chunking/_custom_models/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..65dbb2ecce9c3847bc9885b9af75af32d080166d --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_1.txt @@ -0,0 +1,4 @@ +In this tutorial, we will show you +how to write a custom model and its configuration so it can be used inside Transformers, and how you can share it +with the community (with the code it relies on) so that anyone can use it, even if it's not present in the 🤗 +Transformers library. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_10.txt b/chunked/content_aware_chunking/_custom_models/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..6ab10b738ac7a2e5741d666ba6ffe333843c782d --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_10.txt @@ -0,0 +1,56 @@ +Then the +model is defined from the configuration by passing everything to the ResNet class: + +from transformers import PreTrainedModel +from timm.models.resnet import BasicBlock, Bottleneck, ResNet +from .configuration_resnet import ResnetConfig +BLOCK_MAPPING = {"basic": BasicBlock, "bottleneck": Bottleneck} +class ResnetModel(PreTrainedModel): + config_class = ResnetConfig +def __init__(self, config): + super().__init__(config) + block_layer = BLOCK_MAPPING[config.block_type] + self.model = ResNet( + block_layer, + config.layers, + num_classes=config.num_classes, + in_chans=config.input_channels, + cardinality=config.cardinality, + base_width=config.base_width, + stem_width=config.stem_width, + stem_type=config.stem_type, + avg_down=config.avg_down, + ) + +def forward(self, tensor): + return self.model.forward_features(tensor) + +For the model that will classify images, we just change the forward method: + +import torch +class ResnetModelForImageClassification(PreTrainedModel): + config_class = ResnetConfig +def __init__(self, config): + super().__init__(config) + block_layer = BLOCK_MAPPING[config.block_type] + self.model = ResNet( + block_layer, + config.layers, + num_classes=config.num_classes, + in_chans=config.input_channels, + cardinality=config.cardinality, + base_width=config.base_width, + stem_width=config.stem_width, + stem_type=config.stem_type, + avg_down=config.avg_down, + ) + +def forward(self, tensor, labels=None): + logits = self.model(tensor) + if labels is not None: + loss = torch.nn.cross_entropy(logits, labels) + return {"loss": loss, "logits": logits} + return {"logits": logits} + +In both cases, notice how we inherit from PreTrainedModel and call the superclass initialization with the config +(a bit like when you write a regular torch.nn.Module). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_11.txt b/chunked/content_aware_chunking/_custom_models/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..3c11cb6074c881745bc5f777f61cd7f26fcea353 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_11.txt @@ -0,0 +1,8 @@ +The line that sets the config_class is not mandatory, unless +you want to register your model with the auto classes (see last section). + +If your model is very similar to a model inside the library, you can re-use the same configuration as this model. + +You can have your model return anything you want, but returning a dictionary like we did for +ResnetModelForImageClassification, with the loss included when labels are passed, will make your model directly +usable inside the [Trainer] class. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_12.txt b/chunked/content_aware_chunking/_custom_models/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..4e4b81878d9f6d0689080046758d83d3565dbdfd --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_12.txt @@ -0,0 +1,8 @@ +Using another output format is fine as long as you are planning on using your own +training loop or another library for training. +Now that we have our model class, let's create one: +py +resnet50d = ResnetModelForImageClassification(resnet50d_config) +Again, you can use any of the methods of [PreTrainedModel], like [~PreTrainedModel.save_pretrained] or +[~PreTrainedModel.push_to_hub]. We will use the second in the next section, and see how to push the model weights +with the code of our model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_13.txt b/chunked/content_aware_chunking/_custom_models/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..4258faaf4e8f15d69440c187064bcee625b2111a --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_13.txt @@ -0,0 +1,3 @@ +But first, let's load some pretrained weights inside our model. +In your own use case, you will probably be training your custom model on your own data. To go fast for this tutorial, +we will use the pretrained version of the resnet50d. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_14.txt b/chunked/content_aware_chunking/_custom_models/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..1a9d200eabb15378eb8537eec6dab2efef9dd94e --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_14.txt @@ -0,0 +1,12 @@ +Since our model is just a wrapper around it, it's going to be +easy to transfer those weights: + +import timm +pretrained_model = timm.create_model("resnet50d", pretrained=True) +resnet50d.model.load_state_dict(pretrained_model.state_dict()) + +Now let's see how to make sure that when we do [~PreTrainedModel.save_pretrained] or [~PreTrainedModel.push_to_hub], the +code of the model is saved. +Registering a model with custom code to the auto classes +If you are writing a library that extends 🤗 Transformers, you may want to extend the auto classes to include your own +model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_15.txt b/chunked/content_aware_chunking/_custom_models/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..daa96783e31158990d9d8c3899458685fb4a899b --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_15.txt @@ -0,0 +1,18 @@ +This is different from pushing the code to the Hub in the sense that users will need to import your library to +get the custom models (contrarily to automatically downloading the model code from the Hub). +As long as your config has a model_type attribute that is different from existing model types, and that your model +classes have the right config_class attributes, you can just add them to the auto classes like this: + +from transformers import AutoConfig, AutoModel, AutoModelForImageClassification +AutoConfig.register("resnet", ResnetConfig) +AutoModel.register(ResnetConfig, ResnetModel) +AutoModelForImageClassification.register(ResnetConfig, ResnetModelForImageClassification) + +Note that the first argument used when registering your custom config to [AutoConfig] needs to match the model_type +of your custom config, and the first argument used when registering your custom models to any auto model class needs +to match the config_class of those models. +Sending the code to the Hub + +This API is experimental and may have some slight breaking changes in the next releases. + +First, make sure your model is fully defined in a .py file. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_16.txt b/chunked/content_aware_chunking/_custom_models/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..88e9083cfec7d11d855f18afbe2e712c20cf822d --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_16.txt @@ -0,0 +1,4 @@ +It can rely on relative imports to some other files as +long as all the files are in the same directory (we don't support submodules for this feature yet). For our example, +we'll define a modeling_resnet.py file and a configuration_resnet.py file in a folder of the current working +directory named resnet_model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_17.txt b/chunked/content_aware_chunking/_custom_models/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..30791f55bbf5a9fe08749858f0a90a4f323d912c --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_17.txt @@ -0,0 +1,26 @@ +The configuration file contains the code for ResnetConfig and the modeling file +contains the code of ResnetModel and ResnetModelForImageClassification. +. +└── resnet_model + ├── __init__.py + ├── configuration_resnet.py + └── modeling_resnet.py +The __init__.py can be empty, it's just there so that Python detects resnet_model can be use as a module. + +If copying a modeling files from the library, you will need to replace all the relative imports at the top of the file +to import from the transformers package. + +Note that you can re-use (or subclass) an existing configuration/model. +To share your model with the community, follow those steps: first import the ResNet model and config from the newly +created files: +py +from resnet_model.configuration_resnet import ResnetConfig +from resnet_model.modeling_resnet import ResnetModel, ResnetModelForImageClassification +Then you have to tell the library you want to copy the code files of those objects when using the save_pretrained +method and properly register them with a given Auto class (especially for models), just run: +py +ResnetConfig.register_for_auto_class() +ResnetModel.register_for_auto_class("AutoModel") +ResnetModelForImageClassification.register_for_auto_class("AutoModelForImageClassification") +Note that there is no need to specify an auto class for the configuration (there is only one auto class for them, +[AutoConfig]) but it's different for models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_18.txt b/chunked/content_aware_chunking/_custom_models/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..cac434bb6b36b82c7a8c278054769a3f594a014c --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_18.txt @@ -0,0 +1,5 @@ +Your custom model could be suitable for many different tasks, so you +have to specify which one of the auto classes is the correct one for your model. + +Use register_for_auto_class() if you want the code files to be copied. If you instead prefer to use code on the Hub from another repo, +you don't need to call it. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_19.txt b/chunked/content_aware_chunking/_custom_models/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..14f703570299acd4ac390bff49164e43403a62d6 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_19.txt @@ -0,0 +1,17 @@ +In cases where there's more than one auto class, you can modify the config.json directly using the +following structure: +json +"auto_map": { + "AutoConfig": "--", + "AutoModel": "--", + "AutoModelFor": "--", +}, + +Next, let's create the config and models as we did before: + +resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True) +resnet50d = ResnetModelForImageClassification(resnet50d_config) +pretrained_model = timm.create_model("resnet50d", pretrained=True) +resnet50d.model.load_state_dict(pretrained_model.state_dict()) + +Now to send the model to the Hub, make sure you are logged in. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_2.txt b/chunked/content_aware_chunking/_custom_models/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..fdfabae22bef526887d46291323cd6167962b67a --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_2.txt @@ -0,0 +1,7 @@ +We'll see how to build upon transformers and extend the framework with your hooks and +custom code. +We will illustrate all of this on a ResNet model, by wrapping the ResNet class of the +timm library into a [PreTrainedModel]. +Writing a custom configuration +Before we dive into the model, let's first write its configuration. The configuration of a model is an object that +will contain all the necessary information to build the model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_20.txt b/chunked/content_aware_chunking/_custom_models/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..9f82fc911849b1b3aa24a68d2a656fe24293d8bc --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_20.txt @@ -0,0 +1,13 @@ +Either run in your terminal: + +huggingface-cli login +or from a notebook: + +from huggingface_hub import notebook_login +notebook_login() + +You can then push to your own namespace (or an organization you are a member of) like this: +py +resnet50d.push_to_hub("custom-resnet50d") +On top of the modeling weights and the configuration in json format, this also copied the modeling and +configuration .py files in the folder custom-resnet50d and uploaded the result to the Hub. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_21.txt b/chunked/content_aware_chunking/_custom_models/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..291e724e49d46089455b742e13fd56937f6cee7b --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_21.txt @@ -0,0 +1,6 @@ +You can check the result +in this model repo. +See the sharing tutorial for more information on the push to Hub method. +Using a model with custom code +You can use any configuration, model or tokenizer with custom code files in its repository with the auto-classes and +the from_pretrained method. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_22.txt b/chunked/content_aware_chunking/_custom_models/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..c0da27216e9cd7b3debe9fd461e48dad75967ec8 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_22.txt @@ -0,0 +1,2 @@ +All files and code uploaded to the Hub are scanned for malware (refer to the Hub security documentation for more information), but you should still +review the model code and author to avoid executing malicious code on your machine. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_23.txt b/chunked/content_aware_chunking/_custom_models/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..a426bf47d4812882f04ae4aee534f6c4911a8b1a --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_23.txt @@ -0,0 +1,15 @@ +Set trust_remote_code=True to use +a model with custom code: + +from transformers import AutoModelForImageClassification +model = AutoModelForImageClassification.from_pretrained("sgugger/custom-resnet50d", trust_remote_code=True) + +It is also strongly encouraged to pass a commit hash as a revision to make sure the author of the models did not +update the code with some malicious new lines (unless you fully trust the authors of the models). +py +commit_hash = "ed94a7c6247d8aedce4647f00f20de6875b5b292" +model = AutoModelForImageClassification.from_pretrained( + "sgugger/custom-resnet50d", trust_remote_code=True, revision=commit_hash +) +Note that when browsing the commit history of the model repo on the Hub, there is a button to easily copy the commit +hash of any commit.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_3.txt b/chunked/content_aware_chunking/_custom_models/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..7924d04e2a2e2f8403455f40dddf3d1666966d69 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_3.txt @@ -0,0 +1,6 @@ +As we will see in the next section, the model can only +take a config to be initialized, so we really need that object to be as complete as possible. + +Models in the transformers library itself generally follow the convention that they accept a config object +in their __init__ method, and then pass the whole config to sub-layers in the model, rather than breaking the +config object into multiple arguments that are all passed individually to sub-layers. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_4.txt b/chunked/content_aware_chunking/_custom_models/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..a5e9ff783da38f7d6ffcd3ae965c0264082dbd6a --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_4.txt @@ -0,0 +1,6 @@ +Writing your model in this +style results in simpler code with a clear "source of truth" for any hyperparameters, and also makes it easier +to reuse code from other models in transformers. + +In our example, we will take a couple of arguments of the ResNet class that we might want to tweak. Different +configurations will then give us the different types of ResNets that are possible. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_5.txt b/chunked/content_aware_chunking/_custom_models/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..7ee29b9274154a08a2683cc24bb94c10ae3eee18 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_5.txt @@ -0,0 +1,42 @@ +We then just store those arguments, +after checking the validity of a few of them. +thon +from transformers import PretrainedConfig +from typing import List +class ResnetConfig(PretrainedConfig): + model_type = "resnet" +def __init__( + self, + block_type="bottleneck", + layers: List[int] = [3, 4, 6, 3], + num_classes: int = 1000, + input_channels: int = 3, + cardinality: int = 1, + base_width: int = 64, + stem_width: int = 64, + stem_type: str = "", + avg_down: bool = False, + **kwargs, +): + if block_type not in ["basic", "bottleneck"]: + raise ValueError(f"`block_type` must be 'basic' or bottleneck', got {block_type}.") + if stem_type not in ["", "deep", "deep-tiered"]: + raise ValueError(f"`stem_type` must be '', 'deep' or 'deep-tiered', got {stem_type}.") + + self.block_type = block_type + self.layers = layers + self.num_classes = num_classes + self.input_channels = input_channels + self.cardinality = cardinality + self.base_width = base_width + self.stem_width = stem_width + self.stem_type = stem_type + self.avg_down = avg_down + super().__init__(**kwargs) + +The three important things to remember when writing you own configuration are the following: +- you have to inherit from PretrainedConfig, +- the __init__ of your PretrainedConfig must accept any kwargs, +- those kwargs need to be passed to the superclass __init__. +The inheritance is to make sure you get all the functionality from the 🤗 Transformers library, while the two other +constraints come from the fact a PretrainedConfig has more fields than the ones you are setting. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_6.txt b/chunked/content_aware_chunking/_custom_models/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..ccabfa53b3fba0c47e8324c97120f25227d787b2 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_6.txt @@ -0,0 +1,7 @@ +When reloading a +config with the from_pretrained method, those fields need to be accepted by your config and then sent to the +superclass. +Defining a model_type for your configuration (here model_type="resnet") is not mandatory, unless you want to +register your model with the auto classes (see last section). +With this done, you can easily create and save your configuration like you would do with any other model config of the +library. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_7.txt b/chunked/content_aware_chunking/_custom_models/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..6b2d9d873c4205a3c9e3de8822254c520b2869b4 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_7.txt @@ -0,0 +1,5 @@ +Here is how we can create a resnet50d config and save it: +py +resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True) +resnet50d_config.save_pretrained("custom-resnet") +This will save a file named config.json inside the folder custom-resnet. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_8.txt b/chunked/content_aware_chunking/_custom_models/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..8b1ed57cfd3f3f4e4da84f8967a8df25c73d8a86 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_8.txt @@ -0,0 +1,8 @@ +You can then reload your config with the +from_pretrained method: +py +resnet50d_config = ResnetConfig.from_pretrained("custom-resnet") +You can also use any other method of the [PretrainedConfig] class, like [~PretrainedConfig.push_to_hub] to +directly upload your config to the Hub. +Writing a custom model +Now that we have our ResNet configuration, we can go on writing the model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_models/chunk_9.txt b/chunked/content_aware_chunking/_custom_models/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..73df0fcc97e00187c640c70ef8c6427732a30b91 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_models/chunk_9.txt @@ -0,0 +1,5 @@ +We will actually write two: one that +extracts the hidden features from a batch of images (like [BertModel]) and one that is suitable for image +classification (like [BertForSequenceClassification]). +As we mentioned before, we'll only write a loose wrapper of the model to keep it simple for this example. The only +thing we need to do before writing this class is a map between the block types and actual block classes. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_0.txt b/chunked/content_aware_chunking/_custom_tools/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..b5283f945481f29101eb993b5101f166df191fb2 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_0.txt @@ -0,0 +1,6 @@ +Custom Tools and Prompts + +If you are not aware of what tools and agents are in the context of transformers, we recommend you read the +Transformers Agents page first. + +Transformers Agents is an experimental API that is subject to change at any time. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_1.txt b/chunked/content_aware_chunking/_custom_tools/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..e1e0bfdada0b36f4aa329360d8ec3755c5bde11a --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_1.txt @@ -0,0 +1,13 @@ +Results returned by the agents +can vary as the APIs or underlying models are prone to change. + +Creating and using custom tools and prompts is paramount to empowering the agent and having it perform new tasks. +In this guide we'll take a look at: + +How to customize the prompt +How to use custom tools +How to create custom tools + +Customizing the prompt +As explained in Transformers Agents agents can run in [~Agent.run] and [~Agent.chat] mode. +Both the run and chat modes underlie the same logic. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_10.txt b/chunked/content_aware_chunking/_custom_tools/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..d041b28535e47290ee87659bc72c7b260b84885f --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_10.txt @@ -0,0 +1,8 @@ +Note that the only +information the agent has about the tool is its name and description, so one should make sure that both +are precisely written and match the style of the existing tools in the toolbox. In particular make sure the description +mentions all the arguments expected by name in code-style, along with the expected type and a description of what they +are. + +Check the naming and description of the curated Transformers tools to better understand what name and +description a tool is expected to have. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_11.txt b/chunked/content_aware_chunking/_custom_tools/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..cc145aaad553d8d6f84dba59f8c9da756f043b72 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_11.txt @@ -0,0 +1,7 @@ +You can see all tools with the [Agent.toolbox] property. + +The third part includes a set of curated examples that show the agent exactly what code it should produce +for what kind of user request. The large language models empowering the agent are extremely good at +recognizing patterns in a prompt and repeating the pattern with new data. Therefore, it is very important +that the examples are written in a way that maximizes the likelihood of the agent to generating correct, +executable code in practice. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_12.txt b/chunked/content_aware_chunking/_custom_tools/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..7989da7d82782cfb430e3460f261b5cf67e6778e --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_12.txt @@ -0,0 +1,12 @@ +Let's have a look at one example: +```text +Task: "Identify the oldest person in thedocument` and create an image showcasing the result as a banner." +I will use the following tools: document_qa to find the oldest person in the document, then image_generator to generate an image according to the answer. +Answer: +py +answer = document_qa(document, question="What is the oldest person?") +print(f"The answer is {answer}.") +image = image_generator("A banner showing " + answer) +` +The pattern the model is prompted to repeat has three parts: The task statement, the agent's explanation of +what it intends to do, and finally the generated code. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_13.txt b/chunked/content_aware_chunking/_custom_tools/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..24248b97e51275a0042df4d360ade1cc5aea1695 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_13.txt @@ -0,0 +1,11 @@ +Every example that is part of the prompt has this exact +pattern, thus making sure that the agent will reproduce exactly the same pattern when generating new tokens. +The prompt examples are curated by the Transformers team and rigorously evaluated on a set of +problem statements +to ensure that the agent's prompt is as good as possible to solve real use cases of the agent. +The final part of the prompt corresponds to: +```text +Task: "Draw me a picture of rivers and lakes" +I will use the following + +is a final and unfinished example that the agent is tasked to complete. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_14.txt b/chunked/content_aware_chunking/_custom_tools/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..020968130c1a6a60a70068d4c13060e1b3ed2c35 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_14.txt @@ -0,0 +1,6 @@ +The unfinished example +is dynamically created based on the actual user input. For the above example, the user ran: +py +agent.run("Draw me a picture of rivers and lakes") +The user input - a.k.a the task: "Draw me a picture of rivers and lakes" is cast into the +prompt template: "Task: \n\n I will use the following". \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_15.txt b/chunked/content_aware_chunking/_custom_tools/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..63a239b2cc738efea404a05b9329be487e6d2d18 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_15.txt @@ -0,0 +1,14 @@ +This sentence makes up the final lines of the +prompt the agent is conditioned on, therefore strongly influencing the agent to finish the example +exactly in the same way it was previously done in the examples. +Without going into too much detail, the chat template has the same prompt structure with the +examples having a slightly different style, e.g.: +````text +[] +===== +Human: Answer the question in the variable question about the image stored in the variable image. +Assistant: I will use the tool image_qa to answer the question on the input image. +py +answer = image_qa(text=question, image=image) +print(f"The answer is {answer}") +Human: I tried this code, it worked but didn't give me a good result. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_16.txt b/chunked/content_aware_chunking/_custom_tools/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..e6e107ffea240293835abe1acc39b3d5d679d525 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_16.txt @@ -0,0 +1,2 @@ +The question is in French +Assistant: In this case, the question needs to be translated first. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_17.txt b/chunked/content_aware_chunking/_custom_tools/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..30e6d9bff9e3949ad72fe3f6d0c53bd3d9974446 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_17.txt @@ -0,0 +1,11 @@ +I will use the tool translator to do this. +py +translated_question = translator(question=question, src_lang="French", tgt_lang="English") +print(f"The translated question is {translated_question}.") +answer = image_qa(text=translated_question, image=image) +print(f"The answer is {answer}") +===== +[] +` +Contrary, to the examples of the run prompt, each chat prompt example has one or more exchanges between the +Human and the Assistant. Every exchange is structured similarly to the example of the run prompt. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_18.txt b/chunked/content_aware_chunking/_custom_tools/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..ca428797affff0913d1562b3d9463a513fe85d4f --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_18.txt @@ -0,0 +1,3 @@ +The user's input is appended to behind Human: and the agent is prompted to first generate what needs to be done +before generating code. An exchange can be based on previous exchanges, therefore allowing the user to refer +to past exchanges as is done e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_19.txt b/chunked/content_aware_chunking/_custom_tools/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..02b6cfa0d2954c618b890b0aafd2099cafee0fd9 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_19.txt @@ -0,0 +1,6 @@ +above by the user's input of "I tried this code" refers to the +previously generated code of the agent. +Upon running .chat, the user's input or task is cast into an unfinished example of the form: +text +Human: \n\nAssistant: +which the agent completes. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_2.txt b/chunked/content_aware_chunking/_custom_tools/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..b2d96cba8fce7fcdc29ffbcb188638f944b1b01b --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_2.txt @@ -0,0 +1,4 @@ +The language model powering the agent is conditioned on a long +prompt and completes the prompt by generating the next tokens until the stop token is reached. +The only difference between the two modes is that during the chat mode the prompt is extended with +previous user inputs and model generations. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_20.txt b/chunked/content_aware_chunking/_custom_tools/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..00fe044403d82e41341f48bbeeb3eaf4c1d1be07 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_20.txt @@ -0,0 +1,6 @@ +Contrary to the run command, the chat command then appends the completed example +to the prompt, thus giving the agent more context for the next chat turn. +Great now that we know how the prompt is structured, let's see how we can customize it! +Writing good user inputs +While large language models are getting better and better at understanding users' intentions, it helps +enormously to be as precise as possible to help the agent pick the correct task. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_21.txt b/chunked/content_aware_chunking/_custom_tools/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..4395b22f87ae7581a944c147ac059c6b2243df7e --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_21.txt @@ -0,0 +1,5 @@ +What does it mean to be +as precise as possible? +The agent sees a list of tool names and their description in its prompt. The more tools are added the +more difficult it becomes for the agent to choose the correct tool and it's even more difficult to choose +the correct sequences of tools to run. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_22.txt b/chunked/content_aware_chunking/_custom_tools/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..3ca407aa354f8c692577ce0d6146625ce1a54bf2 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_22.txt @@ -0,0 +1,15 @@ +Let's look at a common failure case, here we will only return +the code to analyze it. + +from transformers import HfAgent +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder") +agent.run("Show me a tree", return_code=True) + +gives: +``text +==Explanation from the agent== +I will use the following tool:image_segmenter` to create a segmentation mask for the image. +==Code generated by the agent== +mask = image_segmenter(image, prompt="tree") + +which is probably not what we wanted. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_23.txt b/chunked/content_aware_chunking/_custom_tools/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..f36cf0376d9a9a3249e3ef276f80801fb8f7e7d2 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_23.txt @@ -0,0 +1,7 @@ +Instead, it is more likely that we want an image of a tree to be generated. +To steer the agent more towards using a specific tool it can therefore be very helpful to use important keywords that +are present in the tool's name and description. Let's have a look. +py +agent.toolbox["image_generator"].description +text +'This is a tool that creates an image according to a prompt, which is a text description. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_24.txt b/chunked/content_aware_chunking/_custom_tools/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..26ea5660f1b0213603a5f08b10bc5ffe543e76c8 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_24.txt @@ -0,0 +1,2 @@ +It takes an input named `prompt` which contains the image description and outputs an image. +The name and description make use of the keywords "image", "prompt", "create" and "generate". Using these words will most likely work better here. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_25.txt b/chunked/content_aware_chunking/_custom_tools/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..003269f9f80ffc21c3014dd55ad23213a7fef4f1 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_25.txt @@ -0,0 +1,11 @@ +Let's refine our prompt a bit. +py +agent.run("Create an image of a tree", return_code=True) +gives: +``text +==Explanation from the agent== +I will use the following toolimage_generator` to generate an image of a tree. +==Code generated by the agent== +image = image_generator(prompt="tree") + +Much better! That looks more like what we want. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_26.txt b/chunked/content_aware_chunking/_custom_tools/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..5d8d819e8649824aede6677ed5810ed369c3ce5f --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_26.txt @@ -0,0 +1,5 @@ +In short, when you notice that the agent struggles to +correctly map your task to the correct tools, try looking up the most pertinent keywords of the tool's name +and description and try refining your task request with it. +Customizing the tool descriptions +As we've seen before the agent has access to each of the tools' names and descriptions. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_27.txt b/chunked/content_aware_chunking/_custom_tools/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..c6ad9ceb70be35488fdcd096c1c7cd1080428786 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_27.txt @@ -0,0 +1,5 @@ +The base tools +should have very precise names and descriptions, however, you might find that it could help to change the +the description or name of a tool for your specific use case. This might become especially important +when you've added multiple tools that are very similar or if you want to use your agent only for a certain +domain, e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_28.txt b/chunked/content_aware_chunking/_custom_tools/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..915d458a86982ff0d9806cde88dbf20e8fed4d7c --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_28.txt @@ -0,0 +1,15 @@ +image generation and transformations. +A common problem is that the agent confuses image generation with image transformation/modification when +used a lot for image generation tasks, e.g. +py +agent.run("Make an image of a house and a car", return_code=True) +returns +``text +==Explanation from the agent== +I will use the following toolsimage_generatorto generate an image of a house andimage_transformer` to transform the image of a car into the image of a house. +==Code generated by the agent== +house_image = image_generator(prompt="A house") +car_image = image_generator(prompt="A car") +house_car_image = image_transformer(image=car_image, prompt="A house") + +which is probably not exactly what we want here. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_29.txt b/chunked/content_aware_chunking/_custom_tools/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..8bc2ce51b2fd8f57445ec25e872b1bb9c1f7e9f3 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_29.txt @@ -0,0 +1,3 @@ +It seems like the agent has a difficult time +to understand the difference between image_generator and image_transformer and often uses the two together. +We can help the agent here by changing the tool name and description of image_transformer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_3.txt b/chunked/content_aware_chunking/_custom_tools/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..e7348e905c37bb9318bcef4109d7d14c1160b15e --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_3.txt @@ -0,0 +1,9 @@ +This allows the agent to have access to past interactions, +seemingly giving the agent some kind of memory. +Structure of the prompt +Let's take a closer look at how the prompt is structured to understand how it can be best customized. +The prompt is structured broadly into four parts. + +Introduction: how the agent should behave, explanation of the concept of tools. + +Description of all the tools. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_30.txt b/chunked/content_aware_chunking/_custom_tools/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..f81bf0689fa0d7a1e321d31a621c240064fac3bd --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_30.txt @@ -0,0 +1,8 @@ +Let's instead call it modifier +to disassociate it a bit from "image" and "prompt": +py +agent.toolbox["modifier"] = agent.toolbox.pop("image_transformer") +agent.toolbox["modifier"].description = agent.toolbox["modifier"].description.replace( + "transforms an image according to a prompt", "modifies an image" +) +Now "modify" is a strong cue to use the new image processor which should help with the above prompt. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_31.txt b/chunked/content_aware_chunking/_custom_tools/chunk_31.txt new file mode 100644 index 0000000000000000000000000000000000000000..e3af9e5813f10846c78d6c78968b700f8b5fd06b --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_31.txt @@ -0,0 +1,12 @@ +Let's run it again. +py +agent.run("Make an image of a house and a car", return_code=True) +Now we're getting: +``text +==Explanation from the agent== +I will use the following tools:image_generatorto generate an image of a house, thenimage_generator` to generate an image of a car. +==Code generated by the agent== +house_image = image_generator(prompt="A house") +car_image = image_generator(prompt="A car") + +which is definitely closer to what we had in mind! However, we want to have both the house and car in the same image. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_32.txt b/chunked/content_aware_chunking/_custom_tools/chunk_32.txt new file mode 100644 index 0000000000000000000000000000000000000000..aac641ea60895fd368a228bb9cddbf68467c3aa7 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_32.txt @@ -0,0 +1,17 @@ +Steering the task more toward single image generation should help: +py +agent.run("Create image: 'A house and car'", return_code=True) +``text +==Explanation from the agent== +I will use the following tool:image_generator` to generate an image. +==Code generated by the agent== +image = image_generator(prompt="A house and car") + +Agents are still brittle for many use cases, especially when it comes to +slightly more complex use cases like generating an image of multiple objects. +Both the agent itself and the underlying prompt will be further improved in the coming +months making sure that agents become more robust to a variety of user inputs. + +Customizing the whole prompt +To give the user maximum flexibility, the whole prompt template as explained in above +can be overwritten by the user. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_33.txt b/chunked/content_aware_chunking/_custom_tools/chunk_33.txt new file mode 100644 index 0000000000000000000000000000000000000000..dd0cf33a8595f48685734b3b9c83f081ea19007e --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_33.txt @@ -0,0 +1,2 @@ +In this case make sure that your custom prompt includes an introduction section, +a tool section, an example section, and an unfinished example section. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_34.txt b/chunked/content_aware_chunking/_custom_tools/chunk_34.txt new file mode 100644 index 0000000000000000000000000000000000000000..93714bd3ca8b4af5d94d3d715ea406198389fb60 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_34.txt @@ -0,0 +1,10 @@ +If you want to overwrite the run prompt template, +you can do as follows: + +template = """ [] """ +agent = HfAgent(your_endpoint, run_prompt_template=template) + +Please make sure to have the <> string and the <> defined somewhere in the template so that the agent can be aware +of the tools, it has available to it as well as correctly insert the user's prompt. + +Similarly, one can overwrite the chat prompt template. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_35.txt b/chunked/content_aware_chunking/_custom_tools/chunk_35.txt new file mode 100644 index 0000000000000000000000000000000000000000..dadfc768abca09c4b2895b100ec343e2492b2115 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_35.txt @@ -0,0 +1,15 @@ +Note that the chat mode always uses the following format for the exchanges: +```text +Human: <> +Assistant: + +Therefore it is important that the examples of the custom chat prompt template also make use of this format. +You can overwrite the chat template at instantiation as follows. +thon +template = """ [] """ +agent = HfAgent(url_endpoint=your_endpoint, chat_prompt_template=template) + +Please make sure to have the <> string defined somewhere in the template so that the agent can be aware +of the tools, it has available to it. + +In both cases, you can pass a repo ID instead of the prompt template if you would like to use a template hosted by someone in the community. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_36.txt b/chunked/content_aware_chunking/_custom_tools/chunk_36.txt new file mode 100644 index 0000000000000000000000000000000000000000..cc52f093e5853430ed7c9a52e0d228a075575e7a --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_36.txt @@ -0,0 +1,22 @@ +The default prompts live in this repo as an example. +To upload your custom prompt on a repo on the Hub and share it with the community just make sure: +- to use a dataset repository +- to put the prompt template for the run command in a file named run_prompt_template.txt +- to put the prompt template for the chat command in a file named chat_prompt_template.txt +Using custom tools +In this section, we'll be leveraging two existing custom tools that are specific to image generation: + +We replace huggingface-tools/image-transformation, + with diffusers/controlnet-canny-tool + to allow for more image modifications. +We add a new tool for image upscaling to the default toolbox: + diffusers/latent-upscaler-tool replace the existing image-transformation tool. + +We'll start by loading the custom tools with the convenient [load_tool] function: + +from transformers import load_tool +controlnet_transformer = load_tool("diffusers/controlnet-canny-tool") +upscaler = load_tool("diffusers/latent-upscaler-tool") + +Upon adding custom tools to an agent, the tools' descriptions and names are automatically +included in the agents' prompts. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_37.txt b/chunked/content_aware_chunking/_custom_tools/chunk_37.txt new file mode 100644 index 0000000000000000000000000000000000000000..abe8339fa0f8fc8d8526c75abe0ee925d66bdf6b --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_37.txt @@ -0,0 +1,9 @@ +Thus, it is imperative that custom tools have +a well-written description and name in order for the agent to understand how to use them. +Let's take a look at the description and name of controlnet_transformer: +py +print(f"Description: '{controlnet_transformer.description}'") +print(f"Name: '{controlnet_transformer.name}'") +gives +text +Description: 'This is a tool that transforms an image with ControlNet according to a prompt. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_38.txt b/chunked/content_aware_chunking/_custom_tools/chunk_38.txt new file mode 100644 index 0000000000000000000000000000000000000000..0040646ecccd126cd51780fbf5b3404aa4793b2c --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_38.txt @@ -0,0 +1 @@ +It takes two inputs: `image`, which should be the image to transform, and `prompt`, which should be the prompt to use to change it. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_39.txt b/chunked/content_aware_chunking/_custom_tools/chunk_39.txt new file mode 100644 index 0000000000000000000000000000000000000000..71a98b7bf59f81b8051ac09bf064fda27a13e5cd --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_39.txt @@ -0,0 +1,15 @@ +It returns the modified image.' +Name: 'image_transformer' +The name and description are accurate and fit the style of the curated set of tools. +Next, let's instantiate an agent with controlnet_transformer and upscaler: +py +tools = [controlnet_transformer, upscaler] +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder", additional_tools=tools) +This command should give you the following info: +text +image_transformer has been replaced by as provided in `additional_tools` +The set of curated tools already has an image_transformer tool which is hereby replaced with our custom tool. + +Overwriting existing tools can be beneficial if we want to use a custom tool exactly for the same task as an existing tool +because the agent is well-versed in using the specific task. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_4.txt b/chunked/content_aware_chunking/_custom_tools/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..4d4b56dccf289e5dc03f79cbe0c66a42ad311f22 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_4.txt @@ -0,0 +1,13 @@ +This is defined by a <> token that is dynamically replaced at runtime with the tools defined/chosen by the user. + +A set of examples of tasks and their solution + +Current example, and request for solution. + +To better understand each part, let's look at a shortened version of how the run prompt can look like: +````text +I will ask you to perform a task, your job is to come up with a series of simple commands in Python that will perform the task. +[] +You can print intermediate results if it makes sense to do so. +Tools: +- document_qa: This is a tool that answers a question about a document (pdf). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_40.txt b/chunked/content_aware_chunking/_custom_tools/chunk_40.txt new file mode 100644 index 0000000000000000000000000000000000000000..e2569dcf9911ca3b0146078bf26d200c0fb6cdd2 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_40.txt @@ -0,0 +1,45 @@ +Beware that the custom tool should follow the exact same API +as the overwritten tool in this case, or you should adapt the prompt template to make sure all examples using that +tool are updated. + +The upscaler tool was given the name image_upscaler which is not yet present in the default toolbox and is therefore simply added to the list of tools. +You can always have a look at the toolbox that is currently available to the agent via the agent.toolbox attribute: +py +print("\n".join([f"- {a}" for a in agent.toolbox.keys()])) +text +- document_qa +- image_captioner +- image_qa +- image_segmenter +- transcriber +- summarizer +- text_classifier +- text_qa +- text_reader +- translator +- image_transformer +- text_downloader +- image_generator +- video_generator +- image_upscaler +Note how image_upscaler is now part of the agents' toolbox. +Let's now try out the new tools! We will re-use the image we generated in Transformers Agents Quickstart. + +from diffusers.utils import load_image +image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rivers_and_lakes.png" +) + + +Let's transform the image into a beautiful winter landscape: +py +image = agent.run("Transform the image: 'A frozen lake and snowy forest'", image=image) +``text +==Explanation from the agent== +I will use the following tool:image_transformer` to transform the image. +==Code generated by the agent== +image = image_transformer(image, prompt="A frozen lake and snowy forest") + + +The new image processing tool is based on ControlNet which can make very strong modifications to the image. +By default the image processing tool returns an image of size 512x512 pixels. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_41.txt b/chunked/content_aware_chunking/_custom_tools/chunk_41.txt new file mode 100644 index 0000000000000000000000000000000000000000..8c81af794df23f9a1252fd13f42c2ab7ee3d429d --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_41.txt @@ -0,0 +1,17 @@ +Let's see if we can upscale it. +py +image = agent.run("Upscale the image", image) +``text +==Explanation from the agent== +I will use the following tool:image_upscaler` to upscale the image. +==Code generated by the agent== +upscaled_image = image_upscaler(image) + + +The agent automatically mapped our prompt "Upscale the image" to the just added upscaler tool purely based on the description and name of the upscaler tool +and was able to correctly run it. +Next, let's have a look at how you can create a new custom tool. +Adding new tools +In this section, we show how to create a new tool that can be added to the agent. +Creating a new tool +We'll first start by creating a tool. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_42.txt b/chunked/content_aware_chunking/_custom_tools/chunk_42.txt new file mode 100644 index 0000000000000000000000000000000000000000..09989e05c7b4e67d03d03167fe3b708510eefede --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_42.txt @@ -0,0 +1,12 @@ +We'll add the not-so-useful yet fun task of fetching the model on the Hugging Face +Hub with the most downloads for a given task. +We can do that with the following code: +thon +from huggingface_hub import list_models +task = "text-classification" +model = next(iter(list_models(filter=task, sort="downloads", direction=-1))) +print(model.id) + +For the task text-classification, this returns 'facebook/bart-large-mnli', for translation it returns 'google-t5/t5-base. +How do we convert this to a tool that the agent can leverage? All tools depend on the superclass Tool that holds the +main attributes necessary. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_43.txt b/chunked/content_aware_chunking/_custom_tools/chunk_43.txt new file mode 100644 index 0000000000000000000000000000000000000000..a40965e6f776970c75b1c25de620a7dd3ddedc2d --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_43.txt @@ -0,0 +1,11 @@ +We'll create a class that inherits from it: +thon +from transformers import Tool +class HFModelDownloadsTool(Tool): + pass + +This class has a few needs: +- An attribute name, which corresponds to the name of the tool itself. To be in tune with other tools which have a + performative name, we'll name it model_download_counter. +- An attribute description, which will be used to populate the prompt of the agent. +- inputs and outputs attributes. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_44.txt b/chunked/content_aware_chunking/_custom_tools/chunk_44.txt new file mode 100644 index 0000000000000000000000000000000000000000..2d2ae0f247a701041b715b281454531ae1fadc6e --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_44.txt @@ -0,0 +1,4 @@ +Defining this will help the python interpreter make educated choices about types, + and will allow for a gradio-demo to be spawned when we push our tool to the Hub. They're both a list of expected + values, which can be text, image, or audio. +- A __call__ method which contains the inference code. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_45.txt b/chunked/content_aware_chunking/_custom_tools/chunk_45.txt new file mode 100644 index 0000000000000000000000000000000000000000..6fde99e3805a5fcd6066c09481b23d0271118878 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_45.txt @@ -0,0 +1,9 @@ +This is the code we've played with above! +Here's what our class looks like now: +thon +from transformers import Tool +from huggingface_hub import list_models +class HFModelDownloadsTool(Tool): + name = "model_download_counter" + description = ( + "This is a tool that returns the most downloaded model of a given task on the Hugging Face Hub. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_46.txt b/chunked/content_aware_chunking/_custom_tools/chunk_46.txt new file mode 100644 index 0000000000000000000000000000000000000000..77f633757971b43698a29b0a0603697921c0e899 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_46.txt @@ -0,0 +1,12 @@ +" + "It takes the name of the category (such as text-classification, depth-estimation, etc), and " + "returns the name of the checkpoint." + ) +inputs = ["text"] +outputs = ["text"] + +def __call__(self, task: str): + model = next(iter(list_models(filter=task, sort="downloads", direction=-1))) + return model.id + +We now have our tool handy. Save it in a file and import it from your main script. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_47.txt b/chunked/content_aware_chunking/_custom_tools/chunk_47.txt new file mode 100644 index 0000000000000000000000000000000000000000..0a2c69f47a96dfa3d478c0352fd37beccb6caca5 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_47.txt @@ -0,0 +1,8 @@ +Let's name this file +model_downloads.py, so the resulting import code looks like this: +thon +from model_downloads import HFModelDownloadsTool +tool = HFModelDownloadsTool() + +In order to let others benefit from it and for simpler initialization, we recommend pushing it to the Hub under your +namespace. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_48.txt b/chunked/content_aware_chunking/_custom_tools/chunk_48.txt new file mode 100644 index 0000000000000000000000000000000000000000..8771da0131f8009e3f55f354c08992d3dbc65a42 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_48.txt @@ -0,0 +1,31 @@ +To do so, just call push_to_hub on the tool variable: +python +tool.push_to_hub("hf-model-downloads") +You now have your code on the Hub! Let's take a look at the final step, which is to have the agent use it. +Having the agent use the tool +We now have our tool that lives on the Hub which can be instantiated as such (change the user name for your tool): +thon +from transformers import load_tool +tool = load_tool("lysandre/hf-model-downloads") + +In order to use it in the agent, simply pass it in the additional_tools parameter of the agent initialization method: +thon +from transformers import HfAgent +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder", additional_tools=[tool]) +agent.run( + "Can you read out loud the name of the model that has the most downloads in the 'text-to-video' task on the Hugging Face Hub?" +) +which outputs the following:text +==Code generated by the agent== +model = model_download_counter(task="text-to-video") +print(f"The model with the most downloads is {model}.") +audio_model = text_reader(model) +==Result== +The model with the most downloads is damo-vilab/text-to-video-ms-1.7b. + +and generates the following audio. +| Audio | +|------------------------------------------------------------------------------------------------------------------------------------------------------| +| | + +Depending on the LLM, some are quite brittle and require very exact prompts in order to work well. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_49.txt b/chunked/content_aware_chunking/_custom_tools/chunk_49.txt new file mode 100644 index 0000000000000000000000000000000000000000..711f4c84d5c32622270300b316a799db41f19bd9 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_49.txt @@ -0,0 +1,5 @@ +Having a well-defined +name and description of the tool is paramount to having it be leveraged by the agent. + +Replacing existing tools +Replacing existing tools can be done simply by assigning a new item to the agent's toolbox. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_5.txt b/chunked/content_aware_chunking/_custom_tools/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..696aedf89bddb8d47e2867afd9ddf1a5e9fc4179 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_5.txt @@ -0,0 +1,2 @@ +It takes an input named document which should be the document containing the information, as well as a question that is the question about the document. It returns a text that contains the answer to the question. +- image_captioner: This is a tool that generates a description of an image. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_50.txt b/chunked/content_aware_chunking/_custom_tools/chunk_50.txt new file mode 100644 index 0000000000000000000000000000000000000000..737dd0bdf1bc3ac0d0da0621b282f445809ac837 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_50.txt @@ -0,0 +1,7 @@ +Here's how one would do so: +thon +from transformers import HfAgent, load_tool +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder") +agent.toolbox["image-transformation"] = load_tool("diffusers/controlnet-canny-tool") + +Beware when replacing tools with others! This will also adjust the agent's prompt. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_51.txt b/chunked/content_aware_chunking/_custom_tools/chunk_51.txt new file mode 100644 index 0000000000000000000000000000000000000000..6ad0b771839ab882024c9003b0445f904f68f3fd --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_51.txt @@ -0,0 +1,8 @@ +This can be good if you have a better +prompt suited for the task, but it can also result in your tool being selected way more than others or for other +tools to be selected instead of the one you have defined. + +Leveraging gradio-tools +gradio-tools is a powerful library that allows using Hugging +Face Spaces as tools. It supports many existing Spaces as well as custom Spaces to be designed with it. +We offer support for gradio_tools by using the Tool.from_gradio method. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_52.txt b/chunked/content_aware_chunking/_custom_tools/chunk_52.txt new file mode 100644 index 0000000000000000000000000000000000000000..24354b09d4c730d8e5765ea444a2ed0cdb2247fa --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_52.txt @@ -0,0 +1,14 @@ +For example, we want to take +advantage of the StableDiffusionPromptGeneratorTool tool offered in the gradio-tools toolkit so as to +improve our prompts and generate better images. +We first import the tool from gradio_tools and instantiate it: +thon +from gradio_tools import StableDiffusionPromptGeneratorTool +gradio_tool = StableDiffusionPromptGeneratorTool() + +We pass that instance to the Tool.from_gradio method: +thon +from transformers import Tool +tool = Tool.from_gradio(gradio_tool) + +Now we can manage it exactly as we would a usual custom tool. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_53.txt b/chunked/content_aware_chunking/_custom_tools/chunk_53.txt new file mode 100644 index 0000000000000000000000000000000000000000..1153ba478115e64448943016bb5ba34a1d4d5ec8 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_53.txt @@ -0,0 +1,19 @@ +We leverage it to improve our prompt +a rabbit wearing a space suit: +thon +from transformers import HfAgent +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder", additional_tools=[tool]) +agent.run("Generate an image of the prompt after improving it.", prompt="A rabbit wearing a space suit") + +The model adequately leverages the tool: +``text +==Explanation from the agent== +I will use the following tools:StableDiffusionPromptGeneratorto improve the prompt, thenimage_generator` to generate an image according to the improved prompt. +==Code generated by the agent== +improved_prompt = StableDiffusionPromptGenerator(prompt) +print(f"The improved prompt is {improved_prompt}.") +image = image_generator(improved_prompt) + +Before finally generating the image: + +gradio-tools requires textual inputs and outputs, even when working with different modalities. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_54.txt b/chunked/content_aware_chunking/_custom_tools/chunk_54.txt new file mode 100644 index 0000000000000000000000000000000000000000..fbb8f9a93fac0556b5bf005450e52de466671496 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_54.txt @@ -0,0 +1,6 @@ +This implementation +works with image and audio objects. The two are currently incompatible, but will rapidly become compatible as we +work to improve the support. + +Future compatibility with Langchain +We love Langchain and think it has a very compelling suite of tools. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_55.txt b/chunked/content_aware_chunking/_custom_tools/chunk_55.txt new file mode 100644 index 0000000000000000000000000000000000000000..1811107edd8f7bec6b1547a98ce652a3cb328108 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_55.txt @@ -0,0 +1,7 @@ +In order to handle these tools, +Langchain requires textual inputs and outputs, even when working with different modalities. +This is often the serialized version (i.e., saved to disk) of the objects. +This difference means that multi-modality isn't handled between transformers-agents and langchain. +We aim for this limitation to be resolved in future versions, and welcome any help from avid langchain +users to help us achieve this compatibility. +We would love to have better support. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_56.txt b/chunked/content_aware_chunking/_custom_tools/chunk_56.txt new file mode 100644 index 0000000000000000000000000000000000000000..bdc1ee25d0ff5e2eb4fdec2a3c2d91e6cbdf2c6d --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_56.txt @@ -0,0 +1,2 @@ +If you would like to help, please +open an issue and share what you have in mind.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_6.txt b/chunked/content_aware_chunking/_custom_tools/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..423b9356171050dc4ffc8952561a7c0edcfd2c2c --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_6.txt @@ -0,0 +1,3 @@ +It takes an input named image which should be the image to the caption and returns a text that contains the description in English. +[] +Task: "Answer the question in the variable question about the image stored in the variable image. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_7.txt b/chunked/content_aware_chunking/_custom_tools/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..32378a5f6ef4af49a9cfc77f36f8d36c354d430e --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_7.txt @@ -0,0 +1,22 @@ +The question is in French." +I will use the following tools: translator to translate the question into English and then image_qa to answer the question on the input image. +Answer: +py +translated_question = translator(question=question, src_lang="French", tgt_lang="English") +print(f"The translated question is {translated_question}.") +answer = image_qa(image=image, question=translated_question) +print(f"The answer is {answer}") +Task: "Identify the oldest person in the document and create an image showcasing the result as a banner." +I will use the following tools: document_qa to find the oldest person in the document, then image_generator to generate an image according to the answer. +Answer: +py +answer = document_qa(document, question="What is the oldest person?") +print(f"The answer is {answer}.") +image = image_generator("A banner showing " + answer) +[] +Task: "Draw me a picture of rivers and lakes" +I will use the following +` +The introduction (the text before "Tools:") explains precisely how the model shall behave and what it should do. +This part most likely does not need to be customized as the agent shall always behave the same way. +The second part (the bullet points below "Tools") is dynamically added upon calling run or chat. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_8.txt b/chunked/content_aware_chunking/_custom_tools/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..1cbc8f7d44ddac6d82e0f778ff9a82dc33212957 --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_8.txt @@ -0,0 +1,14 @@ +There are +exactly as many bullet points as there are tools in agent.toolbox and each bullet point consists of the name +and description of the tool: +text +- : +Let's verify this quickly by loading the document_qa tool and printing out the name and description. + +from transformers import load_tool +document_qa = load_tool("document-question-answering") +print(f"- {document_qa.name}: {document_qa.description}") + +which gives: +text +- document_qa: This is a tool that answers a question about a document (pdf). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_custom_tools/chunk_9.txt b/chunked/content_aware_chunking/_custom_tools/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..df980886565ae6911c02b1961f56ccd80048a87a --- /dev/null +++ b/chunked/content_aware_chunking/_custom_tools/chunk_9.txt @@ -0,0 +1,4 @@ +It takes an input named `document` which should be the document containing the information, as well as a `question` that is the question about the document. It returns a text that contains the answer to the question. +We can see that the tool name is short and precise. The description includes two parts, the first explaining +what the tool does and the second states what input arguments and return values are expected. +A good tool name and tool description are very important for the agent to correctly use it. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_0.txt b/chunked/content_aware_chunking/_debugging/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..d03b63e9748127ebe42a5b3f5d4df6a3d2a5540e --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_0.txt @@ -0,0 +1,7 @@ +Debugging +Training on multiple GPUs can be a tricky endeavor whether you're running into installation issues or communication problems between your GPUs. This debugging guide covers some issues you may run into and how to resolve them. +DeepSpeed CUDA installation +If you're using DeepSpeed, you've probably already installed it with the following command. + +pip install deepspeed +DeepSpeed compiles CUDA C++ code and it can be a potential source of errors when building PyTorch extensions that require CUDA. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_1.txt b/chunked/content_aware_chunking/_debugging/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..737b624202aae2a5e72766103ab6ec80d8b7df05 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_1.txt @@ -0,0 +1,6 @@ +These errors depend on how CUDA is installed on your system, and this section focuses on PyTorch built with CUDA 10.2. + +For any other installation issues, please open an issue with the DeepSpeed team. + +Non-identical CUDA toolkits +PyTorch comes with its own CUDA toolkit, but to use DeepSpeed with PyTorch, you need to have an identical version of CUDA installed system-wide. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_10.txt b/chunked/content_aware_chunking/_debugging/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..c1c783af3c1ae414565d1f9e00caac63d15cbdc5 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_10.txt @@ -0,0 +1,6 @@ +To make a local build for DeepSpeed: + +git clone https://github.com/microsoft/DeepSpeed/ +cd DeepSpeed +rm -rf build +TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_11.txt b/chunked/content_aware_chunking/_debugging/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..a9e38ae665216d8e2741b8c577074950033ff01a --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_11.txt @@ -0,0 +1,7 @@ +\ +--global-option="build_ext" --global-option="-j8" --no-cache -v \ +--disable-pip-version-check 2>&1 | tee build.log + +To use NVMe offload, add the DS_BUILD_AIO=1 parameter to the build command and make sure you install the libaio-dev package system-wide. + +Next, you'll have to specify your GPU's architecture by editing the TORCH_CUDA_ARCH_LIST variable (find a complete list of NVIDIA GPUs and their corresponding architectures on this page). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_12.txt b/chunked/content_aware_chunking/_debugging/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..1533bcdf1ad2da6acab912970c55f8933feb72b9 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_12.txt @@ -0,0 +1,15 @@ +To check the PyTorch version that corresponds to your architecture, run the following command: + +python -c "import torch; print(torch.cuda.get_arch_list())" +Find the architecture for a GPU with the following command: + +CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())" + +To find the architecture for GPU 0: + +CUDA_VISIBLE_DEVICES=0 python -c "import torch; \ +print(torch.cuda.get_device_properties(torch.device('cuda'))) +"_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)" +This means your GPU architecture is 8.6. + +If you get 8, 6, then you can set TORCH_CUDA_ARCH_LIST="8.6". \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_13.txt b/chunked/content_aware_chunking/_debugging/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..4e5a8ceaaed378a2924d84c9a26580723d84dec0 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_13.txt @@ -0,0 +1,2 @@ +For multiple GPUs with different architectures, list them like TORCH_CUDA_ARCH_LIST="6.1;8.6". +It is also possible to not specify TORCH_CUDA_ARCH_LIST and the build program automatically queries the GPU architecture of the build. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_14.txt b/chunked/content_aware_chunking/_debugging/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..89b124d0388b092ff7acd51e62695a2f78475720 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_14.txt @@ -0,0 +1,9 @@ +However, it may or may not match the actual GPU on the target machine which is why it is better to explicitly specify the correct architecture. +For training on multiple machines with the same setup, you'll need to make a binary wheel: + +git clone https://github.com/microsoft/DeepSpeed/ +cd DeepSpeed +rm -rf build +TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \ +python setup.py build_ext -j8 bdist_wheel +This command generates a binary wheel that'll look something like dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_15.txt b/chunked/content_aware_chunking/_debugging/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..234afd5aafe336a0df1dba7c789108166df74e23 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_15.txt @@ -0,0 +1,17 @@ +Now you can install this wheel locally or on another machine. + +pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl +Multi-GPU Network Issues Debug +When training or inferencing with DistributedDataParallel and multiple GPU, if you run into issue of inter-communication between processes and/or nodes, you can use the following script to diagnose network issues. + +wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py +For example to test how 2 GPUs interact do: + +python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py +If both processes can talk to each and allocate GPU memory each will print an OK status. +For more GPUs or nodes adjust the arguments in the script. +You will find a lot more details inside the diagnostics script and even a recipe to how you could run it in a SLURM environment. +An additional level of debug is to add NCCL_DEBUG=INFO environment variable as follows: + +NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py +This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_16.txt b/chunked/content_aware_chunking/_debugging/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..2438958ce9cecdf9a715be0adebecb3251e1e634 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_16.txt @@ -0,0 +1,11 @@ +Or if you're not sure how to interpret the output you can share the log file in an Issue. +Underflow and Overflow Detection + +This feature is currently available for PyTorch-only. + +For multi-GPU training it requires DDP (torch.distributed.launch). + +This feature can be used with any nn.Module-based model. + +If you start getting loss=NaN or the model inhibits some other abnormal behavior due to inf or nan in +activations or weights one needs to discover where the first underflow or overflow happens and what led to it. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_17.txt b/chunked/content_aware_chunking/_debugging/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..69ddba5e34c4184a6e71f504b8ab3cad951d908e --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_17.txt @@ -0,0 +1,14 @@ +Luckily +you can accomplish that easily by activating a special module that will do the detection automatically. +If you're using [Trainer], you just need to add: + +--debug underflow_overflow +to the normal command line arguments, or pass debug="underflow_overflow" when creating the +[TrainingArguments] object. +If you're using your own training loop or another Trainer you can accomplish the same with: +thon +from transformers.debug_utils import DebugUnderflowOverflow +debug_overflow = DebugUnderflowOverflow(model) + +[~debug_utils.DebugUnderflowOverflow] inserts hooks into the model that immediately after each +forward call will test input and output variables and also the corresponding module's weights. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_18.txt b/chunked/content_aware_chunking/_debugging/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..71b95d8008415b29e1ba1b2d6935100124246cd2 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_18.txt @@ -0,0 +1,43 @@ +As soon as inf or +nan is detected in at least one element of the activations or weights, the program will assert and print a report +like this (this was caught with google/mt5-small under fp16 mixed precision): +Detected inf/nan during batch_number=0 +Last 21 forward frames: +abs min abs max metadata + encoder.block.1.layer.1.DenseReluDense.dropout Dropout +0.00e+00 2.57e+02 input[0] +0.00e+00 2.85e+02 output +[] + encoder.block.2.layer.0 T5LayerSelfAttention +6.78e-04 3.15e+03 input[0] +2.65e-04 3.42e+03 output[0] + None output[1] +2.25e-01 1.00e+04 output[2] + encoder.block.2.layer.1.layer_norm T5LayerNorm +8.69e-02 4.18e-01 weight +2.65e-04 3.42e+03 input[0] +1.79e-06 4.65e+00 output + encoder.block.2.layer.1.DenseReluDense.wi_0 Linear +2.17e-07 4.50e+00 weight +1.79e-06 4.65e+00 input[0] +2.68e-06 3.70e+01 output + encoder.block.2.layer.1.DenseReluDense.wi_1 Linear +8.08e-07 2.66e+01 weight +1.79e-06 4.65e+00 input[0] +1.27e-04 2.37e+02 output + encoder.block.2.layer.1.DenseReluDense.dropout Dropout +0.00e+00 8.76e+03 input[0] +0.00e+00 9.74e+03 output + encoder.block.2.layer.1.DenseReluDense.wo Linear +1.01e-06 6.44e+00 weight +0.00e+00 9.74e+03 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense +1.79e-06 4.65e+00 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.dropout Dropout +3.18e-04 6.27e+04 input[0] +0.00e+00 inf output +The example output has been trimmed in the middle for brevity. +The second column shows the value of the absolute largest element, so if you have a closer look at the last few frames, +the inputs and outputs were in the range of 1e4. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_19.txt b/chunked/content_aware_chunking/_debugging/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..15f3a439e75f70abfa11302b03c8f41294a8cb83 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_19.txt @@ -0,0 +1,2 @@ +So when this training was done under fp16 mixed precision the very +last step overflowed (since under fp16 the largest number before inf is 64e3). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_2.txt b/chunked/content_aware_chunking/_debugging/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..90c337c5a7218469c3b52e5341d4782f4c6dba44 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_2.txt @@ -0,0 +1,2 @@ +For example, if you installed PyTorch with cudatoolkit==10.2 in your Python environment, then you'll also need to have CUDA 10.2 installed system-wide. If you don't have CUDA installed system-wide, you should install it first. +The exact location may vary from system to system, but usr/local/cuda-10.2 is the most common location on many Unix systems. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_20.txt b/chunked/content_aware_chunking/_debugging/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..439ea074a99448d3fd4a9543816115e4f0e298d7 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_20.txt @@ -0,0 +1,6 @@ +To avoid overflows under +fp16 the activations must remain way below 1e4, because 1e4 * 1e4 = 1e8 so any matrix multiplication with +large activations is going to lead to a numerical overflow condition. +At the very start of the trace you can discover at which batch number the problem occurred (here Detected inf/nan during batch_number=0 means the problem occurred on the first batch). +Each reported frame starts by declaring the fully qualified entry for the corresponding module this frame is reporting +for. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_21.txt b/chunked/content_aware_chunking/_debugging/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..9d9951c9c753b912e5ede9233e9fdee9ff113902 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_21.txt @@ -0,0 +1,7 @@ +If we look just at this frame: +encoder.block.2.layer.1.layer_norm T5LayerNorm +8.69e-02 4.18e-01 weight +2.65e-04 3.42e+03 input[0] +1.79e-06 4.65e+00 output +Here, encoder.block.2.layer.1.layer_norm indicates that it was a layer norm for the first layer, of the second +block of the encoder. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_22.txt b/chunked/content_aware_chunking/_debugging/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..26c3010fda0a23b46162e1edb51cb725b1b3a2cc --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_22.txt @@ -0,0 +1,26 @@ +And the specific calls of the forward is T5LayerNorm. +Let's look at the last few frames of that report: +Detected inf/nan during batch_number=0 +Last 21 forward frames: +abs min abs max metadata +[] + encoder.block.2.layer.1.DenseReluDense.wi_0 Linear +2.17e-07 4.50e+00 weight +1.79e-06 4.65e+00 input[0] +2.68e-06 3.70e+01 output + encoder.block.2.layer.1.DenseReluDense.wi_1 Linear +8.08e-07 2.66e+01 weight +1.79e-06 4.65e+00 input[0] +1.27e-04 2.37e+02 output + encoder.block.2.layer.1.DenseReluDense.wo Linear +1.01e-06 6.44e+00 weight +0.00e+00 9.74e+03 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense +1.79e-06 4.65e+00 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.dropout Dropout +3.18e-04 6.27e+04 input[0] +0.00e+00 inf output +The last frame reports for Dropout.forward function with the first entry for the only input and the second for the +only output. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_23.txt b/chunked/content_aware_chunking/_debugging/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..5e7fa603048ee4bef9d5a992c6f2a9d87123bbc3 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_23.txt @@ -0,0 +1,5 @@ +You can see that it was called from an attribute dropout inside DenseReluDense class. We can see +that it happened during the first layer, of the 2nd block, during the very first batch. Finally, the absolute largest +input elements was 6.27e+04 and same for the output was inf. +You can see here, that T5DenseGatedGeluDense.forward resulted in output activations, whose absolute max value was +around 62.7K, which is very close to fp16's top limit of 64K. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_24.txt b/chunked/content_aware_chunking/_debugging/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..3431c95bca3cfcf6055abbb54d3663522f28bf3d --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_24.txt @@ -0,0 +1,29 @@ +In the next frame we have Dropout which renormalizes +the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an +overflow (inf). +As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16 +numbers. +Let's match the report to the code from models/t5/modeling_t5.py: +thon +class T5DenseGatedGeluDense(nn.Module): + def init(self, config): + super().init() + self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False) + self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False) + self.wo = nn.Linear(config.d_ff, config.d_model, bias=False) + self.dropout = nn.Dropout(config.dropout_rate) + self.gelu_act = ACT2FN["gelu_new"] +def forward(self, hidden_states): + hidden_gelu = self.gelu_act(self.wi_0(hidden_states)) + hidden_linear = self.wi_1(hidden_states) + hidden_states = hidden_gelu * hidden_linear + hidden_states = self.dropout(hidden_states) + hidden_states = self.wo(hidden_states) + return hidden_states + +Now it's easy to see the dropout call, and all the previous calls as well. +Since the detection is happening in a forward hook, these reports are printed immediately after each forward +returns. +Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers +started to go up and most likely switch to the fp32 mode here, so that the numbers don't overflow when multiplied +or summed up. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_25.txt b/chunked/content_aware_chunking/_debugging/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..934204e08859eea7050b70449f96c3d95a6af032 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_25.txt @@ -0,0 +1 @@ +Of course, there might be other solutions. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_26.txt b/chunked/content_aware_chunking/_debugging/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..737d2069d795ce2c2a748c98f76f5d356a5d7b99 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_26.txt @@ -0,0 +1,20 @@ +For example, we could turn off amp temporarily if it's +enabled, after moving the original forward into a helper wrapper, like so: +thon +def _forward(self, hidden_states): + hidden_gelu = self.gelu_act(self.wi_0(hidden_states)) + hidden_linear = self.wi_1(hidden_states) + hidden_states = hidden_gelu * hidden_linear + hidden_states = self.dropout(hidden_states) + hidden_states = self.wo(hidden_states) + return hidden_states +import torch +def forward(self, hidden_states): + if torch.is_autocast_enabled(): + with torch.cuda.amp.autocast(enabled=False): + return self._forward(hidden_states) + else: + return self._forward(hidden_states) + +Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may +want to analyse the intermediary stages of any specific forward function as well. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_27.txt b/chunked/content_aware_chunking/_debugging/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..d5c38d8748065798b31f957c5a90448cfa7489cf --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_27.txt @@ -0,0 +1,27 @@ +In such a case you can use the +detect_overflow helper function to inject the detector where you want it, for example: +thon +from debug_utils import detect_overflow +class T5LayerFF(nn.Module): + [] +def forward(self, hidden_states): + forwarded_states = self.layer_norm(hidden_states) + detect_overflow(forwarded_states, "after layer_norm") + forwarded_states = self.DenseReluDense(forwarded_states) + detect_overflow(forwarded_states, "after DenseReluDense") + return hidden_states + self.dropout(forwarded_states) + +You can see that we added 2 of these and now we track if inf or nan for forwarded_states was detected +somewhere in between. +Actually, the detector already reports these because each of the calls in the example above is a nn.Module, but +let's say if you had some local direct calculations this is how you'd do that. +Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from +its default, e.g.: +thon +from transformers.debug_utils import DebugUnderflowOverflow +debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100) + +Specific batch absolute min and max value tracing +The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off. +Let's say you want to watch the absolute min and max values for all the ingredients of each forward call of a given +batch, and only do that for batches 1 and 3. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_28.txt b/chunked/content_aware_chunking/_debugging/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..fe5b0b0c51885a9e2d0baeb31e6fabe69a59039b --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_28.txt @@ -0,0 +1,7 @@ +Then you instantiate this class as: +python +debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3]) +And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does. +Batches are 0-indexed. +This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward +right to that area. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_29.txt b/chunked/content_aware_chunking/_debugging/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..a89e44e6e9339e5acaec68b6b32e3934c78082d3 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_29.txt @@ -0,0 +1,31 @@ +Here is a sample truncated output for such configuration: + + *** Starting batch number=1 *** +abs min abs max metadata + shared Embedding +1.01e-06 7.92e+02 weight +0.00e+00 2.47e+04 input[0] +5.36e-05 7.92e+02 output +[] + decoder.dropout Dropout +1.60e-07 2.27e+01 input[0] +0.00e+00 2.52e+01 output + decoder T5Stack + not a tensor output + lm_head Linear +1.01e-06 7.92e+02 weight +0.00e+00 1.11e+00 input[0] +6.06e-02 8.39e+01 output + T5ForConditionalGeneration + not a tensor output + *** Starting batch number=3 *** + +abs min abs max metadata + shared Embedding +1.01e-06 7.92e+02 weight +0.00e+00 2.78e+04 input[0] +5.36e-05 7.92e+02 output +[] + +Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may +not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_3.txt b/chunked/content_aware_chunking/_debugging/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..dfad2768b3e09f3472ad88e4f86d820ebd6da42d --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_3.txt @@ -0,0 +1,9 @@ +When CUDA is correctly setup and added to your PATH environment variable, you can find the installation location with the following command: + +which nvcc +Multiple CUDA toolkits +You may also have more than one CUDA toolkit installed system-wide. + +/usr/local/cuda-10.2 +/usr/local/cuda-11.0 +Typically, package installers set the paths to whatever the last version was installed. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_30.txt b/chunked/content_aware_chunking/_debugging/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..c4eafb8ee8e87b7c1d4e8d285032a2ccc9fb3acc --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_30.txt @@ -0,0 +1,6 @@ +For example, if +a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where +numbers started to diverge. +You can also specify the batch number after which to stop the training, with: +python +debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_4.txt b/chunked/content_aware_chunking/_debugging/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..00c15a281bc9d8c9f1c6f9dc8255e760e79b1d05 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_4.txt @@ -0,0 +1,6 @@ +If the package build fails because it can't find the right CUDA version (despite it being installed system-wide already), then you need to configure the PATH and LD_LIBRARY_PATH environment variables to point to the correct path. +Take a look at the contents of these environment variables first: + +echo $PATH +echo $LD_LIBRARY_PATH +PATH lists the locations of the executables and LD_LIBRARY_PATH lists where to look for shared libraries. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_5.txt b/chunked/content_aware_chunking/_debugging/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..d3569c70a8fbee5686b7303126e09848d0decdb0 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_5.txt @@ -0,0 +1 @@ +Earlier entries are prioritized over later ones, and : is used to separate multiple entries. To tell the build program where to find the specific CUDA toolkit you want, insert the correct path to list first. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_6.txt b/chunked/content_aware_chunking/_debugging/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..a2270614dc2e0a13ee4dc84c593a6f3c0c983321 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_6.txt @@ -0,0 +1,7 @@ +This command prepends rather than overwrites the existing values. +```bash +adjust the version and full path if needed +export PATH=/usr/local/cuda-10.2/bin:$PATH +export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH + +In addition, you should also check the directories you assign actually exist. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_7.txt b/chunked/content_aware_chunking/_debugging/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..ce335d9c339d89db5cf75d608e5bca31d2784e67 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_7.txt @@ -0,0 +1,3 @@ +The lib64 sub-directory contains various CUDA .so objects (like libcudart.so) and while it is unlikely your system names them differently, you should check the actual names and change them accordingly. +Older CUDA versions +Sometimes, older CUDA versions may refuse to build with newer compilers. For example, if you have gcc-9 but CUDA wants gcc-7. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_8.txt b/chunked/content_aware_chunking/_debugging/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..daad1dfd2f906d3d3fc3a852b09fed12ae4e998b --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_8.txt @@ -0,0 +1,2 @@ +Usually, installing the latest CUDA toolkit enables support for the newer compiler. +You could also install an older version of the compiler in addition to the one you're currently using (or it may already be installed but it's not used by default and the build system can't see it). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_debugging/chunk_9.txt b/chunked/content_aware_chunking/_debugging/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..ad0b7f8a979f1d64b257bfe76a9585ab1964f4f3 --- /dev/null +++ b/chunked/content_aware_chunking/_debugging/chunk_9.txt @@ -0,0 +1,8 @@ +To resolve this, you can create a symlink to give the build system visibility to the older compiler. +```bash +adapt the path to your system +sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc +sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++ + +Prebuild +If you're still having issues with installing DeepSpeed or if you're building DeepSpeed at run time, you can try to prebuild the DeepSpeed modules before installing them. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_0.txt b/chunked/content_aware_chunking/_deepspeed/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..ef6969fd8309ab197dfddd259cbd1e290ca22aa1 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_0.txt @@ -0,0 +1,2 @@ +DeepSpeed +DeepSpeed is a PyTorch optimization library that makes distributed training memory-efficient and fast. At it's core is the Zero Redundancy Optimizer (ZeRO) which enables training large models at scale. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_1.txt b/chunked/content_aware_chunking/_deepspeed/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..5730df10e85cb85417fc634694d3e1ac63094627 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_1.txt @@ -0,0 +1,7 @@ +ZeRO works in several stages: + +ZeRO-1, optimizer state partioning across GPUs +ZeRO-2, gradient partitioning across GPUs +ZeRO-3, parameteter partitioning across GPUs + +In GPU-limited environments, ZeRO also enables offloading optimizer memory and computation from the GPU to the CPU to fit and train really large models on a single GPU. DeepSpeed is integrated with the Transformers [Trainer] class for all ZeRO stages and offloading. All you need to do is provide a config file or you can use a provided template. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_10.txt b/chunked/content_aware_chunking/_deepspeed/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..1a58c0b0319ad657a5f42ff45f53571c53337d60 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_10.txt @@ -0,0 +1,16 @@ +-name '*json') + +The DeepSpeed configuration file is passed as a path to a JSON file if you're training from the command line interface or as a nested dict object if you're using the [Trainer] in a notebook setting. + +py +TrainingArguments(, deepspeed="path/to/deepspeed_config.json") + +py +ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params) +args = TrainingArguments(, deepspeed=ds_config_dict) +trainer = Trainer(model, args, ) + +DeepSpeed and Trainer parameters +There are three types of configuration parameters: + +Some of the configuration parameters are shared by [Trainer] and DeepSpeed, and it can be difficult to identify errors when there are conflicting definitions. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_11.txt b/chunked/content_aware_chunking/_deepspeed/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..e7028ffb6832141bc2b5d86e6741eda379628cbb --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_11.txt @@ -0,0 +1,3 @@ +To make it easier, these shared configuration parameters are configured from the [Trainer] command line arguments. + +Some configuration parameters that are automatically derived from the model configuration so you don't need to manually adjust these values. The [Trainer] uses a configuration value auto to determine set the most correct or efficient value. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_12.txt b/chunked/content_aware_chunking/_deepspeed/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..0df00ae8f110e45fcedea1f42220d6b97947fb1c --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_12.txt @@ -0,0 +1 @@ +You could set your own configuration parameters explicitly, but you must take care to ensure the [Trainer] arguments and DeepSpeed configuration parameters agree. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_13.txt b/chunked/content_aware_chunking/_deepspeed/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..65c191c16e5efab74c987a9ffb4340caa9f96ae9 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_13.txt @@ -0,0 +1,12 @@ +Mismatches may cause the training to fail in very difficult to detect ways! + +Some configuration parameters specific to DeepSpeed only which need to be manually set based on your training needs. + +You could also modify the DeepSpeed configuration and edit [TrainingArguments] from it: + +Create or load a DeepSpeed configuration to used as the main configuration +Create a [TrainingArguments] object based on these DeepSpeed configuration values + +Some values, such as scheduler.params.total_num_steps are calculated by the [Trainer] during training. +ZeRO configuration +There are three configurations, each corresponding to a different ZeRO stage. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_14.txt b/chunked/content_aware_chunking/_deepspeed/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..471aa00767c27cf965dbfe0dfe778ceecd068b2c --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_14.txt @@ -0,0 +1,3 @@ +Stage 1 is not as interesting for scalability, and this guide focuses on stages 2 and 3. The zero_optimization configuration contains all the options for what to enable and how to configure them. For a more detailed explanation of each parameter, take a look at the DeepSpeed Configuration JSON reference. + +DeepSpeed doesn’t validate parameter names and any typos fallback on the parameter's default setting. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_15.txt b/chunked/content_aware_chunking/_deepspeed/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..70def7d2ac11eeff3b62b0052862580573bf80a9 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_15.txt @@ -0,0 +1,13 @@ +You can watch the DeepSpeed engine startup log messages to see what values it is going to use. + +The following configurations must be setup with DeepSpeed because the [Trainer] doesn't provide equivalent command line arguments. + +ZeRO-1 shards the optimizer states across GPUs, and you can expect a tiny speed up. The ZeRO-1 config can be setup like this: +yml +{ + "zero_optimization": { + "stage": 1 + } +} + +ZeRO-2 shards the optimizer and gradients across GPUs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_16.txt b/chunked/content_aware_chunking/_deepspeed/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..af675421b6d005a0e1edb2f3936e7b08d695a2cf --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_16.txt @@ -0,0 +1,4 @@ +This stage is primarily used for training since it's features are not relevant to inference. Some important parameters to configure for better performance include: + +offload_optimizer should be enabled to reduce GPU memory usage. +overlap_comm when set to true trades off increased GPU memory usage to lower allreduce latency. This feature uses 4.5x the allgather_bucket_size and reduce_bucket_size values. In this example, they're set to 5e8 which means it requires 9GB of GPU memory. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_17.txt b/chunked/content_aware_chunking/_deepspeed/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..8d5115d563c3000c2a455effbc6502ec40d571bf --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_17.txt @@ -0,0 +1,2 @@ +If your GPU memory is 8GB or less, you should reduce overlap_comm to lower the memory requirements and prevent an out-of-memory (OOM) error. +allgather_bucket_size and reduce_bucket_size trade off available GPU memory for communication speed. The smaller their values, the slower communication is and the more GPU memory is available. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_18.txt b/chunked/content_aware_chunking/_deepspeed/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..865c8d788de17997c868478b153e4d9e31d707de --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_18.txt @@ -0,0 +1,2 @@ +You can balance, for example, whether a bigger batch size is more important than a slightly slower training time. +round_robin_gradients is available in DeepSpeed 0.4.4 for CPU offloading. It parallelizes gradient copying to CPU memory among ranks by fine-grained gradient partitioning. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_19.txt b/chunked/content_aware_chunking/_deepspeed/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..647ebd5f73298e42863d9434b93403282f7bce12 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_19.txt @@ -0,0 +1,21 @@ +Performance benefit grows with gradient accumulation steps (more copying between optimizer steps) or GPU count (increased parallelism). + +yml +{ + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 5e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 5e8, + "contiguous_gradients": true + "round_robin_gradients": true + } +} + +ZeRO-3 shards the optimizer, gradient, and parameters across GPUs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_2.txt b/chunked/content_aware_chunking/_deepspeed/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..58aa5266ea2ee0e0208484201b657954e0df4ede --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_2.txt @@ -0,0 +1,6 @@ +For inference, Transformers support ZeRO-3 and offloading since it allows loading huge models. +This guide will walk you through how to deploy DeepSpeed training, the features you can enable, how to setup the config files for different ZeRO stages, offloading, inference, and using DeepSpeed without the [Trainer]. +Installation +DeepSpeed is available to install from PyPI or Transformers (for more detailed installation options, take a look at the DeepSpeed installation details or the GitHub README). + +If you're having difficulties installing DeepSpeed, check the DeepSpeed CUDA installation guide. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_20.txt b/chunked/content_aware_chunking/_deepspeed/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..85486d47f84ae9015435e18809b759391ce3a045 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_20.txt @@ -0,0 +1,3 @@ +Unlike ZeRO-2, ZeRO-3 can also be used for inference, in addition to training, because it allows large models to be loaded on multiple GPUs. Some important parameters to configure include: + +device: "cpu" can help if you're running out of GPU memory and if you have free CPU memory available. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_21.txt b/chunked/content_aware_chunking/_deepspeed/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..d443716691d02b44b2b519282be41008e08839ee --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_21.txt @@ -0,0 +1,3 @@ +This allows offloading model parameters to the CPU. +pin_memory: true can improve throughput, but less memory becomes available for other processes because the pinned memory is reserved for the specific process that requested it and it's typically accessed much faster than normal CPU memory. +stage3_max_live_parameters is the upper limit on how many full parameters you want to keep on the GPU at any given time. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_22.txt b/chunked/content_aware_chunking/_deepspeed/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..1991220f4eadecf5fed94c4649d2ef7b5d8a7342 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_22.txt @@ -0,0 +1,2 @@ +Reduce this value if you encounter an OOM error. +stage3_max_reuse_distance is a value for determining when a parameter is used again in the future, and it helps decide whether to throw the parameter away or to keep it. If the parameter is going to be reused (if the value is less than stage3_max_reuse_distance), then it is kept to reduce communication overhead. This is super helpful when activation checkpointing is enabled and you want to keep the parameter in the forward recompute until the backward pass. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_23.txt b/chunked/content_aware_chunking/_deepspeed/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..595370a6bf27d4afef9c4fe18f58ab39828eac45 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_23.txt @@ -0,0 +1,4 @@ +But reduce this value if you encounter an OOM error. +stage3_gather_16bit_weights_on_model_save consolidates fp16 weights when a model is saved. For large models and multiple GPUs, this is an expensive in terms of memory and speed. You should enable it if you're planning on resuming training. + +sub_group_size controls which parameters are updated during the optimizer step. Parameters are grouped into buckets of sub_group_size and each bucket is updated one at a time. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_24.txt b/chunked/content_aware_chunking/_deepspeed/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..299747fa63a6a753b49563e9c0868464e36a1066 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_24.txt @@ -0,0 +1,3 @@ +When used with NVMe offload, sub_group_size determines when model states are moved in and out of CPU memory from during the optimization step. This prevents running out of CPU memory for extremely large models. sub_group_size can be left to its default value if you aren't using NVMe offload, but you may want to change it if you: + +Run into an OOM error during the optimizer step. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_25.txt b/chunked/content_aware_chunking/_deepspeed/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..515274924c5c755efbbbc6c66be5721d061c5691 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_25.txt @@ -0,0 +1,4 @@ +In this case, reduce sub_group_size to reduce memory usage of the temporary buffers. +The optimizer step is taking a really long time. In this case, increase sub_group_size to improve bandwidth utilization as a result of increased data buffers. + +reduce_bucket_size, stage3_prefetch_bucket_size, and stage3_param_persistence_threshold are dependent on a model's hidden size. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_26.txt b/chunked/content_aware_chunking/_deepspeed/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..637f0e07378b9896b3dca4afba382d3a99148663 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_26.txt @@ -0,0 +1,34 @@ +It is recommended to set these values to auto and allow the [Trainer] to automatically assign the values. + +yml +{ + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + } +} +You can use the deepspeed.zero.Init context manager to initialize a model faster: + +from transformers import T5ForConditionalGeneration, T5Config +import deepspeed +with deepspeed.zero.Init(): + config = T5Config.from_pretrained("google-t5/t5-small") + model = T5ForConditionalGeneration(config) + +For pretrained models, the DeepSped config file needs to have is_deepspeed_zero3_enabled: true setup in [TrainingArguments] and it needs a ZeRO configuration enabled. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_27.txt b/chunked/content_aware_chunking/_deepspeed/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..28bc65f580c46fb95a47bd4e473cb77f51b959c9 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_27.txt @@ -0,0 +1,8 @@ +The [TrainingArguments] object must be created before calling the model [~PreTrainedModel.from_pretrained]. + +from transformers import AutoModel, Trainer, TrainingArguments +training_args = TrainingArguments(, deepspeed=ds_config) +model = AutoModel.from_pretrained("google-t5/t5-small") +trainer = Trainer(model=model, args=training_args, ) + +You'll need ZeRO-3 if the fp16 weights don't fit on a single GPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_28.txt b/chunked/content_aware_chunking/_deepspeed/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..3e9a9fdc51ef3611c9b9f886208ee7d9c887b4db --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_28.txt @@ -0,0 +1,2 @@ +If you're able to load fp16 weights, then make sure you specify torch_dtype=torch.float16 in [~PreTrainedModel.from_pretrained]. +Another consideration for ZeRO-3 is if you have multiple GPUs, no single GPU has all the parameters unless it's the parameters for the currently executing layer. To access all parameters from all the layers at once, such as loading pretrained model weights in [~PreTrainedModel.from_pretrained], one layer is loaded at a time and immediately partitioned to all GPUs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_29.txt b/chunked/content_aware_chunking/_deepspeed/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..09f034524ac61bbdcc5a932fdfaa6e39107e9c04 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_29.txt @@ -0,0 +1,9 @@ +This is because for very large models, it isn't possible to load the weights on one GPU and then distribute them across the other GPUs due to memory limitations. +If you encounter a model parameter weight that looks like the following, where tensor([1.]) or the parameter size is 1 instead of a larger multi-dimensional shape, this means the parameter is partitioned and this is a ZeRO-3 placeholder. +py +tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True) + +For more information about initializing large models with ZeRO-3 and accessing the parameters, take a look at the Constructing Massive Models and Gathering Parameters guides. + +NVMe configuration +ZeRO-Infinity allows offloading model states to the CPU and/or NVMe to save even more memory. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_3.txt b/chunked/content_aware_chunking/_deepspeed/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..01645b1662ce876be98c88e1c9ba595d506be84d --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_3.txt @@ -0,0 +1,8 @@ +While DeepSpeed has a pip installable PyPI package, it is highly recommended to install it from source to best match your hardware and to support certain features, like 1-bit Adam, which aren’t available in the PyPI distribution. + +pip install deepspeed + +pip install transformers[deepspeed] + +Memory requirements +Before you begin, it is a good idea to check whether you have enough GPU and CPU memory to fit your model. DeepSpeed provides a tool for estimating the required CPU/GPU memory. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_30.txt b/chunked/content_aware_chunking/_deepspeed/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..8373365c051372c99dddf2ee746fa89dd5dd9a71 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_30.txt @@ -0,0 +1,2 @@ +Smart partitioning and tiling algorithms allow each GPU to send and receive very small amounts of data during offloading such that a modern NVMe can fit an even larger total memory pool than is available to your training process. ZeRO-Infinity requires ZeRO-3. +Depending on the CPU and/or NVMe memory available, you can offload both the optimizer states and parameters, just one of them, or none. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_31.txt b/chunked/content_aware_chunking/_deepspeed/chunk_31.txt new file mode 100644 index 0000000000000000000000000000000000000000..17dd67c56a4830b3abb0cf75c0694a1f82696b16 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_31.txt @@ -0,0 +1 @@ +You should also make sure the nvme_path is pointing to an NVMe device, because while it still works with a normal hard drive or solid state drive, it'll be significantly slower. With a modern NVMe, you can expect peak transfer speeds of ~3.5GB/s for read and ~3GB/s for write operations. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_32.txt b/chunked/content_aware_chunking/_deepspeed/chunk_32.txt new file mode 100644 index 0000000000000000000000000000000000000000..a9ff55dd643ebfd17e165ca60ed201a8c3d50e73 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_32.txt @@ -0,0 +1,79 @@ +Lastly, run a benchmark on your training setup to determine the optimal aio configuration. +The example ZeRO-3/Infinity configuration file below sets most of the parameter values to auto, but you could also manually add these values. +```yml +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, +"optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } +}, + +"scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } +}, + +"zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "nvme", + "nvme_path": "/local_nvme", + "pin_memory": true, + "buffer_count": 4, + "fast_init": false + }, + "offload_param": { + "device": "nvme", + "nvme_path": "/local_nvme", + "pin_memory": true, + "buffer_count": 5, + "buffer_size": 1e8, + "max_in_cpu": 1e9 + }, + "aio": { + "block_size": 262144, + "queue_depth": 32, + "thread_count": 1, + "single_submit": false, + "overlap_events": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true +}, + +"gradient_accumulation_steps": "auto", +"gradient_clipping": "auto", +"steps_per_print": 2000, +"train_batch_size": "auto", +"train_micro_batch_size_per_gpu": "auto", +"wall_clock_breakdown": false + +} + +DeepSpeed features +There are a number of important parameters to specify in the DeepSpeed configuration file which are briefly described in this section. +Activation/gradient checkpointing +Activation and gradient checkpointing trades speed for more GPU memory which allows you to overcome scenarios where your GPU is out of memory or to increase your batch size for better performance. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_33.txt b/chunked/content_aware_chunking/_deepspeed/chunk_33.txt new file mode 100644 index 0000000000000000000000000000000000000000..8eacc38458202e4d730ade3ade250f3b55fa0728 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_33.txt @@ -0,0 +1,4 @@ +To enable this feature: + +For a Hugging Face model, set model.gradient_checkpointing_enable() or --gradient_checkpointing in the [Trainer]. +For a non-Hugging Face model, use the DeepSpeed Activation Checkpointing API. You could also replace the Transformers modeling code and replace torch.utils.checkpoint with the DeepSpeed API. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_34.txt b/chunked/content_aware_chunking/_deepspeed/chunk_34.txt new file mode 100644 index 0000000000000000000000000000000000000000..2edc3967bcca70717cbe77698a014276151cf3aa --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_34.txt @@ -0,0 +1,4 @@ +This approach is more flexible because you can offload the forward activations to the CPU memory instead of recalculating them. + +Optimizer and scheduler +DeepSpeed and Transformers optimizer and scheduler can be mixed and matched as long as you don't enable offload_optimizer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_35.txt b/chunked/content_aware_chunking/_deepspeed/chunk_35.txt new file mode 100644 index 0000000000000000000000000000000000000000..847a2a80fc20597ba13ce84705c034e5c366436e --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_35.txt @@ -0,0 +1,3 @@ +When offload_optimizer is enabled, you could use a non-DeepSpeed optimizer (except for LAMB) as long as it has both a CPU and GPU implementation. + +The optimizer and scheduler parameters for the config file can be set from the command line to avoid hard to find errors. For example, if the learning rate is set to a different value in another place you can override it from the command line. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_36.txt b/chunked/content_aware_chunking/_deepspeed/chunk_36.txt new file mode 100644 index 0000000000000000000000000000000000000000..5a0b824cbcb0e1b6130d7d9441fc1b897068eea3 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_36.txt @@ -0,0 +1,3 @@ +Aside from the optimizer and scheduler parameters, you'll need to ensure your [Trainer] command line arguments match the DeepSpeed configuration. + +DeepSpeed offers several optimizers (Adam, AdamW, OneBitAdam, and LAMB) but you can also import other optimizers from PyTorch. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_37.txt b/chunked/content_aware_chunking/_deepspeed/chunk_37.txt new file mode 100644 index 0000000000000000000000000000000000000000..d339b1cf21bc6d692e1ad7d9c7e9a68df792b122 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_37.txt @@ -0,0 +1,50 @@ +If you don't configure the optimizer in the config, the [Trainer] automatically selects AdamW and either uses the supplied values or the default values for the following parameters from the command line: lr, adam_beta1, adam_beta2, adam_epsilon, weight_decay. +You can set the parameters to "auto" or manually input your own desired values. +yaml +{ + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + } +} +You can also use an unsupported optimizer by adding the following to the top level configuration. +yaml +{ + "zero_allow_untested_optimizer": true +} +From DeepSpeed==0.8.3 on, if you want to use offload, you'll also need to the following to the top level configuration because offload works best with DeepSpeed's CPU Adam optimizer. +yaml +{ + "zero_force_ds_cpu_optimizer": false +} + +DeepSpeed supports the LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR learning rate schedulers. +Transformers and DeepSpeed provide two of the same schedulers: + +WarmupLR is the same as --lr_scheduler_type constant_with_warmup in Transformers +WarmupDecayLR is the same as --lr_scheduler_type linear in Transformers (this is the default scheduler used in Transformers) + +If you don't configure the scheduler in the config, the [Trainer] automatically selects WarmupDecayLR and either uses the supplied values or the default values for the following parameters from the command line: warmup_min_lr, warmup_max_lr, warmup_num_steps, total_num_steps (automatically calculated during run time if max_steps is not provided). +You can set the parameters to "auto" or manually input your own desired values. +yaml +{ + "scheduler": { + "type": "WarmupDecayLR", + "params": { + "total_num_steps": "auto", + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + } +} + +Precision +Deepspeed supports fp32, fp16, and bf16 mixed precision. + +If your model doesn't work well with mixed precision, for example if it wasn't pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_38.txt b/chunked/content_aware_chunking/_deepspeed/chunk_38.txt new file mode 100644 index 0000000000000000000000000000000000000000..21ab3ec299030175480b6dd915797bf9d2d6de2a --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_38.txt @@ -0,0 +1,8 @@ +For these cases, you should use full fp32 precision by explicitly disabling the default fp16 mode. +yaml +{ + "fp16": { + "enabled": false + } +} +For Ampere GPUs and PyTorch > 1.7, it automatically switches to the more efficient tf32 format for some operations but the results are still in fp32. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_39.txt b/chunked/content_aware_chunking/_deepspeed/chunk_39.txt new file mode 100644 index 0000000000000000000000000000000000000000..1fd78c69f0bbaf82750df9e2196e18173e69e604 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_39.txt @@ -0,0 +1,3 @@ +You can control it from the [Trainer] by setting --tf32 to enable it, and --tf32 0 or --no_tf32 to disable it. + +To configure PyTorch AMP-like fp16 mixed precision reduces memory usage and accelerates training speed. [Trainer] automatically enables or disables fp16 based on the value of args.fp16_backend, and the rest of the config can be set by you. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_4.txt b/chunked/content_aware_chunking/_deepspeed/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..284e9907f41827f211dd31ccd591f93def03732f --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_4.txt @@ -0,0 +1,18 @@ +For example, to estimate the memory requirements for the bigscience/T0_3B model on a single GPU: + +$ python -c 'from transformers import AutoModel; \ +from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \ +model = AutoModel.from_pretrained("bigscience/T0_3B"); \ +estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)' +[] +Estimated memory needed for params, optim states and gradients for a: +HW: Setup with 1 node, 1 GPU per node. +SW: Model with 2783M total params, 65M largest layer params. + per CPU | per GPU | Options + 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 + 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 + 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1 + 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0 + 0.37GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=1 + 15.56GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=0 +This means you either need a single 80GB GPU without CPU offload or a 8GB GPU and a ~60GB CPU to offload to (these are just the memory requirements for the parameters, optimizer states and gradients, and you'll need a bit more for the CUDA kernels and activations). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_40.txt b/chunked/content_aware_chunking/_deepspeed/chunk_40.txt new file mode 100644 index 0000000000000000000000000000000000000000..0ff1656c93888849529b063d185a615f7d19682a --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_40.txt @@ -0,0 +1,14 @@ +fp16 is enabled from the command line when the following arguments are passed: --fp16, --fp16_backend amp or --fp16_full_eval. +yaml +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + } +} +For additional DeepSpeed fp16 training options, take a look at the FP16 Training Options reference. +To configure Apex-like fp16 mixed precision, setup the config as shown below with "auto" or your own values. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_41.txt b/chunked/content_aware_chunking/_deepspeed/chunk_41.txt new file mode 100644 index 0000000000000000000000000000000000000000..e1d6a889c224ec5baf6e561cfae98c201f0e79e7 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_41.txt @@ -0,0 +1,10 @@ +[Trainer] automatically configure amp based on the values of args.fp16_backend and args.fp16_opt_level. It can also be enabled from the command line when the following arguments are passed: --fp16, --fp16_backend apex or --fp16_opt_level 01. +yaml +{ + "amp": { + "enabled": "auto", + "opt_level": "auto" + } +} + +To use bf16, you'll need at least DeepSpeed==0.6.0. bf16 has the same dynamic range as fp32 and doesn’t require loss scaling. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_42.txt b/chunked/content_aware_chunking/_deepspeed/chunk_42.txt new file mode 100644 index 0000000000000000000000000000000000000000..696eeeab73281b60bcbc71f73e0c04383234ed6e --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_42.txt @@ -0,0 +1,11 @@ +However, if you use gradient accumulation with bf16, gradients are accumulated in bf16 which may not be desired because this format's low precision can lead to lossy accumulation. +bf16 can be setup in the config file or enabled from the command line when the following arguments are passed: --bf16 or --bf16_full_eval. +yaml +{ + "bf16": { + "enabled": "auto" + } +} + +Batch size +The batch size can be auto-configured or explicitly set. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_43.txt b/chunked/content_aware_chunking/_deepspeed/chunk_43.txt new file mode 100644 index 0000000000000000000000000000000000000000..6e9353ad0da20b4ebba292b4b41bf12962ff8c1d --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_43.txt @@ -0,0 +1,8 @@ +If you choose to use the "auto" option, [Trainer] sets train_micro_batch_size_per_gpu to the value of args.per_device_train_batch_size and train_batch_size to args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps. +yaml +{ + "train_micro_batch_size_per_gpu": "auto", + "train_batch_size": "auto" +} +Gradient accumulation +Gradient accumulation can be auto-configured or explicitly set. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_44.txt b/chunked/content_aware_chunking/_deepspeed/chunk_44.txt new file mode 100644 index 0000000000000000000000000000000000000000..3fa41053c5ac4cacfbade95c120eda0651297546 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_44.txt @@ -0,0 +1,8 @@ +If you choose to use the "auto" option, [Trainer] sets it to the value of args.gradient_accumulation_steps. +```yaml +{ + "gradient_accumulation_steps": "auto" +} + +Gradient clipping +Gradient clipping can be auto-configured or explicitly set. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_45.txt b/chunked/content_aware_chunking/_deepspeed/chunk_45.txt new file mode 100644 index 0000000000000000000000000000000000000000..ee3d7b1e3109d60bb0ffa5cd050a7ade27f71f25 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_45.txt @@ -0,0 +1,8 @@ +If you choose to use the "auto" option, [Trainer] sets it to the value of args.max_grad_norm. +yaml +{ + "gradient_clipping": "auto" +} +Communication data type +For communication collectives like reduction, gathering and scattering operations, a separate data type is used. +All gather and scatter operations are performed in the same data type the data is in. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_46.txt b/chunked/content_aware_chunking/_deepspeed/chunk_46.txt new file mode 100644 index 0000000000000000000000000000000000000000..4e8211dbed3f3316472197d8a0214a7c48879eca --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_46.txt @@ -0,0 +1,2 @@ +For example, if you're training with bf16, the data is also gathered in bf16 because gathering is a non-lossy operation. +Reduce operations are lossy, for example when gradients are averaged across multiple GPUs. When the communication is done in fp16 or bf16, it is more likely to be lossy because adding multiple numbers in low precision isn't exact. This is especially the case with bf16 which has a lower precision than fp16. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_47.txt b/chunked/content_aware_chunking/_deepspeed/chunk_47.txt new file mode 100644 index 0000000000000000000000000000000000000000..41bdc59ff2b31c2f3db64066002a57b633e6a8a3 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_47.txt @@ -0,0 +1,2 @@ +For this reason, fp16 is the default for reduction operations because the loss is minimal when averaging gradients. +You can choose the communication data type by setting the communication_data_type parameter in the config file. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_48.txt b/chunked/content_aware_chunking/_deepspeed/chunk_48.txt new file mode 100644 index 0000000000000000000000000000000000000000..a41843f0af3b4f1a16607dbb8df00fe46102951a --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_48.txt @@ -0,0 +1,7 @@ +For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it is downcasted to whichever half-precision dtype you're training in. +yaml +{ + "communication_data_type": "fp32" +} +Deployment +DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate. To deploy, add --deepspeed ds_config.json to the [Trainer] command line. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_49.txt b/chunked/content_aware_chunking/_deepspeed/chunk_49.txt new file mode 100644 index 0000000000000000000000000000000000000000..d24f9e8ca6aab1fa8c27819ceabb13daafd7c89c --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_49.txt @@ -0,0 +1,4 @@ +It’s recommended to use DeepSpeed’s add_config_arguments utility to add any necessary command line arguments to your code. +This guide will show you how to deploy DeepSpeed with the deepspeed launcher for different training setups. You can check out this post for more practical usage examples. + +To deploy DeepSpeed on multiple GPUs, add the --num_gpus parameter. If you want to use all available GPUs, you don't need to add --num_gpus. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_5.txt b/chunked/content_aware_chunking/_deepspeed/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..527e58eaa175aa63d4775bccbc9cf0f271dde025 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_5.txt @@ -0,0 +1,4 @@ +You should also consider the tradeoff between cost and speed because it'll be cheaper to rent or buy a smaller GPU but it'll take longer to train your model. +If you have enough GPU memory make sure you disable CPU/NVMe offload to make everything faster. +Select a ZeRO stage +After you've installed DeepSpeed and have a better idea of your memory requirements, the next step is selecting a ZeRO stage to use. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_50.txt b/chunked/content_aware_chunking/_deepspeed/chunk_50.txt new file mode 100644 index 0000000000000000000000000000000000000000..b778d4a28c7b85f777033b8a29568f3075020b04 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_50.txt @@ -0,0 +1,11 @@ +The example below uses 2 GPUs. + +deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero3.json \ +--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \ +--output_dir output_dir --overwrite_output_dir --fp16 \ +--do_train --max_train_samples 500 --num_train_epochs 1 \ +--dataset_name wmt16 --dataset_config "ro-en" \ +--source_lang en --target_lang ro + +To deploy DeepSpeed on a single GPU, add the --num_gpus parameter. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_51.txt b/chunked/content_aware_chunking/_deepspeed/chunk_51.txt new file mode 100644 index 0000000000000000000000000000000000000000..52e33e9cba6429ffe4a0fbafe0454ea481293277 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_51.txt @@ -0,0 +1,18 @@ +It isn't necessary to explicitly set this value if you only have 1 GPU because DeepSpeed deploys all GPUs it can see on a given node. + +deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero2.json \ +--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \ +--output_dir output_dir --overwrite_output_dir --fp16 \ +--do_train --max_train_samples 500 --num_train_epochs 1 \ +--dataset_name wmt16 --dataset_config "ro-en" \ +--source_lang en --target_lang ro +DeepSpeed is still useful with just 1 GPU because you can: + +Offload some computations and memory to the CPU to make more GPU resources available to your model to use a larger batch size or fit a very large model that normally won't fit. +Minimize memory fragmentation with it's smart GPU memory management system which also allows you to fit bigger models and data batches. + +Set the allgather_bucket_size and reduce_bucket_size values to 2e8 in the ZeRO-2 configuration file to get better performance on a single GPU. + +Multi-node deployment +A node is one or more GPUs for running a workload. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_52.txt b/chunked/content_aware_chunking/_deepspeed/chunk_52.txt new file mode 100644 index 0000000000000000000000000000000000000000..5263545ed6990fefe7b57d2e819230907c2bce1a --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_52.txt @@ -0,0 +1,2 @@ +A more powerful setup is a multi-node setup which can be launched with the deepspeed launcher. For this guide, let's assume there are two nodes with 8 GPUs each. The first node can be accessed ssh hostname1 and the second node with ssh hostname2. Both nodes must be able to communicate with each other locally over ssh without a password. +By default, DeepSpeed expects your multi-node environment to use a shared storage. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_53.txt b/chunked/content_aware_chunking/_deepspeed/chunk_53.txt new file mode 100644 index 0000000000000000000000000000000000000000..ab3d37bfaf977923eee4cbe07480a957f6965d6a --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_53.txt @@ -0,0 +1,10 @@ +If this is not the case and each node can only see the local filesystem, you need to adjust the config file to include a checkpoint to allow loading without access to a shared filesystem: +yaml +{ + "checkpoint": { + "use_node_local_storage": true + } +} +You could also use the [Trainer]'s --save_on_each_node argument to automatically add the above checkpoint to your config. + +For torchrun, you have to ssh to each node and run the following command on both of them. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_54.txt b/chunked/content_aware_chunking/_deepspeed/chunk_54.txt new file mode 100644 index 0000000000000000000000000000000000000000..b398d36e11a202e1736ebba83e20eee982534310 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_54.txt @@ -0,0 +1,10 @@ +The launcher waits until both nodes are synchronized before launching the training. + +python -m torch.run --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \ +--master_port=9901 your_program.py --deepspeed ds_config.json + +For the deepspeed launcher, start by creating a hostfile. + +hostname1 slots=8 +hostname2 slots=8 +Then you can launch the training with the following command. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_55.txt b/chunked/content_aware_chunking/_deepspeed/chunk_55.txt new file mode 100644 index 0000000000000000000000000000000000000000..028e344bc777afdddd87400dca4499f661622eca --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_55.txt @@ -0,0 +1,8 @@ +The deepspeed launcher automatically launches the command on both nodes at once. + +deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \ +your_program.py --deepspeed ds_config.json +Check out the Resource Configuration (multi-node) guide for more details about configuring multi-node compute resources. + +SLURM +In a SLURM environment, you'll need to adapt your SLURM script to your specific SLURM environment. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_56.txt b/chunked/content_aware_chunking/_deepspeed/chunk_56.txt new file mode 100644 index 0000000000000000000000000000000000000000..6da53079a3044aff7da863585b9ffed7083b86fc --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_56.txt @@ -0,0 +1,22 @@ +An example SLURM script may look like: +```bash +SBATCH --job-name=test-nodes # name +SBATCH --nodes=2 # nodes +SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node! +SBATCH --cpus-per-task=10 # number of cores per tasks +SBATCH --gres=gpu:8 # number of gpus +SBATCH --time 20:00:00 # maximum execution time (HH:MM:SS) +SBATCH --output=%x-%j.out # output file name +export GPUS_PER_NODE=8 +export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) +export MASTER_PORT=9901 +srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \ + --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \ + --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ +your_program.py --deepspeed ds_config.json' + +Then you can schedule your multi-node deployment with the following command which launches training simultaneously on all nodes. + +sbatch launch.slurm +Notebook +The deepspeed launcher doesn't support deployment from a notebook so you'll need to emulate the distributed environment. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_57.txt b/chunked/content_aware_chunking/_deepspeed/chunk_57.txt new file mode 100644 index 0000000000000000000000000000000000000000..306a97e378d22162752ccc1b23bbf83762d1a3c9 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_57.txt @@ -0,0 +1 @@ +However, this only works for 1 GPU. If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_58.txt b/chunked/content_aware_chunking/_deepspeed/chunk_58.txt new file mode 100644 index 0000000000000000000000000000000000000000..918c5ea6cf52d67c21472d0c34004401a9bcd966 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_58.txt @@ -0,0 +1,79 @@ +This means you have to use the deepspeed launcher which can't be emulated as shown here. + +DeepSpeed requires a distributed environment even when only one process is used. +This emulates a launcher in the notebook +import os +os.environ["MASTER_ADDR"] = "localhost" +os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use +os.environ["RANK"] = "0" +os.environ["LOCAL_RANK"] = "0" +os.environ["WORLD_SIZE"] = "1" +Now proceed as normal, plus pass the DeepSpeed config file +training_args = TrainingArguments(, deepspeed="ds_config_zero3.json") +trainer = Trainer() +trainer.train() + +If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated cell. + +%%bash +cat <<'EOT' > ds_config_zero3.json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, +"optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } +}, + +"scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } +}, + +"zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true +}, + +"gradient_accumulation_steps": "auto", +"gradient_clipping": "auto", +"steps_per_print": 2000, +"train_batch_size": "auto", +"train_micro_batch_size_per_gpu": "auto", +"wall_clock_breakdown": false + +} +EOT + +If the training script is in a file and not in a notebook cell, you can launch deepspeed normally from the shell in a notebook cell. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_59.txt b/chunked/content_aware_chunking/_deepspeed/chunk_59.txt new file mode 100644 index 0000000000000000000000000000000000000000..c96aa920e0b27760816f4214fb41460aa233c176 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_59.txt @@ -0,0 +1,5 @@ +For example, to launch run_translation.py: +py +!git clone https://github.com/huggingface/transformers +!cd transformers; deepspeed examples/pytorch/translation/run_translation.py +You could also use %%bash magic and write multi-line code to run the shell program, but you won't be able to view the logs until training is complete. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_6.txt b/chunked/content_aware_chunking/_deepspeed/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..f4259f8a79d02b6095124caf2ed2788cfa4e7674 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_6.txt @@ -0,0 +1,9 @@ +In order of fastest and most memory-efficient: +| Fastest | Memory efficient | +|------------------|------------------| +| ZeRO-1 | ZeRO-3 + offload | +| ZeRO-2 | ZeRO-3 | +| ZeRO-2 + offload | ZeRO-2 + offload | +| ZeRO-3 | ZeRO-2 | +| ZeRO-3 + offload | ZeRO-1 | +To find what works best for you, start with the fastest approach and if you run out of memory, try the next stage which is slower but more memory efficient. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_60.txt b/chunked/content_aware_chunking/_deepspeed/chunk_60.txt new file mode 100644 index 0000000000000000000000000000000000000000..796a0c081500b2219880f668cd05460f5608e5e0 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_60.txt @@ -0,0 +1,11 @@ +With %%bash magic, you don't need to emulate a distributed environment. + +%%bash +git clone https://github.com/huggingface/transformers +cd transformers +deepspeed examples/pytorch/translation/run_translation.py + +Save model weights +DeepSpeed stores the main full precision fp32 weights in custom checkpoint optimizer files (the glob pattern looks like global_step*/*optim_states.pt) and are saved under the normal checkpoint. + +A model trained with ZeRO-2 saves the pytorch_model.bin weights in fp16. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_61.txt b/chunked/content_aware_chunking/_deepspeed/chunk_61.txt new file mode 100644 index 0000000000000000000000000000000000000000..697e4ec42670ad6c45d48438e7df62b1ebb996dd --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_61.txt @@ -0,0 +1 @@ +To save the model weights in fp16 for a model trained with ZeRO-3, you need to set "stage3_gather_16bit_weights_on_model_save": true because the model weights are partitioned across multiple GPUs. Otherwise, the [Trainer] won't save the weights in fp16 and it won't create a pytorch_model.bin file. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_62.txt b/chunked/content_aware_chunking/_deepspeed/chunk_62.txt new file mode 100644 index 0000000000000000000000000000000000000000..486c2baf29c1e1a78bfe1794add2c2fc09cd0cd8 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_62.txt @@ -0,0 +1,9 @@ +This is because DeepSpeed's state_dict contains a placeholder instead of the real weights and you won't be able to load them. +yaml +{ + "zero_optimization": { + "stage3_gather_16bit_weights_on_model_save": true + } +} + +The full precision weights shouldn't be saved during training because it can require a lot of memory. It is usually best to save the fp32 weights offline after training is complete. But if you have a lot of free CPU memory, it is possible to save the fp32 weights during training. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_63.txt b/chunked/content_aware_chunking/_deepspeed/chunk_63.txt new file mode 100644 index 0000000000000000000000000000000000000000..3b3e977f4dc535cbb6ee537a7dd0be5ad8d7a82d --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_63.txt @@ -0,0 +1,10 @@ +This section covers both online and offline approaches. +Online +You must have saved at least one checkpoint to load the latest checkpoint as shown in the following: + +from transformers.trainer_utils import get_last_checkpoint +from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint +checkpoint_dir = get_last_checkpoint(trainer.args.output_dir) +fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) + +If you've enabled the --load_best_model_at_end parameter to track the best checkpoint in [TrainingArguments], you can finish training first and save the final model explicitly. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_64.txt b/chunked/content_aware_chunking/_deepspeed/chunk_64.txt new file mode 100644 index 0000000000000000000000000000000000000000..bb16fe0679e3057906ccb59bd5c2652257cd42da --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_64.txt @@ -0,0 +1,8 @@ +Then you can reload it as shown below: + +from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint +checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final") +trainer.deepspeed.save_checkpoint(checkpoint_dir) +fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) + +Once load_state_dict_from_zero_checkpoint is run, the model is no longer usable in DeepSpeed in the context of the same application. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_65.txt b/chunked/content_aware_chunking/_deepspeed/chunk_65.txt new file mode 100644 index 0000000000000000000000000000000000000000..0bc5e1baa81d3f632c1076cc03fcc560f459efe3 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_65.txt @@ -0,0 +1 @@ +You'll need to initialize the DeepSpeed engine again since model.load_state_dict(state_dict) removes all the DeepSpeed magic from it. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_66.txt b/chunked/content_aware_chunking/_deepspeed/chunk_66.txt new file mode 100644 index 0000000000000000000000000000000000000000..7119384e3a60f6261df136b61fd188add09c65b9 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_66.txt @@ -0,0 +1,11 @@ +Only use this at the very end of training. + +You can also extract and load the state_dict of the fp32 weights: + +from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint +state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu +model = model.cpu() +model.load_state_dict(state_dict) + +Offline +DeepSpeed provides a zero_to_fp32.py script at the top-level of the checkpoint folder for extracting weights at any point. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_67.txt b/chunked/content_aware_chunking/_deepspeed/chunk_67.txt new file mode 100644 index 0000000000000000000000000000000000000000..a9d95973d73768db1a70e6caf677e77b01fee096 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_67.txt @@ -0,0 +1,17 @@ +This is a standalone script and you don't need a configuration file or [Trainer]. +For example, if your checkpoint folder looked like this: + +$ ls -l output_dir/checkpoint-1/ +-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json +drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/ +-rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest +-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt +-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin +-rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt +-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json +-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model +-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json +-rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json +-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin +-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py* +To reconstruct the fp32 weights from the DeepSpeed checkpoint (ZeRO-2 or ZeRO-3) subfolder global_step1, run the following command to create and consolidate the full fp32 weights from multiple GPUs into a single pytorch_model.bin file. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_68.txt b/chunked/content_aware_chunking/_deepspeed/chunk_68.txt new file mode 100644 index 0000000000000000000000000000000000000000..35dcff90ef067799464a7f186621a28a312fa59d --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_68.txt @@ -0,0 +1,8 @@ +The script automatically discovers the subfolder containing the checkpoint. +py +python zero_to_fp32.py . pytorch_model.bin + +Run python zero_to_fp32.py -h for more usage details. The script requires 2x the general RAM of the final fp32 weights. + +ZeRO Inference +ZeRO Inference places the model weights in CPU or NVMe memory to avoid burdening the GPU which makes it possible to run inference with huge models on a GPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_69.txt b/chunked/content_aware_chunking/_deepspeed/chunk_69.txt new file mode 100644 index 0000000000000000000000000000000000000000..d90b787b7d15e8e73503905e75d96780afeb8309 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_69.txt @@ -0,0 +1,7 @@ +Inference doesn't require any large additional amounts of memory for the optimizer states and gradients so you can fit much larger batches and/or sequence lengths on the same hardware. +ZeRO Inference shares the same configuration file as ZeRO-3, and ZeRO-2 and ZeRO-1 configs won't work because they don't provide any benefits for inference. +To run ZeRO Inference, pass your usual training arguments to the [TrainingArguments] class and add the --do_eval argument. + +deepspeed --num_gpus=2 your_program.py --do_eval --deepspeed ds_config.json +Non-Trainer DeepSpeed integration +DeepSpeed also works with Transformers without the [Trainer] class. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_7.txt b/chunked/content_aware_chunking/_deepspeed/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..cfc35e3459b94ff67a4774d3812f25176ace08b3 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_7.txt @@ -0,0 +1,17 @@ +Feel free to work in whichever direction you prefer (starting with the most memory efficient or fastest) to discover the appropriate balance between speed and memory usage. +A general process you can use is (start with batch size of 1): + +enable gradient checkpointing +try ZeRO-2 +try ZeRO-2 and offload the optimizer +try ZeRO-3 +try ZeRO-3 and offload parameters to the CPU +try ZeRO-3 and offload parameters and the optimizer to the CPU +try lowering various default values like a narrower search beam if you're using the [~GenerationMixin.generate] method +try mixed half-precision (fp16 on older GPU architectures and bf16 on Ampere) over full-precision weights +add more hardware if possible or enable Infinity to offload parameters and the optimizer to a NVMe +once you're not running out of memory, measure effective throughput and then try to increase the batch size as large as you can to maximize GPU efficiency +lastly, try to optimize your training setup by disabling some offload features or use a faster ZeRO stage and increasing/decreasing the batch size to find the best tradeoff between speed and memory usage + +DeepSpeed configuration file +DeepSpeed works with the [Trainer] class by way of a config file containing all the parameters for configuring how you want setup your training run. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_70.txt b/chunked/content_aware_chunking/_deepspeed/chunk_70.txt new file mode 100644 index 0000000000000000000000000000000000000000..5c3412985832b405c5f197c9941859639b1d299a --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_70.txt @@ -0,0 +1,29 @@ +This is handled by the [HfDeepSpeedConfig] which only takes care of gathering ZeRO-3 parameters and splitting a model across multiple GPUs when you call [~PreTrainedModel.from_pretrained]. + +If you want everything automatically taken care of for you, try using DeepSpeed with the [Trainer]! You'll need to follow the DeepSpeed documentation, and manually configure the parameter values in the config file (you can't use the "auto" value). + +To efficiently deploy ZeRO-3, you must instantiate the [HfDeepSpeedConfig] object before the model and keep that object alive: + +from transformers.integrations import HfDeepSpeedConfig +from transformers import AutoModel +import deepspeed +ds_config = {} # deepspeed config object or path to the file +must run before instantiating the model to detect zero 3 +dschf = HfDeepSpeedConfig(ds_config) # keep this object alive +model = AutoModel.from_pretrained("openai-community/gpt2") +engine = deepspeed.initialize(model=model, config_params=ds_config, ) + +[HfDeepSpeedConfig] is not required for ZeRO-1 or ZeRO-2. + +from transformers.integrations import HfDeepSpeedConfig +from transformers import AutoModel, AutoConfig +import deepspeed +ds_config = {} # deepspeed config object or path to the file +must run before instantiating the model to detect zero 3 +dschf = HfDeepSpeedConfig(ds_config) # keep this object alive +config = AutoConfig.from_pretrained("openai-community/gpt2") +model = AutoModel.from_config(config) +engine = deepspeed.initialize(model=model, config_params=ds_config, ) + +Non-Trainer ZeRO Inference +To run ZeRO Inference without the [Trainer] in cases where you can’t fit a model onto a single GPU, try using additional GPUs or/and offloading to CPU memory. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_71.txt b/chunked/content_aware_chunking/_deepspeed/chunk_71.txt new file mode 100644 index 0000000000000000000000000000000000000000..6c472d6336d018849847b18b9d2953143adccd9b --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_71.txt @@ -0,0 +1,5 @@ +The important nuance to understand here is that the way ZeRO is designed, you can process different inputs on different GPUs in parallel. +Make sure to: + +disable CPU offload if you have enough GPU memory (since it slows things down). +enable bf16 if you have an Ampere or newer GPU to make things faster. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_72.txt b/chunked/content_aware_chunking/_deepspeed/chunk_72.txt new file mode 100644 index 0000000000000000000000000000000000000000..6315b5aa78c60f89579189c6a17cd5d4ead08819 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_72.txt @@ -0,0 +1,10 @@ +If you don’t have one of these GPUs, you may enable fp16 as long as you don’t use a model pretrained in bf16 (T5 models) because it may lead to an overflow error. + +Take a look at the following script to get a better idea of how to run ZeRO Inference without the [Trainer] on a model that won't fit on a single GPU. + +!/usr/bin/env python +This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model +into a single GPU + +1. Use 1 GPU with CPU offload +2. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_73.txt b/chunked/content_aware_chunking/_deepspeed/chunk_73.txt new file mode 100644 index 0000000000000000000000000000000000000000..3a6be27ebb91588dd50c19497a13260dfd95aaa1 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_73.txt @@ -0,0 +1,9 @@ +Or use multiple GPUs instead + +First you need to install deepspeed: pip install deepspeed + +Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2 +small GPUs can handle it. or 1 small GPU and a lot of CPU memory. + +To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU - +you will need 2-4 gpus. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_74.txt b/chunked/content_aware_chunking/_deepspeed/chunk_74.txt new file mode 100644 index 0000000000000000000000000000000000000000..cfdd68fcbf1520bdeb6786f15b2a9a29f3d83ec0 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_74.txt @@ -0,0 +1,6 @@ +And then you can adapt the script to handle more gpus if you want to +process multiple inputs at once. + +The provided deepspeed config also activates CPU memory offloading, so chances are that if you +have a lot of available CPU memory and you don't mind a slowdown you should be able to load a +model that doesn't normally fit into a single GPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_75.txt b/chunked/content_aware_chunking/_deepspeed/chunk_75.txt new file mode 100644 index 0000000000000000000000000000000000000000..75eef872cbaa08d469441531ebf97fec6679eebb --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_75.txt @@ -0,0 +1,87 @@ +If you have enough GPU memory the program will +run faster if you don't want offload to CPU - so disable that section then. + +To deploy on 1 gpu: + +deepspeed --num_gpus 1 t0.py +or: +python -m torch.distributed.run --nproc_per_node=1 t0.py + +To deploy on 2 gpus: + +deepspeed --num_gpus 2 t0.py +or: +python -m torch.distributed.run --nproc_per_node=2 t0.py +from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM +from transformers.integrations import HfDeepSpeedConfig +import deepspeed +import os +import torch +os.environ["TOKENIZERS_PARALLELISM"] = "false" # To avoid warnings about parallelism in tokenizers +distributed setup +local_rank = int(os.getenv("LOCAL_RANK", "0")) +world_size = int(os.getenv("WORLD_SIZE", "1")) +torch.cuda.set_device(local_rank) +deepspeed.init_distributed() +model_name = "bigscience/T0_3B" +config = AutoConfig.from_pretrained(model_name) +model_hidden_size = config.d_model +batch size has to be divisible by world_size, but can be bigger than world_size +train_batch_size = 1 * world_size +ds_config notes + +- enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be +faster. + +- for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g. +all official t5 models are bf16-pretrained + +- set offload_param.device to "none" or completely remove the offload_param section if you don't +- want CPU offload + +- if using offload_param you can manually finetune stage3_param_persistence_threshold to control +- which params should remain on gpus - the larger the value the smaller the offload size + +For in-depth info on Deepspeed config see +https://huggingface.co/docs/transformers/main/main_classes/deepspeed +keeping the same format as json for consistency, except it uses lower case for true/false +fmt: off +ds_config = { + "fp16": { + "enabled": False + }, + "bf16": { + "enabled": False + }, + "zero_optimization": { + "stage": 3, + "offload_param": { + "device": "cpu", + "pin_memory": True + }, + "overlap_comm": True, + "contiguous_gradients": True, + "reduce_bucket_size": model_hidden_size * model_hidden_size, + "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size, + "stage3_param_persistence_threshold": 10 * model_hidden_size + }, + "steps_per_print": 2000, + "train_batch_size": train_batch_size, + "train_micro_batch_size_per_gpu": 1, + "wall_clock_breakdown": False +} +fmt: on +next line instructs transformers to partition the model directly over multiple gpus using +deepspeed.zero.Init when model's from_pretrained method is called. + +it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name) + +otherwise the model will first be loaded normally and only partitioned at forward time which is +less efficient and when there is little CPU RAM may fail +dschf = HfDeepSpeedConfig(ds_config) # keep this object alive +now a model can be loaded. +model = AutoModelForSeq2SeqLM.from_pretrained(model_name) +initialise Deepspeed ZeRO and store only the engine object +ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0] +ds_engine.module.eval() # inference +Deepspeed ZeRO can process unrelated inputs on each GPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_76.txt b/chunked/content_aware_chunking/_deepspeed/chunk_76.txt new file mode 100644 index 0000000000000000000000000000000000000000..4cfd784e2d49503bfbbf1f14aca7fdecc7cca51a --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_76.txt @@ -0,0 +1,28 @@ +So for 2 gpus you process 2 inputs at once. +If you use more GPUs adjust for more. +And of course if you have just one input to process you then need to pass the same string to both gpus +If you use only one GPU, then you will have only rank 0. +rank = torch.distributed.get_rank() +if rank == 0: + text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy" +elif rank == 1: + text_in = "Is this review positive or negative? Review: this is the worst restaurant ever" +tokenizer = AutoTokenizer.from_pretrained(model_name) +inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank) +with torch.no_grad(): + outputs = ds_engine.module.generate(inputs, synced_gpus=True) +text_out = tokenizer.decode(outputs[0], skip_special_tokens=True) +print(f"rank{rank}:\n in={text_in}\n out={text_out}") + +Save the script as t0.py and launch it: + +$ deepspeed --num_gpus 2 t0.py +rank0: + in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy + out=Positive +rank1: + in=Is this review positive or negative? Review: this is the worst restaurant ever + out=negative +This is a very basic example and you'll want to adapt it to your use case. +Generate +Using multiple GPUs with ZeRO-3 for generation requires synchronizing the GPUs by setting synced_gpus=True in the [~GenerationMixin.generate] method. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_77.txt b/chunked/content_aware_chunking/_deepspeed/chunk_77.txt new file mode 100644 index 0000000000000000000000000000000000000000..cdd7cf3dea1fa92d5a5e61d8122dbff7f70a8b25 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_77.txt @@ -0,0 +1,4 @@ +Otherwise, if one GPU is finished generating before another one, the whole system hangs because the remaining GPUs haven't received the weight shard from the GPU that finished first. +For Transformers>=4.28, if synced_gpus is automatically set to True if multiple GPUs are detected during generation. +Troubleshoot +When you encounter an issue, you should consider whether DeepSpeed is the cause of the problem because often it isn't (unless it's super obviously and you can see DeepSpeed modules in the exception)! The first step should be to retry your setup without DeepSpeed, and if the problem persists, then you can report the issue. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_78.txt b/chunked/content_aware_chunking/_deepspeed/chunk_78.txt new file mode 100644 index 0000000000000000000000000000000000000000..9177cfa3354e392aa493736fc119218a1d5295e6 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_78.txt @@ -0,0 +1,20 @@ +If the issue is a core DeepSpeed problem and unrelated to the Transformers integration, open an Issue on the DeepSpeed repository. +For issues related to the Transformers integration, please provide the following information: + +the full DeepSpeed config file + +the command line arguments of the [Trainer], or [TrainingArguments] arguments if you're scripting the [Trainer] setup yourself (don't dump the [TrainingArguments] which has dozens of irrelevant entries) + +the outputs of: + +python -c 'import torch; print(f"torch: {torch.__version__}")' +python -c 'import transformers; print(f"transformers: {transformers.__version__}")' +python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")' + +a link to a Google Colab notebook to reproduce the issue + +if impossible, a standard and non-custom dataset we can use and also try to use an existing example to reproduce the issue with + +The following sections provide a guide for resolving two of the most common issues. +DeepSpeed process killed at startup +When the DeepSpeed process is killed during launch without a traceback, that usually means the program tried to allocate more CPU memory than your system has or your process tried to allocate more CPU memory than allowed leading the OS kernel to terminate the process. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_79.txt b/chunked/content_aware_chunking/_deepspeed/chunk_79.txt new file mode 100644 index 0000000000000000000000000000000000000000..43fc14c609b98acc017361f41aad75480c1293a4 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_79.txt @@ -0,0 +1,4 @@ +In this case, check whether your configuration file has either offload_optimizer, offload_param or both configured to offload to the CPU. +If you have NVMe and ZeRO-3 setup, experiment with offloading to the NVMe (estimate the memory requirements for your model). +NaN loss +NaN loss often occurs when a model is pretrained in bf16 and then you try to use it with fp16 (especially relevant for TPU trained models). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_8.txt b/chunked/content_aware_chunking/_deepspeed/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..1b30e6260390675a43a5789a38a06e55bf1239b6 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_8.txt @@ -0,0 +1,3 @@ +When you execute your training script, DeepSpeed logs the configuration it received from [Trainer] to the console so you can see exactly what configuration was used. + +Find a complete list of DeepSpeed configuration options on the DeepSpeed Configuration JSON reference. You can also find more practical examples of various DeepSpeed configuration examples on the DeepSpeedExamples repository or the main DeepSpeed repository. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_80.txt b/chunked/content_aware_chunking/_deepspeed/chunk_80.txt new file mode 100644 index 0000000000000000000000000000000000000000..6989d5cc7c22fb683a3a1b9a4fda749ce5ea1078 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_80.txt @@ -0,0 +1,2 @@ +To resolve this, use fp32 or bf16 if your hardware supports it (TPU, Ampere GPUs or newer). +The other issue may be related to using fp16. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_deepspeed/chunk_81.txt b/chunked/content_aware_chunking/_deepspeed/chunk_81.txt new file mode 100644 index 0000000000000000000000000000000000000000..08fc264eea30339314b40f3f3341154594968467 --- /dev/null +++ b/chunked/content_aware_chunking/_deepspeed/chunk_81.txt @@ -0,0 +1,16 @@ +For example, if this is your fp16 configuration: +yaml +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + } +} +You might see the following OVERFLOW! messages in the logs: + +0%| | 0/189 [00:0015 billion parameter model. +While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full bfloat16 inference. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_2.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..9318ca1e7064ac1199d9f6b090b8784d719c6cdc --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_2.txt @@ -0,0 +1,10 @@ +This necessitates the model's capability to manage very long input sequences during inference. + +The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences. +In this guide, we will go over the effective techniques for efficient LLM deployment: + +Lower Precision: Research has shown that operating at reduced numerical precision, namely 8-bit and 4-bit can achieve computational advantages without a considerable decline in model performance. + +Flash Attention: Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization. + +Architectural Innovations: Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_20.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..6c747e4912c161a17c3210a92434dc92feff5f8c --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_20.txt @@ -0,0 +1,14 @@ +It is up to the user to try it out. +Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to \( \text{quantize} \) and \( \text{dequantize} \) taking longer during inference. +python +del model +del pipe +python +flush() +Overall, we saw that running OctoCoder in 8-bit precision reduced the required GPU VRAM from 32G GPU VRAM to only 15GB and running the model in 4-bit precision further reduces the required GPU VRAM to just a bit over 9GB. +4-bit quantization allows the model to be run on GPUs such as RTX3090, V100, and T4 which are quite accessible for most people. +For more information on quantization and to see how one can quantize models to require even less GPU VRAM memory than 4-bit, we recommend looking into the AutoGPTQ implementation. + +As a conclusion, it is important to remember that model quantization trades improved memory efficiency against accuracy and in some cases inference time. + +If GPU memory is not a constraint for your use case, there is often no need to look into quantization. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_21.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..c4638b5a481209f68652251eaa1f7948b15c90b4 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_21.txt @@ -0,0 +1,4 @@ +However many GPUs simply can't run LLMs without quantization methods and in this case, 4-bit and 8-bit quantization schemes are extremely useful tools. +For more in-detail usage information, we strongly recommend taking a look at the Transformers Quantization Docs. +Next, let's look into how we can improve computational and memory efficiency by using better algorithms and an improved model architecture. +2. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_22.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..e5abee9c0e3280501faee5febe0f55d752557582 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_22.txt @@ -0,0 +1,6 @@ +Flash Attention +Today's top-performing LLMs share more or less the same fundamental architecture that consists of feed-forward layers, activation layers, layer normalization layers, and most crucially, self-attention layers. +Self-attention layers are central to Large Language Models (LLMs) in that they enable the model to understand the contextual relationships between input tokens. +However, the peak GPU memory consumption for self-attention layers grows quadratically both in compute and memory complexity with number of input tokens (also called sequence length) that we denote in the following by \( N \) . +While this is not really noticeable for shorter input sequences (of up to 1000 input tokens), it becomes a serious problem for longer input sequences (at around 16000 input tokens). +Let's take a closer look. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_23.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..7463537b41d474517b90cc4c6b0223f00f850f60 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_23.txt @@ -0,0 +1,3 @@ +The formula to compute the output \( \mathbf{O} \) of a self-attention layer for an input \( \mathbf{X} \) of length \( N \) is: +$$ \textbf{O} = \text{Attn}(\mathbf{X}) = \mathbf{V} \times \text{Softmax}(\mathbf{QK}^T) \text{ with } \mathbf{Q} = \mathbf{W}_q \mathbf{X}, \mathbf{V} = \mathbf{W}_v \mathbf{X}, \mathbf{K} = \mathbf{W}_k \mathbf{X} $$ +\( \mathbf{X} = (\mathbf{x}1, \mathbf{x}{N}) \) is thereby the input sequence to the attention layer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_24.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..22d46663ea3c7ef8cd7f973e32ee22a23ffe956a --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_24.txt @@ -0,0 +1,3 @@ +The projections \( \mathbf{Q} \) and \( \mathbf{K} \) will each consist of \( N \) vectors resulting in the \( \mathbf{QK}^T \) being of size \( N^2 \) . +LLMs usually have multiple attention heads, thus doing multiple self-attention computations in parallel. +Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the \( \mathbf{QK^T} \) matrices to be \( 40 * 2 * N^2 \) bytes. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_25.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..349277fe1243ec5fc36995e7efe222aca60b7c70 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_25.txt @@ -0,0 +1,3 @@ +For \( N=1000 \) only around 50 MB of VRAM are needed, however, for \( N=16000 \) we would need 19 GB of VRAM, and for \( N=100,000 \) we would need almost 1TB just to store the \( \mathbf{QK}^T \) matrices. +Long story short, the default self-attention algorithm quickly becomes prohibitively memory-expensive for large input contexts. +As LLMs improve in text comprehension and generation, they are applied to increasingly complex tasks. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_26.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..5d6f595335685487a91129e756232197464f0d5f --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_26.txt @@ -0,0 +1,2 @@ +While models once handled the translation or summarization of a few sentences, they now manage entire pages, demanding the capability to process extensive input lengths. +How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the \( QK^T \) matrix. Tri Dao et al. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_27.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..9c3d40bde12e76cb107b7e197b76092c2ac62cfa --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_27.txt @@ -0,0 +1,5 @@ +developed exactly such a new algorithm and called it Flash Attention. +In a nutshell, Flash Attention breaks the \(\mathbf{V} \times \text{Softmax}(\mathbf{QK}^T\)) computation apart and instead computes smaller chunks of the output by iterating over multiple softmax computation steps: +$$ \textbf{O}i \leftarrow s^a{ij} * \textbf{O}i + s^b{ij} * \mathbf{V}{j} \times \text{Softmax}(\mathbf{QK}^T{i,j}) \text{ for multiple } i, j \text{ iterations} $$ +with \( s^a_{ij} \) and \( s^b_{ij} \) being some softmax normalization statistics that need to be recomputed for every \( i \) and \( j \) . +Please note that the whole Flash Attention is a bit more complex and is greatly simplified here as going in too much depth is out of scope for this guide. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_28.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..97f967564ae9bd248f85f1d499b7034e1bad9c90 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_28.txt @@ -0,0 +1,6 @@ +The reader is invited to take a look at the well-written Flash Attention paper for more details. +The main takeaway here is: + +By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives numerical identical outputs compared to the default self-attention layer at a memory cost that only increases linearly with \( N \) . + +Looking at the formula, one would intuitively say that Flash Attention must be much slower compared to the default self-attention formula as more computation needs to be done. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_29.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..e0cd3a66ed7b6e37d343f9c20f5e0a996b1048de --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_29.txt @@ -0,0 +1,6 @@ +Indeed Flash Attention requires more FLOPs compared to normal attention as the softmax normalization statistics have to constantly be recomputed (see paper for more details if interested) + +However, Flash Attention is much faster in inference compared to default attention which comes from its ability to significantly reduce the demands on the slower, high-bandwidth memory of the GPU (VRAM), focusing instead on the faster on-chip memory (SRAM). + +Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast on-chip SRAM memory instead of having to access the slower VRAM memory to compute the output vector \( \mathbf{O} \) . +In practice, there is currently absolutely no reason to not use Flash Attention if available. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_3.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..d79e0663fc68a00f72de081bba96e22e6e0a33c3 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_3.txt @@ -0,0 +1,3 @@ +The most important advancement in model architectures hereby are Alibi, Rotary embeddings, Multi-Query Attention (MQA) and Grouped-Query-Attention (GQA). + +Throughout this guide, we will offer an analysis of auto-regressive generation from a tensor's perspective. We delve into the pros and cons of adopting lower precision, provide a comprehensive exploration of the latest attention algorithms, and discuss improved LLM architectures. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_30.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..21d9755d2b1e7f4f602eaa4cea9c88a19c95dd14 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_30.txt @@ -0,0 +1,3 @@ +The algorithm gives mathematically the same outputs, and is both faster and more memory-efficient. +Let's look at a practical example. +Our OctoCoder model now gets a significantly longer input prompt which includes a so-called system prompt. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_31.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_31.txt new file mode 100644 index 0000000000000000000000000000000000000000..a7e6192bed641af1ed4f21e04c5f1a636e0adb7f --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_31.txt @@ -0,0 +1,13 @@ +System prompts are used to steer the LLM into a better assistant that is tailored to the users' task. +In the following, we use a system prompt that will make OctoCoder a better coding assistant. +thon +system_prompt = """Below are a series of dialogues between various people and an AI technical assistant. +The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble but knowledgeable. +The assistant is happy to help with code questions and will do their best to understand exactly what is needed. +It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. +That said, the assistant is practical really does its best, and doesn't let caution get too much in the way of being useful. +The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests). +The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective, and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data. + +Question: Write a function that takes two lists and returns a list that has alternating elements from each input list. +Answer: Sure. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_32.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_32.txt new file mode 100644 index 0000000000000000000000000000000000000000..bc5c8bdf440366183fa2cee0a9f484f6045c8b6d --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_32.txt @@ -0,0 +1,13 @@ +Here is a function that does that. +def alternating(list1, list2): + results = [] + for i in range(len(list1)): + results.append(list1[i]) + results.append(list2[i]) + return results +Question: Can you write some test cases for this function? +Answer: Sure, here are some tests. +assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3] +assert alternating([True, False], [4, 5]) == [True, 4, False, 5] +assert alternating([], []) == [] +Question: Modify the function so that it returns all input elements when the lists have uneven length. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_33.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_33.txt new file mode 100644 index 0000000000000000000000000000000000000000..8dd1a4eaca756d531d0d19a745222e29f19380f7 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_33.txt @@ -0,0 +1,37 @@ +The elements from the longer list should be at the end. +Answer: Here is the modified function. +def alternating(list1, list2): + results = [] + for i in range(min(len(list1), len(list2))): + results.append(list1[i]) + results.append(list2[i]) + if len(list1) > len(list2): + results.extend(list1[i+1:]) + else: + results.extend(list2[i+1:]) + return results + +""" +`` +For demonstration purposes, we duplicate the system prompt by ten so that the input length is long enough to observe Flash Attention's memory savings. +We append the original text prompt"Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"` +python +long_prompt = 10 * system_prompt + prompt +We instantiate our model again in bfloat16 precision. +thon +model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto") +tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder") +pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) + +Let's now run the model just like before without Flash Attention and measure the peak GPU memory requirement and inference time. +thon +import time +start_time = time.time() +result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):] +print(f"Generated in {time.time() - start_time} seconds.") +result + +Output: + +Generated in 10.96854019165039 seconds. +Sure. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_34.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_34.txt new file mode 100644 index 0000000000000000000000000000000000000000..972c72ce2f52d26464de7d7c57d1a605ddc3a741 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_34.txt @@ -0,0 +1,3 @@ +Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef +` +We're getting the same output as before, however this time, the model repeats the answer multiple times until it's 60 tokens cut-off. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_35.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_35.txt new file mode 100644 index 0000000000000000000000000000000000000000..52727cde1282af3d645ba0c6f4e75c9e43ae6ded --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_35.txt @@ -0,0 +1,9 @@ +This is not surprising as we've repeated the system prompt ten times for demonstration purposes and thus cued the model to repeat itself. +Note that the system prompt should not be repeated ten times in real-world applications - one time is enough! +Let's measure the peak GPU memory requirement. +python +bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) +Output: + +37.668193340301514 +As we can see the peak GPU memory requirement is now significantly higher than in the beginning, which is largely due to the longer input sequence. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_36.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_36.txt new file mode 100644 index 0000000000000000000000000000000000000000..9fcb5309c031db4b4fd2082846912f5db758b64a --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_36.txt @@ -0,0 +1,19 @@ +Also the generation takes a little over a minute now. +We call flush() to free GPU memory for our next experiment. +python +flush() +For comparison, let's run the same function, but enable Flash Attention instead. +To do so, we convert the model to BetterTransformer and by doing so enabling PyTorch's SDPA self-attention which in turn is able to use Flash Attention. +python +model.to_bettertransformer() +Now we run the exact same code snippet as before and under the hood Transformers will make use of Flash Attention. + +start_time = time.time() +with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): + result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):] +print(f"Generated in {time.time() - start_time} seconds.") +result + +Output: +Generated in 3.0211617946624756 seconds. + Sure. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_37.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_37.txt new file mode 100644 index 0000000000000000000000000000000000000000..40bf948f8a64dfd3ffee41e031bc2e56e23b8947 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_37.txt @@ -0,0 +1 @@ +Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_38.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_38.txt new file mode 100644 index 0000000000000000000000000000000000000000..ca5e51c33b87a3d407355c82e7296e68cc1991a1 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_38.txt @@ -0,0 +1,13 @@ +Here is a function that does that.\n\ndef +We're getting the exact same result as before, but can observe a very significant speed-up thanks to Flash Attention. +Let's measure the memory consumption one last time. +python +bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) +Output: +32.617331981658936 +And we're almost back to our original 29GB peak GPU memory from the beginning. +We can observe that we only use roughly 100MB more GPU memory when passing a very long input sequence with Flash Attention compared to passing a short input sequence as done in the beginning. +py +flush() +For more information on how to use Flash Attention, please have a look at this doc page. +3. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_39.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_39.txt new file mode 100644 index 0000000000000000000000000000000000000000..490e50bc23a26437c0809489a604a75a5f1b695c --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_39.txt @@ -0,0 +1,23 @@ +Architectural Innovations +So far we have looked into improving computational and memory efficiency by: + +Casting the weights to a lower precision format +Replacing the self-attention algorithm with a more memory- and compute efficient version + +Let's now look into how we can change the architecture of an LLM so that it is most effective and efficient for task that require long text inputs, e.g.: +- Retrieval augmented Questions Answering, +- Summarization, +- Chat +Note that chat not only requires the LLM to handle long text inputs, but it also necessitates that the LLM is able to efficiently handle the back-and-forth dialogue between user and assistant (such as ChatGPT). +Once trained, the fundamental LLM architecture is difficult to change, so it is important to make considerations about the LLM's tasks beforehand and accordingly optimize the model's architecture. +There are two important components of the model architecture that quickly become memory and/or performance bottlenecks for large input sequences. + +The positional embeddings +The key-value cache + +Let's go over each component in more detail +3.1 Improving positional embeddings of LLMs +Self-attention puts each token in relation to each other's tokens. +As an example, the \( \text{Softmax}(\mathbf{QK}^T) \) matrix of the text input sequence "Hello", "I", "love", "you" could look as follows: + +Each word token is given a probability mass at which it attends all other word tokens and, therefore is put into relation with all other word tokens. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_4.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..7f1284f93d73081bce938a82fcbf1bd74f05e8b1 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_4.txt @@ -0,0 +1,4 @@ +While doing so, we run practical examples showcasing each of the feature improvements. +1. Lower Precision +Memory requirements of LLMs can be best understood by seeing the LLM as a set of weight matrices and vectors and the text inputs as a sequence of vectors. In the following, the definition weights will be used to signify all model weight matrices and vectors. +At the time of writing this guide, LLMs consist of at least a couple billion parameters. Each parameter thereby is made of a decimal number, e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_40.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_40.txt new file mode 100644 index 0000000000000000000000000000000000000000..0c6c42b6981712ccfb35860e2422a687ff0e1372 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_40.txt @@ -0,0 +1 @@ +E.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_41.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_41.txt new file mode 100644 index 0000000000000000000000000000000000000000..237f5da569d2045b1d9500a08f3266d8c15dec8c --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_41.txt @@ -0,0 +1,4 @@ +the word "love" attends to the word "Hello" with 5%, to "I" with 30%, and to itself with 65%. +A LLM based on self-attention, but without position embeddings would have great difficulties in understanding the positions of the text inputs to each other. +This is because the probability score computed by \( \mathbf{QK}^T \) relates each word token to each other word token in \( O(1) \) computations regardless of their relative positional distance to each other. +Therefore, for the LLM without position embeddings each token appears to have the same distance to all other tokens, e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_42.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_42.txt new file mode 100644 index 0000000000000000000000000000000000000000..3afbff7e760c4fec76823ae54ae9a55ea79a35f3 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_42.txt @@ -0,0 +1,11 @@ +differentiating between "Hello I love you" and "You love I hello" would be very challenging. +For the LLM to understand sentence order, an additional cue is needed and is usually applied in the form of positional encodings (or also called positional embeddings). +Positional encodings, encode the position of each token into a numerical presentation that the LLM can leverage to better understand sentence order. +The authors of the Attention Is All You Need paper introduced sinusoidal positional embeddings \( \mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N \) . +where each vector \( \mathbf{p}_i \) is computed as a sinusoidal function of its position \( i \) . +The positional encodings are then simply added to the input sequence vectors \( \mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N \) = \( \mathbf{x}_1 + \mathbf{p}_1, \ldots, \mathbf{x}_N + \mathbf{p}_N \) thereby cueing the model to better learn sentence order. +Instead of using fixed position embeddings, others (such as Devlin et al.) used learned positional encodings for which the positional embeddings +\( \mathbf{P} \) are learned during training. +Sinusoidal and learned position embeddings used to be the predominant methods to encode sentence order into LLMs, but a couple of problems related to these positional encodings were found: + +Sinusoidal and learned position embeddings are both absolute positional embeddings, i.e. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_43.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_43.txt new file mode 100644 index 0000000000000000000000000000000000000000..d0ea1162330a3f9e3641cd9574e9759ca08d6528 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_43.txt @@ -0,0 +1 @@ +encoding a unique embedding for each position id: \( 0, \ldots, N \) . As shown by Huang et al. and Su et al., absolute positional embeddings lead to poor LLM performance for long text inputs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_44.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_44.txt new file mode 100644 index 0000000000000000000000000000000000000000..5f8ac5cc170b2c931fdc6b4d139c33a5ca5ece9f --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_44.txt @@ -0,0 +1,9 @@ +For long text inputs, it is advantageous if the model learns the relative positional distance input tokens have to each other instead of their absolute position. +When using learned position embeddings, the LLM has to be trained on a fixed input length \( N \), which makes it difficult to extrapolate to an input length longer than what it was trained on. + +Recently, relative positional embeddings that can tackle the above mentioned problems have become more popular, most notably: + +Rotary Position Embedding (RoPE) +ALiBi + +Both RoPE and ALiBi argue that it's best to cue the LLM about sentence order directly in the self-attention algorithm as it's there that word tokens are put into relation with each other. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_45.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_45.txt new file mode 100644 index 0000000000000000000000000000000000000000..d649b63af466d42cee2add80e7769acc82d73c79 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_45.txt @@ -0,0 +1,3 @@ +More specifically, sentence order should be cued by modifying the \( \mathbf{QK}^T \) computation. +Without going into too many details, RoPE notes that positional information can be encoded into query-key pairs, e.g. \( \mathbf{q}_i \) and \( \mathbf{x}_j \) by rotating each vector by an angle \( \theta * i \) and \( \theta * j \) respectively with \( i, j \) describing each vectors sentence position: +$$ \mathbf{\hat{q}}i^T \mathbf{\hat{x}}_j = \mathbf{{q}}_i^T \mathbf{R}{\theta, i -j} \mathbf{{x}}_j. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_46.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_46.txt new file mode 100644 index 0000000000000000000000000000000000000000..81700878221bacbf300acd8960f386680c73e303 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_46.txt @@ -0,0 +1,2 @@ +$$ +\( \mathbf{R}_{\theta, i - j} \) thereby represents a rotational matrix. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_47.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_47.txt new file mode 100644 index 0000000000000000000000000000000000000000..174e3d74dcc945c957d518fd0090005829cede4e --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_47.txt @@ -0,0 +1,11 @@ +\( \theta \) is not learned during training, but instead set to a pre-defined value that depends on the maximum input sequence length during training. + +By doing so, the propability score between \( \mathbf{q}_i \) and \( \mathbf{q}_j \) is only affected if \( i \ne j \) and solely depends on the relative distance \( i - j \) regardless of each vector's specific positions \( i \) and \( j \) . + +RoPE is used in multiple of today's most important LLMs, such as: + +Falcon +Llama +PaLM + +As an alternative, ALiBi proposes a much simpler relative position encoding scheme. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_48.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_48.txt new file mode 100644 index 0000000000000000000000000000000000000000..664193e01f271f1b03cd52289591ab6a5e149082 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_48.txt @@ -0,0 +1,11 @@ +The relative distance that input tokens have to each other is added as a negative integer scaled by a pre-defined value m to each query-key entry of the \( \mathbf{QK}^T \) matrix right before the softmax computation. + +As shown in the ALiBi paper, this simple relative positional encoding allows the model to retain a high performance even at very long text input sequences. +ALiBi is used in multiple of today's most important LLMs, such as: + +MPT +BLOOM + +Both RoPE and ALiBi position encodings can extrapolate to input lengths not seen during training whereas it has been shown that extrapolation works much better out-of-the-box for ALiBi as compared to RoPE. +For ALiBi, one simply increases the values of the lower triangular position matrix to match the length of the input sequence. +For RoPE, keeping the same \( \theta \) that was used during training leads to poor results when passing text inputs much longer than those seen during training, c.f Press et al.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_49.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_49.txt new file mode 100644 index 0000000000000000000000000000000000000000..d98cd49df71cc3612ec6a6dec0333aa447c3d5f1 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_49.txt @@ -0,0 +1,6 @@ +However, the community has found a couple of effective tricks that adapt \( \theta \), thereby allowing RoPE position embeddings to work well for extrapolated text input sequences (see here). + +Both RoPE and ALiBi are relative positional embeddings that are not learned during training, but instead are based on the following intuitions: + - Positional cues about the text inputs should be given directly to the \( QK^T \) matrix of the self-attention layer + - The LLM should be incentivized to learn a constant relative distance positional encodings have to each other + - The further text input tokens are from each other, the lower the probability of their query-value probability. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_5.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..ae0e299fa13837e058ec4f416bdb949ba355c961 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_5.txt @@ -0,0 +1,5 @@ +4.5689 which is usually stored in either float32, bfloat16, or float16 format. This allows us to easily compute the memory requirement to load the LLM into memory: + +Loading the weights of a model having X billion parameters requires roughly 4 * X GB of VRAM in float32 precision + +Nowadays, models are however rarely trained in full float32 precision, but usually in bfloat16 precision or less frequently in float16 precision. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_50.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_50.txt new file mode 100644 index 0000000000000000000000000000000000000000..b94ebbf4f91e4178b517b8764877724e361037b0 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_50.txt @@ -0,0 +1,3 @@ +Both RoPE and ALiBi lower the query-key probability of tokens far away from each other. RoPE by decreasing their vector product by increasing the angle between the query-key vectors. ALiBi by adding large negative numbers to the vector product + +In conclusion, LLMs that are intended to be deployed in tasks that require handling large text inputs are better trained with relative positional embeddings, such as RoPE and ALiBi. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_51.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_51.txt new file mode 100644 index 0000000000000000000000000000000000000000..62e4ed1e57b2cf585b7d201dfead1f50763ed281 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_51.txt @@ -0,0 +1,5 @@ +Also note that even if an LLM with RoPE and ALiBi has been trained only on a fixed length of say \( N_1 = 2048 \) it can still be used in practice with text inputs much larger than \( N_1 \), like \( N_2 = 8192 > N_1 \) by extrapolating the positional embeddings. +3.2 The key-value cache +Auto-regressive text generation with LLMs works by iteratively putting in an input sequence, sampling the next token, appending the next token to the input sequence, and continuing to do so until the LLM produces a token that signifies that the generation has finished. +Please have a look at Transformer's Generate Text Tutorial to get a more visual explanation of how auto-regressive generation works. +Let's run a quick code snippet to show how auto-regressive works in practice. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_52.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_52.txt new file mode 100644 index 0000000000000000000000000000000000000000..8813d0527dd1e5c44824cd1d3006d6b6f2226c9a --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_52.txt @@ -0,0 +1,20 @@ +We will simply take the most likely next token via torch.argmax. +thon +input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to("cuda") +for _ in range(5): + next_logits = model(input_ids)["logits"][:, -1:] + next_token_id = torch.argmax(next_logits,dim=-1) +input_ids = torch.cat([input_ids, next_token_id], dim=-1) + print("shape of input_ids", input_ids.shape) +generated_text = tokenizer.batch_decode(input_ids[:, -5:]) +generated_text + +Output: +shape of input_ids torch.Size([1, 21]) +shape of input_ids torch.Size([1, 22]) +shape of input_ids torch.Size([1, 23]) +shape of input_ids torch.Size([1, 24]) +shape of input_ids torch.Size([1, 25]) +[' Here is a Python function'] +As we can see every time we increase the text input tokens by the just sampled token. +With very few exceptions, LLMs are trained using the causal language modeling objective and therefore mask the upper triangle matrix of the attention score - this is why in the two diagrams above the attention scores are left blank (a.k.a have 0 probability). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_53.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_53.txt new file mode 100644 index 0000000000000000000000000000000000000000..ded977dbed150ff2ebb34b54efe84811185253d9 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_53.txt @@ -0,0 +1,2 @@ +For a quick recap on causal language modeling you can refer to the Illustrated Self Attention blog. +As a consequence, tokens never depend on previous tokens, more specifically the \( \mathbf{q}i \) vector is never put in relation with any key, values vectors \( \mathbf{k}_j, \mathbf{v}_j \) if \( j > i \) . Instead \( \mathbf{q}_i \) only attends to previous key-value vectors \( \mathbf{k}{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in {0, \ldots i - 1} \). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_54.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_54.txt new file mode 100644 index 0000000000000000000000000000000000000000..9cc7294de33fb1930c1815679d492f67353230fc --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_54.txt @@ -0,0 +1,30 @@ +In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps. +In the following, we will tell the LLM to make use of the key-value cache by retrieving and forwarding it for each forward pass. +In Transformers, we can retrieve the key-value cache by passing the use_cache flag to the forward call and can then pass it with the current token. +thon +past_key_values = None # past_key_values is the key-value cache +generated_tokens = [] +next_token_id = tokenizer(prompt, return_tensors="pt")["input_ids"].to("cuda") +for _ in range(5): + next_logits, past_key_values = model(next_token_id, past_key_values=past_key_values, use_cache=True).to_tuple() + next_logits = next_logits[:, -1:] + next_token_id = torch.argmax(next_logits, dim=-1) +print("shape of input_ids", next_token_id.shape) + print("length of key-value cache", len(past_key_values[0][0])) # past_key_values are of shape [num_layers, 0 for k, 1 for v, batch_size, length, hidden_dim] + generated_tokens.append(next_token_id.item()) +generated_text = tokenizer.batch_decode(generated_tokens) +generated_text + +Output: +shape of input_ids torch.Size([1, 1]) +length of key-value cache 20 +shape of input_ids torch.Size([1, 1]) +length of key-value cache 21 +shape of input_ids torch.Size([1, 1]) +length of key-value cache 22 +shape of input_ids torch.Size([1, 1]) +length of key-value cache 23 +shape of input_ids torch.Size([1, 1]) +length of key-value cache 24 +[' Here', ' is', ' a', ' Python', ' function'] +As one can see, when using the key-value cache the text input tokens are not increased in length, but remain a single input vector. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_55.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_55.txt new file mode 100644 index 0000000000000000000000000000000000000000..25969685262c29156d186006bf8e5e81dccb8e81 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_55.txt @@ -0,0 +1,6 @@ +The length of the key-value cache on the other hand is increased by one at every decoding step. + +Making use of the key-value cache means that the \( \mathbf{QK}^T \) is essentially reduced to \( \mathbf{q}_c\mathbf{K}^T \) with \( \mathbf{q}_c \) being the query projection of the currently passed input token which is always just a single vector. + +Using the key-value cache has two advantages: +- Significant increase in computational efficiency as less computations are performed compared to computing the full \( \mathbf{QK}^T \) matrix. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_56.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_56.txt new file mode 100644 index 0000000000000000000000000000000000000000..9535fffefaed8df87d1c43a37532abe224897139 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_56.txt @@ -0,0 +1,4 @@ +This leads to an increase in inference speed +- The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly. + +One should always make use of the key-value cache as it leads to identical results and a significant speed-up for longer input sequences. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_57.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_57.txt new file mode 100644 index 0000000000000000000000000000000000000000..d069968c14c614bdc8db223ef723ffee067771b4 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_57.txt @@ -0,0 +1,6 @@ +Transformers has the key-value cache enabled by default when making use of the text pipeline or the generate method. + +Note that, despite our advice to use key-value caches, your LLM output may be slightly different when you use them. This is a property of the matrix multiplication kernels themselves -- you can read more about it here. + +3.2.1 Multi-round conversation +The key-value cache is especially useful for applications such as chat where multiple passes of auto-regressive decoding are required. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_58.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_58.txt new file mode 100644 index 0000000000000000000000000000000000000000..f18ed5b39e18da5e699f80006ebef4b717223923 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_58.txt @@ -0,0 +1,7 @@ +Let's look at an example. +User: How many people live in France? +Assistant: Roughly 75 million people live in France +User: And how many are in Germany? +Assistant: Germany has ca. 81 million inhabitants +In this chat, the LLM runs auto-regressive decoding twice: + 1. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_59.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_59.txt new file mode 100644 index 0000000000000000000000000000000000000000..a6151c08281dd124d74f9ec75912523042070120 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_59.txt @@ -0,0 +1,2 @@ +The first time, the key-value cache is empty and the input prompt is "User: How many people live in France?" and the model auto-regressively generates the text "Roughly 75 million people live in France" while increasing the key-value cache at every decoding step. + 2. The second time the input prompt is "User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?". \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_6.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..861f88616ea8adfc7138bf0a004a019767a18d9c --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_6.txt @@ -0,0 +1,5 @@ +Therefore the rule of thumb becomes: + +Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision + +For shorter text inputs (less than 1024 tokens), the memory requirement for inference is very much dominated by the memory requirement to load the weights. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_60.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_60.txt new file mode 100644 index 0000000000000000000000000000000000000000..2c9de3449f4950239c3c10b6061d5708ef0069fe --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_60.txt @@ -0,0 +1 @@ +Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of "User: And how many in Germany?". While processing the shortened input prompt, it's computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer "Germany has ca. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_61.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_61.txt new file mode 100644 index 0000000000000000000000000000000000000000..b8c2af4e69654d164ae54e84c92dedfc4356a705 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_61.txt @@ -0,0 +1,3 @@ +81 million inhabitants" is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of "User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?". +Two things should be noted here: + 1. Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_62.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_62.txt new file mode 100644 index 0000000000000000000000000000000000000000..f671f683c0c3cfec368686101e1b1179fef0c446 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_62.txt @@ -0,0 +1,2 @@ +for the example above the LLM needs to understand that the user refers to the population when asking "And how many are in Germany". + 2. The key-value cache is extremely useful for chat as it allows us to continuously grow the encoded chat history instead of having to re-encode the chat history again from scratch (as e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_63.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_63.txt new file mode 100644 index 0000000000000000000000000000000000000000..8d2d4b0c1e8605735cc151d84ea49f9e57898f6b --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_63.txt @@ -0,0 +1,2 @@ +would be the case when using an encoder-decoder architecture). +In transformers, a generate call will return past_key_values when return_dict_in_generate=True is passed, in addition to the default use_cache=True. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_64.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_64.txt new file mode 100644 index 0000000000000000000000000000000000000000..c7130e24e68067ded17cb34a300e7081657d339c --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_64.txt @@ -0,0 +1,26 @@ +Note that it is not yet available through the pipeline interface. +thon +Generation as usual +prompt = system_prompt + "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here" +model_inputs = tokenizer(prompt, return_tensors='pt') +generation_output = model.generate(**model_inputs, max_new_tokens=60, return_dict_in_generate=True) +decoded_output = tokenizer.batch_decode(generation_output.sequences)[0] +Piping the returned past_key_values to speed up the next conversation round +prompt = decoded_output + "\nQuestion: How can I modify the function above to return Mega bytes instead?\n\nAnswer: Here" +model_inputs = tokenizer(prompt, return_tensors='pt') +generation_output = model.generate( + **model_inputs, + past_key_values=generation_output.past_key_values, + max_new_tokens=60, + return_dict_in_generate=True +) +tokenizer.batch_decode(generation_output.sequences)[0][len(prompt):] + +Output: + + is a modified version of the function that returns Mega bytes instead. +def bytes_to_megabytes(bytes): + return bytes / 1024 / 1024 +Answer: The function takes a number of bytes as input and returns the number of + +Great, no additional time is spent recomputing the same key and values for the attention layer! There is however one catch. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_65.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_65.txt new file mode 100644 index 0000000000000000000000000000000000000000..6b52d451a0db23331837894f8fc4808cf1c8e9ff --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_65.txt @@ -0,0 +1 @@ +While the required peak memory for the \( \mathbf{QK}^T \) matrix is significantly reduced, holding the key-value cache in memory can become very memory expensive for long input sequences or multi-turn chat. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_66.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_66.txt new file mode 100644 index 0000000000000000000000000000000000000000..701c80b57ad2773368ffec1d657e867607285fe8 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_66.txt @@ -0,0 +1,13 @@ +Remember that the key-value cache needs to store the key-value vectors for all previous input vectors \( \mathbf{x}_i \text{, for } i \in {1, \ldots, c - 1} \) for all self-attention layers and for all attention heads. +Let's compute the number of float values that need to be stored in the key-value cache for the LLM bigcode/octocoder that we used before. +The number of float values amounts to two times the sequence length times the number of attention heads times the attention head dimension and times the number of layers. +Computing this for our LLM at a hypothetical input sequence length of 16000 gives: +python +config = model.config +2 * 16_000 * config.n_layer * config.n_head * config.n_embd // config.n_head +Output: +7864320000 +Roughly 8 billion float values! Storing 8 billion float values in float16 precision requires around 15 GB of RAM which is circa half as much as the model weights themselves! +Researchers have proposed two methods that allow to significantly reduce the memory cost of storing the key-value cache, which are explored in the next subsections. +3.2.2 Multi-Query-Attention (MQA) +Multi-Query-Attention was proposed in Noam Shazeer's Fast Transformer Decoding: One Write-Head is All You Need paper. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_67.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_67.txt new file mode 100644 index 0000000000000000000000000000000000000000..522a0dc107b51b0023e63541ea7b887c933ba6c3 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_67.txt @@ -0,0 +1,5 @@ +As the title says, Noam found out that instead of using n_head key-value projections weights, one can use a single head-value projection weight pair that is shared across all attention heads without that the model's performance significantly degrades. + +By using a single head-value projection weight pair, the key value vectors \( \mathbf{k}_i, \mathbf{v}_i \) have to be identical across all attention heads which in turn means that we only need to store 1 key-value projection pair in the cache instead of n_head ones. + +As most LLMs use between 20 and 100 attention heads, MQA significantly reduces the memory consumption of the key-value cache. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_68.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_68.txt new file mode 100644 index 0000000000000000000000000000000000000000..3b8c883e2e42d30f61639e5383e8e65c920e9abe --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_68.txt @@ -0,0 +1,3 @@ +For the LLM used in this notebook we could therefore reduce the required memory consumption from 15 GB to less than 400 MB at an input sequence length of 16000. +In addition to memory savings, MQA also leads to improved computational efficiency as explained in the following. +In auto-regressive decoding, large key-value vectors need to be reloaded, concatenated with the current key-value vector pair to be then fed into the \( \mathbf{q}_c\mathbf{K}^T \) computation at every step. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_69.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_69.txt new file mode 100644 index 0000000000000000000000000000000000000000..8b32678c472e23d0e61e0cc28aba7377b1400bf1 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_69.txt @@ -0,0 +1,2 @@ +For auto-regressive decoding, the required memory bandwidth for the constant reloading can become a serious time bottleneck. By reducing the size of the key-value vectors less memory needs to be accessed, thus reducing the memory bandwidth bottleneck. For more detail, please have a look at Noam's paper. +The important part to understand here is that reducing the number of key-value attention heads to 1 only makes sense if a key-value cache is used. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_7.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..684a662c0f3c100c76d1e0b3f2e14d17413435e3 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_7.txt @@ -0,0 +1,11 @@ +Therefore, for now, let's assume that the memory requirement for inference is equal to the memory requirement to load the model into the GPU VRAM. +To give some examples of how much VRAM it roughly takes to load a model in bfloat16: + +GPT3 requires 2 * 175 GB = 350 GB VRAM +Bloom requires 2 * 176 GB = 352 GB VRAM +Llama-2-70b requires 2 * 70 GB = 140 GB VRAM +Falcon-40b requires 2 * 40 GB = 80 GB VRAM +MPT-30b requires 2 * 30 GB = 60 GB VRAM +bigcode/starcoder requires 2 * 15.5 = 31 GB VRAM + +As of writing this document, the largest GPU chip on the market is the A100 & H100 offering 80GB of VRAM. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_70.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_70.txt new file mode 100644 index 0000000000000000000000000000000000000000..f5a3ae0ce38cb702c2b30fe5cd5de93312da89b2 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_70.txt @@ -0,0 +1,11 @@ +The peak memory consumption of the model for a single forward pass without key-value cache stays unchanged as every attention head still has a unique query vector so that each attention head still has a different \( \mathbf{QK}^T \) matrix. +MQA has seen wide adoption by the community and is now used by many of the most popular LLMs: + +Falcon +PaLM +MPT +BLOOM + +Also, the checkpoint used in this notebook - bigcode/octocoder - makes use of MQA. +3.2.3 Grouped-Query-Attention (GQA) +Grouped-Query-Attention, as proposed by Ainslie et al. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_71.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_71.txt new file mode 100644 index 0000000000000000000000000000000000000000..65ca39d9467c1e9c8a97d9ce59799ed933127f60 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_71.txt @@ -0,0 +1 @@ +from Google, found that using MQA can often lead to quality degradation compared to using vanilla multi-key-value head projections. The paper argues that more model performance can be kept by less drastically reducing the number of query head projection weights. Instead of using just a single key-value projection weight, n < n_head key-value projection weights should be used. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_72.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_72.txt new file mode 100644 index 0000000000000000000000000000000000000000..2b028553abaf0186127b070cd4adf998f661ef06 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_72.txt @@ -0,0 +1,2 @@ +By choosing n to a significantly smaller value than n_head, such as 2,4 or 8 almost all of the memory and speed gains from MQA can be kept while sacrificing less model capacity and thus arguably less performance. +Moreover, the authors of GQA found out that existing model checkpoints can be uptrained to have a GQA architecture with as little as 5% of the original pre-training compute. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_73.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_73.txt new file mode 100644 index 0000000000000000000000000000000000000000..64ef1857f14f1cbf2619c2e565b93add232dc1fd --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_73.txt @@ -0,0 +1,8 @@ +While 5% of the original pre-training compute can still be a massive amount, GQA uptraining allows existing checkpoints to be useful for longer input sequences. +GQA was only recently proposed which is why there is less adoption at the time of writing this notebook. +The most notable application of GQA is Llama-v2. + +As a conclusion, it is strongly recommended to make use of either GQA or MQA if the LLM is deployed with auto-regressive decoding and is required to handle large input sequences as is the case for example for chat. + +Conclusion +The research community is constantly coming up with new, nifty ways to speed up inference time for ever-larger LLMs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_74.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_74.txt new file mode 100644 index 0000000000000000000000000000000000000000..19c14f3a003b2e19852198cff8ef3e4c0fc60cb1 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_74.txt @@ -0,0 +1 @@ +As an example, one such promising research direction is speculative decoding where "easy tokens" are generated by smaller, faster language models and only "hard tokens" are generated by the LLM itself. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_75.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_75.txt new file mode 100644 index 0000000000000000000000000000000000000000..1531948ced950164a978bdf30b731d59d49f06c3 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_75.txt @@ -0,0 +1,3 @@ +Going into more detail is out of the scope of this notebook, but can be read upon in this nice blog post. +The reason massive LLMs such as GPT3/4, Llama-2-70b, Claude, PaLM can run so quickly in chat-interfaces such as Hugging Face Chat or ChatGPT is to a big part thanks to the above-mentioned improvements in precision, algorithms, and architecture. +Going forward, accelerators such as GPUs, TPUs, etc will only get faster and allow for more memory, but one should nevertheless always make sure to use the best available algorithms and architectures to get the most bang for your buck 🤗. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_8.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..813169919cd12df28b6ffe4fc65c3fbeb62316f2 --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_8.txt @@ -0,0 +1,3 @@ +Most of the models listed before require more than 80GB just to be loaded and therefore necessarily require tensor parallelism and/or pipeline parallelism. +🤗 Transformers does not support tensor parallelism out of the box as it requires the model architecture to be written in a specific way. If you're interested in writing models in a tensor-parallelism-friendly way, feel free to have a look at the text-generation-inference library. +Naive pipeline parallelism is supported out of the box. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_9.txt b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..0cc418dc3d3b9c6b593771aa61fdf037e1ce17ec --- /dev/null +++ b/chunked/content_aware_chunking/_llm_tutorial_optimization/chunk_9.txt @@ -0,0 +1,2 @@ +For this, simply load the model with device="auto" which will automatically place the different layers on the available GPUs as explained here. +Note, however that while very effective, this naive pipeline parallelism does not tackle the issues of GPU idling. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_0.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..db82d086f2b191c64ec4e30806f8679ed3e82efa --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_0.txt @@ -0,0 +1,5 @@ +Model training anatomy +To understand performance optimization techniques that one can apply to improve efficiency of model training +speed and memory utilization, it's helpful to get familiar with how GPU is utilized during training, and how compute +intensity varies depending on an operation performed. +Let's start by exploring a motivating example of GPU utilization and the training run of a model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_1.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..047a96a2eb5d98fc1bbdf1cfaaf9a7a8e25780cc --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_1.txt @@ -0,0 +1,7 @@ +For the demonstration, +we'll need to install a few libraries: + +pip install transformers datasets accelerate nvidia-ml-py3 +The nvidia-ml-py3 library allows us to monitor the memory usage of the models from within Python. You might be familiar +with the nvidia-smi command in the terminal - this library allows to access the same information in Python directly. +Then, we create some dummy data: random token IDs between 100 and 30000 and binary labels for a classifier. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_10.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..7e78d218bbdb0112c68c88d2d79fe50d27621470 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_10.txt @@ -0,0 +1,2 @@ +So now we can +start training the model and see how the GPU memory consumption changes. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_11.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..c2ed0e2bd73742c0e081a1e1a39ad11600274eaa --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_11.txt @@ -0,0 +1,28 @@ +First, we set up a few standard training +arguments: +py +default_args = { + "output_dir": "tmp", + "evaluation_strategy": "steps", + "num_train_epochs": 1, + "log_level": "error", + "report_to": "none", +} + +If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python + kernel between experiments. + +Memory utilization at vanilla training +Let's use the [Trainer] and train the model without using any GPU performance optimization techniques and a batch size of 4: + +from transformers import TrainingArguments, Trainer, logging +logging.set_verbosity_error() +training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) +trainer = Trainer(model=model, args=training_args, train_dataset=ds) +result = trainer.train() +print_summary(result) + +Time: 57.82 +Samples/second: 8.86 +GPU memory occupied: 14949 MB. +We see that already a relatively small batch size almost fills up our GPU's entire memory. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_12.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..0978da1aef2bec5eb8616689b87b36fd483d166b --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_12.txt @@ -0,0 +1,3 @@ +However, a larger batch size +can often result in faster model convergence or better end performance. So ideally we want to tune the batch size to our +model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_13.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..3e2dd9c6e175446a75d9eddc227afcf5080fa674 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_13.txt @@ -0,0 +1,6 @@ +To understand a bit better why this is the case let's have a look at a model's operations and memory needs. +Anatomy of Model's Operations +Transformers architecture includes 3 main groups of operations grouped below by compute-intensity. + +Tensor Contractions +Linear layers and components of Multi-Head Attention all do batched matrix-matrix multiplications. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_14.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..a6935426b47c48c2027d37309f212ac7e4a85454 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_14.txt @@ -0,0 +1,7 @@ +These operations are the most compute-intensive part of training a transformer. + +Statistical Normalizations +Softmax and layer normalization are less compute-intensive than tensor contractions, and involve one or more reduction operations, the result of which is then applied via a map. + +Element-wise Operators +These are the remaining operators: biases, dropout, activations, and residual connections. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_15.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..d5c2d541e093b72b9cac204d9d74744773dcd103 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_15.txt @@ -0,0 +1,7 @@ +These are the least compute-intensive operations. + +This knowledge can be helpful to know when analyzing performance bottlenecks. +This summary is derived from Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020 +Anatomy of Model's Memory +We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there +are many components during training that use GPU memory. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_16.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..57e2291a8a1665ebd09b94379602e45b28b43ec5 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_16.txt @@ -0,0 +1,11 @@ +The components on GPU memory are the following: + +model weights +optimizer states +gradients +forward activations saved for gradient computation +temporary buffers +functionality-specific memory + +A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For +inference there are no optimizer states and gradients, so we can subtract those. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_17.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..5c89349dfdd3a375863f6d64f527231d494b1078 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_17.txt @@ -0,0 +1,27 @@ +And thus we end up with 6 bytes per +model parameter for mixed precision inference, plus activation memory. +Let's look at the details. +Model Weights: + +4 bytes * number of parameters for fp32 training +6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory) + +Optimizer States: + +8 bytes * number of parameters for normal AdamW (maintains 2 states) +2 bytes * number of parameters for 8-bit AdamW optimizers like bitsandbytes +4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state) + +Gradients + +4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32) + +Forward Activations + +size depends on many factors, the key ones being sequence length, hidden size and batch size. + +There are the input and output that are being passed and returned by the forward and the backward functions and the +forward activations saved for gradient computation. +Temporary Memory +Additionally, there are all kinds of temporary variables which get released once the calculation is done, but in the +moment these could require additional memory and could push to OOM. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_18.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..ea3714e2cb016f63fa42c212101db49a10ac4687 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_18.txt @@ -0,0 +1,4 @@ +Therefore, when coding it's crucial to think +strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed. +Functionality-specific memory +Then, your software could have special memory needs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_19.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..f2ae95507daa73151388ccfb1eabcc2c92398c97 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_19.txt @@ -0,0 +1,5 @@ +For example, when generating text using beam search, the software +needs to maintain multiple copies of inputs and outputs. +forward vs backward Execution Speed +For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates +into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_2.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..59ecb2b296291d0d5865a792c7290c64e5dd2415 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_2.txt @@ -0,0 +1,31 @@ +In total, we get 512 sequences each with length 512 and store them in a [~datasets.Dataset] with PyTorch format. + +import numpy as np +from datasets import Dataset +seq_len, dataset_size = 512, 512 +dummy_data = { + "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)), + "labels": np.random.randint(0, 1, (dataset_size)), + } +ds = Dataset.from_dict(dummy_data) +ds.set_format("pt") + +To print summary statistics for the GPU utilization and the training run with the [Trainer] we define two helper functions: + +from pynvml import * +def print_gpu_utilization(): + nvmlInit() + handle = nvmlDeviceGetHandleByIndex(0) + info = nvmlDeviceGetMemoryInfo(handle) + print(f"GPU memory occupied: {info.used//1024**2} MB.") +def print_summary(result): + print(f"Time: {result.metrics['train_runtime']:.2f}") + print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}") + print_gpu_utilization() + +Let's verify that we start with a free GPU memory: + +print_gpu_utilization() +GPU memory occupied: 0 MB. + +That looks good: the GPU memory is not occupied as we would expect before we load any models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_20.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..4fd9fb1d697edd004e19f0bdd431c9b8ce0e5c66 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_20.txt @@ -0,0 +1,5 @@ +Activations are usually +bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward +(e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward, +and writes once, gradInput). +As you can see, there are potentially a few places where we could save GPU memory or speed up operations. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_21.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..0c013adaeb594eb8ef56822eebcd5b9725a9d1dc --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_21.txt @@ -0,0 +1,3 @@ +Now that you understand what affects GPU utilization and computation speed, refer to +the Methods and tools for efficient training on a single GPU documentation page to learn about +performance optimization techniques. . \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_3.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..e8720d713fc9e6736aae6058ff79cb888639930d --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_3.txt @@ -0,0 +1,3 @@ +If that's not the case on +your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by +the user. When a model is loaded to the GPU the kernels are also loaded, which can take up 1-2GB of memory. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_4.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..b5ac0a37bb01462b832bcdf5b90d7a0d88a34818 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_4.txt @@ -0,0 +1,11 @@ +To see how +much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well. + +import torch +torch.ones((1, 1)).to("cuda") +print_gpu_utilization() +GPU memory occupied: 1343 MB. + +We see that the kernels alone take up 1.3GB of GPU memory. Now let's see how much space the model uses. +Load Model +First, we load the google-bert/bert-large-uncased model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_5.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..63de236255b0603f3e808842d23ba76a8e868f52 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_5.txt @@ -0,0 +1,10 @@ +We load the model weights directly to the GPU so that we can check +how much space just the weights use. + +from transformers import AutoModelForSequenceClassification +model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-large-uncased").to("cuda") +print_gpu_utilization() +GPU memory occupied: 2631 MB. + +We can see that the model weights alone take up 1.3 GB of GPU memory. The exact number depends on the specific +GPU you are using. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_6.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..d2671d4e6c1dbd9b0a71c574406bc0d953a46bf3 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_6.txt @@ -0,0 +1,2 @@ +Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an +optimized fashion that speeds up the usage of the model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_7.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..8aaa5fa33f75e0f19f9b86fe123a638c3e5c1400 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_7.txt @@ -0,0 +1,10 @@ +Now we can also quickly check if we get the same result +as with nvidia-smi CLI: + +nvidia-smi +```bash +Tue Jan 11 08:58:05 2022 ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_8.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..20e2dc0a596c4d195d0e6d0c147dd9be364a3005 --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_8.txt @@ -0,0 +1,3 @@ +ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +| | | MIG M. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_memory_anatomy/chunk_9.txt b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..8b96ab1cffac806efb2b6041ce7a21fa9f5d766d --- /dev/null +++ b/chunked/content_aware_chunking/_model_memory_anatomy/chunk_9.txt @@ -0,0 +1,15 @@ +| +|===============================+======================+======================| +| 0 Tesla V100-SXM2 On | 00000000:00:04.0 Off | 0 | +| N/A 37C P0 39W / 300W | 2631MiB / 16160MiB | 0% Default | +| | | N/A | ++-------------------------------+----------------------+----------------------+ ++-----------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=============================================================================| +| 0 N/A N/A 3721 C nvs/codeparrot/bin/python 2629MiB | ++-----------------------------------------------------------------------------+ + +We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_0.txt b/chunked/content_aware_chunking/_model_sharing/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..9aa6be321b6480d5b009a3cc01fb9759a44fda9f --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_0.txt @@ -0,0 +1,2 @@ +Share a model +The last two tutorials showed how you can fine-tune a model with PyTorch, Keras, and 🤗 Accelerate for distributed setups. The next step is to share your model with the community! At Hugging Face, we believe in openly sharing knowledge and resources to democratize artificial intelligence for everyone. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_1.txt b/chunked/content_aware_chunking/_model_sharing/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..be357938612650d873d89e9c5a639a82cf6b41b6 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_1.txt @@ -0,0 +1,7 @@ +We encourage you to consider sharing your model with the community to help others save time and resources. +In this tutorial, you will learn two methods for sharing a trained or fine-tuned model on the Model Hub: + +Programmatically push your files to the Hub. +Drag-and-drop your files to the Hub with the web interface. + +To share a model with the community, you need an account on huggingface.co. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_10.txt b/chunked/content_aware_chunking/_model_sharing/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..d34ad6001a97e8f2441fa79169ab9e8d57d63f86 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_10.txt @@ -0,0 +1,7 @@ +🤗 Transformers will even automatically add training hyperparameters, training results and framework versions to your model card! + +trainer.push_to_hub() +`` + + +Share a model to the Hub with [PushToHubCallback]. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_11.txt b/chunked/content_aware_chunking/_model_sharing/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..97390ca35df63f3f903da8fc388ed5b6267ff302 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_11.txt @@ -0,0 +1,22 @@ +In the [PushToHubCallback`] function, add: + +An output directory for your model. +A tokenizer. +The hub_model_id, which is your Hub username and model name. + +from transformers import PushToHubCallback +push_to_hub_callback = PushToHubCallback( + output_dir="./your_model_save_path", tokenizer=tokenizer, hub_model_id="your-username/my-awesome-model" + ) + +Add the callback to fit, and 🤗 Transformers will push the trained model to the Hub: + +model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3, callbacks=push_to_hub_callback) + +Use the push_to_hub function +You can also call push_to_hub directly on your model to upload it to the Hub. +Specify your model name in push_to_hub: + +pt_model.push_to_hub("my-awesome-model") + +This creates a repository under your username with the model name my-awesome-model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_12.txt b/chunked/content_aware_chunking/_model_sharing/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..e6d27f6dbe8f4197bdffca5c12bc00f4d5cbe45d --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_12.txt @@ -0,0 +1,10 @@ +Users can now load your model with the from_pretrained function: + +from transformers import AutoModel +model = AutoModel.from_pretrained("your_username/my-awesome-model") + +If you belong to an organization and want to push your model under the organization name instead, just add it to the repo_id: + +pt_model.push_to_hub("my-awesome-org/my-awesome-model") + +The push_to_hub function can also be used to add other files to a model repository. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_13.txt b/chunked/content_aware_chunking/_model_sharing/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..465c30696539c225e95dbf3052e0341426064cef --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_13.txt @@ -0,0 +1,9 @@ +For example, add a tokenizer to a model repository: + +tokenizer.push_to_hub("my-awesome-model") + +Or perhaps you'd like to add the TensorFlow version of your fine-tuned PyTorch model: + +tf_model.push_to_hub("my-awesome-model") + +Now when you navigate to your Hugging Face profile, you should see your newly created model repository. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_14.txt b/chunked/content_aware_chunking/_model_sharing/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..326bb6bef315ccd4636649c37ce3ed71bc29e5ea --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_14.txt @@ -0,0 +1,8 @@ +Clicking on the Files tab will display all the files you've uploaded to the repository. +For more details on how to create and upload files to a repository, refer to the Hub documentation here. +Upload with the web interface +Users who prefer a no-code approach are able to upload a model through the Hub's web interface. Visit huggingface.co/new to create a new repository: + +From here, add some information about your model: + +Select the owner of the repository. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_15.txt b/chunked/content_aware_chunking/_model_sharing/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..3382bd3814d0fe83ff51125e5901c68740e36905 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_15.txt @@ -0,0 +1,6 @@ +This can be yourself or any of the organizations you belong to. +Pick a name for your model, which will also be the repository name. +Choose whether your model is public or private. +Specify the license usage for your model. + +Now click on the Files tab and click on the Add file button to upload a new file to your repository. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_16.txt b/chunked/content_aware_chunking/_model_sharing/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..46070036e5b8e38b665c8990e4a26923c5b55c37 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_16.txt @@ -0,0 +1,4 @@ +Then drag-and-drop a file to upload and add a commit message. + +Add a model card +To make sure users understand your model's capabilities, limitations, potential biases and ethical considerations, please add a model card to your repository. The model card is defined in the README.md file. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_17.txt b/chunked/content_aware_chunking/_model_sharing/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..70e77fb7e8b7b94bc8ab33bdce523ea94b0dfb1d --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_17.txt @@ -0,0 +1,6 @@ +You can add a model card by: + +Manually creating and uploading a README.md file. +Clicking on the Edit model card button in your model repository. + +Take a look at the DistilBert model card for a good example of the type of information a model card should include. For more details about other options you can control in the README.md file such as a model's carbon footprint or widget examples, refer to the documentation here.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_2.txt b/chunked/content_aware_chunking/_model_sharing/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..9f234efb65afebee7117fdc480995251cbc69fa0 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_2.txt @@ -0,0 +1,5 @@ +You can also join an existing organization or create a new one. + +Repository features +Each repository on the Model Hub behaves like a typical GitHub repository. Our repositories offer versioning, commit history, and the ability to visualize differences. +The Model Hub's built-in versioning is based on git and git-lfs. In other words, you can treat one model as one repository, enabling greater access control and scalability. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_3.txt b/chunked/content_aware_chunking/_model_sharing/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..c92a2fcdef647dcb3d7b66d6ed8351fc2f80aa98 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_3.txt @@ -0,0 +1,11 @@ +Version control allows revisions, a method for pinning a specific version of a model with a commit hash, tag or branch. +As a result, you can load a specific model version with the revision parameter: + +model = AutoModel.from_pretrained( + "julien-c/EsperBERTo-small", revision="v2.0.1" # tag name, or branch name, or commit hash + ) + +Files are also easily edited in a repository, and you can view the commit history as well as the difference: + +Setup +Before sharing a model to the Hub, you will need your Hugging Face credentials. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_4.txt b/chunked/content_aware_chunking/_model_sharing/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..e245ff9253934a1b371489cbd0762da4f8f61ef0 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_4.txt @@ -0,0 +1,4 @@ +If you have access to a terminal, run the following command in the virtual environment where 🤗 Transformers is installed. This will store your access token in your Hugging Face cache folder (~/.cache/ by default): + +huggingface-cli login +If you are using a notebook like Jupyter or Colaboratory, make sure you have the huggingface_hub library installed. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_5.txt b/chunked/content_aware_chunking/_model_sharing/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..c6922891f564fc9f20232ab157c59794b99e3624 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_5.txt @@ -0,0 +1,10 @@ +This library allows you to programmatically interact with the Hub. + +pip install huggingface_hub +Then use notebook_login to sign-in to the Hub, and follow the link here to generate a token to login with: + +from huggingface_hub import notebook_login +notebook_login() + +Convert a model for all frameworks +To ensure your model can be used by someone working with a different framework, we recommend you convert and upload your model with both PyTorch and TensorFlow checkpoints. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_6.txt b/chunked/content_aware_chunking/_model_sharing/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..ae3f2ecd5676daa6d5e35d9e90c30c38994d2a31 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_6.txt @@ -0,0 +1,2 @@ +While users are still able to load your model from a different framework if you skip this step, it will be slower because 🤗 Transformers will need to convert the checkpoint on-the-fly. +Converting a checkpoint for another framework is easy. Make sure you have PyTorch and TensorFlow installed (see here for installation instructions), and then find the specific model for your task in the other framework. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_7.txt b/chunked/content_aware_chunking/_model_sharing/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..8de194264fc7cda0942060e975a0a34b90c57eee --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_7.txt @@ -0,0 +1,24 @@ +Specify from_tf=True to convert a checkpoint from TensorFlow to PyTorch: + +pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True) +pt_model.save_pretrained("path/to/awesome-name-you-picked") +`` + + +Specifyfrom_pt=True` to convert a checkpoint from PyTorch to TensorFlow: + +tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True) + +Then you can save your new TensorFlow model with its new checkpoint: + +tf_model.save_pretrained("path/to/awesome-name-you-picked") + +If a model is available in Flax, you can also convert a checkpoint from PyTorch to Flax: + +flax_model = FlaxDistilBertForSequenceClassification.from_pretrained( + "path/to/awesome-name-you-picked", from_pt=True + ) + +Push a model during training + +Sharing a model to the Hub is as simple as adding an extra parameter or callback. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_8.txt b/chunked/content_aware_chunking/_model_sharing/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..8b52fdecd0036e16b6d59b33691a3545a4e6565a --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_8.txt @@ -0,0 +1 @@ +Remember from the fine-tuning tutorial, the [TrainingArguments] class is where you specify hyperparameters and additional training options. One of these training options includes the ability to push a model directly to the Hub. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_sharing/chunk_9.txt b/chunked/content_aware_chunking/_model_sharing/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..67cd43d6f6be6eb3ac1c8cfaa3acd917d7f611d5 --- /dev/null +++ b/chunked/content_aware_chunking/_model_sharing/chunk_9.txt @@ -0,0 +1,15 @@ +Set push_to_hub=True in your [TrainingArguments]: + +training_args = TrainingArguments(output_dir="my-awesome-model", push_to_hub=True) + +Pass your training arguments as usual to [Trainer]: + +trainer = Trainer( + model=model, + args=training_args, + train_dataset=small_train_dataset, + eval_dataset=small_eval_dataset, + compute_metrics=compute_metrics, + ) + +After you fine-tune your model, call [~transformers.Trainer.push_to_hub] on [Trainer] to push the trained model to the Hub. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_0.txt b/chunked/content_aware_chunking/_model_summary/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..3e56efb2abe2faeadb9b2dce0f18c53d884ee321 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_0.txt @@ -0,0 +1,2 @@ +The Transformer model family +Since its introduction in 2017, the original Transformer model has inspired many new and exciting models that extend beyond natural language processing (NLP) tasks. There are models for predicting the folded structure of proteins, training a cheetah to run, and time series forecasting. With so many Transformer variants available, it can be easy to miss the bigger picture. What all these models have in common is they're based on the original Transformer architecture. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_1.txt b/chunked/content_aware_chunking/_model_summary/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..8e5fb8901cc3b03f2b9cde890ddba25a7f8532e2 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_1.txt @@ -0,0 +1 @@ +Some models only use the encoder or decoder, while others use both. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_10.txt b/chunked/content_aware_chunking/_model_summary/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..2df9c5422941b854ebc2b491c3109b37ecf14c68 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_10.txt @@ -0,0 +1,3 @@ +In addition to image generation, ImageGPT could also be finetuned for image classification. +Encoder-decoder[[cv-encoder-decoder]] +Vision models commonly use an encoder (also known as a backbone) to extract important image features before passing them to a Transformer decoder. DETR has a pretrained backbone, but it also uses the complete Transformer encoder-decoder architecture for object detection. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_11.txt b/chunked/content_aware_chunking/_model_summary/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..cfa8ae296648bad1a1d3e8c4a37833e0d2e9fffd --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_11.txt @@ -0,0 +1,5 @@ +The encoder learns image representations and combines them with object queries (each object query is a learned embedding that focuses on a region or object in an image) in the decoder. DETR predicts the bounding box coordinates and class label for each object query. +Natural language processing + +Encoder[[nlp-encoder]] +BERT is an encoder-only Transformer that randomly masks certain tokens in the input to avoid seeing other tokens, which would allow it to "cheat". \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_12.txt b/chunked/content_aware_chunking/_model_summary/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..03489da522e01485af7a3e585a637b28f5150e40 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_12.txt @@ -0,0 +1 @@ +The pretraining objective is to predict the masked token based on the context. This allows BERT to fully use the left and right contexts to help it learn a deeper and richer representation of the inputs. However, there was still room for improvement in BERT's pretraining strategy. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_13.txt b/chunked/content_aware_chunking/_model_summary/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..b77010612c12c793b9e9b82809e2732b6d3c32e8 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_13.txt @@ -0,0 +1,2 @@ +RoBERTa improved upon this by introducing a new pretraining recipe that includes training for longer and on larger batches, randomly masking tokens at each epoch instead of just once during preprocessing, and removing the next-sentence prediction objective. +The dominant strategy to improve performance is to increase the model size. But training large models is computationally expensive. One way to reduce computational costs is using a smaller model like DistilBERT. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_14.txt b/chunked/content_aware_chunking/_model_summary/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..f4204ec03e96930a16c8c587ce093280957eb288 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_14.txt @@ -0,0 +1,2 @@ +DistilBERT uses knowledge distillation - a compression technique - to create a smaller version of BERT while keeping nearly all of its language understanding capabilities. +However, most Transformer models continued to trend towards more parameters, leading to new models focused on improving training efficiency. ALBERT reduces memory consumption by lowering the number of parameters in two ways: separating the larger vocabulary embedding into two smaller matrices and allowing layers to share parameters. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_15.txt b/chunked/content_aware_chunking/_model_summary/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..bfbdb7ac1fcb74e562ec1c66033b8721895e8c73 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_15.txt @@ -0,0 +1 @@ +DeBERTa added a disentangled attention mechanism where the word and its position are separately encoded in two vectors. The attention is computed from these separate vectors instead of a single vector containing the word and position embeddings. Longformer also focused on making attention more efficient, especially for processing documents with longer sequence lengths. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_16.txt b/chunked/content_aware_chunking/_model_summary/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..7525d2257cecad0905c14a8c757fb5bab6057d44 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_16.txt @@ -0,0 +1,3 @@ +It uses a combination of local windowed attention (attention only calculated from fixed window size around each token) and global attention (only for specific task tokens like [CLS] for classification) to create a sparse attention matrix instead of a full attention matrix. +Decoder[[nlp-decoder]] +GPT-2 is a decoder-only Transformer that predicts the next word in the sequence. It masks tokens to the right so the model can't "cheat" by looking ahead. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_17.txt b/chunked/content_aware_chunking/_model_summary/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..a1be7a88a3615b02807f739b7783ce7b2cdd41d5 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_17.txt @@ -0,0 +1 @@ +By pretraining on a massive body of text, GPT-2 became really good at generating text, even if the text is only sometimes accurate or true. But GPT-2 lacked the bidirectional context from BERT's pretraining, which made it unsuitable for certain tasks. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_18.txt b/chunked/content_aware_chunking/_model_summary/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..308c4b41f3e45b8b90e6ed2675dcd68a71c29459 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_18.txt @@ -0,0 +1,2 @@ +XLNET combines the best of both BERT and GPT-2's pretraining objectives by using a permutation language modeling objective (PLM) that allows it to learn bidirectionally. +After GPT-2, language models grew even bigger and are now known as large language models (LLMs). LLMs demonstrate few- or even zero-shot learning if pretrained on a large enough dataset. GPT-J is an LLM with 6B parameters and trained on 400B tokens. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_19.txt b/chunked/content_aware_chunking/_model_summary/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..93c5035d5ce79a3b7a5453d32012306b67d1f27b --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_19.txt @@ -0,0 +1 @@ +GPT-J was followed by OPT, a family of decoder-only models, the largest of which is 175B and trained on 180B tokens. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_2.txt b/chunked/content_aware_chunking/_model_summary/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..bb7dff91027339dcd73a66622b1c850f1e9f0eee --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_2.txt @@ -0,0 +1,7 @@ +This provides a useful taxonomy to categorize and examine the high-level differences within models in the Transformer family, and it'll help you understand Transformers you haven't encountered before. +If you aren't familiar with the original Transformer model or need a refresher, check out the How do Transformers work chapter from the Hugging Face course. + +Computer vision + +Convolutional network +For a long time, convolutional networks (CNNs) were the dominant paradigm for computer vision tasks until the Vision Transformer demonstrated its scalability and efficiency. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_20.txt b/chunked/content_aware_chunking/_model_summary/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..fad2aaa1a46259a9b5d1b015430df71a3a082cfd --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_20.txt @@ -0,0 +1,3 @@ +BLOOM was released around the same time, and the largest model in the family has 176B parameters and is trained on 366B tokens in 46 languages and 13 programming languages. +Encoder-decoder[[nlp-encoder-decoder]] +BART keeps the original Transformer architecture, but it modifies the pretraining objective with text infilling corruption, where some text spans are replaced with a single mask token. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_21.txt b/chunked/content_aware_chunking/_model_summary/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..094fa5b976ba02f1d3d37b8f43ad65a3cbfddde4 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_21.txt @@ -0,0 +1 @@ +The decoder predicts the uncorrupted tokens (future tokens are masked) and uses the encoder's hidden states to help it. Pegasus is similar to BART, but Pegasus masks entire sentences instead of text spans. In addition to masked language modeling, Pegasus is pretrained by gap sentence generation (GSG). The GSG objective masks whole sentences important to a document, replacing them with a mask token. The decoder must generate the output from the remaining sentences. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_22.txt b/chunked/content_aware_chunking/_model_summary/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..ddcbb588ee98dd5b997e328db72eaaf9fe4cec08 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_22.txt @@ -0,0 +1,5 @@ +T5 is a more unique model that casts all NLP tasks into a text-to-text problem using specific prefixes. For example, the prefix Summarize: indicates a summarization task. T5 is pretrained by supervised (GLUE and SuperGLUE) training and self-supervised training (randomly sample and drop out 15% of tokens). +Audio + +Encoder[[audio-encoder]] +Wav2Vec2 uses a Transformer encoder to learn speech representations directly from raw audio waveforms. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_23.txt b/chunked/content_aware_chunking/_model_summary/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..448d247164513c70f8f7a120cd662b3b7fb849c9 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_23.txt @@ -0,0 +1 @@ +It is pretrained with a contrastive task to determine the true speech representation from a set of false ones. HuBERT is similar to Wav2Vec2 but has a different training process. Target labels are created by a clustering step in which segments of similar audio are assigned to a cluster which becomes a hidden unit. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_24.txt b/chunked/content_aware_chunking/_model_summary/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..1a4b552cbe870be15454c0c8cc4a0ca5f93c48ae --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_24.txt @@ -0,0 +1,3 @@ +The hidden unit is mapped to an embedding to make a prediction. +Encoder-decoder[[audio-encoder-decoder]] +Speech2Text is a speech model designed for automatic speech recognition (ASR) and speech translation. The model accepts log mel-filter bank features extracted from the audio waveform and pretrained autoregressively to generate a transcript or translation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_25.txt b/chunked/content_aware_chunking/_model_summary/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..97fe108a95fc6d4ddef0142e62a23d131fec43b1 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_25.txt @@ -0,0 +1 @@ +Whisper is also an ASR model, but unlike many other speech models, it is pretrained on a massive amount of ✨ labeled ✨ audio transcription data for zero-shot performance. A large chunk of the dataset also contains non-English languages, meaning Whisper can also be used for low-resource languages. Structurally, Whisper is similar to Speech2Text. The audio signal is converted to a log-mel spectrogram encoded by the encoder. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_26.txt b/chunked/content_aware_chunking/_model_summary/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..399d0c24e0f04f4d6838592ccd069f984846e747 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_26.txt @@ -0,0 +1,5 @@ +The decoder generates the transcript autoregressively from the encoder's hidden states and the previous tokens. +Multimodal + +Encoder[[mm-encoder]] +VisualBERT is a multimodal model for vision-language tasks released shortly after BERT. It combines BERT and a pretrained object detection system to extract image features into visual embeddings, passed alongside text embeddings to BERT. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_27.txt b/chunked/content_aware_chunking/_model_summary/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..f15497ba861b6cb764a96b6cfe03bc693a9c2d1c --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_27.txt @@ -0,0 +1 @@ +VisualBERT predicts the masked text based on the unmasked text and the visual embeddings, and it also has to predict whether the text is aligned with the image. When ViT was released, ViLT adopted ViT in its architecture because it was easier to get the image embeddings this way. The image embeddings are jointly processed with the text embeddings. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_28.txt b/chunked/content_aware_chunking/_model_summary/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..1d73a2ba5e9554469fcee01e3ffd51a473b2bbf7 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_28.txt @@ -0,0 +1,2 @@ +From there, ViLT is pretrained by image text matching, masked language modeling, and whole word masking. +CLIP takes a different approach and makes a pair prediction of (image, text) . An image encoder (ViT) and a text encoder (Transformer) are jointly trained on a 400 million (image, text) pair dataset to maximize the similarity between the image and text embeddings of the (image, text) pairs. After pretraining, you can use natural language to instruct CLIP to predict the text given an image or vice versa. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_29.txt b/chunked/content_aware_chunking/_model_summary/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..2ca1717e42eb6a7dfa5d381065da55ea95722fcd --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_29.txt @@ -0,0 +1,3 @@ +OWL-ViT builds on top of CLIP by using it as its backbone for zero-shot object detection. After pretraining, an object detection head is added to make a set prediction over the (class, bounding box) pairs. +Encoder-decoder[[mm-encoder-decoder]] +Optical character recognition (OCR) is a long-standing text recognition task that typically involves several components to understand the image and generate the text. TrOCR simplifies the process using an end-to-end Transformer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_3.txt b/chunked/content_aware_chunking/_model_summary/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..f5d4df11a5da3b5bb42869cd067d573d5d27a487 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_3.txt @@ -0,0 +1 @@ +Even then, some of a CNN's best qualities, like translation invariance, are so powerful (especially for certain tasks) that some Transformers incorporate convolutions in their architecture. ConvNeXt flipped this exchange around and incorporated design choices from Transformers to modernize a CNN. For example, ConvNeXt uses non-overlapping sliding windows to patchify an image and a larger kernel to increase its global receptive field. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_30.txt b/chunked/content_aware_chunking/_model_summary/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..bb8a5e68b7199b76d8ea3a92de9da40937f3f196 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_30.txt @@ -0,0 +1 @@ +The encoder is a ViT-style model for image understanding and processes the image as fixed-size patches. The decoder accepts the encoder's hidden states and autoregressively generates text. Donut is a more general visual document understanding model that doesn't rely on OCR-based approaches. It uses a Swin Transformer as the encoder and multilingual BART as the decoder. Donut is pretrained to read text by predicting the next word based on the image and text annotations. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_31.txt b/chunked/content_aware_chunking/_model_summary/chunk_31.txt new file mode 100644 index 0000000000000000000000000000000000000000..2a58f0a81ec9ec38403596a1b0aa8e9a5a1bdf67 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_31.txt @@ -0,0 +1,5 @@ +The decoder generates a token sequence given a prompt. The prompt is represented by a special token for each downstream task. For example, document parsing has a special parsing token that is combined with the encoder hidden states to parse the document into a structured output format (JSON). +Reinforcement learning + +Decoder[[rl-decoder]] +The Decision and Trajectory Transformer casts the state, action, and reward as a sequence modeling problem. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_32.txt b/chunked/content_aware_chunking/_model_summary/chunk_32.txt new file mode 100644 index 0000000000000000000000000000000000000000..c2d360316bd37a44ff8fc3f328f080c0e6b08294 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_32.txt @@ -0,0 +1 @@ +The Decision Transformer generates a series of actions that lead to a future desired return based on returns-to-go, past states, and actions. For the last K timesteps, each of the three modalities are converted into token embeddings and processed by a GPT-like model to predict a future action token. Trajectory Transformer also tokenizes the states, actions, and rewards and processes them with a GPT architecture. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_33.txt b/chunked/content_aware_chunking/_model_summary/chunk_33.txt new file mode 100644 index 0000000000000000000000000000000000000000..5ceff65a6215b94426625676a6ad0c2c2b456ed0 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_33.txt @@ -0,0 +1 @@ +Unlike the Decision Transformer, which is focused on reward conditioning, the Trajectory Transformer generates future actions with beam search.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_4.txt b/chunked/content_aware_chunking/_model_summary/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..eeb92e710c39c2e4edd00f122603a3464f4d2f6c --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_4.txt @@ -0,0 +1,3 @@ +ConvNeXt also makes several layer design choices to be more memory-efficient and improve performance, so it competes favorably with Transformers! +Encoder[[cv-encoder]] +The Vision Transformer (ViT) opened the door to computer vision tasks without convolutions. ViT uses a standard Transformer encoder, but its main breakthrough was how it treated an image. It splits an image into fixed-size patches and uses them to create an embedding, just like how a sentence is split into tokens. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_5.txt b/chunked/content_aware_chunking/_model_summary/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..1aa9c9108030798580fc84936b176fdf8d25e8c0 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_5.txt @@ -0,0 +1,2 @@ +ViT capitalized on the Transformers' efficient architecture to demonstrate competitive results with the CNNs at the time while requiring fewer resources to train. ViT was soon followed by other vision models that could also handle dense vision tasks like segmentation as well as detection. +One of these models is the Swin Transformer. It builds hierarchical feature maps (like a CNN 👀 and unlike ViT) from smaller-sized patches and merges them with neighboring patches in deeper layers. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_6.txt b/chunked/content_aware_chunking/_model_summary/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..21f4b23b5e299e4d7efb7a679b8a495f5d03a528 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_6.txt @@ -0,0 +1 @@ +Attention is only computed within a local window, and the window is shifted between attention layers to create connections to help the model learn better. Since the Swin Transformer can produce hierarchical feature maps, it is a good candidate for dense prediction tasks like segmentation and detection. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_7.txt b/chunked/content_aware_chunking/_model_summary/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..4d91e422a230ded6ce86ace8e7ed5bf1a4e270ee --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_7.txt @@ -0,0 +1,2 @@ +The SegFormer also uses a Transformer encoder to build hierarchical feature maps, but it adds a simple multilayer perceptron (MLP) decoder on top to combine all the feature maps and make a prediction. +Other vision models, like BeIT and ViTMAE, drew inspiration from BERT's pretraining objective. BeIT is pretrained by masked image modeling (MIM); the image patches are randomly masked, and the image is also tokenized into visual tokens. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_8.txt b/chunked/content_aware_chunking/_model_summary/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..d5190ec9e6d8db440110b055e9ea262bdba8b170 --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_8.txt @@ -0,0 +1 @@ +BeIT is trained to predict the visual tokens corresponding to the masked patches. ViTMAE has a similar pretraining objective, except it must predict the pixels instead of visual tokens. What's unusual is 75% of the image patches are masked! The decoder reconstructs the pixels from the masked tokens and encoded patches. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_model_summary/chunk_9.txt b/chunked/content_aware_chunking/_model_summary/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..6afeeeb4aae1c67343cadb7eb8039e6e0b536a1d --- /dev/null +++ b/chunked/content_aware_chunking/_model_summary/chunk_9.txt @@ -0,0 +1,3 @@ +After pretraining, the decoder is thrown away, and the encoder is ready to be used in downstream tasks. +Decoder[[cv-decoder]] +Decoder-only vision models are rare because most vision models rely on an encoder to learn an image representation. But for use cases like image generation, the decoder is a natural fit, as we've seen from text generation models like GPT-2. ImageGPT uses the same architecture as GPT-2, but instead of predicting the next token in a sequence, it predicts the next pixel in an image. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_0.txt b/chunked/content_aware_chunking/_multilingual/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..23135f28e02a2665c6e8be90202cdd8fd24d27b4 --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_0.txt @@ -0,0 +1,5 @@ +Multilingual models for inference +[[open-in-colab]] +There are several multilingual models in 🤗 Transformers, and their inference usage differs from monolingual models. Not all multilingual model usage is different though. Some models, like google-bert/bert-base-multilingual-uncased, can be used just like a monolingual model. This guide will show you how to use multilingual models whose usage differs for inference. +XLM +XLM has ten different checkpoints, only one of which is monolingual. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_1.txt b/chunked/content_aware_chunking/_multilingual/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..c6e33cd224d7c13931fcb454fe87e3add2b1cfa0 --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_1.txt @@ -0,0 +1,13 @@ +The nine remaining model checkpoints can be split into two categories: the checkpoints that use language embeddings and those that don't. +XLM with language embeddings +The following XLM models use language embeddings to specify the language used at inference: + +FacebookAI/xlm-mlm-ende-1024 (Masked language modeling, English-German) +FacebookAI/xlm-mlm-enfr-1024 (Masked language modeling, English-French) +FacebookAI/xlm-mlm-enro-1024 (Masked language modeling, English-Romanian) +FacebookAI/xlm-mlm-xnli15-1024 (Masked language modeling, XNLI languages) +FacebookAI/xlm-mlm-tlm-xnli15-1024 (Masked language modeling + translation, XNLI languages) +FacebookAI/xlm-clm-enfr-1024 (Causal language modeling, English-French) +FacebookAI/xlm-clm-ende-1024 (Causal language modeling, English-German) + +Language embeddings are represented as a tensor of the same shape as the input_ids passed to the model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_10.txt b/chunked/content_aware_chunking/_multilingual/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..e478affd540edc77f65ce5093b45f6dfc19f5eb5 --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_10.txt @@ -0,0 +1,7 @@ +Set the forced_bos_token_id to en in the generate method to translate to English: + +generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]) +tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) +"Don't interfere with the wizard's affairs, because they are subtle, will soon get angry." + +If you are using the facebook/mbart-large-50-many-to-one-mmt checkpoint, you don't need to force the target language id as the first generated token otherwise the usage is the same.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_2.txt b/chunked/content_aware_chunking/_multilingual/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..be167bf91737b9eab7152862bc3c3f139a994825 --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_2.txt @@ -0,0 +1,18 @@ +The values in these tensors depend on the language used and are identified by the tokenizer's lang2id and id2lang attributes. +In this example, load the FacebookAI/xlm-clm-enfr-1024 checkpoint (Causal language modeling, English-French): + +import torch +from transformers import XLMTokenizer, XLMWithLMHeadModel +tokenizer = XLMTokenizer.from_pretrained("FacebookAI/xlm-clm-enfr-1024") +model = XLMWithLMHeadModel.from_pretrained("FacebookAI/xlm-clm-enfr-1024") + +The lang2id attribute of the tokenizer displays this model's languages and their ids: + +print(tokenizer.lang2id) +{'en': 0, 'fr': 1} + +Next, create an example input: + +input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1 + +Set the language id as "en" and use it to define the language embedding. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_3.txt b/chunked/content_aware_chunking/_multilingual/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..6bc9e0fdf90a0914b59ab00e3bba3cca0e4d6e5a --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_3.txt @@ -0,0 +1 @@ +The language embedding is a tensor filled with 0 since that is the language id for English. This tensor should be the same size as input_ids. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_4.txt b/chunked/content_aware_chunking/_multilingual/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..7da90561890771b358e92d22c604bab11355fa9f --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_4.txt @@ -0,0 +1,24 @@ +language_id = tokenizer.lang2id["en"] # 0 +langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, , 0]) +We reshape it to be of size (batch_size, sequence_length) +langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1) + +Now you can pass the input_ids and language embedding to the model: + +outputs = model(input_ids, langs=langs) + +The run_generation.py script can generate text with language embeddings using the xlm-clm checkpoints. +XLM without language embeddings +The following XLM models do not require language embeddings during inference: + +FacebookAI/xlm-mlm-17-1280 (Masked language modeling, 17 languages) +FacebookAI/xlm-mlm-100-1280 (Masked language modeling, 100 languages) + +These models are used for generic sentence representations, unlike the previous XLM checkpoints. +BERT +The following BERT models can be used for multilingual tasks: + +google-bert/bert-base-multilingual-uncased (Masked language modeling + Next sentence prediction, 102 languages) +google-bert/bert-base-multilingual-cased (Masked language modeling + Next sentence prediction, 104 languages) + +These models do not require language embeddings during inference. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_5.txt b/chunked/content_aware_chunking/_multilingual/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..8dafbc0ea724dcb03327d06b8581db1e07226568 --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_5.txt @@ -0,0 +1,9 @@ +They should identify the language from the +context and infer accordingly. +XLM-RoBERTa +The following XLM-RoBERTa models can be used for multilingual tasks: + +FacebookAI/xlm-roberta-base (Masked language modeling, 100 languages) +FacebookAI/xlm-roberta-large (Masked language modeling, 100 languages) + +XLM-RoBERTa was trained on 2.5TB of newly created and cleaned CommonCrawl data in 100 languages. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_6.txt b/chunked/content_aware_chunking/_multilingual/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..163f3e46d7544fe2f6116632ef4fbe3062d0b339 --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_6.txt @@ -0,0 +1,8 @@ +It provides strong gains over previously released multilingual models like mBERT or XLM on downstream tasks like classification, sequence labeling, and question answering. +M2M100 +The following M2M100 models can be used for multilingual translation: + +facebook/m2m100_418M (Translation) +facebook/m2m100_1.2B (Translation) + +In this example, load the facebook/m2m100_418M checkpoint to translate from Chinese to English. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_7.txt b/chunked/content_aware_chunking/_multilingual/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..71a9a7bac003ebdc6c65707754a4ac533355da1b --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_7.txt @@ -0,0 +1,13 @@ +You can set the source language in the tokenizer: + +from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer +en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger." +chinese_text = "不要插手巫師的事務, 因為他們是微妙的, 很快就會發怒." +tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="zh") +model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M") + +Tokenize the text: + +encoded_zh = tokenizer(chinese_text, return_tensors="pt") + +M2M100 forces the target language id as the first generated token to translate to the target language. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_8.txt b/chunked/content_aware_chunking/_multilingual/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..4ef96eab7f46a1d1ed2725633db3a936ce78aff4 --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_8.txt @@ -0,0 +1,16 @@ +Set the forced_bos_token_id to en in the generate method to translate to English: + +generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en")) +tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) +'Do not interfere with the matters of the witches, because they are delicate and will soon be angry.' + +MBart +The following MBart models can be used for multilingual translation: + +facebook/mbart-large-50-one-to-many-mmt (One-to-many multilingual machine translation, 50 languages) +facebook/mbart-large-50-many-to-many-mmt (Many-to-many multilingual machine translation, 50 languages) +facebook/mbart-large-50-many-to-one-mmt (Many-to-one multilingual machine translation, 50 languages) +facebook/mbart-large-50 (Multilingual translation, 50 languages) +facebook/mbart-large-cc25 + +In this example, load the facebook/mbart-large-50-many-to-many-mmt checkpoint to translate Finnish to English. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_multilingual/chunk_9.txt b/chunked/content_aware_chunking/_multilingual/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..b5d45bbf0e4fcbce28d7ccce5ee54a31ca9c021a --- /dev/null +++ b/chunked/content_aware_chunking/_multilingual/chunk_9.txt @@ -0,0 +1,13 @@ +You can set the source language in the tokenizer: + +from transformers import AutoTokenizer, AutoModelForSeq2SeqLM +en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger." +fi_text = "Älä sekaannu velhojen asioihin, sillä ne ovat hienovaraisia ja nopeasti vihaisia." +tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", src_lang="fi_FI") +model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") + +Tokenize the text: + +encoded_en = tokenizer(en_text, return_tensors="pt") + +MBart forces the target language id as the first generated token to translate to the target language. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_notebooks/chunk_0.txt b/chunked/content_aware_chunking/_notebooks/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..2e6fa58690a5783fce158b49043429a1a6632a0d --- /dev/null +++ b/chunked/content_aware_chunking/_notebooks/chunk_0.txt @@ -0,0 +1 @@ +../../../notebooks/README.md. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_pad_truncation/chunk_0.txt b/chunked/content_aware_chunking/_pad_truncation/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..cae26b29f34679a9bdceba420fbd2ab6a76472a9 --- /dev/null +++ b/chunked/content_aware_chunking/_pad_truncation/chunk_0.txt @@ -0,0 +1,2 @@ +Padding and truncation +Batched inputs are often different lengths, so they can't be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a special padding token to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_pad_truncation/chunk_1.txt b/chunked/content_aware_chunking/_pad_truncation/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..d3b78d25fba7e63cd10184e9499dd5a39cd51e8b --- /dev/null +++ b/chunked/content_aware_chunking/_pad_truncation/chunk_1.txt @@ -0,0 +1,3 @@ +Truncation works in the other direction by truncating long sequences. +In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well. However, the API supports more strategies if you need them. The three arguments you need to are: padding, truncation and max_length. +The padding argument controls padding. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_pad_truncation/chunk_2.txt b/chunked/content_aware_chunking/_pad_truncation/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..79df339cccb9ff8f4357d47ecf27e6ee7dc65167 --- /dev/null +++ b/chunked/content_aware_chunking/_pad_truncation/chunk_2.txt @@ -0,0 +1,7 @@ +It can be a boolean or a string: + +True or 'longest': pad to the longest sequence in the batch (no padding is applied if you only provide + a single sequence). +'max_length': pad to a length specified by the max_length argument or the maximum length accepted + by the model if no max_length is provided (max_length=None). Padding will still be applied if you only provide a single sequence. +False or 'do_not_pad': no padding is applied. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_pad_truncation/chunk_3.txt b/chunked/content_aware_chunking/_pad_truncation/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..73988b6a2b9c25c24f763e17bb8710c03879a25b --- /dev/null +++ b/chunked/content_aware_chunking/_pad_truncation/chunk_3.txt @@ -0,0 +1,6 @@ +This is the default behavior. + +The truncation argument controls truncation. It can be a boolean or a string: + +True or 'longest_first': truncate to a maximum length specified by the max_length argument or + the maximum length accepted by the model if no max_length is provided (max_length=None). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_pad_truncation/chunk_4.txt b/chunked/content_aware_chunking/_pad_truncation/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..bc6db6f6deaed72878b7d4b7535ed318eb637884 --- /dev/null +++ b/chunked/content_aware_chunking/_pad_truncation/chunk_4.txt @@ -0,0 +1,5 @@ +This will + truncate token by token, removing a token from the longest sequence in the pair until the proper length is + reached. +'only_second': truncate to a maximum length specified by the max_length argument or the maximum + length accepted by the model if no max_length is provided (max_length=None). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_pad_truncation/chunk_5.txt b/chunked/content_aware_chunking/_pad_truncation/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..0053bc554486b8031f0368aa0fe67427a89ecb18 --- /dev/null +++ b/chunked/content_aware_chunking/_pad_truncation/chunk_5.txt @@ -0,0 +1,6 @@ +This will only truncate + the second sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided. +'only_first': truncate to a maximum length specified by the max_length argument or the maximum + length accepted by the model if no max_length is provided (max_length=None). This will only truncate + the first sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided. +False or 'do_not_truncate': no truncation is applied. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_pad_truncation/chunk_6.txt b/chunked/content_aware_chunking/_pad_truncation/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..ac261bb5b40092ab27b8e562c0762a00aabf6c98 --- /dev/null +++ b/chunked/content_aware_chunking/_pad_truncation/chunk_6.txt @@ -0,0 +1,4 @@ +This is the default behavior. + +The max_length argument controls the length of the padding and truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation or padding to max_length is deactivated. +The following table summarizes the recommended way to setup padding and truncation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_pad_truncation/chunk_7.txt b/chunked/content_aware_chunking/_pad_truncation/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..3f96ed4618f3a7fccf85b80770ceea469d68d9ac --- /dev/null +++ b/chunked/content_aware_chunking/_pad_truncation/chunk_7.txt @@ -0,0 +1,2 @@ +If you use pairs of input sequences in any of the following examples, you can replace truncation=True by a STRATEGY selected in +['only_first', 'only_second', 'longest_first'], i.e. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_pad_truncation/chunk_8.txt b/chunked/content_aware_chunking/_pad_truncation/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..15196440cad9f7cc48a0bfad19a5e4b0bf3ece5b --- /dev/null +++ b/chunked/content_aware_chunking/_pad_truncation/chunk_8.txt @@ -0,0 +1,23 @@ +truncation='only_second' or truncation='longest_first' to control how both sequences in the pair are truncated as detailed before. +| Truncation | Padding | Instruction | +|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------| +| no truncation | no padding | tokenizer(batch_sentences) | +| | padding to max sequence in batch | tokenizer(batch_sentences, padding=True) or | +| | | tokenizer(batch_sentences, padding='longest') | +| | padding to max model input length | tokenizer(batch_sentences, padding='max_length') | +| | padding to specific length | tokenizer(batch_sentences, padding='max_length', max_length=42) | +| | padding to a multiple of a value | tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8) | +| truncation to max model input length | no padding | tokenizer(batch_sentences, truncation=True) or | +| | | tokenizer(batch_sentences, truncation=STRATEGY) | +| | padding to max sequence in batch | tokenizer(batch_sentences, padding=True, truncation=True) or | +| | | tokenizer(batch_sentences, padding=True, truncation=STRATEGY) | +| | padding to max model input length | tokenizer(batch_sentences, padding='max_length', truncation=True) or | +| | | tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY) | +| | padding to specific length | Not possible | +| truncation to specific length | no padding | tokenizer(batch_sentences, truncation=True, max_length=42) or | +| | | tokenizer(batch_sentences, truncation=STRATEGY, max_length=42) | +| | padding to max sequence in batch | tokenizer(batch_sentences, padding=True, truncation=True, max_length=42) or | +| | | tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42) | +| | padding to max model input length | Not possible | +| | padding to specific length | tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42) or | +| | | tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42) |. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_peft/chunk_0.txt b/chunked/content_aware_chunking/_peft/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..945c9b46d684f08ec84cb316e1dc0061e361f794 --- /dev/null +++ b/chunked/content_aware_chunking/_peft/chunk_0.txt @@ -0,0 +1 @@ +. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_0.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..a234f685e9632a371e01b5d33d158f7ffa5d8817 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_0.txt @@ -0,0 +1,2 @@ +Custom hardware for training +The hardware you use to run model training and inference can have a big effect on performance. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_1.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..f1fed45a603a24e83132f5459ebde1c25bd86847 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_1.txt @@ -0,0 +1,14 @@ +For a deep dive into GPUs make sure to check out Tim Dettmer's excellent blog post. +Let's have a look at some practical advice for GPU setups. +GPU +When you train bigger models you have essentially three options: + +bigger GPUs +more GPUs +more CPU and NVMe (offloaded to by DeepSpeed-Infinity) + +Let's start at the case where you have a single GPU. +Power and Cooling +If you bought an expensive high end GPU make sure you give it the correct power and sufficient cooling. +Power: +Some high end consumer GPU cards have 2 and sometimes 3 PCI-E 8-Pin power sockets. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_10.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..6344915a81abb5feae2b03b6bf9c23664f2c8714 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_10.txt @@ -0,0 +1,7 @@ +here is a quote from Nvidia Ampere GA102 GPU Architecture: + +Third-Generation NVLink® +GA102 GPUs utilize NVIDIA’s third-generation NVLink interface, which includes four x4 links, +with each link providing 14.0625 GB/sec bandwidth in each direction between two GPUs. Four +links provide 56.25 GB/sec bandwidth in each direction, and 112.5 GB/sec total bandwidth +between two GPUs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_11.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..c3eb1bd790f1d6fe68d8b568b2c56f11f48e08c2 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_11.txt @@ -0,0 +1,4 @@ +Two RTX 3090 GPUs can be connected together for SLI using NVLink. +(Note that 3-Way and 4-Way SLI configurations are not supported.) + +So the higher X you get in the report of NVX in the output of nvidia-smi topo -m the better. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_12.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..31069d15a957079e6262e185a4121392d7a190ec --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_12.txt @@ -0,0 +1,8 @@ +The generation will depend on your GPU architecture. +Let's compare the execution of a openai-community/gpt2 language model training over a small sample of wikitext. +The results are: +| NVlink | Time | +| ----- | ---: | +| Y | 101s | +| N | 131s | +You can see that NVLink completes the training ~23% faster. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_13.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..c4ad678e8a13ee7e7ad48da9e37821351658ee6f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_13.txt @@ -0,0 +1,18 @@ +In the second benchmark we use NCCL_P2P_DISABLE=1 to tell the GPUs not to use NVLink. +Here is the full benchmark code and outputs: +```bash +DDP w/ NVLink +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ +--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ +--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 +{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} +DDP w/o NVLink +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ +--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train +--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 +{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} + +Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2 in nvidia-smi topo -m) +Software: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_2.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..f718bd3c4bf00576df7f735ba86993561a6cb477 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_2.txt @@ -0,0 +1 @@ +Make sure you have as many independent 12V PCI-E 8-Pin cables plugged into the card as there are sockets. Do not use the 2 splits at one end of the same cable (also known as pigtail cable). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_3.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..dea4c0966e2dd1d8d5aa8704ff25d67fab95abd5 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_3.txt @@ -0,0 +1,5 @@ +That is if you have 2 sockets on the GPU, you want 2 PCI-E 8-Pin cables going from your PSU to the card and not one that has 2 PCI-E 8-Pin connectors at the end! You won't get the full performance out of your card otherwise. +Each PCI-E 8-Pin power cable needs to be plugged into a 12V rail on the PSU side and can supply up to 150W of power. +Some other cards may use a PCI-E 12-Pin connectors, and these can deliver up to 500-600W of power. +Low end cards may use 6-Pin connectors, which supply up to 75W of power. +Additionally you want the high-end PSU that has stable voltage. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_4.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..19de2cf93c59df925bb8c21f730259f7657d1fe3 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_4.txt @@ -0,0 +1,5 @@ +Some lower quality ones may not give the card the stable voltage it needs to function at its peak. +And of course the PSU needs to have enough unused Watts to power the card. +Cooling: +When a GPU gets overheated it will start throttling down and will not deliver full performance and it can even shutdown if it gets too hot. +It's hard to tell the exact best temperature to strive for when a GPU is heavily loaded, but probably anything under +80C is good, but lower is better - perhaps 70-75C is an excellent range to be in. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_5.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..c8306818ee32e0951f09065bfbf194b160622e55 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_5.txt @@ -0,0 +1,4 @@ +The throttling down is likely to start at around 84-90C. But other than throttling performance a prolonged very high temperature is likely to reduce the lifespan of a GPU. +Next let's have a look at one of the most important aspects when having multiple GPUs: connectivity. +Multi-GPU Connectivity +If you use multiple GPUs the way cards are inter-connected can have a huge impact on the total training time. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_6.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..a21af511e9fa7802be4fb4792a36956417ef57ea --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_6.txt @@ -0,0 +1,4 @@ +If the GPUs are on the same physical node, you can run: + +nvidia-smi topo -m +and it will tell you how the GPUs are inter-connected. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_7.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..3aa6ca23a56a6ca2889615248030be13b8ff4176 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_7.txt @@ -0,0 +1,18 @@ +On a machine with dual-GPU and which are connected with NVLink, you will most likely see something like: +GPU0 GPU1 CPU Affinity NUMA Affinity +GPU0 X NV2 0-23 N/A +GPU1 NV2 X 0-23 N/A +on a different machine w/o NVLink we may see: +GPU0 GPU1 CPU Affinity NUMA Affinity +GPU0 X PHB 0-11 N/A +GPU1 PHB X 0-11 N/A +The report includes this legend: +X = Self + SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) + NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node + PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) + PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) + PIX = Connection traversing at most a single PCIe bridge + NV# = Connection traversing a bonded set of # NVLinks +So the first report NV2 tells us the GPUs are interconnected with 2 NVLinks, and the second report PHB we have a typical consumer-level PCIe+Bridge setup. +Check what type of connectivity you have on your setup. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_8.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..c31b1f6643b894caef38645e791d9282ec52023a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_8.txt @@ -0,0 +1,2 @@ +Some of these will make the communication between cards faster (e.g. NVLink), others slower (e.g. PHB). +Depending on the type of scalability solution used, the connectivity speed could have a major or a minor impact. If the GPUs need to sync rarely, as in DDP, the impact of a slower connection will be less significant. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_hardware/chunk_9.txt b/chunked/content_aware_chunking/_perf_hardware/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..855ff0f59f8c9f9ef480eeeae2e0299ed9772542 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_hardware/chunk_9.txt @@ -0,0 +1,4 @@ +If the GPUs need to send messages to each other often, as in ZeRO-DP, then faster connectivity becomes super important to achieve faster training. +NVlink +NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. +Each new generation provides a faster bandwidth, e.g. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_cpu/chunk_0.txt b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..a557e3b72137cab2134a64dba31544e519ec9230 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_0.txt @@ -0,0 +1,3 @@ +CPU inference +With some optimizations, it is possible to efficiently run large model inference on a CPU. One of these optimization techniques involves compiling the PyTorch code into an intermediate format for high-performance environments like C++. The other technique fuses multiple operations into one kernel to reduce the overhead of running each operation separately. +You'll learn how to use BetterTransformer for faster inference, and how to convert your PyTorch code to TorchScript. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_cpu/chunk_1.txt b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..ef2b2d2978bd038f4f8653eb41453b20372162ed --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_1.txt @@ -0,0 +1,3 @@ +If you're using an Intel CPU, you can also use graph optimizations from Intel Extension for PyTorch to boost inference speed even more. Finally, learn how to use 🤗 Optimum to accelerate inference with ONNX Runtime or OpenVINO (if you're using an Intel CPU). +BetterTransformer +BetterTransformer accelerates inference with its fastpath (native PyTorch specialized implementation of Transformer functions) execution. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_cpu/chunk_2.txt b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..d8063f7feef74c16215f2f94ec2ec59394bfe351 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_2.txt @@ -0,0 +1,8 @@ +The two optimizations in the fastpath execution are: + +fusion, which combines multiple sequential operations into a single "kernel" to reduce the number of computation steps +skipping the inherent sparsity of padding tokens to avoid unnecessary computation with nested tensors + +BetterTransformer also converts all attention operations to use the more memory-efficient scaled dot product attention. + +BetterTransformer is not supported for all models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_cpu/chunk_3.txt b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..197b66d83b116b7cadb2b3b3c30d5df4e1799e4d --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_3.txt @@ -0,0 +1,11 @@ +Check this list to see if a model supports BetterTransformer. + +Before you start, make sure you have 🤗 Optimum installed. +Enable BetterTransformer with the [PreTrainedModel.to_bettertransformer] method: + +from transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder") +model.to_bettertransformer() + +TorchScript +TorchScript is an intermediate PyTorch model representation that can be run in production environments where performance is important. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_cpu/chunk_4.txt b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..127e7b3a1d165bfdffc796aa30250de2a1e03fde --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_4.txt @@ -0,0 +1 @@ +You can train a model in PyTorch and then export it to TorchScript to free the model from Python performance constraints. PyTorch traces a model to return a [ScriptFunction] that is optimized with just-in-time compilation (JIT). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_cpu/chunk_5.txt b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..ad220ad88750142b03d9eaad924824804d86c696 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_5.txt @@ -0,0 +1,16 @@ +Compared to the default eager mode, JIT mode in PyTorch typically yields better performance for inference using optimization techniques like operator fusion. +For a gentle introduction to TorchScript, see the Introduction to PyTorch TorchScript tutorial. +With the [Trainer] class, you can enable JIT mode for CPU inference by setting the --jit_mode_eval flag: + +python run_qa.py \ +--model_name_or_path csarron/bert-base-uncased-squad-v1 \ +--dataset_name squad \ +--do_eval \ +--max_seq_length 384 \ +--doc_stride 128 \ +--output_dir /tmp/ \ +--no_cuda \ +--jit_mode_eval + +For PyTorch >= 1.14.0, JIT-mode could benefit any model for prediction and evaluation since the dict input is supported in jit.trace. +For PyTorch < 1.14.0, JIT-mode could benefit a model if its forward parameter order matches the tuple input order in jit.trace, such as a question-answering model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_cpu/chunk_6.txt b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..f83d7c0584cd0cf839a185395aeac51f5d81f54a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_6.txt @@ -0,0 +1,4 @@ +If the forward parameter order does not match the tuple input order in jit.trace, like a text classification model, jit.trace will fail and we are capturing this with the exception here to make it fallback. Logging is used to notify users. + +IPEX graph optimization +Intel® Extension for PyTorch (IPEX) provides further optimizations in JIT mode for Intel CPUs, and we recommend combining it with TorchScript for even faster performance. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_cpu/chunk_7.txt b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..a14fde9f6e7fcd7b5fe4b23cee65cfb68333b0ec --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_7.txt @@ -0,0 +1,19 @@ +The IPEX graph optimization fuses operations like Multi-head attention, Concat Linear, Linear + Add, Linear + Gelu, Add + LayerNorm, and more. +To take advantage of these graph optimizations, make sure you have IPEX installed: + +pip install intel_extension_for_pytorch +Set the --use_ipex and --jit_mode_eval flags in the [Trainer] class to enable JIT mode with the graph optimizations: + +python run_qa.py \ +--model_name_or_path csarron/bert-base-uncased-squad-v1 \ +--dataset_name squad \ +--do_eval \ +--max_seq_length 384 \ +--doc_stride 128 \ +--output_dir /tmp/ \ +--no_cuda \ +--use_ipex \ +--jit_mode_eval +🤗 Optimum + +Learn more details about using ORT with 🤗 Optimum in the Optimum Inference with ONNX Runtime guide. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_cpu/chunk_8.txt b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..528a5cf01c3705a8e5e447b61475017a1018656e --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_8.txt @@ -0,0 +1,3 @@ +This section only provides a brief and simple example. + +ONNX Runtime (ORT) is a model accelerator that runs inference on CPUs by default. ORT is supported by 🤗 Optimum which can be used in 🤗 Transformers, without making too many changes to your code. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_cpu/chunk_9.txt b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..1e66064209a571cec57247e79d450f47d308af81 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_cpu/chunk_9.txt @@ -0,0 +1,13 @@ +You only need to replace the 🤗 Transformers AutoClass with its equivalent [~optimum.onnxruntime.ORTModel] for the task you're solving, and load a checkpoint in the ONNX format. +For example, if you're running inference on a question answering task, load the optimum/roberta-base-squad2 checkpoint which contains a model.onnx file: + +from transformers import AutoTokenizer, pipeline +from optimum.onnxruntime import ORTModelForQuestionAnswering +model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") +tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") +onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer) +question = "What's my name?" +context = "My name is Philipp and I live in Nuremberg." +pred = onnx_qa(question, context) + +If you have an Intel CPU, take a look at 🤗 Optimum Intel which supports a variety of compression techniques (quantization, pruning, knowledge distillation) and tools for converting models to the OpenVINO format for higher performance inference.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_0.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..188e0c0d12cfb1ed1bfc0aac171f8dd430a508bb --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_0.txt @@ -0,0 +1,2 @@ +GPU inference +GPUs are the standard choice of hardware for machine learning, unlike CPUs, because they are optimized for memory bandwidth and parallelism. To keep up with the larger sizes of modern models or to run these large models on existing and older hardware, there are several optimizations you can use to speed up GPU inference. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_1.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..38038732707830c1d6976405d24692a76c5f83f3 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_1.txt @@ -0,0 +1 @@ +In this guide, you'll learn how to use FlashAttention-2 (a more memory-efficient attention mechanism), BetterTransformer (a PyTorch native fastpath execution), and bitsandbytes to quantize your model to a lower precision. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_10.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..5093499944fd0af67ca2d4d3846acd65e88dad1f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_10.txt @@ -0,0 +1,34 @@ +SDPA support is currently being added natively in Transformers and is used by default for torch>=2.1.1 when an implementation is available. +For now, Transformers supports SDPA inference and training for the following architectures: +* Bart +* GPTBigCode +* Falcon +* Llama +* Phi +* Idefics +* Whisper +* Mistral +* Mixtral +* Qwen2 + +FlashAttention can only be used for models with the fp16 or bf16 torch type, so make sure to cast your model to the appropriate type first. + +By default, SDPA selects the most performant kernel available but you can check whether a backend is available in a given setting (hardware, problem size) with torch.backends.cuda.sdp_kernel as a context manager: + +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16).to("cuda") +convert the model to BetterTransformer +model.to_bettertransformer() +input_text = "Hello my dog is cute and" +inputs = tokenizer(input_text, return_tensors="pt").to("cuda") + +with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): + outputs = model.generate(**inputs) + +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) + +If you see a bug with the traceback below, try using the nightly version of PyTorch which may have broader coverage for FlashAttention: +```bash +RuntimeError: No available kernel. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_11.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..fe631efeb54994ee14b6f6993fe3d589251bf942 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_11.txt @@ -0,0 +1,7 @@ +Aborting execution. +install PyTorch nightly +pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 + +BetterTransformer + +Some BetterTransformer features are being upstreamed to Transformers with default support for native torch.nn.scaled_dot_product_attention. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_12.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..a754f0c625fa1520b8b76ab0df4259c5ee1697bf --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_12.txt @@ -0,0 +1,5 @@ +BetterTransformer still has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to natively support SDPA in Transformers. + +Check out our benchmarks with BetterTransformer and scaled dot product attention in the Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0 and learn more about the fastpath execution in the BetterTransformer blog post. + +BetterTransformer accelerates inference with its fastpath (native PyTorch specialized implementation of Transformer functions) execution. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_13.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..0ee119f88aa679665a608fc777550dc86bc03d07 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_13.txt @@ -0,0 +1,11 @@ +The two optimizations in the fastpath execution are: + +fusion, which combines multiple sequential operations into a single "kernel" to reduce the number of computation steps +skipping the inherent sparsity of padding tokens to avoid unnecessary computation with nested tensors + +BetterTransformer also converts all attention operations to use the more memory-efficient scaled dot product attention (SDPA), and it calls optimized kernels like FlashAttention under the hood. +Before you start, make sure you have 🤗 Optimum installed. +Then you can enable BetterTransformer with the [PreTrainedModel.to_bettertransformer] method: +python +model = model.to_bettertransformer() +You can return the original Transformers model with the [~PreTrainedModel.reverse_bettertransformer] method. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_14.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..92a6b7fe880c56a7c3a371e2b43f5fe0e4f1e736 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_14.txt @@ -0,0 +1,6 @@ +You should use this before saving your model to use the canonical Transformers modeling: +py +model = model.reverse_bettertransformer() +model.save_pretrained("saved_model") +bitsandbytes +bitsandbytes is a quantization library that includes support for 4-bit and 8-bit quantization. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_15.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..f727ea5a2b27fcd29c705f0392b3501b439fda7a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_15.txt @@ -0,0 +1,10 @@ +Quantization reduces your model size compared to its native full precision version, making it easier to fit large models onto GPUs with limited memory. +Make sure you have bitsandbytes and 🤗 Accelerate installed: +```bash +these versions support 8-bit and 4-bit +pip install bitsandbytes>=0.39.0 accelerate>=0.20.0 +install Transformers +pip install transformers + +4-bit +To load a model in 4-bit for inference, use the load_in_4bit parameter. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_16.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..323d85ed078ae399760b3c5d26828e5e82d22d06 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_16.txt @@ -0,0 +1,7 @@ +The device_map parameter is optional, but we recommend setting it to "auto" to allow 🤗 Accelerate to automatically and efficiently allocate the model given the available resources in the environment. + +from transformers import AutoModelForCausalLM +model_name = "bigscience/bloom-2b5" +model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True) + +To load a model in 4-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_17.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..043f02070a34cd0a2737f799b7cb19218da2d79a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_17.txt @@ -0,0 +1,12 @@ +For example, to distribute 600MB of memory to the first GPU and 1GB of memory to the second GPU: +py +max_memory_mapping = {0: "600MB", 1: "1GB"} +model_name = "bigscience/bloom-3b" +model_4bit = AutoModelForCausalLM.from_pretrained( + model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping +) +8-bit + +If you're curious and interested in learning more about the concepts underlying 8-bit quantization, read the Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes blog post. + +To load a model in 8-bit for inference, use the load_in_8bit parameter. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_18.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..99b6c2a7858193160f96e3ac7b7015f736f44a48 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_18.txt @@ -0,0 +1,7 @@ +The device_map parameter is optional, but we recommend setting it to "auto" to allow 🤗 Accelerate to automatically and efficiently allocate the model given the available resources in the environment: + +from transformers import AutoModelForCausalLM +model_name = "bigscience/bloom-2b5" +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) + +If you're loading a model in 8-bit for text generation, you should use the [~transformers.GenerationMixin.generate] method instead of the [Pipeline] function which is not optimized for 8-bit models and will be slower. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_19.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..a211224317f81fb0a043bedbe2c3eb7b12e12a0c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_19.txt @@ -0,0 +1 @@ +Some sampling strategies, like nucleus sampling, are also not supported by the [Pipeline] for 8-bit models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_2.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..c16eccdd7d7a4760cc08767345f14efffb059600 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_2.txt @@ -0,0 +1,39 @@ +Finally, learn how to use 🤗 Optimum to accelerate inference with ONNX Runtime on Nvidia and AMD GPUs. + +The majority of the optimizations described here also apply to multi-GPU setups! + +FlashAttention-2 + +FlashAttention-2 is experimental and may change considerably in future versions. + +FlashAttention-2 is a faster and more efficient implementation of the standard attention mechanism that can significantly speedup inference by: + +additionally parallelizing the attention computation over sequence length +partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them + +FlashAttention-2 is currently supported for the following architectures: +* Bark +* Bart +* DistilBert +* GPTBigCode +* GPTNeo +* GPTNeoX +* Falcon +* Llama +* Llava +* VipLlava +* MBart +* Mistral +* Mixtral +* OPT +* Phi +* StableLm +* Qwen2 +* Whisper +You can request to add FlashAttention-2 support for another model by opening a GitHub Issue or Pull Request. +Before you begin, make sure you have FlashAttention-2 installed. + +pip install flash-attn --no-build-isolation +We strongly suggest referring to the detailed installation instructions to learn more about supported hardware and data types! + +FlashAttention-2 is also supported on AMD GPUs and current support is limited to Instinct MI210 and Instinct MI250. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_20.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..de684a8821f12e2a9eb4637d498f53aaf0791687 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_20.txt @@ -0,0 +1,12 @@ +You should also place all inputs on the same device as the model: + +from transformers import AutoModelForCausalLM, AutoTokenizer +model_name = "bigscience/bloom-2b5" +tokenizer = AutoTokenizer.from_pretrained(model_name) +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) +prompt = "Hello, my llama is cute" +inputs = tokenizer(prompt, return_tensors="pt").to("cuda") +generated_ids = model.generate(**inputs) +outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) + +To load a model in 4-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_21.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..d702f17e8302309462573fa92283919c51510d9b --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_21.txt @@ -0,0 +1,13 @@ +For example, to distribute 1GB of memory to the first GPU and 2GB of memory to the second GPU: +py +max_memory_mapping = {0: "1GB", 1: "2GB"} +model_name = "bigscience/bloom-3b" +model_8bit = AutoModelForCausalLM.from_pretrained( + model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping +) + +Feel free to try running a 11 billion parameter T5 model or the 3 billion parameter BLOOM model for inference on Google Colab's free tier GPUs! + +🤗 Optimum + +Learn more details about using ORT with 🤗 Optimum in the Accelerated inference on NVIDIA GPUs and Accelerated inference on AMD GPUs guides. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_22.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..605bd54a02a47e813893fb0dbd1a0b005fe89fa1 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_22.txt @@ -0,0 +1,3 @@ +This section only provides a brief and simple example. + +ONNX Runtime (ORT) is a model accelerator that supports accelerated inference on Nvidia GPUs, and AMD GPUs that use ROCm stack. ORT uses optimization techniques like fusing common operations into a single node and constant folding to reduce the number of computations performed and speedup inference. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_23.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..46caa867124d7d6588bc2862c5626c221bd9e42c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_23.txt @@ -0,0 +1,2 @@ +ORT also places the most computationally intensive operations on the GPU and the rest on the CPU to intelligently distribute the workload between the two devices. +ORT is supported by 🤗 Optimum which can be used in 🤗 Transformers. You'll need to use an [~optimum.onnxruntime.ORTModel] for the task you're solving, and specify the provider parameter which can be set to either CUDAExecutionProvider, ROCMExecutionProvider or TensorrtExecutionProvider. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_24.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..a33aa69e4a0d8e0a5965f11956ec06ea1141fc8b --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_24.txt @@ -0,0 +1,19 @@ +If you want to load a model that was not yet exported to ONNX, you can set export=True to convert your model on-the-fly to the ONNX format: + +from optimum.onnxruntime import ORTModelForSequenceClassification +ort_model = ORTModelForSequenceClassification.from_pretrained( + "distilbert/distilbert-base-uncased-finetuned-sst-2-english", + export=True, + provider="CUDAExecutionProvider", +) + +Now you're free to use the model for inference: + +from optimum.pipelines import pipeline +from transformers import AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english") +pipeline = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0") +result = pipeline("Both the music and visual were astounding, not to mention the actors performance.") + +Combine optimizations +It is often possible to combine several of the optimization techniques described above to get the best inference performance possible for your model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_25.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..05cfb97aadf0e4f0bac16f4e2d5d93afa839e6f6 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_25.txt @@ -0,0 +1,20 @@ +For example, you can load a model in 4-bit, and then enable BetterTransformer with FlashAttention: + +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig +load model in 4-bit +quantization_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_compute_dtype=torch.float16 +) +tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", quantization_config=quantization_config) +enable BetterTransformer +model = model.to_bettertransformer() +input_text = "Hello my dog is cute and" +inputs = tokenizer(input_text, return_tensors="pt").to("cuda") +enable FlashAttention +with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): + outputs = model.generate(**inputs) +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) +```. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_3.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..8a17af91d3f5831449b831e401cf5ee53e0a0f90 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_3.txt @@ -0,0 +1,15 @@ +We strongly suggest using this Dockerfile to use FlashAttention-2 on AMD GPUs. + +To enable FlashAttention-2, pass the argument attn_implementation="flash_attention_2" to [~AutoModelForCausalLM.from_pretrained]: +thon +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM +model_id = "tiiuae/falcon-7b" +tokenizer = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForCausalLM.from_pretrained( + model_id, + torch_dtype=torch.bfloat16, + attn_implementation="flash_attention_2", +) + +FlashAttention-2 can only be used when the model's dtype is fp16 or bf16. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_4.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..780566434aeb8b344f6a23be518d9652c1c50af7 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_4.txt @@ -0,0 +1,5 @@ +Make sure to cast your model to the appropriate dtype and load them on a supported device before using FlashAttention-2. + +You can also set use_flash_attention_2=True to enable FlashAttention-2 but it is deprecated in favor of attn_implementation="flash_attention_2". + +FlashAttention-2 can be combined with other optimization techniques like quantization to further speedup inference. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_5.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..1d1a1f4d0740fb4a18582d316323c13739b621ea --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_5.txt @@ -0,0 +1,21 @@ +For example, you can combine FlashAttention-2 with 8-bit or 4-bit quantization: + +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM +model_id = "tiiuae/falcon-7b" +tokenizer = AutoTokenizer.from_pretrained(model_id) +load in 8bit +model = AutoModelForCausalLM.from_pretrained( + model_id, + load_in_8bit=True, + attn_implementation="flash_attention_2", +) +load in 4bit +model = AutoModelForCausalLM.from_pretrained( + model_id, + load_in_4bit=True, + attn_implementation="flash_attention_2", +) + +Expected speedups +You can benefit from considerable speedups for inference, especially for inputs with long sequences. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_6.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..379b1ca2e371c3ad01d3271c699af758f116aab0 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_6.txt @@ -0,0 +1 @@ +However, since FlashAttention-2 does not support computing attention scores with padding tokens, you must manually pad/unpad the attention scores for batched inference when the sequence contains padding tokens. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_7.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..7c102156783e22dc3bab7003aede826197e0af4c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_7.txt @@ -0,0 +1,7 @@ +This leads to a significant slowdown for batched generations with padding tokens. +To overcome this, you should use FlashAttention-2 without padding tokens in the sequence during training (by packing a dataset or concatenating sequences until reaching the maximum sequence length). +For a single forward pass on tiiuae/falcon-7b with a sequence length of 4096 and various batch sizes without padding tokens, the expected speedup is: + +For a single forward pass on meta-llama/Llama-7b-hf with a sequence length of 4096 and various batch sizes without padding tokens, the expected speedup is: + +For sequences with padding tokens (generating with padding tokens), you need to unpad/pad the input sequences to correctly compute the attention scores. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_8.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..342891138dcf715742296260d727e50c5147f848 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_8.txt @@ -0,0 +1,5 @@ +With a relatively small sequence length, a single forward pass creates overhead leading to a small speedup (in the example below, 30% of the input is filled with padding tokens): + +But for larger sequence lengths, you can expect even more speedup benefits: + +FlashAttention is more memory efficient, meaning you can train on much larger sequence lengths without running into out-of-memory issues. You can potentially reduce memory usage up to 20x for larger sequence lengths. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_9.txt b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..b2d5197f17a965faf75b8a0f61bb9c08bccae5c7 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_infer_gpu_one/chunk_9.txt @@ -0,0 +1,4 @@ +Take a look at the flash-attention repository for more details. + +PyTorch scaled dot product attention +PyTorch's torch.nn.functional.scaled_dot_product_attention (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_torch_compile/chunk_0.txt b/chunked/content_aware_chunking/_perf_torch_compile/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..bc5cb03ef6c3d4bf0f642a47f95232a7e48191b8 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_torch_compile/chunk_0.txt @@ -0,0 +1,4 @@ +Optimize inference using torch.compile() +This guide aims to provide a benchmark on the inference speed-ups introduced with torch.compile() for computer vision models in 🤗 Transformers. +Benefits of torch.compile +Depending on the model and the GPU, torch.compile() yields up to 30% speed-up during inference. To use torch.compile(), simply install any version of torch above 2.0. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_torch_compile/chunk_1.txt b/chunked/content_aware_chunking/_perf_torch_compile/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..9fe545210e53ba2ae64f2636d4050750cc220193 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_torch_compile/chunk_1.txt @@ -0,0 +1,8 @@ +Compiling a model takes time, so it's useful if you are compiling the model only once instead of every time you infer. +To compile any computer vision model of your choice, call torch.compile() on the model as shown below: + +from transformers import AutoModelForImageClassification +model = AutoModelForImageClassification.from_pretrained(MODEL_ID).to("cuda") ++ model = torch.compile(model) + +compile() comes with multiple modes for compiling, which essentially differ in compilation time and inference overhead. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_torch_compile/chunk_2.txt b/chunked/content_aware_chunking/_perf_torch_compile/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..7b1d162bd71b4ebc7239cdb2fca43392adf76b65 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_torch_compile/chunk_2.txt @@ -0,0 +1,4 @@ +max-autotune takes longer than reduce-overhead but results in faster inference. Default mode is fastest for compilation but is not as efficient compared to reduce-overhead for inference time. In this guide, we used the default mode. You can learn more about it here. +We benchmarked torch.compile with different computer vision models, tasks, types of hardware, and batch sizes on torch version 2.0.1. +Benchmarking code +Below you can find the benchmarking code for each task. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_torch_compile/chunk_3.txt b/chunked/content_aware_chunking/_perf_torch_compile/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..555dd7a14e74c1fa04bcac62c1fe84c7fb8e0b36 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_torch_compile/chunk_3.txt @@ -0,0 +1,54 @@ +We warm up the GPU before inference and take the mean time of 300 inferences, using the same image each time. +Image Classification with ViT +thon +import torch +from PIL import Image +import requests +import numpy as np +from transformers import AutoImageProcessor, AutoModelForImageClassification +url = 'http://images.cocodataset.org/val2017/000000039769.jpg' +image = Image.open(requests.get(url, stream=True).raw) +processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224").to("cuda") +model = torch.compile(model) +processed_input = processor(image, return_tensors='pt').to(device="cuda") +with torch.no_grad(): + _ = model(**processed_input) + +Object Detection with DETR +thon +from transformers import AutoImageProcessor, AutoModelForObjectDetection +processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50") +model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50").to("cuda") +model = torch.compile(model) +texts = ["a photo of a cat", "a photo of a dog"] +inputs = processor(text=texts, images=image, return_tensors="pt").to("cuda") +with torch.no_grad(): + _ = model(**inputs) + +Image Segmentation with Segformer +thon +from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation +processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512") +model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512").to("cuda") +model = torch.compile(model) +seg_inputs = processor(images=image, return_tensors="pt").to("cuda") +with torch.no_grad(): + _ = model(**seg_inputs) + +Below you can find the list of the models we benchmarked. +Image Classification +- google/vit-base-patch16-224 +- microsoft/beit-base-patch16-224-pt22k-ft22k +- facebook/convnext-large-224 +- microsoft/resnet-50 +Image Segmentation +- nvidia/segformer-b0-finetuned-ade-512-512 +- facebook/mask2former-swin-tiny-coco-panoptic +- facebook/maskformer-swin-base-ade +- google/deeplabv3_mobilenet_v2_1.0_513 +Object Detection +- google/owlvit-base-patch32 +- facebook/detr-resnet-101 +- microsoft/conditional-detr-resnet-50 +Below you can find visualization of inference durations with and without torch.compile() and percentage improvements for each model in different hardware and batch sizes. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_torch_compile/chunk_4.txt b/chunked/content_aware_chunking/_perf_torch_compile/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..dbb4429b3eb45a439fb7f52e72f928ea6ce47c44 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_torch_compile/chunk_4.txt @@ -0,0 +1 @@ +Below you can find inference durations in milliseconds for each model with and without compile(). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_torch_compile/chunk_5.txt b/chunked/content_aware_chunking/_perf_torch_compile/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..090e0156e956bca621430f93b77d9a38e0a28cf9 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_torch_compile/chunk_5.txt @@ -0,0 +1,132 @@ +Note that OwlViT results in OOM in larger batch sizes. +A100 (batch size: 1) +| Task/Model | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:| +| Image Classification/ViT | 9.325 | 7.584 | +| Image Segmentation/Segformer | 11.759 | 10.500 | +| Object Detection/OwlViT | 24.978 | 18.420 | +| Image Classification/BeiT | 11.282 | 8.448 | +| Object Detection/DETR | 34.619 | 19.040 | +| Image Classification/ConvNeXT | 10.410 | 10.208 | +| Image Classification/ResNet | 6.531 | 4.124 | +| Image Segmentation/Mask2former | 60.188 | 49.117 | +| Image Segmentation/Maskformer | 75.764 | 59.487 | +| Image Segmentation/MobileNet | 8.583 | 3.974 | +| Object Detection/Resnet-101 | 36.276 | 18.197 | +| Object Detection/Conditional-DETR | 31.219 | 17.993 | +A100 (batch size: 4) +| Task/Model | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:| +| Image Classification/ViT | 14.832 | 14.499 | +| Image Segmentation/Segformer | 18.838 | 16.476 | +| Image Classification/BeiT | 13.205 | 13.048 | +| Object Detection/DETR | 48.657 | 32.418| +| Image Classification/ConvNeXT | 22.940 | 21.631 | +| Image Classification/ResNet | 6.657 | 4.268 | +| Image Segmentation/Mask2former | 74.277 | 61.781 | +| Image Segmentation/Maskformer | 180.700 | 159.116 | +| Image Segmentation/MobileNet | 14.174 | 8.515 | +| Object Detection/Resnet-101 | 68.101 | 44.998 | +| Object Detection/Conditional-DETR | 56.470 | 35.552 | +A100 (batch size: 16) +| Task/Model | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:| +| Image Classification/ViT | 40.944 | 40.010 | +| Image Segmentation/Segformer | 37.005 | 31.144 | +| Image Classification/BeiT | 41.854 | 41.048 | +| Object Detection/DETR | 164.382 | 161.902 | +| Image Classification/ConvNeXT | 82.258 | 75.561 | +| Image Classification/ResNet | 7.018 | 5.024 | +| Image Segmentation/Mask2former | 178.945 | 154.814 | +| Image Segmentation/Maskformer | 638.570 | 579.826 | +| Image Segmentation/MobileNet | 51.693 | 30.310 | +| Object Detection/Resnet-101 | 232.887 | 155.021 | +| Object Detection/Conditional-DETR | 180.491 | 124.032 | +V100 (batch size: 1) +| Task/Model | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:| +| Image Classification/ViT | 10.495 | 6.00 | +| Image Segmentation/Segformer | 13.321 | 5.862 | +| Object Detection/OwlViT | 25.769 | 22.395 | +| Image Classification/BeiT | 11.347 | 7.234 | +| Object Detection/DETR | 33.951 | 19.388 | +| Image Classification/ConvNeXT | 11.623 | 10.412 | +| Image Classification/ResNet | 6.484 | 3.820 | +| Image Segmentation/Mask2former | 64.640 | 49.873 | +| Image Segmentation/Maskformer | 95.532 | 72.207 | +| Image Segmentation/MobileNet | 9.217 | 4.753 | +| Object Detection/Resnet-101 | 52.818 | 28.367 | +| Object Detection/Conditional-DETR | 39.512 | 20.816 | +V100 (batch size: 4) +| Task/Model | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:| +| Image Classification/ViT | 15.181 | 14.501 | +| Image Segmentation/Segformer | 16.787 | 16.188 | +| Image Classification/BeiT | 15.171 | 14.753 | +| Object Detection/DETR | 88.529 | 64.195 | +| Image Classification/ConvNeXT | 29.574 | 27.085 | +| Image Classification/ResNet | 6.109 | 4.731 | +| Image Segmentation/Mask2former | 90.402 | 76.926 | +| Image Segmentation/Maskformer | 234.261 | 205.456 | +| Image Segmentation/MobileNet | 24.623 | 14.816 | +| Object Detection/Resnet-101 | 134.672 | 101.304 | +| Object Detection/Conditional-DETR | 97.464 | 69.739 | +V100 (batch size: 16) +| Task/Model | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:| +| Image Classification/ViT | 52.209 | 51.633 | +| Image Segmentation/Segformer | 61.013 | 55.499 | +| Image Classification/BeiT | 53.938 | 53.581 | +| Object Detection/DETR | OOM | OOM | +| Image Classification/ConvNeXT | 109.682 | 100.771 | +| Image Classification/ResNet | 14.857 | 12.089 | +| Image Segmentation/Mask2former | 249.605 | 222.801 | +| Image Segmentation/Maskformer | 831.142 | 743.645 | +| Image Segmentation/MobileNet | 93.129 | 55.365 | +| Object Detection/Resnet-101 | 482.425 | 361.843 | +| Object Detection/Conditional-DETR | 344.661 | 255.298 | +T4 (batch size: 1) +| Task/Model | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:| +| Image Classification/ViT | 16.520 | 15.786 | +| Image Segmentation/Segformer | 16.116 | 14.205 | +| Object Detection/OwlViT | 53.634 | 51.105 | +| Image Classification/BeiT | 16.464 | 15.710 | +| Object Detection/DETR | 73.100 | 53.99 | +| Image Classification/ConvNeXT | 32.932 | 30.845 | +| Image Classification/ResNet | 6.031 | 4.321 | +| Image Segmentation/Mask2former | 79.192 | 66.815 | +| Image Segmentation/Maskformer | 200.026 | 188.268 | +| Image Segmentation/MobileNet | 18.908 | 11.997 | +| Object Detection/Resnet-101 | 106.622 | 82.566 | +| Object Detection/Conditional-DETR | 77.594 | 56.984 | +T4 (batch size: 4) +| Task/Model | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:| +| Image Classification/ViT | 43.653 | 43.626 | +| Image Segmentation/Segformer | 45.327 | 42.445 | +| Image Classification/BeiT | 52.007 | 51.354 | +| Object Detection/DETR | 277.850 | 268.003 | +| Image Classification/ConvNeXT | 119.259 | 105.580 | +| Image Classification/ResNet | 13.039 | 11.388 | +| Image Segmentation/Mask2former | 201.540 | 184.670 | +| Image Segmentation/Maskformer | 764.052 | 711.280 | +| Image Segmentation/MobileNet | 74.289 | 48.677 | +| Object Detection/Resnet-101 | 421.859 | 357.614 | +| Object Detection/Conditional-DETR | 289.002 | 226.945 | +T4 (batch size: 16) +| Task/Model | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:| +| Image Classification/ViT | 163.914 | 160.907 | +| Image Segmentation/Segformer | 192.412 | 163.620 | +| Image Classification/BeiT | 188.978 | 187.976 | +| Object Detection/DETR | OOM | OOM | +| Image Classification/ConvNeXT | 422.886 | 388.078 | +| Image Classification/ResNet | 44.114 | 37.604 | +| Image Segmentation/Mask2former | 756.337 | 695.291 | +| Image Segmentation/Maskformer | 2842.940 | 2656.88 | +| Image Segmentation/MobileNet | 299.003 | 201.942 | +| Object Detection/Resnet-101 | 1619.505 | 1262.758 | +| Object Detection/Conditional-DETR | 1137.513 | 897.390| +PyTorch Nightly +We also benchmarked on PyTorch nightly (2.1.0dev, find the wheel here) and observed improvement in latency both for uncompiled and compiled models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_torch_compile/chunk_6.txt b/chunked/content_aware_chunking/_perf_torch_compile/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..6bc1993ab4918cf0558ebeaa7835ef013ca5df4f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_torch_compile/chunk_6.txt @@ -0,0 +1,51 @@ +A100 +| Task/Model | Batch Size | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 12.462 | 6.954 | +| Image Classification/BeiT | 4 | 14.109 | 12.851 | +| Image Classification/BeiT | 16 | 42.179 | 42.147 | +| Object Detection/DETR | Unbatched | 30.484 | 15.221 | +| Object Detection/DETR | 4 | 46.816 | 30.942 | +| Object Detection/DETR | 16 | 163.749 | 163.706 | +T4 +| Task/Model | Batch Size | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 14.408 | 14.052 | +| Image Classification/BeiT | 4 | 47.381 | 46.604 | +| Image Classification/BeiT | 16 | 42.179 | 42.147 | +| Object Detection/DETR | Unbatched | 68.382 | 53.481 | +| Object Detection/DETR | 4 | 269.615 | 204.785 | +| Object Detection/DETR | 16 | OOM | OOM | +V100 +| Task/Model | Batch Size | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 13.477 | 7.926 | +| Image Classification/BeiT | 4 | 15.103 | 14.378 | +| Image Classification/BeiT | 16 | 52.517 | 51.691 | +| Object Detection/DETR | Unbatched | 28.706 | 19.077 | +| Object Detection/DETR | 4 | 88.402 | 62.949| +| Object Detection/DETR | 16 | OOM | OOM | +Reduce Overhead +We benchmarked reduce-overhead compilation mode for A100 and T4 in Nightly. +A100 +| Task/Model | Batch Size | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:|:---:| +| Image Classification/ConvNeXT | Unbatched | 11.758 | 7.335 | +| Image Classification/ConvNeXT | 4 | 23.171 | 21.490 | +| Image Classification/ResNet | Unbatched | 7.435 | 3.801 | +| Image Classification/ResNet | 4 | 7.261 | 2.187 | +| Object Detection/Conditional-DETR | Unbatched | 32.823 | 11.627 | +| Object Detection/Conditional-DETR | 4 | 50.622 | 33.831 | +| Image Segmentation/MobileNet | Unbatched | 9.869 | 4.244 | +| Image Segmentation/MobileNet | 4 | 14.385 | 7.946 | +T4 +| Task/Model | Batch Size | torch 2.0 - no compile | torch 2.0 - compile | +|:---:|:---:|:---:|:---:| +| Image Classification/ConvNeXT | Unbatched | 32.137 | 31.84 | +| Image Classification/ConvNeXT | 4 | 120.944 | 110.209 | +| Image Classification/ResNet | Unbatched | 9.761 | 7.698 | +| Image Classification/ResNet | 4 | 15.215 | 13.871 | +| Object Detection/Conditional-DETR | Unbatched | 72.150 | 57.660 | +| Object Detection/Conditional-DETR | 4 | 301.494 | 247.543 | +| Image Segmentation/MobileNet | Unbatched | 22.266 | 19.339 | +| Image Segmentation/MobileNet | 4 | 78.311 | 50.983 |. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu/chunk_0.txt b/chunked/content_aware_chunking/_perf_train_cpu/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..594b2e672c9453a954d53bc1b490b831ee4332e8 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu/chunk_0.txt @@ -0,0 +1,4 @@ +Efficient Training on CPU +This guide focuses on training large models efficiently on CPU. +Mixed precision with IPEX +Mixed precision uses single (fp32) and half-precision (bf16/fp16) data types in a model to accelerate training or inference while still preserving much of the single-precision accuracy. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu/chunk_1.txt b/chunked/content_aware_chunking/_perf_train_cpu/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..50a0bbf1c349daec5850042ab396602152befe36 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu/chunk_1.txt @@ -0,0 +1,2 @@ +Modern CPUs such as 3rd and 4th Gen Intel® Xeon® Scalable processors natively support bf16, so you should get more performance out of the box by enabling mixed precision training with bf16. +To further maximize training performance, you can use Intel® Extension for PyTorch (IPEX), which is a library built on PyTorch and adds additional CPU instruction level architecture (ISA) level support such as Intel® Advanced Vector Extensions 512 Vector Neural Network Instructions (Intel® AVX512-VNNI), and Intel® Advanced Matrix Extensions (Intel® AMX) for an extra performance boost on Intel CPUs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu/chunk_2.txt b/chunked/content_aware_chunking/_perf_train_cpu/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..3dc18b16fd8af619a0173ad8a431fcfff72e9a36 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu/chunk_2.txt @@ -0,0 +1,2 @@ +However, CPUs with only AVX2 (e.g., AMD or older Intel CPUs) are not guaranteed to have better performance under IPEX. +Auto Mixed Precision (AMP) for CPU backends has been enabled since PyTorch 1.10. AMP support for bf16 on CPUs and bf16 operator optimization is also supported in IPEX and partially upstreamed to the main PyTorch branch. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu/chunk_3.txt b/chunked/content_aware_chunking/_perf_train_cpu/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..cce9f64841554700bd4175d5576d5f8d2dfc8695 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu/chunk_3.txt @@ -0,0 +1,46 @@ +You can get better performance and user experience with IPEX AMP. +Check more detailed information for Auto Mixed Precision. +IPEX installation: +IPEX release is following PyTorch, to install via pip: +| PyTorch Version | IPEX version | +| :---------------: | :----------: | +| 2.1.x | 2.1.100+cpu | +| 2.0.x | 2.0.100+cpu | +| 1.13 | 1.13.0+cpu | +| 1.12 | 1.12.300+cpu | +Please run pip list | grep torch to get your pytorch_version, so you can get the IPEX version_name. + +pip install intel_extension_for_pytorch== -f https://developer.intel.com/ipex-whl-stable-cpu +You can check the latest versions in ipex-whl-stable-cpu if needed. +Check more approaches for IPEX installation. +Usage in Trainer +To enable auto mixed precision with IPEX in Trainer, users should add use_ipex, bf16 and no_cuda in training command arguments. +Take an example of the use cases on Transformers question-answering + +Training with IPEX using BF16 auto mixed precision on CPU: + + python run_qa.py \ +--model_name_or_path google-bert/bert-base-uncased \ +--dataset_name squad \ +--do_train \ +--do_eval \ +--per_device_train_batch_size 12 \ +--learning_rate 3e-5 \ +--num_train_epochs 2 \ +--max_seq_length 384 \ +--doc_stride 128 \ +--output_dir /tmp/debug_squad/ \ +--use_ipex \ +--bf16 \ +--use_cpu +If you want to enable use_ipex and bf16 in your script, add these parameters to TrainingArguments like this: +diff +training_args = TrainingArguments( + output_dir=args.output_path, ++ bf16=True, ++ use_ipex=True, ++ use_cpu=True, + **kwargs +) +Practice example +Blog: Accelerating PyTorch Transformers with Intel Sapphire Rapids. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_0.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..7197d8a601eac0fca10e24d62cc3e74d8396a8c8 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_0.txt @@ -0,0 +1,5 @@ +Efficient Training on Multiple CPUs +When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling +distributed CPU training efficiently on bare metal and Kubernetes. +Intel® oneCCL Bindings for PyTorch +Intel® oneCCL (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_1.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..e363365fb914ed378665e745decfe2a136e01dc4 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_1.txt @@ -0,0 +1,24 @@ +For more information on oneCCL, please refer to the oneCCL documentation and oneCCL specification. +Module oneccl_bindings_for_pytorch (torch_ccl before version 1.12) implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now +Check more detailed information for oneccl_bind_pt. +Intel® oneCCL Bindings for PyTorch installation +Wheel files are available for the following Python versions: +| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | +| :---------------: | :--------: | :--------: | :--------: | :--------: | :---------: | +| 2.1.0 | | √ | √ | √ | √ | +| 2.0.0 | | √ | √ | √ | √ | +| 1.13.0 | | √ | √ | √ | √ | +| 1.12.100 | | √ | √ | √ | √ | +| 1.12.0 | | √ | √ | √ | √ | +Please run pip list | grep torch to get your pytorch_version. + +pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipex-whl-stable-cpu +where {pytorch_version} should be your PyTorch version, for instance 2.1.0. +Check more approaches for oneccl_bind_pt installation. +Versions of oneCCL and PyTorch must match. + +oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0) +PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100 + +Intel® MPI library +Use this standards-based MPI implementation to deliver flexible, efficient, scalable cluster messaging on Intel® architecture. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_10.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..c2126e17cd6dd03a7fb69aa29f4b177b3af737dd --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_10.txt @@ -0,0 +1,3 @@ +The amount of CPU and memory limits/requests defined in the yaml should be less than the amount of +available CPU/memory capacity on a single machine. It is usually a good idea to not use the entire machine's capacity in +order to leave some resources for the kubelet and OS. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_11.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..8b7164319c0888bb32a985585a6347d6c33d9b67 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_11.txt @@ -0,0 +1,11 @@ +In order to get "guaranteed" +quality of service for the worker pods, +set the same CPU and memory amounts for both the resource limits and requests. + +Deploy +After the PyTorchJob spec has been updated with values appropriate for your cluster and training job, it can be deployed +to the cluster using: + +kubectl create -f pytorchjob.yaml +The kubectl get pods -n kubeflow command can then be used to list the pods in the kubeflow namespace. You should see +the worker pods for the PyTorchJob that was just deployed. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_12.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..9d12c1f980cd6a143aba73add850b928cd813a7c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_12.txt @@ -0,0 +1,10 @@ +At first, they will probably have a status of "Pending" as +the containers get pulled and created, then the status should change to "Running". +NAME READY STATUS RESTARTS AGE + +transformers-pytorchjob-worker-0 1/1 Running 0 7m37s +transformers-pytorchjob-worker-1 1/1 Running 0 7m37s +transformers-pytorchjob-worker-2 1/1 Running 0 7m37s +transformers-pytorchjob-worker-3 1/1 Running 0 7m37s + +The logs for worker can be viewed using kubectl logs -n kubeflow . \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_13.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..e8e5e199f97f8c43d588d778ef4f13faa947a9c3 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_13.txt @@ -0,0 +1,8 @@ +Add -f to stream the logs, for example: + +kubectl logs -n kubeflow transformers-pytorchjob-worker-0 -f +After the training job completes, the trained model can be copied from the PVC or storage location. When you are done +with the job, the PyTorchJob resource can be deleted from the cluster using kubectl delete -f pytorchjob.yaml. +Summary +This guide covered running distributed PyTorch training jobs using multiple CPUs on bare metal and on a Kubernetes +cluster. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_14.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..34b96154b84e46a89c2a3b60d079240a06905db8 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_14.txt @@ -0,0 +1,2 @@ +Both cases utilize Intel Extension for PyTorch and Intel oneCCL Bindings for PyTorch for optimal training +performance, and can be used as a template to run your own workload on multiple nodes.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_2.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..8c79700553d47112878dd987602a25ea9625f101 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_2.txt @@ -0,0 +1,2 @@ +This component is part of the Intel® oneAPI HPC Toolkit. +oneccl_bindings_for_pytorch is installed along with the MPI tool set. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_3.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..2bc2e0af22b65079bfbacfb2c5c9130524a4afec --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_3.txt @@ -0,0 +1,16 @@ +Need to source the environment before using it. +for Intel® oneCCL >= 1.12.0 + +oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") +source $oneccl_bindings_for_pytorch_path/env/setvars.sh +for Intel® oneCCL whose version < 1.12.0 + +torch_ccl_path=$(python -c "import torch; import torch_ccl; import os; print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))") +source $torch_ccl_path/env/setvars.sh +Intel® Extension for PyTorch installation +Intel Extension for PyTorch (IPEX) provides performance optimizations for CPU training with both Float32 and BFloat16 (refer to the single CPU section to learn more). +The following "Usage in Trainer" takes mpirun in Intel® MPI library as an example. +Usage in Trainer +To enable multi CPU distributed training in the Trainer with the ccl backend, users should add --ddp_backend ccl in the command arguments. +Let's see an example with the question-answering example +The following command enables training with 2 processes on one Xeon node, with one process running per one socket. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_4.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..b2f28e34254de88080ca9212d445eb0c8a32473e --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_4.txt @@ -0,0 +1,20 @@ +The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. +shell script + export CCL_WORKER_COUNT=1 + export MASTER_ADDR=127.0.0.1 + mpirun -n 2 -genv OMP_NUM_THREADS=23 \ + python3 run_qa.py \ + --model_name_or_path google-bert/bert-large-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ \ + --no_cuda \ + --ddp_backend ccl \ + --use_ipex +The following command enables training with a total of four processes on two Xeons (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_5.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..3c88f7caa6a5c088fb73e84c1bd2419f1e247e99 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_5.txt @@ -0,0 +1,36 @@ +The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. +In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument. +shell script + cat hostfile + xxx.xxx.xxx.xxx #node0 ip + xxx.xxx.xxx.xxx #node1 ip +Now, run the following command in node0 and 4DDP will be enabled in node0 and node1 with BF16 auto mixed precision: +shell script + export CCL_WORKER_COUNT=1 + export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip + mpirun -f hostfile -n 4 -ppn 2 \ + -genv OMP_NUM_THREADS=23 \ + python3 run_qa.py \ + --model_name_or_path google-bert/bert-large-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ \ + --no_cuda \ + --ddp_backend ccl \ + --use_ipex \ + --bf16 +Usage with Kubernetes +The same distributed training job from the previous section can be deployed to a Kubernetes cluster using the +Kubeflow PyTorchJob training operator. +Setup +This example assumes that you have: +* Access to a Kubernetes cluster with Kubeflow installed +* kubectl installed and configured to access the Kubernetes cluster +* A Persistent Volume Claim (PVC) that can be used + to store datasets and model files. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_6.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..1f64df536903869e1470668b5d4065383ac92f4e --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_6.txt @@ -0,0 +1,3 @@ +There are multiple options for setting up the PVC including using an NFS + storage class or a cloud storage bucket. +* A Docker container that includes your model training script and all the dependencies needed to run the script. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_7.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..4668bd78bc05a639b9b38f6749bdca5ca0560595 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_7.txt @@ -0,0 +1,18 @@ +For + distributed CPU training jobs, this typically includes PyTorch, Transformers, Intel Extension for PyTorch, Intel + oneCCL Bindings for PyTorch, and OpenSSH to communicate between the containers. +The snippet below is an example of a Dockerfile that uses a base image that supports distributed CPU training and then +extracts a Transformers release to the /workspace directory, so that the example scripts are included in the image: +```dockerfile +FROM intel/ai-workflows:torch-2.0.1-huggingface-multinode-py3.9 +WORKDIR /workspace +Download and extract the transformers code +ARG HF_TRANSFORMERS_VER="4.35.2" +RUN mkdir transformers && \ + curl -sSL --retry 5 https://github.com/huggingface/transformers/archive/refs/tags/v${HF_TRANSFORMERS_VER}.tar.gz | tar -C transformers --strip-components=1 -xzf - + +The image needs to be built and copied to the cluster's nodes or pushed to a container registry prior to deploying the +PyTorchJob to the cluster. +PyTorchJob Specification File +The Kubeflow PyTorchJob is used to run the distributed +training job on the cluster. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_8.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..7d6a5bd30d2255773083f7dfc87acac505afbb92 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_8.txt @@ -0,0 +1,9 @@ +The yaml file for the PyTorchJob defines parameters such as: + * The name of the PyTorchJob + * The number of replicas (workers) + * The python script and it's parameters that will be used to run the training job + * The types of resources (node selector, memory, and CPU) needed for each worker + * The image/tag for the Docker container to use + * Environment variables + * A volume mount for the PVC +The volume mount defines a path where the PVC will be mounted in the container for each worker pod. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_9.txt b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..5c1e8c5d7cfe0e826647248059a3027f8f87ce0e --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_cpu_many/chunk_9.txt @@ -0,0 +1,92 @@ +This location can be +used for the dataset, checkpoint files, and the saved model after training completes. +The snippet below is an example of a yaml file for a PyTorchJob with 4 workers running the +question-answering example. +yaml +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: transformers-pytorchjob + namespace: kubeflow +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 4 + maxRestarts: 10 + pytorchReplicaSpecs: + Worker: + replicas: 4 # The number of worker pods + restartPolicy: OnFailure + template: + spec: + containers: + - name: pytorch + image: : # Specify the docker image to use for the worker pods + imagePullPolicy: IfNotPresent + command: + - torchrun + - /workspace/transformers/examples/pytorch/question-answering/run_qa.py + - --model_name_or_path + - "google-bert/bert-large-uncased" + - --dataset_name + - "squad" + - --do_train + - --do_eval + - --per_device_train_batch_size + - "12" + - --learning_rate + - "3e-5" + - --num_train_epochs + - "2" + - --max_seq_length + - "384" + - --doc_stride + - "128" + - --output_dir + - "/tmp/pvc-mount/output" + - --no_cuda + - --ddp_backend + - "ccl" + - --use_ipex + - --bf16 # Specify --bf16 if your hardware supports bfloat16 + env: + - name: LD_PRELOAD + value: "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.9:/usr/local/lib/libiomp5.so" + - name: TRANSFORMERS_CACHE + value: "/tmp/pvc-mount/transformers_cache" + - name: HF_DATASETS_CACHE + value: "/tmp/pvc-mount/hf_datasets_cache" + - name: LOGLEVEL + value: "INFO" + - name: CCL_WORKER_COUNT + value: "1" + - name: OMP_NUM_THREADS # Can be tuned for optimal performance + + resources: + limits: + cpu: 200 # Update the CPU and memory limit values based on your nodes + memory: 128Gi + requests: + cpu: 200 # Update the CPU and memory request values based on your nodes + memory: 128Gi + volumeMounts: + - name: pvc-volume + mountPath: /tmp/pvc-mount + - mountPath: /dev/shm + name: dshm + restartPolicy: Never + nodeSelector: # Optionally use the node selector to specify what types of nodes to use for the workers + node-type: spr + volumes: + - name: pvc-volume + persistentVolumeClaim: + claimName: transformers-pvc + - name: dshm + emptyDir: + medium: Memory +To run this example, update the yaml based on your training script and the nodes in your cluster. + +The CPU resource limits/requests in the yaml are defined in cpu units +where 1 CPU unit is equivalent to 1 physical CPU core or 1 virtual core (depending on whether the node is a physical +host or a VM). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_0.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..8d7b94845d3068b0ed30b365476a49d447443d2f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_0.txt @@ -0,0 +1,5 @@ +Efficient Training on Multiple GPUs +If training a model on a single GPU is too slow or if the model's weights do not fit in a single GPU's memory, transitioning +to a multi-GPU setup may be a viable option. Prior to making this transition, thoroughly explore all the strategies covered +in the Methods and tools for efficient training on a single GPU as they are universally applicable +to model training on any number of GPUs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_1.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..ac92c03474b9efb9f705daa27fd1122c084c50f6 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_1.txt @@ -0,0 +1,5 @@ +Once you have employed those strategies and found them insufficient for your +case on a single GPU, consider moving to multiple GPUs. +Transitioning from a single GPU to multiple GPUs requires the introduction of some form of parallelism, as the workload +must be distributed across the resources. Multiple techniques can be employed to achieve parallelism, such as data +parallelism, tensor parallelism, and pipeline parallelism. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_10.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..ac94d8a01294ff49df94fc804799cbf86e66b61d --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_10.txt @@ -0,0 +1,4 @@ +DDP performs only a single communication per batch - sending gradients, while DP performs five different data exchanges per batch. +DDP copies data using torch.distributed, while DP copies data within +the process via Python threads (which introduces limitations associated with GIL). As a result, DistributedDataParallel (DDP) is generally faster than DataParallel (DP) unless you have slow GPU card inter-connectivity. +2. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_11.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..ea8e036ec83d338666d27be3aaff05fdee83f0c5 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_11.txt @@ -0,0 +1,5 @@ +Under DP, GPU 0 performs significantly more work than other GPUs, resulting in GPU under-utilization. +3. DDP supports distributed training across multiple machines, whereas DP does not. +This is not an exhaustive list of differences between DP and DDP, however, other nuances are out of scope of this guide. +You can get a deeper understanding of these methods by reading this article. +Let's illustrate the differences between DP and DDP with an experiment. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_12.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..f8b7ab3d07cd03aaef23c1e6761a7c84636343f6 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_12.txt @@ -0,0 +1,7 @@ +We'll benchmark the differences between DP and +DDP with an added context of NVLink presence: + +Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2 in nvidia-smi topo -m). +Software: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0. + +To disable the NVLink feature on one of the benchmarks, we use NCCL_P2P_DISABLE=1. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_13.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..413f3b5b65c35ab9461aa27a3961075a48a1b0f0 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_13.txt @@ -0,0 +1,38 @@ +Here is the benchmarking code and outputs: +DP +```bash +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ +python examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 +{'train_runtime': 110.5948, 'train_samples_per_second': 1.808, 'epoch': 0.69} + +DDP w/ NVlink +```bash +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ +torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 +{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} + +DDP w/o NVlink +```bash +rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \ +torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 +{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} + +Here are the same benchmarking results gathered in a table for convenience: +| Type | NVlink | Time | +| :----- | ----- | ---: | +| 2:DP | Y | 110s | +| 2:DDP | Y | 101s | +| 2:DDP | N | 131s | +As you can see, in this case DP is ~10% slower than DDP with NVlink, but ~15% faster than DDP without NVlink. +The real difference will depend on how much data each GPU needs to sync with the others - the more there is to sync, +the more a slow link will impede the overall runtime. +ZeRO Data Parallelism +ZeRO-powered data parallelism (ZeRO-DP) is illustrated in the following diagram from this blog post. + +While it may appear complex, it is a very similar concept to DataParallel (DP). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_14.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..1cb7ce01de158ae5905c5bd0b8175ec7c8eade59 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_14.txt @@ -0,0 +1,5 @@ +The difference is that instead of +replicating the full model parameters, gradients and optimizer states, each GPU stores only a slice of it. Then, at +run-time when the full layer parameters are needed just for the given layer, all GPUs synchronize to give each other +parts that they miss. +To illustrate this idea, consider a simple model with 3 layers (La, Lb, and Lc), where each layer has 3 parameters. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_15.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..cc78f23f3ea44df67dc16e6f4de50a485077d764 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_15.txt @@ -0,0 +1,23 @@ +Layer La, for example, has weights a0, a1 and a2: +La | Lb | Lc +---|----|--- +a0 | b0 | c0 +a1 | b1 | c1 +a2 | b2 | c2 +If we have 3 GPUs, ZeRO-DP splits the model onto 3 GPUs like so: + +GPU0: +La | Lb | Lc +---|----|--- +a0 | b0 | c0 +GPU1: +La | Lb | Lc +---|----|--- +a1 | b1 | c1 +GPU2: +La | Lb | Lc +---|----|--- +a2 | b2 | c2 + +In a way, this is the same horizontal slicing as tensor parallelism, as opposed to Vertical +slicing, where one puts whole layer-groups on different GPUs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_16.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..4b60278964a049ef37747c4ef29367af8b1eaeb8 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_16.txt @@ -0,0 +1,8 @@ +Now let's see how this works: +Each of these GPUs will get the usual mini-batch as it works in DP: +x0 => GPU0 +x1 => GPU1 +x2 => GPU2 +The inputs are passed without modifications as if they would be processed by the original model. +First, the inputs get to the layer La. What happens at this point? +On GPU0: the x0 mini-batch requires the a0, a1, a2 parameters to do its forward path through the layer, but the GPU0 has only a0. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_17.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..b560fc525792feffc4b04ea1d7ad2ce9fc0d2817 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_17.txt @@ -0,0 +1,3 @@ +It will get a1 from GPU1 and a2 from GPU2, bringing all the pieces of the model together. +In parallel, GPU1 gets another mini-batch - x1. GPU1 has the a1 parameter, but needs a0 and a2, so it gets those from GPU0 and GPU2. +Same happens to GPU2 that gets the mini-batch x2. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_18.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..210ba8a7e361d5f7391b5351942629bf15ad0c14 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_18.txt @@ -0,0 +1,3 @@ +It gets a0 and a1 from GPU0 and GPU1. +This way each of the 3 GPUs gets the full tensors reconstructed and makes a forward pass with its own mini-batch. +As soon as the calculation is done, the data that is no longer needed gets dropped - it's only used during the calculation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_19.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..2cca0060c7a20016e8cd3460e7aa637fb7a88420 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_19.txt @@ -0,0 +1,6 @@ +The reconstruction is done efficiently via a pre-fetch. +Then the whole process is repeated for layer Lb, then Lc forward-wise, and then backward Lc -> Lb -> La. + +This mechanism is similar to an efficient group backpacking strategy: person A carries the tent, person B carries the stove, +and person C carries the axe. Each night they all share what they have with others and get from others what they don't have, +and in the morning they pack up their allocated type of gear and continue on their way. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_2.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..5f4183366a839dd06ef7ada38ecbdb40821bc26b --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_2.txt @@ -0,0 +1,5 @@ +It's important to note that there isn't a one-size-fits-all +solution, and the optimal settings depend on the specific hardware configuration you are using. +This guide offers an in-depth overview of individual types of parallelism, as well as guidance on ways to combine +techniques and choosing an appropriate approach. For step-by-step tutorials on distributed training, please refer to +the 🤗 Accelerate documentation. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_20.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..cb8de17e849ae2527e13baaf3dfff2ceebbff387 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_20.txt @@ -0,0 +1,7 @@ +This is what ZeRO DP/Sharded DDP is. +Compare this strategy to the simple one where each person has to carry their own tent, stove and axe (similar to +DataParallel (DP and DDP) in PyTorch), which would be far more inefficient. + +While reading the literature on this topic you may encounter the following synonyms: Sharded, Partitioned. +If you pay close attention the way ZeRO partitions the model's weights - it looks very similar to tensor parallelism +which will be discussed later. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_21.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..4df417ecf4d6edbcf736f2dffab474da533ed144 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_21.txt @@ -0,0 +1,10 @@ +This is because it partitions/shards each layer's weights, unlike vertical model parallelism +which is discussed next. +Implementations: + +DeepSpeed ZeRO-DP stages 1+2+3 +Accelerate integration +transformers integration + +From Naive Model Parallelism to Pipeline Parallelism +To explain Pipeline parallelism, we'll first look into Naive Model Parallelism (MP), also known as Vertical MP. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_22.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..ce558799d0640d4c9ec4c27f13deb3107c00aaa6 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_22.txt @@ -0,0 +1,4 @@ +This approach +involves distributing groups of model layers across multiple GPUs by assigning specific layers to specific GPUs with .to(). +As data flows through these layers, it is moved to the same GPU as the layer, while the other layers remain untouched. +We refer to this Model parallelism as "Vertical" because of how models are typically visualized. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_23.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..622aab1511f1c6e69f73903662f9eda9648b46d3 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_23.txt @@ -0,0 +1,18 @@ +For example, the +following diagram shows an 8-layer model split vertically into two slices, placing layers 0-3 onto +GPU0 and 4-7 to GPU1: + +| Layer | | +| 0 | | +| 1 | GPU0 | +| 2 | | +| 3 | | +================ +| Layer | | +| 4 | | +| 5 | GPU1 | +| 6 | | +| 7 | | +================ + +In this example, when data moves from layer 0 to 3, it's no different from regular forward pass. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_24.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..1461f0634cf932b7935b42a42122faf57100fe3d --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_24.txt @@ -0,0 +1,5 @@ +However, passing data +from layer 3 to 4 requires moving it from GPU0 to GPU1, introducing a communication overhead. If the participating +GPUs are on the same compute node (e.g. same physical machine) this copying is fast, but if the GPUs are distributed +across different compute nodes (e.g. multiple machines), the communication overhead could be substantially greater. +Following that, layers 4 to 7 work as they would in the original model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_25.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..b8413a0bbe7dda177ca9d41b0579aa4bd67d8398 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_25.txt @@ -0,0 +1,5 @@ +Upon completion of the 7th layer, there is often +a need to send the data back to layer 0 where the labels are (or alternatively send the labels to the last layer). Now the loss can be +computed and the optimizer can do its work. +Naive Model Parallelism comes several shortcomings: +- All but one GPU are idle at any given moment: if 4 GPUs are used, it's nearly identical to quadrupling the amount of memory of a single GPU, and ignoring the rest of the hardware. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_26.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..65c584dfabd7e777d23e74c560fbaba8c5541672 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_26.txt @@ -0,0 +1 @@ +- Overhead in data transfer between devices: E.g. 4x 6GB cards will be able to accommodate the same size as 1x 24GB card using naive MP, but a single 24GB card will complete the training faster, because it doesn't have the data copying overhead. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_27.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..86e1f7a02669d0c554ae447d6ea19593bf47b498 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_27.txt @@ -0,0 +1,10 @@ +But, say, if you have 40GB cards and need to fit a 45GB model you can with 4x 40GB cards (but barely because of the gradient and optimizer states) +- Copying shared embeddings: Shared embeddings may need to get copied back and forth between GPUs. +Now that you are familiar with how the naive approach to model parallelism works and its shortcomings, let's look at Pipeline Parallelism (PP). +PP is almost identical to a naive MP, but it solves the GPU idling problem by chunking the incoming batch into micro-batches +and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process. +The following illustration from the GPipe paper +shows the naive MP on the top, and PP on the bottom: + +At the bottom of the diagram, you can observe that the Pipeline Parallelism (PP) approach minimizes the number of idle +GPU zones, referred to as 'bubbles'. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_28.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..59f9bc8cd0dd4644351567451ba988a17fae9812 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_28.txt @@ -0,0 +1,5 @@ +Both parts of the diagram show a parallelism level of degree 4, meaning that 4 GPUs +are involved in the pipeline. You can see that there's a forward path of 4 pipe stages (F0, F1, F2 and F3) followed by +a backward path in reverse order (B3, B2, B1, and B0). +PP introduces a new hyperparameter to tune - chunks, which determines how many data chunks are sent in a sequence +through the same pipe stage. For example, in the bottom diagram you can see chunks=4. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_29.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..749a8db0fac7c3d26f3908e34bf83883dd9645df --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_29.txt @@ -0,0 +1,5 @@ +GPU0 performs the same +forward path on chunk 0, 1, 2 and 3 (F0,0, F0,1, F0,2, F0,3) and then it waits for other GPUs to do complete their work. +Only when the other GPUs begin to complete their work, GPU0 starts to work again doing the backward path for chunks +3, 2, 1 and 0 (B0,3, B0,2, B0,1, B0,0). +Note that this is the same concept as gradient accumulation steps. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_3.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..38d84ed7135cdd1d0e048542a5bd98f7f3bb0e09 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_3.txt @@ -0,0 +1,7 @@ +While the main concepts discussed in this guide are likely applicable across frameworks, here we focus on +PyTorch-based implementations. + +Before diving deeper into the specifics of each technique, let's go over the rough decision process when training +large models on a large infrastructure. +Scalability strategy +Begin by estimating how much vRAM is required to train your model. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_30.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..1c85eb816e78bc67cd22cbaa2d5266fd740e0d08 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_30.txt @@ -0,0 +1,6 @@ +PyTorch uses chunks, while DeepSpeed refers +to the same hyperparameter as gradient accumulation steps. +Because of the chunks, PP introduces the notion of micro-batches (MBS). DP splits the global data batch size into +mini-batches, so if you have a DP degree of 4, a global batch size of 1024 gets split up into 4 mini-batches of +256 each (1024/4). And if the number of chunks (or GAS) is 32 we end up with a micro-batch size of 8 (256/32). Each +Pipeline stage works with a single micro-batch at a time. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_31.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_31.txt new file mode 100644 index 0000000000000000000000000000000000000000..ae756ca358e018567a04f99411b4a088e512c166 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_31.txt @@ -0,0 +1,4 @@ +To calculate the global batch size of the DP + PP setup, +use the formula: mbs * chunks * dp_degree (8 * 32 * 4 = 1024). +With chunks=1 you end up with the naive MP, which is inefficient. With a large chunks value you end up with +tiny micro-batch sizes which is also inefficient. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_32.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_32.txt new file mode 100644 index 0000000000000000000000000000000000000000..55637ca28411f40dcf588e18360784269d97739f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_32.txt @@ -0,0 +1,4 @@ +For this reason, we encourage to experiment with the chunks value to +find the one that leads to the most efficient GPUs utilization. +You may notice a bubble of "dead" time on the diagram that can't be parallelized because the last forward stage +has to wait for backward to complete the pipeline. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_33.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_33.txt new file mode 100644 index 0000000000000000000000000000000000000000..ec8b34845684748ea9cca9144a28fb9231da8762 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_33.txt @@ -0,0 +1,9 @@ +The purpose of finding the best value for chunks is to enable a high +concurrent GPU utilization across all participating GPUs which translates to minimizing the size of the bubble. +Pipeline API solutions have been implemented in: +- PyTorch +- DeepSpeed +- Megatron-LM +These come with some shortcomings: +- They have to modify the model quite heavily, because Pipeline requires one to rewrite the normal flow of modules into a nn.Sequential sequence of the same, which may require changes to the design of the model. +- Currently the Pipeline API is very restricted. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_34.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_34.txt new file mode 100644 index 0000000000000000000000000000000000000000..e7b7a30ab3ac7783425d6bb7025a4893f5710f4d --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_34.txt @@ -0,0 +1 @@ +If you had a bunch of Python variables being passed in the very first stage of the Pipeline, you will have to find a way around it. Currently, the pipeline interface requires either a single Tensor or a tuple of Tensors as the only input and output. These tensors must have a batch size as the very first dimension, since pipeline is going to chunk the mini batch into micro-batches. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_35.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_35.txt new file mode 100644 index 0000000000000000000000000000000000000000..5968df26dc99961016a0a5b7f78f03740e83cb69 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_35.txt @@ -0,0 +1,10 @@ +Possible improvements are being discussed here https://github.com/pytorch/pytorch/pull/50693 +- Conditional control flow at the level of pipe stages is not possible - e.g., Encoder-Decoder models like T5 require special workarounds to handle a conditional encoder stage. +- They have to arrange each layer so that the output of one layer becomes an input to the other layer. +More recent solutions include: +- Varuna +- Sagemaker +We have not experimented with Varuna and SageMaker but their papers report that they have overcome the list of problems +mentioned above and that they require smaller changes to the user's model. +Implementations: +- PyTorch (initial support in pytorch-1.8, and progressively getting improved in 1.9 and more so in 1.10). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_36.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_36.txt new file mode 100644 index 0000000000000000000000000000000000000000..89590567350567bf957fc546e5a675350de4a0dc --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_36.txt @@ -0,0 +1,8 @@ +Some examples +- DeepSpeed +- Megatron-LM has an internal implementation - no API. +- Varuna +- SageMaker - this is a proprietary solution that can only be used on AWS. +- OSLO - this is implemented based on the Hugging Face Transformers. +🤗 Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive MP support. +The main obstacle is being unable to convert the models to nn.Sequential and have all the inputs to be Tensors. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_37.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_37.txt new file mode 100644 index 0000000000000000000000000000000000000000..3ca383fb1883d017fc6d36889124415b8ba4c18a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_37.txt @@ -0,0 +1,7 @@ +This +is because currently the models include many features that make the conversion very complicated, and will need to be removed to accomplish that. +DeepSpeed and Megatron-LM integrations are available in 🤗 Accelerate +Other approaches: +DeepSpeed, Varuna and SageMaker use the concept of an Interleaved Pipeline + +Here the bubble (idle time) is further minimized by prioritizing backward passes. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_38.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_38.txt new file mode 100644 index 0000000000000000000000000000000000000000..2292109b68d09d1e68ee7f469667fa564cb4d494 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_38.txt @@ -0,0 +1,17 @@ +Varuna further attempts to improve the +schedule by using simulations to discover the most efficient scheduling. +OSLO has pipeline parallelism implementation based on the Transformers without nn.Sequential conversion. +Tensor Parallelism +In Tensor Parallelism, each GPU processes a slice of a tensor and only aggregates the full tensor for operations requiring it. +To describe this method, this section of the guide relies on the concepts and diagrams from the Megatron-LM +paper: Efficient Large-Scale Language Model Training on GPU Clusters. +The main building block of any transformer is a fully connected nn.Linear followed by a nonlinear activation GeLU. +The dot dot-product part of it, following the Megatron's paper notation, can be written as Y = GeLU(XA), where X is +an input vector, Y is the output vector, and A is the weight matrix. +If we look at the computation in matrix form, you can see how the matrix multiplication can be split between multiple GPUs: + +If we split the weight matrix A column-wise across N GPUs and perform matrix multiplications XA_1 through XA_n in parallel, +then we will end up with N output vectors Y_1, Y_2, , Y_n which can be fed into GeLU independently: + +Using this principle, we can update a multi-layer perceptron of arbitrary depth, without the need for any synchronization +between GPUs until the very end, where we need to reconstruct the output vector from shards. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_39.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_39.txt new file mode 100644 index 0000000000000000000000000000000000000000..e4df96badf75c82393dfe522a3cd91821e59ed98 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_39.txt @@ -0,0 +1,8 @@ +The Megatron-LM paper authors +provide a helpful illustration for that: + +Parallelizing the multi-headed attention layers is even simpler, since they are already inherently parallel, due to having +multiple independent heads! + +Special considerations: TP requires very fast network, and therefore it's not advisable to do TP across more than one node. +Practically, if a node has 4 GPUs, the highest TP degree is therefore 4. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_4.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..375bdaedd0ec8cb88882d6e09188819a2555d567 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_4.txt @@ -0,0 +1,6 @@ +For models hosted on the 🤗 Hub, use our +Model Memory Calculator, which gives you +accurate calculations within a few percent margin. +Parallelization strategy for a single Node / multi-GPU setup +When training a model on a single node with multiple GPUs, your choice of parallelization strategy can significantly +impact performance. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_40.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_40.txt new file mode 100644 index 0000000000000000000000000000000000000000..635aac6e89b180d9c90d75ea230061dd2194293e --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_40.txt @@ -0,0 +1,15 @@ +If you need a TP degree of 8, you need to use +nodes that have at least 8 GPUs. +This section is based on the original much more detailed TP overview. +by @anton-l. +Alternative names: +- DeepSpeed calls it tensor slicing +Implementations: +- Megatron-LM has an internal implementation, as it's very model-specific +- parallelformers (only inference at the moment) +- SageMaker - this is a proprietary solution that can only be used on AWS. +- OSLO has the tensor parallelism implementation based on the Transformers. +SageMaker combines TP with DP for a more efficient processing. +🤗 Transformers status: +- core: not yet implemented in the core +- but if you want inference parallelformers provides this support for most of our models. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_41.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_41.txt new file mode 100644 index 0000000000000000000000000000000000000000..6214910f0e1de43f782536ca8785b33fdd160153 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_41.txt @@ -0,0 +1 @@ +So until this is implemented in the core you can use theirs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_42.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_42.txt new file mode 100644 index 0000000000000000000000000000000000000000..9525edf2c5b987d8b7c4499eb694764061573967 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_42.txt @@ -0,0 +1,8 @@ +And hopefully training mode will be supported too. +- Deepspeed-Inference also supports our BERT, GPT-2, and GPT-Neo models in their super-fast CUDA-kernel-based inference mode, see more here +🤗 Accelerate integrates with TP from Megatron-LM. +Data Parallelism + Pipeline Parallelism +The following diagram from the DeepSpeed pipeline tutorial demonstrates +how one can combine DP with PP. + +Here it's important to see how DP rank 0 doesn't see GPU2 and DP rank 1 doesn't see GPU3. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_43.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_43.txt new file mode 100644 index 0000000000000000000000000000000000000000..0b344c0b45661ebb99e9804ae359b97e955eb077 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_43.txt @@ -0,0 +1,2 @@ +To DP there is just GPUs 0 +and 1 where it feeds data as if there were just 2 GPUs. GPU0 "secretly" offloads some of its load to GPU2 using PP. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_44.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_44.txt new file mode 100644 index 0000000000000000000000000000000000000000..50120e51fa44523ed7005cedf02c9a8dd58123a6 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_44.txt @@ -0,0 +1,11 @@ +And GPU1 does the same by enlisting GPU3 to its aid. +Since each dimension requires at least 2 GPUs, here you'd need at least 4 GPUs. +Implementations: +- DeepSpeed +- Megatron-LM +- Varuna +- SageMaker +- OSLO +🤗 Transformers status: not yet implemented +Data Parallelism + Pipeline Parallelism + Tensor Parallelism +To get an even more efficient training a 3D parallelism is used where PP is combined with TP and DP. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_45.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_45.txt new file mode 100644 index 0000000000000000000000000000000000000000..b02e25e767036e5dd0f1d897a31256d41e095ccb --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_45.txt @@ -0,0 +1,13 @@ +This can be seen in the following diagram. + +This diagram is from a blog post 3D parallelism: Scaling to trillion-parameter models, which is a good read as well. +Since each dimension requires at least 2 GPUs, here you'd need at least 8 GPUs. +Implementations: +- DeepSpeed - DeepSpeed also includes an even more efficient DP, which they call ZeRO-DP. +- Megatron-LM +- Varuna +- SageMaker +- OSLO +🤗 Transformers status: not yet implemented, since we have no PP and TP. +ZeRO Data Parallelism + Pipeline Parallelism + Tensor Parallelism +One of the main features of DeepSpeed is ZeRO, which is a super-scalable extension of DP. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_46.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_46.txt new file mode 100644 index 0000000000000000000000000000000000000000..35b4d08e96fb60e1f2811099ebd828612e253900 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_46.txt @@ -0,0 +1,6 @@ +It has already been +discussed in ZeRO Data Parallelism. Normally it's a standalone feature that doesn't require PP or TP. +But it can be combined with PP and TP. +When ZeRO-DP is combined with PP (and optionally TP) it typically enables only ZeRO stage 1 (optimizer sharding). +While it's theoretically possible to use ZeRO stage 2 (gradient sharding) with Pipeline Parallelism, it will have negative +performance impacts. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_47.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_47.txt new file mode 100644 index 0000000000000000000000000000000000000000..ade8d2399d9b69fb9c7d2d23cdd74e35c200ca45 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_47.txt @@ -0,0 +1,4 @@ +There would need to be an additional reduce-scatter collective for every micro-batch to aggregate +the gradients before sharding, which adds a potentially significant communication overhead. By nature of Pipeline Parallelism, +small micro-batches are used and instead the focus is on trying to balance arithmetic intensity (micro-batch size) with +minimizing the Pipeline bubble (number of micro-batches). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_48.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_48.txt new file mode 100644 index 0000000000000000000000000000000000000000..8e2b18983954c435f0fdc4a202cd341cb9d9c400 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_48.txt @@ -0,0 +1,5 @@ +Therefore those communication costs are going to impact the performance. +In addition, there are already fewer layers than normal due to PP and so the memory savings won't be huge. PP already +reduces gradient size by 1/PP, and so gradient sharding savings on top of that are less significant than pure DP. +ZeRO stage 3 is not a good choice either for the same reason - more inter-node communications required. +And since we have ZeRO, the other benefit is ZeRO-Offload. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_49.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_49.txt new file mode 100644 index 0000000000000000000000000000000000000000..058a169a8939ab551484bd3dc32742655fddfccf --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_49.txt @@ -0,0 +1,22 @@ +Since this is stage 1 optimizer states can be offloaded to CPU. +Implementations: +- Megatron-DeepSpeed and Megatron-Deepspeed from BigScience, which is the fork of the former repo. +- OSLO +Important papers: + +Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model + +🤗 Transformers status: not yet implemented, since we have no PP and TP. +FlexFlow +FlexFlow also solves the parallelization problem in a slightly different approach. +Paper: "Beyond Data and Model Parallelism for Deep Neural Networks" by Zhihao Jia, Matei Zaharia, Alex Aiken +It performs a sort of 4D Parallelism over Sample-Operator-Attribute-Parameter. + +Sample = Data Parallelism (sample-wise parallel) +Operator = Parallelize a single operation into several sub-operations +Attribute = Data Parallelism (length-wise parallel) +Parameter = Model Parallelism (regardless of dimension - horizontal or vertical) + +Examples: +* Sample +Let's take 10 batches of sequence length 512. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_5.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..8eaee5a392979d8edacb7f8ffbbc6d3c180a0fd9 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_5.txt @@ -0,0 +1,16 @@ +Here's a breakdown of your options: +Case 1: Your model fits onto a single GPU +If your model can comfortably fit onto a single GPU, you have two primary options: + +DDP - Distributed DataParallel +ZeRO - depending on the situation and configuration used, this method may or may not be faster, however, it's worth experimenting with it. + +Case 2: Your model doesn't fit onto a single GPU: +If your model is too large for a single GPU, you have several alternatives to consider: + +PipelineParallel (PP) +ZeRO +TensorParallel (TP) + +With very fast inter-node connectivity (e.g., NVLINK or NVSwitch) all three strategies (PP, ZeRO, TP) should result in +similar performance. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_50.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_50.txt new file mode 100644 index 0000000000000000000000000000000000000000..a07b190c097708c447a09f4e3157bc577372993c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_50.txt @@ -0,0 +1,6 @@ +If we parallelize them by sample dimension into 2 devices, we get 10 x 512 which becomes be 5 x 2 x 512. + +Operator + +If we perform layer normalization, we compute std first and mean second, and then we can normalize data. +Operator parallelism allows computing std and mean in parallel. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_51.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_51.txt new file mode 100644 index 0000000000000000000000000000000000000000..5f2bbb953f722137b9fe59b32df5695fd61211d6 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_51.txt @@ -0,0 +1,12 @@ +So if we parallelize them by operator dimension into 2 +devices (cuda:0, cuda:1), first we copy input data into both devices, and cuda:0 computes std, cuda:1 computes mean at the same time. + +Attribute + +We have 10 batches of 512 length. If we parallelize them by attribute dimension into 2 devices, 10 x 512 will be 10 x 2 x 256. + +Parameter + +It is similar with tensor model parallelism or naive layer-wise model parallelism. + +The significance of this framework is that it takes resources like (1) GPU/TPU/CPU vs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_52.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_52.txt new file mode 100644 index 0000000000000000000000000000000000000000..99a06913ff8f629ac96dfa5c6a19e201def01f74 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_52.txt @@ -0,0 +1 @@ +(2) RAM/DRAM vs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_53.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_53.txt new file mode 100644 index 0000000000000000000000000000000000000000..604ebffe811cd5bc264b5cdc6a3b51a30195c390 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_53.txt @@ -0,0 +1,7 @@ +(3) +fast-intra-connect/slow-inter-connect and it automatically optimizes all these algorithmically deciding which +parallelisation to use where. +One very important aspect is that FlexFlow is designed for optimizing DNN parallelizations for models with static and +fixed workloads, since models with dynamic behavior may prefer different parallelization strategies across iterations. +So the promise is very attractive - it runs a 30min simulation on the cluster of choice and it comes up with the best +strategy to utilise this specific environment. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_54.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_54.txt new file mode 100644 index 0000000000000000000000000000000000000000..580d1453b46b61e0933dcce37c3ee931a868e473 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_54.txt @@ -0,0 +1,6 @@ +If you add/remove/replace any parts it'll run and re-optimize the plan +for that. And then you can train. A different setup will have its own custom optimization. +🤗 Transformers status: Transformers models are FX-trace-able via transformers.utils.fx, +which is a prerequisite for FlexFlow, however, changes are required on the FlexFlow side to make it work with Transformers models. +GPU selection +When training on multiple GPUs, you can specify the number of GPUs to use and in what order. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_55.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_55.txt new file mode 100644 index 0000000000000000000000000000000000000000..01c0e6977e1bb40efba9827ab9328d54a70a749b --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_55.txt @@ -0,0 +1 @@ +This can be useful for instance when you have GPUs with different computing power and want to use the faster GPU first. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_56.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_56.txt new file mode 100644 index 0000000000000000000000000000000000000000..bbad8cd99d365ba88eb7e32c6b8803fc9250ca62 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_56.txt @@ -0,0 +1,18 @@ +The selection process works for both DistributedDataParallel and DataParallel to use only a subset of the available GPUs, and you don't need Accelerate or the DeepSpeed integration. +Number of GPUs +For example, if you have 4 GPUs and you only want to use the first 2: + +Use the --nproc_per_node to select how many GPUs to use. + +torchrun --nproc_per_node=2 trainer-program.py + +Use --num_processes to select how many GPUs to use. + +accelerate launch --num_processes 2 trainer-program.py + +Use --num_gpus to select how many GPUs to use. + +deepspeed --num_gpus 2 trainer-program.py + +Order of GPUs +Now, to select which GPUs to use and their order, you'll use the CUDA_VISIBLE_DEVICES environment variable. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_57.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_57.txt new file mode 100644 index 0000000000000000000000000000000000000000..b540cec5d9e75dcc10e70bf7b6d4638548ae2ab2 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_57.txt @@ -0,0 +1,4 @@ +It is easiest to set the environment variable in a ~/bashrc or another startup config file. CUDA_VISIBLE_DEVICES is used to map which GPUs are used. For example, if you have 4 GPUs (0, 1, 2, 3) and you only want to run GPUs 0 and 2: + +CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py +Only the 2 physical GPUs (0 and 2) are "visible" to PyTorch and these are mapped to cuda:0 and cuda:1 respectively. You can also reverse the order of the GPUs to use 2 first. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_58.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_58.txt new file mode 100644 index 0000000000000000000000000000000000000000..a887144d4ee62a9646f0ef71f3efb354149cfa39 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_58.txt @@ -0,0 +1,8 @@ +Now, the mapping is cuda:1 for GPU 0 and cuda:0 for GPU 2. + +CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py +You can also set the CUDA_VISIBLE_DEVICES environment variable to an empty value to create an environment without GPUs. + +CUDA_VISIBLE_DEVICES= python trainer-program.py + +As with any environment variable, they can be exported instead of being added to the command line. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_59.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_59.txt new file mode 100644 index 0000000000000000000000000000000000000000..6ba6aa870c118f251e3a8fbb728f594e349edfa2 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_59.txt @@ -0,0 +1,3 @@ +However, this is not recommended because it can be confusing if you forget how the environment variable was setup and you end up using the wrong GPUs. Instead, it is common practice to set the environment variable for a specific training run on the same command line. + +CUDA_DEVICE_ORDER is an alternative environment variable you can use to control how the GPUs are ordered. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_6.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..7d23d76811684ed4fba398a56971f21e48af6266 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_6.txt @@ -0,0 +1,3 @@ +However, without these, PP will be faster than TP or ZeRO. The degree of TP may also +make a difference. It's best to experiment with your specific setup to determine the most suitable strategy. +TP is almost always used within a single node. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_60.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_60.txt new file mode 100644 index 0000000000000000000000000000000000000000..d5ef8df5c9787340dcd953344a3ada63f9cedd74 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_60.txt @@ -0,0 +1,10 @@ +You can either order them by: + +PCIe bus ID's that matches the order of nvidia-smi and rocm-smi for NVIDIA and AMD GPUs respectively + +export CUDA_DEVICE_ORDER=PCI_BUS_ID + +GPU compute ability + +export CUDA_DEVICE_ORDER=FASTEST_FIRST +The CUDA_DEVICE_ORDER is especially useful if your training setup consists of an older and newer GPU, where the older GPU appears first, but you cannot physically swap the cards to make the newer GPU appear first. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_61.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_61.txt new file mode 100644 index 0000000000000000000000000000000000000000..bd9872bddb28160ab046836767b4694b8ebb51b0 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_61.txt @@ -0,0 +1 @@ +In this case, set CUDA_DEVICE_ORDER=FASTEST_FIRST to always use the newer and faster GPU first (nvidia-smi or rocm-smi still reports the GPUs in their PCIe order). Or you could also set export CUDA_VISIBLE_DEVICES=1,0.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_7.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..9303d7516d1ecf3681099a380baf1dc9f92d5245 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_7.txt @@ -0,0 +1,21 @@ +That is TP size <= GPUs per node. +Case 3: Largest layer of your model does not fit onto a single GPU + +If you are not using ZeRO, you have to use TensorParallel (TP), because PipelineParallel (PP) alone won't be sufficient to accommodate the large layer. +If you are using ZeRO, additionally adopt techniques from the Methods and tools for efficient training on a single GPU. + +Parallelization strategy for a multi-Node / multi-GPU setup + +When you have fast inter-node connectivity (e.g., NVLINK or NVSwitch) consider using one of these options: + +ZeRO - as it requires close to no modifications to the model +A combination of PipelineParallel(PP) with TensorParallel(TP) and DataParallel(DP) - this approach will result in fewer communications, but requires significant changes to the model + +When you have slow inter-node connectivity and still low on GPU memory: + +Employ a combination of DataParallel(DP) with PipelineParallel(PP), TensorParallel(TP), and ZeRO. + +In the following sections of this guide we dig deeper into how these different parallelism methods work. +Data Parallelism +Even with only 2 GPUs, you can readily leverage the accelerated training capabilities offered by PyTorch's built-in features, +such as DataParallel (DP) and DistributedDataParallel (DDP). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_8.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..11a2e705fd4509809648a8cefbe0212bf5b6b490 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_8.txt @@ -0,0 +1,16 @@ +Note that +PyTorch documentation recommends to prefer +DistributedDataParallel (DDP) over DataParallel (DP) for multi-GPU training as it works for all models. +Let's take a look at how these two methods work and what makes them different. +DataParallel vs DistributedDataParallel +To understand the key differences in inter-GPU communication overhead between the two methods, let's review the processes per batch: +DDP: + +At the start time the main process replicates the model once from GPU 0 to the rest of GPUs +Then for each batch: +Each GPU directly consumes its mini-batch of data. +During backward, once the local gradients are ready, they are averaged across all processes. + +DP: +For each batch: + 1. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_9.txt b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..48ce455cdc4f2bd18f782a5b2bc154c1f7c9efcf --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_many/chunk_9.txt @@ -0,0 +1,7 @@ +GPU 0 reads the batch of data and then sends a mini-batch to each GPU. + 2. The up-to-date model is replicated from GPU 0 to each GPU. + 3. forward is executed, and output from each GPU is sent to GPU 0 to compute the loss. + 4. The loss is distributed from GPU 0 to all GPUs, and backward is run. + 5. Gradients from each GPU are sent to GPU 0 and averaged. +Key differences include: +1. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_0.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..23c419f713e8ff49853ab50318eb02c233baffaf --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_0.txt @@ -0,0 +1,5 @@ +Methods and tools for efficient training on a single GPU +This guide demonstrates practical techniques that you can use to increase the efficiency of your model's training by +optimizing memory utilization, speeding up the training, or both. If you'd like to understand how GPU is utilized during +training, please refer to the Model training anatomy conceptual guide first. This guide +focuses on practical techniques. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_1.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..749ff5ec60cd075c224c1cc23df51ae5e1bd4317 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_1.txt @@ -0,0 +1,9 @@ +If you have access to a machine with multiple GPUs, these approaches are still valid, plus you can leverage additional methods outlined in the multi-GPU section. + +When training large models, there are two aspects that should be considered at the same time: + +Data throughput/training time +Model performance + +Maximizing the throughput (samples/second) leads to lower training cost. This is generally achieved by utilizing the GPU +as much as possible and thus filling GPU memory to its limit. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_10.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..aa6efa59f89c92aa6532463f8e6ca4115c389fae --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_10.txt @@ -0,0 +1,5 @@ +Alternatively, use 🤗 Accelerate to gain full control over the training loop. Find the 🤗 Accelerate example +further down in this guide. +While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can +result in a more pronounced training slowdown. Consider the following example. Let's say, the per_device_train_batch_size=4 +without gradient accumulation hits the GPU's limit. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_11.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..0e30b9ceff674ed1bcf726081c9019814d5f6471 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_11.txt @@ -0,0 +1,3 @@ +If you would like to train with batches of size 64, do not set the +per_device_train_batch_size to 1 and gradient_accumulation_steps to 64. Instead, keep per_device_train_batch_size=4 +and set gradient_accumulation_steps=16. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_12.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..a9414335fa5ec27abf73f20458c0e48d65bb4a3f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_12.txt @@ -0,0 +1,6 @@ +This results in the same effective batch size while making better use of +the available GPU resources. +For additional information, please refer to batch size and gradient accumulation benchmarks for RTX-3090 +and A100. +Gradient Checkpointing +Some large models may still face memory issues even when the batch size is set to 1 and gradient accumulation is used. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_13.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..42ead21afd63bbb9d83118ed542249c21bc793fb --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_13.txt @@ -0,0 +1,3 @@ +This is because there are other components that also require memory storage. +Saving all activations from the forward pass in order to compute the gradients during the backward pass can result in +significant memory overhead. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_14.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..8cc8eec5da7956fb094ff5ef7fc1e49702299541 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_14.txt @@ -0,0 +1,4 @@ +The alternative approach of discarding the activations and recalculating them when needed +during the backward pass, would introduce a considerable computational overhead and slow down the training process. +Gradient checkpointing offers a compromise between these two approaches and saves strategically selected activations +throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_15.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..01db124af80fe70c9a93ec06af15b8bb53229464 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_15.txt @@ -0,0 +1,8 @@ +For +an in-depth explanation of gradient checkpointing, refer to this great article. +To enable gradient checkpointing in the [Trainer], pass the corresponding a flag to [TrainingArguments]: +py +training_args = TrainingArguments( + per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args +) +Alternatively, use 🤗 Accelerate - find the 🤗 Accelerate example further in this guide. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_16.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..a2e6f4a7b56d7e823dbea7e1f2eb46d840d9b327 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_16.txt @@ -0,0 +1,6 @@ +While gradient checkpointing may improve memory efficiency, it slows training by approximately 20%. + +Mixed precision training +Mixed precision training is a technique that aims to optimize the computational efficiency of training models by +utilizing lower-precision numerical formats for certain variables. Traditionally, most models use 32-bit floating point +precision (fp32 or float32) to represent and process variables. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_17.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..bbb55213da2b7d7352c269d808a9dc8aecada78a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_17.txt @@ -0,0 +1,3 @@ +However, not all variables require this high precision +level to achieve accurate results. By reducing the precision of certain variables to lower numerical formats like 16-bit +floating point (fp16 or float16), we can speed up the computations. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_18.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..135ef66c62b584adb40898bbaadbf3e52a50c0d5 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_18.txt @@ -0,0 +1,4 @@ +Because in this approach some computations are performed +in half-precision, while some are still in full precision, the approach is called mixed precision training. +Most commonly mixed precision training is achieved by using fp16 (float16) data types, however, some GPU architectures +(such as the Ampere architecture) offer bf16 and tf32 (CUDA internal data type) data types. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_19.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..74abb0365c277cc52a0bab89ea564ac06967638a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_19.txt @@ -0,0 +1,7 @@ +Check +out the NVIDIA Blog to learn more about +the differences between these data types. +fp16 +The main advantage of mixed precision training comes from saving the activations in half precision (fp16). +Although the gradients are also computed in half precision they are converted back to full precision for the optimization +step so no memory is saved here. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_2.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..e78b99708dfb8a489a3ee894e0ff8f03e313c52c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_2.txt @@ -0,0 +1,4 @@ +If the desired batch size exceeds the limits of the GPU memory, +the memory optimization techniques, such as gradient accumulation, can help. +However, if the preferred batch size fits into memory, there's no reason to apply memory-optimizing techniques because they can +slow down the training. Just because one can use a large batch size, does not necessarily mean they should. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_20.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..3c57001b6df111ec7976372ab81923e74846f798 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_20.txt @@ -0,0 +1,6 @@ +While mixed precision training results in faster computations, it can also lead to more GPU memory being utilized, especially for small batch sizes. +This is because the model is now present on the GPU in both 16-bit and 32-bit precision (1.5x the original model on the GPU). +To enable mixed precision training, set the fp16 flag to True: +py +training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args) +If you prefer to use 🤗 Accelerate, find the 🤗 Accelerate example further in this guide. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_21.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..3fd50c9237a0c5b54476ee7ad5215d8c5abd638c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_21.txt @@ -0,0 +1,4 @@ +BF16 +If you have access to an Ampere or newer hardware you can use bf16 for mixed precision training and evaluation. While +bf16 has a worse precision than fp16, it has a much bigger dynamic range. In fp16 the biggest number you can have +is 65535 and any number above that will result in an overflow. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_22.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..df8c47a23c7f1665d2b57c6011f407b6c2dcc701 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_22.txt @@ -0,0 +1,8 @@ +A bf16 number can be as large as 3.39e+38 (!) which +is about the same as fp32 - because both have 8-bits used for the numerical range. +You can enable BF16 in the 🤗 Trainer with: +python +training_args = TrainingArguments(bf16=True, **default_args) +TF32 +The Ampere hardware uses a magical data type called tf32. It has the same numerical range as fp32 (8-bits), but instead +of 23 bits precision it has only 10 bits (same as fp16) and uses only 19 bits in total. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_23.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..5d5db5233b9c7ac08898a4b4e2f671fff87bc13f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_23.txt @@ -0,0 +1,3 @@ +It's "magical" in the sense that +you can use the normal fp32 training and/or inference code and by enabling tf32 support you can get up to 3x throughput +improvement. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_24.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..71b60a7be283b944079ff986fa7be2870a8d3c57 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_24.txt @@ -0,0 +1,8 @@ +All you need to do is to add the following to your code: +python +import torch +torch.backends.cuda.matmul.allow_tf32 = True +torch.backends.cudnn.allow_tf32 = True +CUDA will automatically switch to using tf32 instead of fp32 where possible, assuming that the used GPU is from the Ampere series. +According to NVIDIA research, the +majority of machine learning training workloads show the same perplexity and convergence with tf32 training as with fp32. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_25.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_25.txt new file mode 100644 index 0000000000000000000000000000000000000000..910e335f8e8f8abb35f1b08b8b2e69a7489f94ed --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_25.txt @@ -0,0 +1,6 @@ +If you're already using fp16 or bf16 mixed precision it may help with the throughput as well. +You can enable this mode in the 🤗 Trainer: +python +TrainingArguments(tf32=True, **default_args) + +tf32 can't be accessed directly via tensor.to(dtype=torch.tf32) because it is an internal CUDA data type. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_26.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_26.txt new file mode 100644 index 0000000000000000000000000000000000000000..d78bb7309c9ebf277c12a46c68b9f8c011c9d5a0 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_26.txt @@ -0,0 +1,7 @@ +You need torch>=1.7 to use tf32 data types. + +For additional information on tf32 vs other precisions, please refer to the following benchmarks: +RTX-3090 and +A100. +Flash Attention 2 +You can speedup the training throughput by using Flash Attention 2 integration in transformers. Check out the appropriate section in the single GPU section to learn more about how to load a model with Flash Attention 2 modules. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_27.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_27.txt new file mode 100644 index 0000000000000000000000000000000000000000..54c9bbf4cd76aeadb5fff9d48012136a72307484 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_27.txt @@ -0,0 +1,4 @@ +Optimizer choice +The most common optimizer used to train transformer models is Adam or AdamW (Adam with weight decay). Adam achieves +good convergence by storing the rolling average of the previous gradients; however, it adds an additional memory +footprint of the order of the number of model parameters. To remedy this, you can use an alternative optimizer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_28.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_28.txt new file mode 100644 index 0000000000000000000000000000000000000000..d7e28d3af73438f799996a2d01d2347e295005fb --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_28.txt @@ -0,0 +1,4 @@ +For example if you have NVIDIA/apex installed for NVIDIA GPUs, or ROCmSoftwarePlatform/apex for AMD GPUs, adamw_apex_fused will give you the +fastest training experience among all supported AdamW optimizers. +[Trainer] integrates a variety of optimizers that can be used out of box: adamw_hf, adamw_torch, adamw_torch_fused, +adamw_apex_fused, adamw_anyprecision, adafactor, or adamw_bnb_8bit. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_29.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_29.txt new file mode 100644 index 0000000000000000000000000000000000000000..b358bcb8b600d6c4ddc6fe0dcc8be7e3f5c35028 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_29.txt @@ -0,0 +1,4 @@ +More optimizers can be plugged in via a third-party implementation. +Let's take a closer look at two alternatives to AdamW optimizer: +1. adafactor which is available in [Trainer] +2. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_3.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..6f726aee62f9d529fb100dc7e47925601a4f0517 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_3.txt @@ -0,0 +1,19 @@ +As part of +hyperparameter tuning, you should determine which batch size yields the best results and then optimize resources accordingly. +The methods and tools covered in this guide can be classified based on the effect they have on the training process: +| Method/tool | Improves training speed | Optimizes memory utilization | +|:-----------------------------------------------------------|:------------------------|:-----------------------------| +| Batch size choice | Yes | Yes | +| Gradient accumulation | No | Yes | +| Gradient checkpointing | No | Yes | +| Mixed precision training | Yes | (No) | +| Optimizer choice | Yes | Yes | +| Data preloading | Yes | No | +| DeepSpeed Zero | No | Yes | +| torch.compile | Yes | No | +| Parameter-Efficient Fine Tuning (PEFT) | No | Yes | + +Note: when using mixed precision with a small model and a large batch size, there will be some memory savings but with a +large model and a small batch size, the memory use will be larger. + +You can combine the above methods to get a cumulative effect. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_30.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_30.txt new file mode 100644 index 0000000000000000000000000000000000000000..9fcbd26ce3e3125bd4b9b9edd2d6c19f0233e296 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_30.txt @@ -0,0 +1,4 @@ +adamw_bnb_8bit is also available in Trainer, but a third-party integration is provided below for demonstration. +For comparison, for a 3B-parameter model, like “google-t5/t5-3b”: +* A standard AdamW optimizer will need 24GB of GPU memory because it uses 8 bytes for each parameter (83 => 24GB) +* Adafactor optimizer will need more than 12GB. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_31.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_31.txt new file mode 100644 index 0000000000000000000000000000000000000000..a7ad940606e9ee204873ceec6c99f64af8fae6e9 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_31.txt @@ -0,0 +1,5 @@ +It uses slightly more than 4 bytes for each parameter, so 43 and then some extra. +* 8bit BNB quantized optimizer will use only (2*3) 6GB if all optimizer states are quantized. +Adafactor +Adafactor doesn't store rolling averages for each element in weight matrices. Instead, it keeps aggregated information +(sums of rolling averages row- and column-wise), significantly reducing its footprint. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_32.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_32.txt new file mode 100644 index 0000000000000000000000000000000000000000..77f7a66aec86ec70b834977cb32cb795963486ff --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_32.txt @@ -0,0 +1,8 @@ +However, compared to Adam, +Adafactor may have slower convergence in certain cases. +You can switch to Adafactor by setting optim="adafactor" in [TrainingArguments]: +py +training_args = TrainingArguments(per_device_train_batch_size=4, optim="adafactor", **default_args) +Combined with other approaches (gradient accumulation, gradient checkpointing, and mixed precision training) +you can notice up to 3x improvement while maintaining the throughput! However, as mentioned before, the convergence of +Adafactor can be worse than Adam. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_33.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_33.txt new file mode 100644 index 0000000000000000000000000000000000000000..02dc2f06ccf13a81a597fc4e47d649edd10c3718 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_33.txt @@ -0,0 +1,3 @@ +8-bit Adam +Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization +means that it stores the state with lower precision and dequantizes it only for the optimization. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_34.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_34.txt new file mode 100644 index 0000000000000000000000000000000000000000..02f81a2eeb38c05d3fd26c061d80ca4ae3ea2bbb --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_34.txt @@ -0,0 +1,9 @@ +This is similar to the +idea behind mixed precision training. +To use adamw_bnb_8bit, you simply need to set optim="adamw_bnb_8bit" in [TrainingArguments]: +py +training_args = TrainingArguments(per_device_train_batch_size=4, optim="adamw_bnb_8bit", **default_args) +However, we can also use a third-party implementation of the 8-bit optimizer for demonstration purposes to see how that can be integrated. +First, follow the installation guide in the GitHub repo to install the bitsandbytes library +that implements the 8-bit Adam optimizer. +Next you need to initialize the optimizer. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_35.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_35.txt new file mode 100644 index 0000000000000000000000000000000000000000..5d33fb3f2936add8f19b8688948338b14ea4ca68 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_35.txt @@ -0,0 +1,2 @@ +This involves two steps: +* First, group the model's parameters into two groups - one where weight decay should be applied, and the other one where it should not. Usually, biases and layer norm parameters are not weight decayed. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_36.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_36.txt new file mode 100644 index 0000000000000000000000000000000000000000..25887c5725a179ea8d9cdbc645fe83d9e681d2aa --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_36.txt @@ -0,0 +1,35 @@ +* Then do some argument housekeeping to use the same parameters as the previously used AdamW optimizer. + +import bitsandbytes as bnb +from torch import nn +from transformers.trainer_pt_utils import get_parameter_names +training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) +decay_parameters = get_parameter_names(model, [nn.LayerNorm]) +decay_parameters = [name for name in decay_parameters if "bias" not in name] +optimizer_grouped_parameters = [ + { + "params": [p for n, p in model.named_parameters() if n in decay_parameters], + "weight_decay": training_args.weight_decay, + }, + { + "params": [p for n, p in model.named_parameters() if n not in decay_parameters], + "weight_decay": 0.0, + }, +] +optimizer_kwargs = { + "betas": (training_args.adam_beta1, training_args.adam_beta2), + "eps": training_args.adam_epsilon, +} +optimizer_kwargs["lr"] = training_args.learning_rate +adam_bnb_optim = bnb.optim.Adam8bit( + optimizer_grouped_parameters, + betas=(training_args.adam_beta1, training_args.adam_beta2), + eps=training_args.adam_epsilon, + lr=training_args.learning_rate, +) + +Finally, pass the custom optimizer as an argument to the Trainer: +py +trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None)) +Combined with other approaches (gradient accumulation, gradient checkpointing, and mixed precision training), +you can expect to get about a 3x memory improvement and even slightly higher throughput as using Adafactor. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_37.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_37.txt new file mode 100644 index 0000000000000000000000000000000000000000..8414c965667cef50b6174cbf71d0bdf1dda68a40 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_37.txt @@ -0,0 +1,6 @@ +multi_tensor +pytorch-nightly introduced torch.optim._multi_tensor which should significantly speed up the optimizers for situations +with lots of small feature tensors. It should eventually become the default, but if you want to experiment with it sooner, take a look at this GitHub issue. +Data preloading +One of the important requirements to reach great training speed is the ability to feed the GPU at the maximum speed it +can handle. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_38.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_38.txt new file mode 100644 index 0000000000000000000000000000000000000000..b05524426502b5ecb32ad5fce4bf8c51f777f4b1 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_38.txt @@ -0,0 +1,5 @@ +By default, everything happens in the main process, and it might not be able to read the data from disk fast +enough, and thus create a bottleneck, leading to GPU under-utilization. Configure the following arguments to reduce the bottleneck: + +DataLoader(pin_memory=True, ) - ensures the data gets preloaded into the pinned memory on CPU and typically leads to much faster transfers from CPU to GPU memory. +DataLoader(num_workers=4, ) - spawn several workers to preload data faster. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_39.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_39.txt new file mode 100644 index 0000000000000000000000000000000000000000..59e16fa53ad5f99a0342518deea80db069fbb304 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_39.txt @@ -0,0 +1 @@ +During training, watch the GPU utilization stats; if it's far from 100%, experiment with increasing the number of workers. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_4.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..95d422c5189af8eadf06a7024f4210b0a1e28048 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_4.txt @@ -0,0 +1,9 @@ +These techniques are available to you whether you are +training your model with [Trainer] or writing a pure PyTorch loop, in which case you can configure these optimizations +with 🤗 Accelerate. +If these methods do not result in sufficient gains, you can explore the following options: +* Look into building your own custom Docker container with efficient softare prebuilds +* Consider a model that uses Mixture of Experts (MoE) +* Convert your model to BetterTransformer to leverage PyTorch native attention +Finally, if all of the above is still not enough, even after switching to a server-grade GPU like A100, consider moving +to a multi-GPU setup. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_40.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_40.txt new file mode 100644 index 0000000000000000000000000000000000000000..38f26586c90f5d1ec3209d943c99aac363dcbe9f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_40.txt @@ -0,0 +1,9 @@ +Of course, the problem could be elsewhere, so many workers won't necessarily lead to better performance. + +When using [Trainer], the corresponding [TrainingArguments] are: dataloader_pin_memory (True by default), and dataloader_num_workers (defaults to 0). +DeepSpeed ZeRO +DeepSpeed is an open-source deep learning optimization library that is integrated with 🤗 Transformers and 🤗 Accelerate. +It provides a wide range of features and optimizations designed to improve the efficiency and scalability of large-scale +deep learning training. +If your model fits onto a single GPU and you have enough space to fit a small batch size, you don't need to use DeepSpeed +as it'll only slow things down. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_41.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_41.txt new file mode 100644 index 0000000000000000000000000000000000000000..da62274198f9cd24a58b779b0a796181f10b3456 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_41.txt @@ -0,0 +1,7 @@ +However, if the model doesn't fit onto a single GPU or you can't fit a small batch, you can +leverage DeepSpeed ZeRO + CPU Offload, or NVMe Offload for much larger models. In this case, you need to separately +install the library, then follow one of the guides to create a configuration file +and launch DeepSpeed: + +For an in-depth guide on DeepSpeed integration with [Trainer], review the corresponding documentation, specifically the +section for a single GPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_42.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_42.txt new file mode 100644 index 0000000000000000000000000000000000000000..eea5696093e2d047c9c6e15f34c00a4c40993179 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_42.txt @@ -0,0 +1,10 @@ +Some adjustments are required to use DeepSpeed in a notebook; please take a look at the corresponding guide. +If you prefer to use 🤗 Accelerate, refer to 🤗 Accelerate DeepSpeed guide. + +Using torch.compile +PyTorch 2.0 introduced a new compile function that doesn't require any modification to existing PyTorch code but can +optimize your code by adding a single line of code: model = torch.compile(model). +If using [Trainer], you only need to pass the torch_compile option in the [TrainingArguments]: +python +training_args = TrainingArguments(torch_compile=True, **default_args) +torch.compile uses Python's frame evaluation API to automatically create a graph from existing PyTorch programs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_43.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_43.txt new file mode 100644 index 0000000000000000000000000000000000000000..adb16236eda0bc6a4b1924a4e3e2728cfa73b1b0 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_43.txt @@ -0,0 +1,5 @@ +After +capturing the graph, different backends can be deployed to lower the graph to an optimized engine. +You can find more details and benchmarks in PyTorch documentation. +torch.compile has a growing list of backends, which can be found in by calling torchdynamo.list_backends(), each of which with its optional dependencies. +Choose which backend to use by specifying it via torch_compile_backend in the [TrainingArguments]. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_44.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_44.txt new file mode 100644 index 0000000000000000000000000000000000000000..b8ba9911947a6a72e43141f3ee1be300818305d8 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_44.txt @@ -0,0 +1,4 @@ +Some of the most commonly used backends are: +Debugging backends: +* dynamo.optimize("eager") - Uses PyTorch to run the extracted GraphModule. This is quite useful in debugging TorchDynamo issues. +* dynamo.optimize("aot_eager") - Uses AotAutograd with no compiler, i.e, just using PyTorch eager for the AotAutograd's extracted forward and backward graphs. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_45.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_45.txt new file mode 100644 index 0000000000000000000000000000000000000000..42bd176abfe2ebe3614ae123e28e8dddfb97716f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_45.txt @@ -0,0 +1,6 @@ +This is useful for debugging, and unlikely to give speedups. +Training & inference backends: +* dynamo.optimize("inductor") - Uses TorchInductor backend with AotAutograd and cudagraphs by leveraging codegened Triton kernels Read more +* dynamo.optimize("nvfuser") - nvFuser with TorchScript. Read more +* dynamo.optimize("aot_nvfuser") - nvFuser with AotAutograd. Read more +* dynamo.optimize("aot_cudagraphs") - cudagraphs with AotAutograd. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_46.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_46.txt new file mode 100644 index 0000000000000000000000000000000000000000..f53bc0f0c9a1fb8a67d7e74e0aada7415bb70dfb --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_46.txt @@ -0,0 +1,6 @@ +Read more +Inference-only backends: +* dynamo.optimize("ofi") - Uses Torchscript optimize_for_inference. Read more +* dynamo.optimize("fx2trt") - Uses NVIDIA TensorRT for inference optimizations. Read more +* dynamo.optimize("onnxrt") - Uses ONNXRT for inference on CPU/GPU. Read more +* dynamo.optimize("ipex") - Uses IPEX for inference on CPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_47.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_47.txt new file mode 100644 index 0000000000000000000000000000000000000000..1a3bc46c967a23111e24bd0e395d6738ff0cb8c8 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_47.txt @@ -0,0 +1,16 @@ +Read more +For an example of using torch.compile with 🤗 Transformers, check out this blog post on fine-tuning a BERT model for Text Classification using the newest PyTorch 2.0 features +Using 🤗 PEFT +Parameter-Efficient Fine Tuning (PEFT) methods freeze the pretrained model parameters during fine-tuning and add a small number of trainable parameters (the adapters) on top of it. +As a result the memory associated to the optimizer states and gradients are greatly reduced. +For example with a vanilla AdamW, the memory requirement for the optimizer state would be: +* fp32 copy of parameters: 4 bytes/param +* Momentum: 4 bytes/param +* Variance: 4 bytes/param +Suppose a model with 7B parameters and 200 millions parameters injected with Low Rank Adapters. +The memory requirement for the optimizer state of the plain model would be 12 * 7 = 84 GB (assuming 7B trainable parameters). +Adding Lora increases slightly the memory associated to the model weights and substantially decreases memory requirement for the optimizer state to 12 * 0.2 = 2.4GB. +Read more about PEFT and its detailed usage in the PEFT documentation or PEFT repository. +Using 🤗 Accelerate +With 🤗 Accelerate you can use the above methods while gaining full +control over the training loop and can essentially write the loop in pure PyTorch with some minor modifications. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_48.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_48.txt new file mode 100644 index 0000000000000000000000000000000000000000..3a44e840ebdc3bb8689daa82d80bbf680bbd3fef --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_48.txt @@ -0,0 +1,28 @@ +Suppose you have combined the methods in the [TrainingArguments] like so: +py +training_args = TrainingArguments( + per_device_train_batch_size=1, + gradient_accumulation_steps=4, + gradient_checkpointing=True, + fp16=True, + **default_args, +) +The full example training loop with 🤗 Accelerate is only a handful of lines of code long: + +from accelerate import Accelerator +from torch.utils.data.dataloader import DataLoader +dataloader = DataLoader(ds, batch_size=training_args.per_device_train_batch_size) +if training_args.gradient_checkpointing: + model.gradient_checkpointing_enable() +accelerator = Accelerator(fp16=training_args.fp16) +model, optimizer, dataloader = accelerator.prepare(model, adam_bnb_optim, dataloader) +model.train() +for step, batch in enumerate(dataloader, start=1): + loss = model(**batch).loss + loss = loss / training_args.gradient_accumulation_steps + accelerator.backward(loss) + if step % training_args.gradient_accumulation_steps == 0: + optimizer.step() + optimizer.zero_grad() + +First we wrap the dataset in a DataLoader. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_49.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_49.txt new file mode 100644 index 0000000000000000000000000000000000000000..e7995d058b24d88298597c36fd90356aca4cadb8 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_49.txt @@ -0,0 +1,6 @@ +Then we can enable gradient checkpointing by calling the model's [~PreTrainedModel.gradient_checkpointing_enable] method. +When we initialize the Accelerator +we can specify if we want to use mixed precision training and it will take care of it for us in the [prepare] call. +During the prepare +call the dataloader will also be distributed across workers should we use multiple GPUs. We use the same 8-bit optimizer from the earlier example. +Finally, we can add the main training loop. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_5.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..36ca99aed7bbe499d9106115c3424f9fa785656c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_5.txt @@ -0,0 +1,5 @@ +All these approaches are still valid in a multi-GPU setup, plus you can leverage additional parallelism +techniques outlined in the multi-GPU section. +Batch size choice +To achieve optimal performance, start by identifying the appropriate batch size. It is recommended to use batch sizes and +input/output neuron counts that are of size 2^N. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_50.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_50.txt new file mode 100644 index 0000000000000000000000000000000000000000..99ca7a09fb9e8f320459f3e689ac24a54a783979 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_50.txt @@ -0,0 +1,5 @@ +Note that the backward call is handled by 🤗 Accelerate. We can also see +how gradient accumulation works: we normalize the loss, so we get the average at the end of accumulation and once we have +enough steps we run the optimization. +Implementing these optimization techniques with 🤗 Accelerate only takes a handful of lines of code and comes with the +benefit of more flexibility in the training loop. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_51.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_51.txt new file mode 100644 index 0000000000000000000000000000000000000000..83a7fbf4b597a73919f568e86873a8323afb26f1 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_51.txt @@ -0,0 +1,7 @@ +For a full documentation of all features have a look at the +Accelerate documentation. +Efficient Software Prebuilds +PyTorch's pip and conda builds come prebuilt with the cuda toolkit +which is enough to run PyTorch, but it is insufficient if you need to build cuda extensions. +At times, additional efforts may be required to pre-build some components. For instance, if you're using libraries like apex that +don't come pre-compiled. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_52.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_52.txt new file mode 100644 index 0000000000000000000000000000000000000000..e46186c1e2ff5fa3d797d6b480a6c96ceb329d92 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_52.txt @@ -0,0 +1,3 @@ +In other situations figuring out how to install the right cuda toolkit system-wide can be complicated. +To address these scenarios PyTorch and NVIDIA released a new version of NGC docker container which already comes with +everything prebuilt. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_53.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_53.txt new file mode 100644 index 0000000000000000000000000000000000000000..cc21403822aa24c598e740c80bce8bf3dd5c4573 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_53.txt @@ -0,0 +1,4 @@ +You just need to install your programs on it, and it will run out of the box. +This approach is also useful if you want to tweak the pytorch source and/or make a new customized build. +To find the docker image version you want start with PyTorch release notes, +choose one of the latest monthly releases. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_54.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_54.txt new file mode 100644 index 0000000000000000000000000000000000000000..c2fc15fbe93d8fcd684e4b8b4260582643ddc65a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_54.txt @@ -0,0 +1,3 @@ +Go into the release's notes for the desired release, check that the environment's +components are matching your needs (including NVIDIA Driver requirements!) and then at the very top of that document go +to the corresponding NGC page. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_55.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_55.txt new file mode 100644 index 0000000000000000000000000000000000000000..59c2194faeaffa6999c3a6a81535a8e544d5c097 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_55.txt @@ -0,0 +1,14 @@ +If for some reason you get lost, here is the index of all PyTorch NGC images. +Next follow the instructions to download and deploy the docker image. +Mixture of Experts +Some recent papers reported a 4-5x training speedup and a faster inference by integrating +Mixture of Experts (MoE) into the Transformer models. +Since it has been discovered that more parameters lead to better performance, this technique allows to increase the +number of parameters by an order of magnitude without increasing training costs. +In this approach every other FFN layer is replaced with a MoE Layer which consists of many experts, with a gated function +that trains each expert in a balanced way depending on the input token's position in a sequence. + +(source: GLAM) +You can find exhaustive details and comparison tables in the papers listed at the end of this section. +The main drawback of this approach is that it requires staggering amounts of GPU memory - almost an order of magnitude +larger than its dense equivalent. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_56.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_56.txt new file mode 100644 index 0000000000000000000000000000000000000000..b610d3190f3df38319bad861eb8cc64335210c57 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_56.txt @@ -0,0 +1,22 @@ +Various distillation and approaches are proposed to how to overcome the much higher memory requirements. +There is direct trade-off though, you can use just a few experts with a 2-3x smaller base model instead of dozens or +hundreds experts leading to a 5x smaller model and thus increase the training speed moderately while increasing the +memory requirements moderately as well. +Most related papers and implementations are built around Tensorflow/TPUs: + +GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding +Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity +GLaM: Generalist Language Model (GLaM) + +And for Pytorch DeepSpeed has built one as well: DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, Mixture of Experts - blog posts: 1, 2 and specific deployment with large transformer-based natural language generation models: blog post, Megatron-Deepspeed branch. +Using PyTorch native attention and Flash Attention +PyTorch 2.0 released a native torch.nn.functional.scaled_dot_product_attention (SDPA), +that allows using fused GPU kernels such as memory-efficient attention and flash attention. +After installing the optimum package, the relevant internal modules can be +replaced to use PyTorch's native attention with: +python +model = model.to_bettertransformer() +Once converted, train the model as usual. + +The PyTorch-native scaled_dot_product_attention operator can only dispatch to Flash Attention if no attention_mask is provided. +By default, in training mode, the BetterTransformer integration drops the mask support and can only be used for training that does not require a padding mask for batched training. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_57.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_57.txt new file mode 100644 index 0000000000000000000000000000000000000000..6ead3bdde3445b2367f7e6a93b713b42d20b30d0 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_57.txt @@ -0,0 +1,3 @@ +This is the case, for example, during masked language modeling or causal language modeling. BetterTransformer is not suited for fine-tuning models on tasks that require a padding mask. + +Check out this blogpost to learn more about acceleration and memory-savings with SDPA.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_6.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..ca444687881f4947ed3f1e1cf86aecfbda10d542 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_6.txt @@ -0,0 +1,7 @@ +Often it's a multiple of 8, but it can be +higher depending on the hardware being used and the model's dtype. +For reference, check out NVIDIA's recommendation for input/output neuron counts and +batch size for +fully connected layers (which are involved in GEMMs (General Matrix Multiplications)). +Tensor Core Requirements +define the multiplier based on the dtype and the hardware. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_7.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..bebf1ede5366f4b6219dd3c8f9c4252a3591f729 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_7.txt @@ -0,0 +1,7 @@ +For instance, for fp16 data type a multiple of 8 is recommended, unless +it's an A100 GPU, in which case use multiples of 64. +For parameters that are small, consider also Dimension Quantization Effects. +This is where tiling happens and the right multiplier can have a significant speedup. +Gradient Accumulation +The gradient accumulation method aims to calculate gradients in smaller increments instead of computing them for the +entire batch at once. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_8.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..1347405e16174ab6967cb2747ec12cf52d738e13 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_8.txt @@ -0,0 +1,4 @@ +This approach involves iteratively calculating gradients in smaller batches by performing forward +and backward passes through the model and accumulating the gradients during the process. Once a sufficient number of +gradients have been accumulated, the model's optimization step is executed. By employing gradient accumulation, it +becomes possible to increase the effective batch size beyond the limitations imposed by the GPU's memory capacity. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_9.txt b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..21d514974e1edca3f999ef4792d8ccef5d1f0d84 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_gpu_one/chunk_9.txt @@ -0,0 +1,6 @@ +However, it is important to note that the additional forward and backward passes introduced by gradient accumulation can +slow down the training process. +You can enable gradient accumulation by adding the gradient_accumulation_steps argument to [TrainingArguments]: +py +training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args) +In the above example, your effective batch size becomes 4. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_special/chunk_0.txt b/chunked/content_aware_chunking/_perf_train_special/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..1ab59848cc025da6d810e9a7d422d35c1cdc5c38 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_special/chunk_0.txt @@ -0,0 +1,2 @@ +PyTorch training on Apple silicon +Previously, training models on a Mac was limited to the CPU only. With the release of PyTorch v1.12, you can take advantage of training models with Apple's silicon GPUs for significantly faster performance and training. This is powered in PyTorch by integrating Apple's Metal Performance Shaders (MPS) as a backend. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_special/chunk_1.txt b/chunked/content_aware_chunking/_perf_train_special/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..5b47852d4b3d25a14e488bca90c28c135976ec5b --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_special/chunk_1.txt @@ -0,0 +1,3 @@ +The MPS backend implements PyTorch operations as custom Metal shaders and places these modules on a mps device. + +Some PyTorch operations are not implemented in MPS yet and will throw an error. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_special/chunk_2.txt b/chunked/content_aware_chunking/_perf_train_special/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..1427bc6de7c40dd3a5584d572195856c2bda7edd --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_special/chunk_2.txt @@ -0,0 +1,11 @@ +To avoid this, you should set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU kernels instead (you'll still see a UserWarning). + +If you run into any other errors, please open an issue in the PyTorch repository because the [Trainer] only integrates the MPS backend. + +With the mps device set, you can: + +train larger networks or batch sizes locally +reduce data retrieval latency because the GPU's unified memory architecture allows direct access to the full memory store +reduce costs because you don't need to train on cloud-based GPUs or add additional local GPUs + +Get started by making sure you have PyTorch installed. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_special/chunk_3.txt b/chunked/content_aware_chunking/_perf_train_special/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..adbd6579233031a237f757ad3617a1e2ebb5c8fa --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_special/chunk_3.txt @@ -0,0 +1,4 @@ +MPS acceleration is supported on macOS 12.3+. + +pip install torch torchvision torchaudio +[TrainingArguments] uses the mps device by default if it's available which means you don't need to explicitly set the device. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_special/chunk_4.txt b/chunked/content_aware_chunking/_perf_train_special/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..78001e05f4f123de6d0c71d3c5fa3c9e9d64610a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_special/chunk_4.txt @@ -0,0 +1,18 @@ +For example, you can run the run_glue.py script with the MPS backend automatically enabled without making any changes. + +export TASK_NAME=mrpc +python examples/pytorch/text-classification/run_glue.py \ + --model_name_or_path google-bert/bert-base-cased \ + --task_name $TASK_NAME \ +- --use_mps_device \ + --do_train \ + --do_eval \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --output_dir /tmp/$TASK_NAME/ \ + --overwrite_output_dir + +Backends for distributed setups like gloo and nccl are not supported by the mps device which means you can only train on a single GPU with the MPS backend. +You can learn more about the MPS backend in the Introducing Accelerated PyTorch Training on Mac blog post.. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_0.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..3c2273a9217ed2e9b7ebb294f944f199b12b2ff3 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_0.txt @@ -0,0 +1,6 @@ +Training on TPU with TensorFlow + +If you don't need long explanations and just want TPU code samples to get started with, check out our TPU example notebook! + +What is a TPU? +A TPU is a Tensor Processing Unit. They are hardware designed by Google, which are used to greatly speed up the tensor computations within neural networks, much like GPUs. They can be used for both network training and inference. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_1.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..c3fc176b11f8df88033a5eaddd7ba74ea149439a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_1.txt @@ -0,0 +1,4 @@ +They are generally accessed through Google’s cloud services, but small TPUs can also be accessed directly for free through Google Colab and Kaggle Kernels. +Because all TensorFlow models in 🤗 Transformers are Keras models, most of the methods in this document are generally applicable to TPU training for any Keras model! However, there are a few points that are specific to the HuggingFace ecosystem (hug-o-system?) of Transformers and Datasets, and we’ll make sure to flag them up when we get to them. +What kinds of TPU are available? +New users are often very confused by the range of TPUs, and the different ways to access them. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_10.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_10.txt new file mode 100644 index 0000000000000000000000000000000000000000..3796d9ac68598cd838057bf601fe867f3db54936 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_10.txt @@ -0,0 +1,3 @@ +Be sure to note the caveats below about XLA compatibility, though! + +Tip born of painful experience: Although using jit_compile=True is a good way to get a speed boost and test if your CPU/GPU code is XLA-compatible, it can actually cause a lot of problems if you leave it in when actually training on TPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_11.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_11.txt new file mode 100644 index 0000000000000000000000000000000000000000..614017c77e1944a9446d00bc6826af07c1750c3b --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_11.txt @@ -0,0 +1,6 @@ +XLA compilation will happen implicitly on TPU, so remember to remove that line before actually running your code on a TPU! + +How do I make my model XLA compatible? +In many cases, your code is probably XLA-compatible already! However, there are a few things that work in normal TensorFlow that don’t work in XLA. We’ve distilled them into three core rules below: + +🤗Specific HuggingFace Tip🤗: We’ve put a lot of effort into rewriting our TensorFlow models and loss functions to be XLA-compatible. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_12.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_12.txt new file mode 100644 index 0000000000000000000000000000000000000000..cc5809cd51e844d78ff97f38538a18d85fb37d66 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_12.txt @@ -0,0 +1,4 @@ +Our models and loss functions generally obey rule #1 and #2 by default, so you can skip over them if you’re using transformers models. Don’t forget about these rules when writing your own models and loss functions, though! + +XLA Rule #1: Your code cannot have “data-dependent conditionals” +What that means is that any if statement cannot depend on values inside a tf.Tensor. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_13.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_13.txt new file mode 100644 index 0000000000000000000000000000000000000000..392a2ef054779b7b638f94fe4afe7a728dab3ea2 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_13.txt @@ -0,0 +1,5 @@ +For example, this code block cannot be compiled with XLA! +python +if tf.reduce_sum(tensor) > 10: + tensor = tensor / 2.0 +This might seem very restrictive at first, but most neural net code doesn’t need to do this. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_14.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_14.txt new file mode 100644 index 0000000000000000000000000000000000000000..a53ed6c51468155df2550b9d61bec46b85ae5a99 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_14.txt @@ -0,0 +1,7 @@ +You can often get around this restriction by using tf.cond (see the documentation here) or by removing the conditional and finding a clever math trick with indicator variables instead, like so: +python +sum_over_10 = tf.cast(tf.reduce_sum(tensor) > 10, tf.float32) +tensor = tensor / (1.0 + sum_over_10) +This code has exactly the same effect as the code above, but by avoiding a conditional, we ensure it will compile with XLA without problems! +XLA Rule #2: Your code cannot have “data-dependent shapes” +What this means is that the shape of all of the tf.Tensor objects in your code cannot depend on their values. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_15.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_15.txt new file mode 100644 index 0000000000000000000000000000000000000000..be2157b8b3aa029a8666b8049440f3b872485e9f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_15.txt @@ -0,0 +1,2 @@ +For example, the function tf.unique cannot be compiled with XLA, because it returns a tensor containing one instance of each unique value in the input. The shape of this output will obviously be different depending on how repetitive the input Tensor was, and so XLA refuses to handle it! +In general, most neural network code obeys rule #2 by default. However, there are a few common cases where it becomes a problem. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_16.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_16.txt new file mode 100644 index 0000000000000000000000000000000000000000..a512589d18a7ce649941ea7005dbbad699514c7c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_16.txt @@ -0,0 +1 @@ +One very common one is when you use label masking, setting your labels to a negative value to indicate that those positions should be ignored when computing the loss. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_17.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_17.txt new file mode 100644 index 0000000000000000000000000000000000000000..c33316359b9623cb6e46e31e05569c189f4ebac6 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_17.txt @@ -0,0 +1,8 @@ +If you look at NumPy or PyTorch loss functions that support label masking, you will often see code like this that uses boolean indexing: +python +label_mask = labels >= 0 +masked_outputs = outputs[label_mask] +masked_labels = labels[label_mask] +loss = compute_loss(masked_outputs, masked_labels) +mean_loss = torch.mean(loss) +This code is totally fine in NumPy or PyTorch, but it breaks in XLA! Why? Because the shape of masked_outputs and masked_labels depends on how many positions are masked - that makes it a data-dependent shape. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_18.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_18.txt new file mode 100644 index 0000000000000000000000000000000000000000..72d1c4e55b4ed306e9ca0a436dff07d56b041389 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_18.txt @@ -0,0 +1,7 @@ +However, just like for rule #1, we can often rewrite this code to yield exactly the same output without any data-dependent shapes. +python +label_mask = tf.cast(labels >= 0, tf.float32) +loss = compute_loss(outputs, labels) +loss = loss * label_mask # Set negative label positions to 0 +mean_loss = tf.reduce_sum(loss) / tf.reduce_sum(label_mask) +Here, we avoid data-dependent shapes by computing the loss for every position, but zeroing out the masked positions in both the numerator and denominator when we calculate the mean, which yields exactly the same result as the first block while maintaining XLA compatibility. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_19.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_19.txt new file mode 100644 index 0000000000000000000000000000000000000000..bf4cfccaebfdbfd5e718577a1412228c3dc379b4 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_19.txt @@ -0,0 +1,3 @@ +Note that we use the same trick as in rule #1 - converting a tf.bool to tf.float32 and using it as an indicator variable. This is a really useful trick, so remember it if you need to convert your own code to XLA! +XLA Rule #3: XLA will need to recompile your model for every different input shape it sees +This is the big one. What this means is that if your input shapes are very variable, XLA will have to recompile your model over and over, which will create huge performance problems. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_2.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..af3424aeb88d0d730aa825e0818a3aa646ff1455 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_2.txt @@ -0,0 +1,2 @@ +The first key distinction to understand is the difference between TPU Nodes and TPU VMs. +When you use a TPU Node, you are effectively indirectly accessing a remote TPU. You will need a separate VM, which will initialize your network and data pipeline and then forward them to the remote node. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_20.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_20.txt new file mode 100644 index 0000000000000000000000000000000000000000..2e8a60a86beba514a1a0c08e3b5526d4bf01c5a9 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_20.txt @@ -0,0 +1,2 @@ +This commonly arises in NLP models, where input texts have variable lengths after tokenization. In other modalities, static shapes are more common and this rule is much less of a problem. +How can you get around rule #3? The key is padding - if you pad all your inputs to the same length, and then use an attention_mask, you can get the same results as you’d get from variable shapes, but without any XLA issues. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_21.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_21.txt new file mode 100644 index 0000000000000000000000000000000000000000..fc262b93361bf4fda56d8ec521a375853f75b74c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_21.txt @@ -0,0 +1,2 @@ +However, excessive padding can cause severe slowdown too - if you pad all your samples to the maximum length in the whole dataset, you might end up with batches consisting endless padding tokens, which will waste a lot of compute and memory! +There isn’t a perfect solution to this problem. However, you can try some tricks. One very useful trick is to pad batches of samples up to a multiple of a number like 32 or 64 tokens. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_22.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_22.txt new file mode 100644 index 0000000000000000000000000000000000000000..faeb42ac04b045baaf0217e0f5b4896d94ecb76a --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_22.txt @@ -0,0 +1,3 @@ +This often only increases the number of tokens by a small amount, but it hugely reduces the number of unique input shapes, because every input shape now has to be a multiple of 32 or 64. Fewer unique input shapes means fewer XLA compilations! + +🤗Specific HuggingFace Tip🤗: Our tokenizers and data collators have methods that can help you here. You can use padding="max_length" or padding="longest" when calling tokenizers to get them to output padded data. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_23.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_23.txt new file mode 100644 index 0000000000000000000000000000000000000000..021f83c291fe9cf094aa3ed3d63e527502beaa0b --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_23.txt @@ -0,0 +1,4 @@ +Our tokenizers and data collators also have a pad_to_multiple_of argument that you can use to reduce the number of unique input shapes you see! + +How do I actually train my model on TPU? +Once your training is XLA-compatible and (if you’re using TPU Node / Colab) your dataset has been prepared appropriately, running on TPU is surprisingly easy! All you really need to change in your code is to add a few lines to initialize your TPU, and to ensure that your model and dataset are created inside a TPUStrategy scope. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_24.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_24.txt new file mode 100644 index 0000000000000000000000000000000000000000..d6eddb92411dc69ba7e5f39ede68b177108ae60c --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_24.txt @@ -0,0 +1,15 @@ +Take a look at our TPU example notebook to see this in action! +Summary +There was a lot in here, so let’s summarize with a quick checklist you can follow when you want to get your model ready for TPU training: + +Make sure your code follows the three rules of XLA +Compile your model with jit_compile=True on CPU/GPU and confirm that you can train it with XLA +Either load your dataset into memory or use a TPU-compatible dataset loading approach (see notebook) +Migrate your code either to Colab (with accelerator set to “TPU”) or a TPU VM on Google Cloud +Add TPU initializer code (see notebook) +Create your TPUStrategy and make sure dataset loading and model creation are inside the strategy.scope() (see notebook) +Don’t forget to take jit_compile=True out again when you move to TPU! +🙏🙏🙏🥺🥺🥺 +Call model.fit() +You did it! +. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_3.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..bb1be55e0de95563133487c548ade4678c8a9b75 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_3.txt @@ -0,0 +1,6 @@ +When you use a TPU on Google Colab, you are accessing it in the TPU Node style. +Using TPU Nodes can have some quite unexpected behaviour for people who aren’t used to them! In particular, because the TPU is located on a physically different system to the machine you’re running your Python code on, your data cannot be local to your machine - any data pipeline that loads from your machine’s internal storage will totally fail! Instead, data must be stored in Google Cloud Storage where your data pipeline can still access it, even when the pipeline is running on the remote TPU node. + +If you can fit all your data in memory as np.ndarray or tf.Tensor, then you can fit() on that data even when using Colab or a TPU Node, without needing to upload it to Google Cloud Storage. + +🤗Specific Hugging Face Tip🤗: The methods Dataset.to_tf_dataset() and its higher-level wrapper model.prepare_tf_dataset() , which you will see throughout our TF code examples, will both fail on a TPU Node. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_4.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..0d4517bf6e2e7cc5c271e0b760741ca479b8f826 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_4.txt @@ -0,0 +1,3 @@ +The reason for this is that even though they create a tf.data.Dataset it is not a “pure” tf.data pipeline and uses tf.numpy_function or Dataset.from_generator() to stream data from the underlying HuggingFace Dataset. This HuggingFace Dataset is backed by data that is on a local disc and which the remote TPU Node will not be able to read. + +The second way to access a TPU is via a TPU VM. When using a TPU VM, you connect directly to the machine that the TPU is attached to, much like training on a GPU VM. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_5.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..ea937cfdeff98719d3587d1452087f67e9118975 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_5.txt @@ -0,0 +1,2 @@ +TPU VMs are generally easier to work with, particularly when it comes to your data pipeline. All of the above warnings do not apply to TPU VMs! +This is an opinionated document, so here’s our opinion: Avoid using TPU Node if possible. It is more confusing and more difficult to debug than TPU VMs. It is also likely to be unsupported in future - Google’s latest TPU, TPUv4, can only be accessed as a TPU VM, which suggests that TPU Nodes are increasingly going to become a “legacy” access method. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_6.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_6.txt new file mode 100644 index 0000000000000000000000000000000000000000..b43fc15a23791cbd405abeb6d5a5653dbf287b35 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_6.txt @@ -0,0 +1,3 @@ +However, we understand that the only free TPU access is on Colab and Kaggle Kernels, which uses TPU Node - so we’ll try to explain how to handle it if you have to! Check the TPU example notebook for code samples that explain this in more detail. +What sizes of TPU are available? +A single TPU (a v2-8/v3-8/v4-8) runs 8 replicas. TPUs exist in pods that can run hundreds or thousands of replicas simultaneously. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_7.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_7.txt new file mode 100644 index 0000000000000000000000000000000000000000..74085d6cdbc9f6ca82293a53187f3ca003611657 --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_7.txt @@ -0,0 +1,4 @@ +When you use more than a single TPU but less than a whole pod (for example, a v3-32), your TPU fleet is referred to as a pod slice. +When you access a free TPU via Colab, you generally get a single v2-8 TPU. +I keep hearing about this XLA thing. What’s XLA, and how does it relate to TPUs? +XLA is an optimizing compiler, used by both TensorFlow and JAX. In JAX it is the only compiler, whereas in TensorFlow it is optional (but mandatory on TPU!). \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_8.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_8.txt new file mode 100644 index 0000000000000000000000000000000000000000..60cbce9a341497ab41bd2d21ce9080116270fc0f --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_8.txt @@ -0,0 +1,2 @@ +The easiest way to enable it when training a Keras model is to pass the argument jit_compile=True to model.compile(). If you don’t get any errors and performance is good, that’s a great sign that you’re ready to move to TPU! +Debugging on TPU is generally a bit harder than on CPU/GPU, so we recommend getting your code running on CPU/GPU with XLA first before trying it on TPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_9.txt b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_9.txt new file mode 100644 index 0000000000000000000000000000000000000000..61cce2cc666c2a9b6ce903994a77b6c96aa0bbbb --- /dev/null +++ b/chunked/content_aware_chunking/_perf_train_tpu_tf/chunk_9.txt @@ -0,0 +1,3 @@ +You don’t have to train for long, of course - just for a few steps to make sure that your model and data pipeline are working like you expect them to. + +XLA compiled code is usually faster - so even if you’re not planning to run on TPU, adding jit_compile=True can improve your performance. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_performance/chunk_0.txt b/chunked/content_aware_chunking/_performance/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..692471173a8fd94956f07c9950c6e94f2d423e99 --- /dev/null +++ b/chunked/content_aware_chunking/_performance/chunk_0.txt @@ -0,0 +1,5 @@ +Performance and Scalability +Training large transformer models and deploying them to production present various challenges. +During training, the model may require more GPU memory than available or exhibit slow training speed. In the deployment +phase, the model can struggle to handle the required throughput in a production environment. +This documentation aims to assist you in overcoming these challenges and finding the optimal setting for your use-case. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_performance/chunk_1.txt b/chunked/content_aware_chunking/_performance/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..e84de479b110eb7dbfd6dee4a7916f854f428997 --- /dev/null +++ b/chunked/content_aware_chunking/_performance/chunk_1.txt @@ -0,0 +1,6 @@ +The guides are divided into training and inference sections, as each comes with different challenges and solutions. +Within each section you'll find separate guides for different hardware configurations, such as single GPU vs. multi-GPU +for training or CPU vs. GPU for inference. +Use this document as your starting point to navigate further to the methods that match your scenario. +Training +Training large transformer models efficiently requires an accelerator such as a GPU or TPU. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_performance/chunk_2.txt b/chunked/content_aware_chunking/_performance/chunk_2.txt new file mode 100644 index 0000000000000000000000000000000000000000..4a68e18051ed2de1b147b8e3d027fba5cac67ee6 --- /dev/null +++ b/chunked/content_aware_chunking/_performance/chunk_2.txt @@ -0,0 +1,6 @@ +The most common case is where +you have a single GPU. The methods that you can apply to improve training efficiency on a single GPU extend to other setups +such as multiple GPU. However, there are also techniques that are specific to multi-GPU or CPU training. We cover them in +separate sections. + +Methods and tools for efficient training on a single GPU: start here to learn common approaches that can help optimize GPU memory utilization, speed up the training, or both. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_performance/chunk_3.txt b/chunked/content_aware_chunking/_performance/chunk_3.txt new file mode 100644 index 0000000000000000000000000000000000000000..295cea4f14826b11192cb1af562d91c1c077cb66 --- /dev/null +++ b/chunked/content_aware_chunking/_performance/chunk_3.txt @@ -0,0 +1,4 @@ +Multi-GPU training section: explore this section to learn about further optimization methods that apply to a multi-GPU settings, such as data, tensor, and pipeline parallelism. +CPU training section: learn about mixed precision training on CPU. +Efficient Training on Multiple CPUs: learn about distributed CPU training. +Training on TPU with TensorFlow: if you are new to TPUs, refer to this section for an opinionated introduction to training on TPUs and using XLA. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_performance/chunk_4.txt b/chunked/content_aware_chunking/_performance/chunk_4.txt new file mode 100644 index 0000000000000000000000000000000000000000..2f7059a104ebfe0c59bc4f81fb73bcb3931cb2cd --- /dev/null +++ b/chunked/content_aware_chunking/_performance/chunk_4.txt @@ -0,0 +1,5 @@ +Custom hardware for training: find tips and tricks when building your own deep learning rig. +Hyperparameter Search using Trainer API + +Inference +Efficient inference with large models in a production environment can be as challenging as training them. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_performance/chunk_5.txt b/chunked/content_aware_chunking/_performance/chunk_5.txt new file mode 100644 index 0000000000000000000000000000000000000000..40aaec3aa2e78ff98ec95afc2a6b289e13e2f946 --- /dev/null +++ b/chunked/content_aware_chunking/_performance/chunk_5.txt @@ -0,0 +1,19 @@ +In the following +sections we go through the steps to run inference on CPU and single/multi-GPU setups. + +Inference on a single CPU +Inference on a single GPU +Multi-GPU inference +XLA Integration for TensorFlow Models + +Training and inference +Here you'll find techniques, tips and tricks that apply whether you are training a model, or running inference with it. + +Instantiating a big model +Troubleshooting performance issues + +Contribute +This document is far from being complete and a lot more needs to be added, so if you have additions or corrections to +make please don't hesitate to open a PR or if you aren't sure start an Issue and we can discuss the details there. +When making contributions that A is better than B, please try to include a reproducible benchmark and/or a link to the +source of that information (unless it comes directly from you).. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perplexity/chunk_0.txt b/chunked/content_aware_chunking/_perplexity/chunk_0.txt new file mode 100644 index 0000000000000000000000000000000000000000..04b6a642448c2c950bce9e033fc46b3b8c098c32 --- /dev/null +++ b/chunked/content_aware_chunking/_perplexity/chunk_0.txt @@ -0,0 +1,6 @@ +Perplexity of fixed-length models +[[open-in-colab]] +Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note +that the metric applies specifically to classical language models (sometimes called autoregressive or causal language +models) and is not well defined for masked language models like BERT (see summary of the models). +Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. \ No newline at end of file diff --git a/chunked/content_aware_chunking/_perplexity/chunk_1.txt b/chunked/content_aware_chunking/_perplexity/chunk_1.txt new file mode 100644 index 0000000000000000000000000000000000000000..5899c32813566902e42dee1eb47ce07c678ee4eb --- /dev/null +++ b/chunked/content_aware_chunking/_perplexity/chunk_1.txt @@ -0,0 +1,4 @@ +If we have a tokenized +sequence \(X = (x_0, x_1, \dots, x_t)\), then the perplexity of \(X\) is, +$$\text{PPL}(X) = \exp \left{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{