Ahmadzei commited on
Commit
57bdca5
1 Parent(s): 190997b
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +2 -0
  2. Copy_of_rag_homework_fin.ipynb +1315 -0
  3. README.md +5 -4
  4. gradio_app/app.py → app.py +0 -0
  5. chunked/content_aware_chunking/__config/chunk_0.txt +13 -0
  6. chunked/content_aware_chunking/__redirects/chunk_0.txt +2 -0
  7. chunked/content_aware_chunking/__toctree/chunk_0.txt +0 -0
  8. chunked/content_aware_chunking/__toctree/chunk_1.txt +836 -0
  9. chunked/content_aware_chunking/_accelerate/chunk_0.txt +2 -0
  10. chunked/content_aware_chunking/_accelerate/chunk_1.txt +6 -0
  11. chunked/content_aware_chunking/_accelerate/chunk_2.txt +7 -0
  12. chunked/content_aware_chunking/_accelerate/chunk_3.txt +72 -0
  13. chunked/content_aware_chunking/_accelerate/chunk_4.txt +6 -0
  14. chunked/content_aware_chunking/_add_new_model/chunk_0.txt +2 -0
  15. chunked/content_aware_chunking/_add_new_model/chunk_1.txt +12 -0
  16. chunked/content_aware_chunking/_add_new_model/chunk_10.txt +2 -0
  17. chunked/content_aware_chunking/_add_new_model/chunk_100.txt +5 -0
  18. chunked/content_aware_chunking/_add_new_model/chunk_101.txt +5 -0
  19. chunked/content_aware_chunking/_add_new_model/chunk_102.txt +10 -0
  20. chunked/content_aware_chunking/_add_new_model/chunk_103.txt +5 -0
  21. chunked/content_aware_chunking/_add_new_model/chunk_104.txt +5 -0
  22. chunked/content_aware_chunking/_add_new_model/chunk_105.txt +8 -0
  23. chunked/content_aware_chunking/_add_new_model/chunk_106.txt +7 -0
  24. chunked/content_aware_chunking/_add_new_model/chunk_107.txt +2 -0
  25. chunked/content_aware_chunking/_add_new_model/chunk_108.txt +7 -0
  26. chunked/content_aware_chunking/_add_new_model/chunk_109.txt +4 -0
  27. chunked/content_aware_chunking/_add_new_model/chunk_11.txt +9 -0
  28. chunked/content_aware_chunking/_add_new_model/chunk_12.txt +5 -0
  29. chunked/content_aware_chunking/_add_new_model/chunk_13.txt +6 -0
  30. chunked/content_aware_chunking/_add_new_model/chunk_14.txt +8 -0
  31. chunked/content_aware_chunking/_add_new_model/chunk_15.txt +11 -0
  32. chunked/content_aware_chunking/_add_new_model/chunk_16.txt +4 -0
  33. chunked/content_aware_chunking/_add_new_model/chunk_17.txt +4 -0
  34. chunked/content_aware_chunking/_add_new_model/chunk_18.txt +21 -0
  35. chunked/content_aware_chunking/_add_new_model/chunk_19.txt +6 -0
  36. chunked/content_aware_chunking/_add_new_model/chunk_2.txt +5 -0
  37. chunked/content_aware_chunking/_add_new_model/chunk_20.txt +5 -0
  38. chunked/content_aware_chunking/_add_new_model/chunk_21.txt +15 -0
  39. chunked/content_aware_chunking/_add_new_model/chunk_22.txt +6 -0
  40. chunked/content_aware_chunking/_add_new_model/chunk_23.txt +15 -0
  41. chunked/content_aware_chunking/_add_new_model/chunk_24.txt +12 -0
  42. chunked/content_aware_chunking/_add_new_model/chunk_25.txt +10 -0
  43. chunked/content_aware_chunking/_add_new_model/chunk_26.txt +5 -0
  44. chunked/content_aware_chunking/_add_new_model/chunk_27.txt +5 -0
  45. chunked/content_aware_chunking/_add_new_model/chunk_28.txt +10 -0
  46. chunked/content_aware_chunking/_add_new_model/chunk_29.txt +2 -0
  47. chunked/content_aware_chunking/_add_new_model/chunk_3.txt +5 -0
  48. chunked/content_aware_chunking/_add_new_model/chunk_30.txt +7 -0
  49. chunked/content_aware_chunking/_add_new_model/chunk_31.txt +5 -0
  50. chunked/content_aware_chunking/_add_new_model/chunk_32.txt +10 -0
.gitattributes ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ *.lance filter=lfs diff=lfs merge=lfs -text
2
+ *.idx filter=lfs diff=lfs merge=lfs -text
Copy_of_rag_homework_fin.ipynb ADDED
@@ -0,0 +1,1315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "844fe3af-9cf1-4c66-aa78-b88a3429acc6",
6
+ "metadata": {
7
+ "id": "844fe3af-9cf1-4c66-aa78-b88a3429acc6"
8
+ },
9
+ "source": [
10
+ "### 0. Setup\n",
11
+ "1) Clone https://github.com/plaggy/rag-gradio-sample-project and set up an environment with gradio_app/requirements.txt.\n",
12
+ "\n",
13
+ "There you'll find the following files:\n",
14
+ "- [prep_scripts/markdown_to_text.py](https://github.com/plaggy/rag-gradio-sample-project/blob/main/prep_scripts/markdown_to_text.py) processes markdown into text; you won't need to change it.\n",
15
+ "- [prep_scripts/lancedb_setup.py](https://github.com/plaggy/rag-gradio-sample-project/blob/main/prep_scripts/lancedb_setup.py) is the file where the database is created and, in particular, an embedding model is defined.\n",
16
+ "- [gradio_app/backend/query_llm.py](https://github.com/plaggy/rag-gradio-sample-project/blob/main/gradio_app/backend/query_llm.py) defines what LLM is used.\n",
17
+ "- [gradio_app/app.py](https://github.com/plaggy/rag-gradio-sample-project/blob/main/gradio_app/app.py) creates the gradio app.\n",
18
+ "\n",
19
+ "In this task you'll try not only OpenAI models, but also open-source models from Hugging Face Hub through InferenceClient interface (see [gradio_app/backend/query_llm.py](https://github.com/plaggy/rag-gradio-sample-project/blob/main/gradio_app/backend/query_llm.py)). Please don't forget to obtain a Hugging Face token for that (see here https://huggingface.co/settings/tokens).\n",
20
+ "\n",
21
+ "\n",
22
+ "A convenient way to work through the project is to test locally and keep committing the changes to the [HF Spaces](https://huggingface.co/spaces) repo. A space gets automatically rebuilt after each commit and you get a new version of your application up and running.\n",
23
+ "\n",
24
+ "2) Create a new space with Gradio SDK. You'll get an almost empty repo, the only thing you'll need from it is README.md which has a config letting a space builder know that it's a Gradio app. Reset a remote upstream of your local rag-gradio-sample-project clone to be your freshly created Spaces repository.\n",
25
+ "\n",
26
+ "The easiest way to set your space up is to set up the gradio_app folder as a git repo, set remote origin to your space repo and checkout the remote README:\n",
27
+ "\n",
28
+ "```\n",
29
+ "cd gradio_app\n",
30
+ "git init\n",
31
+ "git remote add origin <your spaces repo url>\n",
32
+ "git fetch\n",
33
+ "git checkout origin/main README.md\n",
34
+ "```\n",
35
+ "\n",
36
+ "The space is not working yet. You'll get the first working version after the Step 3.\n",
37
+ "\n",
38
+ "- Clone https://github.com/huggingface/transformers to a local machine and run prep_scripts/markdown_to_text.py script to extract raw text from transformers/docs/source/en/. This will be your knowledge base, you don't need it to be a part of your repository\n",
39
+ "\n",
40
+ "Run the command as follows (pass arguments that work for you)\n",
41
+ "```\n",
42
+ "python prep_scripts/markdown_to_text.py --input-dir transformers/docs/source/en/ --output-dir docs\n",
43
+ "```\n"
44
+ ]
45
+ },
46
+ {
47
+ "cell_type": "markdown",
48
+ "id": "762e9fde-c1f4-464c-b12b-dca602fac5ba",
49
+ "metadata": {
50
+ "id": "762e9fde-c1f4-464c-b12b-dca602fac5ba"
51
+ },
52
+ "source": [
53
+ "**By design, you'll be running your experiments in a [Gradio space](https://huggingface.co/docs/hub/en/spaces-sdks-gradio). Apart from deliverables for each step you'll need to provide a link to a functioning RAG space in it final state!**"
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "code",
58
+ "source": [
59
+ "!git clone https://github.com/plaggy/rag-gradio-sample-project"
60
+ ],
61
+ "metadata": {
62
+ "colab": {
63
+ "base_uri": "https://localhost:8080/"
64
+ },
65
+ "id": "BUHKUeqR7unC",
66
+ "outputId": "92617e28-da69-45e3-b34e-2b88876ae3dd"
67
+ },
68
+ "id": "BUHKUeqR7unC",
69
+ "execution_count": 1,
70
+ "outputs": [
71
+ {
72
+ "output_type": "stream",
73
+ "name": "stdout",
74
+ "text": [
75
+ "Cloning into 'rag-gradio-sample-project'...\n",
76
+ "remote: Enumerating objects: 73, done.\u001b[K\n",
77
+ "remote: Counting objects: 100% (73/73), done.\u001b[K\n",
78
+ "remote: Compressing objects: 100% (59/59), done.\u001b[K\n",
79
+ "remote: Total 73 (delta 23), reused 57 (delta 14), pack-reused 0\u001b[K\n",
80
+ "Receiving objects: 100% (73/73), 31.10 KiB | 10.37 MiB/s, done.\n",
81
+ "Resolving deltas: 100% (23/23), done.\n"
82
+ ]
83
+ }
84
+ ]
85
+ },
86
+ {
87
+ "cell_type": "code",
88
+ "source": [
89
+ "!pip install -r /content/rag-gradio-sample-project/gradio_app/requirements.txt"
90
+ ],
91
+ "metadata": {
92
+ "colab": {
93
+ "base_uri": "https://localhost:8080/"
94
+ },
95
+ "id": "FFIvgBYDcVMt",
96
+ "outputId": "3c53faf0-f87e-4d19-bbac-90401cc70b71"
97
+ },
98
+ "id": "FFIvgBYDcVMt",
99
+ "execution_count": 2,
100
+ "outputs": [
101
+ {
102
+ "output_type": "stream",
103
+ "name": "stdout",
104
+ "text": [
105
+ "Collecting lancedb==0.5.3 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n",
106
+ " Downloading lancedb-0.5.3-py3-none-any.whl (106 kB)\n",
107
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m107.0/107.0 kB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
108
+ "\u001b[?25hCollecting openai==1.11.1 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2))\n",
109
+ " Downloading openai-1.11.1-py3-none-any.whl (226 kB)\n",
110
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m226.1/226.1 kB\u001b[0m \u001b[31m19.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
111
+ "\u001b[?25hCollecting sentence-transformers==2.3.1 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3))\n",
112
+ " Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)\n",
113
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m132.8/132.8 kB\u001b[0m \u001b[31m15.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
114
+ "\u001b[?25hCollecting tqdm==4.66.1 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 4))\n",
115
+ " Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)\n",
116
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.3/78.3 kB\u001b[0m \u001b[31m9.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
117
+ "\u001b[?25hCollecting torch==2.1.1 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
118
+ " Downloading torch-2.1.1-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)\n",
119
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m670.2/670.2 MB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
120
+ "\u001b[?25hCollecting transformers==4.37.2 (from -r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 6))\n",
121
+ " Downloading transformers-4.37.2-py3-none-any.whl (8.4 MB)\n",
122
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.4/8.4 MB\u001b[0m \u001b[31m89.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
123
+ "\u001b[?25hCollecting deprecation (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n",
124
+ " Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)\n",
125
+ "Collecting pylance==0.9.12 (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n",
126
+ " Downloading pylance-0.9.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB)\n",
127
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.4/21.4 MB\u001b[0m \u001b[31m68.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
128
+ "\u001b[?25hCollecting ratelimiter~=1.0 (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n",
129
+ " Downloading ratelimiter-1.2.0.post0-py3-none-any.whl (6.6 kB)\n",
130
+ "Collecting retry>=0.9.2 (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n",
131
+ " Downloading retry-0.9.2-py2.py3-none-any.whl (8.0 kB)\n",
132
+ "Requirement already satisfied: pydantic>=1.10 in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (2.6.1)\n",
133
+ "Requirement already satisfied: attrs>=21.3.0 in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (23.2.0)\n",
134
+ "Collecting semver>=3.0 (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n",
135
+ " Downloading semver-3.0.2-py3-none-any.whl (17 kB)\n",
136
+ "Requirement already satisfied: cachetools in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (5.3.2)\n",
137
+ "Requirement already satisfied: pyyaml>=6.0 in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (6.0.1)\n",
138
+ "Requirement already satisfied: click>=8.1.7 in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (8.1.7)\n",
139
+ "Requirement already satisfied: requests>=2.31.0 in /usr/local/lib/python3.10/dist-packages (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (2.31.0)\n",
140
+ "Collecting overrides>=0.7 (from lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n",
141
+ " Downloading overrides-7.7.0-py3-none-any.whl (17 kB)\n",
142
+ "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (3.7.1)\n",
143
+ "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (1.7.0)\n",
144
+ "Collecting httpx<1,>=0.23.0 (from openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2))\n",
145
+ " Downloading httpx-0.26.0-py3-none-any.whl (75 kB)\n",
146
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.9/75.9 kB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
147
+ "\u001b[?25hRequirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (1.3.0)\n",
148
+ "Requirement already satisfied: typing-extensions<5,>=4.7 in /usr/local/lib/python3.10/dist-packages (from openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (4.9.0)\n",
149
+ "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (1.25.2)\n",
150
+ "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (1.2.2)\n",
151
+ "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (1.11.4)\n",
152
+ "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (3.8.1)\n",
153
+ "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (0.1.99)\n",
154
+ "Requirement already satisfied: huggingface-hub>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (0.20.3)\n",
155
+ "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (9.4.0)\n",
156
+ "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (3.13.1)\n",
157
+ "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (1.12)\n",
158
+ "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (3.2.1)\n",
159
+ "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (3.1.3)\n",
160
+ "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (2023.6.0)\n",
161
+ "Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
162
+ " Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)\n",
163
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m23.7/23.7 MB\u001b[0m \u001b[31m59.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
164
+ "\u001b[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
165
+ " Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)\n",
166
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m823.6/823.6 kB\u001b[0m \u001b[31m43.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
167
+ "\u001b[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
168
+ " Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)\n",
169
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.1/14.1 MB\u001b[0m \u001b[31m80.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
170
+ "\u001b[?25hCollecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
171
+ " Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)\n",
172
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m731.7/731.7 MB\u001b[0m \u001b[31m764.7 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
173
+ "\u001b[?25hCollecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
174
+ " Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)\n",
175
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m410.6/410.6 MB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
176
+ "\u001b[?25hCollecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
177
+ " Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)\n",
178
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m121.6/121.6 MB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
179
+ "\u001b[?25hCollecting nvidia-curand-cu12==10.3.2.106 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
180
+ " Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)\n",
181
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.5/56.5 MB\u001b[0m \u001b[31m10.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
182
+ "\u001b[?25hCollecting nvidia-cusolver-cu12==11.4.5.107 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
183
+ " Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)\n",
184
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m124.2/124.2 MB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
185
+ "\u001b[?25hCollecting nvidia-cusparse-cu12==12.1.0.106 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
186
+ " Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)\n",
187
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m196.0/196.0 MB\u001b[0m \u001b[31m2.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
188
+ "\u001b[?25hCollecting nvidia-nccl-cu12==2.18.1 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
189
+ " Downloading nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64.whl (209.8 MB)\n",
190
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m209.8/209.8 MB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
191
+ "\u001b[?25hCollecting nvidia-nvtx-cu12==12.1.105 (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
192
+ " Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)\n",
193
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m99.1/99.1 kB\u001b[0m \u001b[31m13.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
194
+ "\u001b[?25hRequirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (2.1.0)\n",
195
+ "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.2->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 6)) (23.2)\n",
196
+ "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.2->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 6)) (2023.12.25)\n",
197
+ "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.2->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 6)) (0.15.2)\n",
198
+ "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.2->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 6)) (0.4.2)\n",
199
+ "Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5))\n",
200
+ " Downloading nvidia_nvjitlink_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (20.5 MB)\n",
201
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m20.5/20.5 MB\u001b[0m \u001b[31m72.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
202
+ "\u001b[?25hCollecting pyarrow>=12 (from pylance==0.9.12->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n",
203
+ " Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)\n",
204
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m38.3/38.3 MB\u001b[0m \u001b[31m14.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
205
+ "\u001b[?25hRequirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (3.6)\n",
206
+ "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (1.2.0)\n",
207
+ "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2)) (2024.2.2)\n",
208
+ "Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2))\n",
209
+ " Downloading httpcore-1.0.3-py3-none-any.whl (77 kB)\n",
210
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.0/77.0 kB\u001b[0m \u001b[31m10.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
211
+ "\u001b[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai==1.11.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 2))\n",
212
+ " Downloading h11-0.14.0-py3-none-any.whl (58 kB)\n",
213
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
214
+ "\u001b[?25hRequirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (0.6.0)\n",
215
+ "Requirement already satisfied: pydantic-core==2.16.2 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (2.16.2)\n",
216
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.31.0->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (3.3.2)\n",
217
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.31.0->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (2.0.7)\n",
218
+ "Requirement already satisfied: decorator>=3.4.2 in /usr/local/lib/python3.10/dist-packages (from retry>=0.9.2->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1)) (4.4.2)\n",
219
+ "Collecting py<2.0.0,>=1.4.26 (from retry>=0.9.2->lancedb==0.5.3->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 1))\n",
220
+ " Downloading py-1.11.0-py2.py3-none-any.whl (98 kB)\n",
221
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.7/98.7 kB\u001b[0m \u001b[31m12.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
222
+ "\u001b[?25hRequirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (2.1.5)\n",
223
+ "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (1.3.2)\n",
224
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers==2.3.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 3)) (3.2.0)\n",
225
+ "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch==2.1.1->-r /content/rag-gradio-sample-project/gradio_app/requirements.txt (line 5)) (1.3.0)\n",
226
+ "Installing collected packages: ratelimiter, tqdm, semver, pyarrow, py, overrides, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, h11, deprecation, retry, pylance, nvidia-cusparse-cu12, nvidia-cudnn-cu12, httpcore, nvidia-cusolver-cu12, lancedb, httpx, transformers, torch, openai, sentence-transformers\n",
227
+ " Attempting uninstall: tqdm\n",
228
+ " Found existing installation: tqdm 4.66.2\n",
229
+ " Uninstalling tqdm-4.66.2:\n",
230
+ " Successfully uninstalled tqdm-4.66.2\n",
231
+ " Attempting uninstall: pyarrow\n",
232
+ " Found existing installation: pyarrow 10.0.1\n",
233
+ " Uninstalling pyarrow-10.0.1:\n",
234
+ " Successfully uninstalled pyarrow-10.0.1\n",
235
+ " Attempting uninstall: transformers\n",
236
+ " Found existing installation: transformers 4.35.2\n",
237
+ " Uninstalling transformers-4.35.2:\n",
238
+ " Successfully uninstalled transformers-4.35.2\n",
239
+ " Attempting uninstall: torch\n",
240
+ " Found existing installation: torch 2.1.0+cu121\n",
241
+ " Uninstalling torch-2.1.0+cu121:\n",
242
+ " Successfully uninstalled torch-2.1.0+cu121\n",
243
+ "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
244
+ "llmx 0.0.15a0 requires cohere, which is not installed.\n",
245
+ "llmx 0.0.15a0 requires tiktoken, which is not installed.\n",
246
+ "ibis-framework 7.1.0 requires pyarrow<15,>=2, but you have pyarrow 15.0.0 which is incompatible.\n",
247
+ "torchaudio 2.1.0+cu121 requires torch==2.1.0, but you have torch 2.1.1 which is incompatible.\n",
248
+ "torchdata 0.7.0 requires torch==2.1.0, but you have torch 2.1.1 which is incompatible.\n",
249
+ "torchtext 0.16.0 requires torch==2.1.0, but you have torch 2.1.1 which is incompatible.\n",
250
+ "torchvision 0.16.0+cu121 requires torch==2.1.0, but you have torch 2.1.1 which is incompatible.\u001b[0m\u001b[31m\n",
251
+ "\u001b[0mSuccessfully installed deprecation-2.1.0 h11-0.14.0 httpcore-1.0.3 httpx-0.26.0 lancedb-0.5.3 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.3.101 nvidia-nvtx-cu12-12.1.105 openai-1.11.1 overrides-7.7.0 py-1.11.0 pyarrow-15.0.0 pylance-0.9.12 ratelimiter-1.2.0.post0 retry-0.9.2 semver-3.0.2 sentence-transformers-2.3.1 torch-2.1.1 tqdm-4.66.1 transformers-4.37.2\n"
252
+ ]
253
+ }
254
+ ]
255
+ },
256
+ {
257
+ "cell_type": "code",
258
+ "source": [
259
+ "!pip install huggingface_hub"
260
+ ],
261
+ "metadata": {
262
+ "colab": {
263
+ "base_uri": "https://localhost:8080/"
264
+ },
265
+ "id": "uVHHnyoedIPy",
266
+ "outputId": "527c17ae-a7db-45db-cf12-46cb07f90342"
267
+ },
268
+ "id": "uVHHnyoedIPy",
269
+ "execution_count": 3,
270
+ "outputs": [
271
+ {
272
+ "output_type": "stream",
273
+ "name": "stdout",
274
+ "text": [
275
+ "Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.10/dist-packages (0.20.3)\n",
276
+ "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (3.13.1)\n",
277
+ "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (2023.6.0)\n",
278
+ "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (2.31.0)\n",
279
+ "Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (4.66.1)\n",
280
+ "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (6.0.1)\n",
281
+ "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (4.9.0)\n",
282
+ "Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (23.2)\n",
283
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (3.3.2)\n",
284
+ "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (3.6)\n",
285
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (2.0.7)\n",
286
+ "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (2024.2.2)\n"
287
+ ]
288
+ }
289
+ ]
290
+ },
291
+ {
292
+ "cell_type": "code",
293
+ "source": [
294
+ "!huggingface-cli login"
295
+ ],
296
+ "metadata": {
297
+ "colab": {
298
+ "base_uri": "https://localhost:8080/"
299
+ },
300
+ "id": "8-M0jyfGdKYe",
301
+ "outputId": "c7f7d369-c51b-43e6-8aa4-182af93a7f4a"
302
+ },
303
+ "id": "8-M0jyfGdKYe",
304
+ "execution_count": 4,
305
+ "outputs": [
306
+ {
307
+ "output_type": "stream",
308
+ "name": "stdout",
309
+ "text": [
310
+ "\n",
311
+ " _| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|\n",
312
+ " _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n",
313
+ " _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|\n",
314
+ " _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n",
315
+ " _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|\n",
316
+ "\n",
317
+ " To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .\n",
318
+ "Token: \n",
319
+ "Add token as git credential? (Y/n) \n",
320
+ "Token is valid (permission: read).\n",
321
+ "\u001b[1m\u001b[31mCannot authenticate through git-credential as no helper is defined on your machine.\n",
322
+ "You might have to re-authenticate when pushing to the Hugging Face Hub.\n",
323
+ "Run the following command in your terminal in case you want to set the 'store' credential helper as default.\n",
324
+ "\n",
325
+ "git config --global credential.helper store\n",
326
+ "\n",
327
+ "Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.\u001b[0m\n",
328
+ "Token has not been saved to git credential helper.\n",
329
+ "Your token has been saved to /root/.cache/huggingface/token\n",
330
+ "Login successful\n"
331
+ ]
332
+ }
333
+ ]
334
+ },
335
+ {
336
+ "cell_type": "code",
337
+ "source": [
338
+ "%cd rag-gradio-sample-project/gradio_app/\n",
339
+ "%ls"
340
+ ],
341
+ "metadata": {
342
+ "colab": {
343
+ "base_uri": "https://localhost:8080/"
344
+ },
345
+ "id": "HjXOD1nH1fx5",
346
+ "outputId": "874e8e62-8730-47e2-b185-c7c1e9cc6cfe"
347
+ },
348
+ "id": "HjXOD1nH1fx5",
349
+ "execution_count": 5,
350
+ "outputs": [
351
+ {
352
+ "output_type": "stream",
353
+ "name": "stdout",
354
+ "text": [
355
+ "/content/rag-gradio-sample-project/gradio_app\n",
356
+ "app.py \u001b[0m\u001b[01;34mbackend\u001b[0m/ requirements.txt \u001b[01;34mtemplates\u001b[0m/\n"
357
+ ]
358
+ }
359
+ ]
360
+ },
361
+ {
362
+ "cell_type": "code",
363
+ "source": [
364
+ "%pwd"
365
+ ],
366
+ "metadata": {
367
+ "colab": {
368
+ "base_uri": "https://localhost:8080/",
369
+ "height": 36
370
+ },
371
+ "id": "55IyOwgr1kNR",
372
+ "outputId": "6235574d-e278-40eb-f389-bdae96090556"
373
+ },
374
+ "id": "55IyOwgr1kNR",
375
+ "execution_count": 6,
376
+ "outputs": [
377
+ {
378
+ "output_type": "execute_result",
379
+ "data": {
380
+ "text/plain": [
381
+ "'/content/rag-gradio-sample-project/gradio_app'"
382
+ ],
383
+ "application/vnd.google.colaboratory.intrinsic+json": {
384
+ "type": "string"
385
+ }
386
+ },
387
+ "metadata": {},
388
+ "execution_count": 6
389
+ }
390
+ ]
391
+ },
392
+ {
393
+ "cell_type": "code",
394
+ "source": [
395
+ "!git init\n",
396
+ "!git remote add origin https://huggingface.co/spaces/Ahmadzei/RAG\n",
397
+ "!git config --global init.defaultBranch main\n",
398
+ "!git fetch\n",
399
+ "!git checkout origin/main README.md"
400
+ ],
401
+ "metadata": {
402
+ "colab": {
403
+ "base_uri": "https://localhost:8080/"
404
+ },
405
+ "id": "1wY0VaL-9c14",
406
+ "outputId": "27d6b540-2d9a-4dee-d397-25601878c187"
407
+ },
408
+ "id": "1wY0VaL-9c14",
409
+ "execution_count": null,
410
+ "outputs": [
411
+ {
412
+ "output_type": "stream",
413
+ "name": "stdout",
414
+ "text": [
415
+ "\u001b[33mhint: Using 'master' as the name for the initial branch. This default branch name\u001b[m\n",
416
+ "\u001b[33mhint: is subject to change. To configure the initial branch name to use in all\u001b[m\n",
417
+ "\u001b[33mhint: of your new repositories, which will suppress this warning, call:\u001b[m\n",
418
+ "\u001b[33mhint: \u001b[m\n",
419
+ "\u001b[33mhint: \tgit config --global init.defaultBranch <name>\u001b[m\n",
420
+ "\u001b[33mhint: \u001b[m\n",
421
+ "\u001b[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and\u001b[m\n",
422
+ "\u001b[33mhint: 'development'. The just-created branch can be renamed via this command:\u001b[m\n",
423
+ "\u001b[33mhint: \u001b[m\n",
424
+ "\u001b[33mhint: \tgit branch -m <name>\u001b[m\n",
425
+ "Initialized empty Git repository in /content/rag-gradio-sample-project/gradio_app/.git/\n",
426
+ "remote: Enumerating objects: 4, done.\u001b[K\n",
427
+ "remote: Total 4 (delta 0), reused 0 (delta 0), pack-reused 4\u001b[K\n",
428
+ "Unpacking objects: 100% (4/4), 1.27 KiB | 1.27 MiB/s, done.\n",
429
+ "From https://huggingface.co/spaces/Ahmadzei/RAG\n",
430
+ " * [new branch] main -> origin/main\n",
431
+ "Updated 1 path from b4805fb\n"
432
+ ]
433
+ }
434
+ ]
435
+ },
436
+ {
437
+ "cell_type": "code",
438
+ "source": [
439
+ "!git clone https://github.com/huggingface/transformers"
440
+ ],
441
+ "metadata": {
442
+ "colab": {
443
+ "base_uri": "https://localhost:8080/"
444
+ },
445
+ "id": "cgX7Aqujk37U",
446
+ "outputId": "6294a191-642f-41f2-bb07-0ce528fae8c2"
447
+ },
448
+ "id": "cgX7Aqujk37U",
449
+ "execution_count": 7,
450
+ "outputs": [
451
+ {
452
+ "output_type": "stream",
453
+ "name": "stdout",
454
+ "text": [
455
+ "Cloning into 'transformers'...\n",
456
+ "remote: Enumerating objects: 185037, done.\u001b[K\n",
457
+ "remote: Counting objects: 100% (1681/1681), done.\u001b[K\n",
458
+ "remote: Compressing objects: 100% (1231/1231), done.\u001b[K\n",
459
+ "remote: Total 185037 (delta 824), reused 742 (delta 374), pack-reused 183356\u001b[K\n",
460
+ "Receiving objects: 100% (185037/185037), 205.20 MiB | 19.65 MiB/s, done.\n",
461
+ "Resolving deltas: 100% (130045/130045), done.\n"
462
+ ]
463
+ }
464
+ ]
465
+ },
466
+ {
467
+ "cell_type": "code",
468
+ "source": [
469
+ "# !python transformers/prep_scripts/markdown_to_text.py --input_dir transformers/docs/source/en/ --output_dir /content/knowledge_base/\n",
470
+ "!python /content/rag-gradio-sample-project/prep_scripts/markdown_to_text.py --input-dir /content/rag-gradio-sample-project/gradio_app/transformers/docs/source/en/ --output-dir /content/docs/"
471
+ ],
472
+ "metadata": {
473
+ "colab": {
474
+ "base_uri": "https://localhost:8080/"
475
+ },
476
+ "id": "2NYMq3KIlMAz",
477
+ "outputId": "d24cd17b-2f77-4f3a-b8c0-449acd9b0f80"
478
+ },
479
+ "id": "2NYMq3KIlMAz",
480
+ "execution_count": 8,
481
+ "outputs": [
482
+ {
483
+ "output_type": "stream",
484
+ "name": "stdout",
485
+ "text": [
486
+ "\r0it [00:00, ?it/s]/content/rag-gradio-sample-project/prep_scripts/markdown_to_text.py:22: DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead.\n",
487
+ " text = ''.join(soup.findAll(text=True))\n",
488
+ "385it [00:06, 60.38it/s]\n"
489
+ ]
490
+ }
491
+ ]
492
+ },
493
+ {
494
+ "cell_type": "code",
495
+ "execution_count": null,
496
+ "id": "6c813d03-33a7-4ce1-836f-11afc541f291",
497
+ "metadata": {
498
+ "id": "6c813d03-33a7-4ce1-836f-11afc541f291"
499
+ },
500
+ "outputs": [],
501
+ "source": [
502
+ "# Add the link to the space you've just created here:\n",
503
+ "# https://huggingface.co/spaces/Ahmadzei/RAG"
504
+ ]
505
+ },
506
+ {
507
+ "cell_type": "markdown",
508
+ "id": "c970d0a4-fee8-48ac-9377-4a6def7712b2",
509
+ "metadata": {
510
+ "id": "c970d0a4-fee8-48ac-9377-4a6def7712b2"
511
+ },
512
+ "source": [
513
+ "### Step 1: Chunk Your Data\n",
514
+ "\n",
515
+ "To efficiently pull up documents relevant to a query from a knowledge base documents are embedded and stored as vectors. Documents in your knowledge base are not expected to fit into the context length of an embedding model (most have 512 token limit). Hence chunking your documents into smaller pieces is required. Take a deeper dive into why chunking is important and what are the options [here](https://www.pinecone.io/learn/chunking-strategies/).\n",
516
+ "\n",
517
+ "Your task is to implement and compare two chunking strategies: fixed-sized chunking and content-aware chunking. For content-aware you could split by sentences, paragraphs or in some other way that makes sense.\n",
518
+ "\n",
519
+ "The deliverables are:\n",
520
+ "- The code for chunk splitting"
521
+ ]
522
+ },
523
+ {
524
+ "cell_type": "code",
525
+ "execution_count": null,
526
+ "id": "f7bad8c8",
527
+ "metadata": {
528
+ "id": "f7bad8c8"
529
+ },
530
+ "outputs": [],
531
+ "source": [
532
+ "# Chunk splitting deliverables"
533
+ ]
534
+ },
535
+ {
536
+ "cell_type": "code",
537
+ "source": [
538
+ "def fixed_size_chunking(text, chunk_size=512):\n",
539
+ " \"\"\"\n",
540
+ " Splits the text into fixed-sized chunks.\n",
541
+ "\n",
542
+ " :param text: The input text to be chunked.\n",
543
+ " :param chunk_size: The size of each chunk in number of characters.\n",
544
+ " :return: A list of chunks.\n",
545
+ " \"\"\"\n",
546
+ " return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]\n"
547
+ ],
548
+ "metadata": {
549
+ "id": "n9qEj8jfvlPj"
550
+ },
551
+ "id": "n9qEj8jfvlPj",
552
+ "execution_count": 9,
553
+ "outputs": []
554
+ },
555
+ {
556
+ "cell_type": "code",
557
+ "source": [
558
+ "def content_aware_chunking(text, max_chunk_size=512):\n",
559
+ " \"\"\"\n",
560
+ " Splits the text into content-aware chunks by sentences.\n",
561
+ "\n",
562
+ " :param text: The input text to be chunked.\n",
563
+ " :param max_chunk_size: The maximum size of each chunk in number of characters.\n",
564
+ " :return: A list of chunks.\n",
565
+ " \"\"\"\n",
566
+ " sentences = text.split('. ') # Simple sentence splitting, can be improved with NLP libraries\n",
567
+ " chunks = []\n",
568
+ " current_chunk = \"\"\n",
569
+ "\n",
570
+ " for sentence in sentences:\n",
571
+ " if len(current_chunk) + len(sentence) < max_chunk_size:\n",
572
+ " current_chunk += sentence + \". \"\n",
573
+ " else:\n",
574
+ " chunks.append(current_chunk.strip())\n",
575
+ " current_chunk = sentence + \". \"\n",
576
+ " if current_chunk:\n",
577
+ " chunks.append(current_chunk.strip())\n",
578
+ "\n",
579
+ " return chunks"
580
+ ],
581
+ "metadata": {
582
+ "id": "DB5IlJAdL6Bq"
583
+ },
584
+ "id": "DB5IlJAdL6Bq",
585
+ "execution_count": 10,
586
+ "outputs": []
587
+ },
588
+ {
589
+ "cell_type": "code",
590
+ "source": [
591
+ "import nltk\n",
592
+ "nltk.download('punkt')\n",
593
+ "from nltk.tokenize import sent_tokenize\n",
594
+ "\n",
595
+ "def nltk_chunking(text):\n",
596
+ " \"\"\"\n",
597
+ " Divide text into chunks based on sentences.\n",
598
+ "\n",
599
+ " Args:\n",
600
+ " text (str): The text to be chunked.\n",
601
+ "\n",
602
+ " Returns:\n",
603
+ " list of str: A list containing the text chunks (sentences).\n",
604
+ " \"\"\"\n",
605
+ " return sent_tokenize(text)"
606
+ ],
607
+ "metadata": {
608
+ "colab": {
609
+ "base_uri": "https://localhost:8080/"
610
+ },
611
+ "id": "8eYOiabGvl00",
612
+ "outputId": "abf76bf5-09cb-43f6-b40e-0fffcbf37b3a"
613
+ },
614
+ "id": "8eYOiabGvl00",
615
+ "execution_count": 11,
616
+ "outputs": [
617
+ {
618
+ "output_type": "stream",
619
+ "name": "stderr",
620
+ "text": [
621
+ "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
622
+ "[nltk_data] Unzipping tokenizers/punkt.zip.\n"
623
+ ]
624
+ }
625
+ ]
626
+ },
627
+ {
628
+ "cell_type": "code",
629
+ "source": [
630
+ "def paragraph_chunking(text):\n",
631
+ " \"\"\"\n",
632
+ " Divide text into chunks based on paragraphs.\n",
633
+ "\n",
634
+ " Args:\n",
635
+ " text (str): The text to be chunked.\n",
636
+ "\n",
637
+ " Returns:\n",
638
+ " list of str: A list containing the text chunks (paragraphs).\n",
639
+ " \"\"\"\n",
640
+ " return text.split('\\n\\n')"
641
+ ],
642
+ "metadata": {
643
+ "id": "Sk2M6tYmvosj"
644
+ },
645
+ "id": "Sk2M6tYmvosj",
646
+ "execution_count": 12,
647
+ "outputs": []
648
+ },
649
+ {
650
+ "cell_type": "code",
651
+ "source": [
652
+ "import os\n",
653
+ "import glob\n",
654
+ "\n",
655
+ "def chunk_and_write_docs(input_dir, output_dir_fixed, output_dir_content_aware):\n",
656
+ " # Ensure output directories exist\n",
657
+ " os.makedirs(output_dir_fixed, exist_ok=True)\n",
658
+ " os.makedirs(output_dir_content_aware, exist_ok=True)\n",
659
+ "\n",
660
+ " # List all text files in the input directory\n",
661
+ " file_paths = glob.glob(os.path.join(input_dir, '*.txt'))\n",
662
+ "\n",
663
+ " for file_path in file_paths:\n",
664
+ " # Read the content of the file\n",
665
+ " with open(file_path, 'r', encoding='utf-8') as file:\n",
666
+ " text_content = file.read()\n",
667
+ "\n",
668
+ " # Generate chunks using both methods\n",
669
+ " fixed_chunks = fixed_size_chunking(text_content)\n",
670
+ " content_aware_chunks = content_aware_chunking(text_content)\n",
671
+ "\n",
672
+ " # Extract base name without extension for use in chunk file names\n",
673
+ " base_name = os.path.splitext(os.path.basename(file_path))[0]\n",
674
+ "\n",
675
+ " # Fixed-size chunking\n",
676
+ " fixed_chunk_dir = os.path.join(output_dir_fixed, base_name.replace('.txt', ''))\n",
677
+ " os.makedirs(fixed_chunk_dir, exist_ok=True)\n",
678
+ " for i, chunk in enumerate(fixed_chunks):\n",
679
+ " with open(os.path.join(fixed_chunk_dir, f'chunk_{i}.txt'), 'w', encoding='utf-8') as chunk_file:\n",
680
+ " chunk_file.write(chunk)\n",
681
+ "\n",
682
+ " # Content-aware chunking\n",
683
+ " content_aware_chunk_dir = os.path.join(output_dir_content_aware, base_name.replace('.txt', ''))\n",
684
+ " os.makedirs(content_aware_chunk_dir, exist_ok=True)\n",
685
+ " for i, chunk in enumerate(content_aware_chunks):\n",
686
+ " with open(os.path.join(content_aware_chunk_dir, f'chunk_{i}.txt'), 'w', encoding='utf-8') as chunk_file:\n",
687
+ " chunk_file.write(chunk)\n",
688
+ "\n",
689
+ "# Define input and output directories\n",
690
+ "input_dir = '/content/docs'\n",
691
+ "output_dir_fixed = '/content/chunked/fixed_size_chunking'\n",
692
+ "output_dir_content_aware = '/content/chunked/content_aware_chunking'\n",
693
+ "\n",
694
+ "# Process the documents\n",
695
+ "chunk_and_write_docs(input_dir, output_dir_fixed, output_dir_content_aware)\n",
696
+ "\n",
697
+ "# To indicate completion and the count of processed files\n",
698
+ "processed_files_count = len(glob.glob(os.path.join(input_dir, '*.txt')))\n",
699
+ "processed_files_count\n"
700
+ ],
701
+ "metadata": {
702
+ "colab": {
703
+ "base_uri": "https://localhost:8080/"
704
+ },
705
+ "id": "FGDf40tqSK2C",
706
+ "outputId": "39033395-444e-4579-a387-1128ec73bc41"
707
+ },
708
+ "id": "FGDf40tqSK2C",
709
+ "execution_count": 13,
710
+ "outputs": [
711
+ {
712
+ "output_type": "execute_result",
713
+ "data": {
714
+ "text/plain": [
715
+ "381"
716
+ ]
717
+ },
718
+ "metadata": {},
719
+ "execution_count": 13
720
+ }
721
+ ]
722
+ },
723
+ {
724
+ "cell_type": "markdown",
725
+ "id": "5e5ebaad-8d42-430c-b00b-18198cdb9ce8",
726
+ "metadata": {
727
+ "id": "5e5ebaad-8d42-430c-b00b-18198cdb9ce8"
728
+ },
729
+ "source": [
730
+ "### Step 2: Ingest chunks into a database and create an index\n",
731
+ "\n",
732
+ "Chunks need to be vectorized and made accessible to an LLM to enable semantic search with embedding models. A current industry standard is to use a vector database to store and retrieve texts both conveniently and efficiently. There are many products out there, we'll be using [LanceDB](https://lancedb.github.io/lancedb/). LanceDB is a young product, one way it stands out is that it's embedded - it's designed not to be a standalone service but rather a part of an application, more on this [here](https://lancedb.github.io/lancedb/basic/).\n",
733
+ "\n",
734
+ "Find more details on how different databases compare in [this](https://thedataquarry.com/tags/vector-db/) series of posts.\n",
735
+ "\n",
736
+ "Your task is to vectorize and ingest chunked documents into the database.\n",
737
+ "**For each chunking strategy from the previous step create a separate table with one of the embedding models. Compare the chunking strategies and choose one. Perform vectorization+ingestion with the second model only with one chunking strategy of your choice**.\n",
738
+ "Use prep_scrips/lancedb_setup.py to vectorize chunks and store vector representations along with raw text in a Lancedb instance. The script also creates an index for fast ANN retrieval (not really needed for this exercise but necessary at scale). Try different embedding models and see how results differ. The options are:\n",
739
+ "\n",
740
+ "- `sentence-transformers/all-MiniLM-L6-v2`: a light model, produces vectors of length 384\n",
741
+ "- `BAAI/bge-large-en-v1.5`: a much heavier model, embedding vector length is 1024\n",
742
+ "\n",
743
+ "Feel free to explore other embedding models and justify your choice.\n",
744
+ "For different embedding models and different chunking strategies create different tables in the database so you can easily switch between them and compare.\n",
745
+ "\n",
746
+ "Run the embedding+ingestion script as follows, make sure to look into the script and go over the arguments. Note that the number of sub-vectors for indexing must be a divisor of the model embedding size.\n",
747
+ "\n",
748
+ "```\n",
749
+ "python prep_scrips/lancedb_setup.py --emb-model <model name> --table <db table name> --input-dir <folder with chunked docs> --num-sub-vectors <a number which is a divisor of the embedding dim>\n",
750
+ "```\n",
751
+ "\n",
752
+ "Before committing to your space set up environment variables on the settings tab of your space, use `.env` as a ference list of all the things you can customize. Make sure to add HF_TOKEN and OPENAI_API_KEY as secrets.\n",
753
+ "Not all the parameters are required to set via environment variables, most have default values.\n",
754
+ "\n",
755
+ "*The database is expected to be in the `gradio_app` folder under `.lancedb`, make sure to move it there if was initialized elsewhere.* It can be parametrized but it's unnecessary here.\n",
756
+ "\n",
757
+ "To commit large files to Github use `git lfs`:\n",
758
+ "```\n",
759
+ "git lfs install\n",
760
+ "git lfs track \"*.lance\"\n",
761
+ "git lfs track \"*.idx\"\n",
762
+ "git add .gitattributes\n",
763
+ "```\n",
764
+ "Then proceed as usual.\n",
765
+ "\n",
766
+ "For experimenting you can easily switch between embedding models/tables by changing the values of the corresponding env variables in your space (`EMB_MODEL`, `TABLE_NAME`). Overall, every time you change the value of an environment variable a space gets automatically rebuilt.\n",
767
+ "\n",
768
+ "The deliverables are:\n",
769
+ "1. The illustration of how retrieved documents differ depending on the embedding model and the chunking strategy. You should create at least 3 tables: model_1 + chunking_strategy_1, model_1 + chunking_strategy_2, model_2 + chunking_strategy_<1 or 2>\n",
770
+ "2. The analysis of pros and cons of chunking strategies\n",
771
+ "3. The analysis of how retrieved document differ between embedding models (is one better than the other?)\n",
772
+ "4. The analysis of how the embedding time differs between models"
773
+ ]
774
+ },
775
+ {
776
+ "cell_type": "code",
777
+ "execution_count": null,
778
+ "id": "f7db282e-e03c-41de-9c03-54abf455481f",
779
+ "metadata": {
780
+ "id": "f7db282e-e03c-41de-9c03-54abf455481f"
781
+ },
782
+ "outputs": [],
783
+ "source": [
784
+ "# Embed documents with different chunking strategies and ingest into the database"
785
+ ]
786
+ },
787
+ {
788
+ "cell_type": "code",
789
+ "source": [
790
+ "!pip install lancedb openai pyarrow pandas numpy sentence-transformers"
791
+ ],
792
+ "metadata": {
793
+ "colab": {
794
+ "base_uri": "https://localhost:8080/"
795
+ },
796
+ "id": "vrrCjCs3-lNy",
797
+ "outputId": "c1a20049-d733-4390-ef65-cd9df1c0109f"
798
+ },
799
+ "id": "vrrCjCs3-lNy",
800
+ "execution_count": 14,
801
+ "outputs": [
802
+ {
803
+ "output_type": "stream",
804
+ "name": "stdout",
805
+ "text": [
806
+ "Requirement already satisfied: lancedb in /usr/local/lib/python3.10/dist-packages (0.5.3)\n",
807
+ "Requirement already satisfied: openai in /usr/local/lib/python3.10/dist-packages (1.11.1)\n",
808
+ "Requirement already satisfied: pyarrow in /usr/local/lib/python3.10/dist-packages (15.0.0)\n",
809
+ "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)\n",
810
+ "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.25.2)\n",
811
+ "Requirement already satisfied: sentence-transformers in /usr/local/lib/python3.10/dist-packages (2.3.1)\n",
812
+ "Requirement already satisfied: deprecation in /usr/local/lib/python3.10/dist-packages (from lancedb) (2.1.0)\n",
813
+ "Requirement already satisfied: pylance==0.9.12 in /usr/local/lib/python3.10/dist-packages (from lancedb) (0.9.12)\n",
814
+ "Requirement already satisfied: ratelimiter~=1.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (1.2.0.post0)\n",
815
+ "Requirement already satisfied: retry>=0.9.2 in /usr/local/lib/python3.10/dist-packages (from lancedb) (0.9.2)\n",
816
+ "Requirement already satisfied: tqdm>=4.27.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (4.66.1)\n",
817
+ "Requirement already satisfied: pydantic>=1.10 in /usr/local/lib/python3.10/dist-packages (from lancedb) (2.6.1)\n",
818
+ "Requirement already satisfied: attrs>=21.3.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (23.2.0)\n",
819
+ "Requirement already satisfied: semver>=3.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (3.0.2)\n",
820
+ "Requirement already satisfied: cachetools in /usr/local/lib/python3.10/dist-packages (from lancedb) (5.3.2)\n",
821
+ "Requirement already satisfied: pyyaml>=6.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (6.0.1)\n",
822
+ "Requirement already satisfied: click>=8.1.7 in /usr/local/lib/python3.10/dist-packages (from lancedb) (8.1.7)\n",
823
+ "Requirement already satisfied: requests>=2.31.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (2.31.0)\n",
824
+ "Requirement already satisfied: overrides>=0.7 in /usr/local/lib/python3.10/dist-packages (from lancedb) (7.7.0)\n",
825
+ "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai) (3.7.1)\n",
826
+ "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai) (1.7.0)\n",
827
+ "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from openai) (0.26.0)\n",
828
+ "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai) (1.3.0)\n",
829
+ "Requirement already satisfied: typing-extensions<5,>=4.7 in /usr/local/lib/python3.10/dist-packages (from openai) (4.9.0)\n",
830
+ "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)\n",
831
+ "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.4)\n",
832
+ "Requirement already satisfied: transformers<5.0.0,>=4.32.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.37.2)\n",
833
+ "Requirement already satisfied: torch>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (2.1.1)\n",
834
+ "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (1.2.2)\n",
835
+ "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (1.11.4)\n",
836
+ "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (3.8.1)\n",
837
+ "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.1.99)\n",
838
+ "Requirement already satisfied: huggingface-hub>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.20.3)\n",
839
+ "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (9.4.0)\n",
840
+ "Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai) (3.6)\n",
841
+ "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai) (1.2.0)\n",
842
+ "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai) (2024.2.2)\n",
843
+ "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai) (1.0.3)\n",
844
+ "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)\n",
845
+ "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (3.13.1)\n",
846
+ "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (2023.6.0)\n",
847
+ "Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (23.2)\n",
848
+ "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->lancedb) (0.6.0)\n",
849
+ "Requirement already satisfied: pydantic-core==2.16.2 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->lancedb) (2.16.2)\n",
850
+ "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\n",
851
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.31.0->lancedb) (3.3.2)\n",
852
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.31.0->lancedb) (2.0.7)\n",
853
+ "Requirement already satisfied: decorator>=3.4.2 in /usr/local/lib/python3.10/dist-packages (from retry>=0.9.2->lancedb) (4.4.2)\n",
854
+ "Requirement already satisfied: py<2.0.0,>=1.4.26 in /usr/local/lib/python3.10/dist-packages (from retry>=0.9.2->lancedb) (1.11.0)\n",
855
+ "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (1.12)\n",
856
+ "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.2.1)\n",
857
+ "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.1.3)\n",
858
+ "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.105)\n",
859
+ "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.105)\n",
860
+ "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.105)\n",
861
+ "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (8.9.2.26)\n",
862
+ "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.3.1)\n",
863
+ "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (11.0.2.54)\n",
864
+ "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (10.3.2.106)\n",
865
+ "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (11.4.5.107)\n",
866
+ "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.0.106)\n",
867
+ "Requirement already satisfied: nvidia-nccl-cu12==2.18.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (2.18.1)\n",
868
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (12.1.105)\n",
869
+ "Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (2.1.0)\n",
870
+ "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.11.0->sentence-transformers) (12.3.101)\n",
871
+ "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (2023.12.25)\n",
872
+ "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.15.2)\n",
873
+ "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.4.2)\n",
874
+ "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->sentence-transformers) (1.3.2)\n",
875
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers) (3.2.0)\n",
876
+ "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers) (2.1.5)\n",
877
+ "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.11.0->sentence-transformers) (1.3.0)\n"
878
+ ]
879
+ }
880
+ ]
881
+ },
882
+ {
883
+ "cell_type": "code",
884
+ "source": [
885
+ "# Setting environment variables\n",
886
+ "os.environ['EMB_MODEL'] = 'sentence-transformers/all-MiniLM-L6-v2' #sentence-transformers/all-MiniLM-L6-v2: a light model, produces vectors of length 384 / BAAI/bge-large-en-v1.5: a much heavier model, embedding vector length is 1024\n",
887
+ "os.environ['TABLE_NAME'] = 'fixed_size_chunking' # fixed_size_chunking / content_aware_chunking\n",
888
+ "os.environ['INPUT_DIR'] = '/content/chunked/docs/fixed_size_chunking/' # fixed_size_chunking / content_aware_chunking\n",
889
+ "os.environ['NUM_SUB_VECTORS'] = '12'"
890
+ ],
891
+ "metadata": {
892
+ "id": "o3TCdDIEYwk6"
893
+ },
894
+ "id": "o3TCdDIEYwk6",
895
+ "execution_count": 15,
896
+ "outputs": []
897
+ },
898
+ {
899
+ "cell_type": "code",
900
+ "source": [
901
+ "EMB_MODEL = os.getenv('EMB_MODEL')\n",
902
+ "TABLE_NAME = os.getenv('TABLE_NAME')\n",
903
+ "INPUT_DIR = os.getenv('INPUT_DIR')\n",
904
+ "NUM_SUB_VECTORS = os.getenv('NUM_SUB_VECTORS')"
905
+ ],
906
+ "metadata": {
907
+ "id": "1tVGE7JYZc3i"
908
+ },
909
+ "id": "1tVGE7JYZc3i",
910
+ "execution_count": 16,
911
+ "outputs": []
912
+ },
913
+ {
914
+ "cell_type": "code",
915
+ "source": [
916
+ "print(INPUT_DIR)"
917
+ ],
918
+ "metadata": {
919
+ "colab": {
920
+ "base_uri": "https://localhost:8080/"
921
+ },
922
+ "id": "uL8Gzk6TgLtK",
923
+ "outputId": "68c608cf-e685-45c6-fc5f-e51ba204c074"
924
+ },
925
+ "id": "uL8Gzk6TgLtK",
926
+ "execution_count": 17,
927
+ "outputs": [
928
+ {
929
+ "output_type": "stream",
930
+ "name": "stdout",
931
+ "text": [
932
+ "/content/chunked/docs/fixed_size_chunking/\n"
933
+ ]
934
+ }
935
+ ]
936
+ },
937
+ {
938
+ "cell_type": "code",
939
+ "source": [
940
+ "!python /content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py --emb-model {EMB_MODEL} --table {TABLE_NAME} --input-dir {INPUT_DIR} --num-sub-vectors {NUM_SUB_VECTORS}"
941
+ ],
942
+ "metadata": {
943
+ "colab": {
944
+ "base_uri": "https://localhost:8080/"
945
+ },
946
+ "id": "Xy1cyu7_zFgO",
947
+ "outputId": "89ade558-d3bf-4aab-9b29-35f72950a07d"
948
+ },
949
+ "id": "Xy1cyu7_zFgO",
950
+ "execution_count": 19,
951
+ "outputs": [
952
+ {
953
+ "output_type": "stream",
954
+ "name": "stdout",
955
+ "text": [
956
+ "INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2\n",
957
+ "/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()\n",
958
+ " return self.fget.__get__(instance, owner)()\n",
959
+ "INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu\n",
960
+ "INFO:__main__:using cpu device\n",
961
+ "0it [00:00, ?it/s]\n",
962
+ "Traceback (most recent call last):\n",
963
+ " File \"/content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py\", line 96, in <module>\n",
964
+ " main()\n",
965
+ " File \"/content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py\", line 88, in main\n",
966
+ " tbl.create_index(\n",
967
+ " File \"/usr/local/lib/python3.10/dist-packages/lancedb/table.py\", line 858, in create_index\n",
968
+ " self._dataset.create_index(\n",
969
+ " File \"/usr/local/lib/python3.10/dist-packages/lance/dataset.py\", line 1269, in create_index\n",
970
+ " self._ds.create_index(column, index_type, name, replace, kwargs)\n",
971
+ "OSError: LanceError(Index): KMeans: can not train 256 centroids with 0 vectors, choose a smaller K (< 0) instead, /home/runner/work/lance/lance/rust/lance-index/src/vector/kmeans.rs:45:21\n"
972
+ ]
973
+ }
974
+ ]
975
+ },
976
+ {
977
+ "cell_type": "code",
978
+ "source": [
979
+ "# Setting environment variables\n",
980
+ "os.environ['EMB_MODEL'] = 'sentence-transformers/all-MiniLM-L6-v2' #sentence-transformers/all-MiniLM-L6-v2: a light model, produces vectors of length 384 / BAAI/bge-large-en-v1.5: a much heavier model, embedding vector length is 1024\n",
981
+ "os.environ['TABLE_NAME'] = 'content_aware_chunking' # fixed_size_chunking / content_aware_chunking\n",
982
+ "os.environ['INPUT_DIR'] = '/content/chunked/docs/content_aware_chunking/' # fixed_size_chunking / content_aware_chunking\n",
983
+ "os.environ['NUM_SUB_VECTORS'] = '12'"
984
+ ],
985
+ "metadata": {
986
+ "id": "t7aqMOI3bh2s"
987
+ },
988
+ "id": "t7aqMOI3bh2s",
989
+ "execution_count": null,
990
+ "outputs": []
991
+ },
992
+ {
993
+ "cell_type": "code",
994
+ "source": [
995
+ "EMB_MODEL2 = os.getenv('EMB_MODEL')\n",
996
+ "TABLE_NAME2 = os.getenv('TABLE_NAME')\n",
997
+ "INPUT_DIR2 = os.getenv('INPUT_DIR')\n",
998
+ "NUM_SUB_VECTORS2 = os.getenv('NUM_SUB_VECTORS')"
999
+ ],
1000
+ "metadata": {
1001
+ "id": "Gk9ynF4Bbslu"
1002
+ },
1003
+ "id": "Gk9ynF4Bbslu",
1004
+ "execution_count": null,
1005
+ "outputs": []
1006
+ },
1007
+ {
1008
+ "cell_type": "code",
1009
+ "source": [
1010
+ "!python /content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py --emb-model {EMB_MODEL2} --table {TABLE_NAME2} --input-dir {INPUT_DIR2} --num-sub-vectors {NUM_SUB_VECTORS2}"
1011
+ ],
1012
+ "metadata": {
1013
+ "colab": {
1014
+ "base_uri": "https://localhost:8080/"
1015
+ },
1016
+ "id": "rc0n7a9zbwh2",
1017
+ "outputId": "50251872-bad0-473b-9ac3-36ed6d7a2e5f"
1018
+ },
1019
+ "id": "rc0n7a9zbwh2",
1020
+ "execution_count": null,
1021
+ "outputs": [
1022
+ {
1023
+ "output_type": "stream",
1024
+ "name": "stdout",
1025
+ "text": [
1026
+ "INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2\n",
1027
+ "/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()\n",
1028
+ " return self.fget.__get__(instance, owner)()\n",
1029
+ "INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu\n",
1030
+ "INFO:__main__:using cpu device\n",
1031
+ "0it [00:00, ?it/s]\n",
1032
+ "Traceback (most recent call last):\n",
1033
+ " File \"/content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py\", line 100, in <module>\n",
1034
+ " main()\n",
1035
+ " File \"/content/rag-gradio-sample-project/prep_scripts/lancedb_setup.py\", line 92, in main\n",
1036
+ " tbl.create_index(\n",
1037
+ " File \"/usr/local/lib/python3.10/dist-packages/lancedb/table.py\", line 858, in create_index\n",
1038
+ " self._dataset.create_index(\n",
1039
+ " File \"/usr/local/lib/python3.10/dist-packages/lance/dataset.py\", line 1269, in create_index\n",
1040
+ " self._ds.create_index(column, index_type, name, replace, kwargs)\n",
1041
+ "OSError: LanceError(Index): KMeans: can not train 256 centroids with 0 vectors, choose a smaller K (< 0) instead, /home/runner/work/lance/lance/rust/lance-index/src/vector/kmeans.rs:45:21\n"
1042
+ ]
1043
+ }
1044
+ ]
1045
+ },
1046
+ {
1047
+ "cell_type": "code",
1048
+ "source": [
1049
+ "!git lfs install\n",
1050
+ "!git lfs track \"*.lance\"\n",
1051
+ "!git lfs track \"*.idx\"\n",
1052
+ "!git add .gitattributes\n",
1053
+ "# Then commit and push as usual\n"
1054
+ ],
1055
+ "metadata": {
1056
+ "colab": {
1057
+ "base_uri": "https://localhost:8080/"
1058
+ },
1059
+ "id": "3Mlmy4j7x9Ln",
1060
+ "outputId": "c4940d06-37a5-4861-a101-d6cbf753b5d2"
1061
+ },
1062
+ "id": "3Mlmy4j7x9Ln",
1063
+ "execution_count": null,
1064
+ "outputs": [
1065
+ {
1066
+ "output_type": "stream",
1067
+ "name": "stdout",
1068
+ "text": [
1069
+ "Updated git hooks.\n",
1070
+ "Git LFS initialized.\n",
1071
+ "Tracking \"*.lance\"\n",
1072
+ "Tracking \"*.idx\"\n"
1073
+ ]
1074
+ }
1075
+ ]
1076
+ },
1077
+ {
1078
+ "cell_type": "markdown",
1079
+ "id": "7d818b4f-ba5a-4c81-b6d7-f3474c398d9c",
1080
+ "metadata": {
1081
+ "id": "7d818b4f-ba5a-4c81-b6d7-f3474c398d9c"
1082
+ },
1083
+ "source": [
1084
+ "### Step 3: Add a reranker\n",
1085
+ "\n",
1086
+ "A reranker is a second-level model which produces similarity scores for pairs of (input query + retrieved document). Cross-encoders are conventionally used for reranking, their architecture is slightly different from retrieval models (more on it [here] and [here](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)). Cross-encoders are much more costly to run, therefore a retrieval model is used to get a few (dozens) highest-scoring items, and a reranker picks the best among these. The overall pipeline is similar to the recommender system industry standard: a light model retrieves top-n, a precise and heavy model reranks n to get top k, n is orders of magnitude larger than k.\n",
1087
+ "\n",
1088
+ "Cross-encoders are optional because of the overhead their usage implies. Your task is to implement a reranker using a cross-encoder and assess pros and cons of having it. Do not forget that the process of pulling the most relevant documents becomes two-staged: retrieve a larger number of items first, than rerank and keep the best top-k for context.\n",
1089
+ "\n",
1090
+ "The models fit for the task:\n",
1091
+ "1. BAAI/bge-reranker-large\n",
1092
+ "2. cross-encoder/ms-marco-MiniLM-L-6-v2\n",
1093
+ "\n",
1094
+ "As usual, feel free to pick another model and provide some description to it.\n",
1095
+ "\n",
1096
+ "The deliverables are:\n",
1097
+ "\n",
1098
+ "1. The code that enables a reranker.\n",
1099
+ "3. A comparison of how the prompt and the model output change after adding a reranker\n",
1100
+ "4. The analysis of pros and cons. The evaluation aspects should include the relevance of the top-k documents, the response time.\n"
1101
+ ]
1102
+ },
1103
+ {
1104
+ "cell_type": "code",
1105
+ "execution_count": null,
1106
+ "id": "ee1b0160-0ba0-4b5f-81c4-ef3ea76850e5",
1107
+ "metadata": {
1108
+ "id": "ee1b0160-0ba0-4b5f-81c4-ef3ea76850e5"
1109
+ },
1110
+ "outputs": [],
1111
+ "source": [
1112
+ "# Implement code for selecting the final documents using a cross-encoder and compare with and without"
1113
+ ]
1114
+ },
1115
+ {
1116
+ "cell_type": "code",
1117
+ "source": [
1118
+ "from sentence_transformers import SentenceTransformer\n",
1119
+ "\n",
1120
+ "# Load the model\n",
1121
+ "model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # BAAI/bge-reranker-large\n",
1122
+ "\n",
1123
+ "# Vectorize the query\n",
1124
+ "query = \"Your search query here\"\n",
1125
+ "query_vector = model.encode(query)"
1126
+ ],
1127
+ "metadata": {
1128
+ "id": "peSWSL0lXOK5"
1129
+ },
1130
+ "id": "peSWSL0lXOK5",
1131
+ "execution_count": null,
1132
+ "outputs": []
1133
+ },
1134
+ {
1135
+ "cell_type": "code",
1136
+ "source": [
1137
+ "import lancedb\n",
1138
+ "import numpy as np\n",
1139
+ "\n",
1140
+ "# Connect to LanceDB and open your table\n",
1141
+ "db = lancedb.connect(\"/content/rag-gradio-sample-project/gradio_app/.lancedb/\")\n",
1142
+ "tbl = db.open_table({TABLE_NAME2})\n",
1143
+ "\n",
1144
+ "# Perform a vector search for the top-N documents\n",
1145
+ "df = tbl.search(query_vector) \\\n",
1146
+ " .metric(\"cosine\") \\\n",
1147
+ " .limit(10) \\\n",
1148
+ " .to_list() # Or use .to_pandas(), .to_arrow(), etc., based on your preference\n",
1149
+ "\n",
1150
+ "# `df` now contains the top-N documents and their similarity scores"
1151
+ ],
1152
+ "metadata": {
1153
+ "id": "xd10rndiUCIW"
1154
+ },
1155
+ "id": "xd10rndiUCIW",
1156
+ "execution_count": null,
1157
+ "outputs": []
1158
+ },
1159
+ {
1160
+ "cell_type": "code",
1161
+ "source": [
1162
+ "# Assuming `df` contains document IDs or keys to fetch the actual documents\n",
1163
+ "documents = [db.fetch_document(table_name, doc_id) for doc_id in df]"
1164
+ ],
1165
+ "metadata": {
1166
+ "id": "8KWuDzhxTLTX"
1167
+ },
1168
+ "id": "8KWuDzhxTLTX",
1169
+ "execution_count": null,
1170
+ "outputs": []
1171
+ },
1172
+ {
1173
+ "cell_type": "code",
1174
+ "source": [
1175
+ "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
1176
+ "from torch.utils.data import DataLoader\n",
1177
+ "import torch\n",
1178
+ "\n",
1179
+ "# Initialize the tokenizer and model\n",
1180
+ "tokenizer = AutoTokenizer.from_pretrained(\"cross-encoder/ms-marco-MiniLM-L-6-v2\")\n",
1181
+ "model = AutoModelForSequenceClassification.from_pretrained(\"cross-encoder/ms-marco-MiniLM-L-6-v2\")\n",
1182
+ "\n",
1183
+ "def rerank(query, documents):\n",
1184
+ " # Assuming `documents` is a list of texts\n",
1185
+ " pairs = [[query, doc['text']] for doc in documents] # Adjust based on your `results` structure\n",
1186
+ " inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors=\"pt\")\n",
1187
+ " with torch.no_grad():\n",
1188
+ " scores = rerank_model(**inputs).logits[:,1] # Scores for each pair\n",
1189
+ " # Sort documents by scores in descending order and return\n",
1190
+ " documents = [doc for _, doc in sorted(zip(scores, documents), key=lambda x: x[0], reverse=True)]\n",
1191
+ " return documents"
1192
+ ],
1193
+ "metadata": {
1194
+ "id": "O6xMyqFjRp_m"
1195
+ },
1196
+ "id": "O6xMyqFjRp_m",
1197
+ "execution_count": null,
1198
+ "outputs": []
1199
+ },
1200
+ {
1201
+ "cell_type": "code",
1202
+ "source": [
1203
+ "top_k_documents = rerank(query, documents)[:K] # Keep top K after reranking"
1204
+ ],
1205
+ "metadata": {
1206
+ "id": "dZtiwhPBRtnS"
1207
+ },
1208
+ "id": "dZtiwhPBRtnS",
1209
+ "execution_count": null,
1210
+ "outputs": []
1211
+ },
1212
+ {
1213
+ "cell_type": "markdown",
1214
+ "id": "f5816c54-a290-4cb0-b7db-3b8901998cb0",
1215
+ "metadata": {
1216
+ "id": "f5816c54-a290-4cb0-b7db-3b8901998cb0"
1217
+ },
1218
+ "source": [
1219
+ "### Step 4: Try a different LLM\n",
1220
+ "\n",
1221
+ "The suggested `Mistral-7b-instruct` is a great but small model for an LLM. A larger model can be applied to a wider range of problems and do more complex reasoning. Within the scope of this project a larger model may not be beneficial but for more complex cases the difference would become apparent. Another dimension to explore is a base model which was not instruction fine-tuned - it won't respond to your queries the way you'd expect. It may be a great exercise to see the value of fine-tuning.\n",
1222
+ "\n",
1223
+ "The task here is to try out an alternative LLM to explore the differences.\n",
1224
+ "\n",
1225
+ "The options are:\n",
1226
+ "1. mistralai/Mistral-7B-v0.1\n",
1227
+ "2. mistralai/Mixtral-8x7B-Instruct-v0.1\n",
1228
+ "\n",
1229
+ "Of course, feel free to choose another one and give some details on how different it is from the initial model.\n",
1230
+ "\n",
1231
+ "The deliverables are:\n",
1232
+ "\n",
1233
+ "1. The comparison between outputs of the Mistral-7b-instuct and a different model of your choice.\n",
1234
+ "2. The difference in response times if a larger model was chosen. Make sure to make multiple queries to make the comparison meaningful.\n",
1235
+ "3. Analyse the differences between outputs and share the conclusions.\n"
1236
+ ]
1237
+ },
1238
+ {
1239
+ "cell_type": "code",
1240
+ "execution_count": null,
1241
+ "id": "942f39d4-eb27-4f2d-ae47-a5d65f102faa",
1242
+ "metadata": {
1243
+ "id": "942f39d4-eb27-4f2d-ae47-a5d65f102faa"
1244
+ },
1245
+ "outputs": [],
1246
+ "source": [
1247
+ "# Analysis of the difference between LLMs"
1248
+ ]
1249
+ },
1250
+ {
1251
+ "cell_type": "markdown",
1252
+ "id": "70c16440",
1253
+ "metadata": {
1254
+ "id": "70c16440"
1255
+ },
1256
+ "source": [
1257
+ "### Step 5 (Bonus): Use an LLM to quantitatively compare outputs of different variants of the system (LLM as a Judge)\n",
1258
+ "\n",
1259
+ "Use a powerful LLM (e.g. GPT-4) to quantitatively evaluate outputs of two alternative setups (different embedding models, different LLMs, both etc.). For inspiration and for prompts refer to [1](https://arxiv.org/pdf/2306.05685.pdf), [2](https://arxiv.org/pdf/2401.10020.pdf), [3](https://www.airtrain.ai/blog/the-comprehensive-guide-to-llm-evaluation#high-level-approach)\n",
1260
+ "\n",
1261
+ "The deliverables:\n",
1262
+ "\n",
1263
+ "1. The code you put together\n",
1264
+ "2. The high-level description of the setup\n",
1265
+ "3. The results of the qualitative comparison\n"
1266
+ ]
1267
+ },
1268
+ {
1269
+ "cell_type": "code",
1270
+ "execution_count": null,
1271
+ "id": "39c18ba0-e54a-478f-9e60-0ea65c29238a",
1272
+ "metadata": {
1273
+ "id": "39c18ba0-e54a-478f-9e60-0ea65c29238a"
1274
+ },
1275
+ "outputs": [],
1276
+ "source": [
1277
+ "# The code implementing LLM-as-a-Judge and the evaluation results"
1278
+ ]
1279
+ },
1280
+ {
1281
+ "cell_type": "code",
1282
+ "execution_count": null,
1283
+ "id": "2ce78700-2578-4719-8b6b-d59fc669d1c1",
1284
+ "metadata": {
1285
+ "id": "2ce78700-2578-4719-8b6b-d59fc669d1c1"
1286
+ },
1287
+ "outputs": [],
1288
+ "source": []
1289
+ }
1290
+ ],
1291
+ "metadata": {
1292
+ "kernelspec": {
1293
+ "display_name": "Python 3 (ipykernel)",
1294
+ "language": "python",
1295
+ "name": "python3"
1296
+ },
1297
+ "language_info": {
1298
+ "codemirror_mode": {
1299
+ "name": "ipython",
1300
+ "version": 3
1301
+ },
1302
+ "file_extension": ".py",
1303
+ "mimetype": "text/x-python",
1304
+ "name": "python",
1305
+ "nbconvert_exporter": "python",
1306
+ "pygments_lexer": "ipython3",
1307
+ "version": "3.10.11"
1308
+ },
1309
+ "colab": {
1310
+ "provenance": []
1311
+ }
1312
+ },
1313
+ "nbformat": 4,
1314
+ "nbformat_minor": 5
1315
+ }
README.md CHANGED
@@ -1,10 +1,11 @@
1
  ---
2
  title: RAG
3
  emoji: ⚡
4
- colorFrom: pink
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 4.19.0
8
- app_file: gradio_app\app.py
9
  pinned: false
 
10
  ---
 
1
  ---
2
  title: RAG
3
  emoji: ⚡
4
+ colorFrom: yellow
5
+ colorTo: indigo
6
  sdk: gradio
7
+ sdk_version: 4.4.1
8
+ app_file: app.py
9
  pinned: false
10
+ license: apache-2.0
11
  ---
gradio_app/app.py → app.py RENAMED
File without changes
chunked/content_aware_chunking/__config/chunk_0.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ docstyle-ignore
2
+ INSTALL_CONTENT = """
3
+ Transformers installation
4
+ ! pip install transformers datasets
5
+ To install from source instead of the last release, comment the command above and uncomment the following one.
6
+ ! pip install git+https://github.com/huggingface/transformers.git
7
+ """
8
+ notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
9
+ black_avoid_patterns = {
10
+ "{processor_class}": "FakeProcessorClass",
11
+ "{model_class}": "FakeModelClass",
12
+ "{object_class}": "FakeObjectClass",
13
+ }.
chunked/content_aware_chunking/__redirects/chunk_0.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ Optimizing inference
2
+ perf_infer_gpu_many: perf_infer_gpu_one.
chunked/content_aware_chunking/__toctree/chunk_0.txt ADDED
File without changes
chunked/content_aware_chunking/__toctree/chunk_1.txt ADDED
@@ -0,0 +1,836 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sections:
2
+ local: index
3
+ title: 🤗 Transformers
4
+ local: quicktour
5
+ title: Quick tour
6
+ local: installation
7
+ title: Installation
8
+ title: Get started
9
+ sections:
10
+ local: pipeline_tutorial
11
+ title: Run inference with pipelines
12
+ local: autoclass_tutorial
13
+ title: Write portable code with AutoClass
14
+ local: preprocessing
15
+ title: Preprocess data
16
+ local: training
17
+ title: Fine-tune a pretrained model
18
+ local: run_scripts
19
+ title: Train with a script
20
+ local: accelerate
21
+ title: Set up distributed training with 🤗 Accelerate
22
+ local: peft
23
+ title: Load and train adapters with 🤗 PEFT
24
+ local: model_sharing
25
+ title: Share your model
26
+ local: transformers_agents
27
+ title: Agents
28
+ local: llm_tutorial
29
+ title: Generation with LLMs
30
+ title: Tutorials
31
+ sections:
32
+ isExpanded: false
33
+ sections:
34
+ local: tasks/sequence_classification
35
+ title: Text classification
36
+ local: tasks/token_classification
37
+ title: Token classification
38
+ local: tasks/question_answering
39
+ title: Question answering
40
+ local: tasks/language_modeling
41
+ title: Causal language modeling
42
+ local: tasks/masked_language_modeling
43
+ title: Masked language modeling
44
+ local: tasks/translation
45
+ title: Translation
46
+ local: tasks/summarization
47
+ title: Summarization
48
+ local: tasks/multiple_choice
49
+ title: Multiple choice
50
+ title: Natural Language Processing
51
+
52
+ isExpanded: false
53
+ sections:
54
+ local: tasks/audio_classification
55
+ title: Audio classification
56
+ local: tasks/asr
57
+ title: Automatic speech recognition
58
+ title: Audio
59
+
60
+ isExpanded: false
61
+ sections:
62
+ local: tasks/image_classification
63
+ title: Image classification
64
+ local: tasks/semantic_segmentation
65
+ title: Image segmentation
66
+ local: tasks/video_classification
67
+ title: Video classification
68
+ local: tasks/object_detection
69
+ title: Object detection
70
+ local: tasks/zero_shot_object_detection
71
+ title: Zero-shot object detection
72
+ local: tasks/zero_shot_image_classification
73
+ title: Zero-shot image classification
74
+ local: tasks/monocular_depth_estimation
75
+ title: Depth estimation
76
+ local: tasks/image_to_image
77
+ title: Image-to-Image
78
+ local: tasks/mask_generation
79
+ title: Mask Generation
80
+ local: tasks/knowledge_distillation_for_image_classification
81
+ title: Knowledge Distillation for Computer Vision
82
+ title: Computer Vision
83
+
84
+ isExpanded: false
85
+ sections:
86
+ local: tasks/image_captioning
87
+ title: Image captioning
88
+ local: tasks/document_question_answering
89
+ title: Document Question Answering
90
+ local: tasks/visual_question_answering
91
+ title: Visual Question Answering
92
+ local: tasks/text-to-speech
93
+ title: Text to speech
94
+ title: Multimodal
95
+
96
+ isExpanded: false
97
+ sections:
98
+ local: generation_strategies
99
+ title: Customize the generation strategy
100
+ title: Generation
101
+
102
+ isExpanded: false
103
+ sections:
104
+ local: tasks/idefics
105
+ title: Image tasks with IDEFICS
106
+ local: tasks/prompting
107
+ title: LLM prompting guide
108
+ title: Prompting
109
+ title: Task Guides
110
+
111
+ sections:
112
+ local: fast_tokenizers
113
+ title: Use fast tokenizers from 🤗 Tokenizers
114
+ local: multilingual
115
+ title: Run inference with multilingual models
116
+ local: create_a_model
117
+ title: Use model-specific APIs
118
+ local: custom_models
119
+ title: Share a custom model
120
+ local: chat_templating
121
+ title: Templates for chat models
122
+ local: trainer
123
+ title: Trainer
124
+ local: sagemaker
125
+ title: Run training on Amazon SageMaker
126
+ local: serialization
127
+ title: Export to ONNX
128
+ local: tflite
129
+ title: Export to TFLite
130
+ local: torchscript
131
+ title: Export to TorchScript
132
+ local: benchmarks
133
+ title: Benchmarks
134
+ local: notebooks
135
+ title: Notebooks with examples
136
+ local: community
137
+ title: Community resources
138
+ local: custom_tools
139
+ title: Custom Tools and Prompts
140
+ local: troubleshooting
141
+ title: Troubleshoot
142
+ local: hf_quantizer
143
+ title: Contribute new quantization method
144
+ title: Developer guides
145
+ sections:
146
+ local: performance
147
+ title: Overview
148
+ local: quantization
149
+ title: Quantization
150
+ sections:
151
+ local: perf_train_gpu_one
152
+ title: Methods and tools for efficient training on a single GPU
153
+ local: perf_train_gpu_many
154
+ title: Multiple GPUs and parallelism
155
+ local: fsdp
156
+ title: Fully Sharded Data Parallel
157
+ local: deepspeed
158
+ title: DeepSpeed
159
+ local: perf_train_cpu
160
+ title: Efficient training on CPU
161
+ local: perf_train_cpu_many
162
+ title: Distributed CPU training
163
+ local: perf_train_tpu_tf
164
+ title: Training on TPU with TensorFlow
165
+ local: perf_train_special
166
+ title: PyTorch training on Apple silicon
167
+ local: perf_hardware
168
+ title: Custom hardware for training
169
+ local: hpo_train
170
+ title: Hyperparameter Search using Trainer API
171
+ title: Efficient training techniques
172
+
173
+ sections:
174
+ local: perf_infer_cpu
175
+ title: CPU inference
176
+ local: perf_infer_gpu_one
177
+ title: GPU inference
178
+ title: Optimizing inference
179
+
180
+ local: big_models
181
+ title: Instantiating a big model
182
+ local: debugging
183
+ title: Debugging
184
+ local: tf_xla
185
+ title: XLA Integration for TensorFlow Models
186
+ local: perf_torch_compile
187
+ title: Optimize inference using torch.compile()
188
+ title: Performance and scalability
189
+ sections:
190
+ local: contributing
191
+ title: How to contribute to 🤗 Transformers?
192
+ local: add_new_model
193
+ title: How to add a model to 🤗 Transformers?
194
+ local: add_tensorflow_model
195
+ title: How to convert a 🤗 Transformers model to TensorFlow?
196
+ local: add_new_pipeline
197
+ title: How to add a pipeline to 🤗 Transformers?
198
+ local: testing
199
+ title: Testing
200
+ local: pr_checks
201
+ title: Checks on a Pull Request
202
+ title: Contribute
203
+ sections:
204
+ local: philosophy
205
+ title: Philosophy
206
+ local: glossary
207
+ title: Glossary
208
+ local: task_summary
209
+ title: What 🤗 Transformers can do
210
+ local: tasks_explained
211
+ title: How 🤗 Transformers solve tasks
212
+ local: model_summary
213
+ title: The Transformer model family
214
+ local: tokenizer_summary
215
+ title: Summary of the tokenizers
216
+ local: attention
217
+ title: Attention mechanisms
218
+ local: pad_truncation
219
+ title: Padding and truncation
220
+ local: bertology
221
+ title: BERTology
222
+ local: perplexity
223
+ title: Perplexity of fixed-length models
224
+ local: pipeline_webserver
225
+ title: Pipelines for webserver inference
226
+ local: model_memory_anatomy
227
+ title: Model training anatomy
228
+ local: llm_tutorial_optimization
229
+ title: Getting the most out of LLMs
230
+ title: Conceptual guides
231
+ sections:
232
+ sections:
233
+ local: main_classes/agent
234
+ title: Agents and Tools
235
+ local: model_doc/auto
236
+ title: Auto Classes
237
+ local: main_classes/backbones
238
+ title: Backbones
239
+ local: main_classes/callback
240
+ title: Callbacks
241
+ local: main_classes/configuration
242
+ title: Configuration
243
+ local: main_classes/data_collator
244
+ title: Data Collator
245
+ local: main_classes/keras_callbacks
246
+ title: Keras callbacks
247
+ local: main_classes/logging
248
+ title: Logging
249
+ local: main_classes/model
250
+ title: Models
251
+ local: main_classes/text_generation
252
+ title: Text Generation
253
+ local: main_classes/onnx
254
+ title: ONNX
255
+ local: main_classes/optimizer_schedules
256
+ title: Optimization
257
+ local: main_classes/output
258
+ title: Model outputs
259
+ local: main_classes/pipelines
260
+ title: Pipelines
261
+ local: main_classes/processors
262
+ title: Processors
263
+ local: main_classes/quantization
264
+ title: Quantization
265
+ local: main_classes/tokenizer
266
+ title: Tokenizer
267
+ local: main_classes/trainer
268
+ title: Trainer
269
+ local: main_classes/deepspeed
270
+ title: DeepSpeed
271
+ local: main_classes/feature_extractor
272
+ title: Feature Extractor
273
+ local: main_classes/image_processor
274
+ title: Image Processor
275
+ title: Main Classes
276
+
277
+ sections:
278
+ isExpanded: false
279
+ sections:
280
+ local: model_doc/albert
281
+ title: ALBERT
282
+ local: model_doc/bart
283
+ title: BART
284
+ local: model_doc/barthez
285
+ title: BARThez
286
+ local: model_doc/bartpho
287
+ title: BARTpho
288
+ local: model_doc/bert
289
+ title: BERT
290
+ local: model_doc/bert-generation
291
+ title: BertGeneration
292
+ local: model_doc/bert-japanese
293
+ title: BertJapanese
294
+ local: model_doc/bertweet
295
+ title: Bertweet
296
+ local: model_doc/big_bird
297
+ title: BigBird
298
+ local: model_doc/bigbird_pegasus
299
+ title: BigBirdPegasus
300
+ local: model_doc/biogpt
301
+ title: BioGpt
302
+ local: model_doc/blenderbot
303
+ title: Blenderbot
304
+ local: model_doc/blenderbot-small
305
+ title: Blenderbot Small
306
+ local: model_doc/bloom
307
+ title: BLOOM
308
+ local: model_doc/bort
309
+ title: BORT
310
+ local: model_doc/byt5
311
+ title: ByT5
312
+ local: model_doc/camembert
313
+ title: CamemBERT
314
+ local: model_doc/canine
315
+ title: CANINE
316
+ local: model_doc/codegen
317
+ title: CodeGen
318
+ local: model_doc/code_llama
319
+ title: CodeLlama
320
+ local: model_doc/convbert
321
+ title: ConvBERT
322
+ local: model_doc/cpm
323
+ title: CPM
324
+ local: model_doc/cpmant
325
+ title: CPMANT
326
+ local: model_doc/ctrl
327
+ title: CTRL
328
+ local: model_doc/deberta
329
+ title: DeBERTa
330
+ local: model_doc/deberta-v2
331
+ title: DeBERTa-v2
332
+ local: model_doc/dialogpt
333
+ title: DialoGPT
334
+ local: model_doc/distilbert
335
+ title: DistilBERT
336
+ local: model_doc/dpr
337
+ title: DPR
338
+ local: model_doc/electra
339
+ title: ELECTRA
340
+ local: model_doc/encoder-decoder
341
+ title: Encoder Decoder Models
342
+ local: model_doc/ernie
343
+ title: ERNIE
344
+ local: model_doc/ernie_m
345
+ title: ErnieM
346
+ local: model_doc/esm
347
+ title: ESM
348
+ local: model_doc/falcon
349
+ title: Falcon
350
+ local: model_doc/fastspeech2_conformer
351
+ title: FastSpeech2Conformer
352
+ local: model_doc/flan-t5
353
+ title: FLAN-T5
354
+ local: model_doc/flan-ul2
355
+ title: FLAN-UL2
356
+ local: model_doc/flaubert
357
+ title: FlauBERT
358
+ local: model_doc/fnet
359
+ title: FNet
360
+ local: model_doc/fsmt
361
+ title: FSMT
362
+ local: model_doc/funnel
363
+ title: Funnel Transformer
364
+ local: model_doc/fuyu
365
+ title: Fuyu
366
+ local: model_doc/openai-gpt
367
+ title: GPT
368
+ local: model_doc/gpt_neo
369
+ title: GPT Neo
370
+ local: model_doc/gpt_neox
371
+ title: GPT NeoX
372
+ local: model_doc/gpt_neox_japanese
373
+ title: GPT NeoX Japanese
374
+ local: model_doc/gptj
375
+ title: GPT-J
376
+ local: model_doc/gpt2
377
+ title: GPT2
378
+ local: model_doc/gpt_bigcode
379
+ title: GPTBigCode
380
+ local: model_doc/gptsan-japanese
381
+ title: GPTSAN Japanese
382
+ local: model_doc/gpt-sw3
383
+ title: GPTSw3
384
+ local: model_doc/herbert
385
+ title: HerBERT
386
+ local: model_doc/ibert
387
+ title: I-BERT
388
+ local: model_doc/jukebox
389
+ title: Jukebox
390
+ local: model_doc/led
391
+ title: LED
392
+ local: model_doc/llama
393
+ title: LLaMA
394
+ local: model_doc/llama2
395
+ title: Llama2
396
+ local: model_doc/longformer
397
+ title: Longformer
398
+ local: model_doc/longt5
399
+ title: LongT5
400
+ local: model_doc/luke
401
+ title: LUKE
402
+ local: model_doc/m2m_100
403
+ title: M2M100
404
+ local: model_doc/madlad-400
405
+ title: MADLAD-400
406
+ local: model_doc/marian
407
+ title: MarianMT
408
+ local: model_doc/markuplm
409
+ title: MarkupLM
410
+ local: model_doc/mbart
411
+ title: MBart and MBart-50
412
+ local: model_doc/mega
413
+ title: MEGA
414
+ local: model_doc/megatron-bert
415
+ title: MegatronBERT
416
+ local: model_doc/megatron_gpt2
417
+ title: MegatronGPT2
418
+ local: model_doc/mistral
419
+ title: Mistral
420
+ local: model_doc/mixtral
421
+ title: Mixtral
422
+ local: model_doc/mluke
423
+ title: mLUKE
424
+ local: model_doc/mobilebert
425
+ title: MobileBERT
426
+ local: model_doc/mpnet
427
+ title: MPNet
428
+ local: model_doc/mpt
429
+ title: MPT
430
+ local: model_doc/mra
431
+ title: MRA
432
+ local: model_doc/mt5
433
+ title: MT5
434
+ local: model_doc/mvp
435
+ title: MVP
436
+ local: model_doc/nezha
437
+ title: NEZHA
438
+ local: model_doc/nllb
439
+ title: NLLB
440
+ local: model_doc/nllb-moe
441
+ title: NLLB-MoE
442
+ local: model_doc/nystromformer
443
+ title: Nyströmformer
444
+ local: model_doc/open-llama
445
+ title: Open-Llama
446
+ local: model_doc/opt
447
+ title: OPT
448
+ local: model_doc/pegasus
449
+ title: Pegasus
450
+ local: model_doc/pegasus_x
451
+ title: PEGASUS-X
452
+ local: model_doc/persimmon
453
+ title: Persimmon
454
+ local: model_doc/phi
455
+ title: Phi
456
+ local: model_doc/phobert
457
+ title: PhoBERT
458
+ local: model_doc/plbart
459
+ title: PLBart
460
+ local: model_doc/prophetnet
461
+ title: ProphetNet
462
+ local: model_doc/qdqbert
463
+ title: QDQBert
464
+ local: model_doc/qwen2
465
+ title: Qwen2
466
+ local: model_doc/rag
467
+ title: RAG
468
+ local: model_doc/realm
469
+ title: REALM
470
+ local: model_doc/reformer
471
+ title: Reformer
472
+ local: model_doc/rembert
473
+ title: RemBERT
474
+ local: model_doc/retribert
475
+ title: RetriBERT
476
+ local: model_doc/roberta
477
+ title: RoBERTa
478
+ local: model_doc/roberta-prelayernorm
479
+ title: RoBERTa-PreLayerNorm
480
+ local: model_doc/roc_bert
481
+ title: RoCBert
482
+ local: model_doc/roformer
483
+ title: RoFormer
484
+ local: model_doc/rwkv
485
+ title: RWKV
486
+ local: model_doc/splinter
487
+ title: Splinter
488
+ local: model_doc/squeezebert
489
+ title: SqueezeBERT
490
+ local: model_doc/stablelm
491
+ title: StableLm
492
+ local: model_doc/switch_transformers
493
+ title: SwitchTransformers
494
+ local: model_doc/t5
495
+ title: T5
496
+ local: model_doc/t5v1.1
497
+ title: T5v1.1
498
+ local: model_doc/tapex
499
+ title: TAPEX
500
+ local: model_doc/transfo-xl
501
+ title: Transformer XL
502
+ local: model_doc/ul2
503
+ title: UL2
504
+ local: model_doc/umt5
505
+ title: UMT5
506
+ local: model_doc/xmod
507
+ title: X-MOD
508
+ local: model_doc/xglm
509
+ title: XGLM
510
+ local: model_doc/xlm
511
+ title: XLM
512
+ local: model_doc/xlm-prophetnet
513
+ title: XLM-ProphetNet
514
+ local: model_doc/xlm-roberta
515
+ title: XLM-RoBERTa
516
+ local: model_doc/xlm-roberta-xl
517
+ title: XLM-RoBERTa-XL
518
+ local: model_doc/xlm-v
519
+ title: XLM-V
520
+ local: model_doc/xlnet
521
+ title: XLNet
522
+ local: model_doc/yoso
523
+ title: YOSO
524
+ title: Text models
525
+ isExpanded: false
526
+ sections:
527
+ local: model_doc/beit
528
+ title: BEiT
529
+ local: model_doc/bit
530
+ title: BiT
531
+ local: model_doc/conditional_detr
532
+ title: Conditional DETR
533
+ local: model_doc/convnext
534
+ title: ConvNeXT
535
+ local: model_doc/convnextv2
536
+ title: ConvNeXTV2
537
+ local: model_doc/cvt
538
+ title: CvT
539
+ local: model_doc/deformable_detr
540
+ title: Deformable DETR
541
+ local: model_doc/deit
542
+ title: DeiT
543
+ local: model_doc/depth_anything
544
+ title: Depth Anything
545
+ local: model_doc/deta
546
+ title: DETA
547
+ local: model_doc/detr
548
+ title: DETR
549
+ local: model_doc/dinat
550
+ title: DiNAT
551
+ local: model_doc/dinov2
552
+ title: DINOV2
553
+ local: model_doc/dit
554
+ title: DiT
555
+ local: model_doc/dpt
556
+ title: DPT
557
+ local: model_doc/efficientformer
558
+ title: EfficientFormer
559
+ local: model_doc/efficientnet
560
+ title: EfficientNet
561
+ local: model_doc/focalnet
562
+ title: FocalNet
563
+ local: model_doc/glpn
564
+ title: GLPN
565
+ local: model_doc/imagegpt
566
+ title: ImageGPT
567
+ local: model_doc/levit
568
+ title: LeViT
569
+ local: model_doc/mask2former
570
+ title: Mask2Former
571
+ local: model_doc/maskformer
572
+ title: MaskFormer
573
+ local: model_doc/mobilenet_v1
574
+ title: MobileNetV1
575
+ local: model_doc/mobilenet_v2
576
+ title: MobileNetV2
577
+ local: model_doc/mobilevit
578
+ title: MobileViT
579
+ local: model_doc/mobilevitv2
580
+ title: MobileViTV2
581
+ local: model_doc/nat
582
+ title: NAT
583
+ local: model_doc/poolformer
584
+ title: PoolFormer
585
+ local: model_doc/pvt
586
+ title: Pyramid Vision Transformer (PVT)
587
+ local: model_doc/regnet
588
+ title: RegNet
589
+ local: model_doc/resnet
590
+ title: ResNet
591
+ local: model_doc/segformer
592
+ title: SegFormer
593
+ local: model_doc/swiftformer
594
+ title: SwiftFormer
595
+ local: model_doc/swin
596
+ title: Swin Transformer
597
+ local: model_doc/swinv2
598
+ title: Swin Transformer V2
599
+ local: model_doc/swin2sr
600
+ title: Swin2SR
601
+ local: model_doc/table-transformer
602
+ title: Table Transformer
603
+ local: model_doc/upernet
604
+ title: UperNet
605
+ local: model_doc/van
606
+ title: VAN
607
+ local: model_doc/vit
608
+ title: Vision Transformer (ViT)
609
+ local: model_doc/vit_hybrid
610
+ title: ViT Hybrid
611
+ local: model_doc/vitdet
612
+ title: ViTDet
613
+ local: model_doc/vit_mae
614
+ title: ViTMAE
615
+ local: model_doc/vitmatte
616
+ title: ViTMatte
617
+ local: model_doc/vit_msn
618
+ title: ViTMSN
619
+ local: model_doc/yolos
620
+ title: YOLOS
621
+ title: Vision models
622
+ isExpanded: false
623
+ sections:
624
+ local: model_doc/audio-spectrogram-transformer
625
+ title: Audio Spectrogram Transformer
626
+ local: model_doc/bark
627
+ title: Bark
628
+ local: model_doc/clap
629
+ title: CLAP
630
+ local: model_doc/encodec
631
+ title: EnCodec
632
+ local: model_doc/hubert
633
+ title: Hubert
634
+ local: model_doc/mctct
635
+ title: MCTCT
636
+ local: model_doc/mms
637
+ title: MMS
638
+ local: model_doc/musicgen
639
+ title: MusicGen
640
+ local: model_doc/pop2piano
641
+ title: Pop2Piano
642
+ local: model_doc/seamless_m4t
643
+ title: Seamless-M4T
644
+ local: model_doc/seamless_m4t_v2
645
+ title: SeamlessM4T-v2
646
+ local: model_doc/sew
647
+ title: SEW
648
+ local: model_doc/sew-d
649
+ title: SEW-D
650
+ local: model_doc/speech_to_text
651
+ title: Speech2Text
652
+ local: model_doc/speech_to_text_2
653
+ title: Speech2Text2
654
+ local: model_doc/speecht5
655
+ title: SpeechT5
656
+ local: model_doc/unispeech
657
+ title: UniSpeech
658
+ local: model_doc/unispeech-sat
659
+ title: UniSpeech-SAT
660
+ local: model_doc/univnet
661
+ title: UnivNet
662
+ local: model_doc/vits
663
+ title: VITS
664
+ local: model_doc/wav2vec2
665
+ title: Wav2Vec2
666
+ local: model_doc/wav2vec2-bert
667
+ title: Wav2Vec2-BERT
668
+ local: model_doc/wav2vec2-conformer
669
+ title: Wav2Vec2-Conformer
670
+ local: model_doc/wav2vec2_phoneme
671
+ title: Wav2Vec2Phoneme
672
+ local: model_doc/wavlm
673
+ title: WavLM
674
+ local: model_doc/whisper
675
+ title: Whisper
676
+ local: model_doc/xls_r
677
+ title: XLS-R
678
+ local: model_doc/xlsr_wav2vec2
679
+ title: XLSR-Wav2Vec2
680
+ title: Audio models
681
+ isExpanded: false
682
+ sections:
683
+ local: model_doc/timesformer
684
+ title: TimeSformer
685
+ local: model_doc/videomae
686
+ title: VideoMAE
687
+ local: model_doc/vivit
688
+ title: ViViT
689
+ title: Video models
690
+ isExpanded: false
691
+ sections:
692
+ local: model_doc/align
693
+ title: ALIGN
694
+ local: model_doc/altclip
695
+ title: AltCLIP
696
+ local: model_doc/blip
697
+ title: BLIP
698
+ local: model_doc/blip-2
699
+ title: BLIP-2
700
+ local: model_doc/bridgetower
701
+ title: BridgeTower
702
+ local: model_doc/bros
703
+ title: BROS
704
+ local: model_doc/chinese_clip
705
+ title: Chinese-CLIP
706
+ local: model_doc/clip
707
+ title: CLIP
708
+ local: model_doc/clipseg
709
+ title: CLIPSeg
710
+ local: model_doc/clvp
711
+ title: CLVP
712
+ local: model_doc/data2vec
713
+ title: Data2Vec
714
+ local: model_doc/deplot
715
+ title: DePlot
716
+ local: model_doc/donut
717
+ title: Donut
718
+ local: model_doc/flava
719
+ title: FLAVA
720
+ local: model_doc/git
721
+ title: GIT
722
+ local: model_doc/groupvit
723
+ title: GroupViT
724
+ local: model_doc/idefics
725
+ title: IDEFICS
726
+ local: model_doc/instructblip
727
+ title: InstructBLIP
728
+ local: model_doc/kosmos-2
729
+ title: KOSMOS-2
730
+ local: model_doc/layoutlm
731
+ title: LayoutLM
732
+ local: model_doc/layoutlmv2
733
+ title: LayoutLMV2
734
+ local: model_doc/layoutlmv3
735
+ title: LayoutLMV3
736
+ local: model_doc/layoutxlm
737
+ title: LayoutXLM
738
+ local: model_doc/lilt
739
+ title: LiLT
740
+ local: model_doc/llava
741
+ title: Llava
742
+ local: model_doc/lxmert
743
+ title: LXMERT
744
+ local: model_doc/matcha
745
+ title: MatCha
746
+ local: model_doc/mgp-str
747
+ title: MGP-STR
748
+ local: model_doc/nougat
749
+ title: Nougat
750
+ local: model_doc/oneformer
751
+ title: OneFormer
752
+ local: model_doc/owlvit
753
+ title: OWL-ViT
754
+ local: model_doc/owlv2
755
+ title: OWLv2
756
+ local: model_doc/perceiver
757
+ title: Perceiver
758
+ local: model_doc/pix2struct
759
+ title: Pix2Struct
760
+ local: model_doc/sam
761
+ title: Segment Anything
762
+ local: model_doc/siglip
763
+ title: SigLIP
764
+ local: model_doc/speech-encoder-decoder
765
+ title: Speech Encoder Decoder Models
766
+ local: model_doc/tapas
767
+ title: TAPAS
768
+ local: model_doc/trocr
769
+ title: TrOCR
770
+ local: model_doc/tvlt
771
+ title: TVLT
772
+ local: model_doc/tvp
773
+ title: TVP
774
+ local: model_doc/vilt
775
+ title: ViLT
776
+ local: model_doc/vipllava
777
+ title: VipLlava
778
+ local: model_doc/vision-encoder-decoder
779
+ title: Vision Encoder Decoder Models
780
+ local: model_doc/vision-text-dual-encoder
781
+ title: Vision Text Dual Encoder
782
+ local: model_doc/visual_bert
783
+ title: VisualBERT
784
+ local: model_doc/xclip
785
+ title: X-CLIP
786
+ title: Multimodal models
787
+ isExpanded: false
788
+ sections:
789
+ local: model_doc/decision_transformer
790
+ title: Decision Transformer
791
+ local: model_doc/trajectory_transformer
792
+ title: Trajectory Transformer
793
+ title: Reinforcement learning models
794
+ isExpanded: false
795
+ sections:
796
+ local: model_doc/autoformer
797
+ title: Autoformer
798
+ local: model_doc/informer
799
+ title: Informer
800
+ local: model_doc/patchtsmixer
801
+ title: PatchTSMixer
802
+ local: model_doc/patchtst
803
+ title: PatchTST
804
+ local: model_doc/time_series_transformer
805
+ title: Time Series Transformer
806
+ title: Time series models
807
+ isExpanded: false
808
+ sections:
809
+ local: model_doc/graphormer
810
+ title: Graphormer
811
+ title: Graph models
812
+ title: Models
813
+
814
+ sections:
815
+ local: internal/modeling_utils
816
+ title: Custom Layers and Utilities
817
+ local: internal/pipelines_utils
818
+ title: Utilities for pipelines
819
+ local: internal/tokenization_utils
820
+ title: Utilities for Tokenizers
821
+ local: internal/trainer_utils
822
+ title: Utilities for Trainer
823
+ local: internal/generation_utils
824
+ title: Utilities for Generation
825
+ local: internal/image_processing_utils
826
+ title: Utilities for Image Processors
827
+ local: internal/audio_utils
828
+ title: Utilities for Audio processing
829
+ local: internal/file_utils
830
+ title: General Utilities
831
+ local: internal/time_series_utils
832
+ title: Utilities for Time Series
833
+ title: Internal Helpers
834
+ title: API
835
+
836
+ .
chunked/content_aware_chunking/_accelerate/chunk_0.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ Distributed training with 🤗 Accelerate
2
+ As models get bigger, parallelism has emerged as a strategy for training larger models on limited hardware and accelerating training speed by several orders of magnitude. At Hugging Face, we created the 🤗 Accelerate library to help users easily train a 🤗 Transformers model on any type of distributed setup, whether it is multiple GPU's on one machine or multiple GPU's across several machines.
chunked/content_aware_chunking/_accelerate/chunk_1.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ In this tutorial, learn how to customize your native PyTorch training loop to enable training in a distributed environment.
2
+ Setup
3
+ Get started by installing 🤗 Accelerate:
4
+
5
+ pip install accelerate
6
+ Then import and create an [~accelerate.Accelerator] object. The [~accelerate.Accelerator] will automatically detect your type of distributed setup and initialize all the necessary components for training.
chunked/content_aware_chunking/_accelerate/chunk_2.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ You don't need to explicitly place your model on a device.
2
+
3
+ from accelerate import Accelerator
4
+ accelerator = Accelerator()
5
+
6
+ Prepare to accelerate
7
+ The next step is to pass all the relevant training objects to the [~accelerate.Accelerator.prepare] method.
chunked/content_aware_chunking/_accelerate/chunk_3.txt ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This includes your training and evaluation DataLoaders, a model and an optimizer:
2
+
3
+ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
4
+ train_dataloader, eval_dataloader, model, optimizer
5
+ )
6
+
7
+ Backward
8
+ The last addition is to replace the typical loss.backward() in your training loop with 🤗 Accelerate's [~accelerate.Accelerator.backward]method:
9
+
10
+ for epoch in range(num_epochs):
11
+ for batch in train_dataloader:
12
+ outputs = model(**batch)
13
+ loss = outputs.loss
14
+ accelerator.backward(loss)
15
+
16
+ optimizer.step()
17
+ lr_scheduler.step()
18
+ optimizer.zero_grad()
19
+ progress_bar.update(1)
20
+
21
+ As you can see in the following code, you only need to add four additional lines of code to your training loop to enable distributed training!
22
+
23
+ + from accelerate import Accelerator
24
+ from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler
25
+
26
+ accelerator = Accelerator()
27
+
28
+ model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
29
+ optimizer = AdamW(model.parameters(), lr=3e-5)
30
+
31
+ device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
32
+
33
+ model.to(device)
34
+
35
+ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
36
+
37
+ train_dataloader, eval_dataloader, model, optimizer
38
+ )
39
+
40
+ num_epochs = 3
41
+ num_training_steps = num_epochs * len(train_dataloader)
42
+ lr_scheduler = get_scheduler(
43
+ "linear",
44
+ optimizer=optimizer,
45
+ num_warmup_steps=0,
46
+ num_training_steps=num_training_steps
47
+ )
48
+ progress_bar = tqdm(range(num_training_steps))
49
+ model.train()
50
+ for epoch in range(num_epochs):
51
+ for batch in train_dataloader:
52
+
53
+ outputs = model(**batch)
54
+ loss = outputs.loss
55
+
56
+ + accelerator.backward(loss)
57
+ optimizer.step()
58
+ lr_scheduler.step()
59
+ optimizer.zero_grad()
60
+ progress_bar.update(1)
61
+
62
+ Train
63
+ Once you've added the relevant lines of code, launch your training in a script or a notebook like Colaboratory.
64
+ Train with a script
65
+ If you are running your training from a script, run the following command to create and save a configuration file:
66
+
67
+ accelerate config
68
+ Then launch your training with:
69
+
70
+ accelerate launch train.py
71
+ Train with a notebook
72
+ 🤗 Accelerate can also run in a notebook if you're planning on using Colaboratory's TPUs.
chunked/content_aware_chunking/_accelerate/chunk_4.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ Wrap all the code responsible for training in a function, and pass it to [~accelerate.notebook_launcher]:
2
+
3
+ from accelerate import notebook_launcher
4
+ notebook_launcher(training_function)
5
+
6
+ For more information about 🤗 Accelerate and its rich features, refer to the documentation..
chunked/content_aware_chunking/_add_new_model/chunk_0.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ How to add a model to 🤗 Transformers?
2
+ The 🤗 Transformers library is often able to offer new models thanks to community contributors. But this can be a challenging project and requires an in-depth knowledge of the 🤗 Transformers library and the model to implement.
chunked/content_aware_chunking/_add_new_model/chunk_1.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ At Hugging Face, we're trying to empower more of the community to actively add models and we've put together this guide to walk you through the process of adding a PyTorch model (make sure you have PyTorch installed).
2
+
3
+ If you're interested in implementing a TensorFlow model, take a look at the How to convert a 🤗 Transformers model to TensorFlow guide!
4
+
5
+ Along the way, you'll:
6
+
7
+ get insights into open-source best practices
8
+ understand the design principles behind one of the most popular deep learning libraries
9
+ learn how to efficiently test large models
10
+ learn how to integrate Python utilities like black, ruff, and make fix-copies to ensure clean and readable code
11
+
12
+ A Hugging Face team member will be available to help you along the way so you'll never be alone.
chunked/content_aware_chunking/_add_new_model/chunk_10.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ Note that the configuration and the model are always serialized into two
2
+ different formats - the model to a pytorch_model.bin file and the configuration to a config.json file.
chunked/content_aware_chunking/_add_new_model/chunk_100.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Hence, the documentation must be understandable and concise. It is very useful for
2
+ the community to add some Tips to show how the model should be used. Don't hesitate to ping the Hugging Face team
3
+ regarding the docstrings.
4
+ Next, make sure that the docstring added to src/transformers/models/brand_new_bert/modeling_brand_new_bert.py is
5
+ correct and included all necessary inputs and outputs. We have a detailed guide about writing documentation and our docstring format here.
chunked/content_aware_chunking/_add_new_model/chunk_101.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ It is always to good to remind oneself that documentation should
2
+ be treated at least as carefully as the code in 🤗 Transformers since the documentation is usually the first contact
3
+ point of the community with the model.
4
+ Code refactor
5
+ Great, now you have added all the necessary code for brand_new_bert.
chunked/content_aware_chunking/_add_new_model/chunk_102.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ At this point, you should correct some potential
2
+ incorrect code style by running:
3
+
4
+ make style
5
+ and verify that your coding style passes the quality check:
6
+
7
+ make quality
8
+ There are a couple of other very strict design tests in 🤗 Transformers that might still be failing, which shows up in
9
+ the tests of your pull request. This is often because of some missing information in the docstring or some incorrect
10
+ naming.
chunked/content_aware_chunking/_add_new_model/chunk_103.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ The Hugging Face team will surely help you if you're stuck here.
2
+ Lastly, it is always a good idea to refactor one's code after having ensured that the code works correctly. With all
3
+ tests passing, now it's a good time to go over the added code again and do some refactoring.
4
+ You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎
5
+ 12.
chunked/content_aware_chunking/_add_new_model/chunk_104.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Upload the models to the model hub
2
+ In this final part, you should convert and upload all checkpoints to the model hub and add a model card for each
3
+ uploaded model checkpoint. You can get familiar with the hub functionalities by reading our Model sharing and uploading Page. You should work alongside the Hugging Face team here to decide on a fitting name for each
4
+ checkpoint and to get the required access rights to be able to upload the model under the author's organization of
5
+ brand_new_bert.
chunked/content_aware_chunking/_add_new_model/chunk_105.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ The push_to_hub method, present in all models in transformers, is a quick and efficient way to push your checkpoint to the hub. A little snippet is pasted below:
2
+ thon
3
+ brand_new_bert.push_to_hub("brand_new_bert")
4
+ Uncomment the following line to push to an organization.
5
+ brand_new_bert.push_to_hub("/brand_new_bert")
6
+
7
+ It is worth spending some time to create fitting model cards for each checkpoint. The model cards should highlight the
8
+ specific characteristics of this particular checkpoint, e.g.
chunked/content_aware_chunking/_add_new_model/chunk_106.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ On which dataset was the checkpoint
2
+ pretrained/fine-tuned on? On what down-stream task should the model be used? And also include some code on how to
3
+ correctly use the model.
4
+ 13. (Optional) Add notebook
5
+ It is very helpful to add a notebook that showcases in-detail how brand_new_bert can be used for inference and/or
6
+ fine-tuned on a downstream task. This is not mandatory to merge your PR, but very useful for the community.
7
+ 14.
chunked/content_aware_chunking/_add_new_model/chunk_107.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ Submit your finished PR
2
+ You're done programming now and can move to the last step, which is getting your PR merged into main.
chunked/content_aware_chunking/_add_new_model/chunk_108.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ Usually, the
2
+ Hugging Face team should have helped you already at this point, but it is worth taking some time to give your finished
3
+ PR a nice description and eventually add comments to your code, if you want to point out certain design choices to your
4
+ reviewer.
5
+ Share your work!!
6
+ Now, it's time to get some credit from the community for your work! Having completed a model addition is a major
7
+ contribution to Transformers and the whole NLP community.
chunked/content_aware_chunking/_add_new_model/chunk_109.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Your code and the ported pre-trained models will certainly be
2
+ used by hundreds and possibly even thousands of developers and researchers. You should be proud of your work and share
3
+ your achievements with the community.
4
+ You have made another model that is super easy to access for everyone in the community! 🤯.
chunked/content_aware_chunking/_add_new_model/chunk_11.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ Calling
2
+ [~PreTrainedModel.save_pretrained] will automatically call
3
+ [~PretrainedConfig.save_pretrained], so that both model and configuration are saved.
4
+ Code style
5
+ When coding your new model, keep in mind that Transformers is an opinionated library and we have a few quirks of our
6
+ own regarding how code should be written :-)
7
+
8
+ The forward pass of your model should be fully written in the modeling file while being fully independent of other
9
+ models in the library.
chunked/content_aware_chunking/_add_new_model/chunk_12.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ If you want to reuse a block from another model, copy the code and paste it with a
2
+ # Copied from comment on top (see here
3
+ for a good example and there for more documentation on Copied from).
4
+ The code should be fully understandable, even by a non-native English speaker. This means you should pick
5
+ descriptive variable names and avoid abbreviations.
chunked/content_aware_chunking/_add_new_model/chunk_13.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ As an example, activation is preferred to act.
2
+ One-letter variable names are strongly discouraged unless it's an index in a for loop.
3
+ More generally we prefer longer explicit code to short magical one.
4
+ Avoid subclassing nn.Sequential in PyTorch but subclass nn.Module and write the forward pass, so that anyone
5
+ using your code can quickly debug it by adding print statements or breaking points.
6
+ Your function signature should be type-annotated.
chunked/content_aware_chunking/_add_new_model/chunk_14.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ For the rest, good variable names are way more readable and
2
+ understandable than type annotations.
3
+
4
+ Overview of tokenizers
5
+ Not quite ready yet :-( This section will be added soon!
6
+ Step-by-step recipe to add a model to 🤗 Transformers
7
+ Everyone has different preferences of how to port a model so it can be very helpful for you to take a look at summaries
8
+ of how other contributors ported models to Hugging Face.
chunked/content_aware_chunking/_add_new_model/chunk_15.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Here is a list of community blog posts on how to port a model:
2
+
3
+ Porting GPT2 Model by Thomas
4
+ Porting WMT19 MT Model by Stas
5
+
6
+ From experience, we can tell you that the most important things to keep in mind when adding a model are:
7
+
8
+ Don't reinvent the wheel! Most parts of the code you will add for the new 🤗 Transformers model already exist
9
+ somewhere in 🤗 Transformers. Take some time to find similar, already existing models and tokenizers you can copy
10
+ from. grep and rg are your
11
+ friends.
chunked/content_aware_chunking/_add_new_model/chunk_16.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Note that it might very well happen that your model's tokenizer is based on one model implementation, and
2
+ your model's modeling code on another one. E.g. FSMT's modeling code is based on BART, while FSMT's tokenizer code
3
+ is based on XLM.
4
+ It's more of an engineering challenge than a scientific challenge.
chunked/content_aware_chunking/_add_new_model/chunk_17.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ You should spend more time creating an
2
+ efficient debugging environment rather than trying to understand all theoretical aspects of the model in the paper.
3
+ Ask for help, when you're stuck! Models are the core component of 🤗 Transformers so we at Hugging Face are more
4
+ than happy to help you at every step to add your model.
chunked/content_aware_chunking/_add_new_model/chunk_18.txt ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Don't hesitate to ask if you notice you are not making
2
+ progress.
3
+
4
+ In the following, we try to give you a general recipe that we found most useful when porting a model to 🤗 Transformers.
5
+ The following list is a summary of everything that has to be done to add a model and can be used by you as a To-Do
6
+ List:
7
+ ☐ (Optional) Understood the model's theoretical aspects
8
+ ☐ Prepared 🤗 Transformers dev environment
9
+ ☐ Set up debugging environment of the original repository
10
+ ☐ Created script that successfully runs the forward() pass using the original repository and checkpoint
11
+ ☐ Successfully added the model skeleton to 🤗 Transformers
12
+ ☐ Successfully converted original checkpoint to 🤗 Transformers checkpoint
13
+ ☐ Successfully ran forward() pass in 🤗 Transformers that gives identical output to original checkpoint
14
+ ☐ Finished model tests in 🤗 Transformers
15
+ ☐ Successfully added tokenizer in 🤗 Transformers
16
+ ☐ Run end-to-end integration tests
17
+ ☐ Finished docs
18
+ ☐ Uploaded model weights to the Hub
19
+ ☐ Submitted the pull request
20
+ ☐ (Optional) Added a demo notebook
21
+ To begin with, we usually recommend starting by getting a good theoretical understanding of BrandNewBert.
chunked/content_aware_chunking/_add_new_model/chunk_19.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ However,
2
+ if you prefer to understand the theoretical aspects of the model on-the-job, then it is totally fine to directly dive
3
+ into the BrandNewBert's code-base. This option might suit you better if your engineering skills are better than
4
+ your theoretical skill, if you have trouble understanding BrandNewBert's paper, or if you just enjoy programming
5
+ much more than reading scientific papers.
6
+ 1.
chunked/content_aware_chunking/_add_new_model/chunk_2.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ 🤗 ❤️
2
+ To get started, open a New model addition issue for the model you want to see in 🤗 Transformers. If you're not especially picky about contributing a specific model, you can filter by the New model label to see if there are any unclaimed model requests and work on it.
3
+ Once you've opened a new model request, the first step is to get familiar with 🤗 Transformers if you aren't already!
4
+ General overview of 🤗 Transformers
5
+ First, you should get a general overview of 🤗 Transformers.
chunked/content_aware_chunking/_add_new_model/chunk_20.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ (Optional) Theoretical aspects of BrandNewBert
2
+ You should take some time to read BrandNewBert's paper, if such descriptive work exists. There might be large
3
+ sections of the paper that are difficult to understand. If this is the case, this is fine - don't worry! The goal is
4
+ not to get a deep theoretical understanding of the paper, but to extract the necessary information required to
5
+ effectively re-implement the model in 🤗 Transformers.
chunked/content_aware_chunking/_add_new_model/chunk_21.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ That being said, you don't have to spend too much time on the
2
+ theoretical aspects, but rather focus on the practical ones, namely:
3
+
4
+ What type of model is brand_new_bert? BERT-like encoder-only model? GPT2-like decoder-only model? BART-like
5
+ encoder-decoder model? Look at the model_summary if you're not familiar with the differences between those.
6
+ What are the applications of brand_new_bert? Text classification? Text generation? Seq2Seq tasks, e.g.,
7
+ summarization?
8
+ What is the novel feature of the model that makes it different from BERT/GPT-2/BART?
9
+ Which of the already existing 🤗 Transformers models is most
10
+ similar to brand_new_bert?
11
+ What type of tokenizer is used? A sentencepiece tokenizer? Word piece tokenizer? Is it the same tokenizer as used
12
+ for BERT or BART?
13
+
14
+ After you feel like you have gotten a good overview of the architecture of the model, you might want to write to the
15
+ Hugging Face team with any questions you might have.
chunked/content_aware_chunking/_add_new_model/chunk_22.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ This might include questions regarding the model's architecture,
2
+ its attention layer, etc. We will be more than happy to help you.
3
+ 2. Next prepare your environment
4
+
5
+ Fork the repository by clicking on the ‘Fork' button on the
6
+ repository's page.
chunked/content_aware_chunking/_add_new_model/chunk_23.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This creates a copy of the code under your GitHub user account.
2
+
3
+ Clone your transformers fork to your local disk, and add the base repository as a remote:
4
+
5
+ git clone https://github.com/[your Github handle]/transformers.git
6
+ cd transformers
7
+ git remote add upstream https://github.com/huggingface/transformers.git
8
+
9
+ Set up a development environment, for instance by running the following command:
10
+
11
+ python -m venv .env
12
+ source .env/bin/activate
13
+ pip install -e ".[dev]"
14
+ Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a
15
+ failure with this command.
chunked/content_aware_chunking/_add_new_model/chunk_24.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ If that's the case make sure to install the Deep Learning framework you are working with
2
+ (PyTorch, TensorFlow and/or Flax) then do:
3
+
4
+ pip install -e ".[quality]"
5
+ which should be enough for most use cases. You can then return to the parent directory
6
+
7
+ cd ..
8
+
9
+ We recommend adding the PyTorch version of brand_new_bert to Transformers. To install PyTorch, please follow the
10
+ instructions on https://pytorch.org/get-started/locally/.
11
+
12
+ Note: You don't need to have CUDA installed.
chunked/content_aware_chunking/_add_new_model/chunk_25.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Making the new model work on CPU is sufficient.
2
+
3
+ To port brand_new_bert, you will also need access to its original repository:
4
+
5
+ git clone https://github.com/org_that_created_brand_new_bert_org/brand_new_bert.git
6
+ cd brand_new_bert
7
+ pip install -e .
8
+ Now you have set up a development environment to port brand_new_bert to 🤗 Transformers.
9
+ 3.-4. Run a pretrained checkpoint using the original repository
10
+ At first, you will work on the original brand_new_bert repository.
chunked/content_aware_chunking/_add_new_model/chunk_26.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Often, the original implementation is very
2
+ “researchy”. Meaning that documentation might be lacking and the code can be difficult to understand. But this should
3
+ be exactly your motivation to reimplement brand_new_bert. At Hugging Face, one of our main goals is to make people
4
+ stand on the shoulders of giants which translates here very well into taking a working model and rewriting it to make
5
+ it as accessible, user-friendly, and beautiful as possible.
chunked/content_aware_chunking/_add_new_model/chunk_27.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ This is the number-one motivation to re-implement
2
+ models into 🤗 Transformers - trying to make complex new NLP technology accessible to everybody.
3
+ You should start thereby by diving into the original repository.
4
+ Successfully running the official pretrained model in the original repository is often the most difficult step.
5
+ From our experience, it is very important to spend some time getting familiar with the original code-base.
chunked/content_aware_chunking/_add_new_model/chunk_28.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ You need to
2
+ figure out the following:
3
+
4
+ Where to find the pretrained weights?
5
+ How to load the pretrained weights into the corresponding model?
6
+ How to run the tokenizer independently from the model?
7
+ Trace one forward pass so that you know which classes and functions are required for a simple forward pass. Usually,
8
+ you only have to reimplement those functions.
9
+ Be able to locate the important components of the model: Where is the model's class? Are there model sub-classes,
10
+ e.g.
chunked/content_aware_chunking/_add_new_model/chunk_29.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ EncoderModel, DecoderModel? Where is the self-attention layer? Are there multiple different attention layers,
2
+ e.g.
chunked/content_aware_chunking/_add_new_model/chunk_3.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ 🤗 Transformers is a very opinionated library, so there is a
2
+ chance that you don't agree with some of the library's philosophies or design choices. From our experience, however, we
3
+ found that the fundamental design choices and philosophies of the library are crucial to efficiently scale 🤗
4
+ Transformers while keeping maintenance costs at a reasonable level.
5
+ A good first starting point to better understand the library is to read the documentation of our philosophy.
chunked/content_aware_chunking/_add_new_model/chunk_30.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ self-attention, cross-attention?
2
+ How can you debug the model in the original environment of the repo? Do you have to add print statements, can you
3
+ work with an interactive debugger like ipdb, or should you use an efficient IDE to debug the model, like PyCharm?
4
+
5
+ It is very important that before you start the porting process, you can efficiently debug code in the original
6
+ repository! Also, remember that you are working with an open-source library, so do not hesitate to open an issue, or
7
+ even a pull request in the original repository.
chunked/content_aware_chunking/_add_new_model/chunk_31.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ The maintainers of this repository are most likely very happy about
2
+ someone looking into their code!
3
+ At this point, it is really up to you which debugging environment and strategy you prefer to use to debug the original
4
+ model. We strongly advise against setting up a costly GPU environment, but simply work on a CPU both when starting to
5
+ dive into the original repository and also when starting to write the 🤗 Transformers implementation of the model.
chunked/content_aware_chunking/_add_new_model/chunk_32.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Only
2
+ at the very end, when the model has already been successfully ported to 🤗 Transformers, one should verify that the
3
+ model also works as expected on GPU.
4
+ In general, there are two possible debugging environments for running the original model
5
+
6
+ Jupyter notebooks / google colab
7
+ Local python scripts.
8
+
9
+ Jupyter notebooks have the advantage that they allow for cell-by-cell execution which can be helpful to better split
10
+ logical components from one another and to have faster debugging cycles as intermediate results can be stored.