mariacyepes96 commited on
Commit
87b6881
1 Parent(s): 59166e4

Training in progress epoch 0

Browse files
Files changed (5) hide show
  1. README.md +4 -6
  2. test_2 copy.ipynb +0 -0
  3. test_2.ipynb +408 -0
  4. tf_model.h5 +1 -1
  5. tokenizer.json +3 -3
README.md CHANGED
@@ -16,9 +16,9 @@ probably proofread and complete it, then remove this comment. -->
16
 
17
  This model is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on an unknown dataset.
18
  It achieves the following results on the evaluation set:
19
- - Train Loss: 5.6767
20
- - Validation Loss: 5.6980
21
- - Epoch: 2
22
 
23
  ## Model description
24
 
@@ -44,9 +44,7 @@ The following hyperparameters were used during training:
44
 
45
  | Train Loss | Validation Loss | Epoch |
46
  |:----------:|:---------------:|:-----:|
47
- | 5.8854 | 5.7677 | 0 |
48
- | 5.7170 | 5.6980 | 1 |
49
- | 5.6767 | 5.6980 | 2 |
50
 
51
 
52
  ### Framework versions
 
16
 
17
  This model is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on an unknown dataset.
18
  It achieves the following results on the evaluation set:
19
+ - Train Loss: 6.2045
20
+ - Validation Loss: 6.1385
21
+ - Epoch: 0
22
 
23
  ## Model description
24
 
 
44
 
45
  | Train Loss | Validation Loss | Epoch |
46
  |:----------:|:---------------:|:-----:|
47
+ | 6.2045 | 6.1385 | 0 |
 
 
48
 
49
 
50
  ### Framework versions
test_2 copy.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
test_2.ipynb ADDED
@@ -0,0 +1,408 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "# !python3 -m venv env \n",
10
+ "# !source env/bin/activate \n",
11
+ "# !pip3 install langchain\n",
12
+ "# !pip3 install pypdf2"
13
+ ]
14
+ },
15
+ {
16
+ "cell_type": "code",
17
+ "execution_count": 2,
18
+ "metadata": {},
19
+ "outputs": [],
20
+ "source": [
21
+ "import PyPDF2\n",
22
+ "import re"
23
+ ]
24
+ },
25
+ {
26
+ "cell_type": "code",
27
+ "execution_count": 3,
28
+ "metadata": {},
29
+ "outputs": [],
30
+ "source": [
31
+ "with open(\"bk_example.pdf\", \"rb\") as file:\n",
32
+ " reader = PyPDF2.PdfReader(file)\n",
33
+ " text_all = ''\n",
34
+ " # Extract text from each page\n",
35
+ " for page_num in range(len(reader.pages)):\n",
36
+ " page = reader.pages[page_num]\n",
37
+ " text = page.extract_text()\n",
38
+ " text_all = text_all +text"
39
+ ]
40
+ },
41
+ {
42
+ "cell_type": "code",
43
+ "execution_count": 12,
44
+ "metadata": {},
45
+ "outputs": [],
46
+ "source": [
47
+ "import getpass\n",
48
+ "import os\n",
49
+ "\n",
50
+ "os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
51
+ "os.environ[\"LANGCHAIN_API_KEY\"] = getpass.getpass()"
52
+ ]
53
+ },
54
+ {
55
+ "cell_type": "code",
56
+ "execution_count": null,
57
+ "metadata": {},
58
+ "outputs": [],
59
+ "source": []
60
+ },
61
+ {
62
+ "cell_type": "code",
63
+ "execution_count": 10,
64
+ "metadata": {},
65
+ "outputs": [],
66
+ "source": [
67
+ "from typing import Optional\n",
68
+ "\n",
69
+ "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
70
+ "from langchain_core.pydantic_v1 import BaseModel, Field\n",
71
+ "\n",
72
+ "# Define a custom prompt to provide instructions and any additional context.\n",
73
+ "# 1) You can add examples into the prompt template to improve extraction quality\n",
74
+ "# 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
75
+ "# about the document from which the text was extracted.)\n",
76
+ "prompt = ChatPromptTemplate.from_messages(\n",
77
+ " [\n",
78
+ " (\n",
79
+ " \"system\",\n",
80
+ " \"You are an expert extraction algorithm. \"\n",
81
+ " \"Only extract relevant information from the text. \"\n",
82
+ " \"If you do not know the value of an attribute asked to extract, \"\n",
83
+ " \"return null for the attribute's value.\",\n",
84
+ " ),\n",
85
+ " # Please see the how-to about improving performance with\n",
86
+ " # reference examples.\n",
87
+ " # MessagesPlaceholder('examples'),\n",
88
+ " (\"human\", \"{text}\"),\n",
89
+ " ]\n",
90
+ ")"
91
+ ]
92
+ },
93
+ {
94
+ "cell_type": "code",
95
+ "execution_count": null,
96
+ "metadata": {},
97
+ "outputs": [],
98
+ "source": [
99
+ "from typing import Optional\n",
100
+ "\n",
101
+ "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
102
+ "from langchain_core.pydantic_v1 import BaseModel, Field\n",
103
+ "\n",
104
+ "# Define a custom prompt to provide instructions and any additional context.\n",
105
+ "# 1) You can add examples into the prompt template to improve extraction quality\n",
106
+ "# 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
107
+ "# about the document from which the text was extracted.)\n",
108
+ "prompt = ChatPromptTemplate.from_messages(\n",
109
+ " [\n",
110
+ " (\n",
111
+ " \"system\",\n",
112
+ " \"You are an expert extraction algorithm. \"\n",
113
+ " \"Only extract relevant information from the text. \"\n",
114
+ " \"If you do not know the value of an attribute asked to extract, \"\n",
115
+ " \"return null for the attribute's value.\",\n",
116
+ " ),\n",
117
+ " # Please see the how-to about improving performance with\n",
118
+ " # reference examples.\n",
119
+ " # MessagesPlaceholder('examples'),\n",
120
+ " (\"human\", \"{text}\"),\n",
121
+ " ]\n",
122
+ ")"
123
+ ]
124
+ },
125
+ {
126
+ "cell_type": "code",
127
+ "execution_count": 11,
128
+ "metadata": {},
129
+ "outputs": [
130
+ {
131
+ "ename": "ModuleNotFoundError",
132
+ "evalue": "No module named 'langchain_mistralai'",
133
+ "output_type": "error",
134
+ "traceback": [
135
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
136
+ "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
137
+ "Cell \u001b[0;32mIn[11], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mlangchain_mistralai\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m ChatMistralAI\n\u001b[1;32m 3\u001b[0m llm \u001b[38;5;241m=\u001b[39m ChatMistralAI(model\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmistral-large-latest\u001b[39m\u001b[38;5;124m\"\u001b[39m, temperature\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m)\n\u001b[1;32m 5\u001b[0m runnable \u001b[38;5;241m=\u001b[39m prompt \u001b[38;5;241m|\u001b[39m llm\u001b[38;5;241m.\u001b[39mwith_structured_output(schema\u001b[38;5;241m=\u001b[39mPerson)\n",
138
+ "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'langchain_mistralai'"
139
+ ]
140
+ }
141
+ ],
142
+ "source": [
143
+ "from langchain_mistralai import ChatMistralAI\n",
144
+ "\n",
145
+ "llm = ChatMistralAI(model=\"mistral-large-latest\", temperature=0)\n",
146
+ "\n",
147
+ "runnable = prompt | llm.with_structured_output(schema=Person)"
148
+ ]
149
+ },
150
+ {
151
+ "cell_type": "code",
152
+ "execution_count": null,
153
+ "metadata": {},
154
+ "outputs": [],
155
+ "source": []
156
+ },
157
+ {
158
+ "cell_type": "code",
159
+ "execution_count": 6,
160
+ "metadata": {},
161
+ "outputs": [],
162
+ "source": [
163
+ "from typing import List, Optional\n",
164
+ "\n",
165
+ "from langchain_core.pydantic_v1 import BaseModel, Field\n",
166
+ "\n",
167
+ "\n",
168
+ "class Bankruptcy(BaseModel):\n",
169
+ " \"\"\"Information about a bankruptcy declaration.\"\"\"\n",
170
+ "\n",
171
+ " # ^ Doc-string for the entity Person.\n",
172
+ " # This doc-string is sent to the LLM as the description of the schema Person,\n",
173
+ " # and it can help to improve extraction results.\n",
174
+ "\n",
175
+ " # Note that:\n",
176
+ " # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
177
+ " # 2. Each field has a `description` -- this description is used by the LLM.\n",
178
+ " # Having a good description can help improve extraction results.\n",
179
+ " ssns: Optional[list] = Field(default=None, description=\"The ssns of the persons\")\n",
180
+ " chapter: Optional[str] = Field(\n",
181
+ " default=None, description=\"The chapter of the bankruptcy declaration\"\n",
182
+ " )\n",
183
+ " country: Optional[str] = Field(\n",
184
+ " default=None, description=\"Country were the bankruptcy declaration is made\"\n",
185
+ " )"
186
+ ]
187
+ },
188
+ {
189
+ "cell_type": "code",
190
+ "execution_count": 7,
191
+ "metadata": {},
192
+ "outputs": [],
193
+ "source": [
194
+ "class Data(BaseModel):\n",
195
+ " \"\"\"Extracted data about bankruptcy declaration..\"\"\"\n",
196
+ "\n",
197
+ " # Creates a model so that we can extract multiple entities.\n",
198
+ " people: List[Bankruptcy]"
199
+ ]
200
+ },
201
+ {
202
+ "cell_type": "code",
203
+ "execution_count": 8,
204
+ "metadata": {},
205
+ "outputs": [
206
+ {
207
+ "ename": "NameError",
208
+ "evalue": "name 'prompt' is not defined",
209
+ "output_type": "error",
210
+ "traceback": [
211
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
212
+ "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
213
+ "Cell \u001b[0;32mIn[8], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m runnable \u001b[38;5;241m=\u001b[39m \u001b[43mprompt\u001b[49m \u001b[38;5;241m|\u001b[39m llm\u001b[38;5;241m.\u001b[39mwith_structured_output(schema\u001b[38;5;241m=\u001b[39mData)\n\u001b[1;32m 2\u001b[0m runnable\u001b[38;5;241m.\u001b[39minvoke({\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext\u001b[39m\u001b[38;5;124m\"\u001b[39m: text_all})\n",
214
+ "\u001b[0;31mNameError\u001b[0m: name 'prompt' is not defined"
215
+ ]
216
+ }
217
+ ],
218
+ "source": [
219
+ "runnable = prompt | llm.with_structured_output(schema=Data)\n",
220
+ "runnable.invoke({\"text\": text_all})"
221
+ ]
222
+ },
223
+ {
224
+ "cell_type": "code",
225
+ "execution_count": 4,
226
+ "metadata": {},
227
+ "outputs": [],
228
+ "source": [
229
+ "#print(text_all)"
230
+ ]
231
+ },
232
+ {
233
+ "cell_type": "code",
234
+ "execution_count": 5,
235
+ "metadata": {},
236
+ "outputs": [],
237
+ "source": [
238
+ "#Find SSNs\n",
239
+ "ssn_pattern = r'\\b(?:Social Security number|ITIN)\\D*(\\d{3}[−\\s]\\d{2}[−\\s]\\d{4})\\b'\n",
240
+ "ssns = re.findall(ssn_pattern, text_all)\n",
241
+ "\n",
242
+ "def find_ssns(text):\n",
243
+ " ssns = re.findall(ssn_pattern, text_all)\n",
244
+ " return ssns"
245
+ ]
246
+ },
247
+ {
248
+ "cell_type": "code",
249
+ "execution_count": 6,
250
+ "metadata": {},
251
+ "outputs": [],
252
+ "source": [
253
+ "#Find chapter\n",
254
+ "chapter_pattern = r'Notice of Chapter (\\d+) Bankruptcy Case \\d{1,2}/\\d{2}'\n",
255
+ "\n",
256
+ "def find_chapter(text):\n",
257
+ " chapters = re.findall(chapter_pattern, text_all)\n",
258
+ " return chapters[0]\n"
259
+ ]
260
+ },
261
+ {
262
+ "cell_type": "code",
263
+ "execution_count": 7,
264
+ "metadata": {},
265
+ "outputs": [],
266
+ "source": [
267
+ "country_code = {\"United States\": \"US\", \"Canada\":\"CA\"}\n",
268
+ "\n",
269
+ "country_pattern = r'\\b(?:United States|Canada)\\b'\n",
270
+ "\n",
271
+ "def find_country_code(text):\n",
272
+ " country_match = re.search(country_pattern, text, re.IGNORECASE)\n",
273
+ " return country_code.get(country_match[0],None) "
274
+ ]
275
+ },
276
+ {
277
+ "cell_type": "code",
278
+ "execution_count": 8,
279
+ "metadata": {},
280
+ "outputs": [],
281
+ "source": [
282
+ "#Find State\n",
283
+ "state_pattern = r'\\nDistrict of (\\w+)'\n",
284
+ "\n",
285
+ "# Dictionaries for state codes\n",
286
+ "us_states = {\n",
287
+ " \"Alabama\": \"AL\", \"Alaska\": \"AK\", \"Arizona\": \"AZ\", \"Arkansas\": \"AR\", \"California\": \"CA\",\n",
288
+ " \"Colorado\": \"CO\", \"Connecticut\": \"CT\", \"Delaware\": \"DE\", \"Florida\": \"FL\", \"Georgia\": \"GA\",\n",
289
+ " \"Hawaii\": \"HI\", \"Idaho\": \"ID\", \"Illinois\": \"IL\", \"Indiana\": \"IN\", \"Iowa\": \"IA\",\n",
290
+ " \"Kansas\": \"KS\", \"Kentucky\": \"KY\", \"Louisiana\": \"LA\", \"Maine\": \"ME\", \"Maryland\": \"MD\",\n",
291
+ " \"Massachusetts\": \"MA\", \"Michigan\": \"MI\", \"Minnesota\": \"MN\", \"Mississippi\": \"MS\", \"Missouri\": \"MO\",\n",
292
+ " \"Montana\": \"MT\", \"Nebraska\": \"NE\", \"Nevada\": \"NV\", \"New Hampshire\": \"NH\", \"New Jersey\": \"NJ\",\n",
293
+ " \"New Mexico\": \"NM\", \"New York\": \"NY\", \"North Carolina\": \"NC\", \"North Dakota\": \"ND\", \"Ohio\": \"OH\",\n",
294
+ " \"Oklahoma\": \"OK\", \"Oregon\": \"OR\", \"Pennsylvania\": \"PA\", \"Rhode Island\": \"RI\", \"South Carolina\": \"SC\",\n",
295
+ " \"South Dakota\": \"SD\", \"Tennessee\": \"TN\", \"Texas\": \"TX\", \"Utah\": \"UT\", \"Vermont\": \"VT\",\n",
296
+ " \"Virginia\": \"VA\", \"Washington\": \"WA\", \"West Virginia\": \"WV\", \"Wisconsin\": \"WI\", \"Wyoming\": \"WY\"\n",
297
+ "}\n",
298
+ "\n",
299
+ "canadian_provinces = {\n",
300
+ " \"Alberta\": \"AB\", \"British Columbia\": \"BC\", \"Manitoba\": \"MB\", \"New Brunswick\": \"NB\", \"Newfoundland and Labrador\": \"NL\",\n",
301
+ " \"Northwest Territories\": \"NT\", \"Nova Scotia\": \"NS\", \"Nunavut\": \"NU\", \"Ontario\": \"ON\", \"Prince Edward Island\": \"PE\",\n",
302
+ " \"Quebec\": \"QC\", \"Saskatchewan\": \"SK\", \"Yukon\": \"YT\"\n",
303
+ "}\n",
304
+ "\n",
305
+ "def find_state_code(text,country_code):\n",
306
+ " state_match = re.search(state_pattern, text)\n",
307
+ " \n",
308
+ " if state_match:\n",
309
+ " # Extract the state or province name from the match\n",
310
+ " state_name = state_match.group(1).strip()\n",
311
+ " \n",
312
+ " if country_code == 'US':\n",
313
+ " state_code = us_states.get(state_name,None)\n",
314
+ " elif country_code == 'CA':\n",
315
+ " state_code = canadian_provinces.get(state_name,None)\n",
316
+ " else:\n",
317
+ " state_code = None\n",
318
+ " \n",
319
+ " return state_code\n",
320
+ "\n"
321
+ ]
322
+ },
323
+ {
324
+ "cell_type": "code",
325
+ "execution_count": 9,
326
+ "metadata": {},
327
+ "outputs": [],
328
+ "source": [
329
+ "#Find stage\n",
330
+ "stage_patterns = {\n",
331
+ " 'Petition': r'\\b(case filed|petition filed|automatic stay)\\b',\n",
332
+ " 'Discharge': r'\\b(discharge of debts|discharge order|case discharged)\\b',\n",
333
+ " 'Dismissed': r'\\b(case dismissed|dismissal|converted to Chapter 7)\\b'\n",
334
+ "}\n",
335
+ "\n",
336
+ "# Function to categorize bankruptcy stages from text\n",
337
+ "def categorize_stage(text):\n",
338
+ " categorized_stages = {'Petition': False, 'Discharge': False, 'Dismissed': False}\n",
339
+ " \n",
340
+ " for stage, pattern in stage_patterns.items():\n",
341
+ " if re.search(pattern, text, re.IGNORECASE):\n",
342
+ " categorized_stages[stage] = True\n",
343
+ " \n",
344
+ " # Determine the final stage based on the presence of keywords\n",
345
+ " if categorized_stages['Petition']:\n",
346
+ " return 'Petition'\n",
347
+ " elif categorized_stages['Discharge']:\n",
348
+ " return 'Discharge'\n",
349
+ " elif categorized_stages['Dismissed']:\n",
350
+ " return 'Dismissed'\n",
351
+ " else:\n",
352
+ " return 'Unknown'"
353
+ ]
354
+ },
355
+ {
356
+ "cell_type": "code",
357
+ "execution_count": 10,
358
+ "metadata": {},
359
+ "outputs": [
360
+ {
361
+ "name": "stdout",
362
+ "output_type": "stream",
363
+ "text": [
364
+ "Data found: {'ssns': ['461−81−0513', '529−97−1200'], 'chapter': '13', 'country_code': 'US', 'state': 'UT', 'stage': 'Petition'}\n"
365
+ ]
366
+ }
367
+ ],
368
+ "source": [
369
+ "data = { \"ssns\": find_ssns(text_all),\n",
370
+ " \"chapter\": find_chapter(text_all),\n",
371
+ " \"country_code\": find_country_code(text_all),\n",
372
+ " \"state\": find_state_code(text_all, find_country_code(text_all)),\n",
373
+ " \"stage\": categorize_stage(text_all)\n",
374
+ " }\n",
375
+ "\n",
376
+ "print(f\"Data found: {data}\")"
377
+ ]
378
+ },
379
+ {
380
+ "cell_type": "code",
381
+ "execution_count": null,
382
+ "metadata": {},
383
+ "outputs": [],
384
+ "source": []
385
+ }
386
+ ],
387
+ "metadata": {
388
+ "kernelspec": {
389
+ "display_name": "Python 3",
390
+ "language": "python",
391
+ "name": "python3"
392
+ },
393
+ "language_info": {
394
+ "codemirror_mode": {
395
+ "name": "ipython",
396
+ "version": 3
397
+ },
398
+ "file_extension": ".py",
399
+ "mimetype": "text/x-python",
400
+ "name": "python",
401
+ "nbconvert_exporter": "python",
402
+ "pygments_lexer": "ipython3",
403
+ "version": "3.11.6"
404
+ }
405
+ },
406
+ "nbformat": 4,
407
+ "nbformat_minor": 2
408
+ }
tf_model.h5 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:93da27aaa4b11ffaa209720a01eb150ecdf0fd014c1bd1f5e623c90b293e841c
3
  size 265583592
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:142c651418d7bc9ced202b8cb5b0154f7378acee3eb6d55b7bea7affea11c187
3
  size 265583592
tokenizer.json CHANGED
@@ -2,13 +2,13 @@
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
- "max_length": 384,
6
  "strategy": "OnlySecond",
7
- "stride": 0
8
  },
9
  "padding": {
10
  "strategy": {
11
- "Fixed": 384
12
  },
13
  "direction": "Right",
14
  "pad_to_multiple_of": null,
 
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
+ "max_length": 512,
6
  "strategy": "OnlySecond",
7
+ "stride": 128
8
  },
9
  "padding": {
10
  "strategy": {
11
+ "Fixed": 512
12
  },
13
  "direction": "Right",
14
  "pad_to_multiple_of": null,