JordiAb commited on
Commit
b3e53a0
1 Parent(s): d75de74

model documentation

Browse files
Files changed (1) hide show
  1. README.md +166 -1
README.md CHANGED
@@ -4,4 +4,169 @@ language:
4
  pipeline_tag: text-classification
5
  ---
6
 
7
- asdf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  pipeline_tag: text-classification
5
  ---
6
 
7
+ # Named Entity Dissambiguation with a BERT model
8
+
9
+ This is a very task specific model. A BERT model was trained to dissambiguate WikiData options for entities mentioned in news articles.
10
+ Given an entity mentioned in a news article and different WikiData options of possible entities, and given the context of which the entity is being mentioned, which of the entity options is the context referring to?
11
+
12
+ Suppose you've got a news article that mentions Donald Trump. Donald Trump is mentioned in a sentence:
13
+ ```python
14
+ entity = "Donald Trump"
15
+ sentence_mention = "It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus, after participating in a meeting in Florida where the US president was..."
16
+ ```
17
+ If one performs a query to WikiData by searching a specific entity, you may get several options for a single entity, in example for Donald Trump:
18
+ ```python
19
+ options = [
20
+ 'Donald Trump, president of the United States from 2017 to 2021',
21
+ 'Donald Trump, American physician',
22
+ 'Donald Trump, Wikimedia disambiguation page',
23
+ 'Donald Trump, song by Mac Miller',
24
+ 'Donald Trump, segment of an episode of Last Week Tonight',
25
+ "Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five",
26
+ '2016 United States presidential election, 58th quadrennial U.S. presidential election'
27
+ ]
28
+ ```
29
+
30
+ This model is trained to give a score to the following query string:
31
+
32
+ `f"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"`
33
+
34
+ **Make sure that query string doesn't exceed 512 tokens,** if it does, it is advisable to reduce the context of the entity mention in order avoid truncation of the query string.
35
+
36
+ To dissambiguate, one must compute scores for all options of a single entity that is mentioned in a context, and then get the option with the maximum score.
37
+
38
+ ```python
39
+ qry_strings = [
40
+ `f"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"` for option in options
41
+ ]
42
+ ```
43
+
44
+ options scores for the above example:
45
+ - Donald Trump, president of the United States from 2017 to 2021: 0.9990746974945068
46
+ - Donald Trump, American physician: 0.00032277879654429853
47
+ - Donald Trump, Wikimedia disambiguation page: 0.00044132230686955154
48
+ - Donald Trump, song by Mac Miller: 0.0003152454155497253
49
+ - Donald Trump, segment of an episode of Last Week Tonight: 0.00031540714553557336
50
+ - Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five: 0.00030414783395826817
51
+ - 2016 United States presidential election, 58th quadrennial U.S. presidential election: 0.0005287989042699337
52
+
53
+ ## Using the Model
54
+
55
+ To compute the score of a single query string:
56
+
57
+ ```python
58
+ import torch
59
+ from transformers import BertTokenizer, BertForSequenceClassification
60
+
61
+ entity = "Donald Trump"
62
+ sentence_mention = "It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus, after participating in a meeting in Florida where the US president was..."
63
+
64
+ options = [
65
+ 'Donald Trump, president of the United States from 2017 to 2021',
66
+ 'Donald Trump, American physician',
67
+ 'Donald Trump, Wikimedia disambiguation page',
68
+ 'Donald Trump, song by Mac Miller',
69
+ 'Donald Trump, segment of an episode of Last Week Tonight',
70
+ "Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five",
71
+ '2016 United States presidential election, 58th quadrennial U.S. presidential election'
72
+ ]
73
+ option = options[0]
74
+
75
+ # predictions will be made on the gpu if there is a gpu available
76
+ device = torch.device(
77
+ "cuda" if torch.cuda.is_available() else "cpu"
78
+ )
79
+ # load the BERT NED model
80
+ model = BertForSequenceClassification.from_pretrained(
81
+ 'JordiAb/BERT_NED'
82
+ ).eval().to(device)
83
+ # load the BERT NED tokenizer
84
+ tokenizer = BertTokenizer.from_pretrained(
85
+ 'JordiAb/BERT_NED'
86
+ )
87
+
88
+ # build the query string required by our BERT model. Namely:
89
+ query = f"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"
90
+
91
+ # encode and tokenize the query string
92
+ encoded_dict = tokenizer.encode_plus(
93
+ query, # Sentence to encode.
94
+ add_special_tokens = True, # Add '[CLS]' and '[SEP]'
95
+ max_length = 512, # Pad & truncate all sentences.
96
+ padding='max_length', # Make sure this applies padding as needed
97
+ truncation=True,
98
+ return_attention_mask = True, # Construct attention masks.
99
+ return_tensors = 'pt', # Return pytorch tensors.
100
+ )
101
+
102
+ # move input ids to GPU (if available)
103
+ input_ids=encoded_dict['input_ids'].to(device)
104
+ # move attention mask to GPU (if available)
105
+ attention_mask=encoded_dict['attention_mask'].to(device)
106
+
107
+ with torch.no_grad(): # avoid gradient computation to save memory
108
+ # forward pass of the model
109
+ outputs = model(
110
+ input_ids=input_ids,
111
+ token_type_ids=None,
112
+ attention_mask=attention_mask
113
+ )
114
+
115
+ # get logits of prediction
116
+ logits = outputs.logits
117
+ # Use softmax to get probabilities
118
+ probabilities = torch.nn.functional.softmax(logits, dim=1)
119
+ # is meant for one observation so return probabilities[0], move the resulting tensor to cpu and return it as numpy array
120
+ probabilities=probabilities[0].cpu().numpy()
121
+ ```
122
+
123
+ `probabilities` is a numpy array containing the two probabilities. Probability of belongin to class 0 and probability of belonging to class 1.
124
+ `n np.array([prob0, prob1])`
125
+
126
+ In this case we are interested in the probability of belonging to class 1, since class 1 is the positive label as the YES answer to the query string `"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"`
127
+
128
+ ## About the DataSet used for this training:
129
+
130
+ The dataset consists of news articles obtained from a Mexican newspaper, processed using Named Entity Recognition (NER) to identify entities within each article. Queries were made to WikiData for each identified entity in order to gather all potential matches of an entity. The `StableBeluga-7B` Language Model (LLM) assisted in disambiguating selected entities from the dataset, with its outputs serving as labels for training.
131
+
132
+ This project approaches the task as a binary classification problem. The training data includes entities from the articles, relevant sentences (context) where the entity is being mentioned and all WikiData options. Each entity-context-option triplet was paired with a binary label (1/0) to form a single training observation. The dataset construction process aimed to fine-tune the model. To ensure compatibility with model limitations, inputs were truncated to fit within a 512-token maximum.
133
+
134
+ For example, with the above Donald Trump example, the Data Set would look like:
135
+
136
+ ```python
137
+ [
138
+ {
139
+ "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, president of the United States from 2017 to 2021?",
140
+ "label": 1,
141
+ },
142
+ {
143
+ "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, American physician?",
144
+ "label": 0,
145
+ },
146
+ {
147
+ "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, Wikimedia disambiguation page?",
148
+ "label": 0,
149
+ },
150
+ {
151
+ "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, song by Mac Miller?",
152
+ "label": 0,
153
+ },
154
+ {
155
+ "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, segment of an episode of Last Week Tonight?",
156
+ "label": 0,
157
+ },
158
+ {
159
+ "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five?",
160
+ "label": 0,
161
+ },
162
+ {
163
+ "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] 2016 United States presidential election, 58th quadrennial U.S. presidential election?",
164
+ "label": 0,
165
+ }
166
+ ]
167
+
168
+ ```
169
+
170
+ Repo of the project:
171
+
172
+ https://github.com/Jordi-Ab/BERT_NED