model documentation
Browse files
README.md
CHANGED
@@ -4,4 +4,169 @@ language:
|
|
4 |
pipeline_tag: text-classification
|
5 |
---
|
6 |
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
pipeline_tag: text-classification
|
5 |
---
|
6 |
|
7 |
+
# Named Entity Dissambiguation with a BERT model
|
8 |
+
|
9 |
+
This is a very task specific model. A BERT model was trained to dissambiguate WikiData options for entities mentioned in news articles.
|
10 |
+
Given an entity mentioned in a news article and different WikiData options of possible entities, and given the context of which the entity is being mentioned, which of the entity options is the context referring to?
|
11 |
+
|
12 |
+
Suppose you've got a news article that mentions Donald Trump. Donald Trump is mentioned in a sentence:
|
13 |
+
```python
|
14 |
+
entity = "Donald Trump"
|
15 |
+
sentence_mention = "It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus, after participating in a meeting in Florida where the US president was..."
|
16 |
+
```
|
17 |
+
If one performs a query to WikiData by searching a specific entity, you may get several options for a single entity, in example for Donald Trump:
|
18 |
+
```python
|
19 |
+
options = [
|
20 |
+
'Donald Trump, president of the United States from 2017 to 2021',
|
21 |
+
'Donald Trump, American physician',
|
22 |
+
'Donald Trump, Wikimedia disambiguation page',
|
23 |
+
'Donald Trump, song by Mac Miller',
|
24 |
+
'Donald Trump, segment of an episode of Last Week Tonight',
|
25 |
+
"Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five",
|
26 |
+
'2016 United States presidential election, 58th quadrennial U.S. presidential election'
|
27 |
+
]
|
28 |
+
```
|
29 |
+
|
30 |
+
This model is trained to give a score to the following query string:
|
31 |
+
|
32 |
+
`f"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"`
|
33 |
+
|
34 |
+
**Make sure that query string doesn't exceed 512 tokens,** if it does, it is advisable to reduce the context of the entity mention in order avoid truncation of the query string.
|
35 |
+
|
36 |
+
To dissambiguate, one must compute scores for all options of a single entity that is mentioned in a context, and then get the option with the maximum score.
|
37 |
+
|
38 |
+
```python
|
39 |
+
qry_strings = [
|
40 |
+
`f"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"` for option in options
|
41 |
+
]
|
42 |
+
```
|
43 |
+
|
44 |
+
options scores for the above example:
|
45 |
+
- Donald Trump, president of the United States from 2017 to 2021: 0.9990746974945068
|
46 |
+
- Donald Trump, American physician: 0.00032277879654429853
|
47 |
+
- Donald Trump, Wikimedia disambiguation page: 0.00044132230686955154
|
48 |
+
- Donald Trump, song by Mac Miller: 0.0003152454155497253
|
49 |
+
- Donald Trump, segment of an episode of Last Week Tonight: 0.00031540714553557336
|
50 |
+
- Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five: 0.00030414783395826817
|
51 |
+
- 2016 United States presidential election, 58th quadrennial U.S. presidential election: 0.0005287989042699337
|
52 |
+
|
53 |
+
## Using the Model
|
54 |
+
|
55 |
+
To compute the score of a single query string:
|
56 |
+
|
57 |
+
```python
|
58 |
+
import torch
|
59 |
+
from transformers import BertTokenizer, BertForSequenceClassification
|
60 |
+
|
61 |
+
entity = "Donald Trump"
|
62 |
+
sentence_mention = "It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus, after participating in a meeting in Florida where the US president was..."
|
63 |
+
|
64 |
+
options = [
|
65 |
+
'Donald Trump, president of the United States from 2017 to 2021',
|
66 |
+
'Donald Trump, American physician',
|
67 |
+
'Donald Trump, Wikimedia disambiguation page',
|
68 |
+
'Donald Trump, song by Mac Miller',
|
69 |
+
'Donald Trump, segment of an episode of Last Week Tonight',
|
70 |
+
"Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five",
|
71 |
+
'2016 United States presidential election, 58th quadrennial U.S. presidential election'
|
72 |
+
]
|
73 |
+
option = options[0]
|
74 |
+
|
75 |
+
# predictions will be made on the gpu if there is a gpu available
|
76 |
+
device = torch.device(
|
77 |
+
"cuda" if torch.cuda.is_available() else "cpu"
|
78 |
+
)
|
79 |
+
# load the BERT NED model
|
80 |
+
model = BertForSequenceClassification.from_pretrained(
|
81 |
+
'JordiAb/BERT_NED'
|
82 |
+
).eval().to(device)
|
83 |
+
# load the BERT NED tokenizer
|
84 |
+
tokenizer = BertTokenizer.from_pretrained(
|
85 |
+
'JordiAb/BERT_NED'
|
86 |
+
)
|
87 |
+
|
88 |
+
# build the query string required by our BERT model. Namely:
|
89 |
+
query = f"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"
|
90 |
+
|
91 |
+
# encode and tokenize the query string
|
92 |
+
encoded_dict = tokenizer.encode_plus(
|
93 |
+
query, # Sentence to encode.
|
94 |
+
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
|
95 |
+
max_length = 512, # Pad & truncate all sentences.
|
96 |
+
padding='max_length', # Make sure this applies padding as needed
|
97 |
+
truncation=True,
|
98 |
+
return_attention_mask = True, # Construct attention masks.
|
99 |
+
return_tensors = 'pt', # Return pytorch tensors.
|
100 |
+
)
|
101 |
+
|
102 |
+
# move input ids to GPU (if available)
|
103 |
+
input_ids=encoded_dict['input_ids'].to(device)
|
104 |
+
# move attention mask to GPU (if available)
|
105 |
+
attention_mask=encoded_dict['attention_mask'].to(device)
|
106 |
+
|
107 |
+
with torch.no_grad(): # avoid gradient computation to save memory
|
108 |
+
# forward pass of the model
|
109 |
+
outputs = model(
|
110 |
+
input_ids=input_ids,
|
111 |
+
token_type_ids=None,
|
112 |
+
attention_mask=attention_mask
|
113 |
+
)
|
114 |
+
|
115 |
+
# get logits of prediction
|
116 |
+
logits = outputs.logits
|
117 |
+
# Use softmax to get probabilities
|
118 |
+
probabilities = torch.nn.functional.softmax(logits, dim=1)
|
119 |
+
# is meant for one observation so return probabilities[0], move the resulting tensor to cpu and return it as numpy array
|
120 |
+
probabilities=probabilities[0].cpu().numpy()
|
121 |
+
```
|
122 |
+
|
123 |
+
`probabilities` is a numpy array containing the two probabilities. Probability of belongin to class 0 and probability of belonging to class 1.
|
124 |
+
`n np.array([prob0, prob1])`
|
125 |
+
|
126 |
+
In this case we are interested in the probability of belonging to class 1, since class 1 is the positive label as the YES answer to the query string `"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"`
|
127 |
+
|
128 |
+
## About the DataSet used for this training:
|
129 |
+
|
130 |
+
The dataset consists of news articles obtained from a Mexican newspaper, processed using Named Entity Recognition (NER) to identify entities within each article. Queries were made to WikiData for each identified entity in order to gather all potential matches of an entity. The `StableBeluga-7B` Language Model (LLM) assisted in disambiguating selected entities from the dataset, with its outputs serving as labels for training.
|
131 |
+
|
132 |
+
This project approaches the task as a binary classification problem. The training data includes entities from the articles, relevant sentences (context) where the entity is being mentioned and all WikiData options. Each entity-context-option triplet was paired with a binary label (1/0) to form a single training observation. The dataset construction process aimed to fine-tune the model. To ensure compatibility with model limitations, inputs were truncated to fit within a 512-token maximum.
|
133 |
+
|
134 |
+
For example, with the above Donald Trump example, the Data Set would look like:
|
135 |
+
|
136 |
+
```python
|
137 |
+
[
|
138 |
+
{
|
139 |
+
"bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, president of the United States from 2017 to 2021?",
|
140 |
+
"label": 1,
|
141 |
+
},
|
142 |
+
{
|
143 |
+
"bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, American physician?",
|
144 |
+
"label": 0,
|
145 |
+
},
|
146 |
+
{
|
147 |
+
"bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, Wikimedia disambiguation page?",
|
148 |
+
"label": 0,
|
149 |
+
},
|
150 |
+
{
|
151 |
+
"bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, song by Mac Miller?",
|
152 |
+
"label": 0,
|
153 |
+
},
|
154 |
+
{
|
155 |
+
"bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, segment of an episode of Last Week Tonight?",
|
156 |
+
"label": 0,
|
157 |
+
},
|
158 |
+
{
|
159 |
+
"bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five?",
|
160 |
+
"label": 0,
|
161 |
+
},
|
162 |
+
{
|
163 |
+
"bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] 2016 United States presidential election, 58th quadrennial U.S. presidential election?",
|
164 |
+
"label": 0,
|
165 |
+
}
|
166 |
+
]
|
167 |
+
|
168 |
+
```
|
169 |
+
|
170 |
+
Repo of the project:
|
171 |
+
|
172 |
+
https://github.com/Jordi-Ab/BERT_NED
|