File size: 9,764 Bytes
e02c364
 
 
 
4dd2f01
b5cc91e
e98471f
d75de74
 
b3e53a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
language:
- en
pipeline_tag: text-classification
widget:
- text: "Is 'AMLO' in the context of: 'Andres Manuel Lopez Obrador defends Mexico's military after report on missing students case', referring to [SEP] 'Andrés Manuel López Obrador, President of Mexico since 2018'?"
- text: "Is 'AMLO' in the context of: 'Andres Manuel Lopez Obrador defends Mexico's military after report on missing students case', referring to [SEP] 'Andrés Manuel López Obrador contenderá por la presidencia de México en 2012, Wikinews article'?"
---

# Named Entity Dissambiguation with a BERT model

This is a very task specific model. A BERT model was trained to dissambiguate WikiData options for entities mentioned in news articles. 
Given an entity mentioned in a news article and different WikiData options of possible entities, and given the context of which the entity is being mentioned, which of the entity options is the context referring to?

Suppose you've got a news article that mentions Donald Trump. Donald Trump is mentioned in a sentence:
```python
entity = "Donald Trump"
sentence_mention = "It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus, after participating in a meeting in Florida where the US president was..."
```
If one performs a query to WikiData by searching a specific entity, you may get several options for a single entity, in example for Donald Trump:
```python
options = [
    'Donald Trump, president of the United States from 2017 to 2021',
    'Donald Trump, American physician',
    'Donald Trump, Wikimedia disambiguation page',
    'Donald Trump, song by Mac Miller',
    'Donald Trump, segment of an episode of Last Week Tonight',
    "Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five",
    '2016 United States presidential election, 58th quadrennial U.S. presidential election'
]
```

This model is trained to give a score to the following query string:

`f"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"`

**Make sure that query string doesn't exceed 512 tokens,**  if it does, it is advisable to reduce the context of the entity mention in order avoid truncation of the query string.

To dissambiguate, one must compute scores for all options of a single entity that is mentioned in a context, and then get the option with the maximum score.

```python
qry_strings = [
  `f"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"` for option in options
]
```

options scores for the above example:
- Donald Trump, president of the United States from 2017 to 2021:                         0.9990746974945068
- Donald Trump, American physician:                                                       0.00032277879654429853
- Donald Trump, Wikimedia disambiguation page:                                            0.00044132230686955154
- Donald Trump, song by Mac Miller:                                                       0.0003152454155497253
- Donald Trump, segment of an episode of Last Week Tonight:                               0.00031540714553557336
- Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five:    0.00030414783395826817
- 2016 United States presidential election, 58th quadrennial U.S. presidential election:  0.0005287989042699337

## Using the Model

To compute the score of a single query string:

```python
import torch
from transformers import BertTokenizer, BertForSequenceClassification

entity = "Donald Trump"
sentence_mention = "It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus, after participating in a meeting in Florida where the US president was..."

options = [
    'Donald Trump, president of the United States from 2017 to 2021',
    'Donald Trump, American physician',
    'Donald Trump, Wikimedia disambiguation page',
    'Donald Trump, song by Mac Miller',
    'Donald Trump, segment of an episode of Last Week Tonight',
    "Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five",
    '2016 United States presidential election, 58th quadrennial U.S. presidential election'
]
option = options[0]

# predictions will be made on the gpu if there is a gpu available
device = torch.device(
  "cuda" if torch.cuda.is_available() else "cpu"
)
# load the BERT NED model
model = BertForSequenceClassification.from_pretrained(
  'JordiAb/BERT_NED'
).eval().to(device)
# load the BERT NED tokenizer
tokenizer = BertTokenizer.from_pretrained(
  'JordiAb/BERT_NED'
)

# build the query string required by our BERT model. Namely:
query = f"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"
                
# encode and tokenize the query string
encoded_dict = tokenizer.encode_plus(
  query,                           # Sentence to encode.
  add_special_tokens = True,       # Add '[CLS]' and '[SEP]'
  max_length = 512,                # Pad & truncate all sentences.
  padding='max_length',            # Make sure this applies padding as needed
  truncation=True,
  return_attention_mask = True,    # Construct attention masks.
  return_tensors = 'pt',           # Return pytorch tensors.
)

# move input ids to GPU (if available)
input_ids=encoded_dict['input_ids'].to(device)
# move attention mask to GPU (if available)
attention_mask=encoded_dict['attention_mask'].to(device)

with torch.no_grad(): # avoid gradient computation to save memory
  # forward pass of the model
  outputs = model(
    input_ids=input_ids, 
    token_type_ids=None, 
    attention_mask=attention_mask
  )
    
# get logits of prediction
logits = outputs.logits
# Use softmax to get probabilities
probabilities = torch.nn.functional.softmax(logits, dim=1)
# is meant for one observation so return probabilities[0], move the resulting tensor to cpu and return it as numpy array
probabilities=probabilities[0].cpu().numpy()
```

`probabilities` is a numpy array containing the two probabilities. Probability of belongin to class 0 and probability of belonging to class 1.
`n np.array([prob0, prob1])`

In this case we are interested in the probability of belonging to class 1, since class 1 is the positive label as the YES answer to the query string `"Is '{entity}' in the context of: '{sentence_mention}', referring to [SEP] {option}?"`

## About the DataSet used for this training:

The dataset consists of news articles obtained from a Mexican newspaper, processed using Named Entity Recognition (NER) to identify entities within each article. Queries were made to WikiData for each identified entity in order to gather all potential matches of an entity. The `StableBeluga-7B` Language Model (LLM) assisted in disambiguating selected entities from the dataset, with its outputs serving as labels for training.

This project approaches the task as a binary classification problem. The training data includes entities from the articles, relevant sentences (context) where the entity is being mentioned and all WikiData options. Each entity-context-option triplet was paired with a binary label (1/0) to form a single training observation. The dataset construction process aimed to fine-tune the model. To ensure compatibility with model limitations, inputs were truncated to fit within a 512-token maximum.

For example, with the above Donald Trump example, the Data Set would look like:

```python
[
  {
    "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, president of the United States from 2017 to 2021?",
    "label": 1,
  },
  {
    "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, American physician?",
    "label": 0,
  },
  {
    "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, Wikimedia disambiguation page?",
    "label": 0,
  },
  {
    "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, song by Mac Miller?",
    "label": 0,
  },
  {
    "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, segment of an episode of Last Week Tonight?",
    "label": 0,
  },
  {
    "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] Donald Trump, character Donald Trump in Anthony Davis's opera The Central Park Five?",
    "label": 0,
  },
  {
    "bert_qry": "Is 'Donald Trump' in the context of: 'It's not the first scare for Donald Trump. In early March it was announced that Brazilian President Jair Bolsonaro, gave positive to coronavirus...', referring to [SEP] 2016 United States presidential election, 58th quadrennial U.S. presidential election?",
    "label": 0,
  }
]

```

Repo of the project:

https://github.com/Jordi-Ab/BERT_NED