File size: 2,344 Bytes
800199e
 
41c244a
800199e
5d0c0e2
800199e
 
2646e3c
800199e
 
 
 
641f50b
6affdeb
800199e
eda6174
800199e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

## Petrained Model BERT: base model (cased)
BERT base model (cased) is a pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this [paper](https://arxiv.org/abs/1810.04805) and first released in this [repository](https://github.com/google-research/bert). This model is case-sensitive: it makes a difference between english and English.



## Pretained Model Description
BERT is an auto-encoder transformer model pretrained on a large corpus of English data (English Wikipedia + Books Corpus) in a self-supervised fashion. This means the targets are computed from the inputs themselves, and humans are not needed to label the data. It was pretrained with two objectives:

- Masked language modeling (MLM)
- Next sentence prediction (NSP)

## Fine-tuned Model Description: BERT fine-tuned Cola
The pretrained model could be fine-tuned on other NLP tasks. The BERT model has been fine-tuned on a cola dataset from the GLUE BENCHAMRK, which is an academic benchmark that aims to measure the performance of ML models. Cola is one of the 11 datasets in this GLUE BENCHMARK. 

By fine-tuning BERT on cola dataset, the model is now able to classify a given setence gramatically and semantically as acceptable or not acceptable

## How to use ?
###### Directly with a pipeline for a text-classification NLP task
```python
from transformers import pipeline
cola = pipeline('text-classification', model='Abirate/bert_fine_tuned_cola')
cola("Tunisia is a beautiful country")

[{'label': 'acceptable', 'score': 0.989352285861969}]
```
###### Breaking down all the steps (Tokenization, Modeling, Postprocessing)
```python
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf
import numpy as np  

tokenizer = AutoTokenizer.from_pretrained('Abirate/bert_fine_tuned_cola')
model = TFAutoModelForSequenceClassification.from_pretrained("Abirate/bert_fine_tuned_cola")
text = "Tunisia is a beautiful country."
encoded_input = tokenizer(text, return_tensors='tf')

#The logits
output = model(encoded_input)
#Postprocessing
probas_output = tf.math.softmax(tf.squeeze(output['logits']), axis = -1)
class_preds = np.argmax(probas_output, axis = -1)

#Predicting the class acceptable or not acceptable
model.config.id2label[class_preds]

#Result 
'acceptable'
```