Spanish News Classification Headlines
SNCH: this model was develop by M47Labs the goal is text classification, the base model use was BETO, it was fine-tuned on 1000 example dataset.
Dataset Sample
Dataset size : 1000
Columns: idTask,task content 1,idTag,tag.
idTask | task content 1 | idTag | tag |
---|---|---|---|
3637d9ac-119c-4a8f-899c-339cf5b42ae0 | Alcalá de Guadaíra celebra la IV Semana de la Diversidad Sexual con acciones de sensibilización | 81b36360-6cbf-4ffa-b558-9ef95c136714 | sociedad |
d56bab52-0029-45dd-ad90-5c17d4ed4c88 | El Archipiélago Chinijo Graciplus se impone en el Trofeo Centro Comercial Rubicón | ed198b6d-a5b9-4557-91ff-c0be51707dec | deportes |
dec70bc5-4932-4fa2-aeac-31a52377be02 | Un total de 39 personas padecen ELA actualmente en la provincia | 81b36360-6cbf-4ffa-b558-9ef95c136714 | sociedad |
fb396ba9-fbf1-4495-84d9-5314eb731405 | Eurocopa 2021 : Italia vence a Gales y pasa a octavos con su candidatura reforzada | ed198b6d-a5b9-4557-91ff-c0be51707dec | deportes |
bc5a36ca-4e0a-422e-9167-766b41008c01 | Resolución de 10 de junio de 2021, del Ayuntamiento de Tarazona de La Mancha (Albacete), referente a la convocatoria para proveer una plaza. | 81b36360-6cbf-4ffa-b558-9ef95c136714 | sociedad |
a87f8703-ce34-47a5-9c1b-e992c7fe60f6 | El primer ministro sueco pierde una moción de censura | 209ae89e-55b4-41fd-aac0-5400feab479e | politica |
d80bdaad-0ad5-43a0-850e-c473fd612526 | El dólar se dispara tras la reunión de la Fed | 11925830-148e-4890-a2bc-da9dc059dc17 | economia |
Labels:
ciencia_tecnologia
clickbait
cultura
deportes
economia
educacion
medio_ambiente
opinion
politica
sociedad
Example of Use
Pipeline
import torch
from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
review_text = 'los vehiculos que esten esperando pasajaeros deberan estar apagados para reducir emisiones'
path = "M47Labs/spanish_news_classification_headlines"
tokenizer = AutoTokenizer.from_pretrained(path)
model = BertForSequenceClassification.from_pretrained(path)
nlp = TextClassificationPipeline(task = "text-classification",
model = model,
tokenizer = tokenizer)
print(nlp(review_text))
[{'label': 'medio_ambiente', 'score': 0.5648820996284485}]
Pytorch
import torch
from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
from numpy import np
model_name = 'M47Labs/spanish_news_classification_headlines'
MAX_LEN = 32
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texto = "las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno"
encoded_review = tokenizer.encode_plus(
texto,
max_length=MAX_LEN,
add_special_tokens=True,
#return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
input_ids = encoded_review['input_ids']
attention_mask = encoded_review['attention_mask']
output = model(input_ids, attention_mask)
_, prediction = torch.max(output['logits'], dim=1)
print(f'Review text: {texto}')
print(f'Sentiment : {model.config.id2label[prediction.detach().cpu().numpy()[0]]}')
Review text: las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno
Sentiment : medio_ambiente
A more in depth example on how to use the model can be found in this colab notebook: https://colab.research.google.com/drive/1XsKea6oMyEckye2FePW_XN7Rf8v41Cw_?usp=sharing
Finetune Hyperparameters
- MAX_LEN = 32
- TRAIN_BATCH_SIZE = 8
- VALID_BATCH_SIZE = 4
- EPOCHS = 5
- LEARNING_RATE = 1e-05
Train Results
n_example | epoch | loss | acc |
---|---|---|---|
100 | 0 | 2.286327266693115 | 12.5 |
100 | 1 | 2.018876111507416 | 40.0 |
100 | 2 | 1.8016730904579163 | 43.75 |
100 | 3 | 1.6121837735176086 | 46.25 |
100 | 4 | 1.41565443277359 | 68.75 |
n_example | epoch | loss | acc |
---|---|---|---|
500 | 0 | 2.0770938420295715 | 24.5 |
500 | 1 | 1.6953029704093934 | 50.25 |
500 | 2 | 1.258900796175003 | 64.25 |
500 | 3 | 0.8342628020048142 | 78.25 |
500 | 4 | 0.5135736921429634 | 90.25 |
n_example | epoch | loss | acc |
---|---|---|---|
1000 | 0 | 1.916002897115854 | 36.1997226074896 |
1000 | 1 | 1.2941598492664295 | 62.2746185852982 |
1000 | 2 | 0.8201534710415117 | 76.97642163661581 |
1000 | 3 | 0.524806430051615 | 86.9625520110957 |
1000 | 4 | 0.30662027455784463 | 92.64909847434119 |
Validation Results
n_examples | 100 |
---|---|
Accuracy Score | 0.35 |
Precision (Macro) | 0.35 |
Recall (Macro) | 0.16 |
n_examples | 500 |
---|---|
Accuracy Score | 0.62 |
Precision (Macro) | 0.60 |
Recall (Macro) | 0.47 |
n_examples | 1000 |
---|---|
Accuracy Score | 0.68 |
Precision(Macro) | 0.68 |
Recall (Macro) | 0.64 |
- Downloads last month
- 1,537
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.