|
--- |
|
license: mit |
|
datasets: |
|
- ljvmiranda921/tlunified-ner |
|
language: |
|
- tl |
|
metrics: |
|
- f1 |
|
tags: |
|
- gliner |
|
pipeline_tag: token-classification |
|
model-index: |
|
- name: tl_gliner_small |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Named Entity Recognition |
|
dataset: |
|
type: tlunified-ner |
|
name: TLUnified-NER |
|
split: test |
|
revision: 3f7dab9d232414ec6204f8d6934b9a35f90a254f |
|
metrics: |
|
- type: f1 |
|
value: 0.854 |
|
name: F1 |
|
--- |
|
|
|
# GLiNER (medium) model finetuned on Tagalog data |
|
|
|
This model was finetuned using the [GLiNER v2.5 suite](https://github.com/urchade/GLiNER) of models. |
|
You can find and replicate the training pipeline on [Github](https://github.com/ljvmiranda921/calamanCy/tree/master/models/v0.1.0-gliner). |
|
|
|
## Usage |
|
|
|
```python |
|
from gliner import GLiNER |
|
|
|
# Initialize GLiNER with the base model |
|
model = GLiNER.from_pretrained("ljvmiranda921/tl_gliner_medium") |
|
|
|
# Sample text for entity prediction |
|
# Reference: Leni Robredo’s speech at the 2022 UP College of Law recognition rites |
|
text = """" |
|
Nagsimula ako sa Public Attorney’s Office, kung saan araw-araw, mula Lunes hanggang Biyernes, nasa loob ako ng iba’t ibang court room at tambak ang kaso. |
|
Bawat Sabado, nasa BJMP ako para ihanda ang aking mga kliyente. Nahasa ako sa crim law at litigation. Pero kinalaunan, lumipat ako sa isang NGO, |
|
‘yung Sentro ng Alternatibong Lingap Panligal. Sa SALIGAN talaga ako nahubog bilang abugado: imbes na tinatanggap na lang ang mga batas na kailangang |
|
sundin, nagtatanong din kung ito ba ay tunay na instrumento para makapagbigay ng katarungan sa ordinaryong Pilipino. Imbes na maghintay ng mga kliyente |
|
sa de-aircon na opisina, dinadayo namin ang mga malalayong komunidad. Kadalasan, naka-tsinelas, naka-t-shirt at maong, hinahanap namin ang mga komunidad, |
|
tinatawid ang mga bundok, palayan, at mga ilog para tumungo sa mga lugar kung saan hirap ang mga batayang sektor na makakuha ng access to justice. |
|
Naaalala ko pa noong naging lead lawyer ako para sa isang proyekto: sa loob ng mahigit dalawang taon, bumibiyahe ako buwan-buwan papunta sa malayong |
|
isla ng Masbate, nagpa-paralegal training sa mga batayang sektor doon, ipinapaliwanag, itinituturo, at sinasanay sila sa mga batas na nagbibigay-proteksyon |
|
sa mga karapatan nila. |
|
""" |
|
|
|
# Labels for entity prediction |
|
# Most GLiNER models should work best when entity types are in lower case or title case |
|
labels = ["person", "organization", "location"] |
|
|
|
# Perform entity prediction |
|
entities = model.predict_entities(text, labels, threshold=0.5) |
|
|
|
# Display predicted entities and their labels |
|
for entity in entities: |
|
print(entity["text"], "=>", entity["label"]) |
|
|
|
# Sample output: |
|
# Public Attorney’s Office => organization |
|
# BJMP => organization |
|
# Sentro ng Alternatibong Lingap Panligal => organization |
|
# Masbate => location |
|
|
|
``` |
|
|
|
## Citation |
|
|
|
Please cite the following papers when using these models: |
|
|
|
``` |
|
@misc{zaratiana2023gliner, |
|
title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer}, |
|
author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois}, |
|
year={2023}, |
|
eprint={2311.08526}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
``` |
|
@inproceedings{miranda-2023-calamancy, |
|
title = "calaman{C}y: A {T}agalog Natural Language Processing Toolkit", |
|
author = "Miranda, Lester James", |
|
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)", |
|
month = dec, |
|
year = "2023", |
|
address = "Singapore, Singapore", |
|
publisher = "Empirical Methods in Natural Language Processing", |
|
url = "https://aclanthology.org/2023.nlposs-1.1", |
|
pages = "1--7", |
|
} |
|
``` |
|
|
|
If you're using the NER dataset: |
|
|
|
``` |
|
@inproceedings{miranda-2023-developing, |
|
title = "Developing a Named Entity Recognition Dataset for {T}agalog", |
|
author = "Miranda, Lester James", |
|
booktitle = "Proceedings of the First Workshop in South East Asian Language Processing", |
|
month = nov, |
|
year = "2023", |
|
address = "Nusa Dua, Bali, Indonesia", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2023.sealp-1.2", |
|
doi = "10.18653/v1/2023.sealp-1.2", |
|
pages = "13--20", |
|
} |
|
``` |