sinhala-nlp
/

NSINA-Category-xlmr-base

Text Classification

Inference Endpoints

Model card Files Files and versions Community

NSINA-Category-xlmr-base / README.md

tharindu's picture

Create README.md

216854f verified 9 months ago

|

history blame contribute delete

1.78 kB

	---
	license: cc-by-sa-4.0
	datasets:
	- sinhala-nlp/NSINA
	- sinhala-nlp/NSINA-Categories
	language:
	- si
	---

	# Sinhala News Category Prediction
	This is a text classification task created with the [NSINA dataset](https://github.com/Sinhala-NLP/NSINA). This dataset is also released with the same license as NSINA. Given the news content, the ML models should predict a pre-defined category for the news.


	## Data
	First, for this task, we dropped all the news articles in NSINA 1.0 without a category as some news sources prefer not to categorise them. Next, we identified the top 100 news categories from the available news instances. We grouped possible categories into four main categories: local news, international news, sports news, and business news. To avoid bias, we undersampled the dataset. We only used 10,000 instances from each category, and for the ``Business" category, we used the original number of instances which was 8777 articles. We divided this dataset into a training and test set following a 0.8 split
	Data can be loaded into pandas dataframes using the following code.

	```python
	from datasets import Dataset
	from datasets import load_dataset

	train = Dataset.to_pandas(load_dataset('sinhala-nlp/NSINA-Categories', split='train'))
	test = Dataset.to_pandas(load_dataset('sinhala-nlp/NSINA-Categories', split='test'))
	```



	## Citation
	If you are using the dataset or the models, please cite the following paper.

	~~~
	@inproceedings{Nsina2024,
	author={Hettiarachchi, Hansi and Premasiri, Damith and Uyangodage, Lasitha and Ranasinghe, Tharindu},
	title={{NSINA: A News Corpus for Sinhala}},
	booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
	year={2024},
	month={May},
	}
	~~~