tharindu commited on
Commit
c47fd80
1 Parent(s): b4d7be0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ datasets:
4
+ - sinhala-nlp/NSINA
5
+ - sinhala-nlp/NSINA-Media
6
+ language:
7
+ - si
8
+ ---
9
+
10
+ # Sinhala News Media Identification
11
+ This is a text classification task created with the [NSINA dataset](https://github.com/Sinhala-NLP/NSINA). This dataset is also released with the same license as NSINA. The task is identifying news media given the news content.
12
+
13
+
14
+ ## Data
15
+ We only used 10,000 instances in NSINA 1.0 from each news source. For the two sources that had less than 10,000 instances ("Dinamina" and "Siyatha") we used the original number of instances they contained. We divided this dataset into a training and test set following a 0.8 split.
16
+ Data can be loaded into pandas dataframes using the following code.
17
+
18
+ ```python
19
+ from datasets import Dataset
20
+ from datasets import load_dataset
21
+
22
+ train = Dataset.to_pandas(load_dataset('sinhala-nlp/NSINA-Media', split='train'))
23
+ test = Dataset.to_pandas(load_dataset('sinhala-nlp/NSINA-Media', split='test'))
24
+ ```
25
+
26
+ ## Citation
27
+ If you are using the dataset or the models, please cite the following paper.
28
+
29
+ ~~~
30
+ @inproceedings{Nsina2024,
31
+ author={Hettiarachchi, Hansi and Premasiri, Damith and Uyangodage, Lasitha and Ranasinghe, Tharindu},
32
+ title={{NSINA: A News Corpus for Sinhala}},
33
+ booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
34
+ year={2024},
35
+ month={May},
36
+ }
37
+ ~~~