--- license: cc-by-nc-4.0 language: - ar - cs - de - el - en - fr - hi - he - it - id - ja - ko - nl - fa - pl - pt - ro - ru - es - tr - uk - vi - zh --- # Model Checkpoints for Multilingual Machine-Generated Text Portion Detection ## Model Details ### Model Description - Developed by: 1-800-SHARED-TASKS - Funded by: Cohere's Research Compute Grant (July 2024) - Model type: Transformer-based for multilingual text portion detection - Languages (NLP): 23 languages (expanding to 102) - License: Non-commercial; derivatives must remain non-commercial with proper attribution ### Model Sources - **Code Repository:** [Github Placeholder] - **Paper:** [ACL Anthology Placeholder] - **Presentation:** [Multi-lingual Machine-Generated Text Portion(s) Detection](https://static1.squarespace.com/static/659ac5de66fdf20e1d607f2e/t/66d977a49597da76b6c260a1/1725527974250/MMGTD-Cohere.pdf) ## Uses The dataset is suitable for machine-generated text portion detection, token classification tasks, and other linguistic tasks. The methods applied here aim to improve the accuracy of detecting which portions of text are machine-generated, particularly in multilingual contexts. The dataset could be beneficial for research and development in areas like AI-generated text moderation, natural language processing, and understanding the integration of AI in content generation. ## Training Details The model was trained on a dataset consisting of approximately 330k text samples from LLMs Command-R-Plus (100k) and Aya-23-35B (230k). The dataset includes 10k samples per language for each LLM, with a distribution of 10% fully human-written texts, 10% entirely machine-generated texts, and 80% mixed cases. ## Evaluation ### Testing Data, Factors & Metrics The model was evaluated on a multilingual dataset covering 23 languages. Metrics include Accuracy, Precision, Recall, and F1 Score at the word level (character level for Japanese and Chinese). ### Results Here are the word-level metrics for each language and ** character-level metrics for Japanese (JPN) and Chinese (ZHO): 0.987

Language	Accuracy	Precision	Recall	F1 Score
ARA	0.923	0.832	0.992	0.905
CES	0.884	0.869	0.975	0.919
DEU	0.917	0.895	0.983	0.937
ELL	0.929	0.905	0.984	0.943
ENG	0.917	0.818	0.986	0.894
FRA	0.927	0.929	0.966	0.947
HEB	0.963	0.961	0.988	0.974
HIN	0.890	0.736	0.975	0.839
IND	0.861	0.794	0.988	0.881
ITA	0.941	0.906	0.989	0.946
JPN**	0.832	0.747	0.965	0.842
KOR	0.937	0.918	0.992	0.954
NLD	0.916	0.872	0.985	0.925
PES	0.822	0.668	0.972	0.792
POL	0.903	0.884	0.986	0.932
POR	0.805	0.679	0.987	0.804
RON	0.931	0.924	0.985	0.953
RUS	0.885	0.818	0.971	0.888
SPA	0.888	0.809	0.990	0.890
TUR	0.849	0.735	0.981	0.840
UKR	0.768	0.637	0.774
VIE	0.866	0.757	0.975	0.853
ZHO**	0.803	0.698	0.970	0.814

## **Authors** **Core Contributors** - Ram Kadiyala [[contact@rkadiyala.com](mailto:contact@rkadiyala.com)] - Siddartha Pullakhandam [[pullakh2@uwm.edu](mailto:pullakh2@uwm.edu)] - Kanwal Mehreen [[kanwal@traversaal.ai](mailto:kanwal@traversaal.ai)] - Ashay Srivastava [[ashays06@umd.edu](mailto:ashays06@umd.edu)] - Subhasya TippaReddy [[subhasyat@usf.edu](mailto:subhasyat@usf.edu)] **Extended Crew** - Arvind Reddy Bobbili [[abobbili@cougarnet.uh.edu](mailto:abobbili@cougarnet.uh.edu)] - Suraj Chandrashekhar [[stelugar@umd.edu](mailto:stelugar@umd.edu)] - Modabbir Adeeb [[madeeb@umd.edu](mailto:madeeb@umd.edu)] - Drishti Sharma [[drishtisharma96505@gmail.com](mailto:drishtisharma96505@gmail.com)] - Srinadh Vura [[320106410055@andhrauniversity.edu.in](mailto:320106410055@andhrauniversity.edu.in)] ## **Contact** [![Gmail](https://img.shields.io/badge/Gmail-D14836?style=for-the-badge&logo=gmail&logoColor=white)](mailto:contact@rkadiyala.com)