Esmail-AGumaan
/

LlTRA-model

Model card Files Files and versions Community

LlTRA-model / README.md

Esmail Atta Gumaan

Update README.md

74dc21e verified 9 months ago

|

history blame contribute delete

2.52 kB

	---
	license: mit
	datasets:
	- opus_books

	---

	LlTRA stands for: Language to Language Transformer model from the paper "Attention is all you Need", building transformer model:Transformer model from scratch and using it for translation using pytorch.

	---

	Problem Statement:
	In the rapidly evolving landscape of natural language processing (NLP) and machine translation, there exists a persistent challenge in achieving accurate and contextually rich language-to-language transformations. Existing models often struggle with capturing nuanced semantic meanings, context preservation, and maintaining grammatical coherence across different languages. Additionally, the demand for efficient cross-lingual communication and content generation has underscored the need for a versatile language transformer model that can seamlessly navigate the intricacies of diverse linguistic structures.

	---

	Goal:
	Develop a specialized language-to-language transformer model that accurately translates from the Arabic language to the English language, ensuring semantic fidelity, contextual awareness, cross-lingual adaptability, and the retention of grammar and style. The model should provide efficient training and inference processes to make it practical and accessible for a wide range of applications, ultimately contributing to the advancement of Arabic-to-English language translation capabilities.

	---

	Dataset used:
	from hugging Face huggingface/opus_infopankki

	---

	Configuration:
	this is the settings of the model, You can customize the source and target languages, sequence lengths for each, the number of epochs, batch size, and more.

	```python
	def Get_configuration():
	return {
	"batch_size": 8,
	"num_epochs": 30,
	"lr": 10**-4,
	"sequence_length": 100,
	"d_model": 512,
	"datasource": 'opus_infopankki',
	"source_language": "ar",
	"target_language": "en",
	"model_folder": "weights",
	"model_basename": "tmodel_",
	"preload": "latest",
	"tokenizer_file": "tokenizer_{0}.json",
	"experiment_name": "runs/tmodel"
	}
	```

	---

	Training:
	I used my drive to upload the project and then connected it to the Google Collab to train it:

	- hours of training: 4 hours.
	- epochs: 20.
	- number of dataset rows: 2,934,399.
	- size of the dataset: 95MB.
	- size of the auto-converted parquet files: 153MB.
	- Arabic tokens: 29999.
	- English tokens: 15697.
	- pre-trained model in collab.
	- BLEU score from Arabic to English: 19.7


	---