ganga-1b / README.md

Update README.md

21c751c verified 3 months ago

6.04 kB

	---
	license: apache-2.0
	language:
	- hi
	- en
	metrics:
	- perplexity
	widget:
	- text: >-
	BCCI ने टी-20 वर्ल्ड कप के बीच जिम्बाब्वे सीरीज
	example_title: Example 1
	- text: >-
	7 अक्टूबर को हमास से जंग शुरू होने के सात महीने बाद इजरायली सेना
	example_title: Example 2
	- text: >-
	हवा में अवांछित गैसों की उपस्थिति से मनुष्य, पशुओं तथा पक्षियों को
	example_title: Example 3
	- text: >-
	पहले संदिग्ध मामलों को 31 दिसंबर 2019 को WHO को सूचित किया गया था,
	example_title: Example 4
	- text: >-
	13 समन्वित बम विस्फोटों के बाद से मुंबई में कई गैर-राज्य हमले
	example_title: Example 5
	- text: >-
	निकोला टेस्ला का जन्म 10 जुलाई 1856 को स्किमडज़, क्रोएरिया में हुआ था,
	example_title: Example 6
	- text: >-
	2007 टूर्नामेंट में क्रिकट विश्व कप के लिए टिकटों से सबसे ज्यादा आमदनी हुई
	example_title: Example 7
	---

	# Model Card for Ganga-1b! 🌊

	The base model ``Ganga-1b`` trained on a monolingual Hindi language dataset as part of *Project Unity. We propose the name Ganga* 🌊 to honor the longest river flowing through the Hindi-speaking region of India 🇮🇳.

	(The first pre-trained Hindi model by any academic research lab in India 🇮🇳!)*


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/667b8f8ba271fc5a8e6929de/jG3tZnGPvH6vcGrvxO-YC.png)



	### Model Description 📚

	Project Unity is an initiative to address India's linguistic diversity and richness by creating a comprehensive resource covering the country's major languages. We strive to achieve state-of-the-art performance in understanding and generating text in Indian languages.
	To achieve this, we train models on the monolingual regional languages of India. Our first release is the Ganga-1B model, which has been trained on a large dataset of public domain web-crawled Hindi language data, including news articles, web documents, books, government publications, educational materials, and social media conversations (filtered for quality). Additionally, the dataset has been further curated by native Indian speakers to ensure high quality.
	Significantly, the Ganga-1B model outperforms existing open-source models that support Indian languages, even at sizes of up to 7 billion parameters.



	- Developed by: [Lingo Research Group at IIT Gandhinagar](https://labs.iitgn.ac.in/lingo/)
	- Model type: Autoregressive Language Model
	- Language(s) (NLP): Bilingual (Primary: Hindi [hi], Secondary: English [en])
	- License: Apache 2.0



	## How to Get Started with the Model 👨🏻‍💻

	Use the code below to get started with the model.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/ganga-1b")
	model = AutoModelForCausalLM.from_pretrained("LingoIITGN/ganga-1b", device_map="auto")

	input_text = "BCCI ने टी-20 वर्ल्ड कप के बीच जिम्बाब्वे सीरीज "
	input_ids = tokenizer.encode(input_text,
	return_tensors="pt").to("cuda")

	outputs = model.generate(input_ids, max_new_tokens=100,
	do_sample=True, top_k=50,
	top_p=0.95, temperature=0.7)

	print(tokenizer.decode(outputs[0]))

	```

	## Technical Specifications 🤖

	- Precision: Float32
	- Context Length: 2,048
	- Learning Rate: 4e-4
	- Optimizer: AdamW
	- LR Scheduler: Cosine

	### Model Architecture and Objective


	Ganga-1b is a decoder-only transformer model, featuring the following specifications:


	* Layers: 16
	* Attention heads: 32
	* Embedding dimension: 2,048
	* Vocabulary size: 30,000
	* Sliding window: 512
	* Intermediate dimension: 7,168


	## Evaluation
	[More Information Needed]

	### Results 🏆

	<details open>
	<summary>Tokenizers Results</summary>
	<br>

	\| Model \| Fertility \|
	\|:-----------:\|:---------:\|
	\| *Ganga-1b* \| *1.12* \|
	\| Pragna-1b \| 1.58 \|
	\| Bloom-1b1 \| 1.27 \|
	\| Bloom-1b7 \| 1.27 \|
	\| Gemma-2b \| 1.89 \|
	\| Bloom-3b \| 1.27 \|
	\| Airavata-7b \| 1.69 \|
	\| Sarvam-2b \| 1.38 \|

	</details>


	<details open>
	<summary>Metrics</summary>
	<br>

	\| Model \| PPL<sub>Our Dataset</sub> \| PPL<sub>Sangraha Dataset</sub> \|
	\|:-----------:\|:---------:\|:------:\|
	\| *Ganga-1b* \| *17.92* \| *15.82* \|
	\| Pragna-1b \| 98.16 \| 9.37 \|
	\| Bloom-1b1 \| 27.81 \| 17.49 \|
	\| Bloom-1b7 \| 22.49 \| 14.28 \|
	\| Gemma-2b \| 49.27 \| 31.01 \|
	\| Bloom-3b \| 19.99 \| 12.82 \|
	\| OpenHathi-7B \| 42.95 \| 25.73 \|
	\| Airavata-7b \| 60.87 \| 38.24 \|
	\| Sarvam-2b \| 18.56 \| 10.31 \|

	</details>


	## Summary



	## Bias, Risks, and Limitations 🚨


	### Recommendations ‼️

	<span style="color:red">This model described is a research preview and is under ongoing iterative updations, and as such, it only provides limited safety measures. Additionally, it may generate offensive content. It is strictly prohibited to use the model for any illegal, harmful, violent, racist, or sexual purposes.</span>

	## More Information

	DEMO: [https://huggingface.co/spaces/Lingo-IITGN/ganga-1b](https://huggingface.co/spaces/Lingo-IITGN/ganga-1b)

	## Model Card Contact ✉️

	[Lingo Research Group at IIT Gandhinagar, India](https://labs.iitgn.ac.in/lingo/) </br>
	Mail at: [lingo@iitgn.ac.in](lingo@iitgn.ac.in)