Update README.md
Browse files
README.md
CHANGED
@@ -5,6 +5,7 @@ datasets:
|
|
5 |
language:
|
6 |
- tr
|
7 |
pipeline_tag: image-text-to-text
|
|
|
8 |
---
|
9 |
|
10 |
<!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
|
@@ -13,20 +14,59 @@ pipeline_tag: image-text-to-text
|
|
13 |
</div>
|
14 |
<!-- Provide a quick summary of what the model is/does. -->
|
15 |
|
16 |
-
|
|
|
17 |
|
|
|
18 |
|
19 |
-
|
20 |
|
21 |
-
|
22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
-
The development process took place as follows:
|
25 |
1) **Unimodal pretraining**
|
26 |
-
- In this stage, instead of pretraining both modalities from scratch, the image encoder
|
27 |
-
|
28 |
-
|
29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
|
32 |
### Model Description
|
@@ -34,15 +74,15 @@ The development process took place as follows:
|
|
34 |
|
35 |
- **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
|
36 |
- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
|
37 |
-
- **Language(s) (NLP):**
|
38 |
-
- **License:**
|
39 |
|
40 |
### Model Sources [optional]
|
41 |
|
42 |
<!-- Provide the basic links for the model. -->
|
43 |
|
44 |
-
- **Repository:** [
|
45 |
-
- **Paper [optional]:**
|
46 |
- **Demo [optional]:** [More Information Needed]
|
47 |
|
48 |
## Uses
|
@@ -140,29 +180,7 @@ Use the code below to get started with the model.
|
|
140 |
|
141 |
[More Information Needed]
|
142 |
|
143 |
-
#### Summary
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
## Model Examination [optional]
|
148 |
-
|
149 |
-
<!-- Relevant interpretability work for the model goes here -->
|
150 |
-
|
151 |
-
[More Information Needed]
|
152 |
|
153 |
-
## Environmental Impact
|
154 |
-
|
155 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
156 |
-
|
157 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
158 |
-
|
159 |
-
- **Hardware Type:** [More Information Needed]
|
160 |
-
- **Hours used:** [More Information Needed]
|
161 |
-
- **Cloud Provider:** [More Information Needed]
|
162 |
-
- **Compute Region:** [More Information Needed]
|
163 |
-
- **Carbon Emitted:** [More Information Needed]
|
164 |
-
|
165 |
-
## Technical Specifications [optional]
|
166 |
|
167 |
### Model Architecture and Objective
|
168 |
|
@@ -172,9 +190,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
172 |
|
173 |
[More Information Needed]
|
174 |
|
175 |
-
#### Hardware
|
176 |
-
|
177 |
-
[More Information Needed]
|
178 |
|
179 |
#### Software
|
180 |
|
|
|
5 |
language:
|
6 |
- tr
|
7 |
pipeline_tag: image-text-to-text
|
8 |
+
license: apache-2.0
|
9 |
---
|
10 |
|
11 |
<!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
|
|
|
14 |
</div>
|
15 |
<!-- Provide a quick summary of what the model is/does. -->
|
16 |
|
17 |
+
## English
|
18 |
+
# 🎉 Introducing TraVisionLM: The First of Its Kind! 🚀
|
19 |
|
20 |
+
🌟 This is the very first fast and compact (875M parameters) visual language model on Hugging Face that responds to Turkish instructions given an image input! 🌟
|
21 |
|
22 |
+
✨ Developed compatible with the Transformers library, TRaVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️
|
23 |
|
24 |
+
Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖
|
25 |
+
|
26 |
+
|
27 |
+
## Türkçe
|
28 |
+
# 🎉 TraVisionLM: Türünün İlk Örneği! 🚀
|
29 |
+
|
30 |
+
🌟 Türkçe görsel dil modelinin ilk hızlı ve kompakt (875M parametre) versiyonu! Bir görüntü ve Türkçe talimat verildiğinde Türkçe yanıt üretir! 🌟
|
31 |
+
|
32 |
+
✨ Transformers kütüphanesi ile uyumlu olarak geliştirilen TraVisionLM, yüklemek, eğitmek ve dış kütüphaneler kullanmadan hızlı sonuçlar almak için kullanımı çok kolay! ⚡️
|
33 |
+
|
34 |
+
Türkçe görsel dil modelini deneyimlemeye hazır mısınız? Hadi başlayalım! 🇹🇷🖼️🤖
|
35 |
+
|
36 |
+
---
|
37 |
+
|
38 |
+
# Model Details
|
39 |
+
|
40 |
+
## English
|
41 |
+
This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder with [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector connects the two modalities together.
|
42 |
+
Its architecture closely resembles [PaliGemma](https://arxiv.org/pdf/2407.07726), with some refined adjustments to the vision projector and the causal language modeling.
|
43 |
+
|
44 |
+
Here's a glimpse into the development process:
|
45 |
|
|
|
46 |
1) **Unimodal pretraining**
|
47 |
+
- In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) and the language model from [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large).
|
48 |
+
2) **Feature Alignment**
|
49 |
+
- Following the [LLaVA training recipe](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train), I train only the vision projector using 500K image-text pairs to align visual and textual features.
|
50 |
+
3) **Task Specific Training**
|
51 |
+
- The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image-prompt-completion triplets.
|
52 |
+
4) **Finetuning on Downstream Tasks**
|
53 |
+
- Finally, the model is fine-tuned for object detection to demonstrate its versatility in various downstream tasks. Explore the fine-tuned model for object detection at [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) for more details.
|
54 |
+
|
55 |
+
|
56 |
+
## Türkçe
|
57 |
+
Bu model, [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) görsel kodlayıcısını ve [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) dil modelini birleştiren çok modlu büyük bir dil modelidir. Görsel projektör, iki modaliteyi bir araya getirir.
|
58 |
+
Mimarisi, [PaliGemma](https://arxiv.org/pdf/2407.07726) ile yakından benzerlik gösterir, ancak görsel projektör ve neden-sonuç dil modellemesinde bazı uyarlamalar yapılmıştır.
|
59 |
+
|
60 |
+
Geliştirme sürecinin özeti:
|
61 |
+
|
62 |
+
1) **Tek Modalite Ön Eğitimi**
|
63 |
+
- Bu aşamada, her iki modaliteyi sıfırdan eğitmek yerine, [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) modelinin görsel kodlayıcısını ve [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large) modelinin dil kodlayıcısını kullanıyorum.
|
64 |
+
2) **Özellik Uyarlama**
|
65 |
+
- [LLaVA eğitim tarifesi](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train) izlenerek, sadece görsel projektörü 500K görüntü-metin çiftleri ile eğiterek görsel ve metin özelliklerini uyumlu hale getiriyorum.
|
66 |
+
3) **Görev Spesifik Eğitim**
|
67 |
+
- Uyumlulaştırılmış model, kısa açıklama, detaylı açıklama ve basit görsel soru cevaplama gibi görevler için daha fazla eğitim alıyor; 1M'den fazla görüntü-istek-tamamlanma üçlüsü kullanılıyor.
|
68 |
+
4) **İndirgeme Görevlerinde İnce Ayar**
|
69 |
+
- Son olarak, modelin çeşitli görevlerdeki çok yönlülüğünü göstermek amacıyla nesne tespiti için ince ayar yapılmıştır. Nesne tespiti için ince ayar yapılmış modeli daha fazla detay için ucsahin/TraVisionLM-Object-Detection-ft adresinden keşfedebilirsiniz.
|
70 |
|
71 |
|
72 |
### Model Description
|
|
|
74 |
|
75 |
- **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
|
76 |
- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
|
77 |
+
- **Language(s) (NLP):** *Turkish*
|
78 |
+
- **License:** *Apache license 2.0*
|
79 |
|
80 |
### Model Sources [optional]
|
81 |
|
82 |
<!-- Provide the basic links for the model. -->
|
83 |
|
84 |
+
- **Repository:** [https://huggingface.co/ucsahin/TraVisionLM-base/edit/main/README.md]
|
85 |
+
- **Paper [optional]:** More info on this later.
|
86 |
- **Demo [optional]:** [More Information Needed]
|
87 |
|
88 |
## Uses
|
|
|
180 |
|
181 |
[More Information Needed]
|
182 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
183 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
184 |
|
185 |
### Model Architecture and Objective
|
186 |
|
|
|
190 |
|
191 |
[More Information Needed]
|
192 |
|
|
|
|
|
|
|
193 |
|
194 |
#### Software
|
195 |
|