ucsahin commited on
Commit
c1ac296
1 Parent(s): 4773b56

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -38
README.md CHANGED
@@ -5,6 +5,7 @@ datasets:
5
  language:
6
  - tr
7
  pipeline_tag: image-text-to-text
 
8
  ---
9
 
10
  <!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
@@ -13,20 +14,59 @@ pipeline_tag: image-text-to-text
13
  </div>
14
  <!-- Provide a quick summary of what the model is/does. -->
15
 
16
- This is the very first fast and small (875M parameters) visual language model in Hugging Face that given an image input and a Turkish instruction generates a response in Turkish. The model is developed natively in accordance with the Transformers library. So, you can easily load, fine-tune and make some blazingly fast inferences without using any external library!
 
17
 
 
18
 
19
- ## Model Details
20
 
21
- This model is a multimodal large language model that uses [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder and [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector is used to connect two modalities together.
22
- The architecture of the model is very similar to that of [PaliGemma](https://arxiv.org/pdf/2407.07726) with some adjustments to the vision projector and the causal language modeling.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- The development process took place as follows:
25
  1) **Unimodal pretraining**
26
- - In this stage, instead of pretraining both modalities from scratch, the image encoder of the [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) model and [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large) are selected as the vision encoder and language models, respectively.
27
- 3) **Feature Alignment**
28
- 4) **Task Specific Training**
29
- 5) **Finetuning on Downstream Tasks**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
 
32
  ### Model Description
@@ -34,15 +74,15 @@ The development process took place as follows:
34
 
35
  - **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
36
  - **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
37
- - **Language(s) (NLP):** [Turkish]
38
- - **License:** More info on this later...
39
 
40
  ### Model Sources [optional]
41
 
42
  <!-- Provide the basic links for the model. -->
43
 
44
- - **Repository:** [More Information Needed]
45
- - **Paper [optional]:** [More Information Needed]
46
  - **Demo [optional]:** [More Information Needed]
47
 
48
  ## Uses
@@ -140,29 +180,7 @@ Use the code below to get started with the model.
140
 
141
  [More Information Needed]
142
 
143
- #### Summary
144
-
145
-
146
-
147
- ## Model Examination [optional]
148
-
149
- <!-- Relevant interpretability work for the model goes here -->
150
-
151
- [More Information Needed]
152
 
153
- ## Environmental Impact
154
-
155
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
156
-
157
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
158
-
159
- - **Hardware Type:** [More Information Needed]
160
- - **Hours used:** [More Information Needed]
161
- - **Cloud Provider:** [More Information Needed]
162
- - **Compute Region:** [More Information Needed]
163
- - **Carbon Emitted:** [More Information Needed]
164
-
165
- ## Technical Specifications [optional]
166
 
167
  ### Model Architecture and Objective
168
 
@@ -172,9 +190,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
172
 
173
  [More Information Needed]
174
 
175
- #### Hardware
176
-
177
- [More Information Needed]
178
 
179
  #### Software
180
 
 
5
  language:
6
  - tr
7
  pipeline_tag: image-text-to-text
8
+ license: apache-2.0
9
  ---
10
 
11
  <!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
 
14
  </div>
15
  <!-- Provide a quick summary of what the model is/does. -->
16
 
17
+ ## English
18
+ # 🎉 Introducing TraVisionLM: The First of Its Kind! 🚀
19
 
20
+ 🌟 This is the very first fast and compact (875M parameters) visual language model on Hugging Face that responds to Turkish instructions given an image input! 🌟
21
 
22
+ Developed compatible with the Transformers library, TRaVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️
23
 
24
+ Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖
25
+
26
+
27
+ ## Türkçe
28
+ # 🎉 TraVisionLM: Türünün İlk Örneği! 🚀
29
+
30
+ 🌟 Türkçe görsel dil modelinin ilk hızlı ve kompakt (875M parametre) versiyonu! Bir görüntü ve Türkçe talimat verildiğinde Türkçe yanıt üretir! 🌟
31
+
32
+ ✨ Transformers kütüphanesi ile uyumlu olarak geliştirilen TraVisionLM, yüklemek, eğitmek ve dış kütüphaneler kullanmadan hızlı sonuçlar almak için kullanımı çok kolay! ⚡️
33
+
34
+ Türkçe görsel dil modelini deneyimlemeye hazır mısınız? Hadi başlayalım! 🇹🇷🖼️🤖
35
+
36
+ ---
37
+
38
+ # Model Details
39
+
40
+ ## English
41
+ This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder with [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector connects the two modalities together.
42
+ Its architecture closely resembles [PaliGemma](https://arxiv.org/pdf/2407.07726), with some refined adjustments to the vision projector and the causal language modeling.
43
+
44
+ Here's a glimpse into the development process:
45
 
 
46
  1) **Unimodal pretraining**
47
+ - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) and the language model from [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large).
48
+ 2) **Feature Alignment**
49
+ - Following the [LLaVA training recipe](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train), I train only the vision projector using 500K image-text pairs to align visual and textual features.
50
+ 3) **Task Specific Training**
51
+ - The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image-prompt-completion triplets.
52
+ 4) **Finetuning on Downstream Tasks**
53
+ - Finally, the model is fine-tuned for object detection to demonstrate its versatility in various downstream tasks. Explore the fine-tuned model for object detection at [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) for more details.
54
+
55
+
56
+ ## Türkçe
57
+ Bu model, [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) görsel kodlayıcısını ve [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) dil modelini birleştiren çok modlu büyük bir dil modelidir. Görsel projektör, iki modaliteyi bir araya getirir.
58
+ Mimarisi, [PaliGemma](https://arxiv.org/pdf/2407.07726) ile yakından benzerlik gösterir, ancak görsel projektör ve neden-sonuç dil modellemesinde bazı uyarlamalar yapılmıştır.
59
+
60
+ Geliştirme sürecinin özeti:
61
+
62
+ 1) **Tek Modalite Ön Eğitimi**
63
+ - Bu aşamada, her iki modaliteyi sıfırdan eğitmek yerine, [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) modelinin görsel kodlayıcısını ve [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large) modelinin dil kodlayıcısını kullanıyorum.
64
+ 2) **Özellik Uyarlama**
65
+ - [LLaVA eğitim tarifesi](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train) izlenerek, sadece görsel projektörü 500K görüntü-metin çiftleri ile eğiterek görsel ve metin özelliklerini uyumlu hale getiriyorum.
66
+ 3) **Görev Spesifik Eğitim**
67
+ - Uyumlulaştırılmış model, kısa açıklama, detaylı açıklama ve basit görsel soru cevaplama gibi görevler için daha fazla eğitim alıyor; 1M'den fazla görüntü-istek-tamamlanma üçlüsü kullanılıyor.
68
+ 4) **İndirgeme Görevlerinde İnce Ayar**
69
+ - Son olarak, modelin çeşitli görevlerdeki çok yönlülüğünü göstermek amacıyla nesne tespiti için ince ayar yapılmıştır. Nesne tespiti için ince ayar yapılmış modeli daha fazla detay için ucsahin/TraVisionLM-Object-Detection-ft adresinden keşfedebilirsiniz.
70
 
71
 
72
  ### Model Description
 
74
 
75
  - **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
76
  - **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
77
+ - **Language(s) (NLP):** *Turkish*
78
+ - **License:** *Apache license 2.0*
79
 
80
  ### Model Sources [optional]
81
 
82
  <!-- Provide the basic links for the model. -->
83
 
84
+ - **Repository:** [https://huggingface.co/ucsahin/TraVisionLM-base/edit/main/README.md]
85
+ - **Paper [optional]:** More info on this later.
86
  - **Demo [optional]:** [More Information Needed]
87
 
88
  ## Uses
 
180
 
181
  [More Information Needed]
182
 
 
 
 
 
 
 
 
 
 
183
 
 
 
 
 
 
 
 
 
 
 
 
 
 
184
 
185
  ### Model Architecture and Objective
186
 
 
190
 
191
  [More Information Needed]
192
 
 
 
 
193
 
194
  #### Software
195