ucsahin commited on
Commit
6258b01
1 Parent(s): 29f7d7e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -13
README.md CHANGED
@@ -1,29 +1,41 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
 
 
8
  <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
11
 
12
  ## Model Details
13
 
14
- ### Model Description
 
15
 
16
- <!-- Provide a longer summary of what this model is. -->
 
 
 
 
 
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
  ### Model Sources [optional]
29
 
 
1
  ---
2
  library_name: transformers
3
+ datasets:
4
+ - ucsahin/Turkish-VLM-Mix-Benchmark
5
+ language:
6
+ - tr
7
+ pipeline_tag: image-text-to-text
8
  ---
9
 
10
+ <!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
11
+ <div style="text-align: center;">
12
+ <img src="logo-no-background.png" alt="logo" style="width: 50%; height: auto;">
13
+ </div>
14
  <!-- Provide a quick summary of what the model is/does. -->
15
 
16
+ This is the very first fast and small (875M parameters) visual language model in Hugging Face that given an image input and a Turkish instruction generates a response in Turkish. The model is developed natively in accordance with the Transformers library. So, you can easily load, fine-tune and make some blazingly fast inferences without using any external library!
17
 
18
 
19
  ## Model Details
20
 
21
+ This model is a multimodal large language model that uses [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder and [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector is used to connect two modalities together.
22
+ The architecture of the model is very similar to that of [PaliGemma](https://arxiv.org/pdf/2407.07726) with some adjustments to the vision projector and the causal language modeling.
23
 
24
+ The development process took place as follows:
25
+ 1) **Unimodal pretraining**
26
+ - In this stage, instead of pretraining both modalities from scratch, the image encoder of the [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) model and [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large) are selected as the vision encoder and language models, respectively.
27
+ 3) **Feature Alignment**
28
+ 4) **Task Specific Training**
29
+ 5) **Finetuning on Downstream Tasks**
30
 
 
31
 
32
+ ### Model Description
33
+ <!-- Provide a longer summary of what this model is. -->
34
+
35
+ - **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
36
+ - **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
37
+ - **Language(s) (NLP):** [Turkish]
38
+ - **License:** More info on this later...
39
 
40
  ### Model Sources [optional]
41