ucsahin
/

TraVisionLM-base

Image-Text-to-Text

text-generation

Model card Files Files and versions Community

ucsahin commited on Aug 8

Commit

6258b01

•

1 Parent(s): 29f7d7e

Update README.md

Files changed (1) hide show

README.md +25 -13

README.md CHANGED Viewed

@@ -1,29 +1,41 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
 ### Model Sources [optional]

 ---
 library_name: transformers
+datasets:
+- ucsahin/Turkish-VLM-Mix-Benchmark
+language:
+- tr
+pipeline_tag: image-text-to-text
 ---
+<!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
+<div style="text-align: center;">
+    <img src="logo-no-background.png" alt="logo" style="width: 50%; height: auto;">
+</div>
 <!-- Provide a quick summary of what the model is/does. -->
+This is the very first fast and small (875M parameters) visual language model in Hugging Face that given an image input and a Turkish instruction generates a response in Turkish. The model is developed natively in accordance with the Transformers library. So, you can easily load, fine-tune and make some blazingly fast inferences without using any external library!
 ## Model Details
+This model is a multimodal large language model that uses [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder and [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector is used to connect two modalities together.
+The architecture of the model is very similar to that of [PaliGemma](https://arxiv.org/pdf/2407.07726) with some adjustments to the vision projector and the causal language modeling.
+The development process took place as follows:
+1) **Unimodal pretraining**
+    - In this stage, instead of pretraining both modalities from scratch, the image encoder of the [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) model and [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large) are selected as the vision encoder and language models, respectively.
+3) **Feature Alignment**
+4) **Task Specific Training**
+5) **Finetuning on Downstream Tasks**
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
+- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
+- **Language(s) (NLP):** [Turkish]
+- **License:** More info on this later...
 ### Model Sources [optional]