metadata

library_name: transformers
datasets:
  - ucsahin/Turkish-VLM-Mix-Benchmark
language:
  - tr
pipeline_tag: image-text-to-text
license: apache-2.0

English

🎉 Introducing TraVisionLM: The First of Its Kind! 🚀

🌟 This is the very first fast and compact (875M parameters) visual language model on Hugging Face that responds to Turkish instructions given an image input! 🌟

✨ Developed compatible with the Transformers library, TRaVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️

Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖

Türkçe

🎉 TraVisionLM: Türünün İlk Örneği! 🚀

🌟 Türkçe görsel dil modelinin ilk hızlı ve kompakt (875M parametre) versiyonu! Bir görüntü ve Türkçe talimat verildiğinde Türkçe yanıt üretir! 🌟

✨ Transformers kütüphanesi ile uyumlu olarak geliştirilen TraVisionLM, yüklemek, eğitmek ve dış kütüphaneler kullanmadan hızlı sonuçlar almak için kullanımı çok kolay! ⚡️

Türkçe görsel dil modelini deneyimlemeye hazır mısınız? Hadi başlayalım! 🇹🇷🖼️🤖

Model Details

English

This model is a multimodal large language model that combines SigLIP as its vision encoder with GPT2-large as its language model. The vision projector connects the two modalities together. Its architecture closely resembles PaliGemma, with some refined adjustments to the vision projector and the causal language modeling.

Here's the summary of the development process:

Unimodal pretraining
- In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from google/siglip-base-patch16-256-multilingual and the language model from ytu-ce-cosmos/turkish-gpt2-large.
Feature Alignment
- Following the LLaVA training recipe, I train only the vision projector using 500K image-text pairs to align visual and textual features.
Task Specific Training
- The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image-prompt-completion triplets.
Finetuning on Downstream Tasks
- Finally, the model is fine-tuned for object detection to demonstrate its versatility in various downstream tasks. Explore the fine-tuned model for object detection at ucsahin/TraVisionLM-Object-Detection-ft for more details.

Türkçe

Bu model, SigLIP görsel kodlayıcısını ve GPT2-large dil modelini birleştiren çok modlu büyük bir dil modelidir. Görsel projektör, iki modaliteyi bir araya getirir. Mimarisi, PaliGemma ile yakından benzerlik gösterir, ancak görsel projektör ve neden-sonuç dil modellemesinde bazı uyarlamalar yapılmıştır.

Geliştirme sürecinin özeti:

Tek Modalite Ön Eğitimi
- Bu aşamada, her iki modaliteyi sıfırdan eğitmek yerine, google/siglip-base-patch16-256-multilingual modelinin görsel kodlayıcısını ve ytu-ce-cosmos/turkish-gpt2-large modelinin dil kodlayıcısını kullanıyorum.
Özellik Uyarlama
- LLaVA eğitim tarifesi izlenerek, sadece görsel projektörü 500K görüntü-metin çiftleri ile eğiterek görsel ve metin özelliklerini uyumlu hale getiriyorum.
Görev Spesifik Eğitim
- Bu adımda, uyumlulaştırılmış model, kısa açıklama, detaylı açıklama ve basit görsel soru cevaplama gibi görevler için daha fazla eğitilmiştir; 1M'den fazla resim-istek-tamamlanma üçlüsünden oluşan veri seti kullanılmıştır.
İndirgeme Görevlerinde İnce Ayar
- Son olarak, modelin çeşitli görevlerdeki çok yönlülüğünü göstermek amacıyla nesne tespiti için ince ayarı yapılmıştır. Nesne tespiti için ince ayar yapılmış modele detaylar için ucsahin/TraVisionLM-Object-Detection-ft adresinden ulaşabilirsiniz.

Model Description

Developed by: ucsahin
Model type: Image-Text-to-Text
Language(s) (NLP): Turkish
License: Apache license 2.0

Model Sources [optional]

Repository: [https://huggingface.co/ucsahin/TraVisionLM-base/edit/main/README.md]
Paper [optional]: More info on this later.
Demo [optional]: [More Information Needed]

Friendly Reminder:

First of all, thanks for your interest if you plan to use this model. I developed this model to primarily show that you can build

Kullanıcılar için Önemli Bir Hatırlatma:

Uses

Direct Use

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Türkçe: Kullanım Alanları

Aşağıda TraVisionLM görsel dil modelinin, hangi görevler için doğrudan ve dolaylı kullanılabileceği durumlar verilmiştir. Ayrıca alan dışı kullanımlar kısmına da göz atmayı unutmayın.

Doğrudan Kullanım Alanları

Kısa Açıklama
Detaylı Açıklama
Görsel Soru Cevaplama

Dolaylı Kullanım Alanları

(Video-Text-to-Text) Model videolarınızla ilgili soru cevap görevi için adapte edilebilir. Mimariye hiçbir değişiklik yapmadan, video kareleri örneklenerek, her bir kare üzerinden modele cevap ürettirilebilir.
(Retrieval) Metne dayalı en uygun görüntü alma görevi için model, herhangi bir değişiklik yapılmadan doğrudan kullanılabilir.
(Finetuning) Model mimarisini destekleyen görsel sınıflandırma gibi geri kalan bütün görevler için model Transformers kütüphanesiyle uyumlu bir şekilde eğitilebilir. Bir örnek için ucsahin/TraVisionLM-Object-Detection-ft adresine bakabilirsiniz.

Zaman buldukça bu dolaylı kullanım uygulamaları ile paylaşımlar yapmayı planlıyorum. Bu sürede topluluktan da destek ya da işbirliği isteklerini dört gözle bekliyorum 🤝💪

Alan-dışı Kullanımlar

Bu modelin aşağıdaki senaryolar için kullanımı uygun değildir:

Model, resimlerinizle ilgili basit sorulara cevap verse de, çok turlu kompleks chat senaryoları için uygun değildir. Geçmiş bilgisi tutulmamaktadır, model daha önce sorduğunuz soruları kontekst olarak kullanmamaktadır. Fakat bu görev için, bir chat şablonu hazırlayıp bu doğrultuda modeli kolayca eğitebilirsiniz.
Model çoklu görsel girdi kabul etmemektedir. Örneğin, iki farklı resmi karşılaştıran sorulara cevap vermeye uygun değildir. Bu özelliği kazandırmak için mimariye değişiklikler yapmak gerekmektedir. Bu tarz bir model için HuggingFaceM4/idefics2-8b (sadece ingilizce) modeline bakabilirsiniz.
Model, karakter ve yazı tanıma (OCR), segmentasyon ve çoklu obje tanıma görevleri için eğitilmemiştir. Bu görevlerde kabul edilebilir başarılar alabilmek için google/paligemma-3b-pt-224 ve microsoft/Florence-2-large gibi görsel dil modelleri milyarlarca doküman ve resimle eğitilmiştir.

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Metrics

[More Information Needed]

Results

More information will come

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Citation

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Model Card Contact

If you have questions or suggestions regarding the model, I prefer if you would reach me directly via Hugging Face (e.g. opening an issue). But if you have specific things in your mind or any ideas for collaboration on future projects, reach me at sahin.umitcan@gmail.com

Modelle ilgili sorularınız veya önerileriniz varsa, doğrudan bana Hugging Face üzerinden (örneğin, bir issue açarak) ulaşmanızı tercih ederim. Diğer konular veya gelecekteki projelerde işbirliği için herhangi bir fikriniz varsa, bana sahin.umitcan@gmail.com adresinden ulaşabilirsiniz.