Edit model card



Idefics2-8B-SFT is SFT fine-tune of HuggingFaceM4/idefics2-8b on 35k TextVQA dataset. Training was performed on RTX A5000 for 10 hrs. Wandb report:


This fine-tuned model achieves a Levenshtein score of 82.29%.

Model Summary

πŸ’» Usage

processor = AutoProcessor.from_pretrained("Syed-Hasan-8503/Idefics2-8B-SFT")
model = AutoModelForVision2Seq.from_pretrained("Syed-Hasan-8503/Idefics2-8B-SFT",).to(DEVICE)

# Create inputs
messages = [
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

# ['User: What do we see in this image? \nAssistant: In this image, we can see the city of New York, and more specifically the Statue of Liberty. \nUser: And how about this image? \nAssistant: In this image we can see buildings, trees, lights, water and sky.']

πŸ† Evaluation

Coming Soon!

Downloads last month
Model size
8.4B params
Tensor type
Inference API
Inference API (serverless) does not yet support transformers models for this pipeline type.

Dataset used to train Syed-Hasan-8503/Idefics2-8B-SFT

Collection including Syed-Hasan-8503/Idefics2-8B-SFT