matthieufp's picture
Update README.md
9c4dbc1 verified
metadata
license: cc-by-nc-4.0
language:
  - en
  - bn
  - cs
  - da
  - de
  - el
  - ar
  - es
  - fa
  - fi
  - fr
  - he
  - hi
  - hr
  - hu
  - id
  - it
  - ja
  - ko
  - mi
  - nl
  - 'no'
  - pl
  - pt
  - qu
  - ro
  - ru
  - sw
  - sv
  - te
  - th
  - tr
  - uk
  - vi
  - zh
  - ta
  - bg
  - ca
  - et
  - ur
  - eu
  - my
  - ht
datasets:
  - oscar-corpus/mOSCAR

Multilingual OpenFlamingo

Multilingual OpenFlamingo is a multilingual version of OpenFlamingo trained on mOSCAR and a translated version of LAION-400M. The model was trained on 43 languages and is based on google/gemma-2b. Multilingual OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text in multiple languages. The model will output the language provided in the prompt, no special token for specifying the language is required.

Multilingual OpenFlamingo is only available for research purpose. We did not conduct any safety alignment training so the model could output harmful content if prompted to.

Installation

git clone https://github.com/MatthieuFP/open_flamingo
cd open_flamingo
pip install --editable ./
pip install numpy==1.26

Initialization

from open_flamingo import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="google/gemma-2b",
    tokenizer_path="google/gemma-2b",
    cross_attn_every_n_layers=1,
)

# grab model checkpoint from huggingface hub
from huggingface_hub import hf_hub_download
import torch

checkpoint_path = hf_hub_download("matthieufp/multilingual_open_flamingo", "checkpoint.pt")
_ = model.load_state_dict(torch.load(checkpoint_path), strict=False)

Generation example

From OpenFlamingo:

Below is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning.

from PIL import Image
import requests

"""
Step 1: Load images
"""
demo_image_one = Image.open(
    requests.get(
        "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
    ).raw
)

demo_image_two = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028137.jpg",
        stream=True
    ).raw
)

query_image = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028352.jpg", 
        stream=True
    ).raw
)


"""
Step 2: Preprocessing images
Details: For OpenFlamingo, we expect the image to be a torch tensor of shape 
 batch_size x num_media x num_frames x channels x height x width. 
 In this case batch_size = 1, num_media = 3, num_frames = 1,
 channels = 3, height = 224, width = 224.
"""
vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]
vision_x = torch.cat(vision_x, dim=0)
vision_x = vision_x.unsqueeze(1).unsqueeze(0)

"""
Step 3: Preprocessing text
Details: In the text we expect an <image> special token to indicate where an image is.
 We also expect an <|endofchunk|> special token to indicate the end of the text 
 portion associated with an image.
"""
tokenizer.padding_side = "left" # For generation padding tokens should be on the left
lang_x = tokenizer(
    ["<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of"],
    return_tensors="pt",
)


"""
Step 4: Generate text
"""
generated_text = model.generate(
    vision_x=vision_x,
    lang_x=lang_x["input_ids"],
    attention_mask=lang_x["attention_mask"],
    max_new_tokens=20,
    num_beams=3,
)

print("Generated text: ", tokenizer.decode(generated_text[0]))

Citations

If you use this model, please consider citing the following works:

@article{futeral2024moscar,
  title={mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus},
  author={Futeral, Matthieu and Zebaze, Armel and Suarez, Pedro Ortiz and Abadji, Julien and Lacroix, R{\'e}mi and Schmid, Cordelia and Bawden, Rachel and Sagot, Beno{\^\i}t},
  journal={arXiv preprint arXiv:2406.08707},
  year={2024}
}
@article{awadalla2023openflamingo,
  title={OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models},
  author={Anas Awadalla and Irena Gao and Josh Gardner and Jack Hessel and Yusuf Hanafy and Wanrong Zhu and Kalyani Marathe and Yonatan Bitton and Samir Gadre and Shiori Sagawa and Jenia Jitsev and Simon Kornblith and Pang Wei Koh and Gabriel Ilharco and Mitchell Wortsman and Ludwig Schmidt},
  journal={arXiv preprint arXiv:2308.01390},
  year={2023}
}
@software{anas_awadalla_2023_7733589,
  author = {Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig},
  title = {OpenFlamingo},
  month        = mar,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v0.1.1},
  doi          = {10.5281/zenodo.7733589},
  url          = {https://doi.org/10.5281/zenodo.7733589}
}