|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- en |
|
- bn |
|
- cs |
|
- da |
|
- de |
|
- el |
|
- ar |
|
- es |
|
- fa |
|
- fi |
|
- fr |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- id |
|
- it |
|
- ja |
|
- ko |
|
- mi |
|
- nl |
|
- 'no' |
|
- pl |
|
- pt |
|
- qu |
|
- ro |
|
- ru |
|
- sw |
|
- sv |
|
- te |
|
- th |
|
- tr |
|
- uk |
|
- vi |
|
- zh |
|
- ta |
|
- bg |
|
- ca |
|
- et |
|
- ur |
|
- eu |
|
- my |
|
- ht |
|
datasets: |
|
- oscar-corpus/mOSCAR |
|
--- |
|
|
|
# Multilingual OpenFlamingo |
|
|
|
Multilingual OpenFlamingo is a multilingual version of [OpenFlamingo](https://arxiv.org/abs/2308.01390) trained on [mOSCAR](https://arxiv.org/abs/2406.08707) and a translated version of [LAION-400M](https://arxiv.org/abs/2111.02114). The model was trained on 43 languages and is based on `google/gemma-2b`. |
|
Multilingual OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text in multiple languages. The model will output the language provided in the prompt, no special token for specifying the language is required. |
|
|
|
Multilingual OpenFlamingo is only available for research purpose. We did not conduct any safety alignment training so the model could output harmful content if prompted to. |
|
|
|
## Installation |
|
``` |
|
git clone https://github.com/MatthieuFP/open_flamingo |
|
cd open_flamingo |
|
pip install --editable ./ |
|
pip install numpy==1.26 |
|
``` |
|
|
|
### Initialization |
|
|
|
``` python |
|
from open_flamingo import create_model_and_transforms |
|
|
|
model, image_processor, tokenizer = create_model_and_transforms( |
|
clip_vision_encoder_path="ViT-L-14", |
|
clip_vision_encoder_pretrained="openai", |
|
lang_encoder_path="google/gemma-2b", |
|
tokenizer_path="google/gemma-2b", |
|
cross_attn_every_n_layers=1, |
|
) |
|
|
|
# grab model checkpoint from huggingface hub |
|
from huggingface_hub import hf_hub_download |
|
import torch |
|
|
|
checkpoint_path = hf_hub_download("matthieufp/multilingual_open_flamingo", "checkpoint.pt") |
|
_ = model.load_state_dict(torch.load(checkpoint_path), strict=False) |
|
|
|
``` |
|
### Generation example |
|
From [OpenFlamingo](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b): |
|
|
|
Below is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning. |
|
|
|
``` python |
|
from PIL import Image |
|
import requests |
|
|
|
""" |
|
Step 1: Load images |
|
""" |
|
demo_image_one = Image.open( |
|
requests.get( |
|
"http://images.cocodataset.org/val2017/000000039769.jpg", stream=True |
|
).raw |
|
) |
|
|
|
demo_image_two = Image.open( |
|
requests.get( |
|
"http://images.cocodataset.org/test-stuff2017/000000028137.jpg", |
|
stream=True |
|
).raw |
|
) |
|
|
|
query_image = Image.open( |
|
requests.get( |
|
"http://images.cocodataset.org/test-stuff2017/000000028352.jpg", |
|
stream=True |
|
).raw |
|
) |
|
|
|
|
|
""" |
|
Step 2: Preprocessing images |
|
Details: For OpenFlamingo, we expect the image to be a torch tensor of shape |
|
batch_size x num_media x num_frames x channels x height x width. |
|
In this case batch_size = 1, num_media = 3, num_frames = 1, |
|
channels = 3, height = 224, width = 224. |
|
""" |
|
vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)] |
|
vision_x = torch.cat(vision_x, dim=0) |
|
vision_x = vision_x.unsqueeze(1).unsqueeze(0) |
|
|
|
""" |
|
Step 3: Preprocessing text |
|
Details: In the text we expect an <image> special token to indicate where an image is. |
|
We also expect an <|endofchunk|> special token to indicate the end of the text |
|
portion associated with an image. |
|
""" |
|
tokenizer.padding_side = "left" # For generation padding tokens should be on the left |
|
lang_x = tokenizer( |
|
["<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of"], |
|
return_tensors="pt", |
|
) |
|
|
|
|
|
""" |
|
Step 4: Generate text |
|
""" |
|
generated_text = model.generate( |
|
vision_x=vision_x, |
|
lang_x=lang_x["input_ids"], |
|
attention_mask=lang_x["attention_mask"], |
|
max_new_tokens=20, |
|
num_beams=3, |
|
) |
|
|
|
print("Generated text: ", tokenizer.decode(generated_text[0])) |
|
``` |
|
|
|
## Citations |
|
If you use this model, please consider citing the following works: |
|
|
|
``` |
|
@article{futeral2024moscar, |
|
title={mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus}, |
|
author={Futeral, Matthieu and Zebaze, Armel and Suarez, Pedro Ortiz and Abadji, Julien and Lacroix, R{\'e}mi and Schmid, Cordelia and Bawden, Rachel and Sagot, Beno{\^\i}t}, |
|
journal={arXiv preprint arXiv:2406.08707}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
``` |
|
@article{awadalla2023openflamingo, |
|
title={OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models}, |
|
author={Anas Awadalla and Irena Gao and Josh Gardner and Jack Hessel and Yusuf Hanafy and Wanrong Zhu and Kalyani Marathe and Yonatan Bitton and Samir Gadre and Shiori Sagawa and Jenia Jitsev and Simon Kornblith and Pang Wei Koh and Gabriel Ilharco and Mitchell Wortsman and Ludwig Schmidt}, |
|
journal={arXiv preprint arXiv:2308.01390}, |
|
year={2023} |
|
} |
|
``` |
|
|
|
``` |
|
@software{anas_awadalla_2023_7733589, |
|
author = {Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig}, |
|
title = {OpenFlamingo}, |
|
month = mar, |
|
year = 2023, |
|
publisher = {Zenodo}, |
|
version = {v0.1.1}, |
|
doi = {10.5281/zenodo.7733589}, |
|
url = {https://doi.org/10.5281/zenodo.7733589} |
|
} |
|
``` |