matthieufp
/

multilingual_open_flamingo

Model card Files Files and versions Community

multilingual_open_flamingo / README.md

matthieufp

Update README.md

9c4dbc1 verified about 2 months ago

preview code

raw

history blame contribute delete

5.37 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	- bn
	- cs
	- da
	- de
	- el
	- ar
	- es
	- fa
	- fi
	- fr
	- he
	- hi
	- hr
	- hu
	- id
	- it
	- ja
	- ko
	- mi
	- nl
	- 'no'
	- pl
	- pt
	- qu
	- ro
	- ru
	- sw
	- sv
	- te
	- th
	- tr
	- uk
	- vi
	- zh
	- ta
	- bg
	- ca
	- et
	- ur
	- eu
	- my
	- ht
	datasets:
	- oscar-corpus/mOSCAR
	---

	# Multilingual OpenFlamingo

	Multilingual OpenFlamingo is a multilingual version of [OpenFlamingo](https://arxiv.org/abs/2308.01390) trained on [mOSCAR](https://arxiv.org/abs/2406.08707) and a translated version of [LAION-400M](https://arxiv.org/abs/2111.02114). The model was trained on 43 languages and is based on `google/gemma-2b`.
	Multilingual OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text in multiple languages. The model will output the language provided in the prompt, no special token for specifying the language is required.

	Multilingual OpenFlamingo is only available for research purpose. We did not conduct any safety alignment training so the model could output harmful content if prompted to.

	## Installation
	```
	git clone https://github.com/MatthieuFP/open_flamingo
	cd open_flamingo
	pip install --editable ./
	pip install numpy==1.26
	```

	### Initialization

	``` python
	from open_flamingo import create_model_and_transforms

	model, image_processor, tokenizer = create_model_and_transforms(
	clip_vision_encoder_path="ViT-L-14",
	clip_vision_encoder_pretrained="openai",
	lang_encoder_path="google/gemma-2b",
	tokenizer_path="google/gemma-2b",
	cross_attn_every_n_layers=1,
	)

	# grab model checkpoint from huggingface hub
	from huggingface_hub import hf_hub_download
	import torch

	checkpoint_path = hf_hub_download("matthieufp/multilingual_open_flamingo", "checkpoint.pt")
	_ = model.load_state_dict(torch.load(checkpoint_path), strict=False)

	```
	### Generation example
	From [OpenFlamingo](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b):

	Below is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning.

	``` python
	from PIL import Image
	import requests

	"""
	Step 1: Load images
	"""
	demo_image_one = Image.open(
	requests.get(
	"http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
	).raw
	)

	demo_image_two = Image.open(
	requests.get(
	"http://images.cocodataset.org/test-stuff2017/000000028137.jpg",
	stream=True
	).raw
	)

	query_image = Image.open(
	requests.get(
	"http://images.cocodataset.org/test-stuff2017/000000028352.jpg",
	stream=True
	).raw
	)


	"""
	Step 2: Preprocessing images
	Details: For OpenFlamingo, we expect the image to be a torch tensor of shape
	batch_size x num_media x num_frames x channels x height x width.
	In this case batch_size = 1, num_media = 3, num_frames = 1,
	channels = 3, height = 224, width = 224.
	"""
	vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]
	vision_x = torch.cat(vision_x, dim=0)
	vision_x = vision_x.unsqueeze(1).unsqueeze(0)

	"""
	Step 3: Preprocessing text
	Details: In the text we expect an <image> special token to indicate where an image is.
	We also expect an <\|endofchunk\|> special token to indicate the end of the text
	portion associated with an image.
	"""
	tokenizer.padding_side = "left" # For generation padding tokens should be on the left
	lang_x = tokenizer(
	["<image>An image of two cats.<\|endofchunk\|><image>An image of a bathroom sink.<\|endofchunk\|><image>An image of"],
	return_tensors="pt",
	)


	"""
	Step 4: Generate text
	"""
	generated_text = model.generate(
	vision_x=vision_x,
	lang_x=lang_x["input_ids"],
	attention_mask=lang_x["attention_mask"],
	max_new_tokens=20,
	num_beams=3,
	)

	print("Generated text: ", tokenizer.decode(generated_text[0]))
	```

	## Citations
	If you use this model, please consider citing the following works:

	```
	@article{futeral2024moscar,
	title={mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus},
	author={Futeral, Matthieu and Zebaze, Armel and Suarez, Pedro Ortiz and Abadji, Julien and Lacroix, R{\'e}mi and Schmid, Cordelia and Bawden, Rachel and Sagot, Beno{\^\i}t},
	journal={arXiv preprint arXiv:2406.08707},
	year={2024}
	}
	```

	```
	@article{awadalla2023openflamingo,
	title={OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models},
	author={Anas Awadalla and Irena Gao and Josh Gardner and Jack Hessel and Yusuf Hanafy and Wanrong Zhu and Kalyani Marathe and Yonatan Bitton and Samir Gadre and Shiori Sagawa and Jenia Jitsev and Simon Kornblith and Pang Wei Koh and Gabriel Ilharco and Mitchell Wortsman and Ludwig Schmidt},
	journal={arXiv preprint arXiv:2308.01390},
	year={2023}
	}
	```

	```
	@software{anas_awadalla_2023_7733589,
	author = {Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig},
	title = {OpenFlamingo},
	month = mar,
	year = 2023,
	publisher = {Zenodo},
	version = {v0.1.1},
	doi = {10.5281/zenodo.7733589},
	url = {https://doi.org/10.5281/zenodo.7733589}
	}
	```