Update README.md

e06795a verified 24 days ago

5.81 kB

	---
	license: gemma
	library_name: keras-hub
	extra_gated_heading: Access PaliGemma on Hugging Face
	extra_gated_prompt: >-
	To access PaliGemma on Hugging Face, you’re required to review and agree to
	Google’s usage license. To do this, please ensure you’re logged-in to Hugging
	Face and click below. Requests are processed immediately.
	extra_gated_button_content: Acknowledge license
	pipeline_tag: image-text-to-text
	---

	PaliGemma is a set of multi-modal large language models published by Google based on the Gemma model. Both a pre-trained and instruction tuned models are available. See the model card below for benchmarks, data sources, and intended use cases.


	## Links

	* [PaliGemma API Documentation](https://keras.io/api/keras_hub/models/pali_gemma/)
	* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
	* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

	## Installation

	Keras and KerasHub can be installed with:

	```
	pip install -U -q keras-hub
	pip install -U -q keras>=3
	```

	Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instruction on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.

	## Presets

	The following model checkpoints are provided by the Keras team. Full code examples for each are available below.

	\| Preset name \| Parameters \| Description \|
	\|-----------------------\|------------\|-------------------------------------------------------------\|
	\| [paligemma-3b-224-mix-keras](https://huggingface.co/google/paligemma-3b-224-mix-keras) \| 2.92B \| image size 224, mix fine tuned, text sequence length is 256 \|
	\| [paligemma-3b-448-mix-keras](https://huggingface.co/google/paligemma-3b-448-mix-keras) \| 2.92B \| image size 448, mix fine tuned, text sequence length is 512 \|
	\| [paligemma-3b-224-keras](https://huggingface.co/google/paligemma-3b-224-keras) \| 2.92B \| image size 224, pre trained, text sequence length is 128 \|
	\| [paligemma-3b-448-keras](https://huggingface.co/google/paligemma-3b-448-keras) \| 2.92B \| image size 448, pre trained, text sequence length is 512 \|
	\| [paligemma-3b-896-keras](https://huggingface.co/google/paligemma-3b-896-keras) \| 2.93B \| image size 896, pre trained, text sequence length is 512 \|

	## Prompts

	The PaliGemma `"mix"` models can handle a number of prompting structures out of the box. It is important to stick exactly to these prompts, including the newline. Lang can be a language code such as `"en"` or `"fr"`. Support for languages outside of English will vary depending on the prompt type.

	* `"cap {lang}\n"`: very raw short caption (from WebLI-alt).
	* `"caption {lang}\n"`: coco-like short captions.
	* `"describe {lang}\n"`: somewhat longer more descriptive captions.
	* `"ocr\n"`: optical character recognition.
	* `"answer en {question}\n"`: question answering about the image contents.
	* `"question {lang} {answer}\n"`: question generation for a given answer.
	* `"detect {thing} ; {thing}\n"`: count objects in a scene.

	Not `"mix"` presets should be fine-tuned for a specific task.

	```
	!pip install -U -q keras-hub
	```
	Pick a backend of your choice
	```
	import os
	os.environ["KERAS_BACKEND"] = "jax"
	```
	Now we can load the PaliGemma "causal language model" from the Kaggle Models hub. A causal language model is just a LLM that is ready for generation, by training with a causal mask, and running generation a token at a time in a recurrent loop.
	```
	keras.config.set_floatx("bfloat16")
	pali_gemma_lm = keras_hub.models.PaliGemmaCausalLM.from_preset(
	"hf://google/paligemma-3b-448-keras"
	)
	```
	Function that reads an image from a given URL
	```
	def read_image(url):
	contents = io.BytesIO(requests.get(url).content)
	image = PIL.Image.open(contents)
	image = np.array(image)
	# Remove alpha channel if neccessary.
	if image.shape[2] == 4:
	image = image[:, :, :3]
	return image
	```
	```
	image_url = 'https://storage.googleapis.com/keras-cv/models/paligemma/cow_beach_1.png'
	image = read_image(image_url)
	```
	Use `generate()` call with a single image and prompt. The text prompt
	has to end with `\n`.
	```
	prompt = 'answer en where is the cow standing?\n'
	output = pali_gemma_lm.generate(
	inputs={
	"images": image,
	"prompts": prompt,
	}
	)
	print(output)
	```
	Use `generate()` call with a batched images and prompts.
	```
	prompts = [
	'answer en where is the cow standing?\n',
	'answer en what color is the cow?\n',
	'describe en\n',
	'detect cow\n',
	'segment cow\n',
	]
	images = [image, image, image, image, image]
	outputs = pali_gemma_lm.generate(
	inputs={
	"images": images,
	"prompts": prompts,
	}
	)
	for output in outputs:
	print(output)
	```

	There's a few other style of prompts this model can handle out of the box...

	`cap {lang}\n`: very raw short caption (from WebLI-alt).

	`caption {lang}\n`: nice, coco-like short captions.

	`describe {lang}\n`: somewhat longer more descriptive captions.

	`ocr\n`: optical character recognition.

	`answer en {question}\n`: question answering about the image contents.

	`question {lang} {answer}\n`: question generation for a given answer.

	`detect {thing} ; {thing}\n`: count objects in a scene.

	Call `fit()` on a single batch
	```
	import numpy as np
	image = np.random.uniform(-1, 1, size=(224, 224, 3))
	x = {
	"images": [image, image],
	"prompts": ["answer en Where is the cow standing?\n", "caption en\n"],
	}
	y = {
	"responses": ["beach", "A brown cow standing on a beach next to the ocean."],
	}
	pali_gemma_lm = keras_hub.models.PaliGemmaCausalLM.from_preset("hf://google/paligemma-3b-448-keras")
	pali_gemma_lm.fit(x=x, y=y, batch_size=2)
	```