Oryx-7B / README.md

Add language tags (#1)

8e21d27 verified about 1 month ago

4.63 kB

	---
	license: apache-2.0
	datasets:
	- THUdyh/Oryx-SFT-Data
	base_model:
	- Qwen/Qwen2-7B-Instruct
	pipeline_tag: text-generation
	language:
	- en
	- zh
	---
	# Oryx-7B

	## Model Summary

	The Oryx models are 7/34B parameter models trained on [Oryx-SFT-Data](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data), based on Qwen2 language model with a context window of 32K tokens.

	Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.

	- Repository: https://github.com/Oryx-mllm/Oryx
	- Languages: English, Chinese
	- Paper: https://arxiv.org/abs/2409.12961

	## Use

	We provide a simple generation process for using our model. For more details, please refer to our [Github Repo](https://github.com/liuzuyan/oryx)

	```
	from oryx.model.builder import load_pretrained_model
	from oryx.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
	from oryx.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
	from oryx.conversation import conv_templates, SeparatorStyle
	from PIL import Image
	import requests
	import copy
	import torch
	import sys
	import warnings
	from decord import VideoReader, cpu
	import numpy as np

	def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
	if max_frames_num == 0:
	return np.zeros((1, 336, 336, 3))
	vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
	total_frame_num = len(vr)
	video_time = total_frame_num / vr.get_avg_fps()
	fps = round(vr.get_avg_fps()/fps)
	frame_idx = [i for i in range(0, len(vr), fps)]
	frame_time = [i/fps for i in frame_idx]
	if len(frame_idx) > max_frames_num or force_sample:
	sample_fps = max_frames_num
	uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
	frame_idx = uniform_sampled_frames.tolist()
	frame_time = [i/vr.get_avg_fps() for i in frame_idx]
	frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
	spare_frames = vr.get_batch(frame_idx).asnumpy()
	# import pdb;pdb.set_trace()
	return spare_frames,frame_time,video_time
	pretrained = "THUdyh/Oryx-7B"
	model_name = "oryx_qwen"
	device = "cuda"
	device_map = "auto"
	tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
	model.eval()
	video_path = ""
	max_frames_num = "64"
	video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
	video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
	video = [video]
	video_data = (video, video)
	input_data = (video_data, (384, 384), "video")
	conv_template = "qwen_1_5"
	question = DEFAULT_IMAGE_TOKEN + "\nPlease describe this video in detail."
	conv = copy.deepcopy(conv_templates[conv_template])
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt_question = conv.get_prompt()
	input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
	output_ids = model.generate(
	inputs=input_ids,
	images=input_data[0][0],
	images_highres=input_data[0][1],
	modalities=video_data[2],
	do_sample=False,
	temperature=0,
	max_new_tokens=128,
	use_cache=True,
	)

	text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
	print(text_outputs)
	```


	### Results

	#### General Video Benchmark

	<img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/hKfOK0u3OXly_u4hgGLDB.png" alt="image/png" style="zoom: 33%;" />

	#### Long-Form Video Understanding

	<img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/Xweq9f4OWkqeVc_FZIMuO.png" alt="image/png" style="zoom:33%;" />

	#### Common Image Benchmark

	<img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/ybfroSA9WaKXtJbP_9cLR.png" alt="image/png" style="zoom:33%;" />

	#### 3D Spatial Understanding

	<img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/5v8ACRzAoKS0FbcVBXZhT.png" alt="image/png" style="zoom:33%;" />



	### Model Architecture

	- Architecture: Pre-trained [Oryx-ViT](https://huggingface.co/THUdyh/Oryx-ViT) + Qwen2-7B
	- Init Model: [Oryx-7B-Image](https://huggingface.co/THUdyh/Oryx-7B-Image)
	- Data: a mixture of 1.2M image/video data
	- Precision: BFloat16

	#### Hardware & Software

	- Hardware: 64 * NVIDIA Tesla A100
	- Orchestration: HuggingFace Trainer
	- Code: Pytorch

	## Citation