VisionParser-VL-Expert
Developed by: Daemontatox
Model Type: Fine-tuned Vision-Language Model (VLM)
Base Model: unsloth/Qwen2-VL-7B-Instruct
Finetuned from model: unsloth/Qwen2-VL-7B-Instruct
License: apache-2.0
Languages: en
Tags:
- document-parsing
- information-extraction
- vision-language
- unsloth
- qwen2_vl
Model Description
VisionParser-VL-Expert is a fine-tuned version of unsloth/Qwen2-VL-7B-Instruct, designed specifically for document parsing and extraction tasks. It excels in interpreting and extracting structured data from images of documents, such as invoices, forms, and reports.
The finetuning process utilized QLoRA with Unsloth and the Hugging Face TRL library, enabling efficient training with minimal resource overhead. This model demonstrates significant improvements in:
- Extracting textual information from visually complex layouts.
- Recognizing tabular and hierarchical data structures.
- Generating accurate and contextually rich text outputs for document understanding.
Datasets used include a combination of publicly available document datasets (e.g., FUNSD, DocVQA) and proprietary annotated data for domain-specific applications.
Intended Uses
VisionParser-VL-Expert is intended for:
- Extracting data from scanned documents, invoices, and forms.
- Parsing and analyzing structured layouts such as tables and charts.
- Generating textual summaries of visual content in documents.
- Supporting OCR systems by providing contextually enriched outputs.
Limitations
While VisionParser-VL-Expert is powerful, it has certain limitations:
- May struggle with low-quality or heavily distorted images.
- Biases from training data might influence performance.
- Limited support for languages other than English.
- Performance can vary with highly complex or novel document layouts.
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "daemontatox/visionparser-vl-expert" # Replace with the actual model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example usage with text and image
prompt = "Extract key details from the document: "
image_path = "path/to/your/document_image.jpg" # Replace with your image path
inputs = tokenizer(prompt, images=image_path, return_tensors="pt")
outputs = model.generate(**inputs)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Acknowledgements
Special thanks to the Unsloth team for their robust tools enabling efficient fine-tuning. This model was developed with the help of open-source libraries and community datasets.
- Downloads last month
- 37
Model tree for critical-hf/visionparser-vl-expert
Base model
Qwen/Qwen2-VL-7B