library_name: transformers
license: mit
datasets:
- SpursgoZmy/MMTab
- apoidea/pubtabnet-html
language:
- en
base_model: google/pix2struct-base
pipeline_tag: image-to-text
pix2struct-base-table2html
Turn table images into HTML!
Demo app
Try the demo app which contains both table detection and recognition!
About
This model takes an image of a table and outputs HTML - the model parses the image and performs optical character recognition (OCR) and structure recognition to HTML format.
The model expects an image containing only a table. If the table is embedded in a document, first use a table detection model to extract it (e.g. Microsoft's Table Transformer model).
The model is finetuned from Pix2Struct base model using a max_patch_length of 1024 and max generation length of 1024. The max_patch_length should likely not be changed for inference but the generation length can be changed.
The model has been trained using two datasets: MMTab and PubTabNet.
Usage
Below is a complete example of loading the model and performing inference on an example table image (example from the MMTab dataset):
import torch
from transformers import AutoProcessor, Pix2StructForConditionalGeneration
from PIL import Image
import requests
from io import BytesIO
# Load model and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("KennethTM/pix2struct-base-table2html")
model = Pix2StructForConditionalGeneration.from_pretrained("KennethTM/pix2struct-base-table2html")
model.to(device)
model.eval()
# Load example image from URL
url = "https://huggingface.co/KennethTM/pix2struct-base-table2html/resolve/main/example_recog_1.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))
# Run model inference
encoding = processor(image, return_tensors="pt", max_patches=1024)
with torch.inference_mode():
flattened_patches = encoding.pop("flattened_patches").to(device)
attention_mask = encoding.pop("attention_mask").to(device)
predictions = model.generate(flattened_patches=flattened_patches, attention_mask=attention_mask, max_new_tokens=1024)
predictions_decoded = processor.tokenizer.batch_decode(predictions, skip_special_tokens=True)
# Show predictions as text
print(predictions_decoded[0])
Example image:
Model HTML output for example image:
<table border="1" cellspacing="0">
<tr>
<th>
Rank
</th>
<th>
Lane
</th>
<th>
Name
</th>
<th>
Nationality
</th>
<th>
Time
</th>
<th>
Notes
</th>
</tr>
<tr>
<td>
</td>
<td>
4
</td>
<td>
Michael Phelps
</td>
<td>
United States
</td>
<td>
51.25
</td>
<td>
OR
</td>
</tr>
<tr>
<td>
</td>
<td>
3
</td>
<td>
Ian Crocker
</td>
<td>
United States
</td>
<td>
51.29
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
5
</td>
<td>
Andriy Serdinov
</td>
<td>
Ukraine
</td>
<td>
51.36
</td>
<td>
EU
</td>
</tr>
<tr>
<td>
4
</td>
<td>
1
</td>
<td>
Thomas Rupprath
</td>
<td>
Germany
</td>
<td>
52.27
</td>
<td>
</td>
</tr>
<tr>
<td>
5
</td>
<td>
6
</td>
<td>
Igor Marchenko
</td>
<td>
Russia
</td>
<td>
52.32
</td>
<td>
</td>
</tr>
<tr>
<td>
6
</td>
<td>
2
</td>
<td>
Gabriel Mangabeira
</td>
<td>
Brazil
</td>
<td>
52.34
</td>
<td>
</td>
</tr>
<tr>
<td>
7
</td>
<td>
8
</td>
<td>
Duje Draganja
</td>
<td>
Croatia
</td>
<td>
52.46
</td>
<td>
</td>
</tr>
<tr>
<td>
8
</td>
<td>
7
</td>
<td>
Geoff Huegill
</td>
<td>
Australia
</td>
<td>
52.56
</td>
<td>
</td>
</tr>
</table>
And the rendered HTML table:
Rank | Lane | Name | Nationality | Time | Notes |
---|---|---|---|---|---|
4 | Michael Phelps | United States | 51.25 | OR | |
3 | Ian Crocker | United States | 51.29 | ||
5 | Andriy Serdinov | Ukraine | 51.36 | EU | |
4 | 1 | Thomas Rupprath | Germany | 52.27 | |
5 | 6 | Igor Marchenko | Russia | 52.32 | |
6 | 2 | Gabriel Mangabeira | Brazil | 52.34 | |
7 | 8 | Duje Draganja | Croatia | 52.46 | |
8 | 7 | Geoff Huegill | Australia | 52.56 |