Model Card for kbsooo/layoutlmv3_finetuned_doclaynet
Model Details
Model Description
This model is a fine-tuned version of LayoutLMv3 for token classification on the DocLayNet dataset.
It is designed to classify each token in a document image based on both textual and layout information.
- Developed by: kbsooo
- Model type: LayoutLMv3ForTokenClassification
- Language(s) (NLP): Korean (document-oriented)
- License: Check DocLayNet and LayoutLMv3 licenses
- Finetuned from model: microsoft/layoutlmv3-base
Model Sources
- Repository: Hugging Face Model Hub
- Paper (optional): LayoutLMv3 Paper
Uses
Direct Use
This model can be used for:
- Token classification in document images (e.g., identifying headings, paragraphs, tables, images, lists)
- Document understanding tasks where layout + text information is important
Downstream Use
- Can be integrated into pipelines for document information extraction
- Useful for document analysis applications: invoice parsing, form processing, etc.
Out-of-Scope Use
- Not intended for languages or layouts not represented in the DocLayNet dataset
- Not suitable for free-form text without document structure
Bias, Risks, and Limitations
- The model may misclassify tokens if the document layout or language differs from the training data
- Biases may exist due to dataset composition (DocLayNet)
- Limited to 10 classes of document layout elements
Recommendations
- Users should preprocess documents similarly to the training setup (tokenization + bounding boxes + image)
- Verify predictions, especially in production or high-stakes scenarios
How to Get Started with the Model
from transformers import LayoutLMv3ForTokenClassification, AutoProcessor
import torch
repo = "kbsooo/layoutlmv3_finetuned_doclaynet"
model = LayoutLMv3ForTokenClassification.from_pretrained(repo)
processor = AutoProcessor.from_pretrained(repo)
image = ... # PIL.Image or np.array
text = "Sample document text"
encoding = processor(image, text, return_tensors="pt")
outputs = model(**encoding)
preds = torch.argmax(outputs.logits, dim=-1)
print(preds)
Training Details
Training Data
- Dataset: DocLayNet-v1.2
- Train/Validation split: 200/100 samples
- Columns: input_ids, attention_mask, bbox, labels, pixel_values, n_words_in, n_words_out
Training Procedure
- Optimizer: AdamW
- Learning rate: 5e-5
- Epochs: 5
- Mixed precision: FP16 optional
- Loss: Cross-entropy per token
Evaluation
- Sample metrics (from validation set):
- Avg Train Loss: 0.134
- Avg Val Loss: 0.458
- Token prediction accuracy should be checked against the DocLayNet labels
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: NVIDIA A100
- Hours used: ~1 hr for 5 epochs (for small dataset)
Technical Specifications
Model Architecture and Objective
- Base model: LayoutLMv3
- Task: Token classification for document layout elements
- Input: Tokenized text, bounding boxes, and document images
- Output: Token-wise logits for 10 classes
Compute Infrastructure
- Training performed on Google Colab Pro (A100 GPU)
- Framework: PyTorch + Hugging Face Transformers
Citation
BibTeX:
@article{huang2022layoutlmv3,
title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
author={Huang, Zejiang and et al.},
journal={arXiv preprint arXiv:2112.01041},
year={2022}
}
APA:
Huang, Z., et al. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv preprint arXiv:2112.01041.
- Downloads last month
- 32
Model tree for kbsooo/layoutlmv3_finetuned_doclaynet
Base model
microsoft/layoutlmv3-base