Model Card for kbsooo/layoutlmv3_finetuned_doclaynet

Model Details

Model Description

This model is a fine-tuned version of LayoutLMv3 for token classification on the DocLayNet dataset.
It is designed to classify each token in a document image based on both textual and layout information.

Developed by: kbsooo
Model type: LayoutLMv3ForTokenClassification
Language(s) (NLP): Korean (document-oriented)
License: Check DocLayNet and LayoutLMv3 licenses
Finetuned from model: microsoft/layoutlmv3-base

Model Sources

Repository: Hugging Face Model Hub
Paper (optional): LayoutLMv3 Paper

Uses

Direct Use

This model can be used for:

Token classification in document images (e.g., identifying headings, paragraphs, tables, images, lists)
Document understanding tasks where layout + text information is important

Downstream Use

Can be integrated into pipelines for document information extraction
Useful for document analysis applications: invoice parsing, form processing, etc.

Out-of-Scope Use

Not intended for languages or layouts not represented in the DocLayNet dataset
Not suitable for free-form text without document structure

Bias, Risks, and Limitations

The model may misclassify tokens if the document layout or language differs from the training data
Biases may exist due to dataset composition (DocLayNet)
Limited to 10 classes of document layout elements

Recommendations

Users should preprocess documents similarly to the training setup (tokenization + bounding boxes + image)
Verify predictions, especially in production or high-stakes scenarios

How to Get Started with the Model

from transformers import LayoutLMv3ForTokenClassification, AutoProcessor
import torch

repo = "kbsooo/layoutlmv3_finetuned_doclaynet"
model = LayoutLMv3ForTokenClassification.from_pretrained(repo)
processor = AutoProcessor.from_pretrained(repo)

image = ...  # PIL.Image or np.array
text = "Sample document text"

encoding = processor(image, text, return_tensors="pt")
outputs = model(**encoding)
preds = torch.argmax(outputs.logits, dim=-1)
print(preds)

Training Details

Training Data

Dataset: DocLayNet-v1.2
Train/Validation split: 200/100 samples
Columns: input_ids, attention_mask, bbox, labels, pixel_values, n_words_in, n_words_out

Training Procedure

Optimizer: AdamW
Learning rate: 5e-5
Epochs: 5
Mixed precision: FP16 optional
Loss: Cross-entropy per token

Evaluation

Sample metrics (from validation set):
- Avg Train Loss: 0.134
- Avg Val Loss: 0.458
Token prediction accuracy should be checked against the DocLayNet labels

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA A100
Hours used: ~1 hr for 5 epochs (for small dataset)

Technical Specifications

Model Architecture and Objective

Base model: LayoutLMv3
Task: Token classification for document layout elements
Input: Tokenized text, bounding boxes, and document images
Output: Token-wise logits for 10 classes

Compute Infrastructure

Training performed on Google Colab Pro (A100 GPU)
Framework: PyTorch + Hugging Face Transformers

Citation

BibTeX:

@article{huang2022layoutlmv3,
  title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
  author={Huang, Zejiang and et al.},
  journal={arXiv preprint arXiv:2112.01041},
  year={2022}
}

APA:

Huang, Z., et al. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv preprint arXiv:2112.01041.

kbsooo
/

layoutlmv3_finetuned_doclaynet