Model Card for kbsooo/layoutlmv3_finetuned_doclaynet

Model Details

Model Description

This model is a fine-tuned version of LayoutLMv3 for token classification on the DocLayNet dataset.
It is designed to classify each token in a document image based on both textual and layout information.

  • Developed by: kbsooo
  • Model type: LayoutLMv3ForTokenClassification
  • Language(s) (NLP): Korean (document-oriented)
  • License: Check DocLayNet and LayoutLMv3 licenses
  • Finetuned from model: microsoft/layoutlmv3-base

Model Sources

Uses

Direct Use

This model can be used for:

  • Token classification in document images (e.g., identifying headings, paragraphs, tables, images, lists)
  • Document understanding tasks where layout + text information is important

Downstream Use

  • Can be integrated into pipelines for document information extraction
  • Useful for document analysis applications: invoice parsing, form processing, etc.

Out-of-Scope Use

  • Not intended for languages or layouts not represented in the DocLayNet dataset
  • Not suitable for free-form text without document structure

Bias, Risks, and Limitations

  • The model may misclassify tokens if the document layout or language differs from the training data
  • Biases may exist due to dataset composition (DocLayNet)
  • Limited to 10 classes of document layout elements

Recommendations

  • Users should preprocess documents similarly to the training setup (tokenization + bounding boxes + image)
  • Verify predictions, especially in production or high-stakes scenarios

How to Get Started with the Model

from transformers import LayoutLMv3ForTokenClassification, AutoProcessor
import torch

repo = "kbsooo/layoutlmv3_finetuned_doclaynet"
model = LayoutLMv3ForTokenClassification.from_pretrained(repo)
processor = AutoProcessor.from_pretrained(repo)

image = ...  # PIL.Image or np.array
text = "Sample document text"

encoding = processor(image, text, return_tensors="pt")
outputs = model(**encoding)
preds = torch.argmax(outputs.logits, dim=-1)
print(preds)

Training Details

Training Data

  • Dataset: DocLayNet-v1.2
  • Train/Validation split: 200/100 samples
  • Columns: input_ids, attention_mask, bbox, labels, pixel_values, n_words_in, n_words_out

Training Procedure

  • Optimizer: AdamW
  • Learning rate: 5e-5
  • Epochs: 5
  • Mixed precision: FP16 optional
  • Loss: Cross-entropy per token

Evaluation

  • Sample metrics (from validation set):
    • Avg Train Loss: 0.134
    • Avg Val Loss: 0.458
  • Token prediction accuracy should be checked against the DocLayNet labels

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA A100
  • Hours used: ~1 hr for 5 epochs (for small dataset)

Technical Specifications

Model Architecture and Objective

  • Base model: LayoutLMv3
  • Task: Token classification for document layout elements
  • Input: Tokenized text, bounding boxes, and document images
  • Output: Token-wise logits for 10 classes

Compute Infrastructure

  • Training performed on Google Colab Pro (A100 GPU)
  • Framework: PyTorch + Hugging Face Transformers

Citation

BibTeX:

@article{huang2022layoutlmv3,
  title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
  author={Huang, Zejiang and et al.},
  journal={arXiv preprint arXiv:2112.01041},
  year={2022}
}

APA:

Huang, Z., et al. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv preprint arXiv:2112.01041.

Downloads last month
32
Safetensors
Model size
126M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kbsooo/layoutlmv3_finetuned_doclaynet

Finetuned
(276)
this model

Dataset used to train kbsooo/layoutlmv3_finetuned_doclaynet