Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Korean Standard Classification of Science and Technology Model - Hugging Face

This model is a sequence classification model based on the "google/gemma-2-9b" architecture, fine-tuned to classify Korean patent documents into the Korean Standard Classification of Science and Technology categories.

The model processes long patent texts, such as IPC, title, abstract, and claims, and applies a few-shot learning approach for classification into 188 classes.

Intended Uses & Limitations

  • Intended Uses:

    • Classifying Korean patent documents into the Korean Standard Classification of Science and Technology.
    • Identifying the technological field of patents based on titles, abstracts, and claims.
  • Limitations:

    • Limited to documents with up to 512 tokens, as longer texts are truncated.
    • The model is specialized for Korean patent data and may not generalize well to other text without fine-tuning.

Training and Evaluation Data

  • Dataset Composition: The dataset consists of Korean patent documents, including IPC, title, abstract, and claims.

    • Training Data: 6,008 instances
    • Validation Data: 20,559 instances
  • Data Source: The dataset is sourced from AI Hub.

Training Procedure

  • Tokenizer: "google/gemma-2-9b" tokenizer was used with padding and truncation applied to a maximum length of 512 tokens.
  • Loss Function: A custom MultiClassFocalLoss function with gamma=2.0 and alpha=0.25 was applied to address class imbalance.
  • Optimizer: The Adalomo optimizer was used with a learning rate of 1e-4 and gradient clipping set to a maximum gradient norm of 1.0.

Training Hyperparameters

  • Batch Size: 8
  • Gradient Accumulation Steps: 2
  • Warmup Steps: 100
  • Epochs: 20
  • Mixed Precision: bf16
  • Logging Steps: 100

Training Results

  • Training Time: 41:44:18 (153,124 seconds)
  • Best Validation Loss: 0.001751
  • Metrics:
    • Samples per second: 0.785
    • Steps per second: 0.049
    • Total FLOPs: 6.137764e+18

Training and Validation Losses

Epoch Training Loss Validation Loss
0 0.008100 0.004401
2 0.004800 0.003040
4 0.003300 0.002230
6 0.002600 0.001864
8 0.002300 0.001784
10 0.002100 0.001759
12 0.001900 0.001751
14 0.001700 0.001768
16 0.001700 0.001803
18 0.001700 0.001799
19 0.001600 0.001800

Framework Versions

  • PyTorch: Version 2.2
  • Transformers: Version 4.44.2

Resources and Server Environment

The model was developed using computing resources provided by the 2024 High-Performance Computing Support Project from the National IT Industry Promotion Agency (NIPA).

  • Server Specifications:
    • GPU: NVIDIA A100 (40 GB each)
    • CPU: 24-core 2.1 GHz
    • Memory: 192 GB
    • Storage: 2 TB SSD
Downloads last month
7
Safetensors
Model size
9.24B params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .