YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Korean Standard Classification of Science and Technology Model - Hugging Face
This model is a sequence classification model based on the "google/gemma-2-9b" architecture, fine-tuned to classify Korean patent documents into the Korean Standard Classification of Science and Technology categories.
The model processes long patent texts, such as IPC, title, abstract, and claims, and applies a few-shot learning approach for classification into 188 classes.
Intended Uses & Limitations
Intended Uses:
- Classifying Korean patent documents into the Korean Standard Classification of Science and Technology.
- Identifying the technological field of patents based on titles, abstracts, and claims.
Limitations:
- Limited to documents with up to 512 tokens, as longer texts are truncated.
- The model is specialized for Korean patent data and may not generalize well to other text without fine-tuning.
Training and Evaluation Data
Dataset Composition: The dataset consists of Korean patent documents, including IPC, title, abstract, and claims.
- Training Data: 6,008 instances
- Validation Data: 20,559 instances
Data Source: The dataset is sourced from AI Hub.
Training Procedure
- Tokenizer: "google/gemma-2-9b" tokenizer was used with padding and truncation applied to a maximum length of 512 tokens.
- Loss Function: A custom
MultiClassFocalLoss
function withgamma=2.0
andalpha=0.25
was applied to address class imbalance. - Optimizer: The Adalomo optimizer was used with a learning rate of 1e-4 and gradient clipping set to a maximum gradient norm of 1.0.
Training Hyperparameters
- Batch Size: 8
- Gradient Accumulation Steps: 2
- Warmup Steps: 100
- Epochs: 20
- Mixed Precision: bf16
- Logging Steps: 100
Training Results
- Training Time: 41:44:18 (153,124 seconds)
- Best Validation Loss: 0.001751
- Metrics:
- Samples per second: 0.785
- Steps per second: 0.049
- Total FLOPs: 6.137764e+18
Training and Validation Losses
Epoch | Training Loss | Validation Loss |
---|---|---|
0 | 0.008100 | 0.004401 |
2 | 0.004800 | 0.003040 |
4 | 0.003300 | 0.002230 |
6 | 0.002600 | 0.001864 |
8 | 0.002300 | 0.001784 |
10 | 0.002100 | 0.001759 |
12 | 0.001900 | 0.001751 |
14 | 0.001700 | 0.001768 |
16 | 0.001700 | 0.001803 |
18 | 0.001700 | 0.001799 |
19 | 0.001600 | 0.001800 |
Framework Versions
- PyTorch: Version 2.2
- Transformers: Version 4.44.2
Resources and Server Environment
The model was developed using computing resources provided by the 2024 High-Performance Computing Support Project from the National IT Industry Promotion Agency (NIPA).
- Server Specifications:
- GPU: NVIDIA A100 (40 GB each)
- CPU: 24-core 2.1 GHz
- Memory: 192 GB
- Storage: 2 TB SSD
- Downloads last month
- 7