|
--- |
|
datasets: |
|
- nur-dev/kaz-for-lm |
|
language: |
|
- kk |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
license: afl-3.0 |
|
--- |
|
|
|
# GPT-J-3.48B-Kazakh |
|
<div style="display: flex; justify-content: center;"> |
|
<img src="https://github.com/Nurgali-Kadyrbek/assets/blob/main/gpt-tree.png?raw=true" alt="Llama Model Logo" width="300" height="300"/> |
|
</div> |
|
<div style="position:relative; text-align: center; padding: 20px; border-radius: 10px; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.1); margin-top: 20px; background: linear-gradient(135deg, #00a3e0, #ffc72c);"> |
|
<h1 style="font-size: 2.5em; margin: 0; color: #ffffff; text-shadow: 1px 1px #005b96; font-family: 'Arial', sans-serif; border-bottom: 5px solid #ffffff; padding-bottom: 10px;"> |
|
<span style="border-bottom: 4px double #ffffff; padding-bottom: 5px;">Kazakh Language GPT-J-3.48B</span> |
|
</h1> |
|
<p style="font-size: 1.2em; margin: 10px 0 0; color: #ffffff; font-family: 'Arial', sans-serif;"> |
|
<span style="background: linear-gradient(to right, #ffc72c 0%, #00a3e0 100%); color: white; padding: 2px 8px; border-radius: 5px;">General-purpose Kazakh Language Model</span> |
|
</p> |
|
</div> |
|
|
|
|
|
Architecture: GPTJForCausalLM<br> |
|
Tokenizer: retrained GPT2Tokenizer (Vocabulary size: 50,400, Model Max Length: 2048)<br> |
|
|
|
## Overview |
|
This model is a Kazakh language variant of the GPT-J-3.48B architecture, designed for general-purpose language modeling tasks. It has been trained on a diverse set of Kazakh language texts and is intended to support various natural language processing applications in the Kazakh language. |
|
|
|
## Usage Example |
|
The model can be used with the Hugging Face Transformers library: |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
model = AutoModelForCausalLM.from_pretrained("nur-dev/gpt-j-3.4B-kaz") |
|
tokenizer = AutoTokenizer.from_pretrained("nur-dev/gpt-j-3.4B-kaz") |
|
model.eval() |
|
``` |
|
### Training Details |
|
The model is being trained using the DeepSpeed library with Zero Optimization Stage 2. During the training process, zero optimization is applied at stage 2, with the optimizer offloaded to the CPU and pin memory enabled. The training also includes allgather partitions with a bucket size of 200M, overlap communication, reduce scatter, an automatic reduce bucket size, and the use of contiguous gradients. |
|
Hardware: 4 NVIDIA A100 GPUs (40GB each) |
|
Training Steps: Approximately 180,000 (ongoing) |
|
Epochs: 1(ongoing) |
|
Batch Size: 2 per device (for both training and evaluation) |
|
Gradient Accumulation Steps: 4 |
|
Learning Rate: 5e-5 |
|
Weight Decay: 0.05 |
|
Learning Rate Scheduler: Cosine with Restarts |
|
Warmup Steps: 15,000 |
|
Checkpointing Steps: Every 10,000 steps |
|
|
|
## Model Authors |
|
|
|
**Name:** Kadyrbek Nurgali |
|
- **Email:** nurgaliqadyrbek@gmail.com |
|
- **LinkedIn:** [Kadyrbek Nurgali](https://www.linkedin.com/in/nurgali-kadyrbek-504260231/) |