Spaces:
Build error
Build error
Stefanus Simandjuntak
commited on
Commit
Β·
927bb09
1
Parent(s):
c696f9e
Add LoRA training setup for Textilindo AI Assistant
Browse files- Created specialized training script for Textilindo AI with system prompt integration
- Updated training configuration for optimal laptop training (batch_size=2, gradient_accumulation=8)
- Added testing and inference scripts with interactive chat mode
- Created setup script for model download and environment preparation
- Added training runner script for easy execution
- Created comprehensive README for LoRA training branch
- Added readiness check script to verify all components before training
- README_LORA_TRAINING.md +176 -0
- configs/system_prompt.md +8 -0
- configs/training_config.yaml +12 -6
- run_textilindo_training.sh +65 -0
- scripts/check_training_ready.py +198 -0
- scripts/inference_textilindo_ai.py +178 -0
- scripts/setup_textilindo_training.py +173 -0
- scripts/test_textilindo_ai.py +235 -0
- scripts/train_textilindo_ai.py +280 -0
README_LORA_TRAINING.md
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Textilindo AI Assistant - LoRA Training
|
| 2 |
+
|
| 3 |
+
This branch contains the setup for fine-tuning Llama 3.2 1B with LoRA (Low-Rank Adaptation) to create a specialized Textilindo AI Assistant.
|
| 4 |
+
|
| 5 |
+
## π― Overview
|
| 6 |
+
|
| 7 |
+
The Textilindo AI Assistant is designed to help customers with:
|
| 8 |
+
- Product information and recommendations
|
| 9 |
+
- Ordering and shipping details
|
| 10 |
+
- Company information and policies
|
| 11 |
+
- Customer support in Indonesian language
|
| 12 |
+
|
| 13 |
+
## π Files Structure
|
| 14 |
+
|
| 15 |
+
```
|
| 16 |
+
βββ configs/
|
| 17 |
+
β βββ system_prompt.md # System prompt for Textilindo AI
|
| 18 |
+
β βββ training_config.yaml # Training configuration
|
| 19 |
+
βββ data/
|
| 20 |
+
β βββ lora_dataset_20250910_145055.jsonl # Training dataset
|
| 21 |
+
βββ scripts/
|
| 22 |
+
β βββ setup_textilindo_training.py # Setup and download model
|
| 23 |
+
β βββ train_textilindo_ai.py # LoRA training script
|
| 24 |
+
β βββ test_textilindo_ai.py # Testing script
|
| 25 |
+
β βββ inference_textilindo_ai.py # Inference script
|
| 26 |
+
βββ run_textilindo_training.sh # Training runner script
|
| 27 |
+
βββ README_LORA_TRAINING.md # This file
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
## π Quick Start
|
| 31 |
+
|
| 32 |
+
### 1. Setup Environment
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
# Activate virtual environment
|
| 36 |
+
source venv/bin/activate
|
| 37 |
+
|
| 38 |
+
# Install requirements
|
| 39 |
+
pip install -r requirements.txt
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
### 2. Download Base Model
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
python scripts/setup_textilindo_training.py
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
### 3. Start Training
|
| 49 |
+
|
| 50 |
+
```bash
|
| 51 |
+
# Option 1: Use the runner script
|
| 52 |
+
./run_textilindo_training.sh
|
| 53 |
+
|
| 54 |
+
# Option 2: Run training directly
|
| 55 |
+
python scripts/train_textilindo_ai.py
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### 4. Test the Model
|
| 59 |
+
|
| 60 |
+
```bash
|
| 61 |
+
# Interactive testing
|
| 62 |
+
python scripts/test_textilindo_ai.py
|
| 63 |
+
|
| 64 |
+
# Test with specific LoRA weights
|
| 65 |
+
python scripts/test_textilindo_ai.py --lora_path models/textilindo-ai-lora-YYYYMMDD_HHMMSS
|
| 66 |
+
|
| 67 |
+
# Single prompt testing
|
| 68 |
+
python scripts/inference_textilindo_ai.py --prompt "dimana lokasi textilindo?"
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
## π§ Configuration
|
| 72 |
+
|
| 73 |
+
### Training Configuration (`configs/training_config.yaml`)
|
| 74 |
+
|
| 75 |
+
- **Model**: Llama 3.2 1B Instruct
|
| 76 |
+
- **Dataset**: `data/lora_dataset_20250910_145055.jsonl`
|
| 77 |
+
- **LoRA Settings**: r=16, alpha=32, dropout=0.1
|
| 78 |
+
- **Training**: 3 epochs, batch_size=2, learning_rate=0.0002
|
| 79 |
+
|
| 80 |
+
### System Prompt (`configs/system_prompt.md`)
|
| 81 |
+
|
| 82 |
+
The system prompt defines the AI assistant's behavior:
|
| 83 |
+
- Responds in Indonesian (Bahasa Indonesia)
|
| 84 |
+
- Friendly and concise responses
|
| 85 |
+
- Focuses on selling and customer service
|
| 86 |
+
- Uses Textilindo-specific information
|
| 87 |
+
|
| 88 |
+
## π Dataset Format
|
| 89 |
+
|
| 90 |
+
The training dataset uses JSONL format with the following structure:
|
| 91 |
+
|
| 92 |
+
```jsonl
|
| 93 |
+
{"input": "", "output": "Textilindo berkantor pusat di Jl. Raya Prancis No.39...", "metadata": {"topic": "general", "doc_id": "web_input_20250829_101006", "source": "faq", "security_level": "low"}, "instruction": "dimana lokasi textilindo?"}
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
**Required fields:**
|
| 97 |
+
- `instruction`: Customer question
|
| 98 |
+
- `output`: AI assistant response
|
| 99 |
+
- `metadata`: Additional information (optional)
|
| 100 |
+
|
| 101 |
+
## ποΈ Training Process
|
| 102 |
+
|
| 103 |
+
1. **Data Preparation**: Load JSONL dataset and format with system prompt
|
| 104 |
+
2. **Model Loading**: Load Llama 3.2 1B base model
|
| 105 |
+
3. **LoRA Setup**: Configure LoRA parameters for efficient fine-tuning
|
| 106 |
+
4. **Training**: Fine-tune with Textilindo-specific data
|
| 107 |
+
5. **Saving**: Save LoRA weights for inference
|
| 108 |
+
|
| 109 |
+
## π§ͺ Testing
|
| 110 |
+
|
| 111 |
+
### Interactive Mode
|
| 112 |
+
```bash
|
| 113 |
+
python scripts/test_textilindo_ai.py
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
### Batch Testing
|
| 117 |
+
The script includes predefined test cases for common Textilindo questions.
|
| 118 |
+
|
| 119 |
+
### Custom Testing
|
| 120 |
+
```bash
|
| 121 |
+
python scripts/inference_textilindo_ai.py --prompt "Your question here"
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
## π Expected Results
|
| 125 |
+
|
| 126 |
+
After training, the AI assistant should be able to:
|
| 127 |
+
- Answer questions about Textilindo's location, hours, and policies
|
| 128 |
+
- Provide product information and recommendations
|
| 129 |
+
- Handle shipping and payment questions
|
| 130 |
+
- Respond in friendly Indonesian language
|
| 131 |
+
- Follow the system prompt guidelines
|
| 132 |
+
|
| 133 |
+
## π Troubleshooting
|
| 134 |
+
|
| 135 |
+
### Common Issues
|
| 136 |
+
|
| 137 |
+
1. **CUDA Out of Memory**
|
| 138 |
+
- Reduce batch_size in training_config.yaml
|
| 139 |
+
- Increase gradient_accumulation_steps
|
| 140 |
+
- Use CPU training (slower but works)
|
| 141 |
+
|
| 142 |
+
2. **Model Download Failed**
|
| 143 |
+
- Check internet connection
|
| 144 |
+
- Verify HuggingFace access
|
| 145 |
+
- Check disk space
|
| 146 |
+
|
| 147 |
+
3. **Training Slow**
|
| 148 |
+
- Ensure CUDA is available
|
| 149 |
+
- Check GPU memory usage
|
| 150 |
+
- Monitor system resources
|
| 151 |
+
|
| 152 |
+
### Performance Tips
|
| 153 |
+
|
| 154 |
+
- Use GPU for training (much faster than CPU)
|
| 155 |
+
- Monitor training progress in logs
|
| 156 |
+
- Save checkpoints regularly
|
| 157 |
+
- Test model during training
|
| 158 |
+
|
| 159 |
+
## π Notes
|
| 160 |
+
|
| 161 |
+
- The model is trained specifically for Textilindo's business
|
| 162 |
+
- Responses follow the system prompt guidelines
|
| 163 |
+
- Training data includes real customer interactions
|
| 164 |
+
- Model is optimized for Indonesian language and Textilindo context
|
| 165 |
+
|
| 166 |
+
## π Support
|
| 167 |
+
|
| 168 |
+
If you encounter issues:
|
| 169 |
+
1. Check the logs in the console output
|
| 170 |
+
2. Verify all files are in the correct locations
|
| 171 |
+
3. Ensure requirements are installed
|
| 172 |
+
4. Check GPU memory and CUDA availability
|
| 173 |
+
|
| 174 |
+
---
|
| 175 |
+
|
| 176 |
+
**Happy Training! π**
|
configs/system_prompt.md
CHANGED
|
@@ -18,6 +18,14 @@ SYSTEM_PROMPT = """You are a friendly and helpful AI assistant for Textilindo, a
|
|
| 18 |
- Provide accurate information with enthusiasm
|
| 19 |
- Encourage further questions warmly
|
| 20 |
- ALWAYS use emojis to make responses more friendly and engaging
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
π FORMATTING GUIDELINES:
|
| 23 |
- Use **bold** ONLY for essential titles: **Harga** (Price), **Lokasi** (Location), **Produk** (Product)
|
|
|
|
| 18 |
- Provide accurate information with enthusiasm
|
| 19 |
- Encourage further questions warmly
|
| 20 |
- ALWAYS use emojis to make responses more friendly and engaging
|
| 21 |
+
- If the user asks in formal language (for example: using "saya", "Anda", "Bapak", "Ibu", or other formal Indonesian phrases), DO NOT use informal greetings like "ka", "kak", or other casual terms.
|
| 22 |
+
- For formal questions, respond in formal Indonesian and address the user as "Bapak" (Sir) or "Ibu" (Ma'am) as appropriate, without using informal greetings.
|
| 23 |
+
- If the user appears to be Gen-Z (using slang, abbreviations, or typical Gen-Z emojis), respond using casual, trendy Gen-Z language. Make sure your answers remain polite, friendly, and easy to understand.
|
| 24 |
+
- Always match the tone and formality of your response to the language style used by the user.
|
| 25 |
+
- If the user uses informal language, reply informally. If the user uses formal language, reply formally.
|
| 26 |
+
- Never force a particular language style; simply follow the user's style so your response feels natural and comfortable.
|
| 27 |
+
|
| 28 |
+
|
| 29 |
|
| 30 |
π FORMATTING GUIDELINES:
|
| 31 |
- Use **bold** ONLY for essential titles: **Harga** (Price), **Lokasi** (Location), **Produk** (Product)
|
configs/training_config.yaml
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
dataset_path: data/
|
| 2 |
lora_config:
|
| 3 |
lora_alpha: 32
|
| 4 |
lora_dropout: 0.1
|
|
@@ -19,10 +19,16 @@ temperature: 0.7
|
|
| 19 |
top_k: 40
|
| 20 |
top_p: 0.9
|
| 21 |
training_config:
|
| 22 |
-
batch_size:
|
| 23 |
-
eval_steps:
|
| 24 |
-
gradient_accumulation_steps:
|
| 25 |
learning_rate: 0.0002
|
| 26 |
num_epochs: 3
|
| 27 |
-
save_steps:
|
| 28 |
-
warmup_steps:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
dataset_path: data/lora_dataset_20250910_145055.jsonl
|
| 2 |
lora_config:
|
| 3 |
lora_alpha: 32
|
| 4 |
lora_dropout: 0.1
|
|
|
|
| 19 |
top_k: 40
|
| 20 |
top_p: 0.9
|
| 21 |
training_config:
|
| 22 |
+
batch_size: 2
|
| 23 |
+
eval_steps: 100
|
| 24 |
+
gradient_accumulation_steps: 8
|
| 25 |
learning_rate: 0.0002
|
| 26 |
num_epochs: 3
|
| 27 |
+
save_steps: 100
|
| 28 |
+
warmup_steps: 50
|
| 29 |
+
logging_steps: 10
|
| 30 |
+
save_total_limit: 3
|
| 31 |
+
prediction_loss_only: true
|
| 32 |
+
remove_unused_columns: false
|
| 33 |
+
push_to_hub: false
|
| 34 |
+
report_to: null
|
run_textilindo_training.sh
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
|
| 3 |
+
# Textilindo AI Assistant Training Script
|
| 4 |
+
# This script sets up and runs the LoRA training for Textilindo AI Assistant
|
| 5 |
+
|
| 6 |
+
echo "π Textilindo AI Assistant - LoRA Training"
|
| 7 |
+
echo "=========================================="
|
| 8 |
+
|
| 9 |
+
# Check if virtual environment exists
|
| 10 |
+
if [ ! -d "venv" ]; then
|
| 11 |
+
echo "β Virtual environment not found. Creating one..."
|
| 12 |
+
python3 -m venv venv
|
| 13 |
+
fi
|
| 14 |
+
|
| 15 |
+
# Activate virtual environment
|
| 16 |
+
echo "π§ Activating virtual environment..."
|
| 17 |
+
source venv/bin/activate
|
| 18 |
+
|
| 19 |
+
# Install/upgrade requirements
|
| 20 |
+
echo "π¦ Installing requirements..."
|
| 21 |
+
pip install --upgrade pip
|
| 22 |
+
pip install -r requirements.txt
|
| 23 |
+
|
| 24 |
+
# Check if we need to download the model
|
| 25 |
+
echo "π Checking model..."
|
| 26 |
+
if [ ! -d "models/llama-3.2-1b-instruct" ] || [ ! -f "models/llama-3.2-1b-instruct/config.json" ]; then
|
| 27 |
+
echo "π₯ Downloading base model..."
|
| 28 |
+
python scripts/setup_textilindo_training.py
|
| 29 |
+
else
|
| 30 |
+
echo "β
Base model already exists"
|
| 31 |
+
fi
|
| 32 |
+
|
| 33 |
+
# Check dataset
|
| 34 |
+
echo "π Checking dataset..."
|
| 35 |
+
if [ ! -f "data/lora_dataset_20250910_145055.jsonl" ]; then
|
| 36 |
+
echo "β Dataset not found: data/lora_dataset_20250910_145055.jsonl"
|
| 37 |
+
echo "Please ensure your dataset is in the correct location"
|
| 38 |
+
exit 1
|
| 39 |
+
else
|
| 40 |
+
echo "β
Dataset found"
|
| 41 |
+
fi
|
| 42 |
+
|
| 43 |
+
# Check system prompt
|
| 44 |
+
echo "π Checking system prompt..."
|
| 45 |
+
if [ ! -f "configs/system_prompt.md" ]; then
|
| 46 |
+
echo "β System prompt not found: configs/system_prompt.md"
|
| 47 |
+
exit 1
|
| 48 |
+
else
|
| 49 |
+
echo "β
System prompt found"
|
| 50 |
+
fi
|
| 51 |
+
|
| 52 |
+
# Start training
|
| 53 |
+
echo "ποΈ Starting LoRA training..."
|
| 54 |
+
echo "This may take several hours depending on your hardware..."
|
| 55 |
+
echo ""
|
| 56 |
+
|
| 57 |
+
python scripts/train_textilindo_ai.py
|
| 58 |
+
|
| 59 |
+
echo ""
|
| 60 |
+
echo "β
Training completed!"
|
| 61 |
+
echo ""
|
| 62 |
+
echo "π Next steps:"
|
| 63 |
+
echo "1. Test the model: python scripts/test_textilindo_ai.py"
|
| 64 |
+
echo "2. Find your trained model in: models/textilindo-ai-lora-*"
|
| 65 |
+
echo "3. Test with LoRA: python scripts/test_textilindo_ai.py --lora_path models/textilindo-ai-lora-*"
|
scripts/check_training_ready.py
ADDED
|
@@ -0,0 +1,198 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Check if everything is ready for Textilindo AI training
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import sys
|
| 8 |
+
import yaml
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
|
| 11 |
+
def check_file_exists(file_path, description):
|
| 12 |
+
"""Check if a file exists and print status"""
|
| 13 |
+
if os.path.exists(file_path):
|
| 14 |
+
print(f"β
{description}: {file_path}")
|
| 15 |
+
return True
|
| 16 |
+
else:
|
| 17 |
+
print(f"β {description}: {file_path}")
|
| 18 |
+
return False
|
| 19 |
+
|
| 20 |
+
def check_config():
|
| 21 |
+
"""Check configuration files"""
|
| 22 |
+
print("π Checking configuration files...")
|
| 23 |
+
|
| 24 |
+
config_path = "configs/training_config.yaml"
|
| 25 |
+
if not os.path.exists(config_path):
|
| 26 |
+
print(f"β Training config not found: {config_path}")
|
| 27 |
+
return False
|
| 28 |
+
|
| 29 |
+
try:
|
| 30 |
+
with open(config_path, 'r') as f:
|
| 31 |
+
config = yaml.safe_load(f)
|
| 32 |
+
|
| 33 |
+
# Check required fields
|
| 34 |
+
required_fields = ['model_name', 'model_path', 'dataset_path', 'lora_config', 'training_config']
|
| 35 |
+
for field in required_fields:
|
| 36 |
+
if field not in config:
|
| 37 |
+
print(f"β Missing field in config: {field}")
|
| 38 |
+
return False
|
| 39 |
+
|
| 40 |
+
print("β
Training configuration is valid")
|
| 41 |
+
return True
|
| 42 |
+
|
| 43 |
+
except Exception as e:
|
| 44 |
+
print(f"β Error reading config: {e}")
|
| 45 |
+
return False
|
| 46 |
+
|
| 47 |
+
def check_dataset():
|
| 48 |
+
"""Check dataset file"""
|
| 49 |
+
print("\nπ Checking dataset...")
|
| 50 |
+
|
| 51 |
+
config_path = "configs/training_config.yaml"
|
| 52 |
+
with open(config_path, 'r') as f:
|
| 53 |
+
config = yaml.safe_load(f)
|
| 54 |
+
|
| 55 |
+
dataset_path = config['dataset_path']
|
| 56 |
+
|
| 57 |
+
if not os.path.exists(dataset_path):
|
| 58 |
+
print(f"β Dataset not found: {dataset_path}")
|
| 59 |
+
return False
|
| 60 |
+
|
| 61 |
+
# Check if it's a valid JSONL file
|
| 62 |
+
try:
|
| 63 |
+
import json
|
| 64 |
+
with open(dataset_path, 'r', encoding='utf-8') as f:
|
| 65 |
+
lines = f.readlines()
|
| 66 |
+
|
| 67 |
+
if not lines:
|
| 68 |
+
print("β Dataset is empty")
|
| 69 |
+
return False
|
| 70 |
+
|
| 71 |
+
# Check first few lines
|
| 72 |
+
valid_lines = 0
|
| 73 |
+
for i, line in enumerate(lines[:5]): # Check first 5 lines
|
| 74 |
+
line = line.strip()
|
| 75 |
+
if line:
|
| 76 |
+
try:
|
| 77 |
+
json.loads(line)
|
| 78 |
+
valid_lines += 1
|
| 79 |
+
except json.JSONDecodeError:
|
| 80 |
+
print(f"β Invalid JSON at line {i+1}")
|
| 81 |
+
return False
|
| 82 |
+
|
| 83 |
+
print(f"β
Dataset found: {dataset_path}")
|
| 84 |
+
print(f" Total lines: {len(lines)}")
|
| 85 |
+
print(f" Valid JSON lines checked: {valid_lines}")
|
| 86 |
+
return True
|
| 87 |
+
|
| 88 |
+
except Exception as e:
|
| 89 |
+
print(f"β Error reading dataset: {e}")
|
| 90 |
+
return False
|
| 91 |
+
|
| 92 |
+
def check_model():
|
| 93 |
+
"""Check if base model exists"""
|
| 94 |
+
print("\nπ Checking base model...")
|
| 95 |
+
|
| 96 |
+
config_path = "configs/training_config.yaml"
|
| 97 |
+
with open(config_path, 'r') as f:
|
| 98 |
+
config = yaml.safe_load(f)
|
| 99 |
+
|
| 100 |
+
model_path = config['model_path']
|
| 101 |
+
|
| 102 |
+
if not os.path.exists(model_path):
|
| 103 |
+
print(f"β Base model not found: {model_path}")
|
| 104 |
+
print(" Run: python scripts/setup_textilindo_training.py")
|
| 105 |
+
return False
|
| 106 |
+
|
| 107 |
+
# Check if it's a valid model directory
|
| 108 |
+
required_files = ['config.json', 'tokenizer.json']
|
| 109 |
+
for file in required_files:
|
| 110 |
+
if not os.path.exists(os.path.join(model_path, file)):
|
| 111 |
+
print(f"β Model file missing: {file}")
|
| 112 |
+
return False
|
| 113 |
+
|
| 114 |
+
print(f"β
Base model found: {model_path}")
|
| 115 |
+
return True
|
| 116 |
+
|
| 117 |
+
def check_system_prompt():
|
| 118 |
+
"""Check system prompt file"""
|
| 119 |
+
print("\nπ Checking system prompt...")
|
| 120 |
+
|
| 121 |
+
system_prompt_path = "configs/system_prompt.md"
|
| 122 |
+
|
| 123 |
+
if not os.path.exists(system_prompt_path):
|
| 124 |
+
print(f"β System prompt not found: {system_prompt_path}")
|
| 125 |
+
return False
|
| 126 |
+
|
| 127 |
+
try:
|
| 128 |
+
with open(system_prompt_path, 'r', encoding='utf-8') as f:
|
| 129 |
+
content = f.read()
|
| 130 |
+
|
| 131 |
+
if 'SYSTEM_PROMPT' not in content:
|
| 132 |
+
print("β SYSTEM_PROMPT not found in file")
|
| 133 |
+
return False
|
| 134 |
+
|
| 135 |
+
print(f"β
System prompt found: {system_prompt_path}")
|
| 136 |
+
return True
|
| 137 |
+
|
| 138 |
+
except Exception as e:
|
| 139 |
+
print(f"β Error reading system prompt: {e}")
|
| 140 |
+
return False
|
| 141 |
+
|
| 142 |
+
def check_requirements():
|
| 143 |
+
"""Check Python requirements"""
|
| 144 |
+
print("\nπ Checking Python requirements...")
|
| 145 |
+
|
| 146 |
+
required_packages = [
|
| 147 |
+
'torch',
|
| 148 |
+
'transformers',
|
| 149 |
+
'peft',
|
| 150 |
+
'datasets',
|
| 151 |
+
'accelerate',
|
| 152 |
+
'bitsandbytes',
|
| 153 |
+
'yaml'
|
| 154 |
+
]
|
| 155 |
+
|
| 156 |
+
missing_packages = []
|
| 157 |
+
for package in required_packages:
|
| 158 |
+
try:
|
| 159 |
+
__import__(package)
|
| 160 |
+
print(f"β
{package}")
|
| 161 |
+
except ImportError:
|
| 162 |
+
missing_packages.append(package)
|
| 163 |
+
print(f"β {package}")
|
| 164 |
+
|
| 165 |
+
if missing_packages:
|
| 166 |
+
print(f"\nβ Missing packages: {', '.join(missing_packages)}")
|
| 167 |
+
print("Install with: pip install " + " ".join(missing_packages))
|
| 168 |
+
return False
|
| 169 |
+
|
| 170 |
+
return True
|
| 171 |
+
|
| 172 |
+
def main():
|
| 173 |
+
print("π Textilindo AI Training - Readiness Check")
|
| 174 |
+
print("=" * 50)
|
| 175 |
+
|
| 176 |
+
all_ready = True
|
| 177 |
+
|
| 178 |
+
# Check all components
|
| 179 |
+
all_ready &= check_config()
|
| 180 |
+
all_ready &= check_dataset()
|
| 181 |
+
all_ready &= check_model()
|
| 182 |
+
all_ready &= check_system_prompt()
|
| 183 |
+
all_ready &= check_requirements()
|
| 184 |
+
|
| 185 |
+
print("\n" + "=" * 50)
|
| 186 |
+
|
| 187 |
+
if all_ready:
|
| 188 |
+
print("β
Everything is ready for training!")
|
| 189 |
+
print("\nπ Next steps:")
|
| 190 |
+
print("1. Run training: python scripts/train_textilindo_ai.py")
|
| 191 |
+
print("2. Or use runner: ./run_textilindo_training.sh")
|
| 192 |
+
else:
|
| 193 |
+
print("β Some components are missing or invalid")
|
| 194 |
+
print("Please fix the issues above before training")
|
| 195 |
+
sys.exit(1)
|
| 196 |
+
|
| 197 |
+
if __name__ == "__main__":
|
| 198 |
+
main()
|
scripts/inference_textilindo_ai.py
ADDED
|
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Inference script untuk Textilindo AI Assistant
|
| 4 |
+
Menggunakan model yang sudah di-fine-tune dengan LoRA
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import sys
|
| 9 |
+
import torch
|
| 10 |
+
import argparse
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 13 |
+
from peft import PeftModel
|
| 14 |
+
import logging
|
| 15 |
+
|
| 16 |
+
logging.basicConfig(level=logging.INFO)
|
| 17 |
+
logger = logging.getLogger(__name__)
|
| 18 |
+
|
| 19 |
+
def load_system_prompt(system_prompt_path):
|
| 20 |
+
"""Load system prompt from markdown file"""
|
| 21 |
+
try:
|
| 22 |
+
with open(system_prompt_path, 'r', encoding='utf-8') as f:
|
| 23 |
+
content = f.read()
|
| 24 |
+
|
| 25 |
+
# Extract SYSTEM_PROMPT from markdown
|
| 26 |
+
if 'SYSTEM_PROMPT = """' in content:
|
| 27 |
+
start = content.find('SYSTEM_PROMPT = """') + len('SYSTEM_PROMPT = """')
|
| 28 |
+
end = content.find('"""', start)
|
| 29 |
+
system_prompt = content[start:end].strip()
|
| 30 |
+
else:
|
| 31 |
+
# Fallback: use entire content
|
| 32 |
+
system_prompt = content.strip()
|
| 33 |
+
|
| 34 |
+
return system_prompt
|
| 35 |
+
except Exception as e:
|
| 36 |
+
logger.error(f"Error loading system prompt: {e}")
|
| 37 |
+
return None
|
| 38 |
+
|
| 39 |
+
def load_model(model_path, lora_path=None):
|
| 40 |
+
"""Load model with optional LoRA weights"""
|
| 41 |
+
logger.info(f"Loading base model from: {model_path}")
|
| 42 |
+
|
| 43 |
+
# Load tokenizer
|
| 44 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 45 |
+
model_path,
|
| 46 |
+
trust_remote_code=True
|
| 47 |
+
)
|
| 48 |
+
|
| 49 |
+
if tokenizer.pad_token is None:
|
| 50 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 51 |
+
|
| 52 |
+
# Load base model
|
| 53 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 54 |
+
model_path,
|
| 55 |
+
torch_dtype=torch.float16,
|
| 56 |
+
device_map="auto",
|
| 57 |
+
trust_remote_code=True
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
# Load LoRA weights if provided
|
| 61 |
+
if lora_path and os.path.exists(lora_path):
|
| 62 |
+
logger.info(f"Loading LoRA weights from: {lora_path}")
|
| 63 |
+
model = PeftModel.from_pretrained(model, lora_path)
|
| 64 |
+
else:
|
| 65 |
+
logger.warning("No LoRA weights found, using base model")
|
| 66 |
+
|
| 67 |
+
return model, tokenizer
|
| 68 |
+
|
| 69 |
+
def generate_response(model, tokenizer, user_input, system_prompt, max_length=512):
|
| 70 |
+
"""Generate response from the model"""
|
| 71 |
+
# Create full prompt with system prompt
|
| 72 |
+
full_prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
|
| 73 |
+
|
| 74 |
+
inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
|
| 75 |
+
|
| 76 |
+
with torch.no_grad():
|
| 77 |
+
outputs = model.generate(
|
| 78 |
+
**inputs,
|
| 79 |
+
max_length=max_length,
|
| 80 |
+
temperature=0.7,
|
| 81 |
+
top_p=0.9,
|
| 82 |
+
top_k=40,
|
| 83 |
+
repetition_penalty=1.1,
|
| 84 |
+
do_sample=True,
|
| 85 |
+
pad_token_id=tokenizer.eos_token_id,
|
| 86 |
+
eos_token_id=tokenizer.eos_token_id,
|
| 87 |
+
stop_strings=["<|end|>", "<|user|>"]
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 91 |
+
|
| 92 |
+
# Extract only the assistant's response
|
| 93 |
+
if "<|assistant|>" in response:
|
| 94 |
+
assistant_response = response.split("<|assistant|>")[-1].strip()
|
| 95 |
+
# Remove any remaining special tokens
|
| 96 |
+
assistant_response = assistant_response.replace("<|end|>", "").strip()
|
| 97 |
+
return assistant_response
|
| 98 |
+
else:
|
| 99 |
+
return response
|
| 100 |
+
|
| 101 |
+
def interactive_chat(model, tokenizer, system_prompt):
|
| 102 |
+
"""Interactive chat mode"""
|
| 103 |
+
print("π€ Textilindo AI Assistant - Chat Mode")
|
| 104 |
+
print("=" * 60)
|
| 105 |
+
print("Type 'quit' to exit")
|
| 106 |
+
print("-" * 60)
|
| 107 |
+
|
| 108 |
+
while True:
|
| 109 |
+
try:
|
| 110 |
+
user_input = input("\nπ€ Customer: ").strip()
|
| 111 |
+
|
| 112 |
+
if user_input.lower() in ['quit', 'exit', 'q']:
|
| 113 |
+
print("π Terima kasih! Sampai jumpa!")
|
| 114 |
+
break
|
| 115 |
+
|
| 116 |
+
if not user_input:
|
| 117 |
+
continue
|
| 118 |
+
|
| 119 |
+
print("\nπ€ Textilindo AI: ", end="", flush=True)
|
| 120 |
+
response = generate_response(model, tokenizer, user_input, system_prompt)
|
| 121 |
+
print(response)
|
| 122 |
+
|
| 123 |
+
except KeyboardInterrupt:
|
| 124 |
+
print("\nπ Terima kasih! Sampai jumpa!")
|
| 125 |
+
break
|
| 126 |
+
except Exception as e:
|
| 127 |
+
logger.error(f"Error generating response: {e}")
|
| 128 |
+
print(f"β Error: {e}")
|
| 129 |
+
|
| 130 |
+
def main():
|
| 131 |
+
parser = argparse.ArgumentParser(description='Textilindo AI Assistant Inference')
|
| 132 |
+
parser.add_argument('--model_path', type=str, default='./models/llama-3.2-1b-instruct',
|
| 133 |
+
help='Path to base model')
|
| 134 |
+
parser.add_argument('--lora_path', type=str, default=None,
|
| 135 |
+
help='Path to LoRA weights')
|
| 136 |
+
parser.add_argument('--system_prompt', type=str, default='configs/system_prompt.md',
|
| 137 |
+
help='Path to system prompt file')
|
| 138 |
+
parser.add_argument('--prompt', type=str, default=None,
|
| 139 |
+
help='Single prompt to process')
|
| 140 |
+
|
| 141 |
+
args = parser.parse_args()
|
| 142 |
+
|
| 143 |
+
print("π€ Textilindo AI Assistant - Inference")
|
| 144 |
+
print("=" * 60)
|
| 145 |
+
|
| 146 |
+
# Load system prompt
|
| 147 |
+
system_prompt = load_system_prompt(args.system_prompt)
|
| 148 |
+
if not system_prompt:
|
| 149 |
+
print(f"β System prompt tidak ditemukan: {args.system_prompt}")
|
| 150 |
+
sys.exit(1)
|
| 151 |
+
|
| 152 |
+
# Check if model exists
|
| 153 |
+
if not os.path.exists(args.model_path):
|
| 154 |
+
print(f"β Base model tidak ditemukan: {args.model_path}")
|
| 155 |
+
print("Jalankan setup_textilindo_training.py terlebih dahulu")
|
| 156 |
+
sys.exit(1)
|
| 157 |
+
|
| 158 |
+
try:
|
| 159 |
+
# Load model
|
| 160 |
+
print("1οΈβ£ Loading model...")
|
| 161 |
+
model, tokenizer = load_model(args.model_path, args.lora_path)
|
| 162 |
+
print("β
Model loaded successfully!")
|
| 163 |
+
|
| 164 |
+
if args.prompt:
|
| 165 |
+
# Single prompt mode
|
| 166 |
+
print(f"\nπ Processing prompt: {args.prompt}")
|
| 167 |
+
response = generate_response(model, tokenizer, args.prompt, system_prompt)
|
| 168 |
+
print(f"\nπ€ Response: {response}")
|
| 169 |
+
else:
|
| 170 |
+
# Interactive mode
|
| 171 |
+
interactive_chat(model, tokenizer, system_prompt)
|
| 172 |
+
|
| 173 |
+
except Exception as e:
|
| 174 |
+
logger.error(f"Error: {e}")
|
| 175 |
+
print(f"β Error loading model: {e}")
|
| 176 |
+
|
| 177 |
+
if __name__ == "__main__":
|
| 178 |
+
main()
|
scripts/setup_textilindo_training.py
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Setup script untuk Textilindo AI Assistant training
|
| 4 |
+
Download model dan prepare environment
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import sys
|
| 9 |
+
import yaml
|
| 10 |
+
import torch
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 13 |
+
import logging
|
| 14 |
+
|
| 15 |
+
logging.basicConfig(level=logging.INFO)
|
| 16 |
+
logger = logging.getLogger(__name__)
|
| 17 |
+
|
| 18 |
+
def load_config(config_path):
|
| 19 |
+
"""Load configuration from YAML file"""
|
| 20 |
+
try:
|
| 21 |
+
with open(config_path, 'r') as f:
|
| 22 |
+
config = yaml.safe_load(f)
|
| 23 |
+
return config
|
| 24 |
+
except Exception as e:
|
| 25 |
+
logger.error(f"Error loading config: {e}")
|
| 26 |
+
return None
|
| 27 |
+
|
| 28 |
+
def download_model(config):
|
| 29 |
+
"""Download base model"""
|
| 30 |
+
model_name = config['model_name']
|
| 31 |
+
model_path = config['model_path']
|
| 32 |
+
|
| 33 |
+
logger.info(f"Downloading model: {model_name}")
|
| 34 |
+
logger.info(f"Target path: {model_path}")
|
| 35 |
+
|
| 36 |
+
# Create models directory
|
| 37 |
+
Path(model_path).mkdir(parents=True, exist_ok=True)
|
| 38 |
+
|
| 39 |
+
try:
|
| 40 |
+
# Download tokenizer
|
| 41 |
+
logger.info("Downloading tokenizer...")
|
| 42 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 43 |
+
model_name,
|
| 44 |
+
trust_remote_code=True,
|
| 45 |
+
cache_dir=model_path
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
# Download model
|
| 49 |
+
logger.info("Downloading model...")
|
| 50 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 51 |
+
model_name,
|
| 52 |
+
torch_dtype=torch.float16,
|
| 53 |
+
trust_remote_code=True,
|
| 54 |
+
cache_dir=model_path
|
| 55 |
+
)
|
| 56 |
+
|
| 57 |
+
# Save to local path
|
| 58 |
+
logger.info(f"Saving model to: {model_path}")
|
| 59 |
+
tokenizer.save_pretrained(model_path)
|
| 60 |
+
model.save_pretrained(model_path)
|
| 61 |
+
|
| 62 |
+
logger.info("β
Model downloaded successfully!")
|
| 63 |
+
return True
|
| 64 |
+
|
| 65 |
+
except Exception as e:
|
| 66 |
+
logger.error(f"Error downloading model: {e}")
|
| 67 |
+
return False
|
| 68 |
+
|
| 69 |
+
def check_requirements():
|
| 70 |
+
"""Check if all requirements are met"""
|
| 71 |
+
print("π Checking requirements...")
|
| 72 |
+
|
| 73 |
+
# Check Python version
|
| 74 |
+
if sys.version_info < (3, 8):
|
| 75 |
+
print("β Python 3.8+ required")
|
| 76 |
+
return False
|
| 77 |
+
|
| 78 |
+
# Check PyTorch
|
| 79 |
+
try:
|
| 80 |
+
import torch
|
| 81 |
+
print(f"β
PyTorch {torch.__version__}")
|
| 82 |
+
except ImportError:
|
| 83 |
+
print("β PyTorch not installed")
|
| 84 |
+
return False
|
| 85 |
+
|
| 86 |
+
# Check CUDA availability
|
| 87 |
+
if torch.cuda.is_available():
|
| 88 |
+
print(f"β
CUDA available: {torch.cuda.get_device_name(0)}")
|
| 89 |
+
print(f" GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
|
| 90 |
+
else:
|
| 91 |
+
print("β οΈ CUDA not available - training will be slower on CPU")
|
| 92 |
+
|
| 93 |
+
# Check required packages
|
| 94 |
+
required_packages = [
|
| 95 |
+
'transformers',
|
| 96 |
+
'peft',
|
| 97 |
+
'datasets',
|
| 98 |
+
'accelerate',
|
| 99 |
+
'bitsandbytes'
|
| 100 |
+
]
|
| 101 |
+
|
| 102 |
+
missing_packages = []
|
| 103 |
+
for package in required_packages:
|
| 104 |
+
try:
|
| 105 |
+
__import__(package)
|
| 106 |
+
print(f"β
{package}")
|
| 107 |
+
except ImportError:
|
| 108 |
+
missing_packages.append(package)
|
| 109 |
+
print(f"β {package}")
|
| 110 |
+
|
| 111 |
+
if missing_packages:
|
| 112 |
+
print(f"\nβ Missing packages: {', '.join(missing_packages)}")
|
| 113 |
+
print("Install with: pip install " + " ".join(missing_packages))
|
| 114 |
+
return False
|
| 115 |
+
|
| 116 |
+
return True
|
| 117 |
+
|
| 118 |
+
def main():
|
| 119 |
+
print("π Textilindo AI Assistant - Setup")
|
| 120 |
+
print("=" * 50)
|
| 121 |
+
|
| 122 |
+
# Check requirements
|
| 123 |
+
if not check_requirements():
|
| 124 |
+
print("\nβ Requirements not met. Please install missing packages.")
|
| 125 |
+
sys.exit(1)
|
| 126 |
+
|
| 127 |
+
# Load configuration
|
| 128 |
+
config_path = "configs/training_config.yaml"
|
| 129 |
+
if not os.path.exists(config_path):
|
| 130 |
+
print(f"β Config file tidak ditemukan: {config_path}")
|
| 131 |
+
sys.exit(1)
|
| 132 |
+
|
| 133 |
+
config = load_config(config_path)
|
| 134 |
+
if not config:
|
| 135 |
+
sys.exit(1)
|
| 136 |
+
|
| 137 |
+
# Check if model already exists
|
| 138 |
+
model_path = config['model_path']
|
| 139 |
+
if os.path.exists(model_path) and os.path.exists(os.path.join(model_path, "config.json")):
|
| 140 |
+
print(f"β
Model already exists: {model_path}")
|
| 141 |
+
print("Skipping download...")
|
| 142 |
+
else:
|
| 143 |
+
# Download model
|
| 144 |
+
print("1οΈβ£ Downloading base model...")
|
| 145 |
+
if not download_model(config):
|
| 146 |
+
print("β Failed to download model")
|
| 147 |
+
sys.exit(1)
|
| 148 |
+
|
| 149 |
+
# Check dataset
|
| 150 |
+
dataset_path = config['dataset_path']
|
| 151 |
+
if not os.path.exists(dataset_path):
|
| 152 |
+
print(f"β Dataset tidak ditemukan: {dataset_path}")
|
| 153 |
+
print("Please ensure your dataset is in the correct location")
|
| 154 |
+
sys.exit(1)
|
| 155 |
+
else:
|
| 156 |
+
print(f"β
Dataset found: {dataset_path}")
|
| 157 |
+
|
| 158 |
+
# Check system prompt
|
| 159 |
+
system_prompt_path = "configs/system_prompt.md"
|
| 160 |
+
if not os.path.exists(system_prompt_path):
|
| 161 |
+
print(f"β System prompt tidak ditemukan: {system_prompt_path}")
|
| 162 |
+
sys.exit(1)
|
| 163 |
+
else:
|
| 164 |
+
print(f"β
System prompt found: {system_prompt_path}")
|
| 165 |
+
|
| 166 |
+
print("\nβ
Setup completed successfully!")
|
| 167 |
+
print("\nπ Next steps:")
|
| 168 |
+
print("1. Run training: python scripts/train_textilindo_ai.py")
|
| 169 |
+
print("2. Test model: python scripts/test_textilindo_ai.py")
|
| 170 |
+
print("3. Test with LoRA: python scripts/test_textilindo_ai.py --lora_path models/textilindo-ai-lora-YYYYMMDD_HHMMSS")
|
| 171 |
+
|
| 172 |
+
if __name__ == "__main__":
|
| 173 |
+
main()
|
scripts/test_textilindo_ai.py
ADDED
|
@@ -0,0 +1,235 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Script untuk testing Textilindo AI Assistant yang sudah di-fine-tune
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import sys
|
| 8 |
+
import yaml
|
| 9 |
+
import torch
|
| 10 |
+
import argparse
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 13 |
+
from peft import PeftModel
|
| 14 |
+
import logging
|
| 15 |
+
|
| 16 |
+
logging.basicConfig(level=logging.INFO)
|
| 17 |
+
logger = logging.getLogger(__name__)
|
| 18 |
+
|
| 19 |
+
def load_system_prompt(system_prompt_path):
|
| 20 |
+
"""Load system prompt from markdown file"""
|
| 21 |
+
try:
|
| 22 |
+
with open(system_prompt_path, 'r', encoding='utf-8') as f:
|
| 23 |
+
content = f.read()
|
| 24 |
+
|
| 25 |
+
# Extract SYSTEM_PROMPT from markdown
|
| 26 |
+
if 'SYSTEM_PROMPT = """' in content:
|
| 27 |
+
start = content.find('SYSTEM_PROMPT = """') + len('SYSTEM_PROMPT = """')
|
| 28 |
+
end = content.find('"""', start)
|
| 29 |
+
system_prompt = content[start:end].strip()
|
| 30 |
+
else:
|
| 31 |
+
# Fallback: use entire content
|
| 32 |
+
system_prompt = content.strip()
|
| 33 |
+
|
| 34 |
+
return system_prompt
|
| 35 |
+
except Exception as e:
|
| 36 |
+
logger.error(f"Error loading system prompt: {e}")
|
| 37 |
+
return None
|
| 38 |
+
|
| 39 |
+
def load_finetuned_model(model_path, lora_weights_path, system_prompt):
|
| 40 |
+
"""Load fine-tuned model with LoRA weights"""
|
| 41 |
+
logger.info(f"Loading base model from: {model_path}")
|
| 42 |
+
|
| 43 |
+
# Load base model
|
| 44 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 45 |
+
model_path,
|
| 46 |
+
torch_dtype=torch.float16,
|
| 47 |
+
device_map="auto",
|
| 48 |
+
trust_remote_code=True
|
| 49 |
+
)
|
| 50 |
+
|
| 51 |
+
# Load LoRA weights if available
|
| 52 |
+
if lora_weights_path and os.path.exists(lora_weights_path):
|
| 53 |
+
logger.info(f"Loading LoRA weights from: {lora_weights_path}")
|
| 54 |
+
model = PeftModel.from_pretrained(model, lora_weights_path)
|
| 55 |
+
else:
|
| 56 |
+
logger.warning("No LoRA weights found, using base model")
|
| 57 |
+
|
| 58 |
+
# Load tokenizer
|
| 59 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 60 |
+
model_path,
|
| 61 |
+
trust_remote_code=True
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
if tokenizer.pad_token is None:
|
| 65 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 66 |
+
|
| 67 |
+
return model, tokenizer
|
| 68 |
+
|
| 69 |
+
def generate_response(model, tokenizer, user_input, system_prompt, max_length=512):
|
| 70 |
+
"""Generate response from the model"""
|
| 71 |
+
# Create full prompt with system prompt
|
| 72 |
+
full_prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
|
| 73 |
+
|
| 74 |
+
inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
|
| 75 |
+
|
| 76 |
+
with torch.no_grad():
|
| 77 |
+
outputs = model.generate(
|
| 78 |
+
**inputs,
|
| 79 |
+
max_length=max_length,
|
| 80 |
+
temperature=0.7,
|
| 81 |
+
top_p=0.9,
|
| 82 |
+
top_k=40,
|
| 83 |
+
repetition_penalty=1.1,
|
| 84 |
+
do_sample=True,
|
| 85 |
+
pad_token_id=tokenizer.eos_token_id,
|
| 86 |
+
eos_token_id=tokenizer.eos_token_id,
|
| 87 |
+
stop_strings=["<|end|>", "<|user|>"]
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 91 |
+
|
| 92 |
+
# Extract only the assistant's response
|
| 93 |
+
if "<|assistant|>" in response:
|
| 94 |
+
assistant_response = response.split("<|assistant|>")[-1].strip()
|
| 95 |
+
# Remove any remaining special tokens
|
| 96 |
+
assistant_response = assistant_response.replace("<|end|>", "").strip()
|
| 97 |
+
return assistant_response
|
| 98 |
+
else:
|
| 99 |
+
return response
|
| 100 |
+
|
| 101 |
+
def interactive_test(model, tokenizer, system_prompt):
|
| 102 |
+
"""Interactive testing mode"""
|
| 103 |
+
print("π€ Textilindo AI Assistant - Interactive Mode")
|
| 104 |
+
print("=" * 60)
|
| 105 |
+
print("Type 'quit' to exit")
|
| 106 |
+
print("-" * 60)
|
| 107 |
+
|
| 108 |
+
while True:
|
| 109 |
+
try:
|
| 110 |
+
user_input = input("\nπ€ Customer: ").strip()
|
| 111 |
+
|
| 112 |
+
if user_input.lower() in ['quit', 'exit', 'q']:
|
| 113 |
+
print("π Terima kasih! Sampai jumpa!")
|
| 114 |
+
break
|
| 115 |
+
|
| 116 |
+
if not user_input:
|
| 117 |
+
continue
|
| 118 |
+
|
| 119 |
+
print("\nπ€ Textilindo AI: ", end="", flush=True)
|
| 120 |
+
response = generate_response(model, tokenizer, user_input, system_prompt)
|
| 121 |
+
print(response)
|
| 122 |
+
|
| 123 |
+
except KeyboardInterrupt:
|
| 124 |
+
print("\nπ Terima kasih! Sampai jumpa!")
|
| 125 |
+
break
|
| 126 |
+
except Exception as e:
|
| 127 |
+
logger.error(f"Error generating response: {e}")
|
| 128 |
+
print(f"β Error: {e}")
|
| 129 |
+
|
| 130 |
+
def batch_test(model, tokenizer, system_prompt, test_cases):
|
| 131 |
+
"""Batch testing with predefined test cases"""
|
| 132 |
+
print("π§ͺ Textilindo AI Assistant - Batch Testing")
|
| 133 |
+
print("=" * 60)
|
| 134 |
+
|
| 135 |
+
for i, test_case in enumerate(test_cases, 1):
|
| 136 |
+
print(f"\nπ Test Case {i}: {test_case['prompt']}")
|
| 137 |
+
print("-" * 40)
|
| 138 |
+
|
| 139 |
+
try:
|
| 140 |
+
response = generate_response(model, tokenizer, test_case['prompt'], system_prompt)
|
| 141 |
+
print(f"π€ Response: {response}")
|
| 142 |
+
|
| 143 |
+
if 'expected' in test_case:
|
| 144 |
+
print(f"π― Expected: {test_case['expected']}")
|
| 145 |
+
|
| 146 |
+
except Exception as e:
|
| 147 |
+
logger.error(f"Error in test case {i}: {e}")
|
| 148 |
+
print(f"β Error: {e}")
|
| 149 |
+
|
| 150 |
+
def main():
|
| 151 |
+
parser = argparse.ArgumentParser(description='Test Textilindo AI Assistant')
|
| 152 |
+
parser.add_argument('--model_path', type=str, default='./models/llama-3.2-1b-instruct',
|
| 153 |
+
help='Path to base model')
|
| 154 |
+
parser.add_argument('--lora_path', type=str, default=None,
|
| 155 |
+
help='Path to LoRA weights')
|
| 156 |
+
parser.add_argument('--system_prompt', type=str, default='configs/system_prompt.md',
|
| 157 |
+
help='Path to system prompt file')
|
| 158 |
+
|
| 159 |
+
args = parser.parse_args()
|
| 160 |
+
|
| 161 |
+
print("π§ͺ Textilindo AI Assistant Testing")
|
| 162 |
+
print("=" * 60)
|
| 163 |
+
|
| 164 |
+
# Load system prompt
|
| 165 |
+
system_prompt = load_system_prompt(args.system_prompt)
|
| 166 |
+
if not system_prompt:
|
| 167 |
+
print(f"β System prompt tidak ditemukan: {args.system_prompt}")
|
| 168 |
+
sys.exit(1)
|
| 169 |
+
|
| 170 |
+
# Check if model exists
|
| 171 |
+
if not os.path.exists(args.model_path):
|
| 172 |
+
print(f"β Base model tidak ditemukan: {args.model_path}")
|
| 173 |
+
print("Jalankan download_model.py terlebih dahulu")
|
| 174 |
+
sys.exit(1)
|
| 175 |
+
|
| 176 |
+
try:
|
| 177 |
+
# Load model
|
| 178 |
+
print("1οΈβ£ Loading model...")
|
| 179 |
+
model, tokenizer = load_finetuned_model(args.model_path, args.lora_path, system_prompt)
|
| 180 |
+
print("β
Model loaded successfully!")
|
| 181 |
+
|
| 182 |
+
# Test cases specific to Textilindo
|
| 183 |
+
test_cases = [
|
| 184 |
+
{
|
| 185 |
+
"prompt": "dimana lokasi textilindo?",
|
| 186 |
+
"expected": "Textilindo berkantor pusat di Jl. Raya Prancis No.39, Kosambi Tim., Kec. Kosambi, Kabupaten Tangerang, Banten 15213"
|
| 187 |
+
},
|
| 188 |
+
{
|
| 189 |
+
"prompt": "Jam berapa textilindo beroperasional?",
|
| 190 |
+
"expected": "Jam operasional Senin-Jumat 08:00-17:00, Sabtu 08:00-12:00."
|
| 191 |
+
},
|
| 192 |
+
{
|
| 193 |
+
"prompt": "Berapa ketentuan pembelian?",
|
| 194 |
+
"expected": "Minimal order 1 roll per jenis kain"
|
| 195 |
+
},
|
| 196 |
+
{
|
| 197 |
+
"prompt": "bagimana dengan pembayarannya?",
|
| 198 |
+
"expected": "Pembayaran dapat dilakukan via transfer bank atau cash on delivery"
|
| 199 |
+
},
|
| 200 |
+
{
|
| 201 |
+
"prompt": "apa ada gratis ongkir?",
|
| 202 |
+
"expected": "Gratis ongkir untuk order minimal 5 roll."
|
| 203 |
+
},
|
| 204 |
+
{
|
| 205 |
+
"prompt": "Apa bisa dikirimkan sample? apa gratis?",
|
| 206 |
+
"expected": "hallo kak untuk sampel kita bisa kirimkan gratis ya kak π"
|
| 207 |
+
}
|
| 208 |
+
]
|
| 209 |
+
|
| 210 |
+
# Choose testing mode
|
| 211 |
+
print("\n2οΈβ£ Pilih mode testing:")
|
| 212 |
+
print("1. Interactive mode (chat)")
|
| 213 |
+
print("2. Batch testing")
|
| 214 |
+
print("3. Custom prompt")
|
| 215 |
+
|
| 216 |
+
choice = input("\nPilihan (1-3): ").strip()
|
| 217 |
+
|
| 218 |
+
if choice == "1":
|
| 219 |
+
interactive_test(model, tokenizer, system_prompt)
|
| 220 |
+
elif choice == "2":
|
| 221 |
+
batch_test(model, tokenizer, system_prompt, test_cases)
|
| 222 |
+
elif choice == "3":
|
| 223 |
+
custom_prompt = input("Masukkan prompt custom: ").strip()
|
| 224 |
+
if custom_prompt:
|
| 225 |
+
response = generate_response(model, tokenizer, custom_prompt, system_prompt)
|
| 226 |
+
print(f"\nπ€ Response: {response}")
|
| 227 |
+
else:
|
| 228 |
+
print("β Pilihan tidak valid")
|
| 229 |
+
|
| 230 |
+
except Exception as e:
|
| 231 |
+
logger.error(f"Error: {e}")
|
| 232 |
+
print(f"β Error loading model: {e}")
|
| 233 |
+
|
| 234 |
+
if __name__ == "__main__":
|
| 235 |
+
main()
|
scripts/train_textilindo_ai.py
ADDED
|
@@ -0,0 +1,280 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Script untuk fine-tuning Llama 3.2 1B dengan LoRA untuk Textilindo AI Assistant
|
| 4 |
+
Menggunakan system prompt dan dataset khusus Textilindo
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import sys
|
| 9 |
+
import yaml
|
| 10 |
+
import json
|
| 11 |
+
import torch
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
from transformers import (
|
| 14 |
+
AutoTokenizer,
|
| 15 |
+
AutoModelForCausalLM,
|
| 16 |
+
TrainingArguments,
|
| 17 |
+
Trainer,
|
| 18 |
+
DataCollatorForLanguageModeling
|
| 19 |
+
)
|
| 20 |
+
from peft import (
|
| 21 |
+
LoraConfig,
|
| 22 |
+
get_peft_model,
|
| 23 |
+
TaskType,
|
| 24 |
+
prepare_model_for_kbit_training
|
| 25 |
+
)
|
| 26 |
+
from datasets import Dataset
|
| 27 |
+
import logging
|
| 28 |
+
from datetime import datetime
|
| 29 |
+
|
| 30 |
+
# Setup logging
|
| 31 |
+
logging.basicConfig(level=logging.INFO)
|
| 32 |
+
logger = logging.getLogger(__name__)
|
| 33 |
+
|
| 34 |
+
def load_config(config_path):
|
| 35 |
+
"""Load configuration from YAML file"""
|
| 36 |
+
try:
|
| 37 |
+
with open(config_path, 'r') as f:
|
| 38 |
+
config = yaml.safe_load(f)
|
| 39 |
+
return config
|
| 40 |
+
except Exception as e:
|
| 41 |
+
logger.error(f"Error loading config: {e}")
|
| 42 |
+
return None
|
| 43 |
+
|
| 44 |
+
def load_system_prompt(system_prompt_path):
|
| 45 |
+
"""Load system prompt from markdown file"""
|
| 46 |
+
try:
|
| 47 |
+
with open(system_prompt_path, 'r', encoding='utf-8') as f:
|
| 48 |
+
content = f.read()
|
| 49 |
+
|
| 50 |
+
# Extract SYSTEM_PROMPT from markdown
|
| 51 |
+
if 'SYSTEM_PROMPT = """' in content:
|
| 52 |
+
start = content.find('SYSTEM_PROMPT = """') + len('SYSTEM_PROMPT = """')
|
| 53 |
+
end = content.find('"""', start)
|
| 54 |
+
system_prompt = content[start:end].strip()
|
| 55 |
+
else:
|
| 56 |
+
# Fallback: use entire content
|
| 57 |
+
system_prompt = content.strip()
|
| 58 |
+
|
| 59 |
+
return system_prompt
|
| 60 |
+
except Exception as e:
|
| 61 |
+
logger.error(f"Error loading system prompt: {e}")
|
| 62 |
+
return None
|
| 63 |
+
|
| 64 |
+
def load_model_and_tokenizer(config):
|
| 65 |
+
"""Load base model and tokenizer"""
|
| 66 |
+
model_path = config['model_path']
|
| 67 |
+
|
| 68 |
+
logger.info(f"Loading model from: {model_path}")
|
| 69 |
+
|
| 70 |
+
# Load tokenizer
|
| 71 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 72 |
+
model_path,
|
| 73 |
+
trust_remote_code=True,
|
| 74 |
+
padding_side="right"
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
if tokenizer.pad_token is None:
|
| 78 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 79 |
+
|
| 80 |
+
# Load model
|
| 81 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 82 |
+
model_path,
|
| 83 |
+
torch_dtype=torch.float16,
|
| 84 |
+
device_map="auto",
|
| 85 |
+
trust_remote_code=True
|
| 86 |
+
)
|
| 87 |
+
|
| 88 |
+
# Prepare model for k-bit training
|
| 89 |
+
model = prepare_model_for_kbit_training(model)
|
| 90 |
+
|
| 91 |
+
return model, tokenizer
|
| 92 |
+
|
| 93 |
+
def setup_lora_config(config):
|
| 94 |
+
"""Setup LoRA configuration"""
|
| 95 |
+
lora_config = config['lora_config']
|
| 96 |
+
|
| 97 |
+
peft_config = LoraConfig(
|
| 98 |
+
task_type=TaskType.CAUSAL_LM,
|
| 99 |
+
r=lora_config['r'],
|
| 100 |
+
lora_alpha=lora_config['lora_alpha'],
|
| 101 |
+
lora_dropout=lora_config['lora_dropout'],
|
| 102 |
+
target_modules=lora_config['target_modules'],
|
| 103 |
+
bias="none",
|
| 104 |
+
)
|
| 105 |
+
|
| 106 |
+
return peft_config
|
| 107 |
+
|
| 108 |
+
def prepare_textilindo_dataset(data_path, tokenizer, system_prompt, max_length=2048):
|
| 109 |
+
"""Prepare Textilindo dataset for training with system prompt"""
|
| 110 |
+
logger.info(f"Loading dataset from: {data_path}")
|
| 111 |
+
|
| 112 |
+
# Load JSONL dataset
|
| 113 |
+
data = []
|
| 114 |
+
with open(data_path, 'r', encoding='utf-8') as f:
|
| 115 |
+
for line_num, line in enumerate(f, 1):
|
| 116 |
+
line = line.strip()
|
| 117 |
+
if line:
|
| 118 |
+
try:
|
| 119 |
+
json_obj = json.loads(line)
|
| 120 |
+
data.append(json_obj)
|
| 121 |
+
except json.JSONDecodeError as e:
|
| 122 |
+
logger.warning(f"Invalid JSON at line {line_num}: {e}")
|
| 123 |
+
continue
|
| 124 |
+
|
| 125 |
+
if not data:
|
| 126 |
+
raise ValueError("No valid JSON objects found in JSONL file")
|
| 127 |
+
|
| 128 |
+
logger.info(f"Loaded {len(data)} samples from JSONL file")
|
| 129 |
+
|
| 130 |
+
# Convert to training format with system prompt
|
| 131 |
+
training_data = []
|
| 132 |
+
for item in data:
|
| 133 |
+
# Extract instruction and output
|
| 134 |
+
instruction = item.get('instruction', '')
|
| 135 |
+
output = item.get('output', '')
|
| 136 |
+
|
| 137 |
+
if not instruction or not output:
|
| 138 |
+
continue
|
| 139 |
+
|
| 140 |
+
# Create training text with system prompt
|
| 141 |
+
training_text = f"<|system|>\n{system_prompt}\n<|user|>\n{instruction}\n<|assistant|>\n{output}<|end|>"
|
| 142 |
+
|
| 143 |
+
training_data.append({
|
| 144 |
+
'text': training_text,
|
| 145 |
+
'instruction': instruction,
|
| 146 |
+
'output': output
|
| 147 |
+
})
|
| 148 |
+
|
| 149 |
+
# Convert to Dataset
|
| 150 |
+
dataset = Dataset.from_list(training_data)
|
| 151 |
+
logger.info(f"Prepared {len(dataset)} training samples")
|
| 152 |
+
|
| 153 |
+
def tokenize_function(examples):
|
| 154 |
+
# Tokenize the texts
|
| 155 |
+
tokenized = tokenizer(
|
| 156 |
+
examples['text'],
|
| 157 |
+
truncation=True,
|
| 158 |
+
padding=True,
|
| 159 |
+
max_length=max_length,
|
| 160 |
+
return_tensors="pt"
|
| 161 |
+
)
|
| 162 |
+
return tokenized
|
| 163 |
+
|
| 164 |
+
# Tokenize dataset
|
| 165 |
+
tokenized_dataset = dataset.map(
|
| 166 |
+
tokenize_function,
|
| 167 |
+
batched=True,
|
| 168 |
+
remove_columns=dataset.column_names
|
| 169 |
+
)
|
| 170 |
+
|
| 171 |
+
return tokenized_dataset
|
| 172 |
+
|
| 173 |
+
def train_model(model, tokenizer, dataset, config, output_dir):
|
| 174 |
+
"""Train the model with LoRA"""
|
| 175 |
+
training_config = config['training_config']
|
| 176 |
+
|
| 177 |
+
# Setup training arguments
|
| 178 |
+
training_args = TrainingArguments(
|
| 179 |
+
output_dir=output_dir,
|
| 180 |
+
num_train_epochs=training_config['num_epochs'],
|
| 181 |
+
per_device_train_batch_size=training_config['batch_size'],
|
| 182 |
+
gradient_accumulation_steps=training_config['gradient_accumulation_steps'],
|
| 183 |
+
learning_rate=training_config['learning_rate'],
|
| 184 |
+
warmup_steps=training_config['warmup_steps'],
|
| 185 |
+
save_steps=training_config['save_steps'],
|
| 186 |
+
eval_steps=training_config['eval_steps'],
|
| 187 |
+
logging_steps=training_config.get('logging_steps', 10),
|
| 188 |
+
save_total_limit=training_config.get('save_total_limit', 3),
|
| 189 |
+
prediction_loss_only=training_config.get('prediction_loss_only', True),
|
| 190 |
+
remove_unused_columns=training_config.get('remove_unused_columns', False),
|
| 191 |
+
push_to_hub=training_config.get('push_to_hub', False),
|
| 192 |
+
report_to=training_config.get('report_to', None),
|
| 193 |
+
fp16=True, # Enable mixed precision training
|
| 194 |
+
dataloader_pin_memory=False, # Reduce memory usage
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
# Setup data collator
|
| 198 |
+
data_collator = DataCollatorForLanguageModeling(
|
| 199 |
+
tokenizer=tokenizer,
|
| 200 |
+
mlm=False,
|
| 201 |
+
)
|
| 202 |
+
|
| 203 |
+
# Setup trainer
|
| 204 |
+
trainer = Trainer(
|
| 205 |
+
model=model,
|
| 206 |
+
args=training_args,
|
| 207 |
+
train_dataset=dataset,
|
| 208 |
+
data_collator=data_collator,
|
| 209 |
+
tokenizer=tokenizer,
|
| 210 |
+
)
|
| 211 |
+
|
| 212 |
+
# Start training
|
| 213 |
+
logger.info("Starting training...")
|
| 214 |
+
trainer.train()
|
| 215 |
+
|
| 216 |
+
# Save the model
|
| 217 |
+
trainer.save_model()
|
| 218 |
+
logger.info(f"Model saved to: {output_dir}")
|
| 219 |
+
|
| 220 |
+
def main():
|
| 221 |
+
print("π Textilindo AI Assistant - LoRA Fine-tuning")
|
| 222 |
+
print("=" * 60)
|
| 223 |
+
|
| 224 |
+
# Load configuration
|
| 225 |
+
config_path = "configs/training_config.yaml"
|
| 226 |
+
if not os.path.exists(config_path):
|
| 227 |
+
print(f"β Config file tidak ditemukan: {config_path}")
|
| 228 |
+
sys.exit(1)
|
| 229 |
+
|
| 230 |
+
config = load_config(config_path)
|
| 231 |
+
if not config:
|
| 232 |
+
sys.exit(1)
|
| 233 |
+
|
| 234 |
+
# Load system prompt
|
| 235 |
+
system_prompt_path = "configs/system_prompt.md"
|
| 236 |
+
if not os.path.exists(system_prompt_path):
|
| 237 |
+
print(f"β System prompt tidak ditemukan: {system_prompt_path}")
|
| 238 |
+
sys.exit(1)
|
| 239 |
+
|
| 240 |
+
system_prompt = load_system_prompt(system_prompt_path)
|
| 241 |
+
if not system_prompt:
|
| 242 |
+
sys.exit(1)
|
| 243 |
+
|
| 244 |
+
# Setup paths
|
| 245 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 246 |
+
output_dir = Path(f"models/textilindo-ai-lora-{timestamp}")
|
| 247 |
+
output_dir.mkdir(parents=True, exist_ok=True)
|
| 248 |
+
|
| 249 |
+
# Check if dataset exists
|
| 250 |
+
data_path = config['dataset_path']
|
| 251 |
+
if not os.path.exists(data_path):
|
| 252 |
+
print(f"β Dataset tidak ditemukan: {data_path}")
|
| 253 |
+
sys.exit(1)
|
| 254 |
+
|
| 255 |
+
# Load model and tokenizer
|
| 256 |
+
print("1οΈβ£ Loading model and tokenizer...")
|
| 257 |
+
model, tokenizer = load_model_and_tokenizer(config)
|
| 258 |
+
|
| 259 |
+
# Setup LoRA
|
| 260 |
+
print("2οΈβ£ Setting up LoRA configuration...")
|
| 261 |
+
peft_config = setup_lora_config(config)
|
| 262 |
+
model = get_peft_model(model, peft_config)
|
| 263 |
+
|
| 264 |
+
# Print trainable parameters
|
| 265 |
+
model.print_trainable_parameters()
|
| 266 |
+
|
| 267 |
+
# Prepare dataset
|
| 268 |
+
print("3οΈβ£ Preparing Textilindo dataset...")
|
| 269 |
+
dataset = prepare_textilindo_dataset(data_path, tokenizer, system_prompt, config['max_length'])
|
| 270 |
+
|
| 271 |
+
# Train model
|
| 272 |
+
print("4οΈβ£ Starting training...")
|
| 273 |
+
train_model(model, tokenizer, dataset, config, output_dir)
|
| 274 |
+
|
| 275 |
+
print("β
Training selesai!")
|
| 276 |
+
print(f"π Model tersimpan di: {output_dir}")
|
| 277 |
+
print(f"π§ Untuk testing, jalankan: python scripts/test_textilindo_ai.py --model_path {output_dir}")
|
| 278 |
+
|
| 279 |
+
if __name__ == "__main__":
|
| 280 |
+
main()
|