Stefanus Simandjuntak commited on
Commit
927bb09
Β·
1 Parent(s): c696f9e

Add LoRA training setup for Textilindo AI Assistant

Browse files

- Created specialized training script for Textilindo AI with system prompt integration
- Updated training configuration for optimal laptop training (batch_size=2, gradient_accumulation=8)
- Added testing and inference scripts with interactive chat mode
- Created setup script for model download and environment preparation
- Added training runner script for easy execution
- Created comprehensive README for LoRA training branch
- Added readiness check script to verify all components before training

README_LORA_TRAINING.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Textilindo AI Assistant - LoRA Training
2
+
3
+ This branch contains the setup for fine-tuning Llama 3.2 1B with LoRA (Low-Rank Adaptation) to create a specialized Textilindo AI Assistant.
4
+
5
+ ## 🎯 Overview
6
+
7
+ The Textilindo AI Assistant is designed to help customers with:
8
+ - Product information and recommendations
9
+ - Ordering and shipping details
10
+ - Company information and policies
11
+ - Customer support in Indonesian language
12
+
13
+ ## πŸ“ Files Structure
14
+
15
+ ```
16
+ β”œβ”€β”€ configs/
17
+ β”‚ β”œβ”€β”€ system_prompt.md # System prompt for Textilindo AI
18
+ β”‚ └── training_config.yaml # Training configuration
19
+ β”œβ”€β”€ data/
20
+ β”‚ └── lora_dataset_20250910_145055.jsonl # Training dataset
21
+ β”œβ”€β”€ scripts/
22
+ β”‚ β”œβ”€β”€ setup_textilindo_training.py # Setup and download model
23
+ β”‚ β”œβ”€β”€ train_textilindo_ai.py # LoRA training script
24
+ β”‚ β”œβ”€β”€ test_textilindo_ai.py # Testing script
25
+ β”‚ └── inference_textilindo_ai.py # Inference script
26
+ β”œβ”€β”€ run_textilindo_training.sh # Training runner script
27
+ └── README_LORA_TRAINING.md # This file
28
+ ```
29
+
30
+ ## πŸš€ Quick Start
31
+
32
+ ### 1. Setup Environment
33
+
34
+ ```bash
35
+ # Activate virtual environment
36
+ source venv/bin/activate
37
+
38
+ # Install requirements
39
+ pip install -r requirements.txt
40
+ ```
41
+
42
+ ### 2. Download Base Model
43
+
44
+ ```bash
45
+ python scripts/setup_textilindo_training.py
46
+ ```
47
+
48
+ ### 3. Start Training
49
+
50
+ ```bash
51
+ # Option 1: Use the runner script
52
+ ./run_textilindo_training.sh
53
+
54
+ # Option 2: Run training directly
55
+ python scripts/train_textilindo_ai.py
56
+ ```
57
+
58
+ ### 4. Test the Model
59
+
60
+ ```bash
61
+ # Interactive testing
62
+ python scripts/test_textilindo_ai.py
63
+
64
+ # Test with specific LoRA weights
65
+ python scripts/test_textilindo_ai.py --lora_path models/textilindo-ai-lora-YYYYMMDD_HHMMSS
66
+
67
+ # Single prompt testing
68
+ python scripts/inference_textilindo_ai.py --prompt "dimana lokasi textilindo?"
69
+ ```
70
+
71
+ ## πŸ”§ Configuration
72
+
73
+ ### Training Configuration (`configs/training_config.yaml`)
74
+
75
+ - **Model**: Llama 3.2 1B Instruct
76
+ - **Dataset**: `data/lora_dataset_20250910_145055.jsonl`
77
+ - **LoRA Settings**: r=16, alpha=32, dropout=0.1
78
+ - **Training**: 3 epochs, batch_size=2, learning_rate=0.0002
79
+
80
+ ### System Prompt (`configs/system_prompt.md`)
81
+
82
+ The system prompt defines the AI assistant's behavior:
83
+ - Responds in Indonesian (Bahasa Indonesia)
84
+ - Friendly and concise responses
85
+ - Focuses on selling and customer service
86
+ - Uses Textilindo-specific information
87
+
88
+ ## πŸ“Š Dataset Format
89
+
90
+ The training dataset uses JSONL format with the following structure:
91
+
92
+ ```jsonl
93
+ {"input": "", "output": "Textilindo berkantor pusat di Jl. Raya Prancis No.39...", "metadata": {"topic": "general", "doc_id": "web_input_20250829_101006", "source": "faq", "security_level": "low"}, "instruction": "dimana lokasi textilindo?"}
94
+ ```
95
+
96
+ **Required fields:**
97
+ - `instruction`: Customer question
98
+ - `output`: AI assistant response
99
+ - `metadata`: Additional information (optional)
100
+
101
+ ## πŸ‹οΈ Training Process
102
+
103
+ 1. **Data Preparation**: Load JSONL dataset and format with system prompt
104
+ 2. **Model Loading**: Load Llama 3.2 1B base model
105
+ 3. **LoRA Setup**: Configure LoRA parameters for efficient fine-tuning
106
+ 4. **Training**: Fine-tune with Textilindo-specific data
107
+ 5. **Saving**: Save LoRA weights for inference
108
+
109
+ ## πŸ§ͺ Testing
110
+
111
+ ### Interactive Mode
112
+ ```bash
113
+ python scripts/test_textilindo_ai.py
114
+ ```
115
+
116
+ ### Batch Testing
117
+ The script includes predefined test cases for common Textilindo questions.
118
+
119
+ ### Custom Testing
120
+ ```bash
121
+ python scripts/inference_textilindo_ai.py --prompt "Your question here"
122
+ ```
123
+
124
+ ## πŸ“ˆ Expected Results
125
+
126
+ After training, the AI assistant should be able to:
127
+ - Answer questions about Textilindo's location, hours, and policies
128
+ - Provide product information and recommendations
129
+ - Handle shipping and payment questions
130
+ - Respond in friendly Indonesian language
131
+ - Follow the system prompt guidelines
132
+
133
+ ## πŸ” Troubleshooting
134
+
135
+ ### Common Issues
136
+
137
+ 1. **CUDA Out of Memory**
138
+ - Reduce batch_size in training_config.yaml
139
+ - Increase gradient_accumulation_steps
140
+ - Use CPU training (slower but works)
141
+
142
+ 2. **Model Download Failed**
143
+ - Check internet connection
144
+ - Verify HuggingFace access
145
+ - Check disk space
146
+
147
+ 3. **Training Slow**
148
+ - Ensure CUDA is available
149
+ - Check GPU memory usage
150
+ - Monitor system resources
151
+
152
+ ### Performance Tips
153
+
154
+ - Use GPU for training (much faster than CPU)
155
+ - Monitor training progress in logs
156
+ - Save checkpoints regularly
157
+ - Test model during training
158
+
159
+ ## πŸ“ Notes
160
+
161
+ - The model is trained specifically for Textilindo's business
162
+ - Responses follow the system prompt guidelines
163
+ - Training data includes real customer interactions
164
+ - Model is optimized for Indonesian language and Textilindo context
165
+
166
+ ## πŸ†˜ Support
167
+
168
+ If you encounter issues:
169
+ 1. Check the logs in the console output
170
+ 2. Verify all files are in the correct locations
171
+ 3. Ensure requirements are installed
172
+ 4. Check GPU memory and CUDA availability
173
+
174
+ ---
175
+
176
+ **Happy Training! πŸš€**
configs/system_prompt.md CHANGED
@@ -18,6 +18,14 @@ SYSTEM_PROMPT = """You are a friendly and helpful AI assistant for Textilindo, a
18
  - Provide accurate information with enthusiasm
19
  - Encourage further questions warmly
20
  - ALWAYS use emojis to make responses more friendly and engaging
 
 
 
 
 
 
 
 
21
 
22
  πŸ“ FORMATTING GUIDELINES:
23
  - Use **bold** ONLY for essential titles: **Harga** (Price), **Lokasi** (Location), **Produk** (Product)
 
18
  - Provide accurate information with enthusiasm
19
  - Encourage further questions warmly
20
  - ALWAYS use emojis to make responses more friendly and engaging
21
+ - If the user asks in formal language (for example: using "saya", "Anda", "Bapak", "Ibu", or other formal Indonesian phrases), DO NOT use informal greetings like "ka", "kak", or other casual terms.
22
+ - For formal questions, respond in formal Indonesian and address the user as "Bapak" (Sir) or "Ibu" (Ma'am) as appropriate, without using informal greetings.
23
+ - If the user appears to be Gen-Z (using slang, abbreviations, or typical Gen-Z emojis), respond using casual, trendy Gen-Z language. Make sure your answers remain polite, friendly, and easy to understand.
24
+ - Always match the tone and formality of your response to the language style used by the user.
25
+ - If the user uses informal language, reply informally. If the user uses formal language, reply formally.
26
+ - Never force a particular language style; simply follow the user's style so your response feels natural and comfortable.
27
+
28
+
29
 
30
  πŸ“ FORMATTING GUIDELINES:
31
  - Use **bold** ONLY for essential titles: **Harga** (Price), **Lokasi** (Location), **Produk** (Product)
configs/training_config.yaml CHANGED
@@ -1,4 +1,4 @@
1
- dataset_path: data/textilindo_training_data.jsonl
2
  lora_config:
3
  lora_alpha: 32
4
  lora_dropout: 0.1
@@ -19,10 +19,16 @@ temperature: 0.7
19
  top_k: 40
20
  top_p: 0.9
21
  training_config:
22
- batch_size: 4
23
- eval_steps: 500
24
- gradient_accumulation_steps: 4
25
  learning_rate: 0.0002
26
  num_epochs: 3
27
- save_steps: 500
28
- warmup_steps: 100
 
 
 
 
 
 
 
1
+ dataset_path: data/lora_dataset_20250910_145055.jsonl
2
  lora_config:
3
  lora_alpha: 32
4
  lora_dropout: 0.1
 
19
  top_k: 40
20
  top_p: 0.9
21
  training_config:
22
+ batch_size: 2
23
+ eval_steps: 100
24
+ gradient_accumulation_steps: 8
25
  learning_rate: 0.0002
26
  num_epochs: 3
27
+ save_steps: 100
28
+ warmup_steps: 50
29
+ logging_steps: 10
30
+ save_total_limit: 3
31
+ prediction_loss_only: true
32
+ remove_unused_columns: false
33
+ push_to_hub: false
34
+ report_to: null
run_textilindo_training.sh ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Textilindo AI Assistant Training Script
4
+ # This script sets up and runs the LoRA training for Textilindo AI Assistant
5
+
6
+ echo "πŸš€ Textilindo AI Assistant - LoRA Training"
7
+ echo "=========================================="
8
+
9
+ # Check if virtual environment exists
10
+ if [ ! -d "venv" ]; then
11
+ echo "❌ Virtual environment not found. Creating one..."
12
+ python3 -m venv venv
13
+ fi
14
+
15
+ # Activate virtual environment
16
+ echo "πŸ”§ Activating virtual environment..."
17
+ source venv/bin/activate
18
+
19
+ # Install/upgrade requirements
20
+ echo "πŸ“¦ Installing requirements..."
21
+ pip install --upgrade pip
22
+ pip install -r requirements.txt
23
+
24
+ # Check if we need to download the model
25
+ echo "πŸ” Checking model..."
26
+ if [ ! -d "models/llama-3.2-1b-instruct" ] || [ ! -f "models/llama-3.2-1b-instruct/config.json" ]; then
27
+ echo "πŸ“₯ Downloading base model..."
28
+ python scripts/setup_textilindo_training.py
29
+ else
30
+ echo "βœ… Base model already exists"
31
+ fi
32
+
33
+ # Check dataset
34
+ echo "πŸ” Checking dataset..."
35
+ if [ ! -f "data/lora_dataset_20250910_145055.jsonl" ]; then
36
+ echo "❌ Dataset not found: data/lora_dataset_20250910_145055.jsonl"
37
+ echo "Please ensure your dataset is in the correct location"
38
+ exit 1
39
+ else
40
+ echo "βœ… Dataset found"
41
+ fi
42
+
43
+ # Check system prompt
44
+ echo "πŸ” Checking system prompt..."
45
+ if [ ! -f "configs/system_prompt.md" ]; then
46
+ echo "❌ System prompt not found: configs/system_prompt.md"
47
+ exit 1
48
+ else
49
+ echo "βœ… System prompt found"
50
+ fi
51
+
52
+ # Start training
53
+ echo "πŸ‹οΈ Starting LoRA training..."
54
+ echo "This may take several hours depending on your hardware..."
55
+ echo ""
56
+
57
+ python scripts/train_textilindo_ai.py
58
+
59
+ echo ""
60
+ echo "βœ… Training completed!"
61
+ echo ""
62
+ echo "πŸ“‹ Next steps:"
63
+ echo "1. Test the model: python scripts/test_textilindo_ai.py"
64
+ echo "2. Find your trained model in: models/textilindo-ai-lora-*"
65
+ echo "3. Test with LoRA: python scripts/test_textilindo_ai.py --lora_path models/textilindo-ai-lora-*"
scripts/check_training_ready.py ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Check if everything is ready for Textilindo AI training
4
+ """
5
+
6
+ import os
7
+ import sys
8
+ import yaml
9
+ from pathlib import Path
10
+
11
+ def check_file_exists(file_path, description):
12
+ """Check if a file exists and print status"""
13
+ if os.path.exists(file_path):
14
+ print(f"βœ… {description}: {file_path}")
15
+ return True
16
+ else:
17
+ print(f"❌ {description}: {file_path}")
18
+ return False
19
+
20
+ def check_config():
21
+ """Check configuration files"""
22
+ print("πŸ” Checking configuration files...")
23
+
24
+ config_path = "configs/training_config.yaml"
25
+ if not os.path.exists(config_path):
26
+ print(f"❌ Training config not found: {config_path}")
27
+ return False
28
+
29
+ try:
30
+ with open(config_path, 'r') as f:
31
+ config = yaml.safe_load(f)
32
+
33
+ # Check required fields
34
+ required_fields = ['model_name', 'model_path', 'dataset_path', 'lora_config', 'training_config']
35
+ for field in required_fields:
36
+ if field not in config:
37
+ print(f"❌ Missing field in config: {field}")
38
+ return False
39
+
40
+ print("βœ… Training configuration is valid")
41
+ return True
42
+
43
+ except Exception as e:
44
+ print(f"❌ Error reading config: {e}")
45
+ return False
46
+
47
+ def check_dataset():
48
+ """Check dataset file"""
49
+ print("\nπŸ” Checking dataset...")
50
+
51
+ config_path = "configs/training_config.yaml"
52
+ with open(config_path, 'r') as f:
53
+ config = yaml.safe_load(f)
54
+
55
+ dataset_path = config['dataset_path']
56
+
57
+ if not os.path.exists(dataset_path):
58
+ print(f"❌ Dataset not found: {dataset_path}")
59
+ return False
60
+
61
+ # Check if it's a valid JSONL file
62
+ try:
63
+ import json
64
+ with open(dataset_path, 'r', encoding='utf-8') as f:
65
+ lines = f.readlines()
66
+
67
+ if not lines:
68
+ print("❌ Dataset is empty")
69
+ return False
70
+
71
+ # Check first few lines
72
+ valid_lines = 0
73
+ for i, line in enumerate(lines[:5]): # Check first 5 lines
74
+ line = line.strip()
75
+ if line:
76
+ try:
77
+ json.loads(line)
78
+ valid_lines += 1
79
+ except json.JSONDecodeError:
80
+ print(f"❌ Invalid JSON at line {i+1}")
81
+ return False
82
+
83
+ print(f"βœ… Dataset found: {dataset_path}")
84
+ print(f" Total lines: {len(lines)}")
85
+ print(f" Valid JSON lines checked: {valid_lines}")
86
+ return True
87
+
88
+ except Exception as e:
89
+ print(f"❌ Error reading dataset: {e}")
90
+ return False
91
+
92
+ def check_model():
93
+ """Check if base model exists"""
94
+ print("\nπŸ” Checking base model...")
95
+
96
+ config_path = "configs/training_config.yaml"
97
+ with open(config_path, 'r') as f:
98
+ config = yaml.safe_load(f)
99
+
100
+ model_path = config['model_path']
101
+
102
+ if not os.path.exists(model_path):
103
+ print(f"❌ Base model not found: {model_path}")
104
+ print(" Run: python scripts/setup_textilindo_training.py")
105
+ return False
106
+
107
+ # Check if it's a valid model directory
108
+ required_files = ['config.json', 'tokenizer.json']
109
+ for file in required_files:
110
+ if not os.path.exists(os.path.join(model_path, file)):
111
+ print(f"❌ Model file missing: {file}")
112
+ return False
113
+
114
+ print(f"βœ… Base model found: {model_path}")
115
+ return True
116
+
117
+ def check_system_prompt():
118
+ """Check system prompt file"""
119
+ print("\nπŸ” Checking system prompt...")
120
+
121
+ system_prompt_path = "configs/system_prompt.md"
122
+
123
+ if not os.path.exists(system_prompt_path):
124
+ print(f"❌ System prompt not found: {system_prompt_path}")
125
+ return False
126
+
127
+ try:
128
+ with open(system_prompt_path, 'r', encoding='utf-8') as f:
129
+ content = f.read()
130
+
131
+ if 'SYSTEM_PROMPT' not in content:
132
+ print("❌ SYSTEM_PROMPT not found in file")
133
+ return False
134
+
135
+ print(f"βœ… System prompt found: {system_prompt_path}")
136
+ return True
137
+
138
+ except Exception as e:
139
+ print(f"❌ Error reading system prompt: {e}")
140
+ return False
141
+
142
+ def check_requirements():
143
+ """Check Python requirements"""
144
+ print("\nπŸ” Checking Python requirements...")
145
+
146
+ required_packages = [
147
+ 'torch',
148
+ 'transformers',
149
+ 'peft',
150
+ 'datasets',
151
+ 'accelerate',
152
+ 'bitsandbytes',
153
+ 'yaml'
154
+ ]
155
+
156
+ missing_packages = []
157
+ for package in required_packages:
158
+ try:
159
+ __import__(package)
160
+ print(f"βœ… {package}")
161
+ except ImportError:
162
+ missing_packages.append(package)
163
+ print(f"❌ {package}")
164
+
165
+ if missing_packages:
166
+ print(f"\n❌ Missing packages: {', '.join(missing_packages)}")
167
+ print("Install with: pip install " + " ".join(missing_packages))
168
+ return False
169
+
170
+ return True
171
+
172
+ def main():
173
+ print("πŸ” Textilindo AI Training - Readiness Check")
174
+ print("=" * 50)
175
+
176
+ all_ready = True
177
+
178
+ # Check all components
179
+ all_ready &= check_config()
180
+ all_ready &= check_dataset()
181
+ all_ready &= check_model()
182
+ all_ready &= check_system_prompt()
183
+ all_ready &= check_requirements()
184
+
185
+ print("\n" + "=" * 50)
186
+
187
+ if all_ready:
188
+ print("βœ… Everything is ready for training!")
189
+ print("\nπŸ“‹ Next steps:")
190
+ print("1. Run training: python scripts/train_textilindo_ai.py")
191
+ print("2. Or use runner: ./run_textilindo_training.sh")
192
+ else:
193
+ print("❌ Some components are missing or invalid")
194
+ print("Please fix the issues above before training")
195
+ sys.exit(1)
196
+
197
+ if __name__ == "__main__":
198
+ main()
scripts/inference_textilindo_ai.py ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Inference script untuk Textilindo AI Assistant
4
+ Menggunakan model yang sudah di-fine-tune dengan LoRA
5
+ """
6
+
7
+ import os
8
+ import sys
9
+ import torch
10
+ import argparse
11
+ from pathlib import Path
12
+ from transformers import AutoTokenizer, AutoModelForCausalLM
13
+ from peft import PeftModel
14
+ import logging
15
+
16
+ logging.basicConfig(level=logging.INFO)
17
+ logger = logging.getLogger(__name__)
18
+
19
+ def load_system_prompt(system_prompt_path):
20
+ """Load system prompt from markdown file"""
21
+ try:
22
+ with open(system_prompt_path, 'r', encoding='utf-8') as f:
23
+ content = f.read()
24
+
25
+ # Extract SYSTEM_PROMPT from markdown
26
+ if 'SYSTEM_PROMPT = """' in content:
27
+ start = content.find('SYSTEM_PROMPT = """') + len('SYSTEM_PROMPT = """')
28
+ end = content.find('"""', start)
29
+ system_prompt = content[start:end].strip()
30
+ else:
31
+ # Fallback: use entire content
32
+ system_prompt = content.strip()
33
+
34
+ return system_prompt
35
+ except Exception as e:
36
+ logger.error(f"Error loading system prompt: {e}")
37
+ return None
38
+
39
+ def load_model(model_path, lora_path=None):
40
+ """Load model with optional LoRA weights"""
41
+ logger.info(f"Loading base model from: {model_path}")
42
+
43
+ # Load tokenizer
44
+ tokenizer = AutoTokenizer.from_pretrained(
45
+ model_path,
46
+ trust_remote_code=True
47
+ )
48
+
49
+ if tokenizer.pad_token is None:
50
+ tokenizer.pad_token = tokenizer.eos_token
51
+
52
+ # Load base model
53
+ model = AutoModelForCausalLM.from_pretrained(
54
+ model_path,
55
+ torch_dtype=torch.float16,
56
+ device_map="auto",
57
+ trust_remote_code=True
58
+ )
59
+
60
+ # Load LoRA weights if provided
61
+ if lora_path and os.path.exists(lora_path):
62
+ logger.info(f"Loading LoRA weights from: {lora_path}")
63
+ model = PeftModel.from_pretrained(model, lora_path)
64
+ else:
65
+ logger.warning("No LoRA weights found, using base model")
66
+
67
+ return model, tokenizer
68
+
69
+ def generate_response(model, tokenizer, user_input, system_prompt, max_length=512):
70
+ """Generate response from the model"""
71
+ # Create full prompt with system prompt
72
+ full_prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
73
+
74
+ inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
75
+
76
+ with torch.no_grad():
77
+ outputs = model.generate(
78
+ **inputs,
79
+ max_length=max_length,
80
+ temperature=0.7,
81
+ top_p=0.9,
82
+ top_k=40,
83
+ repetition_penalty=1.1,
84
+ do_sample=True,
85
+ pad_token_id=tokenizer.eos_token_id,
86
+ eos_token_id=tokenizer.eos_token_id,
87
+ stop_strings=["<|end|>", "<|user|>"]
88
+ )
89
+
90
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
91
+
92
+ # Extract only the assistant's response
93
+ if "<|assistant|>" in response:
94
+ assistant_response = response.split("<|assistant|>")[-1].strip()
95
+ # Remove any remaining special tokens
96
+ assistant_response = assistant_response.replace("<|end|>", "").strip()
97
+ return assistant_response
98
+ else:
99
+ return response
100
+
101
+ def interactive_chat(model, tokenizer, system_prompt):
102
+ """Interactive chat mode"""
103
+ print("πŸ€– Textilindo AI Assistant - Chat Mode")
104
+ print("=" * 60)
105
+ print("Type 'quit' to exit")
106
+ print("-" * 60)
107
+
108
+ while True:
109
+ try:
110
+ user_input = input("\nπŸ‘€ Customer: ").strip()
111
+
112
+ if user_input.lower() in ['quit', 'exit', 'q']:
113
+ print("πŸ‘‹ Terima kasih! Sampai jumpa!")
114
+ break
115
+
116
+ if not user_input:
117
+ continue
118
+
119
+ print("\nπŸ€– Textilindo AI: ", end="", flush=True)
120
+ response = generate_response(model, tokenizer, user_input, system_prompt)
121
+ print(response)
122
+
123
+ except KeyboardInterrupt:
124
+ print("\nπŸ‘‹ Terima kasih! Sampai jumpa!")
125
+ break
126
+ except Exception as e:
127
+ logger.error(f"Error generating response: {e}")
128
+ print(f"❌ Error: {e}")
129
+
130
+ def main():
131
+ parser = argparse.ArgumentParser(description='Textilindo AI Assistant Inference')
132
+ parser.add_argument('--model_path', type=str, default='./models/llama-3.2-1b-instruct',
133
+ help='Path to base model')
134
+ parser.add_argument('--lora_path', type=str, default=None,
135
+ help='Path to LoRA weights')
136
+ parser.add_argument('--system_prompt', type=str, default='configs/system_prompt.md',
137
+ help='Path to system prompt file')
138
+ parser.add_argument('--prompt', type=str, default=None,
139
+ help='Single prompt to process')
140
+
141
+ args = parser.parse_args()
142
+
143
+ print("πŸ€– Textilindo AI Assistant - Inference")
144
+ print("=" * 60)
145
+
146
+ # Load system prompt
147
+ system_prompt = load_system_prompt(args.system_prompt)
148
+ if not system_prompt:
149
+ print(f"❌ System prompt tidak ditemukan: {args.system_prompt}")
150
+ sys.exit(1)
151
+
152
+ # Check if model exists
153
+ if not os.path.exists(args.model_path):
154
+ print(f"❌ Base model tidak ditemukan: {args.model_path}")
155
+ print("Jalankan setup_textilindo_training.py terlebih dahulu")
156
+ sys.exit(1)
157
+
158
+ try:
159
+ # Load model
160
+ print("1️⃣ Loading model...")
161
+ model, tokenizer = load_model(args.model_path, args.lora_path)
162
+ print("βœ… Model loaded successfully!")
163
+
164
+ if args.prompt:
165
+ # Single prompt mode
166
+ print(f"\nπŸ“ Processing prompt: {args.prompt}")
167
+ response = generate_response(model, tokenizer, args.prompt, system_prompt)
168
+ print(f"\nπŸ€– Response: {response}")
169
+ else:
170
+ # Interactive mode
171
+ interactive_chat(model, tokenizer, system_prompt)
172
+
173
+ except Exception as e:
174
+ logger.error(f"Error: {e}")
175
+ print(f"❌ Error loading model: {e}")
176
+
177
+ if __name__ == "__main__":
178
+ main()
scripts/setup_textilindo_training.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Setup script untuk Textilindo AI Assistant training
4
+ Download model dan prepare environment
5
+ """
6
+
7
+ import os
8
+ import sys
9
+ import yaml
10
+ import torch
11
+ from pathlib import Path
12
+ from transformers import AutoTokenizer, AutoModelForCausalLM
13
+ import logging
14
+
15
+ logging.basicConfig(level=logging.INFO)
16
+ logger = logging.getLogger(__name__)
17
+
18
+ def load_config(config_path):
19
+ """Load configuration from YAML file"""
20
+ try:
21
+ with open(config_path, 'r') as f:
22
+ config = yaml.safe_load(f)
23
+ return config
24
+ except Exception as e:
25
+ logger.error(f"Error loading config: {e}")
26
+ return None
27
+
28
+ def download_model(config):
29
+ """Download base model"""
30
+ model_name = config['model_name']
31
+ model_path = config['model_path']
32
+
33
+ logger.info(f"Downloading model: {model_name}")
34
+ logger.info(f"Target path: {model_path}")
35
+
36
+ # Create models directory
37
+ Path(model_path).mkdir(parents=True, exist_ok=True)
38
+
39
+ try:
40
+ # Download tokenizer
41
+ logger.info("Downloading tokenizer...")
42
+ tokenizer = AutoTokenizer.from_pretrained(
43
+ model_name,
44
+ trust_remote_code=True,
45
+ cache_dir=model_path
46
+ )
47
+
48
+ # Download model
49
+ logger.info("Downloading model...")
50
+ model = AutoModelForCausalLM.from_pretrained(
51
+ model_name,
52
+ torch_dtype=torch.float16,
53
+ trust_remote_code=True,
54
+ cache_dir=model_path
55
+ )
56
+
57
+ # Save to local path
58
+ logger.info(f"Saving model to: {model_path}")
59
+ tokenizer.save_pretrained(model_path)
60
+ model.save_pretrained(model_path)
61
+
62
+ logger.info("βœ… Model downloaded successfully!")
63
+ return True
64
+
65
+ except Exception as e:
66
+ logger.error(f"Error downloading model: {e}")
67
+ return False
68
+
69
+ def check_requirements():
70
+ """Check if all requirements are met"""
71
+ print("πŸ” Checking requirements...")
72
+
73
+ # Check Python version
74
+ if sys.version_info < (3, 8):
75
+ print("❌ Python 3.8+ required")
76
+ return False
77
+
78
+ # Check PyTorch
79
+ try:
80
+ import torch
81
+ print(f"βœ… PyTorch {torch.__version__}")
82
+ except ImportError:
83
+ print("❌ PyTorch not installed")
84
+ return False
85
+
86
+ # Check CUDA availability
87
+ if torch.cuda.is_available():
88
+ print(f"βœ… CUDA available: {torch.cuda.get_device_name(0)}")
89
+ print(f" GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
90
+ else:
91
+ print("⚠️ CUDA not available - training will be slower on CPU")
92
+
93
+ # Check required packages
94
+ required_packages = [
95
+ 'transformers',
96
+ 'peft',
97
+ 'datasets',
98
+ 'accelerate',
99
+ 'bitsandbytes'
100
+ ]
101
+
102
+ missing_packages = []
103
+ for package in required_packages:
104
+ try:
105
+ __import__(package)
106
+ print(f"βœ… {package}")
107
+ except ImportError:
108
+ missing_packages.append(package)
109
+ print(f"❌ {package}")
110
+
111
+ if missing_packages:
112
+ print(f"\n❌ Missing packages: {', '.join(missing_packages)}")
113
+ print("Install with: pip install " + " ".join(missing_packages))
114
+ return False
115
+
116
+ return True
117
+
118
+ def main():
119
+ print("πŸš€ Textilindo AI Assistant - Setup")
120
+ print("=" * 50)
121
+
122
+ # Check requirements
123
+ if not check_requirements():
124
+ print("\n❌ Requirements not met. Please install missing packages.")
125
+ sys.exit(1)
126
+
127
+ # Load configuration
128
+ config_path = "configs/training_config.yaml"
129
+ if not os.path.exists(config_path):
130
+ print(f"❌ Config file tidak ditemukan: {config_path}")
131
+ sys.exit(1)
132
+
133
+ config = load_config(config_path)
134
+ if not config:
135
+ sys.exit(1)
136
+
137
+ # Check if model already exists
138
+ model_path = config['model_path']
139
+ if os.path.exists(model_path) and os.path.exists(os.path.join(model_path, "config.json")):
140
+ print(f"βœ… Model already exists: {model_path}")
141
+ print("Skipping download...")
142
+ else:
143
+ # Download model
144
+ print("1️⃣ Downloading base model...")
145
+ if not download_model(config):
146
+ print("❌ Failed to download model")
147
+ sys.exit(1)
148
+
149
+ # Check dataset
150
+ dataset_path = config['dataset_path']
151
+ if not os.path.exists(dataset_path):
152
+ print(f"❌ Dataset tidak ditemukan: {dataset_path}")
153
+ print("Please ensure your dataset is in the correct location")
154
+ sys.exit(1)
155
+ else:
156
+ print(f"βœ… Dataset found: {dataset_path}")
157
+
158
+ # Check system prompt
159
+ system_prompt_path = "configs/system_prompt.md"
160
+ if not os.path.exists(system_prompt_path):
161
+ print(f"❌ System prompt tidak ditemukan: {system_prompt_path}")
162
+ sys.exit(1)
163
+ else:
164
+ print(f"βœ… System prompt found: {system_prompt_path}")
165
+
166
+ print("\nβœ… Setup completed successfully!")
167
+ print("\nπŸ“‹ Next steps:")
168
+ print("1. Run training: python scripts/train_textilindo_ai.py")
169
+ print("2. Test model: python scripts/test_textilindo_ai.py")
170
+ print("3. Test with LoRA: python scripts/test_textilindo_ai.py --lora_path models/textilindo-ai-lora-YYYYMMDD_HHMMSS")
171
+
172
+ if __name__ == "__main__":
173
+ main()
scripts/test_textilindo_ai.py ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Script untuk testing Textilindo AI Assistant yang sudah di-fine-tune
4
+ """
5
+
6
+ import os
7
+ import sys
8
+ import yaml
9
+ import torch
10
+ import argparse
11
+ from pathlib import Path
12
+ from transformers import AutoTokenizer, AutoModelForCausalLM
13
+ from peft import PeftModel
14
+ import logging
15
+
16
+ logging.basicConfig(level=logging.INFO)
17
+ logger = logging.getLogger(__name__)
18
+
19
+ def load_system_prompt(system_prompt_path):
20
+ """Load system prompt from markdown file"""
21
+ try:
22
+ with open(system_prompt_path, 'r', encoding='utf-8') as f:
23
+ content = f.read()
24
+
25
+ # Extract SYSTEM_PROMPT from markdown
26
+ if 'SYSTEM_PROMPT = """' in content:
27
+ start = content.find('SYSTEM_PROMPT = """') + len('SYSTEM_PROMPT = """')
28
+ end = content.find('"""', start)
29
+ system_prompt = content[start:end].strip()
30
+ else:
31
+ # Fallback: use entire content
32
+ system_prompt = content.strip()
33
+
34
+ return system_prompt
35
+ except Exception as e:
36
+ logger.error(f"Error loading system prompt: {e}")
37
+ return None
38
+
39
+ def load_finetuned_model(model_path, lora_weights_path, system_prompt):
40
+ """Load fine-tuned model with LoRA weights"""
41
+ logger.info(f"Loading base model from: {model_path}")
42
+
43
+ # Load base model
44
+ model = AutoModelForCausalLM.from_pretrained(
45
+ model_path,
46
+ torch_dtype=torch.float16,
47
+ device_map="auto",
48
+ trust_remote_code=True
49
+ )
50
+
51
+ # Load LoRA weights if available
52
+ if lora_weights_path and os.path.exists(lora_weights_path):
53
+ logger.info(f"Loading LoRA weights from: {lora_weights_path}")
54
+ model = PeftModel.from_pretrained(model, lora_weights_path)
55
+ else:
56
+ logger.warning("No LoRA weights found, using base model")
57
+
58
+ # Load tokenizer
59
+ tokenizer = AutoTokenizer.from_pretrained(
60
+ model_path,
61
+ trust_remote_code=True
62
+ )
63
+
64
+ if tokenizer.pad_token is None:
65
+ tokenizer.pad_token = tokenizer.eos_token
66
+
67
+ return model, tokenizer
68
+
69
+ def generate_response(model, tokenizer, user_input, system_prompt, max_length=512):
70
+ """Generate response from the model"""
71
+ # Create full prompt with system prompt
72
+ full_prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
73
+
74
+ inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
75
+
76
+ with torch.no_grad():
77
+ outputs = model.generate(
78
+ **inputs,
79
+ max_length=max_length,
80
+ temperature=0.7,
81
+ top_p=0.9,
82
+ top_k=40,
83
+ repetition_penalty=1.1,
84
+ do_sample=True,
85
+ pad_token_id=tokenizer.eos_token_id,
86
+ eos_token_id=tokenizer.eos_token_id,
87
+ stop_strings=["<|end|>", "<|user|>"]
88
+ )
89
+
90
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
91
+
92
+ # Extract only the assistant's response
93
+ if "<|assistant|>" in response:
94
+ assistant_response = response.split("<|assistant|>")[-1].strip()
95
+ # Remove any remaining special tokens
96
+ assistant_response = assistant_response.replace("<|end|>", "").strip()
97
+ return assistant_response
98
+ else:
99
+ return response
100
+
101
+ def interactive_test(model, tokenizer, system_prompt):
102
+ """Interactive testing mode"""
103
+ print("πŸ€– Textilindo AI Assistant - Interactive Mode")
104
+ print("=" * 60)
105
+ print("Type 'quit' to exit")
106
+ print("-" * 60)
107
+
108
+ while True:
109
+ try:
110
+ user_input = input("\nπŸ‘€ Customer: ").strip()
111
+
112
+ if user_input.lower() in ['quit', 'exit', 'q']:
113
+ print("πŸ‘‹ Terima kasih! Sampai jumpa!")
114
+ break
115
+
116
+ if not user_input:
117
+ continue
118
+
119
+ print("\nπŸ€– Textilindo AI: ", end="", flush=True)
120
+ response = generate_response(model, tokenizer, user_input, system_prompt)
121
+ print(response)
122
+
123
+ except KeyboardInterrupt:
124
+ print("\nπŸ‘‹ Terima kasih! Sampai jumpa!")
125
+ break
126
+ except Exception as e:
127
+ logger.error(f"Error generating response: {e}")
128
+ print(f"❌ Error: {e}")
129
+
130
+ def batch_test(model, tokenizer, system_prompt, test_cases):
131
+ """Batch testing with predefined test cases"""
132
+ print("πŸ§ͺ Textilindo AI Assistant - Batch Testing")
133
+ print("=" * 60)
134
+
135
+ for i, test_case in enumerate(test_cases, 1):
136
+ print(f"\nπŸ“ Test Case {i}: {test_case['prompt']}")
137
+ print("-" * 40)
138
+
139
+ try:
140
+ response = generate_response(model, tokenizer, test_case['prompt'], system_prompt)
141
+ print(f"πŸ€– Response: {response}")
142
+
143
+ if 'expected' in test_case:
144
+ print(f"🎯 Expected: {test_case['expected']}")
145
+
146
+ except Exception as e:
147
+ logger.error(f"Error in test case {i}: {e}")
148
+ print(f"❌ Error: {e}")
149
+
150
+ def main():
151
+ parser = argparse.ArgumentParser(description='Test Textilindo AI Assistant')
152
+ parser.add_argument('--model_path', type=str, default='./models/llama-3.2-1b-instruct',
153
+ help='Path to base model')
154
+ parser.add_argument('--lora_path', type=str, default=None,
155
+ help='Path to LoRA weights')
156
+ parser.add_argument('--system_prompt', type=str, default='configs/system_prompt.md',
157
+ help='Path to system prompt file')
158
+
159
+ args = parser.parse_args()
160
+
161
+ print("πŸ§ͺ Textilindo AI Assistant Testing")
162
+ print("=" * 60)
163
+
164
+ # Load system prompt
165
+ system_prompt = load_system_prompt(args.system_prompt)
166
+ if not system_prompt:
167
+ print(f"❌ System prompt tidak ditemukan: {args.system_prompt}")
168
+ sys.exit(1)
169
+
170
+ # Check if model exists
171
+ if not os.path.exists(args.model_path):
172
+ print(f"❌ Base model tidak ditemukan: {args.model_path}")
173
+ print("Jalankan download_model.py terlebih dahulu")
174
+ sys.exit(1)
175
+
176
+ try:
177
+ # Load model
178
+ print("1️⃣ Loading model...")
179
+ model, tokenizer = load_finetuned_model(args.model_path, args.lora_path, system_prompt)
180
+ print("βœ… Model loaded successfully!")
181
+
182
+ # Test cases specific to Textilindo
183
+ test_cases = [
184
+ {
185
+ "prompt": "dimana lokasi textilindo?",
186
+ "expected": "Textilindo berkantor pusat di Jl. Raya Prancis No.39, Kosambi Tim., Kec. Kosambi, Kabupaten Tangerang, Banten 15213"
187
+ },
188
+ {
189
+ "prompt": "Jam berapa textilindo beroperasional?",
190
+ "expected": "Jam operasional Senin-Jumat 08:00-17:00, Sabtu 08:00-12:00."
191
+ },
192
+ {
193
+ "prompt": "Berapa ketentuan pembelian?",
194
+ "expected": "Minimal order 1 roll per jenis kain"
195
+ },
196
+ {
197
+ "prompt": "bagimana dengan pembayarannya?",
198
+ "expected": "Pembayaran dapat dilakukan via transfer bank atau cash on delivery"
199
+ },
200
+ {
201
+ "prompt": "apa ada gratis ongkir?",
202
+ "expected": "Gratis ongkir untuk order minimal 5 roll."
203
+ },
204
+ {
205
+ "prompt": "Apa bisa dikirimkan sample? apa gratis?",
206
+ "expected": "hallo kak untuk sampel kita bisa kirimkan gratis ya kak 😊"
207
+ }
208
+ ]
209
+
210
+ # Choose testing mode
211
+ print("\n2️⃣ Pilih mode testing:")
212
+ print("1. Interactive mode (chat)")
213
+ print("2. Batch testing")
214
+ print("3. Custom prompt")
215
+
216
+ choice = input("\nPilihan (1-3): ").strip()
217
+
218
+ if choice == "1":
219
+ interactive_test(model, tokenizer, system_prompt)
220
+ elif choice == "2":
221
+ batch_test(model, tokenizer, system_prompt, test_cases)
222
+ elif choice == "3":
223
+ custom_prompt = input("Masukkan prompt custom: ").strip()
224
+ if custom_prompt:
225
+ response = generate_response(model, tokenizer, custom_prompt, system_prompt)
226
+ print(f"\nπŸ€– Response: {response}")
227
+ else:
228
+ print("❌ Pilihan tidak valid")
229
+
230
+ except Exception as e:
231
+ logger.error(f"Error: {e}")
232
+ print(f"❌ Error loading model: {e}")
233
+
234
+ if __name__ == "__main__":
235
+ main()
scripts/train_textilindo_ai.py ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Script untuk fine-tuning Llama 3.2 1B dengan LoRA untuk Textilindo AI Assistant
4
+ Menggunakan system prompt dan dataset khusus Textilindo
5
+ """
6
+
7
+ import os
8
+ import sys
9
+ import yaml
10
+ import json
11
+ import torch
12
+ from pathlib import Path
13
+ from transformers import (
14
+ AutoTokenizer,
15
+ AutoModelForCausalLM,
16
+ TrainingArguments,
17
+ Trainer,
18
+ DataCollatorForLanguageModeling
19
+ )
20
+ from peft import (
21
+ LoraConfig,
22
+ get_peft_model,
23
+ TaskType,
24
+ prepare_model_for_kbit_training
25
+ )
26
+ from datasets import Dataset
27
+ import logging
28
+ from datetime import datetime
29
+
30
+ # Setup logging
31
+ logging.basicConfig(level=logging.INFO)
32
+ logger = logging.getLogger(__name__)
33
+
34
+ def load_config(config_path):
35
+ """Load configuration from YAML file"""
36
+ try:
37
+ with open(config_path, 'r') as f:
38
+ config = yaml.safe_load(f)
39
+ return config
40
+ except Exception as e:
41
+ logger.error(f"Error loading config: {e}")
42
+ return None
43
+
44
+ def load_system_prompt(system_prompt_path):
45
+ """Load system prompt from markdown file"""
46
+ try:
47
+ with open(system_prompt_path, 'r', encoding='utf-8') as f:
48
+ content = f.read()
49
+
50
+ # Extract SYSTEM_PROMPT from markdown
51
+ if 'SYSTEM_PROMPT = """' in content:
52
+ start = content.find('SYSTEM_PROMPT = """') + len('SYSTEM_PROMPT = """')
53
+ end = content.find('"""', start)
54
+ system_prompt = content[start:end].strip()
55
+ else:
56
+ # Fallback: use entire content
57
+ system_prompt = content.strip()
58
+
59
+ return system_prompt
60
+ except Exception as e:
61
+ logger.error(f"Error loading system prompt: {e}")
62
+ return None
63
+
64
+ def load_model_and_tokenizer(config):
65
+ """Load base model and tokenizer"""
66
+ model_path = config['model_path']
67
+
68
+ logger.info(f"Loading model from: {model_path}")
69
+
70
+ # Load tokenizer
71
+ tokenizer = AutoTokenizer.from_pretrained(
72
+ model_path,
73
+ trust_remote_code=True,
74
+ padding_side="right"
75
+ )
76
+
77
+ if tokenizer.pad_token is None:
78
+ tokenizer.pad_token = tokenizer.eos_token
79
+
80
+ # Load model
81
+ model = AutoModelForCausalLM.from_pretrained(
82
+ model_path,
83
+ torch_dtype=torch.float16,
84
+ device_map="auto",
85
+ trust_remote_code=True
86
+ )
87
+
88
+ # Prepare model for k-bit training
89
+ model = prepare_model_for_kbit_training(model)
90
+
91
+ return model, tokenizer
92
+
93
+ def setup_lora_config(config):
94
+ """Setup LoRA configuration"""
95
+ lora_config = config['lora_config']
96
+
97
+ peft_config = LoraConfig(
98
+ task_type=TaskType.CAUSAL_LM,
99
+ r=lora_config['r'],
100
+ lora_alpha=lora_config['lora_alpha'],
101
+ lora_dropout=lora_config['lora_dropout'],
102
+ target_modules=lora_config['target_modules'],
103
+ bias="none",
104
+ )
105
+
106
+ return peft_config
107
+
108
+ def prepare_textilindo_dataset(data_path, tokenizer, system_prompt, max_length=2048):
109
+ """Prepare Textilindo dataset for training with system prompt"""
110
+ logger.info(f"Loading dataset from: {data_path}")
111
+
112
+ # Load JSONL dataset
113
+ data = []
114
+ with open(data_path, 'r', encoding='utf-8') as f:
115
+ for line_num, line in enumerate(f, 1):
116
+ line = line.strip()
117
+ if line:
118
+ try:
119
+ json_obj = json.loads(line)
120
+ data.append(json_obj)
121
+ except json.JSONDecodeError as e:
122
+ logger.warning(f"Invalid JSON at line {line_num}: {e}")
123
+ continue
124
+
125
+ if not data:
126
+ raise ValueError("No valid JSON objects found in JSONL file")
127
+
128
+ logger.info(f"Loaded {len(data)} samples from JSONL file")
129
+
130
+ # Convert to training format with system prompt
131
+ training_data = []
132
+ for item in data:
133
+ # Extract instruction and output
134
+ instruction = item.get('instruction', '')
135
+ output = item.get('output', '')
136
+
137
+ if not instruction or not output:
138
+ continue
139
+
140
+ # Create training text with system prompt
141
+ training_text = f"<|system|>\n{system_prompt}\n<|user|>\n{instruction}\n<|assistant|>\n{output}<|end|>"
142
+
143
+ training_data.append({
144
+ 'text': training_text,
145
+ 'instruction': instruction,
146
+ 'output': output
147
+ })
148
+
149
+ # Convert to Dataset
150
+ dataset = Dataset.from_list(training_data)
151
+ logger.info(f"Prepared {len(dataset)} training samples")
152
+
153
+ def tokenize_function(examples):
154
+ # Tokenize the texts
155
+ tokenized = tokenizer(
156
+ examples['text'],
157
+ truncation=True,
158
+ padding=True,
159
+ max_length=max_length,
160
+ return_tensors="pt"
161
+ )
162
+ return tokenized
163
+
164
+ # Tokenize dataset
165
+ tokenized_dataset = dataset.map(
166
+ tokenize_function,
167
+ batched=True,
168
+ remove_columns=dataset.column_names
169
+ )
170
+
171
+ return tokenized_dataset
172
+
173
+ def train_model(model, tokenizer, dataset, config, output_dir):
174
+ """Train the model with LoRA"""
175
+ training_config = config['training_config']
176
+
177
+ # Setup training arguments
178
+ training_args = TrainingArguments(
179
+ output_dir=output_dir,
180
+ num_train_epochs=training_config['num_epochs'],
181
+ per_device_train_batch_size=training_config['batch_size'],
182
+ gradient_accumulation_steps=training_config['gradient_accumulation_steps'],
183
+ learning_rate=training_config['learning_rate'],
184
+ warmup_steps=training_config['warmup_steps'],
185
+ save_steps=training_config['save_steps'],
186
+ eval_steps=training_config['eval_steps'],
187
+ logging_steps=training_config.get('logging_steps', 10),
188
+ save_total_limit=training_config.get('save_total_limit', 3),
189
+ prediction_loss_only=training_config.get('prediction_loss_only', True),
190
+ remove_unused_columns=training_config.get('remove_unused_columns', False),
191
+ push_to_hub=training_config.get('push_to_hub', False),
192
+ report_to=training_config.get('report_to', None),
193
+ fp16=True, # Enable mixed precision training
194
+ dataloader_pin_memory=False, # Reduce memory usage
195
+ )
196
+
197
+ # Setup data collator
198
+ data_collator = DataCollatorForLanguageModeling(
199
+ tokenizer=tokenizer,
200
+ mlm=False,
201
+ )
202
+
203
+ # Setup trainer
204
+ trainer = Trainer(
205
+ model=model,
206
+ args=training_args,
207
+ train_dataset=dataset,
208
+ data_collator=data_collator,
209
+ tokenizer=tokenizer,
210
+ )
211
+
212
+ # Start training
213
+ logger.info("Starting training...")
214
+ trainer.train()
215
+
216
+ # Save the model
217
+ trainer.save_model()
218
+ logger.info(f"Model saved to: {output_dir}")
219
+
220
+ def main():
221
+ print("πŸš€ Textilindo AI Assistant - LoRA Fine-tuning")
222
+ print("=" * 60)
223
+
224
+ # Load configuration
225
+ config_path = "configs/training_config.yaml"
226
+ if not os.path.exists(config_path):
227
+ print(f"❌ Config file tidak ditemukan: {config_path}")
228
+ sys.exit(1)
229
+
230
+ config = load_config(config_path)
231
+ if not config:
232
+ sys.exit(1)
233
+
234
+ # Load system prompt
235
+ system_prompt_path = "configs/system_prompt.md"
236
+ if not os.path.exists(system_prompt_path):
237
+ print(f"❌ System prompt tidak ditemukan: {system_prompt_path}")
238
+ sys.exit(1)
239
+
240
+ system_prompt = load_system_prompt(system_prompt_path)
241
+ if not system_prompt:
242
+ sys.exit(1)
243
+
244
+ # Setup paths
245
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
246
+ output_dir = Path(f"models/textilindo-ai-lora-{timestamp}")
247
+ output_dir.mkdir(parents=True, exist_ok=True)
248
+
249
+ # Check if dataset exists
250
+ data_path = config['dataset_path']
251
+ if not os.path.exists(data_path):
252
+ print(f"❌ Dataset tidak ditemukan: {data_path}")
253
+ sys.exit(1)
254
+
255
+ # Load model and tokenizer
256
+ print("1️⃣ Loading model and tokenizer...")
257
+ model, tokenizer = load_model_and_tokenizer(config)
258
+
259
+ # Setup LoRA
260
+ print("2️⃣ Setting up LoRA configuration...")
261
+ peft_config = setup_lora_config(config)
262
+ model = get_peft_model(model, peft_config)
263
+
264
+ # Print trainable parameters
265
+ model.print_trainable_parameters()
266
+
267
+ # Prepare dataset
268
+ print("3️⃣ Preparing Textilindo dataset...")
269
+ dataset = prepare_textilindo_dataset(data_path, tokenizer, system_prompt, config['max_length'])
270
+
271
+ # Train model
272
+ print("4️⃣ Starting training...")
273
+ train_model(model, tokenizer, dataset, config, output_dir)
274
+
275
+ print("βœ… Training selesai!")
276
+ print(f"πŸ“ Model tersimpan di: {output_dir}")
277
+ print(f"πŸ”§ Untuk testing, jalankan: python scripts/test_textilindo_ai.py --model_path {output_dir}")
278
+
279
+ if __name__ == "__main__":
280
+ main()