JamesANZ commited on
Commit
5a5c571
·
verified ·
1 Parent(s): 3d44051

Replace gh readme with model card

Browse files
Files changed (1) hide show
  1. README.md +150 -275
README.md CHANGED
@@ -1,330 +1,200 @@
1
- # Australian Legal Small Language Model (SLM)
2
-
3
- A tiny domain-specific Small Language Model fine-tuned on Australian legal documents scraped from AustLII. This model learns patterns from the legal corpus and generates answers using what it learned during training, **biasing its behavior strongly toward Australian legal language and concepts**.
4
-
5
- ## Overview
6
-
7
- This project implements a Small Language Model (SLM) by fine-tuning DistilGPT2 (~82M parameters) on a corpus of Australian legal documents. The model is trained using causal language modeling, where it learns to predict the next token in sequences of legal text, thereby internalizing domain-specific knowledge, terminology, and reasoning patterns.
8
-
9
- ### What This Is
10
-
11
- - ✅ **A true SLM**: Fine-tuning DistilGPT2 on the legal corpus _is_ training a small language model - the model weights are updated during training
12
- - ✅ **Domain-adapted**: Domain knowledge is internalized in the model parameters (unlike RAG which relies on external document retrieval)
13
- - ✅ **Standalone**: This is fundamentally different from RAG - no external retrieval at inference time
14
- - ✅ **Learns patterns**: The model learns legal terminology, citation patterns, reasoning structures, and domain-specific syntax from the training corpus
15
-
16
- ### What This Is Not
17
-
18
- - ⚠️ **Not trained from scratch**: DistilGPT2 still retains its original pre-training on a large generic corpus; fine-tuning adapts it to the legal domain rather than replacing its prior knowledge
19
- - ⚠️ **Not hallucination-proof**: Fine-tuning _reduces_ general-domain hallucinations by biasing toward legal text, but does **not** prevent hallucinations entirely
20
- - ⚠️ **Not a replacement for legal advice**: This is a research/educational tool for exploring domain-specific language modeling
21
-
22
- ## Architecture
23
-
24
- ### Model
25
-
26
- - **Base Model**: `distilgpt2` (82M parameters) - the smallest GPT-2 variant
27
- - **Training Objective**: Causal language modeling (next token prediction)
28
- - **Fine-tuning**: Domain adaptation on Australian legal corpus
29
- - **Architecture**: Transformer decoder (GPT-2 style)
30
-
31
- ### Training Process
32
-
33
- 1. **Data Preparation**: Legal documents are loaded, cleaned, and tokenized into fixed-length sequences (512 tokens)
34
- 2. **Fine-tuning**: DistilGPT2 is fine-tuned using Hugging Face Trainer API with causal language modeling
35
- 3. **Evaluation**: Model is evaluated on a held-out validation set to compute perplexity
36
- 4. **Inference**: Trained model generates responses to legal questions
37
-
38
- ## Installation
39
-
40
- ### Prerequisites
41
-
42
- - Python 3.8+
43
- - PyTorch (CPU or GPU)
44
- - CUDA (optional, for GPU training)
45
-
46
- ### Setup
47
-
48
- 1. Clone or navigate to this repository:
49
-
50
- ```bash
51
- cd auslegal-slm
52
- ```
53
-
54
- 2. Create a virtual environment (recommended):
55
-
56
- ```bash
57
- python3 -m venv venv
58
- source venv/bin/activate # On Windows: venv\Scripts\activate
59
- ```
60
-
61
- 3. Install dependencies:
62
-
63
- ```bash
64
- pip install -r requirements.txt
65
- ```
66
-
67
- ## Usage
68
-
69
- ### Step 0: Clean Data Files (One-Time)
70
-
71
- Clean the raw legal documents by removing metadata headers and irrelevant content:
72
-
73
- ```bash
74
- python clean_data.py
75
- ```
76
-
77
- This script:
78
-
79
- - Processes all `.txt` files in the `data/` directory
80
- - Strips metadata headers (URL, scraped date, separators)
81
- - Removes navigation/UI elements and irrelevant text
82
- - Cleans and normalizes whitespace
83
- - Saves cleaned versions back to the same files
84
-
85
- **Note**: This is a one-time operation. Once files are cleaned, you don't need to run this again unless you scrape new data.
86
-
87
- ### Step 1: Prepare Data
88
-
89
- Preprocess the cleaned legal documents for training:
90
 
91
- ```bash
92
- python prepare_data.py
93
- ```
94
 
95
- This script:
96
 
97
- - Loads all cleaned `.txt` files from the `data/` directory
98
- - Tokenizes documents into fixed-length sequences (512 tokens)
99
- - Splits into training (90%) and validation (10%) sets
100
- - Saves preprocessed data to `preprocessed_data/`
101
 
102
- **Output**: `preprocessed_data/train_data.json` and `preprocessed_data/val_data.json`
103
 
104
- ### Step 2: Train the Model
 
 
 
 
105
 
106
- Fine-tune DistilGPT2 on the legal corpus:
107
 
108
- ```bash
109
- python train_slm.py
110
- ```
111
 
112
- **Training Configuration**:
113
 
114
- - Epochs: 5
115
- - Learning rate: 2e-5
116
- - Batch size: 4 (effective: 16 with gradient accumulation)
117
- - Max sequence length: 512 tokens
118
- - Optimizer: AdamW with warmup
119
 
120
- **Output**: Trained model saved to `models/legal_slm/`
 
 
 
 
121
 
122
- **Training Time**:
123
 
124
- - CPU: Several hours (depending on corpus size)
125
- - GPU: 30-60 minutes (depending on GPU)
 
 
 
 
 
 
 
 
126
 
127
- ### Step 3: Query the Model
 
 
128
 
129
- #### Interactive Mode
130
 
131
- Run the interactive query interface:
132
 
133
- ```bash
134
- python query_slm.py
135
- ```
 
 
136
 
137
- Then enter legal questions interactively. Type `quit` or `exit` to stop.
138
 
139
- #### Single Question
140
 
141
- Ask a single question:
142
 
143
- ```bash
144
- python query_slm.py --question "What is the legal precedent for negligence in Australian law?"
145
- ```
 
146
 
147
- #### Custom Parameters
148
 
149
- Adjust generation parameters:
 
 
 
 
150
 
151
- ```bash
152
- python query_slm.py \
153
- --question "Your question here" \
154
- --temperature 0.3 \
155
- --max-length 300
156
- ```
157
 
158
- **Parameters**:
159
 
160
- - `--model-dir`: Path to fine-tuned model (default: `models/legal_slm`)
161
- - `--question`: Single question to ask
162
- - `--temperature`: Sampling temperature, 0.0-1.0 (lower = more deterministic, default: 0.4)
163
- - `--max-length`: Maximum response length in tokens (default: 250)
164
 
165
- ## Technical Specifications (NatSpec)
166
 
167
- ### Data Format
168
 
169
- Legal documents are stored as plain text files with the following structure:
170
 
171
- ```
172
- URL: https://www.austlii.edu.au/...
173
- Scraped: YYYY-MM-DD HH:MM:SS
174
- ================================================================================
175
 
176
- [Legal content here]
177
- ```
178
 
179
- The data preparation script automatically strips the metadata header and extracts only the legal content.
180
 
181
- ### Tokenization
 
 
 
182
 
183
- - **Tokenizer**: GPT-2 tokenizer (BPE-based)
184
- - **Vocabulary size**: 50,257 tokens
185
- - **Special tokens**: `<|endoftext|>` (EOS), padding token set to EOS
186
- - **Sequence length**: 512 tokens (fixed)
187
- - **Sliding window**: 256 token stride (50% overlap)
188
 
189
- ### Model Architecture
 
 
 
190
 
191
- - **Architecture**: GPT-2 (Transformer decoder)
192
- - **Parameters**: ~82M (DistilGPT2)
193
- - **Layers**: 6 transformer decoder blocks
194
- - **Hidden size**: 768
195
- - **Attention heads**: 12
196
- - **Max position embeddings**: 1024
197
 
198
- ### Training Configuration
199
 
200
  ```python
201
- Model: distilgpt2
202
- Objective: Causal Language Modeling (CLM)
203
- Loss: Cross-entropy
204
- Optimizer: AdamW
205
- Learning rate: 2e-5
206
- Learning rate schedule: Linear warmup + cosine decay
207
- Batch size: 4 (per device)
208
- Gradient accumulation: 4 steps (effective batch: 16)
209
- Epochs: 5
210
- Warmup steps: 100
211
- Max sequence length: 512 tokens
212
- Mixed precision: FP16 (if GPU available)
213
- ```
214
 
215
- ### Inference Configuration
 
 
216
 
217
- ```python
218
- Temperature: 0.4 (default, range: 0.0-1.0)
219
- Top-p (nucleus): 0.9
220
- Top-k: 50
221
- Repetition penalty: 1.2
222
- Max new tokens: 250
223
- Do sample: True
 
 
 
 
 
224
  ```
225
 
226
- ## Limitations and Considerations
227
-
228
- ### Hallucination Mitigation
229
-
230
- The following strategies are employed to reduce hallucinations and off-domain content:
231
-
232
- - **Domain fine-tuning**: Model is fine-tuned only on the legal corpus (though base model retains general pre-training)
233
- - **Low temperature**: 0.3-0.5 during inference to reduce randomness
234
- - **Capped generation length**: Limits response length to prevent rambling
235
- - **Prompt engineering**: Prompts explicitly reference "Australian legal documents"
236
- - **Manual monitoring**: Test prompts should be used to detect off-domain or invented content
237
 
238
- **Important Note**: Fine-tuning and decoding settings _reduce_ hallucinations and off-domain content but **cannot guarantee their absence**. The model may still:
 
 
 
 
239
 
240
- - Generate plausible-sounding but incorrect legal information
241
- - Mix general knowledge with legal domain knowledge
242
- - Produce responses that don't directly cite the training corpus
243
 
244
- For stricter factual grounding, consider pairing this SLM with a RAG (Retrieval-Augmented Generation) pipeline over the same corpus.
245
 
246
- ### Data Limitations
 
 
 
247
 
248
- - **Corpus size**: Training on 119 documents (or ~856 files as scraped) is relatively small for language model training
249
- - **Coverage**: The model may not have seen all areas of Australian law
250
- - **Temporal**: Documents reflect the state of law at scraping time; laws may have changed
251
 
252
- ### Model Limitations
 
 
 
253
 
254
- - **Context window**: 512 tokens limits the amount of context the model can consider
255
- - **Generalization**: May overfit to specific documents or underperform on unseen legal topics
256
- - **No citations**: Model doesn't explicitly cite sources (unlike RAG systems)
257
-
258
- ## Evaluation
259
-
260
- Training metrics are saved to `models/legal_slm/training_metrics.json`:
261
-
262
- ```json
263
- {
264
- "training_loss": 2.3456,
265
- "eval_loss": 2.4567,
266
- "perplexity": 11.67,
267
- "num_epochs": 5,
268
- "learning_rate": 2e-5,
269
- "batch_size": 4
270
- }
271
- ```
272
-
273
- **Perplexity**: Lower is better. Measures how well the model predicts the next token. A perplexity of ~10-15 is reasonable for domain-adapted models.
274
-
275
- ## Future Enhancements
276
-
277
- ### Comparison Models
278
-
279
- For learning and comparison purposes, additional training approaches can be implemented:
280
-
281
- - **N-gram model** (`train_ngram.py`): Classic n-gram language model trained from scratch
282
- - **Char-RNN** (`train_charrnn.py`): Character-level LSTM/GRU trained from scratch
283
- - **Tiny transformer from scratch**: Fully custom transformer trained only on legal corpus
284
-
285
- These can be compared on:
286
-
287
- - Training time
288
- - Validation loss/perplexity
289
- - Qualitative sample quality
290
- - Memory requirements
291
-
292
- ### Hybrid Approaches
293
-
294
- - **SLM + RAG**: Combine fine-tuned SLM with retrieval over the same corpus for stricter factual grounding
295
- - **LoRA fine-tuning**: More parameter-efficient fine-tuning approach
296
- - **Gradient checkpointing**: Reduce memory usage for larger batch sizes
297
-
298
- ## File Structure
299
 
300
- ```
301
- auslegal-slm/
302
- ├── data/ # Legal documents (scraped, cleaned)
303
- ├── preprocessed_data/ # Tokenized training data
304
- │ ├── train_data.json
305
- │ └── val_data.json
306
- ├── models/ # Trained models
307
- │ └── legal_slm/ # Fine-tuned DistilGPT2
308
- │ ├── config.json
309
- │ ├── pytorch_model.bin
310
- │ ├── tokenizer_config.json
311
- │ ├── vocab.json
312
- │ ├── merges.txt
313
- │ └── training_metrics.json
314
- ├── scraper/ # Data collection tools
315
- │ ├── scraper.py # Legal document scraper
316
- │ └── requirements.txt # Scraper dependencies
317
- ├── clean_data.py # One-time data cleaning script
318
- ├── prepare_data.py # Data preparation script
319
- ├── train_slm.py # Training script
320
- ├── query_slm.py # Query interface
321
- ├── requirements.txt # SLM dependencies
322
- └── README.md # This file
323
- ```
324
 
325
  ## Citation
326
 
327
- If you use this code or model, please cite:
328
 
329
  ```bibtex
330
  @software{auslegal_slm,
@@ -338,5 +208,10 @@ If you use this code or model, please cite:
338
  ## Acknowledgments
339
 
340
  - Legal documents scraped from [AustLII](https://www.austlii.edu.au/)
341
- - Model architecture based on [DistilGPT2](https://huggingface.co/distilgpt2) by Hugging Face
342
  - Built with [Transformers](https://huggingface.co/docs/transformers) library
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - legal
6
+ - australia
7
+ - law
8
+ - causal-lm
9
+ - text-generation
10
+ - domain-adapted
11
+ - slm
12
+ - distilgpt2
13
+ license: mit
14
+ base_model: distilgpt2
15
+ library_name: transformers
16
+ pipeline_tag: text-generation
17
+ datasets:
18
+ - custom
19
+ metrics:
20
+ - perplexity
21
+ model-index:
22
+ - name: auslegal-slm
23
+ results:
24
+ - task:
25
+ type: text-generation
26
+ dataset:
27
+ name: Australian Legal Corpus (AustLII)
28
+ type: custom
29
+ metrics:
30
+ - name: Perplexity
31
+ type: perplexity
32
+ value: 24.34
33
+ - name: Validation Loss
34
+ type: loss
35
+ value: 3.19
36
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
+ # Australian Legal Small Language Model (SLM)
 
 
39
 
40
+ A domain-specific Small Language Model fine-tuned on Australian legal documents from AustLII. This model is based on DistilGPT2 and has been adapted to generate text in the style of Australian legal documents.
41
 
42
+ ## Model Details
 
 
 
43
 
44
+ ### Model Description
45
 
46
+ - **Model type**: GPT-2 (Transformer decoder)
47
+ - **Architecture**: DistilGPT2 fine-tuned on Australian legal corpus
48
+ - **Parameters**: ~82M
49
+ - **Language**: English (Australian legal domain)
50
+ - **License**: MIT
51
 
52
+ ### Base Model
53
 
54
+ This model is a fine-tune of [distilgpt2](https://huggingface.co/distilgpt2), a distilled version of GPT-2 with 82M parameters.
 
 
55
 
56
+ ### Training Data
57
 
58
+ The model was fine-tuned on a corpus of Australian legal documents scraped from [AustLII](https://www.austlii.edu.au/). The training corpus consists of legal cases, legislation, and other legal documents from Australian jurisdictions.
 
 
 
 
59
 
60
+ **Data Processing**:
61
+ - Documents were cleaned to remove metadata headers
62
+ - Tokenized using GPT-2 tokenizer with a maximum sequence length of 512 tokens
63
+ - Split into training (90%) and validation (10%) sets
64
+ - Used sliding window approach with 256 token stride for sequence creation
65
 
66
+ ### Training Procedure
67
 
68
+ **Training Hyperparameters**:
69
+ - **Training regime**: Fine-tuning (not from scratch)
70
+ - **Epochs**: 1 (as per training metrics)
71
+ - **Learning rate**: 2e-5
72
+ - **Batch size**: 4 (per device)
73
+ - **Gradient accumulation steps**: 1
74
+ - **Max sequence length**: 512 tokens
75
+ - **Optimizer**: AdamW
76
+ - **Warmup steps**: 100
77
+ - **Mixed precision**: FP16 (when GPU available)
78
 
79
+ **Training Infrastructure**:
80
+ - Framework: PyTorch with Hugging Face Transformers
81
+ - Hardware: CPU/GPU compatible
82
 
83
+ ## Evaluation Results
84
 
85
+ ### Metrics
86
 
87
+ | Metric | Value |
88
+ |--------|-------|
89
+ | Validation Loss | 3.19 |
90
+ | Perplexity | 24.34 |
91
+ | Training Loss | 3.29 |
92
 
93
+ **Note**: Lower perplexity indicates better performance. A perplexity of ~24 is reasonable for a domain-adapted model of this size.
94
 
95
+ ## Intended Use
96
 
97
+ ### Direct Use
98
 
99
+ This model is intended for:
100
+ - **Research and educational purposes**: Exploring domain-specific language modeling
101
+ - **Legal text generation**: Generating text in the style of Australian legal documents
102
+ - **Domain adaptation experiments**: As a baseline for legal domain language models
103
 
104
+ ### Out-of-Scope Use
105
 
106
+ ⚠️ **This model should NOT be used for**:
107
+ - Legal advice or legal decision-making
108
+ - Production legal applications without additional safeguards
109
+ - Any application requiring guaranteed factual accuracy
110
+ - Replacing professional legal research or consultation
111
 
112
+ ## Limitations and Bias
 
 
 
 
 
113
 
114
+ ### Known Limitations
115
 
116
+ 1. **Hallucination Risk**: The model may generate plausible-sounding but incorrect legal information. Fine-tuning reduces but does not eliminate hallucinations.
 
 
 
117
 
118
+ 2. **Limited Coverage**: Training on a relatively small corpus (~10,000+ documents) means the model may not have seen all areas of Australian law.
119
 
120
+ 3. **Temporal Limitations**: Documents reflect the state of law at scraping time; laws may have changed since training.
121
 
122
+ 4. **Context Window**: Limited to 512 tokens, restricting the amount of context the model can consider.
123
 
124
+ 5. **No Citations**: The model doesn't explicitly cite sources (unlike RAG systems).
 
 
 
125
 
126
+ 6. **Generalization**: May overfit to specific documents or underperform on unseen legal topics.
 
127
 
128
+ ### Bias Considerations
129
 
130
+ - The model inherits biases from both the base model (DistilGPT2) and the training corpus
131
+ - Legal documents may reflect historical biases present in the legal system
132
+ - The model may reproduce or amplify biases found in the training data
133
+ - Users should be aware that legal language and concepts may not be neutral
134
 
135
+ ### Ethical Considerations
 
 
 
 
136
 
137
+ - **Not for Legal Advice**: This model is a research tool and should not be used to provide legal advice
138
+ - **Factual Accuracy**: Generated content should be verified against authoritative legal sources
139
+ - **Bias Awareness**: Users should be aware of potential biases in generated content
140
+ - **Responsible Use**: Should be used responsibly and with appropriate safeguards
141
 
142
+ ## How to Use
 
 
 
 
 
143
 
144
+ ### Basic Usage
145
 
146
  ```python
147
+ from transformers import GPT2LMHeadModel, GPT2Tokenizer
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
+ # Load model and tokenizer
150
+ model = GPT2LMHeadModel.from_pretrained("JamesANZ/auslegal-slm")
151
+ tokenizer = GPT2Tokenizer.from_pretrained("JamesANZ/auslegal-slm")
152
 
153
+ # Generate text
154
+ prompt = "In Australian law, negligence is defined as"
155
+ inputs = tokenizer.encode(prompt, return_tensors="pt")
156
+ outputs = model.generate(
157
+ inputs,
158
+ max_length=250,
159
+ temperature=0.4,
160
+ do_sample=True,
161
+ pad_token_id=tokenizer.eos_token_id
162
+ )
163
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
164
+ print(generated_text)
165
  ```
166
 
167
+ ### Recommended Generation Parameters
 
 
 
 
 
 
 
 
 
 
168
 
169
+ - **Temperature**: 0.3-0.5 (lower = more deterministic, reduces hallucinations)
170
+ - **Max length**: 250 tokens (prevents rambling)
171
+ - **Top-p (nucleus)**: 0.9
172
+ - **Top-k**: 50
173
+ - **Repetition penalty**: 1.2
174
 
175
+ ## Training Details
 
 
176
 
177
+ ### Training Data
178
 
179
+ - **Source**: AustLII (Australasian Legal Information Institute)
180
+ - **Document count**: ~10,000+ legal documents
181
+ - **Content types**: Legal cases, legislation, legal commentary
182
+ - **Jurisdictions**: Australian federal and state jurisdictions
183
 
184
+ ### Preprocessing
 
 
185
 
186
+ 1. **Data Cleaning**: Removed metadata headers, navigation elements, and irrelevant text
187
+ 2. **Tokenization**: GPT-2 BPE tokenizer with vocabulary size of 50,257 tokens
188
+ 3. **Sequence Creation**: Sliding window with 512 token max length and 256 token stride
189
+ 4. **Train/Val Split**: 90% training, 10% validation
190
 
191
+ ### Training Configuration
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
192
 
193
+ See the main repository README for detailed training configuration and code.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
 
195
  ## Citation
196
 
197
+ If you use this model, please cite:
198
 
199
  ```bibtex
200
  @software{auslegal_slm,
 
208
  ## Acknowledgments
209
 
210
  - Legal documents scraped from [AustLII](https://www.austlii.edu.au/)
211
+ - Base model: [DistilGPT2](https://huggingface.co/distilgpt2) by Hugging Face
212
  - Built with [Transformers](https://huggingface.co/docs/transformers) library
213
+
214
+ ## Model Card Contact
215
+
216
+ For questions or issues, please open an issue in the [repository](https://github.com/JamesANZ/auslegal-slm).
217
+