Australian Legal Small Language Model (SLM)
A domain-specific Small Language Model fine-tuned on Australian legal documents from AustLII. This model is based on DistilGPT2 and has been adapted to generate text in the style of Australian legal documents.
Model Details
Model Description
- Model type: GPT-2 (Transformer decoder)
- Architecture: DistilGPT2 fine-tuned on Australian legal corpus
- Parameters: ~82M
- Language: English (Australian legal domain)
- License: MIT
Base Model
This model is a fine-tune of distilgpt2, a distilled version of GPT-2 with 82M parameters.
Training Data
The model was fine-tuned on a corpus of Australian legal documents scraped from AustLII. The training corpus consists of legal cases, legislation, and other legal documents from Australian jurisdictions.
Data Processing:
- Documents were cleaned to remove metadata headers
- Tokenized using GPT-2 tokenizer with a maximum sequence length of 512 tokens
- Split into training (90%) and validation (10%) sets
- Used sliding window approach with 256 token stride for sequence creation
Training Procedure
Training Hyperparameters:
- Training regime: Fine-tuning (not from scratch)
- Epochs: 1 (as per training metrics)
- Learning rate: 2e-5
- Batch size: 4 (per device)
- Gradient accumulation steps: 1
- Max sequence length: 512 tokens
- Optimizer: AdamW
- Warmup steps: 100
- Mixed precision: FP16 (when GPU available)
Training Infrastructure:
- Framework: PyTorch with Hugging Face Transformers
- Hardware: CPU/GPU compatible
Evaluation Results
Metrics
| Metric | Value |
|---|---|
| Validation Loss | 3.19 |
| Perplexity | 24.34 |
| Training Loss | 3.29 |
Note: Lower perplexity indicates better performance. A perplexity of ~24 is reasonable for a domain-adapted model of this size.
Intended Use
Direct Use
This model is intended for:
- Research and educational purposes: Exploring domain-specific language modeling
- Legal text generation: Generating text in the style of Australian legal documents
- Domain adaptation experiments: As a baseline for legal domain language models
Out-of-Scope Use
โ ๏ธ This model should NOT be used for:
- Legal advice or legal decision-making
- Production legal applications without additional safeguards
- Any application requiring guaranteed factual accuracy
- Replacing professional legal research or consultation
Limitations and Bias
Known Limitations
Hallucination Risk: The model may generate plausible-sounding but incorrect legal information. Fine-tuning reduces but does not eliminate hallucinations.
Limited Coverage: Training on a relatively small corpus (~10,000+ documents) means the model may not have seen all areas of Australian law.
Temporal Limitations: Documents reflect the state of law at scraping time; laws may have changed since training.
Context Window: Limited to 512 tokens, restricting the amount of context the model can consider.
No Citations: The model doesn't explicitly cite sources (unlike RAG systems).
Generalization: May overfit to specific documents or underperform on unseen legal topics.
Bias Considerations
- The model inherits biases from both the base model (DistilGPT2) and the training corpus
- Legal documents may reflect historical biases present in the legal system
- The model may reproduce or amplify biases found in the training data
- Users should be aware that legal language and concepts may not be neutral
Ethical Considerations
- Not for Legal Advice: This model is a research tool and should not be used to provide legal advice
- Factual Accuracy: Generated content should be verified against authoritative legal sources
- Bias Awareness: Users should be aware of potential biases in generated content
- Responsible Use: Should be used responsibly and with appropriate safeguards
How to Use
Basic Usage
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("JamesANZ/auslegal-slm")
tokenizer = GPT2Tokenizer.from_pretrained("JamesANZ/auslegal-slm")
# Generate text
prompt = "In Australian law, negligence is defined as"
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(
inputs,
max_length=250,
temperature=0.4,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Recommended Generation Parameters
- Temperature: 0.3-0.5 (lower = more deterministic, reduces hallucinations)
- Max length: 250 tokens (prevents rambling)
- Top-p (nucleus): 0.9
- Top-k: 50
- Repetition penalty: 1.2
Training Details
Training Data
- Source: AustLII (Australasian Legal Information Institute)
- Document count: ~10,000+ legal documents
- Content types: Legal cases, legislation, legal commentary
- Jurisdictions: Australian federal and state jurisdictions
Preprocessing
- Data Cleaning: Removed metadata headers, navigation elements, and irrelevant text
- Tokenization: GPT-2 BPE tokenizer with vocabulary size of 50,257 tokens
- Sequence Creation: Sliding window with 512 token max length and 256 token stride
- Train/Val Split: 90% training, 10% validation
Training Configuration
See the main repository README for detailed training configuration and code.
Citation
If you use this model, please cite:
@software{auslegal_slm,
title = {Australian Legal Small Language Model},
author = {James Sangalli},
year = {2025},
url = {https://github.com/JamesANZ/auslegal-slm}
}
Acknowledgments
- Legal documents scraped from AustLII
- Base model: DistilGPT2 by Hugging Face
- Built with Transformers library
Model Card Contact
For questions or issues, please open an issue in the repository.
- Downloads last month
- 51
Model tree for JamesANZ/auslegal-slm
Base model
distilbert/distilgpt2Evaluation results
- Perplexity on Australian Legal Corpus (AustLII)self-reported24.340
- Validation Loss on Australian Legal Corpus (AustLII)self-reported3.190