Australian Legal Small Language Model (SLM)

A domain-specific Small Language Model fine-tuned on Australian legal documents from AustLII. This model is based on DistilGPT2 and has been adapted to generate text in the style of Australian legal documents.

Model Details

Model Description

Model type: GPT-2 (Transformer decoder)
Architecture: DistilGPT2 fine-tuned on Australian legal corpus
Parameters: ~82M
Language: English (Australian legal domain)
License: MIT

Base Model

This model is a fine-tune of distilgpt2, a distilled version of GPT-2 with 82M parameters.

Training Data

The model was fine-tuned on a corpus of Australian legal documents scraped from AustLII. The training corpus consists of legal cases, legislation, and other legal documents from Australian jurisdictions.

Data Processing:

Documents were cleaned to remove metadata headers
Tokenized using GPT-2 tokenizer with a maximum sequence length of 512 tokens
Split into training (90%) and validation (10%) sets
Used sliding window approach with 256 token stride for sequence creation

Training Procedure

Training Hyperparameters:

Training regime: Fine-tuning (not from scratch)
Epochs: 1 (as per training metrics)
Learning rate: 2e-5
Batch size: 4 (per device)
Gradient accumulation steps: 1
Max sequence length: 512 tokens
Optimizer: AdamW
Warmup steps: 100
Mixed precision: FP16 (when GPU available)

Training Infrastructure:

Framework: PyTorch with Hugging Face Transformers
Hardware: CPU/GPU compatible

Evaluation Results

Metrics

Metric	Value
Validation Loss	3.19
Perplexity	24.34
Training Loss	3.29

Note: Lower perplexity indicates better performance. A perplexity of ~24 is reasonable for a domain-adapted model of this size.

Intended Use

Direct Use

This model is intended for:

Research and educational purposes: Exploring domain-specific language modeling
Legal text generation: Generating text in the style of Australian legal documents
Domain adaptation experiments: As a baseline for legal domain language models

Out-of-Scope Use

⚠️ This model should NOT be used for:

Legal advice or legal decision-making
Production legal applications without additional safeguards
Any application requiring guaranteed factual accuracy
Replacing professional legal research or consultation

Limitations and Bias

Known Limitations

Hallucination Risk: The model may generate plausible-sounding but incorrect legal information. Fine-tuning reduces but does not eliminate hallucinations.
Limited Coverage: Training on a relatively small corpus (~10,000+ documents) means the model may not have seen all areas of Australian law.
Temporal Limitations: Documents reflect the state of law at scraping time; laws may have changed since training.
Context Window: Limited to 512 tokens, restricting the amount of context the model can consider.
No Citations: The model doesn't explicitly cite sources (unlike RAG systems).
Generalization: May overfit to specific documents or underperform on unseen legal topics.

Bias Considerations

The model inherits biases from both the base model (DistilGPT2) and the training corpus
Legal documents may reflect historical biases present in the legal system
The model may reproduce or amplify biases found in the training data
Users should be aware that legal language and concepts may not be neutral

Ethical Considerations

Not for Legal Advice: This model is a research tool and should not be used to provide legal advice
Factual Accuracy: Generated content should be verified against authoritative legal sources
Bias Awareness: Users should be aware of potential biases in generated content
Responsible Use: Should be used responsibly and with appropriate safeguards

How to Use

Basic Usage

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("JamesANZ/auslegal-slm")
tokenizer = GPT2Tokenizer.from_pretrained("JamesANZ/auslegal-slm")

# Generate text
prompt = "In Australian law, negligence is defined as"
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(
    inputs,
    max_length=250,
    temperature=0.4,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Recommended Generation Parameters

Temperature: 0.3-0.5 (lower = more deterministic, reduces hallucinations)
Max length: 250 tokens (prevents rambling)
Top-p (nucleus): 0.9
Top-k: 50
Repetition penalty: 1.2

Training Details

Training Data

Source: AustLII (Australasian Legal Information Institute)
Document count: ~10,000+ legal documents
Content types: Legal cases, legislation, legal commentary
Jurisdictions: Australian federal and state jurisdictions

Preprocessing

Data Cleaning: Removed metadata headers, navigation elements, and irrelevant text
Tokenization: GPT-2 BPE tokenizer with vocabulary size of 50,257 tokens
Sequence Creation: Sliding window with 512 token max length and 256 token stride
Train/Val Split: 90% training, 10% validation

Training Configuration

See the main repository README for detailed training configuration and code.

Citation

If you use this model, please cite:

@software{auslegal_slm,
  title = {Australian Legal Small Language Model},
  author = {James Sangalli},
  year = {2025},
  url = {https://github.com/JamesANZ/auslegal-slm}
}

Acknowledgments

Legal documents scraped from AustLII
Base model: DistilGPT2 by Hugging Face
Built with Transformers library

Model Card Contact

For questions or issues, please open an issue in the repository.

Downloads last month: 51

Safetensors

Model size

81.9M params

Tensor type

F32

Model tree for JamesANZ/auslegal-slm

Base model

distilbert/distilgpt2

Finetuned

(939)

this model

Evaluation results

Perplexity on Australian Legal Corpus (AustLII)
self-reported

24.340
Validation Loss on Australian Legal Corpus (AustLII)
self-reported

3.190

View on Papers With Code