Australian Legal Small Language Model (SLM)

A domain-specific Small Language Model fine-tuned on Australian legal documents from AustLII. This model is based on DistilGPT2 and has been adapted to generate text in the style of Australian legal documents.

Model Details

Model Description

  • Model type: GPT-2 (Transformer decoder)
  • Architecture: DistilGPT2 fine-tuned on Australian legal corpus
  • Parameters: ~82M
  • Language: English (Australian legal domain)
  • License: MIT

Base Model

This model is a fine-tune of distilgpt2, a distilled version of GPT-2 with 82M parameters.

Training Data

The model was fine-tuned on a corpus of Australian legal documents scraped from AustLII. The training corpus consists of legal cases, legislation, and other legal documents from Australian jurisdictions.

Data Processing:

  • Documents were cleaned to remove metadata headers
  • Tokenized using GPT-2 tokenizer with a maximum sequence length of 512 tokens
  • Split into training (90%) and validation (10%) sets
  • Used sliding window approach with 256 token stride for sequence creation

Training Procedure

Training Hyperparameters:

  • Training regime: Fine-tuning (not from scratch)
  • Epochs: 1 (as per training metrics)
  • Learning rate: 2e-5
  • Batch size: 4 (per device)
  • Gradient accumulation steps: 1
  • Max sequence length: 512 tokens
  • Optimizer: AdamW
  • Warmup steps: 100
  • Mixed precision: FP16 (when GPU available)

Training Infrastructure:

  • Framework: PyTorch with Hugging Face Transformers
  • Hardware: CPU/GPU compatible

Evaluation Results

Metrics

Metric Value
Validation Loss 3.19
Perplexity 24.34
Training Loss 3.29

Note: Lower perplexity indicates better performance. A perplexity of ~24 is reasonable for a domain-adapted model of this size.

Intended Use

Direct Use

This model is intended for:

  • Research and educational purposes: Exploring domain-specific language modeling
  • Legal text generation: Generating text in the style of Australian legal documents
  • Domain adaptation experiments: As a baseline for legal domain language models

Out-of-Scope Use

โš ๏ธ This model should NOT be used for:

  • Legal advice or legal decision-making
  • Production legal applications without additional safeguards
  • Any application requiring guaranteed factual accuracy
  • Replacing professional legal research or consultation

Limitations and Bias

Known Limitations

  1. Hallucination Risk: The model may generate plausible-sounding but incorrect legal information. Fine-tuning reduces but does not eliminate hallucinations.

  2. Limited Coverage: Training on a relatively small corpus (~10,000+ documents) means the model may not have seen all areas of Australian law.

  3. Temporal Limitations: Documents reflect the state of law at scraping time; laws may have changed since training.

  4. Context Window: Limited to 512 tokens, restricting the amount of context the model can consider.

  5. No Citations: The model doesn't explicitly cite sources (unlike RAG systems).

  6. Generalization: May overfit to specific documents or underperform on unseen legal topics.

Bias Considerations

  • The model inherits biases from both the base model (DistilGPT2) and the training corpus
  • Legal documents may reflect historical biases present in the legal system
  • The model may reproduce or amplify biases found in the training data
  • Users should be aware that legal language and concepts may not be neutral

Ethical Considerations

  • Not for Legal Advice: This model is a research tool and should not be used to provide legal advice
  • Factual Accuracy: Generated content should be verified against authoritative legal sources
  • Bias Awareness: Users should be aware of potential biases in generated content
  • Responsible Use: Should be used responsibly and with appropriate safeguards

How to Use

Basic Usage

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("JamesANZ/auslegal-slm")
tokenizer = GPT2Tokenizer.from_pretrained("JamesANZ/auslegal-slm")

# Generate text
prompt = "In Australian law, negligence is defined as"
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(
    inputs,
    max_length=250,
    temperature=0.4,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Recommended Generation Parameters

  • Temperature: 0.3-0.5 (lower = more deterministic, reduces hallucinations)
  • Max length: 250 tokens (prevents rambling)
  • Top-p (nucleus): 0.9
  • Top-k: 50
  • Repetition penalty: 1.2

Training Details

Training Data

  • Source: AustLII (Australasian Legal Information Institute)
  • Document count: ~10,000+ legal documents
  • Content types: Legal cases, legislation, legal commentary
  • Jurisdictions: Australian federal and state jurisdictions

Preprocessing

  1. Data Cleaning: Removed metadata headers, navigation elements, and irrelevant text
  2. Tokenization: GPT-2 BPE tokenizer with vocabulary size of 50,257 tokens
  3. Sequence Creation: Sliding window with 512 token max length and 256 token stride
  4. Train/Val Split: 90% training, 10% validation

Training Configuration

See the main repository README for detailed training configuration and code.

Citation

If you use this model, please cite:

@software{auslegal_slm,
  title = {Australian Legal Small Language Model},
  author = {James Sangalli},
  year = {2025},
  url = {https://github.com/JamesANZ/auslegal-slm}
}

Acknowledgments

Model Card Contact

For questions or issues, please open an issue in the repository.

Downloads last month
51
Safetensors
Model size
81.9M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for JamesANZ/auslegal-slm

Finetuned
(939)
this model

Evaluation results

  • Perplexity on Australian Legal Corpus (AustLII)
    self-reported
    24.340
  • Validation Loss on Australian Legal Corpus (AustLII)
    self-reported
    3.190