---
language:
  - en
tags:
  - legal
  - australia
  - law
  - causal-lm
  - text-generation
  - domain-adapted
  - slm
  - distilgpt2
license: mit
base_model: distilgpt2
library_name: transformers
pipeline_tag: text-generation
datasets:
  - custom
metrics:
  - perplexity
model-index:
  - name: auslegal-slm
    results:
      - task:
          type: text-generation
        dataset:
          name: Australian Legal Corpus (AustLII)
          type: custom
        metrics:
          - name: Perplexity
            type: perplexity
            value: 24.34
          - name: Validation Loss
            type: loss
            value: 3.19
---

# Australian Legal Small Language Model (SLM)

A domain-specific Small Language Model fine-tuned on Australian legal documents from AustLII. This model is based on DistilGPT2 and has been adapted to generate text in the style of Australian legal documents.

## Model Details

### Model Description

- **Model type**: GPT-2 (Transformer decoder)
- **Architecture**: DistilGPT2 fine-tuned on Australian legal corpus
- **Parameters**: ~82M
- **Language**: English (Australian legal domain)
- **License**: MIT

### Base Model

This model is a fine-tune of [distilgpt2](https://huggingface.co/distilgpt2), a distilled version of GPT-2 with 82M parameters.

### Training Data

The model was fine-tuned on a corpus of Australian legal documents scraped from [AustLII](https://www.austlii.edu.au/). The training corpus consists of legal cases, legislation, and other legal documents from Australian jurisdictions.

**Data Processing**:
- Documents were cleaned to remove metadata headers
- Tokenized using GPT-2 tokenizer with a maximum sequence length of 512 tokens
- Split into training (90%) and validation (10%) sets
- Used sliding window approach with 256 token stride for sequence creation

### Training Procedure

**Training Hyperparameters**:
- **Training regime**: Fine-tuning (not from scratch)
- **Epochs**: 1 (as per training metrics)
- **Learning rate**: 2e-5
- **Batch size**: 4 (per device)
- **Gradient accumulation steps**: 1
- **Max sequence length**: 512 tokens
- **Optimizer**: AdamW
- **Warmup steps**: 100
- **Mixed precision**: FP16 (when GPU available)

**Training Infrastructure**:
- Framework: PyTorch with Hugging Face Transformers
- Hardware: CPU/GPU compatible

## Evaluation Results

### Metrics

| Metric | Value |
|--------|-------|
| Validation Loss | 3.19 |
| Perplexity | 24.34 |
| Training Loss | 3.29 |

**Note**: Lower perplexity indicates better performance. A perplexity of ~24 is reasonable for a domain-adapted model of this size.

## Intended Use

### Direct Use

This model is intended for:
- **Research and educational purposes**: Exploring domain-specific language modeling
- **Legal text generation**: Generating text in the style of Australian legal documents
- **Domain adaptation experiments**: As a baseline for legal domain language models

### Out-of-Scope Use

⚠️ **This model should NOT be used for**:
- Legal advice or legal decision-making
- Production legal applications without additional safeguards
- Any application requiring guaranteed factual accuracy
- Replacing professional legal research or consultation

## Limitations and Bias

### Known Limitations

1. **Hallucination Risk**: The model may generate plausible-sounding but incorrect legal information. Fine-tuning reduces but does not eliminate hallucinations.

2. **Limited Coverage**: Training on a relatively small corpus (~10,000+ documents) means the model may not have seen all areas of Australian law.

3. **Temporal Limitations**: Documents reflect the state of law at scraping time; laws may have changed since training.

4. **Context Window**: Limited to 512 tokens, restricting the amount of context the model can consider.

5. **No Citations**: The model doesn't explicitly cite sources (unlike RAG systems).

6. **Generalization**: May overfit to specific documents or underperform on unseen legal topics.

### Bias Considerations

- The model inherits biases from both the base model (DistilGPT2) and the training corpus
- Legal documents may reflect historical biases present in the legal system
- The model may reproduce or amplify biases found in the training data
- Users should be aware that legal language and concepts may not be neutral

### Ethical Considerations

- **Not for Legal Advice**: This model is a research tool and should not be used to provide legal advice
- **Factual Accuracy**: Generated content should be verified against authoritative legal sources
- **Bias Awareness**: Users should be aware of potential biases in generated content
- **Responsible Use**: Should be used responsibly and with appropriate safeguards

## How to Use

### Basic Usage

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("JamesANZ/auslegal-slm")
tokenizer = GPT2Tokenizer.from_pretrained("JamesANZ/auslegal-slm")

# Generate text
prompt = "In Australian law, negligence is defined as"
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(
    inputs,
    max_length=250,
    temperature=0.4,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

### Recommended Generation Parameters

- **Temperature**: 0.3-0.5 (lower = more deterministic, reduces hallucinations)
- **Max length**: 250 tokens (prevents rambling)
- **Top-p (nucleus)**: 0.9
- **Top-k**: 50
- **Repetition penalty**: 1.2

## Training Details

### Training Data

- **Source**: AustLII (Australasian Legal Information Institute)
- **Document count**: ~10,000+ legal documents
- **Content types**: Legal cases, legislation, legal commentary
- **Jurisdictions**: Australian federal and state jurisdictions

### Preprocessing

1. **Data Cleaning**: Removed metadata headers, navigation elements, and irrelevant text
2. **Tokenization**: GPT-2 BPE tokenizer with vocabulary size of 50,257 tokens
3. **Sequence Creation**: Sliding window with 512 token max length and 256 token stride
4. **Train/Val Split**: 90% training, 10% validation

### Training Configuration

See the main repository README for detailed training configuration and code.

## Citation

If you use this model, please cite:

```bibtex
@software{auslegal_slm,
  title = {Australian Legal Small Language Model},
  author = {James Sangalli},
  year = {2025},
  url = {https://github.com/JamesANZ/auslegal-slm}
}
```

## Acknowledgments

- Legal documents scraped from [AustLII](https://www.austlii.edu.au/)
- Base model: [DistilGPT2](https://huggingface.co/distilgpt2) by Hugging Face
- Built with [Transformers](https://huggingface.co/docs/transformers) library

## Model Card Contact

For questions or issues, please open an issue in the [repository](https://github.com/JamesANZ/auslegal-slm).