--- language: - en tags: - legal - australia - law - causal-lm - text-generation - domain-adapted - slm - distilgpt2 license: mit base_model: distilgpt2 library_name: transformers pipeline_tag: text-generation datasets: - custom metrics: - perplexity model-index: - name: auslegal-slm results: - task: type: text-generation dataset: name: Australian Legal Corpus (AustLII) type: custom metrics: - name: Perplexity type: perplexity value: 24.34 - name: Validation Loss type: loss value: 3.19 --- # Australian Legal Small Language Model (SLM) A domain-specific Small Language Model fine-tuned on Australian legal documents from AustLII. This model is based on DistilGPT2 and has been adapted to generate text in the style of Australian legal documents. ## Model Details ### Model Description - **Model type**: GPT-2 (Transformer decoder) - **Architecture**: DistilGPT2 fine-tuned on Australian legal corpus - **Parameters**: ~82M - **Language**: English (Australian legal domain) - **License**: MIT ### Base Model This model is a fine-tune of [distilgpt2](https://huggingface.co/distilgpt2), a distilled version of GPT-2 with 82M parameters. ### Training Data The model was fine-tuned on a corpus of Australian legal documents scraped from [AustLII](https://www.austlii.edu.au/). The training corpus consists of legal cases, legislation, and other legal documents from Australian jurisdictions. **Data Processing**: - Documents were cleaned to remove metadata headers - Tokenized using GPT-2 tokenizer with a maximum sequence length of 512 tokens - Split into training (90%) and validation (10%) sets - Used sliding window approach with 256 token stride for sequence creation ### Training Procedure **Training Hyperparameters**: - **Training regime**: Fine-tuning (not from scratch) - **Epochs**: 1 (as per training metrics) - **Learning rate**: 2e-5 - **Batch size**: 4 (per device) - **Gradient accumulation steps**: 1 - **Max sequence length**: 512 tokens - **Optimizer**: AdamW - **Warmup steps**: 100 - **Mixed precision**: FP16 (when GPU available) **Training Infrastructure**: - Framework: PyTorch with Hugging Face Transformers - Hardware: CPU/GPU compatible ## Evaluation Results ### Metrics | Metric | Value | |--------|-------| | Validation Loss | 3.19 | | Perplexity | 24.34 | | Training Loss | 3.29 | **Note**: Lower perplexity indicates better performance. A perplexity of ~24 is reasonable for a domain-adapted model of this size. ## Intended Use ### Direct Use This model is intended for: - **Research and educational purposes**: Exploring domain-specific language modeling - **Legal text generation**: Generating text in the style of Australian legal documents - **Domain adaptation experiments**: As a baseline for legal domain language models ### Out-of-Scope Use ⚠️ **This model should NOT be used for**: - Legal advice or legal decision-making - Production legal applications without additional safeguards - Any application requiring guaranteed factual accuracy - Replacing professional legal research or consultation ## Limitations and Bias ### Known Limitations 1. **Hallucination Risk**: The model may generate plausible-sounding but incorrect legal information. Fine-tuning reduces but does not eliminate hallucinations. 2. **Limited Coverage**: Training on a relatively small corpus (~10,000+ documents) means the model may not have seen all areas of Australian law. 3. **Temporal Limitations**: Documents reflect the state of law at scraping time; laws may have changed since training. 4. **Context Window**: Limited to 512 tokens, restricting the amount of context the model can consider. 5. **No Citations**: The model doesn't explicitly cite sources (unlike RAG systems). 6. **Generalization**: May overfit to specific documents or underperform on unseen legal topics. ### Bias Considerations - The model inherits biases from both the base model (DistilGPT2) and the training corpus - Legal documents may reflect historical biases present in the legal system - The model may reproduce or amplify biases found in the training data - Users should be aware that legal language and concepts may not be neutral ### Ethical Considerations - **Not for Legal Advice**: This model is a research tool and should not be used to provide legal advice - **Factual Accuracy**: Generated content should be verified against authoritative legal sources - **Bias Awareness**: Users should be aware of potential biases in generated content - **Responsible Use**: Should be used responsibly and with appropriate safeguards ## How to Use ### Basic Usage ```python from transformers import GPT2LMHeadModel, GPT2Tokenizer # Load model and tokenizer model = GPT2LMHeadModel.from_pretrained("JamesANZ/auslegal-slm") tokenizer = GPT2Tokenizer.from_pretrained("JamesANZ/auslegal-slm") # Generate text prompt = "In Australian law, negligence is defined as" inputs = tokenizer.encode(prompt, return_tensors="pt") outputs = model.generate( inputs, max_length=250, temperature=0.4, do_sample=True, pad_token_id=tokenizer.eos_token_id ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ### Recommended Generation Parameters - **Temperature**: 0.3-0.5 (lower = more deterministic, reduces hallucinations) - **Max length**: 250 tokens (prevents rambling) - **Top-p (nucleus)**: 0.9 - **Top-k**: 50 - **Repetition penalty**: 1.2 ## Training Details ### Training Data - **Source**: AustLII (Australasian Legal Information Institute) - **Document count**: ~10,000+ legal documents - **Content types**: Legal cases, legislation, legal commentary - **Jurisdictions**: Australian federal and state jurisdictions ### Preprocessing 1. **Data Cleaning**: Removed metadata headers, navigation elements, and irrelevant text 2. **Tokenization**: GPT-2 BPE tokenizer with vocabulary size of 50,257 tokens 3. **Sequence Creation**: Sliding window with 512 token max length and 256 token stride 4. **Train/Val Split**: 90% training, 10% validation ### Training Configuration See the main repository README for detailed training configuration and code. ## Citation If you use this model, please cite: ```bibtex @software{auslegal_slm, title = {Australian Legal Small Language Model}, author = {James Sangalli}, year = {2025}, url = {https://github.com/JamesANZ/auslegal-slm} } ``` ## Acknowledgments - Legal documents scraped from [AustLII](https://www.austlii.edu.au/) - Base model: [DistilGPT2](https://huggingface.co/distilgpt2) by Hugging Face - Built with [Transformers](https://huggingface.co/docs/transformers) library ## Model Card Contact For questions or issues, please open an issue in the [repository](https://github.com/JamesANZ/auslegal-slm).