metadata

license: apache-2.0
language:
  - zh
tags:
  - legal

This repository contains the DISC-LawLLM, version of Baichuan-13b-base as the base model.

DISC-LawLLM is a large language model specialized in Chinese legal domain, developed and open-sourced by Data Intelligence and Social Computing Laboratory of Fudan University (Fudan-DISC)，which aim to provide comprehensive intelligent legal services. The advtantages of our DISC-LawLLM is as follow:

Legal Texts Generic Processing Capability
Legal Thinking and Reasoning
Legal knowledge Retrieval Capacity

In addition, the contributions include:

High-quality SFT datasets and effective training paradigms
Chinese legal LLMs evaluation framework

Check our HOME for more information.

DISC-Law-SFT Dataset

we construct a high-quality supervised fine-tuning dataset, DISC-Law-SFT with two subsets, namely DISC-Law-SFT-Pair and DISC-Law-SFT-Triplet. Our dataset converge a range of legal tasks, including legal information extraction, judgment prediction, document summarization, and legal question answering, ensuring coverage of diverse scenarios.

Dataset	Task/Source	Size	Scenario
DISC-LawLLM-SFT-Pair	Legal information extraction	32K	Legal professional assistant
	Legal event detection	27K
	Legal case classification	20K
	Legal judgement prediction	11K
	Legal case matching	8K
	Legal text summarization	9K
	Judicial public opinion summarization	6K
	Legal question answering	93K	Legal consultation services
	Legal reading comprehension	38K	Judicial examination assistant
	Judicial examination	12K	Judicial examination assistant
DISC-LawLLM-SFT-Triple	Legal judgement prediction	16K	Legal professional assistant
DISC-LawLLM-SFT-Triple	Legal question answering	23K	Legal consultation services
General	Alpaca-GPT4	48K	General scenarios
General	Firefly	60K	General scenarios
Total	403K

Using through hugging face transformers

>>>import torch
>>>>>>from transformers import AutoModelForCausalLM, AutoTokenizer
>>>from transformers.generation.utils import GenerationConfig
>>>tokenizer = AutoTokenizer.from_pretrained("ShengbinYue/DISC-LawLLM", use_fast=False, trust_remote_code=True)
>>>model = AutoModelForCausalLM.from_pretrained("ShengbinYue/DISC-LawLLM", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True)
>>>model.generation_config = GenerationConfig.from_pretrained("ShengbinYue/DISC-LawLLM")
>>>messages = []
>>>messages.append({"role": "user", "content": "生产销售假冒伪劣商品罪如何判刑？"})
>>>response = model.chat(tokenizer, messages)
>>>print(response)

Disclaimer

DISC-LawLLM comes with issues and limitations that current LLMs have yet to overcome. While it can provide Chinese legal services in many a wide variety of tasks and scenarios, the model should be used for reference purposes only and cannot replace professional lawyers and legal experts. We encourage users of DISC-LawLLM to evaluate the model critically. We do not take responsibility for any issues, risks, or adverse consequences that may arise from the use of DISC-LawLLM.

Citation

If our work is helpful for your, please kindly cite our work as follows:

@misc{yue2023disclawllm,
    title={DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services}, 
    author={Shengbin Yue and Wei Chen and Siyuan Wang and Bingxuan Li and Chenchen Shen and Shujun Liu and Yuxuan Zhou and Yao Xiao and Song Yun and Wei Lin and Xuanjing Huang and Zhongyu Wei},
    year={2023},
    eprint={2309.11325},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

License

DISC-LawLLM is available under the Apache License.