|
--- |
|
license: apache-2.0 |
|
language: |
|
- zh |
|
pipeline_tag: text-generation |
|
--- |
|
--- |
|
<div style="text-align:center"> |
|
<h2>📈 CFGPT2: Chinese Financial Assistant with Large Language Model (CFGPT2-7B)</h2> |
|
</div> |
|
|
|
## Introduction |
|
|
|
We introduce **CFGPT2**, an open-source language model trained by firstly further pretraining InternLM2 on collected and cleaned Chinese finance text data (CFData-pt), including financial domain-specific data (announcement, finance articles, finance exams, finance news, finance research papers) and general data (Wikipedia), and secondly fine-tuning with knowledge-intensive instruction tuning data (CFData-sft). |
|
As for preliminary evaluation, we use CFBenchmark. |
|
CFGPT outperforms the baselines on objective and subjective financial tasks compared to several baseline models with similar parameters. |
|
|
|
In this repository, we will share the CFGPT2-7B. |
|
|
|
- [CFGPT2-7B](https://huggingface.co/TongjiFinLab/CFGPT2-7B): The 7B version of our CFGPT2 model. |
|
|
|
## How to Use |
|
|
|
**1. Prepare the code and the environment** |
|
|
|
Clone [CFGPT](https://github.com/TongjiFinLab/CFGPT.git) repository, create a Python environment, and activate it via the following command |
|
```bash |
|
git clone https://github.com/TongjiFinLab/CFGPT.git |
|
cd CFGPT |
|
conda create -n env_name python=3.10 |
|
source activate env_name |
|
pip install -r requirements.txt |
|
``` |
|
|
|
**2. Use CFGPT2-7B** |
|
|
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
base_model = 'TongjiFinLab/CFGPT2' |
|
device_map = 'cuda:0' |
|
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True, use_fast=True) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
base_model, |
|
trust_remote_code=True, |
|
torch_dtype=torch.bfloat16, |
|
device_map=device_map |
|
) |
|
model = model.eval() |
|
|
|
history = [] |
|
|
|
query = "能否给我用中文写一篇有关新能源汽车领域的市场分析报告,写的长一些,细致一些" |
|
|
|
response, history = model.chat(tokenizer = tokenizer, query = query, history = history, max_new_tokens = 1024, do_sample = True, temperature = 0.8, top_p = 0.8, repetition_penalty=1.1, meta_instruction='') |
|
print(response) |
|
``` |
|
|
|
## 简介 |
|
|
|
**CFGPT2**是一个开源的语言模型,首先通过在收集和清理的中国金融文本数据(CFData-pt)上利用InternLM2进行继续预训练,包括金融领域特定数据(公告、金融文章、金融考试、金融新闻、金融研究论文)和通用数据(维基百科),然后使用知识密集的指导调整数据(CFData-sft)进行微调。 |
|
我们使用CFBenchmark进行初步评估。与几个具有相似参数的基线模型相比,CFGPT在多个金融任务上表现优越。 |
|
|
|
在这个仓库中,我们将分享以下CFGPT2-7B模型。 |
|
|
|
- [CFGPT2-7B](https://huggingface.co/TongjiFinLab/CFGPT2-7B): CFGPT2的7B版本。 |
|
|
|
## 如何使用 |
|
|
|
**1. 准备代码和环境** |
|
|
|
克隆[CFGPT](https://github.com/TongjiFinLab/CFGPT.git)的仓库,创建一个Python环境,并通过以下命令激活它: |
|
```bash |
|
git clone https://github.com/TongjiFinLab/CFGPT.git |
|
cd CFGPT |
|
conda create -n env_name python=3.10 |
|
source activate env_name |
|
pip install -r requirements.txt |
|
``` |
|
|
|
**2. 使用 CFGPT2-7B** |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
base_model = 'TongjiFinLab/CFGPT2' |
|
device_map = 'cuda:0' |
|
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True, use_fast=True) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
base_model, |
|
trust_remote_code=True, |
|
torch_dtype=torch.bfloat16, |
|
device_map=device_map |
|
) |
|
model = model.eval() |
|
|
|
history = [] |
|
|
|
query = "能否给我用中文写一篇有关新能源汽车领域的市场分析报告,写的长一些,细致一些" |
|
|
|
response, history = model.chat(tokenizer = tokenizer, query = query, history = history, max_new_tokens = 1024, do_sample = True, temperature = 0.8, top_p = 0.8, repetition_penalty=1.1, meta_instruction='') |
|
print(response) |
|
``` |