README.md · TongjiFinLab/CFGPT2-7B at main

metadata

license: apache-2.0
language:
  - zh
pipeline_tag: text-generation

📈 CFGPT2: Chinese Financial Assistant with Large Language Model (CFGPT2-7B)

Introduction

We introduce CFGPT2, an open-source language model trained by firstly further pretraining InternLM2 on collected and cleaned Chinese finance text data (CFData-pt), including financial domain-specific data (announcement, finance articles, finance exams, finance news, finance research papers) and general data (Wikipedia), and secondly fine-tuning with knowledge-intensive instruction tuning data (CFData-sft). As for preliminary evaluation, we use CFBenchmark. CFGPT outperforms the baselines on objective and subjective financial tasks compared to several baseline models with similar parameters.

In this repository, we will share the CFGPT2-7B.

CFGPT2-7B: The 7B version of our CFGPT2 model.

How to Use

1. Prepare the code and the environment

Clone CFGPT repository, create a Python environment, and activate it via the following command

git clone https://github.com/TongjiFinLab/CFGPT.git
cd CFGPT
conda create -n env_name python=3.10   
source activate env_name 
pip install -r requirements.txt

2. Use CFGPT2-7B

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_model = 'TongjiFinLab/CFGPT2'
device_map = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    trust_remote_code=True, 
    torch_dtype=torch.bfloat16, 
    device_map=device_map
)
model = model.eval()

history = []

query = "能否给我用中文写一篇有关新能源汽车领域的市场分析报告，写的长一些，细致一些"

response, history = model.chat(tokenizer = tokenizer, query = query, history = history, max_new_tokens = 1024, do_sample = True, temperature = 0.8, top_p = 0.8, repetition_penalty=1.1, meta_instruction='')
print(response)

简介

CFGPT2是一个开源的语言模型，首先通过在收集和清理的中国金融文本数据（CFData-pt）上利用InternLM2进行继续预训练，包括金融领域特定数据（公告、金融文章、金融考试、金融新闻、金融研究论文）和通用数据（维基百科），然后使用知识密集的指导调整数据（CFData-sft）进行微调。我们使用CFBenchmark进行初步评估。与几个具有相似参数的基线模型相比，CFGPT在多个金融任务上表现优越。

在这个仓库中，我们将分享以下CFGPT2-7B模型。

CFGPT2-7B: CFGPT2的7B版本。

如何使用

1. 准备代码和环境

克隆CFGPT的仓库，创建一个Python环境，并通过以下命令激活它：

git clone https://github.com/TongjiFinLab/CFGPT.git
cd CFGPT
conda create -n env_name python=3.10   
source activate env_name 
pip install -r requirements.txt

2. 使用 CFGPT2-7B

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_model = 'TongjiFinLab/CFGPT2'
device_map = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    trust_remote_code=True, 
    torch_dtype=torch.bfloat16, 
    device_map=device_map
)
model = model.eval()

history = []

query = "能否给我用中文写一篇有关新能源汽车领域的市场分析报告，写的长一些，细致一些"

response, history = model.chat(tokenizer = tokenizer, query = query, history = history, max_new_tokens = 1024, do_sample = True, temperature = 0.8, top_p = 0.8, repetition_penalty=1.1, meta_instruction='')
print(response)