CFGPT2-7B / README.md

Update README.md

8b7f164 verified 3 months ago

3.96 kB

	---
	license: apache-2.0
	language:
	- zh
	pipeline_tag: text-generation
	---
	---
	<div style="text-align:center">
	<h2>📈 CFGPT2: Chinese Financial Assistant with Large Language Model (CFGPT2-7B)</h2>
	</div>

	## Introduction

	We introduce CFGPT2, an open-source language model trained by firstly further pretraining InternLM2 on collected and cleaned Chinese finance text data (CFData-pt), including financial domain-specific data (announcement, finance articles, finance exams, finance news, finance research papers) and general data (Wikipedia), and secondly fine-tuning with knowledge-intensive instruction tuning data (CFData-sft).
	As for preliminary evaluation, we use CFBenchmark.
	CFGPT outperforms the baselines on objective and subjective financial tasks compared to several baseline models with similar parameters.

	In this repository, we will share the CFGPT2-7B.

	- [CFGPT2-7B](https://huggingface.co/TongjiFinLab/CFGPT2-7B): The 7B version of our CFGPT2 model.

	## How to Use

	1. Prepare the code and the environment

	Clone [CFGPT](https://github.com/TongjiFinLab/CFGPT.git) repository, create a Python environment, and activate it via the following command
	```bash
	git clone https://github.com/TongjiFinLab/CFGPT.git
	cd CFGPT
	conda create -n env_name python=3.10
	source activate env_name
	pip install -r requirements.txt
	```

	2. Use CFGPT2-7B


	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch
	base_model = 'TongjiFinLab/CFGPT2'
	device_map = 'cuda:0'
	tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True, use_fast=True)
	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	device_map=device_map
	)
	model = model.eval()

	history = []

	query = "能否给我用中文写一篇有关新能源汽车领域的市场分析报告，写的长一些，细致一些"

	response, history = model.chat(tokenizer = tokenizer, query = query, history = history, max_new_tokens = 1024, do_sample = True, temperature = 0.8, top_p = 0.8, repetition_penalty=1.1, meta_instruction='')
	print(response)
	```

	## 简介

	CFGPT2是一个开源的语言模型，首先通过在收集和清理的中国金融文本数据（CFData-pt）上利用InternLM2进行继续预训练，包括金融领域特定数据（公告、金融文章、金融考试、金融新闻、金融研究论文）和通用数据（维基百科），然后使用知识密集的指导调整数据（CFData-sft）进行微调。
	我们使用CFBenchmark进行初步评估。与几个具有相似参数的基线模型相比，CFGPT在多个金融任务上表现优越。

	在这个仓库中，我们将分享以下CFGPT2-7B模型。

	- [CFGPT2-7B](https://huggingface.co/TongjiFinLab/CFGPT2-7B): CFGPT2的7B版本。

	## 如何使用

	1. 准备代码和环境

	克隆[CFGPT](https://github.com/TongjiFinLab/CFGPT.git)的仓库，创建一个Python环境，并通过以下命令激活它：
	```bash
	git clone https://github.com/TongjiFinLab/CFGPT.git
	cd CFGPT
	conda create -n env_name python=3.10
	source activate env_name
	pip install -r requirements.txt
	```

	2. 使用 CFGPT2-7B

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch
	base_model = 'TongjiFinLab/CFGPT2'
	device_map = 'cuda:0'
	tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True, use_fast=True)
	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	device_map=device_map
	)
	model = model.eval()

	history = []

	query = "能否给我用中文写一篇有关新能源汽车领域的市场分析报告，写的长一些，细致一些"

	response, history = model.chat(tokenizer = tokenizer, query = query, history = history, max_new_tokens = 1024, do_sample = True, temperature = 0.8, top_p = 0.8, repetition_penalty=1.1, meta_instruction='')
	print(response)
	```