Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
- zh
|
5 |
+
library_name: transformers
|
6 |
+
pipeline_tag: visual-question-answering
|
7 |
+
---
|
8 |
+
|
9 |
+
# 羽人-百川7B
|
10 |
+
|
11 |
+
羽人-百川7B是基于[baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) 进行多任务有监督微调的开源多模态大语言模型, 建立在 [Pleisto](https://github.com/pleisto) 的以数据为中心(Data-centric AI)的工作上。羽人在多轮对话、开放域问答、角色扮演、文本生成、文本理解、图片理解等多个任务上均拥有优异的表现。
|
12 |
+
|
13 |
+
YuRen BaiChuan 7B is a multi-modal large language model based on [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) and trained with multi-task supervised fine-tuning. It is built on top of [Pleisto](https://github.com/pleisto)'s data-centric AI work. YuRen has excellent performance on multi-turn dialogue, open-domain question answering, role-playing, text generation, text understanding, image understanding and other tasks.
|
14 |
+
|
15 |
+
## Why use yuren-baichuan-7B
|
16 |
+
|
17 |
+
- **多模态**: 参考[LLaVA](https://github.com/haotian-liu/LLaVA) 和 [mPLUG-Owl](https://arxiv.org/abs/2304.14178) 的相关工作, 羽人通过建立线性投影层将 LLM 的语言模态和目前最 SOTA 的 CLIP 模型[laion/clip-vit-l-14-datacomp.xl-s13b-b90k](https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K) 的视觉编码器进行融合, 从而实现了卓越的图片理解能力。
|
18 |
+
- **超高质量 SFT 数据集**: 羽人的 SFT 数据集的基础数据来自于 Pleisto 自有的商业多轮对话与指令精调数据集的一个子集, 该数据集的所有指令均经过了多轮次的人工和算法质检, 在此基础上我们还参考了[Orca LLM](https://arxiv.org/abs/2306.02707)的工作在该子集上进行了基于 GPT-4 的数据增强。图像模态的数据集则由公共数据集 coco2017、ScienceQA 的子集、laion5b 的子集以及 Pleisto 自有的扩散模型训练数据集的中文子集共同构成。
|
19 |
+
- **商业友好**: 羽人的训练和推理代码以 Apache-2.0 协议开源, 模型权重的授权则完全继承自[baichuan-7B 模型许可协议](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf) 仅需联系 [baichuan 团队](opensource@baichuan-inc.com) 进行免费登记即可获得商业使用授权。
|
20 |
+
- **全面兼容 ChatML**: 羽人全面兼容 GPT-4 同款的[ChatML 格式](https://github.com/openai/openai-python/blob/main/chatml.md), 一方面可以最大限度地减少 Prompt Injection 所带来的安全风险, 另一方面可以和 GPT-4 一样实现良好的 System Prompt 遵循度。(没错, 我们的训练数据集中包含了相当一部分带有 system prompt 的对话数据)
|
21 |
+
|
22 |
+
|
23 |
+
|
24 |
+
- **Multimodal**: Referring to related work such as [LLaVA](https://github.com/haotian-liu/LLaVA) and [mPLUG-Owl](https://arxiv.org/abs/2304.14178), Yuren integrates the language modality of LLM and the visual encoder of the currently most SOTA CLIP model [laion/clip-vit-l-14-datacomp.xl-s13b-b90k](https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K) by building a linear projection layer, thus achieving excellent image understanding ability.
|
25 |
+
- **Super High-Quality SFT Dataset**: The basic data of Yuren's SFT dataset comes from a subset of Pleisto's own commercial multi-turn dialogue and instruction fine-tuning dataset. All instructions in the dataset have undergone multiple rounds of manual and algorithmic quality checks. On this basis, we also refer to the work of [Orca LLM](https://arxiv.org/abs/2306.02707) and conduct data augmentation based on GPT-4 on this subset. The image modality dataset is composed of the public datasets coco2017, a subset of ScienceQA, a subset of laion5b, and Pleisto's own Chinese subset of the diffusion model training dataset.
|
26 |
+
- **Business-friendly**: Yuren's training and inference code is open-sourced under the Apache-2.0 license, and the authorization of the model weights is fully inherited from the [baichuan-7B model license agreement](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf). To obtain a commercial use authorization, simply contact the [baichuan team](opensource@baichuan-inc.com) for free registration.
|
27 |
+
- **Fully Compatible with ChatML**: Yuren is fully compatible with the [ChatML format](https://github.com/openai/openai-python/blob/main/chatml.md) of the same type as GPT-4, which can minimize the security risks caused by prompt injection and achieve good system prompt compliance like GPT-4. (Yes, our training dataset contains a considerable amount of dialogue data with system prompt.)
|
28 |
+
|
29 |
+
## How to Get Started with the Model
|
30 |
+
|
31 |
+
### Text-only
|
32 |
+
|
33 |
+
羽人-百川7b在训练前已经将 baichuan-7b 的模型权重文件转为 LLaMA 兼容格式,因此在进行纯文字模态的推理部署时,可以非常方便地直接使用 transformers 的 `LlamaForCausalLM` 和 `LlamaTokenizer` 进行加载。
|
34 |
+
|
35 |
+
Before training, yuren-baichuan-7B has already converted the model weight file of Baichuan-7B into LLaMA-compatible format. Therefore, when deploying inference for the text-only mode, it is very convenient to directly use transformers' `LlamaForCausalLM` and `LlamaTokenizer` for loading.
|
36 |
+
|
37 |
+
```python
|
38 |
+
from transformers import LlamaTokenizer, LlamaForCausalLM
|
39 |
+
import torch
|
40 |
+
|
41 |
+
device = torch.device("cuda")
|
42 |
+
query = "一个传奇的开端,一个不灭的神话,这不仅仅是一部电影,而是作为一个走进新时代的标签,永远彪炳史册。\nWould you rate the previous review as positive, neutral or negative?\nReturn in json object"
|
43 |
+
model = LlamaForCausalLM.from_pretrained(
|
44 |
+
"pleisto/yuren-baichuan-7b", torch_dtype=torch.bfloat16, device_map="auto"
|
45 |
+
)
|
46 |
+
tokenizer = LlamaTokenizer.from_pretrained("pleisto/yuren-baichuan-7b", use_fast=False)
|
47 |
+
system_prompt = "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n"
|
48 |
+
inputs = f"{system_prompt}<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n"
|
49 |
+
input_ids = tokenizer(inputs, return_tensors="pt").input_ids.to(device)
|
50 |
+
generate_ids = model.generate(
|
51 |
+
input_ids,
|
52 |
+
max_new_tokens=4096,
|
53 |
+
do_sample=True,
|
54 |
+
top_p=1.0,
|
55 |
+
temperature=0.42,
|
56 |
+
eos_token_id=64002,
|
57 |
+
)
|
58 |
+
output = tokenizer.batch_decode(generate_ids)[0]
|
59 |
+
print(output)
|
60 |
+
"""
|
61 |
+
<|im_start|> system
|
62 |
+
You are a helpful AI assistant. <|im_end|>
|
63 |
+
<|im_start|> user
|
64 |
+
一个传奇的开端,一个不灭的神话,这不仅仅是一部电影,而是作为一个走进新时代的标签,永远彪炳史册。
|
65 |
+
Would you rate the previous review as positive, neutral or negative?
|
66 |
+
Retun in json object <|im_end|>
|
67 |
+
<|im_start|> assistant
|
68 |
+
{
|
69 |
+
"rating": "positive"
|
70 |
+
} <|im_end|>
|
71 |
+
"""
|
72 |
+
```
|
73 |
+
|
74 |
+
|
75 |
+
### Multimodal
|
76 |
+
|
77 |
+
```bash
|
78 |
+
git clone https://github.com/pleisto/yuren-baichuan-7b.git
|
79 |
+
curl -sSf https://rye-up.com/get | bash
|
80 |
+
source "$HOME/.rye/env"
|
81 |
+
rye sync
|
82 |
+
rye run webui "pleisto/yuren-baichuan-7b" # --load_8bit True --server_name "0.0.0.0" --share True
|
83 |
+
```
|
84 |
+
|
85 |
+
|
86 |
+
## Bias, Risks, and Limitations
|
87 |
+
|
88 |
+
yuren-baichuan-7B可能会产生事实上不正确的输出,不应依赖它产生事实上准确的信息。
|
89 |
+
|
90 |
+
yuren-baichuan-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
|
91 |
+
|
92 |
+
## License
|
93 |
+
|
94 |
+
- 推理代码以 [Apache-2.0](https://github.com/pleisto/yuren-baichuan-7b/blob/main/LICENSE) 协议发布,版权归 Pleisto 所有
|
95 |
+
- 模型权重由Pleisto训练,仍适用于上游的 [baichuan-7b](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf) 协议
|
96 |
+
|
97 |
+
-The inference code is released under the [Apache-2.0](https://github.com/pleisto/yuren-baichuan-7b/blob/main/LICENSE) license, and the copyright belongs to Pleisto.
|
98 |
+
- The model weights are trained by Pleisto and still comply with the upstream [Baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf) license.
|