0xDing commited on
Commit
af8a3ee
1 Parent(s): b0263c5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ library_name: transformers
6
+ pipeline_tag: visual-question-answering
7
+ ---
8
+
9
+ # 羽人-百川7B
10
+
11
+ 羽人-百川7B是基于[baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) 进行多任务有监督微调的开源多模态大语言模型, 建立在 [Pleisto](https://github.com/pleisto) 的以数据为中心(Data-centric AI)的工作上。羽人在多轮对话、开放域问答、角色扮演、文本生成、文本理解、图片理解等多个任务上均拥有优异的表现。
12
+
13
+ YuRen BaiChuan 7B is a multi-modal large language model based on [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) and trained with multi-task supervised fine-tuning. It is built on top of [Pleisto](https://github.com/pleisto)'s data-centric AI work. YuRen has excellent performance on multi-turn dialogue, open-domain question answering, role-playing, text generation, text understanding, image understanding and other tasks.
14
+
15
+ ## Why use yuren-baichuan-7B
16
+
17
+ - **多模态**: 参考[LLaVA](https://github.com/haotian-liu/LLaVA) 和 [mPLUG-Owl](https://arxiv.org/abs/2304.14178) 的相关工作, 羽人通过建立线性投影层将 LLM 的语言模态和目前最 SOTA 的 CLIP 模型[laion/clip-vit-l-14-datacomp.xl-s13b-b90k](https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K) 的视觉编码器进行融合, 从而实现了卓越的图片理解能力。
18
+ - **超高质量 SFT 数据集**: 羽人的 SFT 数据集的基础数据来自于 Pleisto 自有的商业多轮对话与指令精调数据集的一个子集, 该数据集的所有指令均经过了多轮次的人工和算法质检, 在此基础上我们还参考了[Orca LLM](https://arxiv.org/abs/2306.02707)的工作在该子集上进行了基于 GPT-4 的数据增强。图像模态的数据集则由公共数据集 coco2017、ScienceQA 的子集、laion5b 的子集以及 Pleisto 自有的扩散模型训练数据集的中文子集共同构成。
19
+ - **商业友好**: 羽人的训练和推理代码以 Apache-2.0 协议开源, 模型权重的授权则完全继承自[baichuan-7B 模型许可协议](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf) 仅需联系 [baichuan 团队](opensource@baichuan-inc.com) 进行免费登记即可获得商业使用授权。
20
+ - **全面兼容 ChatML**: 羽人全面兼容 GPT-4 同款的[ChatML 格式](https://github.com/openai/openai-python/blob/main/chatml.md), 一方面可以最大限度地减少 Prompt Injection 所带来的安全风险, 另一方面可以和 GPT-4 一样实现良好的 System Prompt 遵循度。(没错, 我们的训练数据集中包含了相当一部分带有 system prompt 的对话数据)
21
+
22
+
23
+
24
+ - **Multimodal**: Referring to related work such as [LLaVA](https://github.com/haotian-liu/LLaVA) and [mPLUG-Owl](https://arxiv.org/abs/2304.14178), Yuren integrates the language modality of LLM and the visual encoder of the currently most SOTA CLIP model [laion/clip-vit-l-14-datacomp.xl-s13b-b90k](https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K) by building a linear projection layer, thus achieving excellent image understanding ability.
25
+ - **Super High-Quality SFT Dataset**: The basic data of Yuren's SFT dataset comes from a subset of Pleisto's own commercial multi-turn dialogue and instruction fine-tuning dataset. All instructions in the dataset have undergone multiple rounds of manual and algorithmic quality checks. On this basis, we also refer to the work of [Orca LLM](https://arxiv.org/abs/2306.02707) and conduct data augmentation based on GPT-4 on this subset. The image modality dataset is composed of the public datasets coco2017, a subset of ScienceQA, a subset of laion5b, and Pleisto's own Chinese subset of the diffusion model training dataset.
26
+ - **Business-friendly**: Yuren's training and inference code is open-sourced under the Apache-2.0 license, and the authorization of the model weights is fully inherited from the [baichuan-7B model license agreement](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf). To obtain a commercial use authorization, simply contact the [baichuan team](opensource@baichuan-inc.com) for free registration.
27
+ - **Fully Compatible with ChatML**: Yuren is fully compatible with the [ChatML format](https://github.com/openai/openai-python/blob/main/chatml.md) of the same type as GPT-4, which can minimize the security risks caused by prompt injection and achieve good system prompt compliance like GPT-4. (Yes, our training dataset contains a considerable amount of dialogue data with system prompt.)
28
+
29
+ ## How to Get Started with the Model
30
+
31
+ ### Text-only
32
+
33
+ 羽人-百川7b在训练前已经将 baichuan-7b 的模型权重文件转为 LLaMA 兼容格式,因此在进行纯文字模态的推理部署时,可以非常方便地直接使用 transformers 的 `LlamaForCausalLM` 和 `LlamaTokenizer` 进行加载。
34
+
35
+ Before training, yuren-baichuan-7B has already converted the model weight file of Baichuan-7B into LLaMA-compatible format. Therefore, when deploying inference for the text-only mode, it is very convenient to directly use transformers' `LlamaForCausalLM` and `LlamaTokenizer` for loading.
36
+
37
+ ```python
38
+ from transformers import LlamaTokenizer, LlamaForCausalLM
39
+ import torch
40
+
41
+ device = torch.device("cuda")
42
+ query = "一个传奇的开端,一个不灭的神话,这不仅仅是一部电影,而是作为一个走进新时代的标签,永远彪炳史册。\nWould you rate the previous review as positive, neutral or negative?\nReturn in json object"
43
+ model = LlamaForCausalLM.from_pretrained(
44
+ "pleisto/yuren-baichuan-7b", torch_dtype=torch.bfloat16, device_map="auto"
45
+ )
46
+ tokenizer = LlamaTokenizer.from_pretrained("pleisto/yuren-baichuan-7b", use_fast=False)
47
+ system_prompt = "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n"
48
+ inputs = f"{system_prompt}<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n"
49
+ input_ids = tokenizer(inputs, return_tensors="pt").input_ids.to(device)
50
+ generate_ids = model.generate(
51
+ input_ids,
52
+ max_new_tokens=4096,
53
+ do_sample=True,
54
+ top_p=1.0,
55
+ temperature=0.42,
56
+ eos_token_id=64002,
57
+ )
58
+ output = tokenizer.batch_decode(generate_ids)[0]
59
+ print(output)
60
+ """
61
+ <|im_start|> system
62
+ You are a helpful AI assistant. <|im_end|>
63
+ <|im_start|> user
64
+ 一个传奇的开端,一个不灭的神话,这不仅仅是一部电影,而是作为一个走进新时代的标签,永远彪炳史册。
65
+ Would you rate the previous review as positive, neutral or negative?
66
+ Retun in json object <|im_end|>
67
+ <|im_start|> assistant
68
+ {
69
+ "rating": "positive"
70
+ } <|im_end|>
71
+ """
72
+ ```
73
+
74
+
75
+ ### Multimodal
76
+
77
+ ```bash
78
+ git clone https://github.com/pleisto/yuren-baichuan-7b.git
79
+ curl -sSf https://rye-up.com/get | bash
80
+ source "$HOME/.rye/env"
81
+ rye sync
82
+ rye run webui "pleisto/yuren-baichuan-7b" # --load_8bit True --server_name "0.0.0.0" --share True
83
+ ```
84
+
85
+
86
+ ## Bias, Risks, and Limitations
87
+
88
+ yuren-baichuan-7B可能会产生事实上不正确的输出,不应依赖它产生事实上准确的信息。
89
+
90
+ yuren-baichuan-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
91
+
92
+ ## License
93
+
94
+ - 推理代码以 [Apache-2.0](https://github.com/pleisto/yuren-baichuan-7b/blob/main/LICENSE) 协议发布,版权归 Pleisto 所有
95
+ - 模型权重由Pleisto训练,仍适用于上游的 [baichuan-7b](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf) 协议
96
+
97
+ -The inference code is released under the [Apache-2.0](https://github.com/pleisto/yuren-baichuan-7b/blob/main/LICENSE) license, and the copyright belongs to Pleisto.
98
+ - The model weights are trained by Pleisto and still comply with the upstream [Baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf) license.