pleisto
/

yuren-baichuan-7b

+---
+language:
+- en
+- zh
+library_name: transformers
+pipeline_tag: visual-question-answering
+---
+# 羽人-百川7B
+羽人-百川7B是基于[baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) 进行多任务有监督微调的开源多模态大语言模型, 建立在 [Pleisto](https://github.com/pleisto) 的以数据为中心(Data-centric AI)的工作上。羽人在多轮对话、开放域问答、角色扮演、文本生成、文本理解、图片理解等多个任务上均拥有优异的表现。
+YuRen BaiChuan 7B is a multi-modal large language model based on [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) and trained with multi-task supervised fine-tuning. It is built on top of [Pleisto](https://github.com/pleisto)'s data-centric AI work. YuRen has excellent performance on multi-turn dialogue, open-domain question answering, role-playing, text generation, text understanding, image understanding and other tasks.
+## Why use yuren-baichuan-7B
+- **多模态**: 参考[LLaVA](https://github.com/haotian-liu/LLaVA) 和 [mPLUG-Owl](https://arxiv.org/abs/2304.14178) 的相关工作, 羽人通过建立线性投影层将 LLM 的语言模态和目前最 SOTA 的 CLIP 模型[laion/clip-vit-l-14-datacomp.xl-s13b-b90k](https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K) 的视觉编码器进行融合, 从而实现了卓越的图片理解能力。
+- **超高质量 SFT 数据集**: 羽人的 SFT 数据集的基础数据来自于 Pleisto 自有的商业多轮对话与指令精调数据集的一个子集, 该数据集的所有指令均经过了多轮次的人工和算法质检, 在此基础上我们还参考了[Orca LLM](https://arxiv.org/abs/2306.02707)的工作在该子集上进行了基于 GPT-4 的数据增强。图像模态的数据集则由公共数据集 coco2017、ScienceQA 的子集、laion5b 的子集以及 Pleisto 自有的扩散模型训练数据集的中文子集共同构成。
+- **商业友好**: 羽人的训练和推理代码以 Apache-2.0 协议开源, 模型权重的授权则完全继承自[baichuan-7B 模型许可协议](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf) 仅需联系 [baichuan 团队](opensource@baichuan-inc.com) 进行免费登记即可获得商业使用授权。
+- **全面兼容 ChatML**: 羽人全面兼容 GPT-4 同款的[ChatML 格式](https://github.com/openai/openai-python/blob/main/chatml.md), 一方面可以最大限度地减少 Prompt Injection 所带来的安全风险, 另一方面可以和 GPT-4 一样实现良好的 System Prompt 遵循度。(没错, 我们的训练数据集中包含了相当一部分带有 system prompt 的对话数据)
+- **Multimodal**: Referring to related work such as [LLaVA](https://github.com/haotian-liu/LLaVA) and [mPLUG-Owl](https://arxiv.org/abs/2304.14178), Yuren integrates the language modality of LLM and the visual encoder of the currently most SOTA CLIP model [laion/clip-vit-l-14-datacomp.xl-s13b-b90k](https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K) by building a linear projection layer, thus achieving excellent image understanding ability.
+- **Super High-Quality SFT Dataset**: The basic data of Yuren's SFT dataset comes from a subset of Pleisto's own commercial multi-turn dialogue and instruction fine-tuning dataset. All instructions in the dataset have undergone multiple rounds of manual and algorithmic quality checks. On this basis, we also refer to the work of [Orca LLM](https://arxiv.org/abs/2306.02707) and conduct data augmentation based on GPT-4 on this subset. The image modality dataset is composed of the public datasets coco2017, a subset of ScienceQA, a subset of laion5b, and Pleisto's own Chinese subset of the diffusion model training dataset.
+- **Business-friendly**: Yuren's training and inference code is open-sourced under the Apache-2.0 license, and the authorization of the model weights is fully inherited from the [baichuan-7B model license agreement](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf). To obtain a commercial use authorization, simply contact the [baichuan team](opensource@baichuan-inc.com) for free registration.
+- **Fully Compatible with ChatML**: Yuren is fully compatible with the [ChatML format](https://github.com/openai/openai-python/blob/main/chatml.md) of the same type as GPT-4, which can minimize the security risks caused by prompt injection and achieve good system prompt compliance like GPT-4. (Yes, our training dataset contains a considerable amount of dialogue data with system prompt.)
+## How to Get Started with the Model
+### Text-only
+羽人-百川7b在训练前已经将 baichuan-7b 的模型权重文件转为 LLaMA 兼容格式，因此在进行纯文字模态的推理部署时，可以非常方便地直接使用 transformers 的 `LlamaForCausalLM` 和 `LlamaTokenizer` 进行加载。
+Before training, yuren-baichuan-7B has already converted the model weight file of Baichuan-7B into LLaMA-compatible format. Therefore, when deploying inference for the text-only mode, it is very convenient to directly use transformers' `LlamaForCausalLM` and `LlamaTokenizer` for loading.
+```python
+from transformers import LlamaTokenizer, LlamaForCausalLM
+import torch
+device = torch.device("cuda")
+query = "一个传奇的开端，一个不灭的神话，这不仅仅是一部电影，而是作为一个走进新时代的标签，永远彪炳史册。\nWould you rate the previous review as positive, neutral or negative?\nReturn in json object"
+model = LlamaForCausalLM.from_pretrained(
+    "pleisto/yuren-baichuan-7b", torch_dtype=torch.bfloat16, device_map="auto"
+)
+tokenizer = LlamaTokenizer.from_pretrained("pleisto/yuren-baichuan-7b", use_fast=False)
+system_prompt = "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n"
+inputs = f"{system_prompt}<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n"
+input_ids = tokenizer(inputs, return_tensors="pt").input_ids.to(device)
+generate_ids = model.generate(
+    input_ids,
+    max_new_tokens=4096,
+    do_sample=True,
+    top_p=1.0,
+    temperature=0.42,
+    eos_token_id=64002,
+)
+output = tokenizer.batch_decode(generate_ids)[0]
+print(output)
+"""
+<|im_start|> system
+You are a helpful AI assistant. <|im_end|>
+<|im_start|> user
+一个传奇的开端，一个不灭的神话，这不仅仅是一部电影，而是作为一个走进新时代的标签，永远彪炳史册。
+Would you rate the previous review as positive, neutral or negative?
+Retun in json object <|im_end|>
+<|im_start|> assistant
+{
+"rating": "positive"
+} <|im_end|>
+"""
+```
+### Multimodal
+```bash
+git clone https://github.com/pleisto/yuren-baichuan-7b.git
+curl -sSf https://rye-up.com/get | bash
+source "$HOME/.rye/env"
+rye sync
+rye run webui "pleisto/yuren-baichuan-7b" # --load_8bit True --server_name "0.0.0.0" --share True
+```
+## Bias, Risks, and Limitations
+yuren-baichuan-7B可能会产生事实上不正确的输出，不应依赖它产生事实上准确的信息。
+yuren-baichuan-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
+## License
+- 推理代码以 [Apache-2.0](https://github.com/pleisto/yuren-baichuan-7b/blob/main/LICENSE) 协议发布，版权归 Pleisto 所有
+- 模型权重由Pleisto训练，仍适用于上游的 [baichuan-7b](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf) 协议
+-The inference code is released under the [Apache-2.0](https://github.com/pleisto/yuren-baichuan-7b/blob/main/LICENSE) license, and the copyright belongs to Pleisto.
+- The model weights are trained by Pleisto and still comply with the upstream [Baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf) license.