jordiclive's picture
Update README.md
2532041
metadata
license: mit
datasets:
  - OpenAssistant/oasst1
language:
  - en
tags:
  - sft
pipeline_tag: text-generation
widget:
  - text: >-
      <|prompter|>What is a meme, and what's the history behind this
      word?<|endoftext|><|assistant|>
  - text: <|prompter|>What's the Earth total population<|endoftext|><|assistant|>
  - text: >-
      <|prompter|>Write a story about future of AI
      development<|endoftext|><|assistant|>

Load Merged Model (Recommended, identical configuration to a fine-tuned model)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

repo_id = "jordiclive/falcon-40b-lora-sft-stage2-1.1k"
dtype = torch.bfloat16
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=dtype,
    trust_remote_code=True,
)

Model Details

  • Developed as part of the OpenAssistant Project
  • Model type: LoRA (PEFT)
  • Language: English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish);
  • Finetuned from: tiiuae/falcon-40b
  • Model type: Causal decoder-only transformer language model
  • Weights & Biases: Training log1 Training log2

LoRA Adapter for Falcon 40B trained on oasst-top1

This repo contains a Falcon 40B LoRA fine-tuned model and the low-rank adapter fit on datasets part of the OpenAssistant project.

This version of the weights was trained with the following hyperparameters:

SFT 1

  • Epochs: 2
  • Batch size: 128
  • Max Length: 2048
  • Learning rate: 1e-4
  • Lora r: 64
  • Lora Alpha: 16
  • Lora target modules: ["dense_4h_to_h", "dense", "query_key_value", "dense_h_to_4h"]

SFT2

  • Epochs: 10
  • Batch size: 128

The model was trained with flash attention and gradient checkpointing and deepspeed stage 3 on 8 x A100 80gb

Dataset: SFT1:

  - oa_leet10k:
      val_split: 0.05
      max_val_set: 250
  - cmu_wiki_qa:
      val_split: 0.05
  - joke:
      val_split: 0.05
  - webgpt:
      val_split: 0.05
      max_val_set: 250
  - alpaca_gpt4:
      val_split: 0.025
      max_val_set: 250
  - gpteacher_roleplay:
      val_split: 0.05
  - wizardlm_70k:
      val_split: 0.05
      max_val_set: 500
  - poem_instructions:
      val_split: 0.025
  - tell_a_joke:
      val_split: 0.05
      max_val_set: 250
  - gpt4all:
      val_split: 0.01
      max_val_set: 1000
  - minimath:
      val_split: 0.05
  - humaneval_mbpp_codegen_qa:
      val_split: 0.05
  - humaneval_mbpp_testgen_qa:
      val_split: 0.05
  - dolly15k:
      val_split: 0.05
      max_val_set: 300
  - recipes:
      val_split: 0.05
  - code_alpaca:
      val_split: 0.05
      max_val_set: 250
  - vicuna:
      fraction: 0.5
      val_split: 0.025
      max_val_set: 250
  - oa_wiki_qa_bart_10000row:
      val_split: 0.05
      max_val_set: 250
  - grade_school_math_instructions:
      val_split: 0.05

SFT2

- oasst_export:
    lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk" # sft-8.0
    input_file_path: 2023-05-06_OASST_labels.jsonl.gz
    val_split: 0.05
    top_k: 1
- lima:
    val_split: 0.05
    max_val_set: 50

Prompting

Two special tokens are used to mark the beginning of user and assistant turns: <|prompter|> and <|assistant|>. Each turn ends with a <|endoftext|> token.

Input prompt example:

<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>

The input ends with the <|assistant|> token to signal that the model should start generating the assistant reply.

Example Inference code (Prompt Template)

model = model.to(device)
if dtype == torch.float16:
    model = model.half()


# Choose Generation parameters

generation_config = GenerationConfig(
    temperature=0.1,
    top_p=0.75,
    top_k=40,
    num_beams=4,
)


def format_system_prompt(prompt, eos_token=tokenizer.eos_token):
    return "{}{}{}{}".format("<|prompter|>", prompt, eos_token, "<|assistant|>")

def generate(prompt, generation_config=generation_config, max_new_tokens=2048, device=device):
    prompt = format_system_prompt(prompt,eos_token=tokenizer.eos_token)  # OpenAssistant Prompt Format expected
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens,
            eos_token_id=tokenizer.eos_token_id,
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    print("Text generated:")
    print(output)
    return output

LoRA weights

If you want to use the LoRA weights separately, several special token embeddings also need to be added.

base_model_id = "tiiuae/falcon-40b"

import torch
import transformers
from huggingface_hub import hf_hub_download
from peft import PeftModel


def add_embeddings(model, embed_path, tokenizer):
    old_embeddings = model.get_input_embeddings()
    old_num_tokens, old_embedding_dim = old_embeddings.weight.size()
    new_embeddings = torch.nn.Embedding(old_num_tokens, old_embedding_dim)
    new_embeddings.to(old_embeddings.weight.device, dtype=old_embeddings.weight.dtype)
    model._init_weights(new_embeddings)
    embed_weights = torch.load(embed_path, map_location=old_embeddings.weight.device)
    vocab_size = tokenizer.vocab_size
    new_embeddings.weight.data[:vocab_size, :] = old_embeddings.weight.data[:vocab_size, :]
    new_embeddings.weight.data[vocab_size : vocab_size + embed_weights.shape[0], :] = embed_weights.to(
        new_embeddings.weight.dtype
    ).to(new_embeddings.weight.device)
    model.set_input_embeddings(new_embeddings)
    model.tie_weights()


def load_peft_model(model, peft_model_path, tokenizer):
    embed_weights = hf_hub_download(peft_model_path, "extra_embeddings.pt")
    model.resize_token_embeddings(tokenizer.vocab_size + torch.load(embed_weights).shape[0])
    model.config.eos_token_id = tokenizer.eos_token_id
    model.config.bos_token_id = tokenizer.bos_token_id
    model.config.pad_token_id = tokenizer.pad_token_id
    model = PeftModel.from_pretrained(
        model,
        model_id=peft_model_path,
        torch_dtype=model.dtype,
    )
    model.eos_token_id = tokenizer.eos_token_id
    add_embeddings(model, embed_weights, tokenizer)
    return model


def load_lora_model(base_model_id, tokenizer, device, dtype):
    model = transformers.AutoModelForCausalLM.from_pretrained(
        base_model_id,
        torch_dtype=dtype,
        trust_remote_code=True,
    )
    model = load_peft_model(model, repo_id, tokenizer)
    model = model.to(device)
    return model


model = load_lora_model(base_model_id=base_model_id, tokenizer=tokenizer, device=device, dtype=dtype)