Edit model card

QuantFactory Banner

QuantFactory/gemma-2-baku-2b-it-GGUF

This is quantized version of rinna/gemma-2-baku-2b-it created using llama.cpp

Original Model Card

Gemma 2 Baku 2B Instruct (rinna/gemma-2-baku-2b-it)

rinna-icon

Overview

The model is an instruction-tuned variant of rinna/gemma-2-baku-2b, utilizing Chat Vector and Odds Ratio Preference Optimization (ORPO) for fine-tuning. It adheres to the gemma-2 chat format.

Size Continual Pre-Training Instruction-Tuning
2B Gemma 2 Baku 2B [HF] Gemma 2 Baku 2B Instruct [HF]
  • Model architecture

    A 26-layer, 2304-hidden-size transformer-based language model. Please refer to the Gemma 2 Model Card for detailed information on the model's architecture.

  • Training

    Model merging. The base model was endowed with instruction-following capabilities through a chat vector addition process. The chat vector was derived by subtracting the parameter vectors of google/gemma-2-2b from google/gemma-2-2b-it, as follows.

      rinna/gemma-2-baku-2b + 1.0 * (google/gemma-2-2b-it - google/gemma-2-2b)
    

    During this process, the embedding layer was excluded during the subtraction and addition of parameter vectors.

    OPRO was applied using a subset of the following dataset to further refine the performance of the merged model.

    • rinna's internal dataset
  • Contributors


Benchmarking

Please refer to rinna's LM benchmark page.


How to use the model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "rinna/gemma-2-baku-2b-it"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=dtype,
    attn_implementation="eager",
)

chat = [
    { "role": "user", "content": "西田幾多郎とはどんな人物ですか?" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
outputs = model.generate(
    input_ids,
    max_new_tokens=512,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

It is recommended to use eager attention when conducting batch inference under bfloat16 precision. Currently, Gemma 2 yields NaN values for input sequences with padding when the default attention mechanism (torch.scaled_dot_product_attention) is employed in conjunction with bfloat16.


Tokenization

The model uses the original google/gemma-2-2b-it tokenizer.


How to cite

@misc{rinna-gemma-2-baku-2b-it,
    title = {rinna/gemma-2-baku-2b-it},
    author = {Chen, Xinqi and Wakatsuki, Toshiaki and Sawada, Kei},
    url = {https://huggingface.co/rinna/gemma-2-baku-2b-it}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

References

@article{gemma-2-2024,
    title = {Gemma 2},
    url = {https://www.kaggle.com/models/google/gemma-2},
    publisher = {Kaggle},
    author = {Gemma Team},
    year = {2024}
}

@article{huang2023chat,
    title = {Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages},
    author = {Huang, Shih-Cheng and Li, Pin-Zu and Hsu, Yu-Chi and Chen, Kuang-Ming and Lin, Yu Tung and Hsiao, Shih-Kai and Tzong-Han Tsai, Richard and Lee, Hung-yi},
    year = {2023},
    url = {https://arxiv.org/abs/2310.04799}
}

@article{hong2024orpo,
  title = {ORPO: Monolithic Preference Optimization without Reference Model},
  author = {Hong, Jiwoo and Lee, Noah and Thorne, James},
  year = {2024},
  url = {https://arxiv.org/abs/2403.07691}
}

License

Gemma Terms of Use

Downloads last month
115
GGUF
Model size
2.61B params
Architecture
gemma2

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for QuantFactory/gemma-2-baku-2b-it-GGUF