bge-m3_RAG-conversational-IR

Introduction

This is a conversational retrieval embedding model, specifically designed for multi-turn RAG systems.

Quickstart

RAG QA format

The input question format is as follows:

# QUESTION: {question3}

</s> # HISTORY:
A: {Question2}
B: {Answer2}||A: {Question1}
B: {Answer1}

The document text is directly encoded without prompt.

Code Example

Here is a code snippet to generate the retrieval query with history.

from transformers import AutoTokenizer


def genereate_retrieval_prompt(
        question: str, 
        history: list, 
        tokenizer: AutoTokenizer, 
        history_max_len: int = 510
    ) -> str:
    def cut_utterance(utterance):
        utterance_len = 128
        utterance_ids = tokenizer.encode(utterance, add_special_tokens=False)
        if len(utterance_ids) > utterance_len:
            utterance = tokenizer.decode(utterance_ids[:utterance_len])
        return utterance

    prompt = "# QUESTION: {question}\n\n{sep_token} # HISTORY:\n{history_string}"

    query = prompt.format(question=question, sep_token=tokenizer.sep_token, history_string="None")
    for i in range(0, len(history)):
        if len(history[i:]) > 0:
            history_list = ["A: {}\nB: {}".format(cut_utterance(history[j][0]), cut_utterance(history[j][1]))
                            for j in range(i, len(history))]
            history_list.reverse()
            history_string = "||".join(history_list)
        else:
            history_string = "None"
        query = prompt.format(question=question, sep_token=tokenizer.sep_token, history_string=history_string)

        if len(tokenizer.encode(query)) < history_max_len:
            break
    
    return query


tokenizer = AutoTokenizer.from_pretrained("HIT-TMG/bge-m3_RAG-conversational-IR")

question = "That's quite an interesting history! what awards and achievements has the brewery received?"
history = [
    [
        "How many people speak dholuo and where is it spoken?", 
        "Dholuo is spoken by around 4.2 million luo people. They live in kenya and tanzania, particularly in the areas around the eastern shore of lake victoria and further south. Dholuo is a dialect of the luo group of nilotic languages and is used in broadcasts on the kenya broadcasting corporation."
    ],
    [
        "Hey there! i recently heard about lost coast brewery and i'm curious to know more about its history and the types of beers they produce. can you help me with that?", 
        "Certainly! Lost coast brewery is located in eureka, california, and was founded by wendy pound and barbara groom in 1989. It's one of only two female-owned breweries in humboldt county. The cofounders got the idea in 1986, and after three years of research and development, they opened the brewery and pub in 1990 in the historical pythian castle. They eventually outgrew this location and moved to a larger building in eureka in 1994.\n\nTheir production increased to 60,000 barrels per year by 2011, and a larger facility built in 2014 allowed for the potential production of 200,000 to 600,000 barrels annually. In 2013, lost coast brewery distributed beer to 21 states and 11 nations.\n\nLost coast brews a variety of year-round and seasonal beers. Some of their well-known offerings include great white, tangerine wheat, downtown brown, alleycat amber, 8-ball stout, and indica india pale ale. They also have seasonal beers like fogcutter double ipa, revenant ipa, hazy ipa, watermelon wheat, raspberry brown, and winterbraun, as well as other limited releases."
    ],
]

retrieval_query = genereate_retrieval_prompt(question=question, history=history, tokenizer=tokenizer, history_max_len=510)

Citation

If you find our model helpful, feel free to give us a citation.

@misc{truthreader,
  author = {Xinshuo Hu and Zetian Sun and Dongfang Li and Shaolin Ye and Zifei Shan and Qian Chen and Baotian Hu and Min Zhang},
  title = {TruthReader: Towards Trustworthy Document Assistant Chatbot with Reliable Attribution},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/HITsz-TMG/TruthReader-document-assistant}},
}

HIT-TMG
/

bge-m3_RAG-conversational-IR