kgreenewald's picture
Update README.md
1e0e6e8 verified
|
raw
history blame
11.8 kB
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: transformers
---
# Granite RAG 3.0 8b
## Model Summary
**Granite RAG 3.0 8b** is a RAG specific LoRA adapter for [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct),
which adds hallucination detection and citation generation capability, in addition to retaining the full abilities of the [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) model.
- **Developer:** IBM Research
- **Model type:** LoRA adapter for [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct)
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
## Usage
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Intended use
**Granite RAG 3.0 8b** is a LoRA adaptor for [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct). This is a RAG specific adaptor which gives the ability to generate an output, detect whether hallucinations exist in the generated output and generate citations for the generate output. The output is generated as a json object, which contains output sentences, hallucination detections and citations.
**Model input**: The input to the model is a list of conversational turns converted to a string using `apply_chat_template` function. The first turn of the conversation is a `system` turn, the `content` field of which contains a json structure (converted to string). The json structure includes:
1. `instruction` : the model is trained with the following system instruction: `Respond to the user's latest question based solely on the information provided in the documents. Ensure that your response is strictly aligned with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data. Make sure that your response follows the attributes mentioned in the 'meta' field.`
2. `documents` : list of documents where, each item in the list is a dictionary with fields `doc_id` and `text`.
3. `meta`: a dictionary where the output features can be controlled by two fields: `hallucination_tags` and `citations`
The rest of the turns in the conversation are `user` and `assistant` turns, where the `content` field contain a string.
**Model output**: The model output is a json structure. The output contains a list, where each entity contains following fields:
1. `sentense`: output sentence.
2. `meta`: This is a dictionary, which contains two fields `hallucination_level` and `citation`. `hallucination_level` could take values of `high` or `low`, where `high` indicates that the generated sentence likely contain some hallucinated content that is not grounded on the provided documents. `citation` is a dictionary, which contains fields `snippet` which indicates the sentences that grounds the response and `doc_id` which points to the document that contains the `snippet`.
**Granite Rag 3.0 8b** is not intended to detect the hallucinations of responses generated by any other models besides itself or [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct).
### Quickstart Example
The following code describes how to use the Granite Rag 3.0 8b in a RAG setting to generate answers, detect hallucinations and generate citations.
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import json
BASE_NAME = "ibm-granite/granite-3.0-8b-instruct"
LORA_NAME = "ibm-granite/granite-rag-3.0-8b-lora"
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left', trust_remote_code=True)
model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map="auto")
model_rag = PeftModel.from_pretrained(model_base, LORA_NAME)
question_chat = [
{
"role": "system",
"content": "{\"instruction\": \"Respond to the user's latest question based solely on the information provided in the documents. Ensure that your response is strictly aligned with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data. Make sure that your response follows the attributes mentioned in the 'meta' field.\", \"documents\": [{\"doc_id\": 1, \"text\": \"Audrey Faith McGraw (born September 21, 1967) is an American singer and record producer. She is one of the most successful country artists of all time, having sold more than 40 million albums worldwide. Hill is married to American singer Tim McGraw, with whom she has recorded several duets. Hill's first two albums, Take Me as I Am (1993) and It Matters to Me (1995), were major successes and placed a combined three number ones on Billboard's country charts. Hill's debut album was Take Me as I Am (1993); sales were strong, buoyed by the chart success of \\\"Wild One\\\". Hill became the first female country singer in 30 years to hold Billboard's number one position for four consecutive weeks when \\\"Wild One\\\" managed the feat in 1994. Her version of \\\"Piece of My Heart\\\", also went to the top of the country charts in 1994. The album sold a total of 3 million copies. Other singles from the album include \\\"Take Me as I Am\\\". The recording of Faith's second album was delayed by surgery to repair a ruptured blood vessel on her vocal cords. It Matters to Me finally appeared in 1995 and was another success, with the title track becoming her third number-one country single. Several other top 10 singles followed, and more than 3 million copies of the album were sold. The fifth single from the album, \\\"I Can't Do That Anymore\\\", was written by country music artist Alan Jackson. Other singles from the album include \\\"You Can't Lose Me\\\", \\\"Someone Else's Dream\\\", and \\\"Let's Go to Vegas\\\". During this period, Hill appeared on the acclaimed PBS music program Austin City Limits. In spring 1996, Hill began the Spontaneous Combustion Tour with country singer Tim McGraw. At that time, Hill had recently become engaged to her former producer, Scott Hendricks, and McGraw had recently broken an engagement. McGraw and Hill were quickly attracted to each other and began an affair. After discovering that Hill was pregnant with their first child, the couple married on October 6, 1996. The couple have three daughters together: Gracie Katherine (born 1997), Maggie Elizabeth (born 1998) and Audrey Caroline (born 2001). Since their marriage, Hill and McGraw have endeavored never to be apart for more than three consecutive days. After the release of It Matters to Me, Hill took a three-year break from recording to give herself a rest from four years of touring and to begin a family with McGraw. During her break, she joined forces with her husband for their first duet, \\\"It's Your Love\\\". The song stayed at number one for six weeks, and won awards from both the Academy of Country Music and the Country Music Association. Hill has remarked that sometimes when they perform the song together, \\\"it [doesn't] feel like anybody else was really watching.\\\"\"}], \"meta\": {\"hallucination_tags\": true, \"citations\": true}}"
},
{
"role": "user",
"content": "Did Faith Hill take a break from recording after releasing her second album, It Matters to Me?"
}
]
# Generate answer
input_text = tokenizer.apply_chat_template(question_chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt")
output = model_rag.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=500)
output_text = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print("Output: " + output_text)
```
## Training Details
The **Granite RAG 3.0 8b** model is a LoRA adapter fine-tuned to generate responses, detect hallucinations and generate citations.
The details of data creation for RAG response generation is available at [Granite Technical Report](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/paper.pdf).
For creating the hallucination labels for responses, the technique available at [Achintalwar, et al.](https://arxiv.org/pdf/2403.06009) was used.
Citations were created by identifying the best overlapping snippets of the documents with the responses by using the ROUGE-L metric.
### Training Data
The following public datasets were used for finetuning the RAG model.
* [MultiDoc2Dial](https://huggingface.co/datasets/IBM/multidoc2dial)
* [QuAC](https://huggingface.co/datasets/allenai/quac)
## Evaluation
1. Evaluation of RAG response generation on the test sets of the [RAGBench](https://arxiv.org/pdf/2407.11005) benchmark using [RAGAS](https://arxiv.org/abs/2309.15217) evaluation framework.
*Note: All the evaluations are done using Azure OpenAI Service.*
a. Evaluation using RAGAS Faithfulness metric.
| Dataset | Granite 3.0 8B Instruct | Granite RAG 3.0 8b |
| ---------- | ----------------------- | ------------------ |
| CovidQA | 87.08 | 84.00 |
| DelucionQA | 87.97 | 89.85 |
| EManual | 83.83 | 87.74 |
| ExpertQA | 61.01 | 63.36 |
| HAGRID | 85.51 | 85.44 |
| HotpotQA | 88.34 | 86.99 |
| MS Marco | 88.85 | 90.52 |
| PubMedQA | 81.25 | 80.67 |
| TAT-QA | 81.28 | 76.12 |
| TechQA | 31.69 | 54.76 |
| FinQA | 63.34 | 47.79 |
| Average | 76.38 | **77.02** |
b. Evaluation using RAGAS Answer Correctness metric.
| Dataset | Granite 3.0 8B Instruct | Granite RAG 3.0 8b |
| ---------- | ----------------------- | ------------------ |
| CovidQA | 63.23 | 65.70 |
| DelucionQA | 66.95 | 71.05 |
| EManual | 66.30 | 67.85 |
| ExpertQA | 55.71 | 52.00 |
| HAGRID | 64.06 | 69.10 |
| HotpotQA | 75.87 | 76.24 |
| MS Marco | 65.13 | 65.62 |
| PubMedQA | 65.15 | 66.36 |
| TAT-QA | 73.05 | 71.67 |
| TechQA | 38.19 | 41.68 |
| FinQA | 55.78 | 73.25 |
| Average | 62.68 | **65.50** |
2. Evaluation of the hallucination detection accuracy of the model.
The hallucination detection of the model was evaluated using the BEGIN dataset of the [TRUE](https://arxiv.org/pdf/2204.04991) Factual Consistency Evaluation framework.
The following table shows the Class-wise F1 score of the Granite RAG 3.0 8b compared with the teacher model [Achintalwar, et al](https://arxiv.org/pdf/2403.06009).
|Dataset| Achintalwar et al. (Deberta) | Granite RAG 3.0 8b |
|--|--|--|
| BEGIN | Hallucination (0.823) | Hallucination (0.811) |
| | No Hallucination (0.531) | No Hallucination (0.493) |
## Model Card Authors
[Chulaka Gunasekara](mailto:chulaka.gunasekara@ibm.com)