---
language:
- en
pipeline_tag: text-classification
---

# Llama-3.1-Bespoke-MiniCheck-7B

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1s-5TYnGV3kGFMLp798r5N-FXPD8lt2dm?usp=sharing)

This is a fact-checking model developed by [Bespoke Labs](https://bespokelabs.ai) and maintained by [Liyan Tang](https://www.tangliyan.com/) and Bespoke Labs. 
The model is an improvement of the MiniCheck model proposed in the following paper:

📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf)

[GitHub Repo](https://github.com/Liyan06/MiniCheck)

The model takes as input a document and a sentence and determines whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}**

In order to fact-check a multi-sentence claim, the claim should first be broken up into sentences. The document does not need to be chunked unless it exceeds `32K` tokens.

`Llama-3.1-Bespoke-MiniCheck-7B` is finetuned from `internlm/internlm2_5-7b-chat` ([Cai et al., 2024](https://arxiv.org/pdf/2403.17297))
on the combination of 35K data points only:
- 21K ANLI examples ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))
- 14K synthetically-generated examples following the scheme in the MiniCheck paper, but with additional proprietary data curation techniques (sampling, selecting additional high quality data sources, etc.) from Bespoke Labs. Specifically, we generate 7K "claim-to-document" (C2D) and 7K "doc-to-claim" (D2C) examples. The following steps were taken to avoid benchmark contamination: the error types of the model in the benchmark data were not used, and the data sources were curated independent of the benchmark.

All synthetic data is generated by [`meta-llama/Meta-Llama-3.1-405B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct), thus the name `Llama-3.1-Bespoke-MiniCheck-7B`.

**While scaling up the model (compared to what is in MiniCheck) helped, many improvements come from high-quality curation, thus establishing the superiority of Bespoke Labs's curation technology.**

### Model Variants
We also have other three MiniCheck model variants:
- [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large) (Model Size: 0.8B)
- [lytang/MiniCheck-RoBERTa-Large](https://huggingface.co/lytang/MiniCheck-RoBERTa-Large) (Model Size: 0.4B)
- [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large) (Model Size: 0.4B)


### Model Performance

<p align="center">
    <img src="./performance.png" width="550">
</p>


The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
from 11 recent human annotated datasets on fact-checking and grounding LLM generations. **Llama-3.1-Bespoke-MiniCheck-7B is the SOTA
fact-checking model despite its small size.**


# Model Usage

Please run the following command to install the **MiniCheck package** and all necessary dependencies.
```sh
pip install "minicheck[llm] @ git+https://github.com/Liyan06/MiniCheck.git@main"
```

### Below is a simple use case

```python
from minicheck.minicheck import MiniCheck
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

doc = "A group of students gather in the school library to study for their upcoming final exams."
claim_1 = "The students are preparing for an examination."
claim_2 = "The students are on vacation."

# model_name can be one of:
# ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B']
scorer = MiniCheck(model_name='Bespoke-MiniCheck-7B', enable_prefix_caching=False, cache_dir='./ckpts')
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])

print(pred_label) # [1, 0]
print(raw_prob)   # [0.9840446675150499, 0.010986349594852094]
```

### Throughput

We speed up Llama-3.1-Bespoke-MiniCheck-7B inference with [vLLM](https://github.com/vllm-project/vllm). Based on our test on 
a single A6000 (48 VRAM), Llama-3.1-Bespoke-MiniCheck-7B with vLLM and MiniCheck-Flan-T5-Large have throughputs > 500 docs/min.

### Automatic Prefix Caching

> Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV
> cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.

To enable automatic prefix caching for `Bespoke-MiniCheck-7B`, simply set `enable_prefix_caching=True` when initializing the 
MiniCheck model (no other changes are needed):

```python
scorer = MiniCheck(model_name='Bespoke-MiniCheck-7B', enable_prefix_caching=True, cache_dir='./ckpts')
```

How automatic prefix caching affects the throughput and model performance can be found in the [GitHub Repo](https://github.com/Liyan06/MiniCheck).


### Test on our [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact) Benchmark

```python
import pandas as pd
from datasets import load_dataset
from minicheck.minicheck import MiniCheck
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# load 29K test data
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
docs = df.doc.values
claims = df.claim.values

scorer = MiniCheck(model_name='Bespoke-MiniCheck-7B', enable_prefix_caching=False, cache_dir='./ckpts')
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])  # ~ 500 docs/min, depending on hardware
```

To evaluate the result on the benchmark
```python
from sklearn.metrics import balanced_accuracy_score

df['preds'] = pred_label
result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
for dataset in df.dataset.unique():
    sub_df = df[df.dataset == dataset]
    bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
    result_df.loc[len(result_df)] = [dataset, bacc]

result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
result_df.round(1)
```

# License

This work is licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
For commercial licensing, please contact company@bespokelabs.ai.

# Citation

```
@misc{tang2024minicheck,
      title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents},
      author={Liyan Tang and Philippe Laban and Greg Durrett},
      year={2024},
      eprint={2404.10774},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{tang2024bespokeminicheck,
  title={Bespoke-Minicheck-7B},
  author={Bespoke Labs},
  year={2024},
  url={https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B},
}
```

# Acknowledgements

Model perfected at [Bespoke Labs](https://www.bespokelabs.ai).

Team:
1. [Liyan Tang](https://tangliyan.com/)
2. [Negin Raoof](https://neginraoof.com/)
3. [Trung Vu](https://x.com/trungthvu)
4. [Greg Durrett](https://www.cs.utexas.edu/~gdurrett/)
5. [Alex Dimakis](https://users.ece.utexas.edu/~dimakis/)
6. [Mahesh Sathiamoorthy](https://smahesh.com)

We also thank Giannis Daras for feedback and Sarthak Malhotra for market research.