|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- zh |
|
pipeline_tag: token-classification |
|
--- |
|
# bert-chunker |
|
|
|
[Paper](https://github.com/jackfsuia/BertChunker/blob/main/main.pdf) | [Github](https://github.com/jackfsuia/BertChunker) |
|
|
|
## Introduction |
|
|
|
bert-chunker is a text chunker based on BERT with a classifier head to predict the start token of chunks (for use in RAG, etc), and using a sliding window it cuts documents of any size into chunks. It was finetuned on [nreimers/MiniLM-L6-H384-uncased](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). The whole training lasted for 10 minutes on a Nvidia P40 GPU on a 50 MB synthetized dataset. |
|
|
|
This repo includes model checkpoint, BertChunker class definition file and all the other files needed. |
|
|
|
## Quickstart |
|
Download this repository. Then enter it. Run the following: |
|
|
|
```python |
|
import safetensors |
|
from transformers import AutoConfig,AutoTokenizer |
|
from modeling_bertchunker import BertChunker |
|
|
|
# load bert tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
"tim1900/bert-chunker", |
|
padding_side="right", |
|
model_max_length=255, |
|
trust_remote_code=True, |
|
) |
|
|
|
# load MiniLM-L6-H384-uncased bert config |
|
config = AutoConfig.from_pretrained( |
|
"tim1900/bert-chunker", |
|
trust_remote_code=True, |
|
) |
|
|
|
# initialize model |
|
model = BertChunker(config) |
|
device='cuda' |
|
model.to(device) |
|
|
|
# load parameters, tim1900/BertChunker/model.safetensors |
|
state_dict = safetensors.torch.load_file("./model.safetensors") |
|
model.load_state_dict(state_dict) |
|
|
|
# text to be chunked, no limit on its length |
|
text='''In the heart of the bustling city, where towering skyscrapers touch the clouds and the symphony |
|
of honking cars never ceases, Sarah, an aspiring novelist, found solace in the quiet corners of the ancient library. |
|
Surrounded by shelves that whispered stories of centuries past, she crafted her own world with words, oblivious to the rush outside. |
|
Dr. Alexander Thompson, aboard the spaceship 'Pandora's Venture', was en route to the newly discovered exoplanet Zephyr-7. |
|
As the lead astrobiologist of the expedition, his mission was to uncover signs of microbial life within the planet's subterranean ice caves. |
|
With each passing light year, the anticipation of unraveling secrets that could alter humanity's |
|
understanding of life in the universe grew ever stronger.''' |
|
|
|
# chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated. |
|
chunks=model.chunk_text(text, tokenizer, prob_threshold=0.5) |
|
|
|
# print chunks |
|
for i, c in enumerate(chunks): |
|
print(f'-----chunk: {i}------------') |
|
print(c) |
|
|
|
``` |