--- license: apache-2.0 language: - en - zh pipeline_tag: token-classification --- # BertChunker [Paper](https://github.com/jackfsuia/BertChunker/blob/main/main.pdf) | [Github](https://github.com/jackfsuia/BertChunker) ## Introduction BertChunker is a text chunker based on BERT with a classifier head to predict the start token of chunks (for use in RAG, etc). It was finetuned on [nreimers/MiniLM-L6-H384-uncased](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). The whole training lasted for 10 minutes on a Nvidia P40 GPU on a 50 MB synthetized dataset. This repo includes model checkpoint, BertChunker class definition file and all the other files needed. ## Quickstart Download this repository. Then enter it. Run the following: ```python import safetensors from transformers import AutoConfig,AutoTokenizer from modeling_bertchunker import BertChunker # load bert tokenizer tokenizer = AutoTokenizer.from_pretrained( "tim1900/BertChunker", padding_side="right", model_max_length=255, trust_remote_code=True, ) # load MiniLM-L6-H384-uncased bert config config = AutoConfig.from_pretrained( "tim1900/BertChunker", trust_remote_code=True, ) # initialize model model = BertChunker(config) device='cuda' model.to(device) # load parameters, tim1900/BertChunker/model.safetensors state_dict = safetensors.torch.load_file("./model.safetensors") model.load_state_dict(state_dict) # text to be chunked text='''In the heart of the bustling city, where towering skyscrapers touch the clouds and the symphony of honking cars never ceases, Sarah, an aspiring novelist, found solace in the quiet corners of the ancient library. Surrounded by shelves that whispered stories of centuries past, she crafted her own world with words, oblivious to the rush outside. Dr. Alexander Thompson, aboard the spaceship 'Pandora's Venture', was en route to the newly discovered exoplanet Zephyr-7. As the lead astrobiologist of the expedition, his mission was to uncover signs of microbial life within the planet's subterranean ice caves. With each passing light year, the anticipation of unraveling secrets that could alter humanity's understanding of life in the universe grew ever stronger.''' # chunk the text. The threshold can be (-inf, +inf). The lower threshold is, the more chunks will be generated. chunks=model.chunk_text(text, tokenizer, threshold=0) # print chunks for i, c in enumerate(chunks): print(f'-----chunk: {i}------------') print(c) # chunk the text faster, by using a fixed context window, batchsize is the number of windows run per batch. print('----->Here is the result of fast chunk method<------:') chunks=model.chunk_text_fast(text, tokenizer, batchsize=20, threshold=0) # print chunks for i, c in enumerate(chunks): print(f'-----chunk: {i}------------') print(c) ```