BERT-Wiki-Paragraphs

Authors: Satya Almasian*, Dennis Aumiller*, Michael Gertz
Details for the training method can be found in our work Structural Text Segmentation of Legal Documents. The training procedure follows the same setup, but we substitute legal documents for Wikipedia in this model.

Training is performed in a form of weakly-supervised fashion to determine whether paragraphs topically belong together or not. We utilize automatically generated samples from Wikipedia for training, where paragrahs from within the same section are assumed to be topically coherent.
We use the same articles as (Koshorek et al., 2018), albeit from a 2021 dump of Wikpeida, and split at paragraph boundaries instead of the sentence level.

Training Setup

The model was trained for 3 epochs from the "bert-base-uncased" checkpoint on paragraph pairs (truncated to 512 max length). Training was performed on a single Titan RTX GPU over the duration of 3 weeks.