StoriesLM: A Family of Language Models With Sequentially-Expanding Pretraining Windows

Model Family

StoriesLM is a family of language models with sequentially-expanding pretraining windows. The pretraining data for the model family comes from the American Stories dataset—a collection of language from historical American news articles. The first language model in the StoriesLM family is pretrained on language data from 1900. Each subsequent language model further trains the previous year’s model checkpoint using data from the following year, up until 1963.

Dataset

The StoriesLM family is pretrained on the American Stories dataset. If you use a model from this family, please also cite the original dataset's authors:

@article{dell2024american,
  title={American stories: A large-scale structured text dataset of historical us newspapers},
  author={Dell, Melissa and Carlson, Jacob and Bryan, Tom and Silcock, Emily and Arora, Abhishek and Shen, Zejiang and D'Amico-Wong, Luca and Le, Quan and Querubin, Pablo and Heldring, Leander},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

StoriesLM
/

StoriesLM-v1-1900

StoriesLM: A Family of Language Models With Sequentially-Expanding Pretraining Windows

Model Family

Dataset

Dataset used to train StoriesLM/StoriesLM-v1-1900