YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CoLoR-filter

See accompanying code at: https://github.com/davidbrandfonbrener/color-filter-olmo

If you only want to download the filtered, untokenized data, see: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4

Usage

To download the data, we recommend using the huggingface-cli.

To download all the data, run huggingface-cli download hlzhang109/CoLoR-filter --local-dir YOUR_PATH.

This will download the data to your huggingface cache and create a local-dir with symbolic links to the data. If you actually want the data at YOUR_PATH, set it as the --cache-dir in the command.

WARNING: the data is large since it contains a copy of tokenized C4 to ensure that the selected data indices match with the tokenized raw data. The C4 data is ~300GB and the rest of the repo is ~50GB of which ~45GB is the 1.2B model and optimizer checkpoints.

If you only want to download some files (e.g. just the models), use the cli. For example, huggingface-cli download hlzhang109/CoLoR-filter --local-dir YOUR_PATH --include "models/*".

Citation

If you use this code in your research, please cite the following paper:

@article{brandfonbrener2024color,
  title={CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training},
  author={Brandfonbrener, David and Zhang, Hanlin and Kirsch, Andreas and Schwarz, Jonathan Richard and Kakade, Sham M},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.