Memorisation-Profiles

pietrolesci 's Collections

AnchorAL

RDD

updated Jul 29

Artefacts for the paper "Causal Estimation of Memorisation Profiles" (Lesci et al., 2024)

Upvote

Causal Estimation of Memorisation Profiles

Paper • 2406.04327 • Published Jun 6 • 1
pietrolesci/pythia-deduped-stats-raw

Viewer • Updated Jul 15 • 14.9M • 4.68k

Note This folder contains the model evaluations (or "stats") for each model size included in the study. This is the "raw" version where we have stats at the token level. We gathered these statistics "just in case" since the inference process was expensive. However, we provide the sequence-level statistics in the `pietrolesci/pythia-deduped-stats` dataset.
pietrolesci/pythia-deduped-stats

Viewer • Updated Jul 15 • 16.3M • 3.09k

Note This folder contains the model evaluations (or "stats") for each model size included in the study and already aggregated at the sequence level. Based on the "raw" version where we have stats at the token level (`pietrolesci/pythia-deduped-stats-raw`).
EleutherAI/pile-deduped-pythia-preshuffled

Updated Oct 31, 2023 • 82 • 4

Note This is the dataset we used in our study and corresponds to the training set for the models reported below.
pietrolesci/pile-deduped-subset

Viewer • Updated Jul 15 • 16.3k • 45

Note Sample from the Pile (`EleutherAI/pile-deduped-pythia-preshuffled`) used in the experiments. The unique sequence identified (`seq_idx`) is simply the order of the sequence in the Pile. The dataset is already tokenized.
pietrolesci/pile-validation

Viewer • Updated Jul 15 • 215k • 45

Note The validation data used in our study. The Pythia suite does not have an official validation. However, we confirmed with the authors that the Pile validation split (this one) was not seen during training. It is still a bit confusing whether the Pile data can be released freely. Thus, we will remove this dataset if required.
pietrolesci/pythia-deduped-memorisation-profiles

Viewer • Updated Jul 15 • 2.13M • 82
EleutherAI/pythia-70m-deduped

Text Generation • Updated Jul 9, 2023 • 92.6k • 24
EleutherAI/pythia-160m-deduped

Text Generation • Updated Jul 9, 2023 • 11.8k • 3
EleutherAI/pythia-410m-deduped

Text Generation • Updated Jul 9, 2023 • 77.7k • 21
EleutherAI/pythia-1.4b-deduped

Text Generation • Updated Jun 8, 2023 • 10.3k • 19
EleutherAI/pythia-2.8b-deduped

Text Generation • Updated Jul 9, 2023 • 8.33k • 14

Note This model size was ultimately not included in the analysis as we found what seems to be a potential mismatch between checkpoints and batch indices. Specifically, no instantaneous memorisation was detected, which is puzzling since even the 70M parameters model experiences it.
EleutherAI/pythia-6.9b-deduped

Text Generation • Updated Jun 8, 2023 • 13.3k • 8
EleutherAI/pythia-12b-deduped

Text Generation • Updated Jun 8, 2023 • 8.89k • 51

Upvote