Spaces:
No application file
Usage document for pipeline
6/7/23: Will be updated and built out as required.
(1) generate a bunch of samples
The point of all this code is to construct pairwise examples of human text, unwatermarked, and watermarked text in something resembling an unbiased or IID manner, despite the difficulty of this ask.
The key functionality is oversampling. A series of arguments control how
the raw datasets samples are turned into prompts, and then, provided
the raw prompts pass some checks, the prompts are
fed to the model, and the number of tokens naturally generated under normal
decoding, as well as watermark decoding. If the generations match the given
(length) output filtering criteria, then the row "counts" as one of the N
requested samples.
Otherwise, the generations are stored, but the global counter of progress
towards N
, is not incremented, and thus this "overhead" is the cost
of being very restrictive in desiring "square" (N
x T
) shaped table of samples
in that all three of the human text, unwatermarked, and watermarked output columns
always have the same tokenized length.
At evaluation time, by default, all the point estimates, means, and ROC and AUC calculations are performed
on the subset of rows that all have about the target length (i.e. a subset with shape ~ N
x T
).
The generation_pipeline.py
call in run_pipeline.sh
demonstrates the basic usage.
Key arguments controlling the oversampling logic...
'Shape' Controls
max_new_tokens
: an upperbound, i.e. target lengthT=200
min_prompt_tokens
: prompt len lower bound such as 50min_generations
: the number of 'good' samples we'd like, ieN=500
Prompt construction strategy
input_truncation_strategy
One in ["completion_length", "prompt_length"]
. If the former, slices the end
max_new_tokens
off of the raw sample to create the 'prompt' with the leading prefix (which can have variable length), making the max_new_tokens
removed, the baseline_completion
, or gold output.
If the latter, selects the leading min_prompt_tokens
off of the raw sample as the prompt,
leaving the remaining tokens (variable length) the baseline_completion
.
Filtering/oversampling criteria
input_filtering_strategy
: Can be one of["completion_length", "prompt_length", "prompt_and_completion_length"]
. In each case, if the relevant field doesn't meet the minimum criteria given bymax_new_tokens
ormin_prompt_tokens
respectively, then the raw sample is thrown away before ever even being fed to the model.output_filtering_strategy
: Can be one in["no_filter", "max_new_tokens"]
, if the former, then no output filtering is performed after generations are sampled from the model. However, ifmax_new_tokens
then each both the unwatermarked and watermarked generations are checked to ensure that they are at leastmax_new_tokens
long.
This is a subtle way of trying to adaptively collect samples (online, from any dataset) such that eventually we end up with at least a subset that matches the squareness (N
x T
) criteria we desire, without forcing this to happen on every sample
by turning off the EOS token which amounts to a potentially
pathological distribution shift in the unwatermarked and watermarked output distributions
which would potentially confound generality of results.
Other generation args descriptions are explained by their argparse defintions, but these in particular control the watermarking:
seeding_scheme
: the watermarking embedding scheme being used, such aslefthash
(formerlysimple_1
) orselfhash
(formerlyalgorithm-3
in reference to previous paper)gamma
: parameter controlling size of the green partition for watermarkingdelta
: parameter controlling how much bias is added to the green token logits before sampling
(2) Optionally, apply an attack transformation to weaken the watermark, or make detection harder (for non-watermarking methods as well).
We implement three types of attacks in this pipeline: gpt
, dipper
, and copy-paste
.
The key parameters for each are as follows:
gpt
:attack_model_name
: the OpenAI model variant to useattack_prompt_id
: the index of the prompt to use, seeutils/prompts.json
no_wm_attack
: whether to attack the un-watermarked generation column (no_wm_output
). Default is the watermarked generation (w_wm_output
)
dipper
:lex
: lexical diversity knob for the dipper model/methodorder
: order diversity knob for the paraphrase attack
copy-paste
:cp_attack_type
: k-t meansk
insertions of lengtht
cp_attack_num_insertions
:k
spec'd as an integercp_attack_insertion_len
:t
but generally spec'd as a percent of the full starting sequence length (i.e25%
)cp_attack_src_col
: the sequence we're taking the tokens "to be detected" from , i.e. "positive" examples for the detector of interest. for watermarking this isw_wm_output
cp_attack_dst_col
: the sequence we treat as "negative" surrounding context for the detector of interest. for watermarking this isno_wm_output
.
All parameters have an associated help string in their argparse definition.
The attack_pipeline.py
call in run_pipeline.sh
demonstrates the basic usage of the attack functionality.
(3) Run evaluation and watermark detection
This batches the process of applying a combination of metric functions to the dataset of generations (jsonl) and returns a new dataset of generations (jsonl) just with extra columns for a bunch of metrics.
This is separated from the generation phase to allow a given set of expensive generations to be reanalyzed in differnet ways with differnet metric flavors as necessary.
The key parameters controlling metrics:
Key parameters and usage notes for detection:
evaluation_metrics
: a comma sep list of metrics to evaluate, such asp-sp,repetition,diversity,z-score,windowed-z-score
window_settings
: if running windowed detection specs the comma sep'd windowing strategies (such as20,40,max
)retrieval_technique
: if running retrieval detection, whether to use thesim
orbm25
strategy
All (other) parameters have a help string in their argparse definition.
The evaluation_pipeline.py
call in run_pipeline.sh
demonstrates the basic usage.
Argument union and precedence
First, all arguments used at generation time (metadata file) are loaded by the evaluation pipeline. Then the commandline args that were passed to the eval pipeline are added via an update, or "overwriting union" operator, where all new args for evaluation only are added to the current metadata object, but those that were also present at generation time are overwritten by those included in the evaluation argparse.
If they match, then this is standard behavior. Overwriting shared arguments
is disabled via the overwrite_args
flag by default, but can be allowed this way.
Additionally, the code writes the metrics file into the same directory as the
generations file if only input_dir
is passed. However, for safety clarity and organization,
one can pass an output dir in which to write the new dataset with metrics, as well
as the evaluation metadata as demonstrated in the run_pipeline.sh
example.
(3.1) Retrieval and DetectGPT detection
Creating prefixes:
Retrieval detection is implemented as a metric, i.e. it is run by the evaluation script. To perform retrieval detection on full examples, nothing extra is required. To run retrieval at T, you first must run broadcast_token_prefixes.py
with the save_per_prefix
argument as False
and with a prefix_stride
of choice, such as 50, with a clean generation or attacked generation directory (with jsonl
and meta file inside) as input. This will create a version of the dataset (new jsonl
file) that contains all of the original rows, duplicated and then sliced to each prefix length defined by iterating by prefix_stride
in the sequence length dimension.
For ex, if you have a file with N=500
rows of length about T=200
each, then running this script with prefix_stride=50
would create a new file with N=2000
where the first 500
rows all have length 50
, the next 500
have length 100
etc. If a given row say length 119
is too short for prefix length i
, say the 3rd slice size in this example, 150
, then in the third block, it would be marked as None
. This is to avoid any prefix block expected to be totally comprising a certain prefix length from containing a bunch of sequnces that are shorter than expected which confounds the measurement.
Now for DetectGPT a separate script, detectgpt/detectgpt_main.py
, must be run pointing at a clean generation or attacked generation jsonl
file. Additionally, to run detectgpt @ T, similar prefixing logic must be used. However, it must be run with save_per_prefix
as True
this time, which then creates a set of new files, each containing all the rows of the input jsonl
file but trucated to each prefix length as described above. Then each run of the detectgpt script produces a new jsonl
file (of length N=500
in the above example) with the detectgpt score column added. Then, the notebook join_jsonl_prefix_files.ipynb
can be used to join all those separate jsonl files for each individual prefix into one full file (N=2000
).
Running detection
For Retrieval detection, all that is necessary is to run the evaluation script on the jsonl
containing all the prefixes, and point estimates for the detection at each prefix length will be created by grouping by the prefix length column and reducing. Note, the retrieval method will load only the full sequences into the retrieval database (by loading only the longest sample for each original row, so just 500
sequences in our example), but will query, or perform detection using all of the different prefixes.
For DetectGPT, the evaluation script must also be run, but with the evaluation_metrics=detectgpt
alone, and no other metrics. This is because most of the script is a no-op at this point as every row already contains a detectgpt score and they just need to be turned into ROC plots or AUC measurements. As with retrieval detection, these will be automatically grouped by prefix length and reduced.