Spaces:

nevi1
/

lm_detect

No application file

App Files Files Community

lm_detect / lm-watermarking-main /README.md

nevi1

Upload 244 files

73f4c20 11 months ago

preview code

raw

history blame

10.7 kB

	# 💧 [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226) 🔍

	### [Demo](https://huggingface.co/spaces/tomg-group-umd/lm-watermarking) \| [Paper](https://arxiv.org/abs/2301.10226)

	Official implementation of the watermarking and detection algorithms presented in the paper:

	"A Watermark for Large language Models" by _John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein_

	### Updates:

	- (6/7/23) We're thrilled to announce the release of ["On the Reliability of Watermarks for Large Language Models"](https://arxiv.org/abs/2306.04634) Our new preprint documents a deep dive into the robustness properties of more advanced watermarks.

	- (6/9/23) Initial code release implementing the alternate watermark and detector variants in the new work. Files located in this subdirectory: [`watermark_reliability_release`](watermark_reliability_release).

	- (9/23/23) Update to the docs with recommendations on parameter settings. Extended implementation (recommended) available in `extended_watermark_processor.py`.

	---

	Implementation is based on the "logit processor" abstraction provided by the [huggingface/transformers 🤗](https://github.com/huggingface/transformers) library.

	The `WatermarkLogitsProcessor` is designed to be readily compatible with any model that supports the `generate` API.
	Any model that can be constructed using the `AutoModelForCausalLM` or `AutoModelForSeq2SeqLM` factories _should_ be compatible.

	### Repo contents

	The core implementation is defined by the `WatermarkBase`, `WatermarkLogitsProcessor`, and `WatermarkDetector` classes in the files `watermark_processor.py`, for a minimal implementation and `extended_watermark_processor.py` for the more full featured implementation (recommended).
	The `demo_watermark.py` script implements a gradio demo interface as well as minimum working example in the `main` function using the minimal version.

	Details about the parameters and the detection outputs are provided in the gradio app markdown blocks as well as the argparse definition.

	The `homoglyphs.py` and `normalizers.py` modules implement algorithms used by the `WatermarkDetector`. `homoglyphs.py` (and its raw data in `homoglyph_data`) is an updated version of the homoglyph code from the deprecated package described here: https://github.com/life4/homoglyphs.
	The `experiments` directory contains pipeline code that we used to run the original experiments in the paper. However this is stale/deprecated
	in favor of the implementation in `watermark_processor.py`.

	### Demo Usage

	As a quickstart, the app can be launched with default args (or deployed to a [huggingface Space](https://huggingface.co/spaces)) using `app.py`
	which is just a thin wrapper around the demo script.
	```sh
	python app.py
	gradio app.py # for hot reloading
	# or
	python demo_watermark.py --model_name_or_path facebook/opt-6.7b
	```


	### How to Watermark - A short guide on watermark hyperparameters
	What watermark hyperparameters are optimal for your task or for a comparison to new watermarks? We'll provide a brief overview about all important settings below, and best practices for future work. This guide represents our current understanding of optimal settings as of August 2023, and so is a bit more up to date than our ICML 2023 conference paper.

	TL;DR: As a baseline generation setting, we suggest default values of `gamma=0.25` and `delta=2.0`. Reduce delta if text quality is negatively impacted. For the context width, h, we recommend a moderate value, i.e. h=4, and as a default PRF we recommend `selfhash`, but can use `minhash` if you want. Reduce h if more robustness against edits is required. Note however that the choice of PRF only matters if h>1. The recommended PRF and context width can be easily selected by instantiating the watermark processor and detector with `seeding_scheme="selfhash"` (a shorthand for `seeding_scheme="ff-anchored_minhash_prf-4-True-15485863"`, but do use a different base key if actually deploying). For detection, always run with `--ignore--repeated-ngrams=True`.

	1) Logit bias delta: The magnitude of delta determines the strength of the watermark. A sufficiently large value of delta recovers a "hard" watermark that encodes 1 bit of information at every token, but this is not an advisable setting, as it strongly affects model quality. A moderate delta in the range of [0.5, 2.0] is appropriate for normal use cases, but the strength of delta is relative to the entropy of the output distribution. Models that are overconfident, such as instruction-tuned models, may benefit from choosing a larger delta value. With non-infinite delta values, the watermark strength is directly proportional to the (spike) entropy of the text and exp(delta) (see Theorem 4.2 in our paper).

	2) Context width h: Context width is the length of the context which is taken into account when seeding the watermark at each location. The longer the context, the "more random" the red/green list partitions are, and the less detectable the watermark is. For private watermarks, this implies that the watermark is harder to discover via brute-force (with an exponential increase in hardness with increasing context width h).
	In the limit of a very long context width, we approach the "undetectable" setting of https://eprint.iacr.org/2023/763. However, the longer the context width, the less "nuclear" the watermark is, and robustness to paraphrasing and other attacks decreases. In the limit of h=0, the watermark is independent of local context and, as such, it is minimally random, but maximally robust against edits (see https://arxiv.org/abs/2306.17439).

	3) Ignoring repeated ngrams: The watermark is only pseudo-random based on the local context. Whenever local context repeats, this constitutes a violation of the assumption that the PRNG numbers used to seed the green/red partition operation are drawn iid. (See Sec.4. in our paper for details). For this reason, p-values for text with repeated n-grams (n-gram here meaning context + chosen token) will be misleading. As such, detection should be run with `--ignore-repeated-ngrams` set to `True`. An additional, detailed analysis of this effect can be found in http://arxiv.org/abs/2308.00113.

	4) Choice of pseudo-random-function (PRF): This choice is only relevant if context width h>1 and determines the robustness of the hash of the context against edits. In our experiments we find "min"-hash PRFs to be the most performant in striking a balance between maximizing robustness and minimizing impact on text quality. In comparison to a PRF that depends on the entire context, this PRF only depends on a single, randomly chosen token from the context.

	5) Self-Hashing: It is possible to extend the context width of the watermark onto the current token. This effectively extends the context width "for-free" by one. The only downside is that this approach requires hashing all possible next tokens, and applying the logit bias only to tokens where their inclusion in the context would produce a hash that includes this token on the green list. This is slow in the way we implement it, because we use cuda's pseudorandom number generator and a simple inner-loop implementation, but in principle has a negligible cost, compared to generating new tokens if engineered for deployment. A generalized algorithm for self-hashing can be found as Alg.1 in http://arxiv.org/abs/2306.04634.

	6) Gamma: Gamma denotes the fraction of the vocabulary that will be in each green list. We find gamma=0.25 to be slightly more optimal empirically, but this is a minor effect and reasonable values of gamma between 0.25 and 0.75 will lead to reasonable watermark. A intuitive argument can be made for why this makes it easier to achieve a fraction of green tokens sufficiently higher than gamma to reject the null hypothesis, when you choose a lower gamma value.

	7) Base Key: Our watermark is salted with a small base key of 15485863 (the millionth prime). If you deploy this watermark, we do not advise re-using this key.

	### How to use the watermark in your own code.

	Our implementation can be added into any huggingface generation pipeline as an additional `LogitProcessor`, only the classes `WatermarkLogitsProcessor` and `WatermarkDetector` from the `extended_watermark_processor.py` file are required.

	Example snippet to generate watermarked text:
	```python

	from extended_watermark_processor import WatermarkLogitsProcessor

	watermark_processor = WatermarkLogitsProcessor(vocab=list(tokenizer.get_vocab().values()),
	gamma=0.25,
	delta=2.0,
	seeding_scheme="selfhash") #equivalent to `ff-anchored_minhash_prf-4-True-15485863`
	# Note:
	# You can turn off self-hashing by setting the seeding scheme to `minhash`.

	tokenized_input = tokenizer(input_text).to(model.device)
	# note that if the model is on cuda, then the input is on cuda
	# and thus the watermarking rng is cuda-based.
	# This is a different generator than the cpu-based rng in pytorch!

	output_tokens = model.generate(**tokenized_input,
	logits_processor=LogitsProcessorList([watermark_processor]))

	# if decoder only model, then we need to isolate the
	# newly generated tokens as only those are watermarked, the input/prompt is not
	output_tokens = output_tokens[:,tokenized_input["input_ids"].shape[-1]:]

	output_text = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]
	```

	Example snippet to detect watermarked text:
	```python

	from extended_watermark_processor import WatermarkDetector

	watermark_detector = WatermarkDetector(vocab=list(tokenizer.get_vocab().values()),
	gamma=0.25, # should match original setting
	seeding_scheme="selfhash", # should match original setting
	device=model.device, # must match the original rng device type
	tokenizer=tokenizer,
	z_threshold=4.0,
	normalizers=[],
	ignore_repeated_ngrams=True)

	score_dict = watermark_detector.detect(output_text) # or any other text of interest to analyze
	```

	To recover the main settings of the experiments in the original work (for historical reasons), use the seeding scheme `simple_1` and set `ignore_repeated_ngrams=False` at detection time.


	### Contributing
	Suggestions and PR's welcome 🙂