A Case for Watermarking Generative Models

Laurence Liang,

This essay is a work in progress. New paragraphs may emerge, and existing ones may evolve.

It is September 2025.

Anyone with reliable Internet access can easily generate text, audio, images, video, and code for very cheap.

There are some direct implications, including but not limited to:

A first question that arises is: will these foundation models eventually replace human jobs for these respective tasks?

We can draw an interesting analogy to speculative decoding (opens in a new tab): at a very high level, let's portray speculative decoding as a very large, costly model L(X)L(X) periodically verifying the outputs of a small, cheap and fast model P(X)P(X). Speculative decoding, while it may sound paradoxal at first in terms of cost and speed improvements, works in pratice. We can apply the same analogy to generative models and humans - perhaps a human can play the role of L(X)L(X) because humans are more sensitive to energy expended, while a generative model P(X)P(X) can run faster and longer within reasonable bounds. A human, similar to the speculative decoding example, can play the role of a verifier L(X)L(X), checking in at regular intervals that generative model P(X)P(X) is doing its job correctly. To answer the first question, it's possible that foundation models will not replace human jobs, but rather humans can play the role of verifiers.

A second question that arises is: if foundation models are so accessible and cheap, is there still a demand for human-generated work?

Let's take the case of writing. Anecdotally, people can still tell whether a blog post or essay has "characteristics" of being generated by a language model - and people still yearn for meticulously crafted written works by human authors.

This may not be a universal case. A simple counter-example is a hackathon environment, where participants simply want code that works, not necessarily hand-crafted code (though there was a time where handcrafted code was commonplace).

However, assuming that the demand for LLM-free content is still generalizable to a variety of other tasks, how can we verify that an output (text, image, audio or other) is devoid of any LLM-generated content?

A complication is that language models are already in the wild. They are not sandboxed - any text that is on the web from a language model has no metadata linking it to a language model - text by itself is free.

Logging may also prove to be difficult for every model out there. Open weights models are runnable on any local machine, and the recipe for training models is out there. Sure, maybe there are moats to being able to run the most sophisticated models that can only be run by select companies who can afford the compute resources needed. Perhaps the most intelligent frontier models can be fully sandboxed and logged. But even an "average" open weights model out there today can generate convincingly consistent output in a variety of modalities.

It would also be difficult to watermark every LLM out there, unless we can magically recall all the open weights ones out there. Or maybe, we could hope that all future LLM releases will have identifiable, jailbreak-safe watermarks in the outputs and that these models will eventually phase out pre-watermark models by shear volume and future releases. Though this is speculation.

Perhaps the following points are fully speculative as well and nothing more.

One approach is analogous to the tests in Blade Runner that assess whether a subject was human or android. A series of questions and answers.

Perhaps we would have to have a long conversation with a model to determine if it has a breaking point.

Or maybe, we could just find one-shot cases that require a very long output. Suppose there exists a length LkL_k for any generated output oko_k longer than length LkL_k, there emerges some content skoks_k \in o_k that is identifiably from a language model.

Proving the existence of such content sks_k and that a length LkL_k exists for every single generated output oko_k may be challenging. Though if we assume that current open weights models cannot be put back in a sandbox, nor be superseded by future releases of watermarked foundation models, the LkL_k approach may be a first remedy for identifying LLM-generated outputs.

Though I would love to be proven wrong.