Lifestyle

A Recipe for Steganogravy

With AI scrapers and government agencies roaming the internet and snarfing down every last byte (hoping to profit when you mistakenly publish useful information online), it’s gotten harder to share data without it becoming a future liability.

One wrong step and you find yourself accidentally contributing to automating your own job, having your identity stolen, or offending the kind of person that seems to always be complaining about other people being offended.

What if we could hide data in a place no one would ever think to look? What if we could submerge our delicious morsels of knowledge in a flavorless slop so devoid of nutritional value even the most ravenous AI agents would spit it out?

tbrockman/recipe-blog-encoding is a vibe-coded (and at least partially plagiarized1) Python CLI that allows you to encode data as completely natural language2 using neural linguistic steganography.

Given a shared prompt and a model, it can hide your secrets where they’re least expected: recipe blog introductions. which produces something like the following: Which any reader, knowing the original prompt and model used, can use to recover the political messaging hidden in your favorite garlic butter chicken recipe: Just how grandma would have made it.

how it works The implementation largely follows arithmetic coding steganography. At a high-level, you can imagine the following: We convert our secret into a binary fraction which represents a point somewhere on a number line between [0, 1).

We use the model’s next token probability distribution to carve out adjacent intervals on the line, where the width of each interval is proportional to the token’s probability. We repeatedly choose tokens whose interval contains our point, narrowing the interval further and further, until enough of the leading bits of the start and end points of the interval agree, such that we’ve encoded our message.

Here’s a simple example with a 3-bit secret: The generated text would then read: “This recipe uses” Decoding is just the reverse: run the same model with the same prompt, reconstruct the probability distribution at each step, and read the secret bits back out by checking which tokens were used.

It’s important to note that both sides need the exact same model, quantization, top-k, and prompt – any mismatch and the distributions diverge, producing garbage. limitations it’s pretty wasteful You’re loading massive models to encode and decode a small amount of information, slowly, at < 2-3 bits/token.

bpe tokenization It turns out that if you pick a token during encoding, decode it to text, and then re-tokenize that text, you don’t always get the same token back. For instance, if the text so far tokenizes to […, "hel"] and the model picks the "lo" as the next token, the combined text "hello" might re-tokenize as a single "hello" rather than "hel" + "lo".

Then, when decoding, the decoder sees a completely different token at that position and everything after it diverges. claude's fix: Add a filter that, at each step, tests whether a candidate token would survive a round-trip through decoding and tokenization.

Tokens that wouldn’t are excluded from the CDF before any interval math happens. You lose some encoding capacity, but you can be certain that if your message can be encoded, it can also be decoded.

model end-of-sequence can be reached before the secret is fully encoded question: What do we do if the prompt we’ve chosen doesn’t provide a path to generate sufficient tokens to encode our secret, converging on end-of-sequence before giving us enough bits?

answer: 🤷 choose a better prompt and try again. The prompt acts as a shared key, but it’s a leaky one. The generated text is statistically conditioned on the prompt, where the prompt is partially revealed by its own output (which is generally not seen as an ideal property for encryption methods).

threat model: passing a note to your friend about which girl you like in class, through an untrusted intermediary local LLM only Not that it’s not possible to use any remote APIs (it should be so long as they provide sufficient determinism and logits), local’s just all that was implemented.

Available on Google Collab or from source: have fun cooking ✌️

Back to top button