Consistent Stable Diffusion Characters Without Training: A Prompt Compilation Approach
No LoRA, no reference photos, no training. Just a description, a name to anchor the face in embedding space, and a little geometry borrowed from directional statistics.
If you have ever tried to build a recurring character in Stable Diffusion, you already know the problem.
You write a careful description. You get a face you like. Then you try to put that same person in a second scene, and a different face comes back. Close, sometimes. The same person, almost never. Change the pose, the lighting, the framing, and the character quietly becomes someone else.
This is one of the most common frustrations in AI image generation, and it is a genuine operational problem the moment you need the same invented person to appear across many images. I spent a while on it as a side project, and the path I ended up on turned out to be more interesting than the destination. It taught me a few concrete things about how Stable Diffusion actually reads a prompt. This article is about that path. The method I landed on is one I call "prompt compilation," and I will build up to it rather than drop it on you.
A note on what this is: a tinkerer's writeup, not a research paper. I am a bidouilleur having fun with a bit of geometry, sharing what worked and what broke, not making formal claims.
A 90-second primer, if you are new to this
Stable Diffusion is a text-to-image model. You give it a prompt, a few words or a few sentences, and it produces an image that matches.
Here is the part that matters for everything below. The model does not read your words. It reads numbers. A component called the text encoder turns your prompt into a long list of numbers, a vector, and the image generator works from that vector, never from the letters you typed. Two sentences that mean the same thing land near each other in this number space. Two sentences that mean different things land far apart.
So, to keep things simple for now, you can picture every prompt as a single point in a very high-dimensional space. That sounds abstract, but it has one practical consequence the whole method rests on. If a prompt is a point, you can do geometry on it. You can measure distances, find neighbors, and combine points. Hold on to that idea, because it is the whole trick.
Why this is harder than it sounds
The standard answer to "I want a consistent character" is to train something.
You can train a LoRA on images of your character. You can do textual inversion. You can fit an IP-Adapter to a reference photo. All of these work, and all of them share one cost: you need training, and you usually need reference images of the very character you are trying to invent. If the character does not exist yet, that is a chicken-and-egg problem. You cannot photograph someone who is not real.
What I wanted was narrower, and I think more honest to the medium: a consistent character from a description alone. No training, no reference images, just text. If the model already knows how to draw faces, the information is in there somewhere. The only question is how to address it reliably.
The idea: compile the prompt, do not just write it
Here is the keystone, and it is worth slowing down for.
A full prompt is a point in the encoder's number space. But individual words are points in that same space too. The word "viking" has a position. So does "auburn." So does any tag, any name, anything you can type. They all live in the same space as your whole sentence.
That means you can approximate a sentence with a combination of simpler, well-known words. Take a dictionary of tokens the model has clean, strong associations for, measure where your description lands, and find the small set of dictionary tokens whose weighted combination sits closest to it. The math for this is well understood (standard linear algebra, nothing exotic). The output is not your original prose. It is a short, weighted list of tokens that points at roughly the same place in the space.
I started calling this "prompt compilation," by analogy with a compiler. You write something readable, and a mechanical step turns it into something the machine responds to better. You are not inventing new vocabulary. You are navigating the vocabulary the model already understands, to find the tokens that sit where your description sits.
That is the tool. The rest of this article is what happened when I pointed it at the consistent-character problem, using a dictionary not of tags but of names.
The first thing I tried, and why it failed
There is a well-known trick in the Stable Diffusion community: certain names act like identity presets. Put a name in the prompt and you get a stable, recognizable face. So my first idea was the obvious one. If I can decompose a description into a weighted blend of names, the model should blend their faces. Half this person, half that person, average the two, get a new consistent face in between.
It failed, and it failed in an instructive way.
Across two model versions and two different name lists (one of real celebrities, one of 882 randomly generated names), every single blended prompt produced the same kind of result: not one merged face, but several different people sharing the frame. A group photo, not a portrait. The weights changed who was most prominent. They never merged anyone.
<img src="
The reason is worth understanding, because it is a fact about how the model works, not a quirk of my code. Averaging vectors is meaningful in the number space. But the image generator does not consume one averaged vector. It consumes a sequence of word vectors, and an internal mechanism called cross-attention lets each named token pull its own region of the image toward itself. Two names means two attention targets, which means two people. The blend is real in the math and invisible in the picture. The two operations, averaging and rendering, simply do not commute.
I find this kind of result more useful than a success would have been, because it tells you where the wall is. You cannot blend identities by blending name vectors. So I stopped trying to.
Names are anchors
The failure came with a gift. When I tested names one at a time, including invented ones that correspond to no real person, the faces were remarkably stable. One made-up name gave me the same woman across every seed I tried. Another gave me the same man, with a consistent ethnicity, age, and bone structure, again and again.
This is not just folklore, although the community folklore is right about it. There is a paper, MagicNaming (arXiv:2412.14902), that studies exactly this. It shows that SDXL's text encoder has a structured "Name Space," a region where identity information is organized and separable from the rest of the prompt's meaning. A name carries a face. The model learned stable visual priors for names during training, even for names it never saw attached to a specific celebrity.
So the move was to stop blending and start casting. Use the compiler to find the single closest name for a description, and use only that one name as the identity. One name, one subject, no group photo. The compiler's job becomes "find me the face that best matches this description," and the chosen name is the casting decision.
That fixed consistency within a scene. It did not, on its own, fix consistency across scenes.
Two layers: a name, then the leftovers
A single name holds a face steady across seeds in the same prompt. But push the character into very different contexts (a tight studio portrait, an outdoor full-body shot, a candid indoor scene) and a name alone often drifts. The gender slips, the age jumps, an attribute the description clearly stated just disappears. On a benchmark of ten characters run across multiple model checkpoints and several scene types, a name by itself held its identity in only two of the eighteen character-and-model pairings I could score cleanly. The name is necessary, not sufficient.
What helped was adding back a few of the description's own attributes alongside the name. Not the whole description (that brings back the noise the name was meant to replace), only the parts the name did not already capture. This is where prompt compilation earns its keep, because it can compute "the parts the name did not capture" instead of me guessing them by hand.
This is the two-layer method, and it is the heart of the approach.
Layer one is casting. Run the compiler on the full description against the name dictionary, take the single best name, output it at full strength. That is the identity anchor.
Layer two is the leftovers. Work out what the description still says after you account for what the name already encodes, and add those attributes back. The key design choice is where layer two gets its vocabulary. I do not pull from a generic tag dictionary. I build a small dictionary on the fly from the description itself, every one-, two-, and three-word phrase in the user's own text, and let the compiler pick which of those phrases best explain the leftover.
That choice has a property I like. If the candidate words are the user's own words, the method literally cannot substitute something else for what the user wrote. When I tested the generic-dictionary version instead, "auburn" got rounded off to "brown hair," "Mediterranean" turned into a different ethnicity entirely, and "curly hair" was dropped. The description-derived version cannot do that, by construction. It can only choose among the attributes you actually stated.
A compiled character ends up looking like this: one name at full weight, then a handful of short phrases lifted straight from your own description and weighted by how much they add, then your scene context left untouched.
In: "30-year-old woman with high cheekbones, green eyes, auburn hair, angular jaw, confident expression"
Out:
(jennifer_taylor:1.00), (eyes auburn hair:1.30), (high cheekbones green:1.17), (jaw confident expression:1.00), (cheekbones green eyes:0.85), (30-year-old woman:0.58), close-up portrait, studio lighting
The anchor is one of the 882 randomly generated names. They are random first-and-last combinations, which is why they look so ordinary, and why a few, like this one, happen to collide with a real person by chance. And the phrases are a little rough, with words that overlap: they are lifted mechanically from your own description, not polished by hand. Rough, but faithful to what you wrote.
Where the 0.5 comes from
Now the one piece of real math, because it is the part I understand best now and understood worst at the time.
"The leftover" sounds vague, so here is the concrete version. Every description and every name is a vector. To get what the name did not capture, you subtract the name's vector from the description's vector. The amount you subtract is a coefficient:
leftover = description - (0.5 × name)
The natural coefficient is 1.0. Subtract the whole name, keep what remains. That is what a textbook would tell you. In practice, 1.0 over-corrects badly: it strips out too much, the leftover collapses toward noise, and layer two comes back nearly empty. I found by trial that 0.5 worked, and for a long time that was the whole of my explanation. Half, because half worked. Not a satisfying answer.
The satisfying answer came from somewhere I did not expect. In May 2026 I sat in on a talk by Christophe Ley, professor of applied statistics at the University of Luxembourg, at the annual meeting of the Luxembourg Institute of Actuaries (ILAC). The subject was directional statistics, the branch of statistics for data that lives on a sphere rather than on a flat line. I was in the room for the actuarial side. The geometry is what followed me home.
Here is the connection, in plain terms. The encoder's raw vectors vary in length, but the moment you normalize them, which is what comparing directions amounts to, they live on the surface of a high-dimensional sphere rather than in open space. And they do not spread out evenly over that sphere. They bunch up in a narrow cone, a documented property of these embedding spaces, first described for language models. I checked it on my own embeddings: two random names sit at a cosine around 0.5, where a uniform spread would sit near zero. On a surface like that, ordinary subtraction is the wrong instinct, because subtracting the whole name vector shoves the leftover off into an empty part of the cone where the nearest-word lookup turns to mush. That is exactly the over-correction I had seen at 1.0.
The clean operation on a sphere is not full subtraction. It is projection: remove only the part of the description that points in the same direction as the name, no more and no less. In symbols, the principled version replaces the 0.5 with the cosine of the angle between the two vectors:
leftover = description - (cos θ × name)
That cosine has an exact value: how closely the description lines up with its best-matching name. In my data it sits around 0.3. That is lower than the 0.5 I measured between two names, which makes sense, since a full description is a busier piece of text than a bare name and aligns less tightly with any single one. My empirical 0.5 therefore breaks down as projection (about 0.3) plus a small extra push away from the name (about 0.2). And the little extra push, it turns out, does real work. A name is a very loud signal inside the model. Nudging the leftover slightly against it forces the description's own attributes to assert themselves a bit harder. That is why 0.5 quietly beat the geometrically pure value in side-by-side tests, though only by a hair.
None of this geometry is mine. Projection on a sphere and the narrow-cone behavior of text embeddings are both well-documented, and I am far from the first to apply spherical geometry to these models. Directional statistics is one minor tool in this story, not its headline. But it turned a vague knob into a precise statement, and I would not have reached for it without that talk. So, credit where it is due.
Does it actually work?
Yes, with the usual honesty about how I know.
I ran a controlled test: five characters, four scene contexts each, four random seeds per context, and two competing leftover strategies, with the name anchor held identical across them so that only the leftover step differed. With each strategy generated alongside a plain baseline to score against, that comes to 320 images. Then I reviewed them in side-by-side grids and scored them automatically. This particular head-to-head was five characters deep, but it sits on top of a broader investigation that ran past two thousand images, ten characters, and three model checkpoints.
The description-derived two-layer method won sixteen of twenty head-to-head comparisons against the obvious alternative, pulling those same attributes from a generic tag dictionary instead of the user's own words, drew four, and lost none. The wins were not scattered: it won on exactly the characters whose descriptions held an attribute the generic dictionary handled badly (the auburn hair, the Mediterranean look, the curly hair), and it drew on the one character the generic dictionary happened to cover cleanly. The method wins where the alternative has gaps, and ties where it does not. That is a mechanism, not luck.
For a numeric handle, I used ArcFace, a standard face-recognition model, to measure how much a character's face stays the same across the four contexts. Here the comparison point is a different one: the older hand-written approach of typing the age and gender in myself. On that scale, the hand-written version sat around 0.25, and the two-layer compiled method around 0.42, a clear lift. I will be honest about the metric: face-recognition models are trained on real photographs, so using one as a "same person?" judge on synthetic faces is approximate. It is a useful signal, not a verdict.
One last check mattered to me. Everything above was on base SDXL. When I reran the whole thing on a popular community fine-tune of the same family, the method held, and the identity score actually rose, to about 0.55. So this is not a quirk of one checkpoint. It travels across the SDXL family.
Where this travels, and where it does not
The shape of the trick is general: one strong anchor, plus a few modifiers pulled out of the user's own description by a residual step. Anywhere you have that shape, the same two-layer idea should apply. A garment as the anchor, plus style details. An object as the anchor, plus material and finish. A setting as the anchor, plus mood and light.
Where I do not expect it to work is fully open scenes with no single dominant subject. "Two figures fighting on a bridge at sunset" has no one anchor to cast; every element is its own identity competing for the model's attention, which is the same cross-attention wall that killed multi-name blending in the first place. The method is good precisely when there is one thing that should stay the same and a few things that should vary around it.
A note on the method, for anyone who wants to reproduce it
A few pointers, without turning this into a tutorial. The core move is an old and well-studied one: approximate a target vector with a small, weighted combination of other vectors. The general name is sparse approximation, or sparse coding, and I solve it with non-negative least squares (NNLS). It is a few lines on top of a standard solver. Fancier options exist if you want them, orthogonal matching pursuit among others; I did not need them.
There is one twist I find genuinely pretty. When you are choosing several pieces from a big dictionary, you do not just want the closest ones, you want a set that does not pile onto the same meaning. So the selection can reward diversity: a candidate that adds something new beats one that echoes what is already chosen. That idea has a name of its own, Maximal Marginal Relevance, and it stops the method from spending its whole budget on five ways of saying the same thing.
The persona method leans on a lighter version of all this. Layer one simply takes the single closest name, with no set to diversify. And layer two has barely a dictionary to manage, because the candidates are your own words: the one-, two-, and three-word phrases of your description, and nothing else.
Token order is not cosmetic either, and it comes back to cross-attention. The model reads the prompt as a positional sequence, so the same pieces in a different order can give a different picture. I list them heaviest first. When I tried the alternatives, a "meaningful" ordering or a random one, they were worse, sometimes flipping the style or collapsing the image outright. The boring choice held up.
One honest caveat sits under all of this. The whole method works in the text encoder's space, lining up vectors before a single pixel exists, while the part of Stable Diffusion that actually paints the image, the U-Net, sits downstream and does not read those vectors quite the way my arithmetic assumes. So matching a description in embedding space is a cheap yet imperfect proxy for matching it in the picture itself; closing that gap properly would mean scoring against what the U-Net renders, which I have not done here.
What I take away from it
The consistent-character result is useful, but the thing I will keep is the prompt-compilation idea itself.
Treat a prompt as a point in the text encoder's space. Use a dictionary of things the model already knows well, plus a bit of geometry, to find the tokens that sit where your description sits. It works for tags. It works for names. It very likely works for other anchored vocabularies I have not tried yet. And as a bonus, it keeps surfacing real facts about how the model behaves: that names are identity anchors, that you cannot blend them, that the embedding space is a curved cone where projection beats subtraction. You do not see any of that while hand-tuning a prompt word by word. You see it when you start treating the prompt as something you can measure, and aim.
Leave Consistent Stable Diffusion Characters Without Training: A Prompt Compilation Approach to:
Read more #stablediffusion posts
Best Posts From jb-pleynet
We have not curated any of jb-pleynet's posts yet. But you can encourage our curation team to review posts by visiting them regularly and by referring other readers. Because we give priority to frequently read content.