50 First Prompts

TL;DR: LLMs do not remember anything between calls. Every “conversation” you’ve ever had with one was reconstructed from scratch by replaying history into the context window. If your architecture treats memory like a feature you turn on, you will pay for it twice: once in token spend, and once in the slow erosion of consistency that has your users playing Henry Roth, re-establishing context every morning so Lucy can function. And yes, I often use humorous analogies, so please subscribe or follow (or un-) according to your tastes.


If you have not seen 50 First Dates, the premise is that Lucy Whitmore (Drew Barrymore) wakes up every day with no memory of anything that happened the day before, and Henry Roth (Adam Sandler) has to remind her of their entire relationship, every morning, forever. Sweet movie. Terrible AI pattern (in most cases).

True story: when I went to see this one in the theater, the projector died about twenty minutes in. It was weeks before we made it back to finish it, and the second viewing had this faint déjà vu quality, the film meeting me halfway while I reconstructed the rest from a partial memory. Something humans do automatically (if unreliably) and LLMs can’t, at least on their own.

The movie plot is also a reasonable analogy of how a Large Language Model works under the hood. The LLM is Lucy. Every developer who builds on top of it is Henry. Every API call is the first call. Every conversation is reconstructed from a transcript that the application hands the model on the way in. The model itself remembers nothing. The illusion of continuity is something your application is doing on its behalf, on every turn, at your expense.

Most teams do not build for this. They build as if “the AI” remembers things, get surprised when it doesn’t, bolt on a memory layer that is tested like a deterministic automation, and then watch their token bill quietly compound. We’ve all heard some horror stories about this happening. It’s why enterprises prefer to use vendor tools and outside consultants. Which is a good way to get up and running, but has its own cost if the relationship isn’t built on trust and reciprocal ROI.

The Architecture Reality Behind the Humorous Analogy

LLMs are stateless. Full stop. The model is a function: tokens in, tokens out. Whatever “memory” you experience in ChatGPT, Claude, Gemini, or your own agent is some other system managing the flow of prior context back into the prompt before the model sees it.

This has three implications that drive everything else:

First, there is no “the conversation.” There is a transcript that gets re-sent every turn. The model is not pulling up your last message; you are handing it back, every time.

Second, the context window is the entire universe of what the model knows in that moment. Anything not in that window does not exist. Anything in that window is being paid for, in tokens, on every single call.

Third, “memory” in vendor marketing rarely means one thing. It is a category that includes at least five different mechanisms with different costs, different failure modes, and different retrieval semantics. Conflating them is how you end up with an expensive system that still forgets the user’s name. There are, however, better ways.

Memory Is a Marketing Word

When a vendor or framework says “memory,” they could mean any of the following, and the differences matter:

Conversation history replay. The full transcript, prepended on every call. Simple, perfect recall, terrible cost curve. Linear in turns, eventually crashes into your context limit.

Running summary. A compacted version of the transcript, regenerated periodically. Cheaper, lossy, drifts over time. The model is now reading its own paraphrase of what happened, with all the small infidelities that implies.

Vector retrieval (RAG over chat history). Past turns are embedded and indexed; only relevant snippets get pulled into the next prompt. Cheap, scalable, but only as good as your embeddings and your retrieval thresholds. It will confidently fail to surface the one thing the user expected it to remember.

Structured profile / entity store. Key-value or graph storage of facts about the user, product, or domain (“user’s tone preference: dry,” “preferred billing currency: USD”). Cheap to read, easy to audit, but only as good as the extraction logic that populates it.

Procedural / skill memory. Instructions, playbooks, or skills the agent loads on demand. Closer to “here is how we do things here” than “here is what you said yesterday.” Different beast entirely.

A reliable and practical AI memory architecture uses several of these in combination. A bad one picks one and pretends it covers everything. If your team is having an argument about “should we add memory,” the real argument is which of these five you are talking about; why it is the best choice in a given context; and when the context and best option changes.

What Lost in the Middle Actually Costs You

Even if you stuff the entire history into the context window, you do not get what you think you are paying for. Liu et al. at Stanford published Lost in the Middle: How Language Models Use Long Contexts in 2023, and the finding has been replicated enough times that it should be a load-bearing assumption in any architecture: model attention is not uniform across the context window. Information at the beginning and end gets used. Information in the middle gets quietly ignored, even by models that advertise long-context support.

So the naive “just give it the whole history” approach is doubly bad. You pay for every token, and the model uses some of them less than others, and you have no easy way to tell which.

This is one of the reasons selective retrieval beats full replay almost everywhere. You are not just saving tokens. You are putting the relevant tokens in positions where the model will actually use them.

The Token Bill (Yes, Again)

Here is the part that gets glossed over in the demos.

Every token in your context window is paid for, every turn. If your “memory” is “we keep prepending the full conversation,” then by turn 50 you are paying for tokens 1 through 49 fifty times over, and the model is working harder to find the signal each time. This is the closest thing to a structural cost trap in LLM architecture, and it is almost always invisible in development because nobody runs 50-turn conversations against the dev key.

Anthropic’s prompt caching, introduced in August 2024, helps for the parts of your context that genuinely repeat (system prompts, fixed instructions, large reference documents): cached read tokens cost about 10% of the standard input price. That is real money saved on the parts that don’t change. But caching is not memory. It does not summarize, retrieve, or forget. It just makes paying for the same prefix cheaper. Use it where it fits, but do not let “we turned on caching” stand in for an actual memory strategy.

Memory architecture is cost architecture. They are the same conversation. Any team treating them separately is going to be surprised by one of them.

Patterns That Actually Earn Their Keep

A few that hold up in production (as of this writing, a caveat that I’m guilty of not always stating, and how you should think about everything you read about AI):

Hierarchical / paged memory. MemGPT (Packer et al., 2023) is the canonical paper here: a small “main context” of hot facts plus a larger “external context” the model can page in and out, modeled on operating-system virtual memory. Even if you never use the framework (now continued as Letta), the mental model is the right one. Most context is cold most of the time. Stop paying to keep it warm.

Compaction at boundaries. Summarize aggressively at natural breakpoints (session end, topic change, day rollover). Throw away the verbatim transcript once the structured summary is written. Track what got compacted so you can audit later if a user complains the model “forgot.”

Structured extraction over raw recall. Pull stable facts (preferences, identifiers, decisions) out of conversation into a structured store. Read those on every turn. Let the conversational history age out. The user’s preferred tone of voice does not need to live in 12,000 tokens of transcript.

Retrieval over replay. Index past turns, retrieve only what is relevant to the current input, accept the occasional miss as a cost of doing business. Tune your retrieval thresholds with the same seriousness you tune any other production query.

Skills and procedural memory as a separate tier. “How we do things” is not the same as “what we said.” Keep them in separate stores with separate update rules. Skills change rarely; episodic facts change constantly.

A Practical Framework

Four scenarios, four answers:

A user opens the same chat tomorrow and expects continuity: structured profile plus retrieval over summarized history. Do not replay the full transcript.

An agent loops on a long-running task: hierarchical memory with compaction at step boundaries. Hot working set stays small; cold context pages out.

A system prompt or large reference document is reused on every call: prompt caching. Cheap, easy, do it today.

A model needs to “know how we do things”: procedural / skill memory in its own tier. Keep it separate from episodic memory so updating one doesn’t disturb the other.

The wrong answer in all four cases is “just send the whole history.” That is the architecture equivalent of walking Lucy through the entire relationship from scratch, every morning, in hopes that this time some of it sticks. Romantic in the movie. Expensive in production.

Paddling off into the Sunset

The model forgets. That is not a bug, that is the current limitation of the art. The work is in deciding what your application remembers, where it stores it, when it retrieves it, and what it costs you per turn. Treat memory as architecture and most of the surprises go away.


Sources:

If you found this interesting, please share.

© Scott S. Nelson

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.