AILLMsMay 20, 2026schedule11 MIN READ

Learning How Context Windows Work Will Change How You Use AI Tools

Ravi Vuruturu

Ravi Vuruturu

Principal Architect & Director of AI

Learning How Context Windows Work Will Change How You Use AI Tools

Learning How Context Windows Work Will Change How You Use AI Tools

The Same Three Mistakes

In the last year I have watched most people using AI tools make the same three mistakes. Not the people you might expect. I mean engineers, analysts, executives, clients, friends. Smart people, frustrated for the same reason, tripping over the same missing concept.

The frustration shows up in a few flavors. One person tells me a chatbot "used to be better" or "is getting dumber." Another burns through cost or quota and cannot figure out why. And I keep meeting people who assume the tool remembers them from last week. None of those reactions are unreasonable. The behavior they are reacting to is real. They have just misread the cause.

In this post I want to name the three patterns I keep seeing, walk through the one idea that explains all of them, and then show what you can do differently starting tomorrow. The idea is not technical. Once you have it, the three patterns stop being mysterious.

Three Patterns I Keep Seeing

The endless thread

I keep meeting people who have one chat open for months. They started it back in January, kept replying inside it, and now it is their working thread for everything. At first the answers were sharp. A few weeks in, things felt off. By month three they tell me the model "got worse" or "is dumber than it used to be." Some of them suspect the company quietly downgraded the product.

The model did not get worse. The thread got longer. But the user is not thinking about the thread as a thing with size. They are thinking about it as a relationship that should be deepening over time. The assumption underneath is that the chat is a relationship. It is not. It is a transcript that gets longer with every reply, and the length itself is what changes the answers.

Pasting the same huge document every turn

The second pattern is more expensive. Someone has a 50-page policy document, or a long contract, or a product spec. They paste the whole thing into the chat. Good so far. Then on the next message they paste it again. And again. Every turn, the full document, re-attached.

From their point of view, they are being thorough. They want to make sure the model "has" the document. But every turn re-bills the entire attached content, because each message is its own request carrying its own payload. Cost climbs fast, and the user does not see it on the screen. Quality sometimes drops too, because the model is wading through the same repetition over and over instead of focusing on the new question. The assumption underneath is that "attached" means "remembered." It does not.

Expecting it to remember

The third pattern shows up in client conversations. Someone will say, with full confidence, "we already told it our company structure last week, so it knows." Or "I explained my role in a different chat, it should have that." They are speaking about the AI tool the way they would speak about a human assistant who has been onboarded.

The tool did not learn anything. Each new session is a fresh stranger. Whatever you told it last week lives in last week's thread, and that thread is not magically loaded into today's. The assumption underneath is that AI assistants accumulate knowledge across sessions the way a human teammate does. They do not, at least not by default. Some products do layer a memory feature on top, but the underlying model is still being briefed from scratch on every call. The memory is just text the product pastes in for you.

The Mental Model That Fixes All Three

Three small ideas, in order. Each one is plain on its own, and together they explain every pattern above.

LLMs are stateless

Every call to a large language model starts blank. There is no persistent memory sitting on the other end, waiting for you, remembering your tone and your project. The model itself does not know who you are between calls. Anything it appears to "know" about the current conversation had to be sent to it again, inside the current request.

The cleanest analogy I use is this. Imagine a brilliant consultant with no long-term memory. Inside a meeting they are sharp, fast, and useful. The moment the meeting ends, they forget it ever happened. If you want them to be useful next week, you have to brief them again from scratch. They are excellent inside the meeting and useless about last week. That is the model.

The context window is working memory, for one call only

The context window is the space the model has to think inside, for that one call. It has a hard size limit, measured in tokens (roughly chunks of words). Everything the model is allowed to see right now (your question, the conversation history, any pasted documents, any system instructions) lives inside that window.

When the window fills up, something has to give. Older content gets dropped, or summarized away, or pushed out. The model is not choosing to ignore the early part of your thread. It physically cannot see what no longer fits. Working memory has an edge, and once you cross it, you lose what was on the other side.

Every turn re-sends the entire conversation

This is the part most people miss. Turn 5 is not just turn 5. It is turns 1, 2, 3, 4, and 5, all packaged together and sent in one request. The model on turn 5 is not picking up where it left off. It is reading the whole thing again, from the top, every single time.

That is why long threads get expensive: each new reply has to carry everything before it. They also get slower, because more tokens in means more work. Quality starts to drift too. The window fills, early context competes with the new question, and pieces get dropped.

Imagine a chat where you asked a question on turn 1, got an answer on turn 2, asked a follow-up on turn 3, got an answer on turn 4, and now you ask something new on turn 5. From the model's point of view on turn 5, all of that history arrives in a single packet. Every turn is the first turn, just with more pages in front of it.

What the model receives on turn 5 of a chat

Principles

A few rules fall out of the mental model. None of them require new tools. They only require treating the context window the way it actually behaves.

Treat context as a resource you spend.

The window is not free, and it is not infinite. Every token you spend on context is a token the model cannot use on the answer. If you fill it with old conversation, restated background, and yesterday's documents, the model has less room to actually think about your current question. You are paying twice, once in money and once in quality.

Context as a resource you spend

Start fresh when the goal changes.

A new task deserves a new thread. Old context costs you money on every turn and competes with the new question for the model's attention. People resist this because the old chat feels valuable, like a relationship you have built up. It is not a relationship. It is a transcript, and yesterday's transcript is not helping you with today's problem.

Summarize before you continue.

When a thread gets long, do not just keep typing. Ask the model to summarize the state of the work, what has been decided, and what is still open. Paste that summary into a new thread and keep going. You compress what mattered and drop what did not. The new thread starts light and focused, with the useful context intact.

Stable context belongs outside the chat.

If you keep re-pasting the same five paragraphs into every conversation, that text should live somewhere persistent. A persistent system prompt, a custom assistant, a project workspace, or a retrieval system. The chat window is the wrong container for things that should not change between sessions. Set the stable context once, somewhere the product can reuse it, and stop paying the tax of typing it again.

The Playbook: What To Do Tomorrow

The principles above are easy to nod at. Here is what you actually do, starting with your next conversation.

One thread per task, not per relationship

Stop using a single chat for everything. New project, new thread. The "everything in one place" instinct is the same one that fills email inboxes until they collapse under their own weight. Threads scoped to a single task stay sharp and are easy to summarize when you are done. You are not abandoning context by closing a thread. You are choosing what to carry forward.

Summarize-and-restart when a thread gets long

When you can feel a thread getting slower or less useful, that is the signal. Ask the model to summarize what has been decided and what is still open. Read the summary, fix anything it got wrong, then paste it into a fresh thread. You keep the work and lose the drag. This habit alone fixes most of the "the model is getting dumber" complaints I hear.

Use the product's built-in surfaces for context you reuse

If you are pasting the same paragraphs about your company, your role, or your style preferences into every chat, that is configuration, not conversation. Most consumer AI products now have a place to set this once, whether that is a project workspace, a custom assistant, or a personal system prompt. The product handles the re-paste for you on every turn, and you stop spending attention on the boilerplate. If your tool offers a project or workspace feature, use that for stable context tied to one body of work. If it offers a persistent system prompt or a custom assistant, use that for context you want everywhere. Either way, the rule is the same: set it once, stop paying the tax every message.

Attach big documents once, reference them, do not re-paste

A 100-page document does not need to be in every message. Most tools let you attach a file once and ask follow-up questions against it without resending the file each turn. Use that pattern. If the tool you are using does not have it, that is a tool problem worth solving. Re-pasting big documents is one of the most expensive habits I see in everyday use.

When stable knowledge needs to live across sessions, that is a retrieval problem

If your company's policies, customer history, or product docs need to be available every time anyone asks a question, that information belongs in a retrieval system (often called RAG), not pasted into a chat. You may not be the person who builds that system, but knowing the line exists is what lets you ask for the right thing from your engineering or AI team.

A Note for Builders

If you are building on top of LLMs instead of just using them, the same mental model raises the stakes. Cost in an agent loop compounds because every step re-sends the accumulated context from every step before it. A 20-step agent loop is not 20 cheap calls. It is 20 calls where each one carries everything that came before, and the total token count grows quadratically. The bill at the end of a long agent run can shock a team that priced their feature off a single round trip.

Chunking strategy, retrieval design, and your summarization policy are not implementation details. They decide whether your system stays affordable at the scale you actually need. The model choice gets all the attention in demos, but the context strategy is what determines whether the product survives contact with real usage.

"Context engineering" is the actual job of anyone building on LLMs. What you put in the window and what you leave out. How you compress long state when a thread runs long, and where stable knowledge actually lives. Those are the real product decisions. The model name on the box is the easy part.

Once You See It

Once you understand context windows, you stop blaming the model and start designing your usage. The chat stops being a black box you complain about and becomes a small system you can shape. That is the difference between using AI tools and using them well.