debrev.
Back to blog

memory-runtime: bounding LLM context without a database

June 13, 2026By Debrev Team
✦ AI Overview

memory-runtime is the stateless context engine behind Debrev Interview's AI chat, now open-sourced under MIT. Instead of replaying a growing chat history into every prompt, it runs a deterministic ingest → compile → observe loop: artifacts and messages go in, a budget-aware compiler emits a message array that never exceeds a token ceiling, and the assistant's reply is parsed back into structured state. The entire session lives in a JSON snapshot the application owns — no SQLite, no filesystem, no server affinity. In a 50-turn benchmark the naive approach grew to 19,888 prompt tokens while memory-runtime held flat under 2,000 — an 82% average reduction — and still surfaced a fact pinned at turn 1 in answer to a paraphrased question at turn 50 that shared no keywords with it. Compilation is deterministic and compact snapshots run ~87% smaller than full ones.

Every LLM chat application has the same leak. To answer turn 50, you replay turns 1 through 49 into the prompt — the messages, the code you pasted, the documents you attached. The context grows linearly, the cost grows with it, and eventually you hit the model's window and start dropping the oldest content, which is often exactly the constraint someone set at the start of the conversation.

The usual fixes each have a tax. A sliding window is cheap but forgets. LLM summarization is lossy, non-deterministic, and costs an extra model call per turn. A vector database retrieves well but adds infrastructure, embedding latency, and a service to operate.

memory-runtime is the path we took for Debrev Interview, and we've open-sourced it. It is a small, dependency-free TypeScript SDK that treats context assembly as a deterministic compilation problem against a fixed token budget — and keeps zero state of its own.

The model: ingest → compile → observe

The runtime has three verbs and one data structure. You ingest artifacts (code snippets, document chunks, diffs) and messages. You compile them into a message array that fits a token budget. You observe the assistant's response to extract structured state. Everything the session knows is captured in a serializable snapshot that you — not the library — store and pass back on the next turn.

A stateless context loop The session holds no state of its own — it round-trips as a JSON snapshot the application owns. ingest snippets · docs · diffs messages · pinned compile select by priority drop / truncate to budget observe extract decisions, constraints, glossary prompt reply ↑ your LLM provider ↓ snapshot state + artifacts + events — JSON you store anywhere (DB, Redis, request payload) out in
The library never touches the filesystem. Every compile and observe returns an updated snapshot; persistence is the application's choice, which makes the whole thing serverless-native.

In code, the loop is about as long as its description:

import { createSession } from "memory-runtime"

// Restore from last turn's snapshot, or start fresh
const session = createSession({ snapshot: previousSnapshot })

session.ingest({
  type: "snippet",
  payload: { source: "auth.ts", content: code, pinned: true },
})

const { messages, snapshot } = session.compile({
  userMessage: "Explain the auth flow",
  budgetTokens: 2000,
})

const reply = await llm.chat(messages)   // your provider, your model

const { snapshot: next } = session.observe({ assistantText: reply })
await store.save(userId, next)           // your storage

The key property: compile() is a pure function of the snapshot and the inputs. Same snapshot, same message → byte-identical output, every time. No hidden state, nothing to migrate, nothing to mock in tests.

Why stateless matters

The original version of this engine kept a SQLite database. It worked on a laptop and fell over the moment we deployed to serverless functions, where there is no durable local disk and no guarantee two requests hit the same instance.

Making the snapshot the entire source of truth removed the problem class. The snapshot is plain JSON. You can put it in Postgres, in Redis, in an encrypted cookie, or simply pass it in the request and response body — which is exactly what Debrev Interview's chat routes do. There is no connection to pool, no migration to run, and no server affinity to engineer around. It also means the library has zero runtime dependencies, so it adds nothing to your attack surface or your bundle.

Snapshots contain user content. Treat them as sensitive: encrypt at rest and in transit. The runtime gives you the bytes; protecting them is the application's job.

The headline: bounded context

Here is the behavior that motivates the whole design. We ran a 50-turn session — each turn ingests a fresh code snippet plus a message exchange — and measured two prompts side by side: the naive prompt (full history plus every artifact, what most apps send) versus the memory-runtime prompt compiled against a 2,000-token budget.

Prompt size over a 50-turn session Naive grows linearly; the compiled prompt holds flat under a 2,000-token budget. 0 5k 10k 15k 20k 1 10 20 30 40 50 conversation turn 2,000-token budget naive: 19,888 compiled: ≤ 1,999 naive (full history + all artifacts) memory-runtime (budget = 2,000)
By turn 50 the naive prompt carries 19,888 tokens and climbing. The compiled prompt never crosses its budget line — average 1,760 tokens across the run, an 82.4% reduction, with the peak at 1,999.

The naive line is a straight ramp to nowhere — it is bounded only by the model's context window, and when it hits that ceiling, content starts falling out the back uncontrolled. The compiled line is flat. That flatness is the product: predictable cost per turn, and predictable behavior at the boundary because you decide what fits, not the truncation logic of whatever happens to be oldest.

The budget is a hard ceiling, not a suggestion

The compiler's contract is that tokenEstimate never exceeds budgetTokens. Lower-priority evidence is dropped or truncated until the prompt fits. We ran the same 50-turn workload at three budgets to see the trade curve.

Compiled size vs. budget Average compiled tokens across 50 turns, against a naive average of 10,056. You pick the point on the curve. 0 2.5k 5k 7.5k 10k naive baseline 10,056 609 −93.9% 1,768 −82.4% 6,252 −37.8% budget 800 budget 2,000 budget 8,000
Across all three budgets the peak compiled prompt stayed within the ceiling (686, 1,995, and 7,992 tokens respectively), and the fact pinned at turn 1 survived to turn 50 in every case. The budget is a dial, not a guess.

A tight 800-token budget cuts 94% of the context and still keeps the pinned needle. An 8,000-token budget keeps far more detail and still trims 38%. The point is that the number is yours to set per surface — Debrev Interview uses 3,000 tokens for single-candidate chat and 8,000 for the multi-candidate overview — and the runtime honors it exactly.

Keeping the fact that matters

Bounding tokens is only half the job. The other half is making sure the right tokens survive. memory-runtime has two mechanisms for this, and we tested both adversarially.

Pinning. A pinned artifact is never evicted by the rolling buffer. We ingested a needle, then buried it under 25 later snippets — past the artifact limit — and compiled. Unpinned, the needle was gone (13 artifacts dropped, the oldest first). Pinned, it survived and made it into the compiled prompt. This is how the interview app guarantees that the system prompt and the candidate roster are always present, no matter how deep the conversation goes.

Observed state. This is the subtler one. When the assistant writes a Decision:, Constraint:, or Glossary: line, observe() lifts it out of the prose and into structured state that rides along in every future compile — independent of whether the source artifact is still in the buffer. We set a config fact at turn 20 ("authentication timeout = 30 minutes"), ran 25 more turns of unrelated work, then asked a deliberately paraphrased question — "how long a user session stays valid before they need to log in again" — that shares no keywords with the original. The compiler dropped 37 artifacts to fit a 1,500-token budget and still surfaced the fact, because it had been promoted to a decision at observe time.

82.4%
avg token reduction at budget 2,000
19,888 → ≤1,999
naive vs. compiled at turn 50
86.8%
smaller compact snapshots
0 deps
runtime dependencies

Snapshots stay small too

Because the snapshot round-trips on every turn, its size is itself a cost. Compact mode strips artifact bodies (keeping IDs and metadata) and caps the event log at the last ten entries. In our test a full snapshot of 10 artifacts and 30 events measured 22,682 bytes; the compact form was 3,002 — an 86.8% reduction — while preserving all extracted state. You keep the full snapshot server-side when you can, and reach for compact mode when the snapshot has to fit in a cookie or a constrained payload.

The whole test suite, reproducible

None of these numbers are hand-waved. They come from the benchmark and test scripts that ship in the repository; clone it and run npm run bench.

Property What we tested Result
Budget enforcement Compile 10 artifacts at a 500-token budget 334 tokens, 9 artifacts dropped — never exceeded
Token reduction Naive vs. compiled, 50 turns @ 2,000 10,056 → 1,768 avg, 82.4% smaller
Pinning Needle buried under 25 later artifacts Dropped unpinned; survived pinned
Paraphrase recall Turn-20 fact, paraphrased query at turn 45 Surfaced despite 37 artifacts dropped
Determinism Compile same snapshot + message twice Byte-identical output
Compact snapshot Full vs. compact serialization 22,682 → 3,002 bytes, 86.8% smaller

What it deliberately is not

memory-runtime is not a summarizer and not a vector database. There is no model call inside compile(), which is why it is deterministic and free to run. Evidence selection is priority-based and keyword-aware, not semantic — for the interview app we layer our own vector retrieval on top of the runtime to pick which artifacts to ingest, then let the runtime handle budgeting and protection. The two compose cleanly precisely because the runtime keeps no opinions and no state of its own.

It is intentionally small: one data structure, three verbs, zero dependencies, and a guarantee that the same inputs always produce the same prompt. That smallness is what made it safe to put on the critical path of a production hiring product.

memory-runtime is MIT-licensed and available on npm as memory-runtime. It is the same engine running in Debrev Interview today, where every AI conversation about a candidate is compiled, budgeted, and bounded by the loop described above.

Ready to transform your hiring process?

Discover how Debrev Interview can help your team make better hiring decisions.

Explore Debrev Interview