memory-runtime: bounding LLM context without a database
memory-runtime is the stateless context engine behind Debrev Interview's AI chat, now open-sourced under MIT. Instead of replaying a growing chat history into every prompt, it runs a deterministic ingest → compile → observe loop: artifacts and messages go in, a budget-aware compiler emits a message array that never exceeds a token ceiling, and the assistant's reply is parsed back into structured state. The entire session lives in a JSON snapshot the application owns — no SQLite, no filesystem, no server affinity. In a 50-turn benchmark the naive approach grew to 19,888 prompt tokens while memory-runtime held flat under 2,000 — an 82% average reduction — and still surfaced a fact pinned at turn 1 in answer to a paraphrased question at turn 50 that shared no keywords with it. Compilation is deterministic and compact snapshots run ~87% smaller than full ones.
Every LLM chat application has the same leak. To answer turn 50, you replay turns 1 through 49 into the prompt — the messages, the code you pasted, the documents you attached. The context grows linearly, the cost grows with it, and eventually you hit the model's window and start dropping the oldest content, which is often exactly the constraint someone set at the start of the conversation.
The usual fixes each have a tax. A sliding window is cheap but forgets. LLM summarization is lossy, non-deterministic, and costs an extra model call per turn. A vector database retrieves well but adds infrastructure, embedding latency, and a service to operate.
memory-runtime is the path we took for Debrev Interview, and we've open-sourced it. It is a small, dependency-free TypeScript SDK that treats context assembly as a deterministic compilation problem against a fixed token budget — and keeps zero state of its own.
The model: ingest → compile → observe
The runtime has three verbs and one data structure. You ingest artifacts (code snippets, document chunks, diffs) and messages. You compile them into a message array that fits a token budget. You observe the assistant's response to extract structured state. Everything the session knows is captured in a serializable snapshot that you — not the library — store and pass back on the next turn.
In code, the loop is about as long as its description:
import { createSession } from "memory-runtime"
// Restore from last turn's snapshot, or start fresh
const session = createSession({ snapshot: previousSnapshot })
session.ingest({
type: "snippet",
payload: { source: "auth.ts", content: code, pinned: true },
})
const { messages, snapshot } = session.compile({
userMessage: "Explain the auth flow",
budgetTokens: 2000,
})
const reply = await llm.chat(messages) // your provider, your model
const { snapshot: next } = session.observe({ assistantText: reply })
await store.save(userId, next) // your storage
The key property: compile() is a pure function of the snapshot and the inputs. Same snapshot, same message → byte-identical output, every time. No hidden state, nothing to migrate, nothing to mock in tests.
Why stateless matters
The original version of this engine kept a SQLite database. It worked on a laptop and fell over the moment we deployed to serverless functions, where there is no durable local disk and no guarantee two requests hit the same instance.
Making the snapshot the entire source of truth removed the problem class. The snapshot is plain JSON. You can put it in Postgres, in Redis, in an encrypted cookie, or simply pass it in the request and response body — which is exactly what Debrev Interview's chat routes do. There is no connection to pool, no migration to run, and no server affinity to engineer around. It also means the library has zero runtime dependencies, so it adds nothing to your attack surface or your bundle.
Snapshots contain user content. Treat them as sensitive: encrypt at rest and in transit. The runtime gives you the bytes; protecting them is the application's job.
The headline: bounded context
Here is the behavior that motivates the whole design. We ran a 50-turn session — each turn ingests a fresh code snippet plus a message exchange — and measured two prompts side by side: the naive prompt (full history plus every artifact, what most apps send) versus the memory-runtime prompt compiled against a 2,000-token budget.
The naive line is a straight ramp to nowhere — it is bounded only by the model's context window, and when it hits that ceiling, content starts falling out the back uncontrolled. The compiled line is flat. That flatness is the product: predictable cost per turn, and predictable behavior at the boundary because you decide what fits, not the truncation logic of whatever happens to be oldest.
The budget is a hard ceiling, not a suggestion
The compiler's contract is that tokenEstimate never exceeds budgetTokens. Lower-priority evidence is dropped or truncated until the prompt fits. We ran the same 50-turn workload at three budgets to see the trade curve.
A tight 800-token budget cuts 94% of the context and still keeps the pinned needle. An 8,000-token budget keeps far more detail and still trims 38%. The point is that the number is yours to set per surface — Debrev Interview uses 3,000 tokens for single-candidate chat and 8,000 for the multi-candidate overview — and the runtime honors it exactly.
Keeping the fact that matters
Bounding tokens is only half the job. The other half is making sure the right tokens survive. memory-runtime has two mechanisms for this, and we tested both adversarially.
Pinning. A pinned artifact is never evicted by the rolling buffer. We ingested a needle, then buried it under 25 later snippets — past the artifact limit — and compiled. Unpinned, the needle was gone (13 artifacts dropped, the oldest first). Pinned, it survived and made it into the compiled prompt. This is how the interview app guarantees that the system prompt and the candidate roster are always present, no matter how deep the conversation goes.
Observed state. This is the subtler one. When the assistant writes a Decision:, Constraint:, or Glossary: line, observe() lifts it out of the prose and into structured state that rides along in every future compile — independent of whether the source artifact is still in the buffer. We set a config fact at turn 20 ("authentication timeout = 30 minutes"), ran 25 more turns of unrelated work, then asked a deliberately paraphrased question — "how long a user session stays valid before they need to log in again" — that shares no keywords with the original. The compiler dropped 37 artifacts to fit a 1,500-token budget and still surfaced the fact, because it had been promoted to a decision at observe time.
Snapshots stay small too
Because the snapshot round-trips on every turn, its size is itself a cost. Compact mode strips artifact bodies (keeping IDs and metadata) and caps the event log at the last ten entries. In our test a full snapshot of 10 artifacts and 30 events measured 22,682 bytes; the compact form was 3,002 — an 86.8% reduction — while preserving all extracted state. You keep the full snapshot server-side when you can, and reach for compact mode when the snapshot has to fit in a cookie or a constrained payload.
The whole test suite, reproducible
None of these numbers are hand-waved. They come from the benchmark and test scripts that ship in the repository; clone it and run npm run bench.
| Property | What we tested | Result |
|---|---|---|
| Budget enforcement | Compile 10 artifacts at a 500-token budget | 334 tokens, 9 artifacts dropped — never exceeded |
| Token reduction | Naive vs. compiled, 50 turns @ 2,000 | 10,056 → 1,768 avg, 82.4% smaller |
| Pinning | Needle buried under 25 later artifacts | Dropped unpinned; survived pinned |
| Paraphrase recall | Turn-20 fact, paraphrased query at turn 45 | Surfaced despite 37 artifacts dropped |
| Determinism | Compile same snapshot + message twice | Byte-identical output |
| Compact snapshot | Full vs. compact serialization | 22,682 → 3,002 bytes, 86.8% smaller |
What it deliberately is not
memory-runtime is not a summarizer and not a vector database. There is no model call inside compile(), which is why it is deterministic and free to run. Evidence selection is priority-based and keyword-aware, not semantic — for the interview app we layer our own vector retrieval on top of the runtime to pick which artifacts to ingest, then let the runtime handle budgeting and protection. The two compose cleanly precisely because the runtime keeps no opinions and no state of its own.
It is intentionally small: one data structure, three verbs, zero dependencies, and a guarantee that the same inputs always produce the same prompt. That smallness is what made it safe to put on the critical path of a production hiring product.
memory-runtime is MIT-licensed and available on npm as memory-runtime. It is the same engine running in Debrev Interview today, where every AI conversation about a candidate is compiled, budgeted, and bounded by the loop described above.
Ready to transform your hiring process?
Discover how Debrev Interview can help your team make better hiring decisions.
Explore Debrev Interview