From Raw Evidence to Structured Intelligence: The Five Phases of Strata

April 2, 2026•By Debrev Team

Most hiring teams make decisions on gut feel. They transcribe interviews, they jot notes, they bring people into a room—and then opinions collide. The strongest voice usually wins.

Strata is built on a different thesis: qualitative human evidence should become normalized evidence objects, then rubric-aware candidate state, then structured, comparable, decision-ready intelligence.

Not a frontier model wrapper. Not an opinion generator. A reasoning system that turns messy real-world evidence into something auditable, traceable, and useful.

Here's how we got there—and where we're headed next.

The Core Problem

Frontier models are powerful at many things. But when you feed a hiring decision into GPT-4 and ask for a summary, you get:

Hallucinated details that sounded plausible
Buried reasoning in prose you can't audit
Generic hiring advice, not domain-specific judgment
No way to compare candidates fairly (one interviewer writes novels, another writes bullet points)

For hiring decisions to work at scale, you need:

Extracted, normalized evidence (what actually happened)
Evidence-linked reasoning (why we think this)
Rubric-aware judgment (does it fit this role?)
Comparable outputs (candidate A vs candidate B on the same scale)
Explicit uncertainty (what do we actually know vs guess)

That's not what frontier models give you. That's what Strata is building.

Phase 1: Base Model Proof of Concept

The question: Can you take a compact open model, fine-tune it cheaply, and prove it works on a narrow hiring task?

We started here. Pick a lean instruct model (not a 70B frontier behemoth). Define a strict structured output schema—JSON fields with specific meaning. Create the first hiring dataset. Run QLoRA fine-tuning to adapt the model without retraining from scratch.

The goal was simple: prove the economics worked. Could you:

Fine-tune for a few hundred dollars
Get repeatable format compliance
Beat the base model on a narrow task

Phase 1 succeeded when the tuned model clearly outperformed the base on held-out examples, the output format was stable, and the training loop was cheap and repeatable.

This phase proved that domain specialization was viable, not expensive.

Phase 2: Domain Behavior Refinement

The question: Now make the model actually good at hiring reasoning, not just structurally correct.

Fine-tuning gets you format compliance. But we needed the model to understand hiring depth. What does "strong technical foundation" really mean? When does an interviewer note point to a risk vs a strength?

Phase 2 meant:

Cleaning and expanding the dataset with better examples
Removing ambiguous category labels that confused the model
Finding edge cases where the model was getting it wrong (over-softening decisions, missing role context)
Hardening decision boundaries

Success looked like: clear outperformance on held-out examples, fewer adjacent-label confusion errors, and evidence/recommendation alignment.

Early Phase 2 Results

This is very early—no RAG, no retrieval augmentation yet. Qwen is the base model.

Tuned Label Accuracy

+20.2pp

Tuned:

58.3%

Qwen:

38.1%

GPT-5*:

72%

Adjacent-Label Accuracy

+9.5pp

Tuned:

97.6%

Qwen:

88.1%

GPT-5*:

94%

Binary Accuracy

+16.6pp

Tuned:

90.4%

Qwen:

73.8%

GPT-5*:

85%

These improvements validate the fine-tuning approach—catching the right competency without retrieval context, avoiding adjacent-label confusion, and nailing binary decisions. Substantial room for improvement as we layer in company context (Phase 4) and optimize efficiency (Phase 5).

*GPT-5 predictions are speculative estimates based on trend extrapolation. Actual frontier model performance on this task is unknown.

The model started reasoning like it understood hiring.

Phase 3: App-Ready Structured Outputs

The question: Can you lock the output schema and make it reliable enough for product logic to consume?

Phases 1 and 2 were about model quality. Phase 3 was about product readiness.

A production system needs:

Locked schema versioning (no surprise field changes breaking downstream code)
Deterministic, parseable outputs (the JSON is always valid)
Required reasoning (not just a score, but evidence-backed claims)
Confidence signals (how sure is the model, really?)
Consistency checks (does the reasoning support the recommendation?)

Phase 3 succeeded when the outputs became stable enough that product engineers could build on top without constant firefighting.

The model stopped being a research prototype. It became a component.

Phase 4: Retrieval and Company-Aware Context

The question: How do you let one trained model adapt to different companies' hiring standards without retraining for each one?

Every company has different role requirements, different competency frameworks, different hiring standards. Retraining for each one would be expensive and slow.

Phase 4 introduced retrieval: at runtime, fetch the company's hiring rubric, role requirements, and interviewer guidelines. Combine them with the candidate's evidence. Let the model reason over both.

Now the model could:

Work with Company A's rubric (emphasizes execution and bias toward action)
Work with Company B's rubric (emphasizes collaboration and thoughtfulness)
Same core model. Different reasoning. No retraining.

Success: the model adapts to different company standards, retrieved context improves real-world judgment, and company knowledge lives outside the weights.

Phase 5: Runtime Efficiency and Deployment

The question: Is this cheap enough to run in production?

No amount of intelligence matters if the cost per decision is prohibitive. Hiring teams make decisions many times—on many candidates, across many roles. The system needs to scale economically.

Phase 5 meant:

Benchmarking latency and throughput on real hiring workflows
Testing quantized inference (smaller models, faster inference)
Comparing hosted vs local deployment costs
Building a path toward even more efficient runtime

Success: clear cost/quality tradeoff, viable deployment paths (local or server), and economics that make repetition practical.

By Phase 5, Strata wasn't theoretical. It was production-viable.

What We've Built

After five phases, the system does something frontier models can't:

Takes messy evidence (notes, scorecards, resumes, interviewer comments)
Normalizes it into structured evidence objects with competency, direction, strength, source
Applies company context (role requirements, hiring standards, rubrics)
Produces structured output (competencies with confidence, evidence links, blockers, recommendation)
Runs efficiently enough to use repeatedly, not just as an expensive one-off

And critically: everything is auditable. You can trace the recommendation back to the evidence. You can see where the model is confident and where it's uncertain. You can compare candidates fairly because they're scored on the same rubric.

This is what Debrev Interview is built on.

What's Next: Beyond Phase 5

Phases 1-5 proved the thesis. The next work is about depth and breadth.

The foundation work:

Evidence normalization as a first-class primitive. Right now evidence is implicit in the reasoning. Making it explicit—with source attribution, strength, direction—unlocks new product features like evidence-by-competency views and follow-up question suggestions.
Candidate-state engine. Instead of one-shot outputs, build persistent state that updates with each interview. This is where the system becomes a durable reasoning system instead of a template.
Comparative decision intelligence. Supporting not just "here's a summary of candidate X" but "here's how candidate X compares to candidate Y on these shared dimensions."

The research tracks:

Scoring models that are useful but don't obscure the underlying evidence
Advanced evidence fusion across several input modalities (not just notes, but resumes, exercises, references)
Specialized model tracks for narrow workflows (maybe a distilled expert for evidence extraction, another for comparison)
Efficiency optimization (retrieval performance, prompt caching, quantization)

The philosophy stays the same: never autonomous, always auditable, always evidence-linked.

Why This Matters

The standard in hiring today is still hunches. Structured, evidence-backed reasoning is a competitive advantage.

Strata isn't about removing human judgment. It's about giving humans better signal. Domain-specialized AI that understands hiring deeper than a generic frontier model, costs less to run, and works inside the workflows that matter.

That's what the five phases built. That's what's next.

Learn more about Debrev Strata and the structured intelligence layers behind it.

Ready to get started?

Discover how Debrev can help transform your workflow.

Learn More