Future-Proofing CTI: An Architecture Review and Roadmap

The previous post — Future-Proofing AI Systems — covered the general principle: build systems that become simpler as models improve, isolate the accommodations to current model weaknesses, and run a litmus test on every architectural decision. That's the theory.

This post applies the theory to a real production system: CTI's coaching layer. The architecture has been deployed for months, has thousands of real conversations behind it, and contains exactly the kind of pragmatic accommodations the general post warned about. Some of them will age well. Some of them are scaffolding that should come out the moment a Sonnet release makes them unnecessary. The point of this post is to be specific about which is which, and to put a review cadence around the system so that those decisions get made rather than just accreting.

The structure follows the litmus test:

First: what should be reviewed on every model release.
Second: where the architecture is already due for a revision regardless of what the model does.
Third: what's missing — gaps between the system as built and the system the principle would suggest.
Fourth: a concrete cadence so the reviews actually happen.

Architecture Areas to Review Periodically

These are the parts of the system that are defensible today but that should be re-examined on a schedule. Each one is a candidate for simplification or removal as model capabilities move.

1. The Intent Router — Biggest Candidate for Simplification

Haiku-as-router is a sensible accommodation: it's cheap, it's fast, and it gives the pipeline a deterministic short-circuit for off-topic messages. But it's also exactly the kind of scaffolding that earns the "review on every model release" label.

What to test on every Sonnet release:

Does the new Sonnet reliably pick the right tool without an <intent_hint> injected into the system prompt? If yes, retire the hint generation step.
Does it refuse off-topic messages reliably from the safety block alone? If yes, the off-topic short-circuit becomes a cost optimisation rather than a correctness mechanism. You might keep it — but for very different reasons, and with a different ownership story.
The five-intent taxonomy is a snapshot of current product surface area. As features are added (the MCP server, more skills), the taxonomy will need to grow or generalise. Watch for the case where you're adding a sixth, seventh, eighth intent — that's the signal that the categorical approach is starting to strain and the model could probably do better with tool descriptions and toolChoice: 'auto'.

The 0.8 vs 0.6 confidence threshold asymmetry is a heuristic that should be backed by eval data, not vibes. Run it through the golden set quarterly and see whether the asymmetry still pays.

2. Hybrid Search — The Heaviest Piece of Legacy Reasoning

Four parallel RPCs (keyword + semantic across chat messages and session memories), Reciprocal Rank Fusion, query analysis, and a top-15 cap is a lot of machinery, and it was designed for a world where the model couldn't be trusted to retrieve its own context. Things to test periodically:

Does query analysis still earn its keep? A Haiku call to extract keywords and timeframes adds latency and a failure mode. A modern Sonnet might extract those cleanly inline, or you might find that the raw user message works as well as the analysed version through ts_rank and embeddings.
RRF K=60 is folklore. That value comes from the original Cormack et al. paper and isn't tuned to CTI's distribution. Worth running an eval comparing K=10, 30, 60, 100 on retrieval quality.
Top-15 cap is arbitrary. As context windows grow, the right number of retrieved chunks may be 30 or 50. As models get better at ignoring irrelevant context, the cost of including more drops.
Threshold values (0.3 / 0.4) are static. Different query types likely warrant different thresholds. A specific question ("threshold power test") wants high precision; a vague one ("how am I going") wants high recall.

The bigger question to revisit annually: should hybrid search become a tool the model calls when it needs context, rather than a pipeline step that always runs? That inverts the architecture in a way that's hard to undo, but it's the natural endpoint as model agency improves. Today's "always retrieve, always inject" pattern is a pure expression of "the model can't be trusted to know what it doesn't know" — a temporary truth, not a permanent one.

3. The Ten-Layer System Prompt

Already flagged in the architecture post's "What I'd Do Differently" section. A few specific review prompts:

Are all ten layers always justified? For a route_search intent on the globe view, do you really need profile memory, episodic context, and session insights? Each layer dilutes attention. Per-intent layer selection is a small change with potentially large quality wins.
Layer ordering is currently strict but never tested. Worth running an A/B where you swap, say, profile memory and ride data, or move the safety block to last. Recency effects in long prompts are real and shift between model versions.
The {{localDatetime}} injection is the kind of thing that breaks subtly across model upgrades. The timezone bug history is the canary — check this on every major model bump.

4. Memory Extraction Merge Strategy

Shallow merge with || is the simplest possible thing and that's a virtue, but it's also lossy. Periodic review questions:

What proportion of extractions overwrite a meaningful prior value vs. add a new key vs. confirm an existing one? There's no visibility into this today.
The append-only profile_memory_history table is the right next step. Without it, you can't answer "when did the coach learn the FTP changed" — which matters for both debugging and training-load reasoning.
Memory extraction at temperature 0 over the last 10 messages is reasonable, but the 10-message window is arbitrary. A long coaching conversation might have the relevant fact at message 15.

5. PII Redaction Coverage

Mutating in-place is a strong design — single source of truth. But:

Redaction runs on user messages but not on tool results being fed back into context. A Strava activity description could contain an email. A ride title could contain a phone number. Worth auditing every place text enters the conversation, not just the user-typed parts.
Disabling name detection is right for now, but as the model becomes more capable of inferring identity from combinations of signals (location + ride times + FTP), the privacy surface area grows. This isn't a 2026 problem but it's a 2027 one.

6. Model Slot Assignments

The slot system is well-designed for swapping. Things that should trigger a re-shuffle:

Every Haiku release: does the new Haiku close the gap on tasks currently assigned to Sonnet? coaching_chat is the obvious candidate — most coaching responses don't need Sonnet's depth.
Every Sonnet release: are there analysis slot tasks that now warrant Opus, or smart slot tasks where Sonnet is now overkill?
Cross-vendor choices (Gemini Flash Lite for insights, Grok for image generation) are hedges but also a maintenance burden. Re-justify annually.

Where the Architecture Should Be Revised

These aren't "review periodically" items. They're known gaps that warrant scoping today, regardless of which model ships next.

Replace Fire-and-Forget With a Durable Job Queue

The current pattern — onFinish triggers extraction, indexing, and summary generation without retry or visibility — is a known reliability gap, called out in the architecture post. The right shape is probably:

Enqueue extraction, indexing, and summary generation as jobs (Supabase queues, or a lightweight pg-boss / Inngest setup).
Structured logging with correlation IDs from the originating chat message.
Retry with exponential backoff for transient failures (model rate limits, embedding API hiccups).
A dead-letter table for inspection.

This is the highest-leverage reliability improvement available. It converts silent data loss into visible, debuggable failures — and silent data loss is the worst kind because it shows up as inexplicable model behaviour weeks later when the missing memories never resurface.

Context Budget Management

Mentioned in the architecture post but worth scoping concretely. The right primitive is probably:

A ContextBudget object passed through the pipeline that knows the current model's window, the tokens already committed to non-negotiable content (safety, base prompt, user message), and the remainder available for retrieved context.
Each context source declares a priority tier (safety: critical, profile: high, episodic: medium, session insights: low).
A trim step that evicts lowest-priority content first when the budget is exceeded.
Telemetry on what got trimmed, so you can see when the budget is binding.

Without this, you'll eventually hit a long-conversation case where context exceeds the window and the failure mode will be opaque.

Eval Feedback Into Routing Thresholds

The 0.8 / 0.6 confidence thresholds and the K=60 RRF value should not be hand-tuned constants in code. They should be config values that the eval suite can sweep. Tying the existing Evalite setup to these knobs gives you principled tuning instead of intuition tuning — and it makes the per-release review tractable, because "is the threshold still right" becomes a one-command answer.

Skills System: Discovery vs Preloading

Today there are two ways to surface skills to the model: pre-loading via slash commands (skills injected as layer 3) and runtime discovery (list_skills / load_skill). That's two ways to do the same thing, which is a future maintenance issue. Worth deciding which is the canonical path:

If skills are mostly preloaded via slash commands, runtime discovery is dead code most of the time and should probably be removed.
If runtime discovery becomes the primary path (more agentic, scales better as skill count grows), preloading becomes the optimisation case for known-deterministic intents.

The forcing function: when there are 15 skills, preloading them into every system prompt becomes infeasible. Better to make the architectural call now than to discover the limit at runtime.

What's Missing or Under-Developed

These aren't items that need revising — they're items that need adding. Each one is a place where the system would be more durable, debuggable, or improvable if the gap were filled.

Observability Beyond the Default Spans

The OpenTelemetry + Langfuse integration that landed after the original architecture post — covered in the evals loop post — already gives CTI a strong observability baseline. The Vercel AI SDK emits OTEL spans for model calls, tool executions, and streaming chunks; the LangfuseSpanProcessor forwards them to Langfuse, where P50/P95 latency, token-per-response trends, cost breakdowns by model tier, and full multi-step span trees are queryable per prompt version.

That foundation covers a lot. The remaining gaps are specifically about the parts of the pipeline the AI SDK doesn't see — the bespoke steps that wrap and feed the model calls:

Custom pipeline-step spans. PII redaction, intent classification, query analysis, hybrid search (the four parallel RPCs + RRF merge), and memory extraction all sit outside the AI SDK's automatic instrumentation. Adding manual spans (tracer.startActiveSpan('pipeline.hybrid_search', …)) inside those functions would slot them into the existing Langfuse trace tree alongside the AI SDK spans, giving you a single waterfall view of where a slow response actually spent its time.
Token accounting per prompt layer. Langfuse tracks total prompt tokens per request, but not which of the ten layers contributed how many. Tagging each layer with a measured token count at assembly time (and emitting it as a span attribute) would tell you when, say, episodic context starts dominating the budget — a prerequisite for the context budget primitive in the previous section.
Tool call success/failure semantics. The AI SDK spans capture that a tool was called and how long it took, but not the outcome category — empty result vs. error vs. timeout vs. happy path. Structured events on each tool span (outcome=empty, outcome=error, outcome=timeout) would let you filter Langfuse for "every time searchRides returned nothing" without grep'ing logs.

The cost is small because the wiring already exists — it's adding spans to existing functions, not standing up a new telemetry stack. The value is that the metrics needed for almost every other optimisation on this list (context budget, ablation studies, cost modelling, degraded-mode tuning) become queryable from the dashboard you already have, rather than requiring new instrumentation each time.

A Regression Safety Net for Prompt Changes

The evals loop post covers the eval machinery. The relationship between prompt changes and eval runs deserves to be made explicit:

Are evals run on every PR that touches a prompt file? (CI hook on prompts/** and skills/**/SKILL.md.)
Is there a baseline-comparison mechanism, or does each run stand alone?
Are eval results published somewhere you can see drift over time?

This is the mechanism that turns the "swap the model string and re-run evals" litmus test from aspiration into reality. Without CI-integrated evals, every model upgrade is a gamble.

Ablation Testing

Related but distinct. The system has ten prompt layers, four search sources, three memory tiers — but no documented evidence that each one improves outcomes. Periodic ablation studies (turn off episodic memory for a sample of conversations, measure quality delta) would tell you what's actually load-bearing vs. cargo-culted. This is also the cleanest way to identify simplification opportunities.

Cost Modelling

Implicit in the slot system but not surfaced. A cost-per-conversation metric, broken down by slot, would let you reason about:

Whether the off-topic short-circuit is paying for itself.
Whether memory extraction's frequency is justified.
Whether moving more traffic from Sonnet to Haiku for simple chats would change the user experience meaningfully.

Cost is the proxy for "how much of this system is necessary". A pipeline step that costs 20% of every request and improves quality by 1% on the eval set is a candidate for removal. You can't have that conversation without the numbers.

A Degraded-Mode Story

What happens when:

The embedding API is down? (Hybrid search degrades to keyword-only.)
The Haiku slot is rate-limited? (Intent router fails — does the system default to "treat as coaching_chat" or fail closed?)
Supabase pgvector is slow? (Search times out — is the response generated without context, or held?)

Each of these is a real failure mode in production. The architecture currently treats these as exceptional rather than designed-for, which is fine until it isn't. Designing the degraded-mode behaviour explicitly — even just documenting it — converts opaque outages into known-bad-but-bounded experiences.

Conversation-Level Safety Circuit Breakers

The safety block handles individual messages, but there's no mechanism for "this conversation has gone sideways" — a user repeatedly probing for medical advice, or a user accumulating profile memory entries that contradict each other. A lightweight post-conversation review (could share infrastructure with the episode summary generation) that flags anomalies for review would close that gap.

Multi-Device / Multi-Session Coherence

Profile memory is user-scoped, which is correct. But sessions are per-chat. If a user has a coaching conversation on mobile and then opens the web app, what's the continuity story? The episodic context layer partially solves this but is opaque to the user. Worth scoping whether a "what we discussed recently" surfaceable view should exist.

Suggested Review Cadence

The general future-proofing principle only delivers value if the reviews actually happen. Putting them on a calendar is the operational version of the principle:

Cadence	What to review
Every model release	Slot assignments, intent router necessity, prompt layer effectiveness, RRF parameters, threshold values
Monthly	Eval drift, cost per conversation, latency breakdown, off-topic false-positive rate
Quarterly	Memory extraction quality, threshold tuning sweep, ablation on prompt layers, vendor diversity audit
Annually	Full architectural review — should hybrid search become a tool? Should routing collapse into the main model? Should the skills system commit to one of preload-vs-discovery?

The release-triggered reviews are the most important. A new Sonnet shipping is the moment to ask "what scaffolding can I delete now?" — not six months later when someone files a ticket about latency.

Highest-Leverage Next Moves

If prioritising the work above into a roadmap:

Durable job queue for fire-and-forget work — biggest reliability win, unlocks everything downstream.
Observability/tracing layer — without this, every other improvement is flying blind.
profile_memory_history table — already scoped, fixes the temporal recall gap.
Context budget primitive — prevents future opaque failures, makes prompt changes safer.
CI-integrated evals on prompt changes — converts the existing eval work into a regression net and a model-upgrade enabler.

The first two are infrastructure that pays compounding interest. The remainder are specific gaps that have already been identified or that will bite within the next 6–12 months of CTI's growth.

The Throughline

The CTI architecture is a useful case study because it contains the full spectrum: durable assets that will outlive several model generations (the slot system, the typed tool surface, the trace + eval pipeline, the PII redactor as single source of truth), pragmatic accommodations to current model weaknesses that will probably need to come out (Haiku-as-router, query analysis, the strict ten-layer prompt, the always-on hybrid search), and gaps that need filling regardless of what models do (job queue, observability, context budget).

The future-proofing principle isn't "predict which way the models will move" — it's "be ready to delete the scaffolding when they move that way." The cadence above is the mechanism that makes that possible. Without the reviews, the scaffolding stays in the codebase forever, and CTI ships its third anniversary still routing to Haiku for an off-topic detection that Sonnet 6 handles natively.

The single highest-leverage thing on this list isn't any of the specific items. It's running the next-model-release review the day Sonnet 4.7 ships, holding the litmus test against every component, and actually deleting the things that no longer earn their keep.

This is what an Orbital architecture review looks like in practice. Surfacing the temporary accommodations, scoping the durable upgrades, and putting the whole thing on a cadence that turns model improvements into quality wins instead of risks. Read the CTI case study →

The CTI series: