The Document Substrate: Solving Garbage In, Garbage Out for AI on Business Documents

The failure mode is always the same. Someone connects an LLM to a folder of PDFs, asks it a question about a client's bank statement, and it either hallucinates the answer or refuses to engage because the document text came out as garbled OCR soup. The model isn't broken. The input is. Garbage in, garbage out — and most document pipelines are producing garbage.

Building useful AI on real business documents requires solving four problems that are mostly invisible until you try to skip them. This post is about those problems, the pipeline that solves them, and what becomes possible on the other side.


The Problem With "Just Give It the PDF"

The naive approach is to extract text from a PDF and drop it into a prompt. It works in demos, on clean documents, on documents where the answer is the first thing the model finds. In production, it fails in ways that erode trust fast:

OCR quality is wildly uneven. A scanned bank statement that looks pristine to a human eye might have columns misaligned, decimal points dropped, or handwritten annotations merged into the surrounding text. The model reads what the parser gives it — and if the parser gave it garbage, that's what the answer will be based on.

Tables don't survive naive extraction. A statement with five columns and two hundred rows becomes a stream of interleaved numbers with no structural cues. The model can't tell a withdrawal from a deposit. Context that was obvious on the page is lost entirely.

PII is everywhere, and you can't just send it to any LLM. Invoices, tax returns, bank statements, legal records — the documents most worth building AI on are also the documents most laden with personal information. Sending raw client documents to a third-party model is a privacy and compliance problem, not just a technical one. For most professional practices, it's a non-starter.

Unstructured text retrieval is unreliable for precise questions. Semantic search finds roughly relevant chunks. But "what was the total GST on supplier invoices in Q2?" isn't a semantic question — it's an aggregation over structured data that never got extracted. The model reasons over prose when it should be querying a table.

These four problems compound. A bad parse produces bad text. Bad text produces bad embeddings. Bad embeddings produce irrelevant retrieval. And the model, no matter how capable, can only work with what retrieval gives it.

The document substrate is the layer that solves all four before any of that happens.


The Pipeline: Four Steps, One Hard Rule

The substrate runs every uploaded document through four sequential stages:

Upload → Parse → Redact → Extract → Chunk + Embed → Ready

One hard rule governs the whole pipeline: only the original file and the encrypted entity vault ever hold raw PII. Everything downstream — page text after redaction, chunks, embeddings, structured extracts, prompts, and model responses — contains only opaque placeholders. Real values are decrypted server-side, briefly, only for an authorised user.

That rule is why the pipeline is worth having. It's the thing that makes the rest safe.

Parse

The first problem is getting honest text out of the document. Not flat text extraction — structured, page-aware OCR that understands layouts, tables, multi-column designs, and mixed text and figures.

LlamaParse handles this with three quality tiers: a fast mode for clean digital PDFs, a standard agentic mode for most business documents, and a premium mode for the highest-complexity layouts — tax returns, multi-page financial statements, forms with dense tabular data. The tier is configurable per document type, so you only pay for the precision a document actually needs.

The output is stored per-page with both the raw text and layout JSON. The layout matters downstream: it's what lets the substrate tell a chunk's bbox — its physical location on the page — so citations can deep-link to the exact page a figure came from. Parse quality is the substrate's foundation. Everything downstream is only as good as what parse produces.

Redact

This is the step that changes everything about what you can build.

Redaction is a three-stage hybrid designed to fail safe:

  1. Regex rules catch structured, high-confidence values first — IRD numbers, bank account formats, credit cards (validated with a Luhn checksum), email addresses, phone numbers, postal codes. These are the patterns you most cannot afford to miss, and pattern-matching gets them reliably.

  2. LLM fallback (Claude Haiku) runs over the text after the regex pass and catches what patterns can't: people's names, organisations, physical addresses. Critically, the model sees text from which the structured PII has already been removed — so it's reasoning over safer input, and its output is a list of entities to redact, not a free-form response that could introduce new risk.

  3. A smoke check runs an independent PII detector over the final redacted text. Treated as a warning, not a blocker — its false-positive rate is too high to fail on — but it surfaces anything that slipped through both previous stages.

Each detected value gets a stable placeholder: [PERSON_1], [BANK_ACCOUNT_2], [IRD_1]. The same entity mentioned twice in one document gets the same placeholder — so the model can reason about relationships between occurrences. The real value is encrypted at rest with AES-256-GCM into the entity vault. Nothing downstream ever sees it.

The query path has a matching pre-flight check: before any LLM call is made, the constructed prompt is scanned for PII patterns and the call fails closed if anything is found. Redaction is a security guarantee, not a best-effort feature.

Extract

Parse and redact give you clean, safe text. Extract turns that text into structured data.

When a document has a registered type — bank_statement, utility_bill, invoice, tax_return — the substrate runs the redacted text against a schema for that type. The schema defines what to pull out, what to call it, and how to handle fields that can't be found (return null, never hallucinate). The result is stored as typed JSON.

There are two extraction engines, and the choice between them matters:

AI SDK extraction (Claude Sonnet + Zod schema validation) works from the redacted text directly. Best for flat, scalar data — invoice totals, utility bill amounts, policy numbers. Fast, cheap, and accurate on clean structured fields that survive the text representation.

LlamaExtract is better for layout-heavy documents — bank statement transaction tables, multi-column forms, anything where the column boundaries and row relationships matter as much as the values. Where it needs the original document layout, the extracted output is post-redacted against the vault before it's stored, so the PII boundary holds.

The schema is what makes extraction reliable. Without it, you're asking a model to guess what matters. With it, you get the same fields, in the same format, every time, validated before storage. The difference between "the model probably pulled the right numbers" and "the numbers were validated against the schema and the types are guaranteed."

Chunk + Embed

The final stage turns the redacted text into a searchable index. Page-aware chunking splits the text into overlapping chunks (~800 tokens, ~100 token overlap), each recording its page number and bounding-box position on the original document. Each chunk is embedded using text-embedding-3-small and stored with its embedding vector.

The key point: embeddings are computed from the redacted text, never the original. The retrieval layer only ever returns placeholder-form content. The security model doesn't have a RAG exception.


Why Redaction Is Liberating, Not Limiting

It sounds like a constraint. In practice, redaction is what unlocks everything.

Without it, you can't send sensitive client documents to a modern cloud LLM at all. The best you can do is run a local, smaller model on-premises — and accept the capability trade-off that entails, or build elaborate legal frameworks around data processing agreements that most firms don't have the resources to maintain.

With it, the LLM only ever sees [PERSON_1] and [BANK_ACCOUNT_2]. It doesn't know what those values are. It can still reason about relationships between them — "the same person appears in both documents"; "this account number is referenced three times" — but it never handles the raw sensitive data. You can use any capable model, hosted anywhere, without the compliance exposure of sending real client data offsite.

Real values are re-introduced exactly once: in the hydration step, server-side, after the model has finished reasoning, just before the response reaches the authenticated user. The model does the hard reasoning; the hydration step makes the answer human-readable. The audit log captures every hydration — who, when, which documents, which placeholders were substituted.

The result is a security story you can explain to a client in one sentence: the AI never saw your name, your account number, or your tax ID.


Why Schemas Are Not Optional

There's a temptation to treat document AI as a pure retrieval problem: chunk the text, embed it, retrieve the most relevant chunks, let the model answer. This works for some questions. It fails systematically for others.

"What were the total withdrawals from this account in March?" is not a retrieval question. It's an aggregation over structured data. If the bank statement's transactions were extracted into typed JSON at ingest time, that question is a SQL query over document_extracts. It's fast, deterministic, and correct. If the transactions weren't extracted, the model has to infer totals from retrieved chunks — and it will, credibly, sometimes incorrectly.

The same applies to every document with repeating structured content: invoice line items, contract clauses, tax return schedules. Schemas defined up front determine whether the substrate can answer precise questions reliably or only approximately.

Schemas are also the mechanism for adding document types to the substrate. An ExtractSchema bundles four things: the document type name that triggers it, a Zod schema defining the output shape, a system prompt instructing the extractor on what to pull and how to handle missing fields, and an explicit choice of extraction engine. Adding support for a new document type is adding one file and one registry entry. The pipeline handles the rest.


What You Can Build On Top

The substrate isn't an application. It's a foundation — the document-handling layer that project-specific AI code builds on. The substrate API exposes search, hydrate, extract retrieval, and audit; project-specific logic handles the chat UI, reporting, agents, and domain workflows.

A few concrete applications for professional practices:

Accounting and reconciliation. Upload a client's bank statements and supplier invoices. Extraction produces structured transaction data — dates, amounts, references, categorisations — ready to feed reconciliation without manual rekeying. Query answers questions like "what was the total GST on supplier invoices in Q2?" against the extracted data, not a retrieval guess.

Financial advice and planning. A financial planner working across a client's tax returns, KiwiSaver statements, and investment reports can ask cross-document questions and get answers with citations, without any of the client's PII leaving the application boundary or reaching a model that hasn't been explicitly cleared to receive it.

Legal document review. Contracts, lease agreements, regulatory submissions — the substrate parses, redacts, extracts key clauses and dates into schema-defined fields, and makes the document corpus queryable. The lawyer asks; the model answers from redacted content; real values hydrate before the response reaches the screen.

Multi-client separation. The substrate's subjects table gives each client their own document group, with queries scoped to that group by default and cross-client queries gated explicitly. An accounting practice with fifty clients doesn't need fifty substrate installs — just one, with fifty subjects, each isolated by row-level security.

Onboarding and triage. Drop a batch of mixed documents from a new client. Parse and extraction classify and structure them automatically. A query over the collection surfaces what's present, what's missing, and what needs follow-up — faster than manual review, and with a searchable record of what was found.


The Substrate Is the Work That Makes AI Trustworthy

The gap between "impressive demo" and "system a professional would stake their reputation on" is almost always in the input layer. The model is capable. The documents are the problem.

Solving that in a principled way — with high-fidelity parsing, systematic PII redaction that lets you use any LLM safely, schema-driven extraction that turns documents into queryable structured data, and a searchable vector index built on the redacted layer — is the work that turns document AI from a liability into an asset.

The substrate is that layer. Once it exists, the project-specific work — the chat UI, the reporting dashboards, the domain-specific workflows — builds on a foundation that's already handling the hard parts correctly.


The document substrate is one of the reusable components we deploy in AI engagements at Orbital. If you're building on top of sensitive business documents and want to do it without the compliance exposure, info@orbital.co.nz

Why "substrate"? · Every domain has its substrate · Map → Architect → Build, in practice