Building Healthcare AI You Can Trust

Written by Ramani Narayan | Jun 24, 2026 11:00:00 AM

Why the Harness — Not the Model — Turns “it might work” into “you can count on it”

Ramani Narayan · May 2026 · 6 min read

There’s a fundamental tension in healthcare AI. On one side, you have technologists excited about what’s possible. On the other, you have healthcare professionals pointing out that these systems “hallucinate” and can’t be trusted with patient safety.

Both perspectives have merit. Large language models do make things up. And healthcare is indeed a high-stakes environment where errors have consequences.

But framing this as an either-or choice misses the point. The question isn’t whether AI is perfect — it’s whether we can build systems that are reliable enough for specific use cases, with appropriate safeguards in place.

And here’s the part the debate usually skips: most of what makes a healthcare AI system trustworthy doesn’t live in the model at all. It lives in the infrastructure wrapped around it — the layer that controls what the model sees, what it’s allowed to do, how its work gets checked, and what happens when something goes wrong. That layer now has a name: the agent harness.

A raw language model is a stateless inference engine — feed it a prompt, get a completion, with no memory, no ability to call a tool, and no feedback loop to catch its own mistakes. The harness is everything that isn’t the model: the context controller, the tool router, the memory system, the verification layer, and the constraint enforcer. In healthcare, the harness is where trust is engineered.

Here’s how that works.

The Determinism Problem

At a healthcare AI conference, a speaker objected to AI systems because they’re “non-deterministic” — ask the same question twice, get different answers. For healthcare, he argued, this is unacceptable.

It’s a fair concern. But it also reflects a misunderstanding of how modern AI systems can be architected.

The language model itself may be non-deterministic. But the system around it doesn’t have to be. When facts need to be precise, you don’t rely on the model to generate them — you use the model to identify what facts to retrieve, then pull those facts directly from verified sources.

This is the harness asserting control over the model. The model becomes a translator between human language and information sources and action, not the source of truth itself. The determinism lives in the tool layer, where a query for a patient’s medications returns the same complete list every time, regardless of how the model phrases the request.

Grounding: Trace Every Fact

The first principle of trustworthy healthcare AI is grounding — ensuring that every factual claim can be traced back to a source document, and, most importantly, verifying that the document asserts what the model claims it asserts. In harness terms, this is the verification layer: a check that runs against the model’s output before that output is allowed to reach a clinician.

Here’s how it works in practice:

When you ask a question, the system retrieves relevant documents from the patient’s record.
These documents are passed to the language model as context.
The model generates a response based on this context.
A separate verification process checks: does every medical fact in the response correspond to something in the source documents?
If the model introduced anything new — any “hallucinated” fact not in the sources — it gets flagged. And if the answer is unacceptable the harness tries again till the output is acceptable.
Ask the model to identify that this is a medication query
Execute a direct database query that returns all medications
Format that complete list for the user

The key insight is using healthcare data standards as the verification framework. Standards like FHIR (Fast Healthcare Interoperability Resources) provide structured representations of clinical concepts — medications, procedures, conditions, observations. When the model’s output is translated back into these structured forms, it becomes possible to verify that nothing was invented. Or, when the source is unstructured, use other techniques to ensure that the facts in the model output are the same as the facts in the unstructured text.

This separation matters more than it first appears. A model asked to check its own work tends to approve its own work — it reasons itself into believing a mistake isn’t a mistake. The fix, borrowed from adversarial network design, is to keep generation and evaluation architecturally distinct. The grounding check above is a deterministic sensor: it doesn’t ask the model whether it was right. It maps every clinical claim to a FHIR resource and a source span, and the ones that don’t map get caught. The generator never gets a vote on its own verdict.

Exhaustiveness: Miss Nothing

Some healthcare queries demand completeness. “What medications is this patient taking?” isn’t a question where “most of them” is an acceptable answer.

Standard language models aren’t designed for exhaustiveness. They’re designed to generate coherent, plausible text — and they’ll naturally conclude a response when it seems complete, even if they haven’t covered everything.

For exhaustive queries, the approach changes entirely. Instead of asking the model to generate a list of medications, you:

The language model helps with intent understanding and presentation. The database provides completeness guarantees. This is the tool-orchestration layer doing its job: the model selects the right tool, the harness executes it deterministically, and the structured result flows back into the answer. Completeness becomes a property of the query, not a hope about the model’s behavior.

Relevance: Nothing Extraneous

The flip side of exhaustiveness is relevance. AI responses shouldn’t include tangential information that’s technically accurate but unhelpful.

Part of this is alignment — training models to match human preferences, so the model learns what constitutes a helpful answer for healthcare professionals, which is often more focused and actionable than general-purpose AI responses. Techniques like RLHF (Reinforcement Learning from Human Feedback) can tune model behavior, though gathering high-quality feedback from busy clinicians is challenging.

But relevance is also a context-engineering problem, and that’s where the harness earns its keep. The most common failure in clinical AI isn’t a bad model — it’s context contamination. Hand the model a patient’s entire record when the question only needs the last three encounters, and the signal drowns in noise. A model that receives 200,000 tokens of loosely relevant history will hallucinate at higher rates than one that receives 2,000 tokens scoped tightly to the question. The harness’s job is to pull only the relevant fragments and scope the model’s view to the task at hand — which is simultaneously a quality mechanism and a security one. A pre-visit summary agent shouldn’t see billing history unless the question requires it.

Ontologies: Speaking the Same Language

Healthcare has a useful asset that many industries lack: well-developed ontologies. These are standardized vocabularies that define medical concepts and their relationships.

For medications, there’s RxNorm. For diagnoses, there’s ICD-10 and SNOMED CT. For lab tests, there’s LOINC. These standards mean that “Lipitor” and “atorvastatin” can be recognized as the same thing, and a query about “statins” can find patients on any medication in that drug class.

In a harness, ontology normalization belongs at the context-ingestion boundary — every concept entering the system gets mapped to standard terminology on the way in, so the model reasons over a consistent vocabulary rather than raw, inconsistent source text. AI systems that leverage these ontologies can provide more intelligent answers. Ask “Is this patient on a statin?” and the system can check all statin medications, not just the specific one you named.

Ontologies also feed the verification layer. When the model mentions a medication, it can be mapped to the standard terminology and checked against patient records in the same standard format — closing the loop between what the model said and what the record contains.

The FHIR Foundation

FHIR has become the dominant standard for healthcare data interoperability. As of 2018, when the standard reached backward-compatible stability, adoption has accelerated. Government mandates require EHR systems to support FHIR APIs.

This creates an opportunity for AI systems. Rather than building custom integrations for every EHR, you can build on top of FHIR and access data from any compliant system.

It also provides the schema your verification layer needs. FHIR resources have defined structures — a medication resource has specific fields, a procedure resource has different fields. When AI output is validated against these schemas, errors become detectable. This is what lets grounding be a deterministic check rather than a judgment call: a claim either resolves to a well-formed FHIR resource backed by the record, or it doesn’t.

Handling Unstructured Data

Here’s a complication: not all healthcare data is neatly structured. Significant amounts arrive as:

Faxed documents (yes, still)
Scanned images
Handwritten notes
PDF reports from external systems
HIPAA compliance: Meeting all regulatory requirements for protected health information
SOC 2 certification: Demonstrating security controls to enterprise customers
Data isolation: Ensuring each healthcare organization’s data is separate
Deployment flexibility: Offering on-premises options for organizations with strict data residency requirements
A complete audit trail: Every retrieval, every model output, every verification verdict, and every human escalation logged with enough context to reconstruct the system’s decision process after the fact

Modern OCR and document understanding models can extract information from these unstructured sources. But they add another layer where errors can enter.

The approach is defense in depth, and it’s a natural fit for a harness. Rather than a fixed “for each document, extract, save” script, the ingestion path becomes an agent loop with judgment at each step: classify the document before processing, skip what’s already been handled, retry transient failures, and — critically — surface ambiguous results to a human instead of guessing. Text gets extracted using best-available technology, structured into FHIR resources, and tagged with confidence levels so downstream systems know when to apply extra scrutiny. Low confidence triggers escalation rather than silent propagation.

Guides and Sensors

There’s a clean way to organize every safeguard described so far, drawn from Martin Fowler’s framing of agentic systems: a harness controls the model with guides and sensors.

Guides are feedforward controls — they shape the model’s behavior before it acts. Scoped context, ontology-normalized inputs, precise tool descriptions, and structured priming all increase the odds of a correct first attempt.

Sensors are feedback controls — they observe after the model acts and enable correction. The grounding check, FHIR schema validation, the OCR confidence gate, and human-in-the-loop review are all sensors.

A trustworthy clinical harness stacks both at every layer, and the mix shifts over time. As you learn where an agent fails, you move work from sensors (detect and correct) toward guides (prevent and avoid), because prevention is cheaper than correction. Grounding catches hallucinations today; better-scoped context tomorrow means there are fewer to catch.

Incremental Trust and the Human Gate

Trust shouldn’t be granted all at once. The right model for clinical AI is “verify, then calibrate trust” — new task types start with a human in the loop at every significant decision, and the gates relax only as the agent demonstrates reliability, moving from review-everything to sampling to exception-only.

Two things should always pull a human back into the loop regardless of track record: the stakes of the next action and the confidence of the current output. An irreversible step — submitting an order, writing to the record — warrants a checkpoint. So does a low-confidence answer, the kind that should never reach a physician unflagged. The harness is what enforces these checkpoints consistently, rather than leaving them to the model’s discretion.

The Integration Imperative

Trust isn’t just about technical accuracy — it’s also about how the system integrates into workflows.

Healthcare professionals won’t trust a system they have to leave their normal workflow to access. If using the AI requires logging into a separate application, finding the right patient, and navigating unfamiliar interfaces, adoption will suffer regardless of accuracy.

This is why successful healthcare AI companies focus intensely on integration with existing EHR systems. The AI should surface within the tools healthcare workers already use, providing insights and capabilities without requiring workflow disruption. It also means the harness must carry workflow state across turns — if a physician clarifies something in turn three, the system can’t have forgotten it by turn seven. Stateless sessions are a harness failure, not a model limitation, and they erode trust fast.

Security, Privacy, and Auditability by Design

Healthcare AI systems handle the most sensitive data imaginable. Beyond accuracy, trust requires:

That last one is where the harness becomes a compliance asset rather than just an engineering convenience. When a downstream decision based on model output turns out to be wrong, you need to reconstruct exactly what the model was told, what it said, and whether the system processed it correctly. In regulated settings, that traceability isn’t a nice-to-have — it’s foundational. These aren’t afterthoughts; they shape system architecture from day one.

The Realistic Bar

Perfection isn’t the right standard. The right question is: is the AI system better than the alternative?

Today, physicians miss information because it’s buried in EHR systems. Patients get asked the same questions repeatedly because their information isn’t accessible. Care varies based on which doctor happens to see you because institutional knowledge isn’t shared.

An AI system that reduces these problems — while maintaining appropriate safeguards and human oversight — represents genuine progress. It doesn’t need to be perfect. It needs to be better than the status quo, with clear understanding of its limitations.

And the lever that gets you there is rarely the model. The difference between a clinical AI demo that impresses and a deployment you can stand behind is almost entirely harness: scoped context instead of the whole chart, deterministic grounding instead of self-evaluation, completeness from the database instead of the model’s best guess, confidence gates instead of silent failure, and an audit trail instead of a black box. The model is a component. The harness is what makes it trustworthy.

That’s the bar for trustworthy healthcare AI: reliable enough for defined use cases, honest about its limitations, and integrated with human judgment rather than trying to replace it.

About ThetaRho

This is part of our ongoing series of AI in healthcare. The opportunity is real — not to replace healthcare professionals, but to give them back the time and information they need to provide excellent care. The work of getting there is infrastructure work: grounding, guardrails, and auditability built once, at the layer beneath every clinical application, so every AI application that runs on ThetaRho inherits them without starting over.

View full post