The Applied AI Razor - Medium

5 Lessons From Building MCPs for Regulated Industries

deepsense.ai — Fri, 05 Jun 2026 12:41:37 GMT

Building MCPs for healthcare and life sciences requires more than API integration. Compliance, auditability, and security shape the architecture from day one.

Most discussions around Model Context Protocol focus on connectivity: giving AI systems access to tools, databases, and external services.

In regulated industries, however, the challenge goes much deeper.

Through our work building MCP infrastructure for healthcare and life sciences deployments, we’ve seen that the hardest problems are rarely related to the protocol itself. They emerge at the intersection of security, compliance, governance, and production operations.

This article summarizes several lessons from that experience.

1. MCPs Are Not Just API Integrations

Many teams approach MCP implementation as a connectivity problem.

In regulated environments, the majority of complexity sits outside the integration layer itself.

Teams must account for:

access control,
auditability,
regulatory requirements,
data governance,
security boundaries.

What appears to be a straightforward connector often becomes a critical compliance component once sensitive healthcare or research data enters the workflow.

Upcoming webinar

Many of the architectural patterns and compliance challenges discussed here will be explored in more detail during our upcoming webinar featuring MCP experts from deepsense.ai and a special guest from Anthropic: REGISTER HERE

2. Compliance Requirements Shape Architecture

One of the most common misconceptions is that compliance can be added later.

In practice, frameworks such as HIPAA, FDA regulations, and GxP standards influence architectural decisions from the beginning.

These requirements affect:

system boundaries,
logging strategies,
validation processes,
access management,
operational controls.

Architectures optimized only for flexibility or speed often require significant redesign once regulatory requirements enter the picture.

3. Security Boundaries Matter More Than Features

Healthcare AI systems operate around highly sensitive information and critical workflows.

For that reason, MCP servers become more than integration points. They become security boundaries.

Several architectural patterns proved especially important:

isolated services,
strict network controls,
role-based access,
encrypted communication,
controlled tool exposure.

The goal is not simply connecting models to data. The goal is creating controlled and auditable pathways between AI systems and regulated environments.

4. Observability Cannot Come at the Cost of Privacy

Production systems require monitoring.

Regulated systems require privacy.

Balancing both is more difficult than it initially appears.

In our deployments, infrastructure metrics, availability indicators, traffic patterns, and operational health signals remain essential. At the same time, user requests and responses containing potentially sensitive information should not become part of monitoring pipelines.

This creates a different observability model than many AI teams are accustomed to.

Upcoming webinar

We’ll discuss governance, auditability, observability, and production deployment trade-offs in more detail during our upcoming webinar on compliance-first AI architectures: LIVE SESSION, JUNE 16, REGISTER HERE

5. Reliability Is a Compliance Requirement

Performance still matters.

A secure system that is too slow, too fragile, or too difficult to operate ultimately fails its users.

Several design decisions helped balance reliability with compliance requirements:

stateless services,
intelligent caching,
explainable error handling,
infrastructure isolation,
continuous monitoring.

The objective is not simply to make systems compliant. It is to make them usable, scalable, and maintainable in production environments.

What This Means for Enterprise AI Teams

As AI systems gain access to enterprise tools and data through protocols such as MCP, architecture decisions become increasingly important.

In regulated industries, compliance is not a layer added after deployment. It becomes part of the system design itself.

Teams that account for governance, security, auditability, and operational reliability early can move faster when transitioning from experimentation to production.

This article is based on our experience building MCP infrastructure for healthcare and life sciences deployments.

The full article explores the architecture, implementation decisions, and operational considerations in greater detail.

JOIN NOW

5 Lessons From Building MCPs for Regulated Industries was originally published in The Applied AI Razor on Medium, where people are continuing the conversation by highlighting and responding to this story.

4 Things Enterprise Teams Learn After Deploying AI Voice Agents

deepsense.ai — Fri, 22 May 2026 12:57:36 GMT

Low latency, deterministic behavior, and evaluation strategy matter more than most teams expect in enterprise AI assistants.

Most discussions around AI assistants still focus on model capabilities.

The webinar below focuses on something more practical: what starts breaking once these systems are connected to actual enterprise workflows, APIs, users, and operational constraints.

During the session, Mateusz Wosinski, Senior ML Tech Lead, walks through two production deployments:

a voice agent automating outbound calls,
and a conversational assistant integrated into a digital twin platform.

This article summarizes several implementation lessons from those projects.

The full webinar including architecture details and live examples is available: HERE.

1. Voice latency changes how the whole system is designed

One of the projects discussed during the webinar involved a telephony voice agent handling outbound calls and appointment scheduling.

The system needed to:

verify identity,
answer FAQs,
handle objections,
schedule appointments,
escalate calls to human agents when necessary.

Conversational pacing turned out to matter just as much as response quality.

Even short silent gaps felt unnatural during calls, which forced several architectural decisions:

pre-generated utterances,
filler phrases masking backend operations,
parallel execution of selected actions,
rule-based handling for short responses instead of LLM calls.

In voice systems, latency quickly becomes part of the UX itself.

That changes how teams think about orchestration, inference strategy, and even dialogue design.

2. Fully generative systems are often harder to control than teams expect

Another practical observation from the webinar was how carefully LLM usage had to be constrained in customer-facing workflows.

In several places, the system relied on deterministic logic instead of free-form generation:

static responses were pre-recorded,
rule-based classifiers handled simple intents,
LLMs were primarily used for intent understanding and action selection.

This improved:

wording consistency,
latency predictability,
operational control,
escalation reliability.

The more sensitive the workflow, the more valuable deterministic behavior becomes.

That trade-off appears frequently in enterprise deployments, especially when systems interact directly with customers.

3. Conversational layers can simplify complex enterprise platforms surprisingly fast

The second use case focused on a digital twin platform with a large and fragmented interface.

Instead of redesigning the UI from scratch, the solution introduced a conversational assistant integrated directly into the platform.

Users could retrieve information and trigger actions through natural language, while the underlying orchestration layer handled:

tool execution,
retrieval,
access control,
service integrations,
guardrails.

One interesting lesson from the project was organizational rather than purely technical.

A conversational layer often creates usability improvements much faster than a full UI redesign — especially in systems where users are already accustomed to existing workflows.

The webinar also goes deeper into:

Elasticsearch-based retrieval,
orchestration services,
hybrid search architecture,
and chunking strategies for large document collections.

Full session: HERE

4. Evaluation becomes much more than model evaluation

One of the more interesting parts of the webinar was the evaluation framework built around the assistant platform.

Mateusz introduced four separate evaluation layers:

unit and integration tests,
scheduled synthetic tests,
scenario-based evaluation,
security and prompt injection testing.

The scenario layer became especially important because many user requests required:

multi-step reasoning,
multiple tool invocations,
retrieval across several systems,
access-policy enforcement.

In systems like this, evaluation quickly spreads across retrieval, orchestration, permissions, and workflow reliability.

It also becomes:

a retrieval problem,
an orchestration problem,
a permissions problem,
and a workflow reliability problem.

You can access it HERE.

4 Things Enterprise Teams Learn After Deploying AI Voice Agents was originally published in The Applied AI Razor on Medium, where people are continuing the conversation by highlighting and responding to this story.

How AI Helps Physicians Navigate 3,000 Medical Papers

deepsense.ai — Thu, 14 May 2026 13:55:23 GMT

Reviewing thousands of medical papers is slow and difficult to scale. We built an AI assistant to support physicians with evidence retrieval and synthesis.

Clinical research moves faster than physicians can realistically review. Even within a narrow medical domain, the number of papers and recommendations quickly becomes difficult to navigate manually. At deepsense.ai, we built an AI assistant designed to help physicians move from large-scale medical literature to structured clinical insight.

This is a condensed version of the full case study. The full version goes deeper into the system architecture, implementation decisions, workflow design, and includes a short demo video.

The core challenge wasn’t only search

The project started with a practical constraint:

thousands of medical publications,
highly specialized terminology,
rapidly evolving evidence,
and limited physician time.

Traditional keyword search was insufficient for this workflow. Physicians needed a system capable of:

retrieving relevant publications,
understanding medical context,
synthesizing findings,
and presenting results in a usable format.

The problem becomes more difficult when the system needs to operate across long and highly technical documents rather than short factual queries.

Building an AI assistant around evidence synthesis

The assistant was designed to support literature review and evidence synthesis workflows.

Instead of treating papers as isolated documents, the system needed to connect:

clinical concepts,
treatment context,
research findings,
and supporting evidence.

Large language models played a central role, but the implementation required more than prompting alone. The workflow depended on combining retrieval mechanisms with structured processing of medical literature.

The system was intended to support physicians by accelerating access to relevant evidence and helping organize large volumes of information.

Why medical AI systems need tighter controls

Medical environments introduce constraints that differ significantly from general-purpose AI applications.

In practice, systems operating in healthcare workflows must account for:

reliability,
traceability,
factual consistency,
and transparency of outputs.

This shaped multiple implementation decisions throughout the project.

The challenge was not only generating useful responses. The system also needed to preserve links to source materials and maintain enough structure for physicians to verify conclusions against the underlying literature.

Similar constraints appear in other regulated healthcare workflows, including AI-assisted pharma content generation, where traceability and reviewability are critical parts of the system design.

Long documents create practical engineering problems

Medical publications are large, dense, and highly contextual.

That creates several engineering challenges:

chunking long documents without losing meaning,
retrieving the correct context,
handling domain-specific terminology,
and reducing irrelevant or misleading retrieval results.

Scaling retrieval across thousands of publications also affects latency, orchestration, and ranking quality.

These constraints often determine whether an AI assistant becomes practically useful in clinical workflows.

Similar document-processing challenges also appear in AI-assisted clinical protocol development, where large volumes of structured and unstructured information must be processed consistently.

The outcome

The resulting system helped physicians navigate large collections of medical literature more efficiently by combining retrieval, synthesis, and structured presentation of evidence.

The project illustrates a broader pattern emerging across healthcare AI:

useful systems depend as much on workflow design, retrieval quality, and reliability as on the underlying model itself.

Visit our website for the full analysis, including architecture details and implementation trade-offs.

How AI Helps Physicians Navigate 3,000 Medical Papers was originally published in The Applied AI Razor on Medium, where people are continuing the conversation by highlighting and responding to this story.

When Code Gets Cheaper, Judgment Gets More Expensive

deepsense.ai — Wed, 06 May 2026 15:51:50 GMT

Faster Code, Slower Validation: Where Enterprise AI Systems Actually Break

AI-assisted development is increasing output faster than many teams can meaningfully evaluate it.

Code assistants are already effective at scaffolding, refactoring, boilerplate generation, and repetitive implementation work. That leverage is real. But in enterprise AI systems, the limiting factor increasingly shifts away from code production and toward system judgment.

The harder question becomes whether the system remains reviewable, testable, observable, and reliable under real operating conditions.

More capability also means more risk

One pattern is becoming increasingly visible across the AI tooling ecosystem.

Generation-first systems make it easy to assemble powerful workflows quickly. But speed also exposes gaps around:

access control,
auditability,
implicit data flows,
governance,
operational visibility.

The response is predictable: governance and policy layers start appearing around the generation stack itself.

The shift is subtle but important: the question increasingly becomes whether these systems can be trusted in production.

Velocity metrics become weak signals

Once code generation becomes cheap, many traditional engineering metrics lose meaning.

PR counts rise. Change volume increases. Spikes appear faster. Reviews get shorter. The surface area of the system expands rapidly.

But higher throughput does not automatically mean higher confidence.

In one sensitive project centered around authentication and authorization flows, development velocity looked healthy on paper. The problem emerged elsewhere: the review depth was no longer proportional to the risk surface.

That kind of gap rarely becomes visible during prototype reviews. It becomes visible later:

during validation,
security assessment,
compliance review,
or direct client scrutiny.

A recurring failure mode was especially revealing: AI systems treat all code as equally valid context.

That leads to implementations built on top of:

inactive user paths,
unused database tables,
deprecated logic,
ignored auth flows hidden behind feature flags.

The issue is not code generation quality alone. The issue is whether the team understands which parts of the system are actually trusted, alive, and safe to extend.

The real failures happen at the boundaries

The interesting problems in AI-generated systems are rarely in scaffolding.

They appear at integration boundaries:

retries,
partial failures,
pagination edge cases,
timeout behavior,
rate limiting,
unstable upstream responses,
inconsistent response contracts.

Modern frameworks make it relatively easy to spin up services quickly. The first implementation often looks clean. The difficult engineering work starts deeper in the stack.

A system that behaves differently under load, returns inconsistent response shapes, or fails unpredictably creates operational debt quickly.

“A tool that sometimes succeeds, sometimes times out, and sometimes invents a response shape is not a tool. It is operational debt with a nice interface.”

That is why deterministic contracts matter so much in production AI systems:

explicit retry policies,
stable error envelopes,
predictable fallback behavior,
controlled timeout handling,
observable upstream interactions.

Without those constraints, debugging turns into archaeology.

We covered these integration and orchestration challenges in more detail in our breakdown of scalable multi-agent systems built around Anthropic’s MCP architecture.

Observability becomes part of the architecture

Logging and monitoring cannot be treated as post-production additions.

Middleware-level logging helps, but production systems also require domain-specific visibility:

identifiers,
decision points,
retries,
upstream dependencies,
fallback paths,
failure classifications.

This becomes especially important in regulated environments.

In the MCP Life Sciences work discussed in the full article, the system boundary also became a compliance boundary. That changed what quality meant in practice:

isolated containerized services,
stateless execution,
strict role-based access,
observable infrastructure without storing PII,
explainable error handling,
controlled caching strategies.

In those environments, weak traceability or unclear system boundaries create operational and compliance risk quickly.

A sequencing problem appears in AI-heavy delivery

Another recurring issue is implementation order.

Teams often start with generation:

generate the server,
scaffold the integration,
automate the PR,
defer monitoring,
postpone observability,
add governance later.

The article argues for the opposite sequencing.

Start with the layers that determine downstream quality:

proven infrastructure components,
Twelve-Factor foundations,
annotation and versioning discipline,
observability,
review standards,
staged rollout practices.

Only then accelerate implementation with AI assistance.

This matters because defects introduced upstream become expensive later. Poor annotation quality behaves similarly to shallow code review: the original problem appears early, but the operational consequences emerge much later.

Enterprise leverage comes from the right layer of customization

Another common pattern: organizations rebuilding infrastructure that already exists.

Custom orchestration layers, custom middleware, and custom framework abstractions often create long-term maintenance overhead instead of differentiation.

The stronger approach is usually:

start from proven components,
customize orchestration and constraints,
adapt execution behavior to the business problem,
avoid rebuilding commodity infrastructure.

The article describes one advisory engagement where an internally built multi-agent retrieval system struggled with unreliable human-in-the-loop behavior and limited orchestration capabilities. Rebuilding around the OpenAI Agents SDK produced a more maintainable and scalable architecture.

The broader lesson is practical rather than ideological: borrow complexity before building it yourself.

Enterprise systems optimize for accountability

Prototype systems and enterprise systems optimize for very different conditions.

Prototype systems optimize for speed, flexibility, and happy-path execution.

Enterprise systems prioritize:

auditability,
policy enforcement,
replayability,
failure recovery,
tenant isolation,
operational debugging.

That difference changes how AI assistance should be used.

The most productive pattern is not rejecting AI-generated code, nor blindly accelerating everything around it.

It is using AI aggressively for repetitive implementation while applying stricter discipline around:

design review,
system boundaries,
observability,
security,
governance,
operational reliability.

The bottleneck shifts from typing to judgment

The core argument is straightforward:

AI reduces the cost of producing code. It does not reduce the cost of validating systems.

Teams that adapt well still move quickly. But their speed compounds because the quality controls remain aligned with the complexity being generated.

The practical implications are clear:

use AI to compress mechanical work,
avoid treating output volume as proof of progress,
increase scrutiny around critical system paths,
make observability first-class early,
build on stable foundations instead of rebuilding infrastructure.

Less time is now spent writing code manually.

More time is spent deciding what deserves to stay.

Visit our blog for the complete analysis, including implementation details, MCP architecture patterns, and production lessons from regulated environments.

When Code Gets Cheaper, Judgment Gets More Expensive was originally published in The Applied AI Razor on Medium, where people are continuing the conversation by highlighting and responding to this story.

GxP-Compliant AI: 5 Things That Actually Matter in Production Life Sciences Systems

deepsense.ai — Thu, 30 Apr 2026 10:29:11 GMT

AI systems are already being deployed across clinical and R&D workflows, where regulatory constraints shape how they are designed and operated.

At the same time, clinical development remains slow and expensive, and even small improvements can translate into significant financial impact. This makes system reliability, not just model capability, central to real-world deployment.

This is a condensed version of our full breakdown — in the original article we walk through architecture and implementation details step by step.

5 things that actually matter in production life sciences AI

1. Architecture determines what can be deployed

In regulated environments, system design directly affects feasibility.

What matters in practice:

constrained orchestration,
validation pipelines,
audit logging,
governance frameworks,
region-aware deployment.

A system that enforces traceability and policy constraints will perform more reliably than one optimized only for model output.

2. Structured connectors make AI usable in workflows

The next step in AI system design is connecting models to authoritative, domain-specific sources.

Anthropic’s Life Sciences launch included structured connectors to:

The MCP connectors used in these integrations were built by deepsense.ai and are already supporting production workloads across healthcare and life sciences.

These connectors allow systems to:

compare clinical trial endpoints,
extract eligibility criteria,
support protocol design,
assist in trial emulation workflows.

We describe implementation patterns and constraints in more detail in our technical article on building MCP systems for regulated industries.

3. Traceability and auditability are required properties

Production systems must support:

reproducibility,
detailed logging,
full traceability of data and decisions.

These properties determine whether systems can pass regulatory review and remain usable at scale.

4. Impact appears at the workflow level

Across real deployments, AI systems are improving outcomes in specific, well-defined workflows, where they are embedded directly into operational processes.

Protocol generation aligned with regulatory frameworks
AI systems constrained by regulatory guidance accelerate compliant study design and reduce iteration cycles.

See how this works in practice.

AI-driven site selection
In retrospective analysis, 90% of model-recommended sites outperformed legacy selections in the US market. Impact:

> higher enrollment efficiency,
> reduced trial delays,
> improved probability of trial success.

Check out implementation details.

Multimodal LLMs for in-silico drug discovery
A multimodal system enabled a 5× acceleration in molecular exploration workflows. Impact:

> faster hypothesis generation,
> improved candidate prioritization,
> reduced experimental iteration cycles.

Click to see case study.

The consistent pattern: when AI is tied to a defined workflow and supported by the right system design, it improves both speed and decision quality without losing control over governance.

5. Scaling across regions adds another layer of complexity

Many organizations validate AI systems in a single market first.

Expanding across jurisdictions such as FDA, EMA, PMDA, and NMPA introduces:

different regulatory expectations,
varying data constraints,
increased system complexity.

Handling this requires architecture that accounts for regional differences from the start.

What this means in practice

AI systems are becoming part of operational infrastructure across:

clinical trial execution,
regulatory processes,
pharmacovigilance,
R&D decision support.

This requires:

architecture-first thinking,
validation aligned with regulators,
domain-aware system design,
production-grade reliability.

Teams that treat compliance as a design principle tend to avoid bottlenecks later in deployment.

Final takeaway

In life sciences, AI systems are evaluated by their ability to operate under regulatory scrutiny, maintain traceability, and scale across real workflows.

GxP-Compliant AI: 5 Things That Actually Matter in Production Life Sciences Systems was originally published in The Applied AI Razor on Medium, where people are continuing the conversation by highlighting and responding to this story.

OpenAI Agents SDK: What Changes After the New Release (Early Access Insights)

deepsense.ai — Wed, 22 Apr 2026 10:16:01 GMT

Persisting and resuming agent workflows while keeping execution auditable remains a challenge. OpenAI’s new SDK addresses this at the system level.

5 Things That Actually Matter in OpenAI’s New Agents SDK

Ahead of OpenAI’s April 2026 release, we had early access to the new Agents SDK as an OpenAI Partner.

What emerges is a clear shift: instead of lightweight chat loops, agents start to resemble structured execution systems.

In practice, this looks closer to a workflow engine — stateful, resumable, and auditable.

Here’s what actually changes from an implementation perspective.

This is a condensed version of our full breakdown — in the original article, our engineers walk through the architecture and implementation details step by step.

1. Long-running workflows are now a first-class concept

The combination of Runner and RunState introduces a clear execution model:

workflows can run over hours or days
execution can be paused and resumed
full state is preserved between steps

RunState captures:

model responses
tool outputs
approval states
execution history

This enables patterns that were previously difficult to implement reliably:

multi-step pipelines
agent-driven ETL or RAG flows
copilots triggering actions across systems

Before, teams had to build orchestration layers themselves. Now, it’s part of the core abstraction.

2. Sandboxed execution becomes practical

The sandbox/session model provides isolated environments with structured control:

execution in Unix, Docker, or microVMs
workspace-level file operations
script execution within controlled environments
pre-configured setups via manifests

For enterprise contexts, this directly supports:

security isolation
controlled tool usage
reproducibility of execution

This is one of the more concrete steps toward making agent systems usable in regulated environments.

3. Persistence goes beyond conversation history

Most agent systems persist prompts and responses. This approach goes further.

The SDK captures:

full workspace state
intermediate artifacts (files, outputs)
memory and context + approval state
execution progress

This enables:

true resumability (not prompt replay)
debuggable execution paths
auditable workflows

From an implementation standpoint, this is a key difference between a demo system and something operational.

4. Memory becomes a system capability

Memory is introduced as a structured, persistent capability with a two-phase pipeline:

Phase 1: lightweight extraction within a run
Phase 2: consolidation across runs over time

Additional characteristics:

agents can update stale memory
read/write separation helps manage cost
configurable storage structure

This means knowledge is no longer static. It evolves across executions instead of being re-injected via prompts.

5. Clear separation between definition and execution

The distinction between:

Agent (definition)
Runner + RunState (execution instance)

introduces a more structured model.

This allows teams to:

version workflows
run multiple executions in parallel
reproduce execution deterministically

It aligns more closely with how enterprise systems are designed and operated.

Where it still needs refinement

The direction is strong, but there are areas that require more clarity:

Capabilities vs tools — boundaries can be ambiguous in larger systems
Concurrency model — multi-agent parallelism still requires custom orchestration
Documentation maturity — building a clear mental model takes effort

These are solvable, but relevant for near-term adoption.

What this signals about agent systems

This release reflects a broader shift:

from stateless prompt chains → to stateful execution systems
from chat interfaces → to infrastructure components

In practice, this matches what enterprise teams are already trying to build:

internal copilots embedded in workflows
automation layers driven by agents
evolving RAG systems
multi-agent coordination across tools and data

Bottom line

The new Agents SDK introduces a foundation that has been missing: a structured execution layer for agents.

It makes systems:

more reliable
more auditable
easier to resume and operate

But it does not remove the core challenges of enterprise AI:

system architecture
data integration
evaluation and monitoring
cost and scaling

For teams already building agent-based systems, this is less about experimentation — and more about how the next generation of workflows should be structured.

Visit our blog for detailed breakdowns of agent execution layers, architectures, MCP integrations, and real implementation trade-offs.

OpenAI Agents SDK: What Changes After the New Release (Early Access Insights) was originally published in The Applied AI Razor on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why Voice AI Still Feels Slow (OpenAI GPT vs Google Gemini Explained)

deepsense.ai — Fri, 17 Apr 2026 15:10:59 GMT

Most enterprise voice AI feels unnatural for a simple reason: the architecture was never designed for human conversational timing.

What works in a chatbot often breaks in voice, where users notice every pause, interruption, and awkward handoff immediately.

You can build a technically correct system, and it can still feel off in real interactions.

The Problem: Latency Compounds Fast

Most voice systems still follow a simple pipeline:

Speech → Text → LLM → Text → Speech

On paper, it looks clean and works well in demos.

In practice, every step adds latency:

transcription delay
model reasoning time
response synthesis

And something else gets lost along the way: tone, pacing, hesitation — everything that makes conversation feel human.

We’ve written about this in more detail here → Latency kills voicebots faster than bad models

1. Sequential Pipelines Don’t Match Human Timing

The traditional pipeline is not “bad” — it can handle complex logic, APIs, and workflows. But it introduces unavoidable delays.

Even when optimized:

responses feel slightly late
interruptions don’t work well
conversations lose flow

Teams try to mask this with tricks:

background sounds
parallel execution
longer prompts

It helps, but it doesn’t address the underlying latency problem.

Takeaway: the architecture itself creates latency.

2. Text-Based Processing Removes Voice Signals

In traditional pipelines, the model never hears the user. It only sees text.

That means it loses:

tone
emotion
hesitation
intent nuances

Everything gets flattened into text before reasoning happens.

This is a fundamental limitation, not an implementation detail.

Takeaway: voice → text → voice strips away context that matters in real conversations.

3. Realtime Voice Models Change the Architecture

New native audio models remove the text layer entirely.

Instead of:
→ speech → text → model → text → speech

You get:
→ speech → model → speech

The model processes audio directly and responds in audio. This reduces latency and allows the system to react more naturally.

It also changes what the model can “understand”:

pacing
tone
conversational rhythm

Takeaway: realtime models improve both latency and interaction quality.

4. You Gain Speed, But Lose Control

Realtime voice models come with a trade-off.

Because they generate audio directly:

you can’t easily post-process responses
you can’t enforce strict templates
control becomes prompt-driven

In contrast, traditional pipelines allow:

deterministic outputs
structured responses
precise API interactions

This matters in:

compliance-heavy workflows
structured data collection
high-risk interactions

Takeaway: realtime systems are faster, but harder to control.

OpenAI vs Gemini: What Actually Matters

We tested both approaches in a real-time voice agent with:

RAG
tool calling
observability
live data access

Both models handled conversations smoothly. But the differences become very noticeable in real use.

If you want to hear the difference, here are short demos of both models in action:
→ Gemini demo
→ GPT demo

OpenAI (gpt-1.5-realtime)

Strengths:

reliable tool calling
smooth conversational pacing
natural filler behavior (“let me check that…”)

Trade-offs:

higher cost (~$0.20/min)
voice quality is good, but not exceptional

https://www.youtube.com/watch?v=jWDkL5693Ik

Google (gemini-2.5-flash-native-audio)

Strengths:

~10x cheaper (~$0.02/min)
noticeably better voice quality

Trade-offs:

less consistent tool usage
harder to interrupt
occasional conversational “dead air”

https://www.youtube.com/watch?v=jwsmVC8hAK4

Where Realtime Voice Works (and Where It Doesn’t)

Works well:

conversational assistants
user triage
FAQs
empathetic interactions

Still problematic:

strict workflows
precise data collection
compliance-heavy systems

These models are good at conversation.

They are not yet reliable system operators.

What This Means for Enterprise Voice AI

Voice AI isn’t just a chatbot with audio.

It requires different architectural decisions:

latency becomes a product issue
control vs naturalness is a trade-off
model choice affects system design

The shift to realtime models is real.

But it doesn’t replace traditional pipelines — it changes where they make sense.

Final Thought

From the outside, voice AI looks like a solved problem.

In practice, most systems struggle more with timing than with intelligence.

The difference between a working system and a usable one often comes down to milliseconds — and how you design around them.

If you want the full version with deeper examples, edge cases, and implementation details, you can read it HERE.

Why Voice AI Still Feels Slow (OpenAI GPT vs Google Gemini Explained) was originally published in The Applied AI Razor on Medium, where people are continuing the conversation by highlighting and responding to this story.

5 Things That Break Medical AI in Practice (Including 30% Wrong X-Rays)

deepsense.ai — Wed, 08 Apr 2026 14:56:25 GMT

You filter for chest X-rays, build a dataset, train a model — and only later realize that 20–30% of your data isn’t chest at all.

Elbows, knees, hands — sometimes not even X-rays.

This isn’t a rare edge case. It’s what clinical ML looks like once you move past clean benchmarks and into real systems.

If you want the full breakdown with examples and implementation details, you can read the original version HERE.

1. Metadata Is Not Ground Truth

Medical datasets look structured on paper.

DICOM defines fields for almost everything: body part, modality, acquisition details. In theory, this should make dataset curation straightforward.

In practice, fields are reused, left unchanged, or filled creatively.

A technologist sets “body part” to chest for one patient, then moves on to the next — and the value carries over. You filter for chest X-rays and end up with a dataset where a significant portion isn’t chest at all.

Not because of bad intent — just because real workflows don’t follow perfect schemas.

This is one of the most common issues we see when working with real-world medical datasets.

Takeaway: metadata is a hint, not a label.

2. Every Hospital Is Its Own Distribution

Inside a single hospital, data tends to be consistent.

Across hospitals, everything changes.

Different scanners, reconstruction settings, reporting styles, and annotation habits create entirely different data distributions. A model trained in one environment often degrades quickly in another.

If you’re building systems that need to work across sites, this becomes a core design problem rather than an edge case.

Takeaway: plan for retraining and adaptation, not universal generalization.

3. Clinical Text Is Not Clean NLP Data

Clinical reports are often dictated, not written.

That brings:

transcription errors
ambiguous abbreviations
inconsistent phrasing

“PT” can mean physical therapy, patient, or prothrombin time. Acronyms shift meaning depending on context, and even clinicians sometimes disagree on interpretation.

On top of that, reports are intentionally cautious:

“suggestive of”
“cannot be excluded”
“may represent”

The result is text that is clinically valid but difficult to model.

Takeaway: treat clinical NLP as a sequence of smaller problems, not a single modeling task.

Chest X-Ray, clearly

4. Medical Images Are Not Just Images

In many domains, images are just pixels.

In medicine, they encode physics.

CT values correspond to Hounsfield Units
X-ray intensity reflects attenuation
MRI brightness depends on acquisition
Ultrasound depends on angle and speckle

Add artifacts:

motion
implants
noise
contrast phases

These differences aren’t cosmetic — they directly affect what the model can learn and how it generalizes.

Takeaway: preprocessing and data understanding are part of the model.

5. Metrics Don’t Reflect Clinical Reality

Benchmarks optimize for metrics like accuracy or F1, but these rarely reflect how models behave in clinical workflows.

Clinical systems operate under different constraints:

extreme class imbalance
asymmetric risk (false negatives vs false positives)
decisions that affect real workflows

In some cases, recall matters more than precision. In others, standard metrics don’t capture the actual cost of errors.

These trade-offs only really make sense once you look at how AI systems behave in real-world medical data environments.

Takeaway: metrics should follow the decision, not the textbook.

What This Means in Practice

Clinical ML isn’t about state-of-the-art models.

It’s about:

mislabeled data
inconsistent metadata
ambiguous text
artifact-heavy images
shifting data distributions

And systems that still need to work despite all of that, often under real clinical constraints.

The pattern that holds across projects is simple:

verify → simplify → re-check

Final Thought

From the outside, medical AI looks like breakthrough models and clean results.

Up close, it’s mostly engineering work.

If you want the full version with deeper examples, edge cases, and implementation details, you can read it HERE.

5 Things That Break Medical AI in Practice (Including 30% Wrong X-Rays) was originally published in The Applied AI Razor on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why MCP deployments break in healthcare (and what actually works in production)

deepsense.ai — Wed, 01 Apr 2026 09:31:01 GMT

Most organizations treat MCP as a simple API integration. In regulated environments like healthcare, that assumption breaks quickly.

What looks like a connector becomes a compliance boundary — with requirements around auditability, access control, and provability.

Healthcare AI systems must handle the most sensitive data types while integrating with legacy EHRs, LIMS, and research platforms. They need explainable decisions for clinical workflows and comprehensive audit trails for regulatory inspections. Most importantly, they can never compromise patient safety or data privacy.

Beyond compliance, there’s the practical challenge of making AI systems work within existing healthcare infrastructure. Clinical workflows are complex, data formats are heterogeneous, and downtime isn’t an option. Traditional AI integrations often require extensive custom development, creating maintenance nightmares and vendor lock-in.

MCP acts as a secure, standardized gateway that allows AI to “see” regulated data without ever compromising the underlying security boundary.

Why healthcare compliance reshapes AI architecture

When we talk about “regulatory compliance” in healthcare AI, we’re not dealing with a single checklist. Instead, we’re navigating a complex web of overlapping frameworks, each with different focuses and requirements.

The Big Three: Navigating the HIPAA, FDA, and GxP Compliance

Here’s what many organizations miss:

These aren’t just policy requirements you can bolt onto existing systems. They fundamentally shape how your AI architecture must be designed from the ground up.

The challenge isn’t meeting any single requirement in isolation. It’s building systems that satisfy all these overlapping demands while remaining performant, maintainable, and scalable. This is where traditional AI architectures break down, and why we need fundamentally different approaches.

How MCP becomes a security boundary, not just a connector

In regulated environments, adding security layers to existing architectures is rarely sufficient. The system design itself needs to be built around compliance constraints.

While MCP servers are fundamentally connectors between LLMs and APIs, in regulated environments, these connections become critical security and compliance boundaries. The challenge isn’t just about AI — it’s about creating secure, auditable pathways between powerful language models and highly sensitive healthcare data systems.

Essential Design Principles for Healthcare MCP Servers

The foundation of any regulatory-compliant MCP architecture rests on four key principles:

Building MCP servers for healthcare leverages the protocol’s client-server architecture to create natural security boundaries. Each MCP server runs as an isolated process, communicating with Claude through standardized JSON-RPC messages over stdio or HTTP. For healthcare deployments, we extend this isolation by containerizing each server with strict resource limits and network policies.

The MCP protocol’s structured message format becomes essential for compliance. Every tool invocation follows the “tools/call” request-response pattern, providing consistent audit points. Healthcare MCP servers must log complete message payloads — including tool names, parameters, and responses — with timestamps and user context for regulatory reporting.

MCP’s capability negotiation through “initialize” and “tools/list” endpoints enables dynamic access control. Healthcare servers can advertise different tools based on authenticated user roles, ensuring clinicians see patient care tools while researchers access only anonymized datasets.

This protocol-aware approach to healthcare security creates the foundation for production deployments that satisfy both technical requirements and regulatory compliance, as demonstrated in our next section’s case study.

Case Study: what production MCP looks like at scale

When Anthropic announced its expanded healthcare and life sciences capabilities in January 2025, our team from deepsense.ai played a key role as the specialized partner building the MCP infrastructure that powers these integrations. Our work developing MCP servers for CMS Coverage, bioRxiv, ChEMBL, ClinicalTrials.gov, ICD-10, and the NPI Registry provided real-world validation of the architectural principles we’ve outlined.

The Challenge: Public Health Data at Enterprise Scale

Unlike typical enterprise integrations, healthcare MCP servers must handle massive public datasets while maintaining enterprise-grade security and compliance. The ICD-10 database contains over 70,000 diagnosis codes, while ClinicalTrials.gov indexes hundreds of thousands of studies. These systems require different security models than patient data — the information is public, but the access patterns and integration points must still meet healthcare security standards.

Our AWS Implementation Architecture

To address these enterprise-scale challenges while ensuring HIPAA-level isolation, we designed a cloud-native foundation on AWS. Our architecture prioritizes “defense-in-depth”, ensuring that every data request is isolated, encrypted, and recorded.

Container Orchestration: Each MCP server runs as an isolated task Amazon ECS Fargate service. We use separate task definitions for each data source (bioRxiv, ChEMBL, etc.), allowing independent scaling and updates without cross-contamination.

Network Security: All containers operate within private subnets with carefully configured security groups that restrict inbound traffic to specific ports and authorized sources, while all data transmission uses TLS encryption in transit by default.

Observability and Compliance: We use CloudWatch to store metrics related to servers performance, traffic, infrastructure uptime and resources consumption. Any metrics that exceed configured threshold immediately triggers CloudWatch alarm. What we do not store is any kind of user requests or responses data — therefore no PII data is ever stored in our environment.

While this infrastructure provides a secure environment for the MCP servers, security alone isn’t enough for a production environment. We also had to provide reasonable performance of servers, while maintaining its relatively low complexity and costs.

How to keep MCP fast without breaking compliance

In life sciences, a secure system that is too slow to use is a failed system. To bridge the gap between rigorous security protocols and the need for fast responses, we implemented several capabilities to enable that.

Performance Caching: ElastiCache Redis clusters cache frequently accessed public data like ICD-10 codes and clinical trial metadata. This reduces API call latency by up to 80% for common queries while ensuring data freshness through TTL-based invalidation.

Stateless Design: Our MCP servers maintain no local persistent state beyond caching. Each tool invocation processes requests independently, simplifying horizontal scaling and eliminating data consistency concerns across container instances.

Monitoring Critical Metrics: CloudWatch dashboards track API availability for each data source, cache hit ratios, and request volume patterns. Custom alarms notify on service degradation or unusual access patterns that might indicate issues.

Error Handling: Instead of exposing raw API error responses, our MCP tools return meaningful, explainable reasons when requests fail. This approach protects against information leakage while providing Claude with actionable feedback that can be communicated clearly to end users.

By balancing these performance gains with a strictly stateless design, we achieved a system that is both highly responsive and inherently auditable. This dual success paved the way for the significant technical wins observed in our deployment on AWS. The combination of containerized isolation, comprehensive logging, and intelligent caching proved essential for handling the scale and reliability requirements of production healthcare AI systems.

Why compliance becomes a competitive advantage

Building MCP servers for regulated industries isn’t just about adapting existing patterns — it requires rethinking AI infrastructure from the ground up with compliance as a core architectural principle.

The healthcare AI transformation is happening now. Organizations that master compliant MCP deployment will gain significant competitive advantages in bringing AI-powered solutions to market faster and more securely than their competitors. The question isn’t whether to build these systems, but whether you have the expertise to build them right.

Production deployments show that MCP architectures designed with compliance in mind can meet both regulatory and operational requirements at scale.

Mateusz Kwaśniak, Senior Technical Lead, deepsense.ai

Why MCP deployments break in healthcare (and what actually works in production) was originally published in The Applied AI Razor on Medium, where people are continuing the conversation by highlighting and responding to this story.

After LLMs: why World Models are becoming the next AI architecture?

deepsense.ai — Wed, 25 Mar 2026 10:51:01 GMT

The AI landscape is entering a new architectural phase. After years of optimizing Large Language Models through scale and next-token prediction, a different direction is emerging: systems that learn internal models of the world.

Google’s Genie 3 and the recent wave of JEPA-based research point in the same direction — toward models that capture state, dynamics, and cause-and-effect, rather than surface-level text patterns.

Why World Models Matter Beyond LLMs

This converts theory into strategy.

The Rise of Non-Autoregressive Reasoning

For technical decision-makers, the signal here is the move away from pure autoregression. The recent “LeJEPA” (Balestriero and LeCun #) publication demonstrated, with mathematical rigor, that pretraining Joint Embedding Predictive Architectures (JEPA) (LeCun #) can be achieved without heuristic tricks, specifically by aligning embeddings with isotropic Gaussians. In the meantime, the newly released VL-JEPA (Vision-Language JEPA) (Chen et al. #) provides a concrete illustration of the benefits of applying JEPA to multimodal tasks.

VL-JEPA: Predicting Representations Instead of Tokens

VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) (Chen et al. #) is a novel approach to integrating visual perception and textual understanding. Unlike standard Multimodal LLMs (MLLMs) that process inputs to autoregressively generate discrete tokens, VL-JEPA operates entirely within a continuous latent space.

This distinction is crucial for engineering leaders to understand:

Higher-Level Abstraction: By predicting representations rather than pixels or tokens, the model captures semantic meaning (e.g., understanding that “the room is dark” and “the lamp is off” are state-equivalent) without being unduly influenced by surface-level variability.
One-Shot Generation: It is non-autoregressive and can predict the entire target embedding sequence in a single forward pass.

Figure 1. Comparison of VL-JEPA Architecture with standard JEPA diagram (LeCun #) (Chen et al. #)

Architectural Breakdown

The architecture of VL-JEPA is a masterclass in component reusability and efficiency. It adapts the standard JEPA design (see Figure 1) for multimodal data by integrating specialized encoders:

Vision Encoder Enc(Xv): It utilizes a pretrained V-JEPA 2 model. Crucially, this encoder is frozen during training, with only the projection layers being fine-tuned. This leverages V-JEPA 2’s existing understanding of video and physical dynamics.
Text Encoder/Latent Z(XQ): Embedding layer that converts text tokens to embeddings.
The Predictor: This is where the reasoning occurs. It comprises the top 8 transformer layers of a Llama 3 (1B) model, repurposed to predict the target-text embedding from the visual input. It has removed causal attention and is not autoregressive.
Y-Encoder Enc(Y): Initialized with EmbeddingGemma-300M, it generates target embeddings from target text (Y) used during training.

The Mathematics of Alignment: InfoNCE Loss

To ensure that the predicted representations are both meaningful and distinct, VL-JEPA employs the InfoNCE loss function. This objective balances two competing forces:

Representation Alignment: Pulling embeddings of positive pairs (matching image-text) closer together.
Uniformity Regularization: Pushing embeddings of negative pairs (batch noise) apart to prevent representation collapse.

The loss can be formalized as:

Where SY,i is the target representation, ŜY,i is the prediction, and τ is the temperature parameter. This regularization enables the model to learn a structured world model without requiring pixel-perfect reconstruction.

What This Means for Enterprise AI Systems

From an engineering perspective, architectures like VL-JEPA are less about research novelty and more about how production systems can become faster, more stable, and less prone to hallucination.

In enterprise settings, this shift toward embedding-based prediction changes how systems are composed, scaled, and evaluated in practice.

Works Cited

Balestriero, Randal, and Yann LeCun. “Lejepa: Provable and scalable self-supervised learning without the heuristics.” arxiv, vol. 2511, no. 08544, 2025, https://arxiv.org/pdf/2511.08544.
Chen, Delong, et al. “Vl-jepa: Joint embedding predictive architecture for vision-language.” arxiv, vol. 2512, no. 10942, 2025, https://arxiv.org/pdf/2512.10942.
LeCun, Yann. “A path towards autonomous machine intelligence version 0.9. 2, 2022–06–27.” Open Review, vol. 62, no. 1, 2022, pp. 1–62, https://openreview.net/pdf?id=BZ5a1r-kVsf

Michał Kulczykowski, Senior Machine Learning Engineer at deepsense.ai

After LLMs: why World Models are becoming the next AI architecture? was originally published in The Applied AI Razor on Medium, where people are continuing the conversation by highlighting and responding to this story.