A Financial Service Engineer's Search For Provability Amid The MCPification Of Everything
Software Engineer at one of the UK’s largest banks, Yamini Krishnaprakash, on why observability is solved, while correctness isn't, and most agent pilots are stuck between the two.

I have seen technical measures like latency and scalability considered vital. But the correctness of the outcome, bias, data security, are the metrics that enable a product to go live for customers.
Every enterprise building agentic AI eventually reaches the same wall: the system works, but nobody can prove it works correctly. In regulated industries, that distinction is existential. Deloitte's 2025 Emerging Technology Trends study found that while 38% of organizations are piloting agentic solutions, just over 10% are running them in production. Capability is table stakes, but in banking, healthcare, and government, it's provability is harder to come by.
MCP was supposed to help close that gap by standardizing how agents connect to enterprise systems. One year in, the protocol has seen real adoption: OpenAI, Google DeepMind, and Microsoft have all backed it, the Linux Foundation now governs the spec, and wiring an agent to a production database has gone from a dedicated integration sprint to an afternoon of configuration. But researchers have discovered significant lapses, including one report that found roughly 1,000 MCP servers sitting on the public internet with no authorization controls, some wired directly into Kubernetes clusters, CRM platforms, and production databases. The protocol makes connection easy. Whether connection translates to production-grade trust is another question entirely.
Yamini Krishnaprakash is one of the engineers operating in that gap. A Software Engineer at one of the UK’s largest banks, Krishnaprakash builds AI agents on the enterprise systems her team wraps with MCP, such as REST endpoints, gRPC services, even decade-old SOAP protocols nobody wanted to touch.
They all get the same treatment from Krishnaprakash and her team. Building a catalog of MCP-wrapped services will get the enterprise started to experiment with different use-cases without rebuilding from scratch. "People call it the USB hub for different backend systems," Krishnaprakash told The Read Replica in a recent interview. "You can MCP-ify anything using the same framework."
To get internal buy-in, the pitch centered on preservation. MCP wraps existing systems without requiring teams to rearchitect, which means existing security models stay intact and nobody has to get sign-off on a net-new infrastructure stack. "Organizations are not going to build systems from the ground up," she said. "People feel safe that their system stays in the same place, but it's just augmented with AI, rather than reinventing everything and getting approvals for every new design and architecture."
The model has to choose
Every agentic workflow involves tool selection. The model evaluates a request, scans the available tools, and picks one. When the toolset is small and descriptions are crisp, this works. When the toolset grows, the failure modes compound in ways that catch teams off guard. "Introducing hundreds and hundreds of tools in one workflow just because MCP can enable it just confuses the LLM and causes hallucinations," Krishnaprakash said. "It will not consistently make the right action."
Research has quantified what practitioners had been noticing: as tool counts rise, models drown in prompt bloat. Context windows fill with tool descriptions, leaving less room for actual reasoning. GPT-4 was observed hallucinating APIs that don't exist. Claude picked the wrong library for a user's request. The more options the model has, the worse it performs. Nearly all MCP tool descriptions contain at least one quality issue, and more than half have unclear purpose statements. "Ambiguous tool descriptions make some wrong tool selections," she said. But the deeper issue is that traditional evals weren't designed to catch this failure mode at all.
"We can verify whether a tool's output is consumed properly, but proving the right tool was selected is a different problem. It changes every time the prompt changes. That's where we spend a lot of time reinventing our evals." The output looks fine. The tool executes. The response is well-formed. Nothing in the logs tells you the agent chose wrong. You only find out when something downstream breaks.
Less is more
The workaround emerging across enterprise teams is counterintuitive: give the model less to work with. "One proven pattern is to aggregate multiple tool actions into one MCP server and let the backend redirect to the right tool thus making the tool calling more deterministic," she explained. "Delegating complex tool selections to the model while they can be programmatically managed will only complicate the traceability of the workflow, of course with increased cost for token and inference."
Amazon's February 2026 writeup on building agentic systems describes the same pattern. They built automated systems to generate standardized tool descriptions and defined cross-organizational standards for schema formalization. Even then, they note that poorly defined schemas result in the invocation of irrelevant APIs at runtime.
The teams making progress are building more constrained agents within architectures where the model only has to choose between a handful of well-defined options. The enterprise play is less about "giving the LLM access to everything." and more restrictive in nature by hiding most tools and routing intelligently before the model gets involved.
The million-dollar metric
Krishnaprakash says prototyping is an easy start. "I have only been involved in prototyping and developing applications that have not been exposed to customers yet," she said. Not because the technology doesn't work, but because identifying and implementing the right evaluation framework is a whole different project that takes products live.
The irony is that observability itself is basically solved. LangChain's State of Agent Engineering report shows about 90% of organizations have implemented monitoring for their agents. Teams can trace execution, measure latency, track tokens. All of that is covered. None of it answers the question that matters in a regulated environment: did the agent make the right decision?
I have seen technical measures like latency and scalability considered vital. But the correctness of the outcome, bias, data security, are the metrics that enable a product to go live for customers." That's the million-dollar metric. And until someone figures it out, the industry better get comfortable in prototype purgatory.
Authorized, but wrong
One early concern was whether MCP would force teams to rearchitect access controls, building entirely new auth models for agent-to-system communication. In banking, that would have killed adoption before it started. "What surprised us is that without diluting the existing security features, we're able to connect systems using MCP," Krishnaprakash said. "The LLM cannot simply call a tool just because it has selected it. It needs to actually mimic an authorized consumer of that particular backend system. The same authentication and authorization goes into play."
That's what made enterprise adoption viable. Without it, her team would have been stuck building a parallel security model from scratch just for agentic AI. "Which is not convincing to business people," she said. MCP works at her institution because it acts as another protocol layer like REST or HTTP without diluting the security architecture underneath.
But security and correctness solve different problems. You can verify an agent was authorized to call a tool. You can confirm the call succeeded. You still can't verify whether that tool was the right one to call, whether the agent's interpretation of the request matched intent, whether the action fit the context. An authorized wrong decision is still a wrong decision.
Now build something worth proving
MCP solved the plumbing. The teams making progress are focused on orchestration: workflow structure, model routing, eval pipeline design. "We have to look away from MCP to the orchestration layer to make a good architecture," Krishnaprakash said. "Because MCP is, at the end of the day, a protocol. Our focus has to be on the architectural system design view of the overall workflow."
The platform ecosystem is moving in the same direction. Supabase's Agent Skills, for example, embed Postgres best practices directly into coding agents' workflows, covering query performance, schema design, and RLS policy implementation. The knowledge activates before the agent writes the first line, not after a human catches the mistake.
"Coding agents give us time to do system design better," Krishnaprakash said. "We spend all our time now finalizing the design and taking it into different teams." The protocol is in place and the security model holds. What remains is the hardest part: proving that the system does what it's supposed to do, not just that it's authorized to try. The teams that solve correctness will ship. The ones still wrapping legacy APIs and calling it progress will wonder why their pilots never graduate.





