A Financial Service Engineer's Search for Provability Amid the MCP-ification of Everything
NatWest Group Software Engineer Yamini Krishnaprakash on why observability is solved, while correctness isn't, and most agent pilots are stuck between the two.

I have seen technical measures like latency and scalability. But the correctness of the outcome of a particular system, I don't know what kind of metrics are out there to monitor that.
Yamini Krishnaprakash's team wrapped a legacy SOAP API in an afternoon. REST endpoints, gRPC services, even decade-old protocols that nobody wanted to touc. They all got the same treatment. Her team at NatWest Group, one of the UK's largest banks, built an internal developer portal modeled on how organizations manage internal APIs: a catalog where teams publish MCP-wrapped services and other teams discover and reuse them instead of rebuilding from scratch. "People call it the USB hub for different backend systems," Krishnaprakash told The Read Replica. "You can MCP-ify anything using the same framework."
To get the internal buy-in, the pitch centered on preservation. MCP wraps existing systems without requiring teams to rearchitect, which means existing security models stay intact and nobody has to get sign-off on a net-new infrastructure stack. "Organizations are not going to build systems from the ground up," she said. "People feel safe that their system stays in the same place, but it's just augmented with AI, rather than reinventing everything and getting approvals for every new design and architecture."
For Krishnaprakash's team, integration turned out to be the easy part.
The model has to choose
Every agentic workflow involves tool selection. The model evaluates a request, scans the available tools, and picks one. When the toolset is small and descriptions are crisp, this works. When the toolset grows, the failure modes compound in ways that catch teams off guard. "Introducing hundreds and hundreds of tools in one workflow just because MCP can enable it just confuses the LLM and causes hallucinations," Krishnaprakash said. "It will not consistently make the right action."
Research has quantified what practitioners had been noticing: as tool counts rise, models drown in prompt bloat. Context windows fill with tool descriptions, leaving less room for actual reasoning. GPT-4 was observed hallucinating APIs that don't exist. Claude picked the wrong library for a user's request. The more options the model has, the worse it performs. Nearly all MCP tool descriptions contain at least one quality issue, and more than half have unclear purpose statements. "Ambiguous tool descriptions make some wrong tool selections," she said. But the deeper issue is that traditional evals weren't designed to catch this failure mode at all.
"We can verify whether a tool's output is consumed properly, but proving the right tool was selected is a different problem. It changes every time the prompt changes. That's where we spend a lot of time reinventing our evals." The output looks fine. The tool executes. The response is well-formed. Nothing in the logs tells you the agent chose wrong. You only find out when something downstream breaks.
Less is more
The workaround emerging across enterprise teams is counterintuitive: give the model less to work with. "We try to aggregate different tools into one MCP server and let the backend system redirect based on whatever use case we have," she explained. "Or even use a small language model in the flow rather than having hundreds of MCPs in a workflow. Then it's impossible to trace where things went wrong."
Amazon's February 2026 writeup on building agentic systems describes the same pattern. They built automated systems to generate standardized tool descriptions and defined cross-organizational standards for schema formalization. Even then, they note that poorly defined schemas result in the invocation of irrelevant APIs at runtime.
The teams making progress aren't building more capable agents. They're building more constrained ones—architectures where the model only has to choose between a handful of well-defined options. The enterprise play isn't "give the LLM access to everything." It's "hide most of the tools and route intelligently before the model gets involved."
The million-dollar metric
Deloitte's 2025 Emerging Technology Trends study found that while 38% of organizations are piloting agentic solutions, only 14% have systems ready to deploy and just 11% are running them in production. Many point to infrastructure or cost as the blocker, but the real culprit is the system's inability to make the right call when it matters.
Krishnaprakash's team is on the prototype side of that line. "We have only been involved in prototyping and developing applications that have not been exposed to customers yet," she said. Not because the technology doesn't work, but because nobody has figured out how to prove it works correctly once it's live.
The irony is that observability itself is basically solved. LangChain's State of Agent Engineering report shows 89% of organizations have implemented monitoring for their agents. Teams can trace execution, measure latency, track tokens. All of that is covered. None of it answers the question that matters in a regulated environment: did the agent make the right decision?
"I have seen technical measures like latency and scalability. But the correctness of the outcome of a particular system, I don't know what kind of metrics are out there to monitor that." That's the million-dollar metric. And until someone figures it out, the industry better get comfortable in prototype purgatory.
Authorized, but wrong
One early concern was whether MCP would force teams to rearchitect access controls, building entirely new auth models for agent-to-system communication. In banking, that would have killed adoption before it started. "What surprised us is that without diluting the existing security features, we're able to connect systems using MCP," explained Krishnaprakash. "The LLM cannot simply call a tool just because it has selected it. It needs to actually mimic an authorized consumer of that particular backend system. The same authentication and authorization goes into play."
That's what made enterprise adoption viable. Without it, her team would have been stuck building a parallel security model from scratch just for agentic AI. "Which is not convincing to business people," she said. "MCP just acts as another protocol, like REST and HTTP."
But security and correctness solve different problems. You can verify an agent was authorized to call a tool. You can confirm the call succeeded. You still can't verify whether that tool was the right one to call, whether the agent's interpretation of the request matched intent, whether the action fit the context. An authorized wrong decision is still a wrong decision.
Now build something worth proving
MCP solved the plumbing. The teams making progress are focused on orchestration: workflow structure, model routing, eval pipeline design. "We have to look away from MCP to the orchestration layer to make a good architecture. Because MCP is, at the end of the day, a protocol. Our focus has to be on the architectural system design view of the overall workflow."
The platform ecosystem is moving in the same direction. Supabase's Agent Skills, released in January 2026, embed Postgres best practices directly into coding agents' workflows, covering query performance, schema design, and RLS policy implementation. The knowledge activates before the agent writes the first line, not after a human catches the mistake.
"Coding agents give us time to do system design better," Krishnaprakash said. "We spend all our time now finalizing the design and taking it into different teams. The code itself is the easy part." The teams that solve correctness will ship. The ones still measuring latency and calling it progress won't know why they're stuck. For now, most of the industry is somewhere in between.




