Is MCP creating the next generation of data silos?

MCP - data silos
Posted by
MCP - data silos

This article was also published on LinkedIn

We’ve all heard the term. Along with “unclean” data, it’s basically the worst thing you can say to a CDO: “Your data is siloed”. It implies a messy, fragmented, poorly integrated data situation that could take years to straighten out and bring together to enable effective business outcomes.

And we’d all like to think that newer advances in the data and application landscape make things better — tie data together effectively, reduce barriers to collaboration, improve entity matching (or eliminate the need for it entirely with a distributed single source of truth) and otherwise ensure that these silos never occur again. Unfortunately, that’s not always true: Even the most hottest, most hype-worthy tech can retreat into data silo problems of the past. Today’s case in point? MCP.

MCPAnthropic’s Model Context Protocol for the 3 people on the planet not already aware of it — is essentially middleware for the AI infrastructure age. It’s designed to connect a range of enterprise data sources, APIs, SaaS apps, data lakes, and other business infrastructure to AI agents so they can read (answer questions) and write (perform actions) based on human language commands. Unlike the information baked into general purpose LLMs, MCP is designed to deliver and integrate “live” business data in real time. Think about asking an automated airline AI agent to help you switch to an earlier or later flight on your day of travel: The AI needs to know (and be able to adjust) your specific travel plans, not just what an airplane is, and it needs to do it now, with up-to-date information about flight times, seat availability, your status with the airline, and other personalized, real-time data that lives inside airline reservation and other operational systems.

The goal of MCP is really two-fold: It forms an abstraction layer that allows those many different corporate systems (SaaS, apps, lakes, etc.) to look more-or-less the same to AI agents, which need a uniform way to understand

  • What data is available to them and what type of data it is (i.e., solve the data semantics or metadata problem)
  • Actually read and write that data safely (i.e., solve the data connectivity and integration problem)

Closely related, MCP also has the goal of avoiding an “MxN” situation where every agent (the “M”) needs to build a custom connection to every data source (the “N”). If you imagine a large enterprise using thousands of agents with tens of thousands of data and API sources, you can easily see how that multiple would quickly get out of control from a cost and maintenance perspective. Instead, MCP attempts to create one, uniform abstraction wrapper around each data source that all AI agents (even across different vendors) can use.

So far, so good: We have a uniform abstraction layer for both metadata needs (discovery and introspection) and data needs (bidirectional data and command flow with reasonable throughput). So what’s the problem?

The problem is that MCP’s approach is create a single wrapper (what MCP calls a “server”) around every individual data source, application, table, etc. Each of these is exposed separately and individually in this design to the agent. In some cases that might make sense — maybe your company has a single, large Amazon S3 bucket holding documentation and you want to create access to the terabytes of PDFs stored there. That would be a good use of MCP: A single MCP server wrapped around a large bucket of mostly homogeneous objects.

The plot thickens...

But what if you have thousands of tables in a data lake, hundreds of SaaS apps, thousands — maybe tens of thousands — of S3 buckets, hundreds of operational databases spread around the globe…suddenly, wrapping each of these individually in an MCP server, even without the MxN problem, doesn’t sounds so attractive. And not only is it a lot of work, it introduces a whole host of problems that have no great solution:

  • Data redaction: LLMs are marvelous things and they’re good at a lot of tasks…but 100% guaranteed compliant, privacy-preserving data transmission is not one of them. The only safe place to control (block, redact, obfuscate, or otherwise limit) 1st, 2nd, or 3rd party data transmission to the end user is before it reaches the LLM. So that means MCP isn’t just “N data source wrappers”, it also has to be “N data redaction layers”, each with its own compliance, privacy, security, and governance guarantees…not just once on initial hookup, but ongoing forever.
  • AuthN and AuthZ proliferation: A similar problem is identifying and authorizing the human being using a (typically multi-tenanted) agent. You don’t want someone other than you rebooking your flights, or a customer performing actions only an employee or manager at the company should be allowed to do. But that means that each MCP server also needs to have a pass-through mechanism to identify (authenticate) and control (authorize) what any possible end user is performing. And since the “back door” of each MCP server is a different application, SaaS, data lake, bucket, etc., it means that literally every single mechanism used in a company for AuthN and AuthZ now require connecting and integrating to MCP servers, one by one.
  • Application (client)-layer data joins: Connecting to data and filtering it source-by-source and user-by-user is only the start of the siloing problem. LLMs, for all their capabilities, make lousy databases. Complex, high-volume joins and other sophisticated queries can only reasonably be carried out on in a database (for operational data) or on a data lake (for analytical data). Trying to bring all the data back from many different databases individually to an AI agent through MCP server layers and then asking the AI agent to perform high-speed joins is a technical nonstarter — it would be way too slow (and likely way too expensive), if it even works at all.
  • MDM and data reconciliation: Another way in which LLMs and AI agents fail to simulate databases is in maintaining accurate, large-scale correlation tables such as live in an MDM system. Reconciling entities (also known as “matching”) is a critical and highly specialized enterprise data activity for a reason. Knowing that “Tim Wagner” on one system is actually “Timothy Wagner” on another, or that “Part #12345” and “Ball ping hammer, 14 inches” are the same inventory item on others isn’t something an LLM is going to easily remember from query to query or keep up to date as customers and inventory change dynamically in the underlying systems of record.
  • Catalog and middleware layering, end-to-end forensics: There’s no denying that MCP is yet another abstraction layer, even if it’s there for a good reason. But adding another layer to existing database, API, and application catalogs and/or on top of existing middleware solutions such as Mulesoft, Boomi, or others inevitably makes for a technical stovepipe. Any problems requiring debugging, or end-to-end performance or security concerns will now require delving through one more layer of introspection, authN/Z, data access, and so forth to figure out what’s wrong or even just to monitor that nothing is.

These problems aren’t insurmountable, and they don’t mean MCP is “bad”. But they do mean that thoughtless proliferation of thousands of individually built, separately owned and maintained MCP servers throughout a large enterprise will result in cost overruns, governance concerns, maintenance challenges, and — yes — a proliferation of nasty data silos. So, let’s take a look at some design patterns that make for healthy MCP adoption dynamics and avoid these issues from the start!

Healthy MCP design patterns

Used unchecked, MCP has the potential to create data silos and extra abstraction layers that add cost and complexity to an enterprise’s already overburdened IT team. But there are also good design approaches that minimize overhead and that simultaneously optimize what both LLMs and existing enterprise technologies excel at without compromising either. Let’s take a look at some of these that are already emerging:

MCP as an alternative facade for SaaS and API gateways

One easy way to ensure that MCP doesn’t introduce additional complexity or data silos is to think of it as a “one-for-one” swap. This pattern is going to be prevalent with SaaS solutions, API gateways, and other approaches where an existing interface can simply by “swapped out” for an MCP alternative at essentially zero cost or complexity by the company using it. A good example is a SaaS service like Stripe: In addition to its existing APIs, Stripe is also making its services available through an MCP-compatible interface. A company already using Stripe will then simply decide which of these “facades” to use, hooking up its AI agents to Stripe’s MCP facade, its operational systems to Stripe’s APIs, its deployment mechanisms to Stripe’s command line wrappers, and so forth. This design pattern also makes sense for API and mobile gateways (though none of the major cloud or open source players offer an MCP alternative just yet) as well as for internal applications where the owning team has the time and resources available to create an MCP version of the app/API. Why it works: This pattern is one-to-many and owner driven (so, a highly amortized ROI) and avoids additional layers by having consumers plug directly into the form factor that makes sense from the producer, whether that’s APIs, command line scripts, SQL, MCP, or something else.

MCP as a fully managed data integration service

Another approach to limiting complexity, risk, and silo bloat with MCP is to use an established data integration platform that “speaks” MCP. Vendia is an example of a company offering fully-managed MCP server approaches that can then integrate with a company’s existing operational, file system, and analytical data on the back end. By keeping the authN, authZ, data redaction, joins, and governance in a platform designed to perform those activities safely, auditably, and at scale, MCP’s limitations and risks don’t affect an enterprise’s data compliance, security, or performance through proliferation. Why it works: The data integration platform ensures that MCP clients only ever see data that’s already been cleansed, reconciled/joined, and audited for compliance, privacy, and safety. Because the platform is fully managed, the MCP server is low cost to adopt and maintain by the company, even though there may be many systems and a large volume of data connected through MCP and agents.

Potential Future Pattern: MCP as a data catalog alternative

As discussed earlier, one of MCP’s roles is to serve a discovery and introspection function — helping an AI agent figure out what data or commands are available, what type(s) of data they receive/deliver, etc. The closest equivalent we have to that functionality today in enterprise systems are data and API catalogs. They’re charted with similar responsibilities, but at the moment MCP is still very nascent compared to more established catalog solutions that also support data classification, custom metadata tagging, and more. But because MCP also handles the actual data flow, it’s a complicated “both more and less than a catalog” comparison. Data catalogs such as Polaris are a rapidly changing landscape, so it’s entirely possible that over time both conventional and AI needs will merge into an “MCP-flavored catalog” that can do everything catalogs offer today and everything MCP handles today.

Conclusion

MCP is a powerful — and critical — new addition to the enterprise middleware landscape. At the same time, it presents potential challenges that IT has seen many times before with middleware layers that attempt to simplify access to disparate backend systems but in doing so also create a new round of silos and overhead. Being thoughtful up front about the scalable design patterns before MCP servers start to proliferate in an uncontrolled fashion across real and shadow IT landscapes can help companies of all sizes to get the most of this exciting new technology without incurring unnecessary costs or limitations down the road. Data integration platforms in particular offer an “easy button” to get started with a fully managed MCP solution that offers proven, trustworthy solutions to hard problems like data access, partner data management, multiple data modalities, and end-to-end forensics without the need to build these complex solutions from scratch.

Posted by
Related reading
Related reading

Search Vendia.com