Iceberg’s “illegal” feature (and how to protect your company from the dangers of MOR files)

Iceberg and the danger of MOR files
Posted by
Iceberg and the danger of MOR files

This article was also published on LinkedIn

Apache Iceberg is a common data storage format. It’s so ubiquitous that every one of the Fortune 1000 has some use of Iceberg somewhere in their tech stack, and it’s the preferred data format for data lakes like Snowflake and public clouds like AWS. Iceberg excels at storing large tables — up to multiple terabytes in size — and gracefully handles changes to those tables over time, such as adding, deleting, or altering the information they contain. Iceberg is arguably the most important data format on the planet, powering BI, AI, and supporting nearly every major business you can think of.

But there’s a concern lurking in all those bits. Handling incremental updates to existing tables is where problems can arise, because there’s a natural tension between doing that quickly and “keeping things clean”…and this is where problems can arise. Iceberg essentially has two ways of representing changes:

  • Copy-on-write (COW): With COW, when the data changes, the file that contains that data gets rewritten to have all (and only) the updated information. Metadata files that point to the new file are similarly copied, so that all new or changed information is cleanly stored. COW is always a safe way to represent both the data and the metadata, because by definition it contains all and only the current information. If older data needs to be deleted for any reason, it can be done safely and in its entirety, ensuring that all old information goes away crisply.
  • Merge-on-read (MOR): With MOR, when the data changes, a “delta” file is created that basically says what changes should be applied to the older information. For example, the MOR file might say, “Delete row #95” or “Delete the row containing ‘John Smith'”. (For our purposes, the difference between a row id and a natural key to identify the row doesn’t matter.) When the data is read with an Iceberg-compliant library, the library is smart enough to automatically look at the MOR file and “forget” row #95. However, that row still exists in the Iceberg file…and therein lies the problem.

Why "old data" might be a legal risk

Why would keeping old data around be an issue? Well, there are both legal and contractual reasons out-of-date information can be dangerous in an enterprise setting. EU’s “right to be forgotten” laws, notably Article 17 of the GDPR, require anyone holding information about any European citizen to remove that information in its entirety if that citizen invokes their right to be forgotten. Notably, it is a violation of EU law to retain a human-readable copy of the information, such as happens in an Iceberg MOR file scenario. (With COW, the older versions of the data can simply be deleted to comply with the request.) MOR files keep human readable copies of so-called “deleted” information…which doesn’t comply with European law.

GDPR is an important reason to be careful of MOR files, but it’s far from the only reason. NDA agreements and other data handling contracts usually require the destruction of all human- and machine-readable copies of material subject to the NDA, which would place a MOR file-based Iceberg representation in violation of contractual agreements. Any company doing business with government or military agencies or that stores or handles privacy sensitive information (PII and/or PHI) is likely subject to these considerations. And California’s version of the GDPR — the CCPA/Prop 24 — mandates similar treatment and imposes penalties similar to GDPR, making it dangerous even for US citizens to retain “erased” data that can identify the person in question.

Avoiding MOR Problems

Legal and contractual risks involving MOR files are especially problematic in two scenarios:

  1. When data is retained for long periods of time, such as in a data lake
  2. When data is shared (or exposed) across different functions, companies, or regions, such as with a data product or through data (table) sharing mechanisms, such as those in popular platforms like Snowflake

Companies can eliminate their risk exposure entirely by ensuring that their Iceberg-related tools and services don’t use MOR files and only rely on COW. However, that could be easier said than done, since Iceberg’s ubiquity means that finding and checking all those locations could be difficult — and in some cases, higher level services or libraries might not even expose the ability to turn off MOR files.

The other alternative is to periodically “checkpoint” Iceberg files by creating a new snapshot. That will eliminate both MOR and COW files by creating a clean representation of the current version of the data, after which the incremental versions can be deleted (along with any MOR or COW files they were using internally). Checkpointing in this fashion is always possible with nearly any tool or service, but it does have the downside that retaining and managing historical versions becomes a more manual process. Forcing unrelated snapshots also defeats Iceberg’s ability to share data across different versions, so it can also result in larger storage costs over time when historical copies need to be retained for other reasons.

Over the long haul, the Iceberg community will hopefully make features like MOR easier to control, especially when data needs to be shared or distributed across parties who might have differing contractual and legal obligations when it comes to handling data that should no longer be seen. Until then, playing it safe and sticking with a “COW only” approach or periodically cleaning up the data through copies is the safest route.

Need more help with data sharing challenges?

For more help managing data integration and collaboration challenges without the manual overhead of worrying about MOR files, data sovereignty, and other low-level issues, check out Vendia — we think about the intricacies of data sharing all day, every day, and can help you with deliver solutions with security, trust, and privacy in mind!

Posted by
Related reading
Related reading

Search Vendia.com