The strengths and limitations of Data Mesh architectures.

novembre 22, 2022

The strengths and limitations of

Data Mesh architectures

est ?

Data Mesh, what is it?

Treat data as a product

Data Mesh refers to a massive data processing system with a decentralized architecture. This architecture is on the rise, especially in large companies operating in highly competitive fields

The architecture is subdivided into “domains” managed by a responsible team enjoying true independence. These data domains are made interoperable with APIs. The idea is to have segmented data, ready to use! It's a bit of an update of "datamarts", but in self-service, and whose management is distributed.

Data ownership is thus federated to generate value to enable faster value creation and learning cycles. “Federation” of ownership often makes sense because the business value can be better defined, prioritized and iterated by the Business Unit (=domain) with the required requirement.

What is the nuance between a Data Lake vs a Data Warehouse vs the Data Mesh?

The Data Lake: as its name suggests, it refers to the provision of a volume of data that is as recent and relevant as possible. Data is read-only, loaded raw, and interpreted only when read.

The Data warehouse: it is a storage according to pre-established schemes according to the uses that will be made of the data. The model opens as soon as it is loaded.

Data Mesh: these are virtual gateways between databases, which are specialized by domain. The data teams will query, transform, create bridges between domains to obtain the most relevant result possible.

The Data Mesh will judiciously complement the Data Lake and the Data Warehouse. And it is today a ground swell, carried by the infinite resources of the Cloud.

What are the advantages of Data Mesh architectures?

Decentralization of data preparation.

The Data Mesh will consume raw data as inputs to return them cleaned, with additional structuring. These "products" can be consumed by entities other than the data owner, and the crossovers are endless.

Decentralized data ownership.

Each domain (BU) of the company owns its data, because it knows it best and uses it the most. He must therefore take care of the collection, the cleaning, the transformations... and he has every interest in doing it well. The Data Mesh is (in theory) a guarantee of quality.

A real self-service.

The data is made accessible to all those who need to access it, there is no longer any compartmentalization or watertight silo. This is done by providing a self-service infrastructure with APIs, which is the only centralized point in the Data Mesh.

Decentralized and therefore (theoretically) efficient governance.

Each domain (or BU) will operate its own data governance: quality, sourcing, compliance, etc. And the "inter-domain" level will further govern the harmonization of formats, system security, data life cycles.

What are the limits of Data Mesh architectures?

The risks of complexification of the IS.

Moving to a fully decentralized model can result in the creation of endless data silos, an unnecessary duplication of effort.

To create new answers, new business insights, a new “Mesh” is created by crossing different domains. This "Mesh" can be crossed in turn with another domain, and so on. As you pile up the answers, the complexity increases. We are in the well-known mechanics of entropy, but at accelerated speed.

The Data Mesh is an operational model that continuously evolves over time.

There is no reference implementation of Data Mesh. Therefore, every business that adopts data mesh must plan for the evolution of its data platform and operating model over time. The risk is to be continually overwhelmed by data owners who are often not attached to IT.

Roles and responsibilities are not clearly established.

These roles and responsibilities of each other must be standardized in order to guarantee that the deployment of a Data Mesh operates harmoniously. Without this pre-established organization with dedicated roles, data engineers can miss out on the requirements needed to build quality data.

The high risk in terms of data governance.

Federated IT governance should be a fundamental tenet of Data Mesh, and should emphasize automation and standardization to enable more comprehensive, real-time monitoring, detection and remediation of issues that arise here and there.

The intricacy of information flows can become such that it becomes impossible to carry out impact analysis from one side of the system to the other. The use of a data lineage tool that will continuously analyze all the flows, connect them to each other, will be a valuable aid in enabling the deployment of a Data Mesh. This tool will have to be fully automated, and will also have to include the dataviz layer in which the management rules are positioned more and more massively.

Conclusion

The objective of the Data Mesh is therefore to promote maximum flexibility, to create new "business insights", with ready-to-use data, validated by their functional owners. And it works! Some companies dream of being data driven, and the Data Mesh is a catalyst for this strategy.

However, the imperatives of mastering data processing in de facto hyper-intricate feed chains can quickly become a "pain point" and discredit the strategy. This should be remedied by automated impact analysis mechanisms (data lineage) in the complete system to have an understanding of the exhaustive deployment of each piece of data.

Moreover, the costs associated with this inflation of information can be exponential. Cloud infrastructures lend themselves very well to this scalability, but Cloud Provider revenue models are largely based on data I/O. Cost considerations can quickly weigh in the balance and curb ambitions.

Feedback from ADEO/Leroy Merlin at Big Data Paris 2022

"The need for Impact Analysis / Data Lineage tools

to deploy the Data Mesh."

#datalineage #bdaip2022 #datamesh

Rechercher dans ce blog

Le data lineage et l’usage des données pour transformer un système : simplifications / migrations