The data lineage of {openAudit} to reinforce GCP Dataplex


The data lineage of {openAudit}

to strengthen  GCP's Dataplex 

 

Google has taken the plunge into Data Management with Dataplex!

 

Dataplex is essentially a central data catalog for BigQuery, Google's managed data warehouse. 

 

The breadth of technologies analyzed continues to grow, with support for other databases available in GCP, particularly Cloud SQL, BigTable and Spanner, as well as GCP's flagship data visualization solution, Looker. Dataplex is also beginning to automate metadata collection from third-party sources: MySQL, Snowflake, Databricks, etc.

 

With this, Dataplex customers have a complete view of data within a single unified catalog with its descriptions and contexts. 

 

Google is now looking to expand Dataplex to make it comprehensive and competitive in the data management world dominated by Informatica, Collibra, and others. Google has integrated a "Data Lineage" feature into Dataplex.

 

Data Lineage makes it possible to track the deployment of data in an Information System: its origin, its successive transformations and its final impacts.

 

What is the purpose of "Data Lineage"? 

  • To be sure that data comes from an authoritative source.
  • To perform impact analysis in case of modification or deletion of a table.
  • To ensure that sensitive data is used correctly within the company and to ensure compliance with regulatory requirements.
  • To track errors in a data stream to their root causes.
  • To prepare for a migration by mapping a system in detail.

Clearly, “Data Lineage” is a key element in the range of a Data Management solution. 

 

However, Dataplex's Data Lineage functionality currently lacks the features we believe are essential to ensure that all of the promises of Data Lineage can be met.

We believe that Dataplex can be judiciously combined with  {openAudit} for a complete Data Management solution. 

 

Current limitations of  Dataplex data lineage

Although the Data Lineage API automatically receives item information from GCP sources and through API calls for external sources,  the graphical form of data lineage is only available for entries from the Dataplex Data Catalog.  This Data Catalog only automatically collects information in BigQuery, Cloud Data Fusion (GCP's ETL), and Cloud Composer (GCP's orchestrator). 

Dataplex's data lineage does not offer "field" analysis,  which considerably reduces the range of possible analyses. 

The Dataplex administrator has full access to the various affected projects in GCP,  which from a security perspective is not ideal.

All traceability information is only kept in the system for 30 days. 

A single view: Dataplex's graphical representation only allows you to visualize data lineage by deploying it iteratively.  Each time, you'll see the source, transformation, and target. When there are hundreds of transformations, analysis can be tedious.

 
 

Associate  {openAudit}  with  Dataplex

{openAudit},  our Data Lineage solution, allows you to advantageously strengthen Dataplex in different directions: 

  • To integrate information from third-party databases into Data Lineage in an automated manner: 

A database that is itself outside Google's scope (Teradata, Exadata, etc.), and which therefore cannot currently "feed" Dataplex's Data Lineage API in an automated manner, means that a large part of the system is not analyzed. We offer it on a wide range of technologies by automating the process 100%. 

  • To integrate transformations processed by third-party ELTs/ELTs into the data lineage: 

ETL/ELT, DataStage, BODS, Stambia, and Talend are still very common in cloud architectures. While the transformations they manage are overlooked, a large portion of these transformations are hidden. Google's roadmap does not currently include this. 

  • To integrate the transformations in the dataviz layer into the data lineage:

Looker hasn't yet captured the entire GCP user base. And many companies use third-party solutions even after migrating to GCP, typically QlikSense or Power BI. And since data preparation has largely been transposed into data visualization solutions, it's crucial to understand what's happening there.

This is what {openAudit} also offers through its granular Data Lineage in the data visualization layer.

  • To have an "end-to-end" view of the flows:

Our customers tell us that ideally, one of the graphical components of Data Lineage should allow detailed analyses through a granular view, of course, but above all it should allow us to encompass at a glance all the sources and all the impacts of a "data point" (field, table, database) to gain efficiency in analyses. 

  • To have a secure solution:

No one accesses the projects, whatever they are in the context of {openAudit}, since the metadata (only) is extracted autonomously for processing. 

  • To learn more about data usage and reduce costs:

Beyond processing, scheduling, and the data visualization layer, {openAudit} will analyze the logs from audit databases, typically certain Google Cloud Operations logs, in order to add the uses and costs of the information to Data Lineage.

This can help to push the security requirement (DLP context = Data Leak Prevention): who consults such and such data unduly.

And this makes it possible to identify levers for reducing Cloud costs through operations to decommission unnecessary data pipelines (FinOps context). 

 

Learn more: 

Lowering Cloud Costs
 
 

 

Conclusion 

 

Dataplex is making a remarkable entry into the world of Data Management, particularly with its increasingly extensive data catalog, associated with security management.

Dataplex's Data Lineage is currently designed to precisely match the various components available in GCP. This limitation allows us to position {openAudit} in a complementary manner. For multi-technology, automatic data lineage, including the dataviz layer, and simultaneously allowing for the rationalization of systems (volume / costs).

Commentaires

Posts les plus consultés de ce blog

Migration automatisée de SAP BO vers Power BI, au forfait.

La Data Observabilité, Buzzword ou nécessité ?

La 1ère action de modernisation d’un Système d'Information : Ecarter les pipelines inutiles ?