The data lineage of {openAudit} to reinforce GCP Dataplex

 

The data lineage of {openAudit}

to reinforce 

GCP Dataplex 

 

Google has launched into the deep end of Data Management with Dataplex!

 

Dataplex is essentially a central data catalog for BigQuery, Google's managed DWH. 

 

The range of technologies analyzed continues to grow, with support for other databases available in GCP, in particular Cloud SQL, BigTable and Spanner, but also for GCP's flagship data visualization solution, Looker. Dataplex is also starting to automate the collection of metadata from third-party sources: MySQL, Snowflake, Databricks, etc.

 

With this, Dataplex customers have a complete view of data within a single unified catalog with its descriptions and contexts. 

 

Google is now looking to complete Dataplex to make it comprehensive and competitive in the world of Data Management, trusted by Informatica, Collibra and others. This is how Google integrated a “Data Lineage” functionality into Dataplex.

 

Data Lineage makes it possible to follow the deployment of data in an Information System: its origin, its successive transformations and its final impacts.

 

What is “Data Lineage” used for? 

  • To be certain that data comes from an authoritative source.
  • To carry out impact analysis in the event of modification or deletion of a table.
  • To ensure sensitive data is used correctly across the business and ensure compliance with regulatory requirements.
  • To track errors in a data flow to their root causes.
  • To prepare for a migration by mapping a system in detail.

Obviously “Data Lineage” is a crucial element in the range of a Data Management solution. 

 

However, Dataplex's Data Lineage functionality does not currently have the characteristics that we believe are essential to ensure that all of the promises relating to Data Lineage can be kept.

We believe that Dataplex can be wisely combined with  {openAudit} for a complete Data Management solution. 

 


The current limits of Dataplex data lineage 

Although the Data Lineage API automatically receives item information from GCP sources and through API calls for external sources,  data lineage graphics are only available for entries coming from the Data Lineage Catalog. Dataplex.  However, this Data Catalog only collects information automatically in BigQuery, Cloud Data Fusion (the GCP ETL) and in Cloud Composer (the GCP orchestrator). 

Dataplex's data lineage does not offer "in-field" analysis,  which quite considerably reduces the range of possible analyzes. 

The Dataplex administrator has full access to the various projects involved in GCP,  which from a security point of view is not ideal.

All traceability information is only kept in the system for 30 days. 

A single view: Dataplex's graphical representation only allows you to visualize the data lineage by deploying it iteratively.  With each time, the source, the transformation and the target. When there are hundreds of transformations, analyzes can be tedious.

 
 

Associate  {openAudit} 

at  Dataplex

{openAudit},  our Data Lineage solution makes it possible to advantageously strengthen Dataplex in different directions: 

  • To integrate information from third-party databases into the Data Lineage in an automated manner: 

A database which is itself outside the scope of Google (Teradata, Exadata, etc.), and which cannot therefore “feed” Dataplex's Data lineage API in an automated way to date, implies that a large part of the system is not analyzed. We offer to integrate a wide range of technologies into Data Lineage, automating the process 100%! 

  • To integrate transformations processed by third-party ELTs/ELTs into the data lineage: 

ETL/ELT, DataStage, BODS, Stambia, Talend are still very numerous in Cloud architectures. If the transformations they manage are passed over in silence, a large part of the transformations are hidden. Google's roadmap does not currently foresee this. 

  • To integrate the transformations in the dataviz layer into the data lineage:

Looker has not yet attracted all GCP users into its net. And many companies use third-party solutions even though they have migrated to GCP. Typically QlikSense or Power BI. And data preparation having been largely transposed into dataviz solutions, it is essential to understand what is happening there.

This is what {openAudit} also offers through its granular Data Lineage in the data visualization layer (with highlighting of all the management rules). 

  • To have an “end-to-end” view of the flows:

Our customers tell us that ideally, one of Data Lineage's graphic components should allow complete analyzes at a single glance, typically for sourcing decision data for business analysis. full scope impact. {openaudit} offers different views with different levels of granularity.  

  • To have a secure solution:

No one accesses projects whatsoever in the context of {openAudit}, since the metadata (only) is extracted autonomously for processing. 

  • To find out more about the use of data and reduce costs:

Beyond the processing, scheduling, and dataviz layer, {openAudit} will analyze the audit database logs, typically certain Google Cloud Operations logs, in order to add to the Data Lineage the uses and costs of the audit database. 'information.

This can make it possible to push the requirement in terms of security (DLP context = Data Leak Prevention): who consults this or that data unduly.

And this makes it possible to identify levers for reducing Cloud costs through operations to decommission unnecessary data pipelines (FinOps context). 

 

Learn more: 

Lower Cloud Costs
 
 

 

Conclusion 

 

Dataplex is making a notable entry into the world of Data Management, in particular with its increasingly broad data catalog, associated with security management.

Dataplex's Data Lineage is currently designed to precisely match the different components available in GCP. This is a limit that allows us to position {openAudit} in a complementary way. For multi-technological, automatic data lineage, including the dataviz layer, and at the same time allowing systems to be rationalized (volume / costs).

Commentaires

Posts les plus consultés de ce blog

La Data Observabilité, Buzzword ou nécessité ?

BCBS 239 : L'enjeu de la fréquence et de l'exactitude du reporting de risque

Le data lineage, l’arme idéale pour la Data Loss Prevention ?