| The new Cloud data pipelines : The principles of ELT and lots of SQL ! |
|
|
|
|
|
For years, ETLs like BODS, DataStage, Informatica, Ab Initio or Talend have been the kings of data pipeline construction. There was then a window of opportunity: - The hardware constraints and limited performance of "on-premises" Data Warehouses suggested that transformations be transferred to third-party servers, hence the lesser success of ELTs (we think of Oracle Data Integrator, ODI).
- Companies wanted to democratize the creation of data pipelines with business-oriented tools.
But Cloud architectures have changed the game, particularly because of the power of Cloud Data Warehouses.
Are we witnessing a massive, but SQL-boosted, renaissance of the principles that were at work in ELT? If so, how can we be part of this dynamic? |
|
|
|
|
|
| Why are companies abandoning E T L logic in the Cloud? |
|
|
|
|
|
Historical ETLs have revolutionized the world of data, but their design is based on paradigms that now have limits: - Optimizing ETL chains and parallelizing processes is not necessarily very simple , intuitive and essentially depends on the tool.
- The workflows we design work well with relatively "static" systems. However, in the era of the "data-driven" enterprise, systems are inflationary.
- Specialists in this or that ETL are rare (and it's the same for the ELT), which is problematic for maintenance! Especially since everyone has done their "tuning" in the ETLs, with countless nestings that create opacity.
|
|
|
|
|
|
| Moving towards new data pipeline tools: combining power and SQL ! |
|
|
|
|
|
Power at low cost - The transformations are done within the Data Warehouse as with the old ELTs, but this time in the Cloud: the native performances of the DWH Cloud (BigQuery, Snowflake, Redshift) are almost unlimited, which boosts the performances of these tools (Matillion, dbt, Airbyte, Fivetran, etc.).
- These data pipeline tools are more accessible in terms of costs: they often rely on open source modules (we think of dbt).
|
|
| |
|
|
|
| SQL as a technical pivot - The SQL transformations encapsulated in these data pipeline tools are open, modularized and designed to collaborate : versioning, reusable models, etc.
- SQL (or its dialects), which constitutes the technical base of these data pipeline tools, is mastered by most IT teams , which in times of shortage of "talents" is not neutral!
|
|
|
|
|
|
Migrating your ETL jobs to a Cloud data pipeline tool, “challenging”? For example, imagine a company (like the ones we know) that has accumulated more than 10,000 ETL jobs, built over several years with complex transformations such as joins, aggregations or even specific business rules... Migrating your ETL jobs to a Cloud data pipeline tool requires a complete overhaul of their execution, particularly when it comes to adopting a SQL approach.
Automation is essential. |
|
|
|
|
|
| 80% of the effort = regaining control over your existing system. How? |
|
|
|
|
|
Perform a comprehensive and automated inventory of the source system with {openAudit} - By analyzing internal processes via physical data lineage, in the field, in the source database, in the ETL: analysis of views, nested views and other scripts associated with feeding the flows.
By an analysis of the uses of information, via an analysis of the logs of the audit bases, for the consumption and injection of data. By analyzing the impacts in the data visualization tools that are associated with the source database, to glimpse the related complexity (calculation rules) and to be able to truly do end-to-end data lineage. By analyzing the scheduling of tasks , in order to link it to data lineage and data usage.
|
|
| |
|
|
|
| Optimize the scope of what needs to be migrated by cross-referencing uses with data lineage with {openAudit} - To reduce migration effort by identifying live branches and dead branches to decommission upstream.
- To define a roadmap by business, by type of flow, etc.
|
|
|
|
|
|
| Migration: go through SQL , of course! |
|
|
|
|
|
Converting ETL logic into source, SQL, with {openAudit} - Scheduling of elementary transformations : {openAudit} detects the sources, targets, filters and transformation logic of all ETL jobs and sub-jobs in source, and isolates the transformations that will be carried over to the target tool.
- Conversion to SQL : These ETL transformations are converted into SQL queries by {openAudit} , step by step.
- Checkpoints : each SQL step corresponds to a step of the source job, which allows {openAudit} to introduce real checkpoints (which does not exist in ETLs).
|
|
| |
|
|
|
| Conversion of ETL logics to data pipeline tool logic in open models with {openAudit} and tests: - Then, adding the models to the project configuration, defining dependencies between models, etc.
- The models are tested to verify their operation.
- Data can be tested to ensure its integrity.
|
|
|
|
|
|
CONCLUSION Moving away from legacy ETLs is not just a matter of technical modernization. It is a strategic opportunity for companies to rethink their data pipelines, making them more efficient, scalable and cost-effective (open source model). Modern, more open data pipeline tools, combined with powerful Cloud DWHs, embody this new era. By adopting these principles, companies can rely on lean, yet robust, architectures. The transition may seem complex, but with the right strategies and tools, it is possible to turn the page and build the data pipelines of the future.
|
|
|
|
|
|
|
Commentaires
Enregistrer un commentaire