Enterprises are heavily reliant on data to make intelligent and faster business decisions. To discover the insights, data from the various sources has to be brought to a single data warehouse for processing. The data might be from external sources and third-party apps which require integration with the data warehouse.
The integration process involves the ETL process, which acts as a catalyst in standardizing diverse and disparate data types. The process of ETL needs to be scalable and streamlined to derive the most benefit.
In the current scenario, data loading requires individual pipelines to load and manage large volumes of data from various sources in various formats. This is a tedious process that consumes a large amount of time, and money.
ETL automation approach
The SnapLogic ETL system allows users to load and manage bulk data in real-time, thereby saving the development time for data loading. The proposed pipeline automation system can help enterprises in loading the data faster and in a much-organized pattern.
We developed an Automation pipeline to load data from multiple sources (SQL Server, Oracle, IBM DB2, and other sources) to the destination (Snowflake) with automatic unit testing which zeroes human interaction.
With the help of this pipeline, users can create the table in the destination automatically and load bulk load data for the first time followed by an incremental load from the next time. Also, users can check all the test logs with schema testing like data types, constraints, and records comparison between source and destination, etc. by just passing few required parameters to the pipeline.
Reduction of around 1400 hrs. of project development time was achieved by using this ETL automation pipeline.
Required parameters to pass are:
- Source: Type of source (SQL Server, Oracle, etc.)
- Source Connection: Connection name of a source created in SnapLogic
- Destination Connection: Connection name of destination created in SnapLogic
- Source Schema: Schema of the source table
- Source Table: Name of the source table
- Key Columns: List primary key columns if any (for bulk upsert)
- Date Column: Date column in source table If any (for Incremental load)
ETL automation pipelines
The parent pipeline will load all the required input parameters from the input table and will pass each parameter for every iteration.
Child 1 Pipeline(C1):
C1 will check if the table exists in the destination. If there is no input table, it will generate the DDL script of the table and transforms the query which is suitable to the destination, and creates the table. If the table already exists then it will invoke the Child 2 pipeline.
Child 2 Pipeline(C2):
C2 will load the data into the Destination table (Snowflake). For the first iteration, the pipeline will do the bulk load and from the next iteration, it will do the incremental load based on the key columns and date provided in the input table. A unit testing framework is called after loading the data.
Use of Unit Testing Framework in ETL Automation Pipeline:
This unit testing framework will generate the test logs like checking data types, columns count, constraints, and records count between the source and destination. Test logs are stored in a table for historical analysis and users can build Power BI reports to check them with multiple filters applied.
Test Cases Covered by the Unit Testing Framework are:
Note: Data Transformations can be done based upon the business requirements
Talk to our experts:
Learn more on how a modern ETL solution with automated pipelines can increase the speed, efficiency, and accuracy of data analysis. Get in touch with our Data Integration experts at Technovert today.