Building a Robust Regression Test Framework for Source-to-Target Reconciliation with dbt and Snowflake

Building a Robust Regression Test Framework for Source-to-Target Reconciliation with dbt and Snowflake

·

6 min read

Data reconciliation projects are critical to ensuring data integrity and consistency across various systems and databases. As these projects involve significant data movement and transformation, it's essential to establish a robust regression test framework to validate the accuracy and reliability of the process.

Why a Regression Test Framework is Crucial

  • Early Detection of Issues: By automating tests, you can identify potential problems early in the development cycle.

  • Reduced Manual Effort: Automation significantly reduces the time and effort required for testing.

  • Improved Data Quality: Consistent testing ensures that data is accurate and complete.

  • Enhanced Confidence: A well-defined framework provides confidence in the reliability of the data movement process.

In this blog post, we are referring to a normal data migration Project. For demonstration, I will be using a CRM Table Structure. Ensure that data is successfully transferred into the targeted data warehouse.

Step 1: Identify Test Cases
Step 2: Automate Tests(Source to target reconciliation queries)
Step 3: Create a view for the automated test case
Step 4: Add view details in the regression base table
Step 5: Run the view using dbt and update the status in the result table

Test Data preparation:
Source Table (CRM System)
The source_table represents data from the CRM system.

CREATE TABLE source_table (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(255),
    email VARCHAR(255),
    phone VARCHAR(20),
    address VARCHAR(255),
    created_at TIMESTAMP
);

The target_table represents the consolidated data in the data warehouse.

CREATE TABLE target_table (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(255),
    email VARCHAR(255),
    phone VARCHAR(20),
    address VARCHAR(255),
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

Here's an example of how you might insert data into the source_table.

INSERT INTO source_table (customer_id, customer_name, email, phone, address, created_at) VALUES
(1, 'John Doe', 'john.doe@example.com', '123-456-7890', '123 Main St', '2024-01-01 10:00:00'),
(2, 'Jane Smith', 'jane.smith@example.com', '098-765-4321', '456 Elm St', '2024-01-02 11:00:00'),
(3, 'Alice Johnson', 'alice.johnson@example.com', '456-789-1234', '789 Pine St', '2024-01-03 12:00:00');

Populating the Target Table:
The transformation logic will be applied in a real-world scenario based on the data mapping rules. Here's an example target_table for demonstration purposes only.

INSERT INTO target_table (customer_id, customer_name, email, phone, address, created_at, updated_at) VALUES
(1, 'John Doe', 'john.doe@example.com', '123-456-7890', '123 Main St', '2024-01-01 10:00:00', '2024-02-01 10:00:00'),
(2, 'Jane Smith', 'jane.smith@example.com', '098-765-4321', '456 Elm St', '2024-01-02 11:00:00', '2024-02-02 11:00:00'),
(3, 'Alice Johnson', 'alice.johnson@example.com', '456-789-1234', '789 Pine St', '2024-01-03 12:00:00','2024-02-03 11:30:00');

Step 1: Identify Test Cases:
The first step in creating a regression test framework is to identify the critical test cases for your reconciliation process. Focus on data completeness, accuracy, and consistency across your source and target systems.

Step 2: Automate Tests(Source to target reconciliation queries):

Automation is key to efficient testing. Use SQL queries to automate your source-to-target reconciliation tests. For example, you can create queries to identify discrepancies between your source and target tables, such as mismatched or missing records.

Source-to-Target Reconciliation Query
This query identifies discrepancies where the source data doesn't match the target data or is missing in the target data:

SELECT
    source.customer_id,
    source.customer_name AS source_customer_name,
    target.customer_name AS target_customer_name,
    source.email AS source_email,
    target.email AS target_email
FROM
    source_table AS source
LEFT JOIN
    target_table AS target
ON
    source.customer_id = target.customer_id
WHERE
    source.customer_name != target.customer_name
    OR source.email != target.email
    OR target.customer_id IS NULL;

Target-to-Source Reconciliation Query
This query identifies discrepancies where the target data doesn't match the source data or is missing in the source data:

SELECT
    target.customer_id,
    target.customer_name AS target_customer_name,
    source.customer_name AS source_customer_name,
    target.email AS target_email,
    source.email AS source_email
FROM
    target_table AS target
LEFT JOIN
    source_table AS source
ON
    target.customer_id = source.customer_id
WHERE
    target.customer_name != source.customer_name
    OR target.email != source.email
    OR source.customer_id IS NULL;

Query Explanation:

  1. Source-to-Target Reconciliation Query:

    • This query uses a LEFT JOIN to include all records from the source_table and the matching records from the target_table.

    • The WHERE clause filters the results to show rows where:

      • The customer names are different between the source and target.

      • The email addresses are different between the source and target.

      • The record exists in the source but not in the target (target.customer_id IS NULL).

  2. Target-to-Source Reconciliation Query:

    • This query uses a LEFT JOIN to include all records from the target_table and the matching records from the source_table.

    • The WHERE clause filters the results to show rows where:

      • The customer names are different between the target and source.

      • The email addresses are different between the target and source.

      • The record exists in the target but not in the source (source.customer_id IS NULL).

Step 3: Create a view for the automated test case:

Once you have your queries, create a view in Snowflake to encapsulate these automated test cases. This allows you to easily reference and execute the tests as needed.

CREATE VIEW customer_tbl_migration_v1(
    customer_id,
    source_customer_name,
    target_customer_name,
    source_email,
    target_email
)
AS
SELECT
    source.customer_id,
    source.customer_name AS source_customer_name,
    target.customer_name AS target_customer_name,
    source.email AS source_email,
    target.email AS target_email
FROM
    source_table AS source
LEFT JOIN
    target_table AS target
ON
    source.customer_id = target.customer_id
WHERE
    source.customer_name != target.customer_name
    OR source.email != target.email
    OR target.customer_id IS NULL
UNION  ALL
SELECT
    target.customer_id,
    target.customer_name AS target_customer_name,
    source.customer_name AS source_customer_name,
    target.email AS target_email,
    source.email AS source_email
FROM
    target_table AS target
LEFT JOIN
    source_table AS source
ON
    target.customer_id = source.customer_id
WHERE
    target.customer_name != source.customer_name
    OR target.email != source.email
    OR source.customer_id IS NULL;

You can have similar views for all your different test scenarios.

Step 4: Add view details in the regression base table

Maintain a regression base table to keep track of your test views and their status. This table helps in organizing and managing your test cases systematically.

CREATE TABLE regression_base(
    reg_des VARCHAR(100),
    reg_script VARCHAR(250)
);
INSERT INTO
    regression_base (reg_des, reg_script)
VALUES
    ('view_tc_1', 'SELECT * FROM customer_tbl_migration_v1');

Create a execution_results table to hold the execution reults.

CREATE TABLE regression_results (
    reg_test_name VARCHAR(255),
    reg_test_verdict VARCHAR(50),
    execution_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 5: Run the view using dbt and update the status in the result table:

ntegrate your views with dbt to automate the execution of your tests. Use dbt to run the views and update the status in the result table based on the test outcomes.

SELECT * FROM regression_base
{% set query = "SELECT * FROM regression_base" %} 
{% set query_string = run_query(query) %}
{% for query in query_string %}
  {% set reg_script = query.values()[1] %}
  {% set reg_des = query.values()[0] %}
  {{ print("Running regression script: " ~ script) }}
  {% set results = run_query(reg_script) %}
  {{ print("Result status:" ~ results|length)}}
  {% if results|length > 0 %}
    {% set new_query_fail = "INSERT INTO regression_results (reg_test_name, reg_test_verdict) VALUES ('" ~ reg_des ~ "', 'FAIL')" %} 
    {% do run_query(new_query_fail) %}
    {{ print("Regression script execution failed. Check the view name:  " ~ reg_des) }}
  {% else %}
    {% set new_query_pass = "INSERT INTO regression_results (reg_test_name, reg_test_verdict) VALUES ('" ~ reg_des ~ "', 'PASS')" %} 
    {% do run_query(new_query_pass) %}
    {{ print("Regression script execution passed. View name: " ~ reg_des) }}
  {% endif %}    
{% endfor %}

Initiate the regression suite execution from dbt

Verify the final execution result in ression_results table.

Creating a regression test framework for source-to-target reconciliation is essential for maintaining data integrity and consistency. By following these steps and utilizing dbt and Snowflake, you can automate the reconciliation process, ensuring that any data discrepancies are promptly identified and addressed. This framework not only enhances the reliability of your data pipelines but also contributes to the overall data quality within your organization.

With this framework in place, you can confidently manage your data reconciliation processes, knowing that your data is accurate and consistent across all systems. Happy coding!