Register for "A Guide to Data Products: Everything You Need to Understand, Plan, and Implement" - Friday, May 31,1:00 p.m. ET

Testing Capabilities and Tools for Data Engineers – Part 2

ABSTRACT: This article outlines how and why data engineers perform tests on their work efforts as they use automated testing and monitoring tools.

In part 1 of this series, “Testing Capabilities and Tools for Data Engineers” we described the need for a well-planned “data quality” framework to ensure that their data pipelines to generate reliable data. The worst indicator of a data quality problem is a customer that informs you about an issue. A fundamental goal of data engineers should be to build trust in data that emanates from the pipelines they create.

This article dives into various types of tests required to build trust in data, including unit and end-to-end tests. It also examines various test and production monitoring tools that can facilitate the work.

Types of Data Tests

Data engineers are frequently responsible for one or more of the following types of data tests: unit tests, integration/component tests, and end-to-end tests.

Unit Tests: Verifications of small standalone parts of code and data in isolation. Unit tests contrast with other tests that check other essential functions of the ETL implementation code, such as integration and system tests.


A common omission by some data engineers is the execution of unit tests that may not be fully adequate.


Performing unit tests in your development environment lets you check that your work is developed correctly. Testing may also be part of your CI/CD solution that helps ensure deployed data quality.

By following a checklist of best practices for developing and testing pipeline data, you can create reliable and robust pipelines. (see unit testing best practices)

For example, data engineers should consider including (as a minimum) tests that answer the following questions:

  • Are data record counts before and after loads as expected?
  • Are primary keys duplicated?
  • Are JOINed fields in the same format?
  • Do CASE WHEN statements work as intended?
  • Are data boundaries, aggregations, and formats transformed as expected?

Data transformations are necessary changes carried out between extracts and loads to targets. Transformations range from light (modification of field types, replacement of null values, applications of standards) to heavy (hashing of PII fields, aggregation of digital data, changes to target data (i.e., change data capture (CDC), filtering, streaming joins). These processing tasks can be recurring batch tasks or streaming data tasks.

Data transformations are frequently designed and developed by data analysts who may individually test them. However, data engineers often have the responsibility to test transformations independently.

Integration tests: Data input, output, and business logic tests are applied at each stage of the data flow to verify the accuracy and deviations from requirements.

Data engineers consider (at a minimum) the following categories of data integration tests. 

  • Tests to evaluate the functionality of all data transformations included in data flows along the pipeline. 
  • System integration tests to evaluate the integration of a data pipeline with external data sources. 
  • Tests to verify that data formats have not been erroneously changed

After a series of sprints, integration testing may be needed to ensure that all the software components from build activities work well together.

End-to-End Pipeline Tests: An E2E test is a process of tests that performs comprehensive testing of a workflow across an ETL pipeline using production or a near-production environment.

Most modern data pipeline environments are incredibly complex, with data from dozens of sources in a data store or pipeline. Most data will undergo numerous changes from source to consumption, sometimes in the hundreds.

Data failures can occur at any time in the pipeline, whether it's a change or a problem among data sources, an adjustment to one of the steps of your pipeline, or a complex interaction between multiple pipelines. To ensure data pipeline reliability, end-to-end testing should be planned for the visibility of pipeline breaks. 

E2E testing is an approach that exercises an entire workflow through an application or set of applications using production or near-production environments. 

ETL and Data Pipeline Testing Tools

It is for the individual data engineering teams to decide where the ETL and pipeline tests can save them manual validation time and prevent broken data from entering the data targets.

Generally, if you can run a test manually, there is a means of automating it. Each automated test works as a checkpoint in any part of an ETL pipeline and investigates source characteristics, transformations, and outputs.

Numerous testing tools are available for unit testing; the following are some of the most commonly used tools to verify the quality of your data. Figure 1 highlights several "checkpoints" in data pipeline workflows where automated testing can be considered.

Figure 1: Primary checkpoints for ETL data pipeline testing.

Source: "State of Automated Testing in a Data Warehouse"

Data can be highly complex to those planning and carrying out complex tests and validations. Adding relevant and timely validation to your end-to-end processing in the ETL process may save one from scrambling when someone sends e-mails and says the data looks incorrect. Your best bet is to find the data testing automation tools that work for your circumstance and try them.

Testing and Monitoring Tools 

The following is a list of open source and commercial testing and monitoring tools for data engineers. 

Open Source Tools

  • Deequ: A library of unit test features to detect errors. Deequ regularly calculates data quality measurements, checks user constraints, and publishes results.
  • DBT: A data transformation tool that has a natural definition of tests within the transformation pipeline.
  • Great_expectations: A data assertion library that can be used with most pipeline tools. You can Insert Great_expectation  functions into your pipelines to make them more robust.
  • JSON Schema: Using a schema for unit tests may seem strange, but many schema libraries allow you to enforce data requirements within the specification. 

Commercial Tools

  • Aceldama: Monitors data in warehouses, ETL processes, migrations, and integrations. Aceldama has an automated data validation platform called Torch, which can integrate business rules as data validation tests.
  • Bigeye (Toro): A data monitoring framework with integrated automated testing, including accuracy, completeness, consistency, and freshness.
  • Databand: A metadata monitoring tool that provides measures of data quality (e.g., data schemas, data distributions, completeness, and custom metrics).
  • Monte Carlo: A customizable data monitoring and alert system in a web-based application designed to detect data warehouse anomalies before they become issues.
  • RightData: A data quality testing tool designed to automate data verifications. It identifies issues related to data consistency, quality, completeness, and gaps. 
  • Talend: A software solution for data quality, data management, and big data. Talend's Data Quality solution profiles, cleans, and masks data in any format or size. 
  • Xplenty: A cloud-based ETL platform to simplify data processing with an intuitive GUI to implement data transformations.

Conclusion

Data quality is a crucial driver of data warehouse, data analytics, and business intelligence solutions. 

Early and ongoing testing of a data pipeline development project provides an evaluation of data quality. Data engineers often direct a test effort that begins early in development to detect errors and achieve quick corrections.

Overall, ETL pipeline testing should focus on a high-quality end-to-end ETL process. This includes validating all required data loads, correctly executing all data transformations, and successful data loadings. A data pipeline team should also carefully test SQL queries, stored procedures, and queries that generate aggregate and summary data. 

Finally, with digital transformation initiatives focused on outstanding customer outcomes, it is increasingly critical for data engineers and the overall testing team to design and execute tests that accurately reflect customer requirements.

Wayne Yaddow is a freelance writer focusing on data quality and testing issues. Much of his work is available at DZone, Dataversity, TDWI, and Tricentis.

C:\Users\wyaddow\Pictures\Wayne Yaddow Photo.jpg

Guest Author

Guest blogger at Eckerson Group

More About Guest Author