Automating Data Preparation in a Digitalized World, Part 2

Automating Data Preparation in a Digitalized World, Part 2

The Australian data warehouse automation vendor, Ajilius, helpfully offers on their website an alphabetical list of two dozen DWA products, including their own, which luckily comes first, complete with a quick opinion on or comment about what each of their competitors has to offer. They also list three characteristics of each product: methodology, data warehouse DBMS, and price where known. Although this is one vendor’s view of its competitors, it is based on publicly available information. I assume it to be largely correct as a basis for some observations about the DWA market and products.

Fully 18 of the 24 products support the dimensional (Kimball) approach to data warehousing, seven support the Data Vault approach, while two support the “3NF” (I assume a largely normal structure in this case), and a further two use unknown methodologies. Some products support more than one approach.

The preponderance of the dimensional methodology offers some insight into the origins and directions of the DWA market. It emphasizes ease, agility and speed of delivery of immediately usable data by IT to business users. This observation also points to one of the most enduring definition problems in data warehousing: the difference between a warehouse and a mart.

For simplicity, I have long distinguished between these terms as follows. A data mart is a store of data optimized for the needs of a particular group of users by subsetting and/or structuring the data for ease of understanding and speed of use. A dimensional (star schema) model is one very popular way of supporting the type of drill-down and slide-and-dice analysis common among business users. An enterprise data warehouse (EDW) is a store designed to integrate, reconcile and cleanse data from multiple sources, for which a largely normalized structure is the most appropriate design choice. A data warehouse is some combination of (or sometimes choice between) these components—EDW and data marts.

Within this set of definitions, data warehouse automation tools actually focus primarily on the process of designing and building data marts rather than data warehouses. With overarching goals of ease, agility and speed of delivery, this makes perfect sense.

So, are they data mart or warehouse automation tools? Unfortunately, life is not that simple. Indeed, in small or mid-sized enterprises with a limited number of data sources and minimal IT staff, the distinction between mart and warehouse becomes moot—the mart is the warehouse and vice versa. Some experts (with whom I disagree) maintain that conformed dimensions enable the creation of an enterprise-wide warehouse out of a set of dimensional data marts. Furthermore, a number of the DWA vendors have set their sights on larger enterprises where an EDW is required.

This has led some DWA vendors to support population of largely normalized and/or Data Vault approaches to data warehouse design as seen in the numbers above. Indeed, some tools that focus on the dimensional approach are flexible enough to support these other methods. In addition, Data Vault uses a hybrid normalized/dimensional model that supports IT agility in the face of the normal ongoing changes in requirements characteristic of business intelligence (BI) tool users.

The growing popularity of data discovery or exploration tools, such as Tableau and QlikView, has also affected the DWA market. Such tools, based on in-memory and/or columnar data stores and high performance parallel processing engines, have reduced the need for dimensional models to support BI user queries. In many cases, a “wide-table” design approach is easier for users to understand and explore, gives more flexibility in access, performs well in most circumstances, and is simpler for IT to design and maintain. This offers that possibility of new architectural components, such as that described in my recent white paper, “Data Discovery Automation—Learning  from the Warehouse Experience”, where the EDW can be supplemented or, in some cases, replaced by a data discovery hub with a wide-table, loosely normalized design. DWA tools can, of course, be optimized to support such a design approach.

Given such a broad scope, long-time architects and builders of data warehouses may justifiably ask: what is the difference between the DWA tools and the ETL (extract, transform and load) tools we have long used? The short answer is that the boundary has become increasingly blurred, with the main distinctions claimed around ease of use, agility and process scope. The long answer requires a closer look at how the ETL market is evolving. And that is the topic of Part 3.

Barry Devlin

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988....

More About Barry Devlin