Using Data Knowledge to Conquer Data Sprawl

The Challenge of Data Sprawl


Data sprawl is a term that describes the inordinate volume, variety, and complexity of data that organizations manage today. Data comes from many different sources from legacy transaction systems to sensor data, open data, and more. (See figure 1.)

 Figure 1. Many Sources of Data

The data arrives at varying speeds from highly latent batch feeds to real-time streaming data. It is stored, managed, and processed in different environments including on-premises data centers and multiple cloud platforms. Discussions of cloud, multi-cloud, or hybrid that seemed to be on point just a few short years ago are pointless today. Everyone is both multi-cloud and hybrid with applications and data deployed across on-premises servers and multiple cloud platforms. I sometimes think we should stop using the phrase “in the cloud” and instead talk about “in the clouds.” With only a few SaaS applications you become multi-cloud either by design or by default. Figure 2 illustrates a typical modern, sprawling enterprise data environment with six distinct cloud platforms as well as on-premises deployments.

 Figure 2. Data Sprawl with Multi-Cloud and Hybrid Deployments

Herein lies the challenge: Data, data everywhere … and we don’t know what we have. Collecting, maintaining, and sharing knowledge about data is the critical first step in managing data sprawl. You simply can’t manage data without data knowledge. You need to know what entities the data represents—which customers, products, suppliers, facilities, etc. are described by the data. You also need to understand schema for all of the data—to know what facts about entities exist as data elements, and how those data elements are organized and structured.

Stepping Up to Data Knowledge

Yes, knowledge of the data is critical. Without data knowledge it is difficult and impractical to manage, integrate, govern, and analyze data. But collecting and managing data knowledge is also difficult. Entity matching is needed to discover relationships, redundancies, and overlaps in data across distributed, disparate datasets. Schema matching is needed to identify semantically related objects and to inform integration, cataloging, governance, quality management, and analysis processes.

The uses and impacts of data knowledge are pervasive throughout data management. Entity matching as batch processing is central to list management processes such as de-duplicating mailing lists. Real time entity matching is a core capability for MDM systems to prevent duplicate records. Schema matching informs data governance, data quality, and data protection activities, helps analysts to find and enrich data, and accelerates data integration and engineering efforts with automated source-to-target mapping. Schema matching is also used to detect schema drift and maintain healthy data pipelines.  

The power of data knowledge is impressive, but collecting data knowledge and managing it as metadata isn’t easy when managing large volumes of data from diverse sources. When done manually it is costly, labor intensive, time consuming, and just too slow. When done algorithmically it can be error prone with high frequency of misclassification, false positives, and false negatives. Getting it right requires a combination of human intelligence and machine learning—people to train the matching and algorithms to automate, accelerate, and scale the processing.

Collecting and managing data knowledge as metadata is a good beginning. But real value is achieved only when the metadata is integrated and connected, and intelligence is applied on top of the metadata to deliver insights and to automate. Knowledge graphs are a natural fit for metadata. A metadata knowledge graph connects the dots, finding relationships and breaking down departmental and organizational metadata silos across the enterprise. Enterprise-wide knowledge sharing enhances data sharing, self-service, and data management automation.

Informatica Steps Up

On August 18th, Informatica announced the acquisition of GreenBay Technologies and their CloudMatcher technology with AI/ML capabilities for entity matching, schema matching, and metadata knowledge graphing. Integrating these capabilities into Informatica’s powerful CLAIRE engine—the heart of their intelligent data management technology—adds a new dimension to Informatica’s Intelligent Data Platform. The combination of AI-based entity matching, schema matching, and enhanced knowledge graphing is a big step forward for automation of data management, moving us ever closer to the vision of self-driving data.

Dave Wells

Dave Wells is an advisory consultant, educator, and industry analyst dedicated to building meaningful connections throughout the path from data to business value. He works at the intersection of information...

More About Dave Wells