Register for "A Guide to Data Products: Everything You Need to Understand, Plan, and Implement" - Friday, May 31,1:00 p.m. ET

Data Engineering for GenAI: How to Optimize Data, Pipelines, and Governance

ABSTRACT: Data engineering is now considered a crucial job in IT as Generative AI, the hottest technology of this decade, relies on data engineers to provide accurate inputs.

Sponsored by Datavolo 

Move over, data scientists. Data engineering might now be the sexiest job in IT—or at least the most necessary. That’s because Generative AI, the hottest technology of this decade, depends on data engineers to deliver trustworthy inputsThis blog, the second in a series, explores how data engineers and their colleagues must meet this requirement and mitigate the inherent risks of generative AI (GenAI.) They must optimize their unstructured data, optimize their pipelines, and govern their environments. 

GenAI is a type of neural network that trains itself to interpret and create digital content such as text, images, or audio. A GenAI language model generates content, often strings of words, after studying the meaning and inter-relationships of words in similar content. Companies are building LMs into GenAI applications that automate customer service, accelerate research-intensive processes such as drug development, and address myriad other specialized use cases. 

Getting Domain-Specific 

Companies apply LMs to their own domain-specific data in two primary ways. First, they use retrieval-augmented generation (RAG) to find and inject relevant data into user prompts, thereby increasing the odds of an accurate response. Second, they also might fine-tune a pre-trained LM such as ChatGPT from OpenAI or Gemini from Google on their own datasetsBoth these approaches, especially RAG, benefit from a data pipeline that converts unstructured data into vectors in a vector database.  

Mitigating Risks 

LMs make mistakes that range from humorous to damaging or even illegal because they are probabilistic rather than deterministicThey can “hallucinate” and make up facts to fill gaps in training data. They can generate outputs that expose personally identifiable information (PII), breach intellectual property, or propagate bias towards certain populations.  

As companies embark on domain-specific GenAI initiatives, their data teams must mitigate these risks by optimizing data, optimizing pipelines, and governing their environments. Let’s explore each requirement, working from the bottom up in our diagram. Data engineers play a leading role in this collaborative effort, using either homegrown or commercial tools from vendors such as Datavolo to automate the process. 

Mitigating the Risk of Generative AI

Optimize Data 

At the foundational layer, data teams optimize data. They curate, transform, profile, and validate the inputs for their GenAI applications. 

Curate. The first step is to curate source content. Guided by business domain experts, data scientists and engineers organize the most relevant, trustworthy source objects for their GenAI initiative. They annotate technical documents, label pictures of products, filter customer service emails by date and sender, and so on. They also remove duplicative, erroneous, or conflicting content. 

Transform. Data engineers work with data scientists or NLP engineers to transform the curated content into usable GenAI inputs. For example, they parse documents and divide them into semantic “chunks,” then use an embedding model to derive vectors that capture chunks’ meaning and inter-relationships. They periodically enrich these inputs by inserting additional content to chunk and vectorize. 

Profile. Data engineers and stewards profile GenAI inputs at each stage, including the source objects, their chunks, and their vectors. This profiling enables them to analyze data patterns and flag anomalies, then intervene to fix quality issues that arise. It also helps them track lineage, including where source objects originated, who touched them, and how they changed before they landed in the vector DB. These lineage views enable data teams to assess the root cause of hallucinations or other issues. 

Validate. Data engineers, scientists, and stewards use schema registries and custom logic to validate that the source objects meet quality standards. When a document or other source object falls short, the data engineer cleanses, re-labels, enriches, or replaces that object. At the other end of the pipeline, data scientists validate whether GenAI outputs meet governance standards. When outputs fall short, data engineers trace the lineage to find and fix root causes with the inputs: unmasked PII, bias, etc. 

Optimize Pipelines 

Data teams optimize GenAI data pipelines by designing, testing, tuning, observing, and orchestrating them. These processes are similar to those for traditional pipelines that deliver structured data to BI. 

Design. Data engineers design the pipelines that stitch together source repositories, target vector DB(s), and transformation logic based on data scientists’ requirements. They work with data scientists or ML engineers to configure the semantic search logic that retrieves relevant vectors as part of RAG. They also might huddle with CloudOps engineers to estimate and configure cloud compute clusters and other underlying infrastructure within their allocated budget. 

Test. Before going live, data engineers test pipelines to ensure they meet service level agreements (SLAs) for latency, throughput, and uptime. They run pipeline code on development workloads, then adjust and reconfigure based on initial results. When pipelines meet SLAs but exceed budget, or vice versa, they huddle with business experts and finance managers to reset expectations. 

Observe. Data engineers observe pipeline performance from numerous angles. They monitor event logs and metrics for both logical data processing and physical resource consumption. They determine operational patterns, then spot anomalies or trends that exceed thresholds. Observability is critical to the success of GenAI initiatives because so many of them depend on real-time interactions with chatbots, supported by heavy data processing. 

Tune. Data engineers tune pipelines by employing continuous integration and continuous delivery (CI/CD) practices. To fix an issue or optimize performance, they branch a pipeline out of production, then revise it by, for example, changing the transformation logic used for chunking and vectorizing. They test this new pipeline in development by validating the speed and accuracy of LM responses, and once satisfied they merge that pipeline back into production workflows. 

Orchestrate. Data engineers and ML engineers orchestrate workflows across GenAI pipelines, other analytical functions, and operational applications. They synchronize tasks and handoffs between all these elements to ensure that the users of GenAI applications have a seamless and reliable experience. This workflow depends on careful scheduling and execution of myriad tasks: updates to the vector database, vector retrieval to support RAG, LM generation of responses, application tasks that act on those responses, and so on. 

Govern 

As an overarching layer, data teams govern their GenAI data environments by gathering metadata, controlling access, masking data, and auditing activities. 

Gather metadata. GenAI pipelines have all kinds of metadata that describes datasets, pipeline elements, users, applications, and workflows. Data engineers gather this metadata, including content labels, metrics, logs, and lineage, for monitoring and analysis. They also consolidate selected metadata in a central catalog to assist overall governance efforts. 

Control access. Data engineers and stewards employ role-based access controls to ensure that only authenticated users perform only authorized actions on permissible datasets. By controlling access for all internal and external stakeholders, they can reduce the risk of exposing PII or running afoul of intellectual property laws. 

Mask data. To complement access controls, data engineers also implement data masking. They hide PII from unauthorized users, for example to ensure that data scientists cannot view columns with social security numbers or bank routing numbers.  

Audit activities. Data engineers, data stewards, and compliance officers audit all these activities to satisfy both internal and external reporting requirements. Auditing assists compliance with legislation such as the Global Data Protection Regulation (GDPR) in Europe or the California Consumer Privacy Act (CCPA) in the US, which requires companies to receive explicit consent from customers for any actions they take with their PII. 

Summary 

In the end, tasks facing data engineers might be more necessary than sexy. But if they can optimize their unstructured data, optimize their pipelines, and extend data governance programs to mitigate new risks, they can make GenAI sexier than ever. Our next blog will recommend evaluation criteria to help data engineering leaders select the right pipeline tools to achieve this goal. 

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie