What is Data Transformation

November 15, 2023

•

min read

July 20, 2022

Data transformation is the process of converting data from one format, structure, or set of values to another by way of joining, filtering, appending, or otherwise performing some sort of computation on the data.

While data transformation is a relatively simple concept, in practice it can be quite complex to move data from point A to B to C. Whether ETL, ELT, or whatever term you prefer, data transformation is the act of doing something with your data to make it more valuable, usable, and reusable, so you can meet the needs of your analytics, ML and other business teams that are relying on that data.

Why do we need data transformation?

When data is ingested from an API, blob storage, data warehouse, or another source, you have no control over how the data is formatted. You might be getting JSON data, or something in a CSV format. Most often, the data will not be in the format you need it to be for whatever data warehouse, BI or visualization tool you’re using.

Beyond standardizing the format, you need to filter out bad data and perform data quality checks. Then, in many instances, you need to aggregate it downstream to join it to data from other systems or previous ingested data. There are many steps required to get data to a place where you can work with it, apply it to your use cases, and derive its full benefit.

Why is data transformation complex?

When building even some of the simplest data pipelines, data transformation can quickly get complicated. Imagine, for instance, ingesting data to create a pipeline around user activity. Let’s say the data comes from log streams from different deployments across three different clouds dispersed globally, which means multiple different regions and time zones.

It’s being ingested in a different format than you generally like to work with—let’s say a log format, with some JSON structured objects thrown in. In this case it’s mostly semi-structured text data, as is often the case when data is coming from a back-end system that is logging user activity. Once you start to do analytical-style operations on the data, you need to take it from JSON compressed files to columnar structures, so you can do efficient, analytical-style operations on it. That involves taking the JSON data, decompressing it, and putting it into a column format—and that’s just step one.

Another common step is filtering out the data you’re not interested in. This filter isn’t simply based on individual users, but also on the larger groups of people using the data. Maybe you are specifically looking for people who did create, update, and delete operations, but you are less interested in other types of events. Filtering out the data for those groups is another common type of data transformation that hones and refines the data set to make it more useful—and accurate for the downstream workload.

What are the different types of data transformation?

Throughout various stages in the data pipeline, there are several different kinds of data transformation, from basic reformatting to enriching and correlating the data, including:

Data extraction and parsing: In early phases, you’re reformatting data, extracting certain fields, parsing and looking for specific values.
Filtering and mapping: You then get into refinements like filtering out certain data or mapping certain fields and values. Maybe you want to bucket your users into one of three categories—low, medium, and high activity, for example.
Data enrichment: This type of transformation involves bringing in data from another source and adding it to your data set. For instance, you may want to add user metadata from a different data set to build a more detailed view of specific users. In this phase, enriching the data can often turn into its own form of ingestion, highlighting just how sophisticated data transformation can get.
Cross-record correlation: This type of data transformation involves analytical-style operations, such as “count how many users during a particular time did x, y, or z.” There’s also correlation of events. You may want to determine if activities are distinct user sessions by correlating one user’s activity with the previous or the following session and look at the duration of the time gap. The transformation that happens in this case is ordering and clustering of events.

The data transformation process

The process of data transformation varies, but here are a few best practices to streamline the process. The first step is, at the highest level, to understand what the goals are. Once you know what you need the data to look like to achieve the goals, only then you can take stock of what data you need to work with. Take the time to understand the source data and what you need the end data to be in order to more easily and accurately understand what the transformation needs to look like.

As you start querying the data, it’s not uncommon to simply start to transform it as you go without a specific plan. Frequently it makes sense to start by breaking the process down into bite-sized transformations such as filtering and enrichment into logical components and steps. This makes it easier to maintain the data pipeline as user needs and business logic inevitably change. Make sure the pipeline is simple and understandable enough for someone else to come in and make changes, if necessary.

Also, it is important to understand the physical limitations of data pipelines, and how the infrastructure that supports your pipelines needs to scale. As you build your transformations, you need to consider how efficient your transformation logic is, so you don’t run into unexpected “Out of Memory” errors and such. This becomes important when you go from processing 100k records in your staging pipelines to millions (or even billions) of records in production pipelines. It always helps to keep scalability in mind.

Benefits and challenges of data transformation

There are challenges to transforming data:

Data transformation can become expensive, depending on what software is involved and what resources are required.
Data transformation processes can eat up resources, whether on-premises or cloud-based.
Lack of expertise can introduce problems during transformation, so data analysts, engineers or anyone dealing with data transformation needs to have subject-matter expertise, so they can accurately and properly curate data.
Enterprises sometimes perform unnecessary transformations—and once changes are made, data teams might have to change it back again to make the data usable.

Transforming data yields several benefits:

Once data is transformed, it is organized and easier—sometimes only now possible—for both humans and computers to use.
Properly formatted and validated data improves data quality and ensures that applications run properly without encountering pitfalls such as incompatible formats, duplicates, or incomplete values.
Data transformation streamlines interoperability among applications, systems, and types of data.

Best practices for data transformation

Conceptually, think of data transformation like a bidirectional search, or finding the shortest path between two points in a graph and mapping your raw data to your business needs. Then figure out how to traverse from both sides towards the middle most efficiently.

Historically, teams operate from one perspective or the other. Often, business teams toss requirements to the data team with a list of demands or data engineering teams look at their data and figure out what can be done with it, unrelated to business goals. The real value lies in skillfully blending the two and understanding the context in which the data will be used. Why are people looking for this data set? What are they trying to extract from understanding it? What is the next natural follow-on question they might ask?

Understand both the business needs and the data: Planning transformations has traditionally taken a waterfall-style approach involving a lot of meetings, whiteboards, and diagrams. This can lead to a lot of expensive, complex work. Instead, teams need to make iteration cheap, easy, and streamlined. Pipelines should be built in minutes, from mapping out the fields, to prototyping a query, sending it off to the processing cluster, running the transformations, and then validating that data and incrementally moving forward to meet new business use cases as quickly as possible. Data teams need to understand contextually why the data matters, as much as how to transform it and work with it.

Avoid prematurely optimizing your transformation logic: Often, teams have optimized their transformation logic, but it’s not very maintainable. Avoid winding up with 1,000-line SQL queries with complex, nested sub-queries, for instance. This may optimize processing, but not maintenance and engineering efforts. Break down queries into small components and understand the input and output for easier debugging and alteration.

Take care not to over-optimize: Especially if you are working with a small data set, don’t over-optimize. Once you get larger data sets and a better understanding of them, you can start to do more sophisticated things like incremental data propagation or compound nested transforms. Only do performance transformations once they become necessary.

Data transformation with Kaspian

With Kaspian, you can make data transformation fast and efficient. You can design your pipelines with declarative definitions that require 95% less code and result in far less maintenance and specify inputs, outputs, and data logic in multiple languages: