January 12, 2024
Tutorials

Improving Data Pipelines

Data pipelines are the backbone of modern analytics, but they’re often slowed down by hidden inefficiencies. In this article, Niraj walks through four common pitfalls in PySpark and AWS Glue (breaking lazy execution, unnecessary re-computation, serial tasks, and partition skew) and shares practical fixes to make pipelines faster, cheaper, and more reliable.

Data pipelines are the backbone of modern analytics, but they can also be long, complex, and computationally expensive. For businesses, speed and efficiency aren’t just nice-to-have—they directly impact how quickly leaders can make informed decisions. That’s why data engineers must constantly look for ways to reduce pipeline run times and cut unnecessary computation. Some common pitfalls that slow pipelines down include breaking lazy execution, excessive re-computation of DataFrames, serial (non-parallel) tasks, and partition skews.

Breaking Lazy Execution:

One of the simplest ways to improve pipeline performance is to review and remove code that breaks lazy execution. In PySpark, transformations on DataFrames are evaluated lazily. They don’t run immediately when defined. Instead, Spark builds a plan of operations and only executes it when an action such as show(), count(), collect(), or printSchema() is called.

This design is powerful because it lets Spark optimize the entire chain of transformations before running. But if you insert an action before your final write, you force Spark to compute the plan prematurely. Later, when you call write() to save to storage (e.g., S3), Spark must recompute everything again including the work it already did for the earlier action. In effect, your pipeline runs twice.

The fix is simple: remove unnecessary actions from the middle of your pipeline.

If you need visibility into what’s happening, use more efficient logging or lightweight methods like df.limit(5).toPandas() for sampling, or log the DataFrame’s schema and counts after caching. These approaches avoid expensive re-computation while still giving you insight. Eliminating these extra runs doesn’t just save compute cycles, it directly reduces cloud costs and shortens delivery time for reports.

Re-computing DataFrames:

Another performance pitfall in PySpark is unnecessary re-computation of DataFrames. By default, Spark does not store intermediate results in memory. Each time you reference a DataFrame in multiple places, Spark rebuilds it from the original source and re-applies all of its transformations.

The solution is to use caching (i.e., "persisting"). Caching tells Spark to materialize the DataFrame the first time it’s computed, and then keep it in memory for reuse in subsequent actions:

Cache only when the DataFrame is reused multiple times and is small enough to fit in memory. Over-caching can actually hurt performance by consuming cluster resources unnecessarily.

It’s important to distinguish between re-computation and breaking lazy execution. Re-computation occurs even when lazy execution is not broken. By design, Spark does not store intermediate results, so if you reference the same DataFrame in multiple actions such as calling df.count() and then immediately df.write(), Spark will recompute the DataFrame from scratch for each action. They overlap conceptually (both lead to Spark doing extra work), but the causes and fixes differ.

Serial (Non-Parallel) Tasks:

When designing a pipeline, it’s important to distinguish which tasks are dependent on one another and which are not. Some transformations must happen sequentially. For example, calculating warehouse profit may require sales data to be processed first. But in other cases, tasks can safely run independently.

Take the example of generating reports for ten distribution centers. An inefficient approach would be to build one large job that iterates through each center one at a time, writing results sequentially. A more efficient approach is to recognize that these centers don’t impact one another, and therefore their workloads can run in parallel. By executing them simultaneously, the overall runtime could drop dramatically, in this case by 80–90%.

In AWS, this can be implemented using Step Functions with parallel states that orchestrate multiple Glue jobs at once. Each job can be parameterized (e.g., with a distribution center identifier) and run independently, with all results computed in parallel instead of serially. For the business, this means faster availability of insights and more timely decisions, without waiting on slow sequential runs.

Ignoring Partition Skew:

The final common cause of slow pipelines is partition skew. In Spark, data is split into partitions that are distributed across worker nodes. Ideally, each partition should be roughly the same size so the they can be processed in parallel. But when the data size is unevenly distributed across worker nodes, some partitions very large and others very small, you get skew.

For example, imagine partitioning sales data by each active warehouse. If 90% of sales belong to a single warehouse (say, “A”) and the remaining 10% are spread across nine smaller warehouses, Spark will assign the “A” data to one partition while the other nine warehouses get tiny partitions. The result is that one node is overloaded with work while the rest of the cluster sits idle. This slows down the entire pipeline because Spark waits for the slowest partition to finish. Engineers who ignore partition skew can end up with massive runtime inefficiencies and even job failures due to out-of-memory errors.

To fix this, an engineer has several options. The simplest is to choose a more balanced partitioning column (e.g., "customers" instead of "warehouse"). For highly uneven partitions, salting can break a dominant partition into smaller subgroups so computation is more evenly distributed. Spark also provides repartition(), which redistributes rows across a specified number of partitions to balance workloads, and bucketBy(), which organizes data into fixed buckets based on a column to optimize joins and aggregations. Both techniques help reduce skew and ensure that all nodes in the cluster are working efficiently.

Conclusion:

Building data pipelines is more than just extracting, transforming, and loading data. It also requires optimizing for speed and cost-efficiency. When engineers understand the business context and how pipeline components interact, they can write code and provision infrastructure in ways that make the pipeline faster, more efficient, more cost effective, and sustainable over a longer period of time.

Ready to optimize your data pipelines? Even small improvements can translate into faster insights and lower costs at scale. If you’d like to explore how to make your pipelines more efficient, contact us at NexusLeap — we’d love to help.

Answering Commonly Asked Questions.

Related articles