Learn how NexusLeap cut AWS Glue job runtimes from hours to minutes for a major food distributor by applying Spark-based optimization techniques. This post dives into practical tips on partitioning, avoiding data shuffling, leveraging lazy evaluation, and using DynamicFrames effectively—all essential for scalable, cost-efficient data engineering in the cloud.
When working with large-scale data in AWS Glue, performance optimization isn’t just nice to have—it’s essential. Poorly optimized jobs can lead to ballooning runtimes, excessive costs, and inefficient use of cloud resources. At small volumes, a few seconds of inefficiency are easy to overlook, but at scale, those seconds become hours, and those hours become serious cost and performance risks.
At NexusLeap, we’ve helped a large food distribution company, whose workflows include hundreds of thousands of daily files, dramatically reduce processing times using targeted optimization techniques. This blog shares what we’ve learned along the way.
Why Optimization Matters in Glue
AWS Glue is a fully managed ETL service built on Apache Spark. It provides powerful scalability, but only if you know how to unlock it.
Take the large food distributor's example: The Glue job used to run for 2.5 hours, and after applying optimizations, it ran in 35 minutes. With a final round of tuning, it now runs in just 10 minutes. Same data, same logic, just better engineering.
Core Spark Concepts in Glue
In-Memory Processing
Unlike Hadoop, which writes intermediate results to disk, Spark performs in-memory computation using RAM. This means Spark is 100x faster, but only if RAM and CPU are aligned.
Example: You can have 80 GB of RAM and a tiny dataset, but if partitioning isn’t optimized, Spark still runs slowly.
Lazy vs. Eager Evaluation
Spark uses lazy evaluation. It builds a DAG (Directed Acyclic Graph) of transformations and only processes them when an action (e.g. .show() or .count()) is called.
Problem: Adding .count() or .show() between steps causes Spark to reshuffle and compute multiple times—adding minutes of runtime per call.
Spark DAG Execution
By allowing Spark to build a complete DAG before executing, you reduce overhead. Group transformations and trigger them in a single action.
Best Practice: Avoid triggering actions during transformation steps. Group lazily and evaluate once.
Optimization Techniques
Partitioning: Align With Compute
Partitioning determines how Spark breaks up the dataset for parallel processing.
Example: 5 DPUs × 4 CPUs = 20 parallel tasks Default = 32 partitions → first 20 run, 12 wait → idle CPUs during second execution
Code Example:
# Initiate a Spark Session
spark = glueContext.spark_session
# Read JSON files in parallel using Glue's DynamicFrame API
# After loading the data, repartition it to match your compute capacity
df = df.repartition(20)
Analogy: You bought 5 cars, each with 4 seats (CPUs), but split your team into 32 small groups. Some seats are empty, while others wait in line.
Minimizing Data Shuffling
Some operations—like groupBy, distinct, orderBy, and joins—cause reshuffling of data across partitions.
Analogy: Imagine calling everyone from 32 rooms into one to do group work, then splitting them up again. That’s what reshuffling does.
Best Practices:
Avoid unnecessary shuffling actions
Perform them after all necessary filters or mappings
Bundle multiple operations before a single eager action
Lazy Evaluation in Practice
Instead of logging data with .show(), use .printSchema() after . We have to plan the execution beforehand, considering where shuffle steps occur and logging after.
Think of taking a car trip...
Ideally, you plan your travel activities so that when you stop for gas, you also take a bathroom break, grab a snack, check the tires, etc. This way, you don’t have to stop separately for each action, wasting opportunities and causing unnecessary delays. In the same manner, one should plan shuffling actions in code to efficiently stack actions.
Code Example:
# Write intermediate results to S3 instead of using .show() or .count()
# Convert to DataFrame for advanced Spark transformations:
df = dynamic_frame.toDF()
Conclusion
Optimization isn't just a final tuning step, it's a core discipline when engineering at scale. AWS Glue and Spark offer immense power, but only when paired with thoughtful, deliberate engineering. Choosing the right data structure, aligning partitions with compute, minimizing reshuffling, and grouping transformations with lazy evaluation practices can mean the difference between a job that runs in hours versus minutes, and costs thousands versus hundreds. Before calling any Glue job 'complete,' make optimization a formal part of your review process. A final sweep with a performance-first mindset can unlock hidden efficiencies, saving valuable time, compute, and cost. Engineering for scale isn’t optional, it’s the standard for modern data systems.
Remember these key takeaways when optimizing your workflows:
Use DynamicFrames for scale
Be intentional about partitions and reshuffling
Group multiple transformations without reshuffling actions.
Log smartly: remove .count() or .show() in production.
Ready to optimize your AWS Glue workflows and scale smarter? Contact us to learn how NexusLeap can help.