January 12, 2024
Tutorials

AWS Glue Optimization Best Practices

Learn how NexusLeap cut AWS Glue job runtimes from hours to minutes for a major food distributor by applying Spark-based optimization techniques. This post dives into practical tips on partitioning, avoiding data shuffling, leveraging lazy evaluation, and using DynamicFrames effectively—all essential for scalable, cost-efficient data engineering in the cloud.

When working with large-scale data in AWS Glue, performance optimization isn’t just nice to have—it’s essential. Poorly optimized jobs can lead to ballooning runtimes, excessive costs, and inefficient use of cloud resources. At small volumes, a few seconds of inefficiency are easy to overlook, but at scale, those seconds become hours, and those hours become serious cost and performance risks.

At NexusLeap, we’ve helped a large food distribution company, whose workflows include hundreds of thousands of daily files, dramatically reduce processing times using targeted optimization techniques. This blog shares what we’ve learned along the way.

Why Optimization Matters in Glue

AWS Glue is a fully managed ETL service built on Apache Spark. It provides powerful scalability, but only if you know how to unlock it.

Take the large food distributor's example: The Glue job used to run for 2.5 hours, and after applying optimizations, it ran in 35 minutes. With a final round of tuning, it now runs in just 10 minutes. Same data, same logic, just better engineering.

Core Spark Concepts in Glue

In-Memory Processing

Unlike Hadoop, which writes intermediate results to disk, Spark performs in-memory computation using RAM. This means Spark is 100x faster, but only if RAM and CPU are aligned.

Example: You can have 80 GB of RAM and a tiny dataset, but if partitioning isn’t optimized, Spark still runs slowly.

Lazy vs. Eager Evaluation

Spark uses lazy evaluation. It builds a DAG (Directed Acyclic Graph) of transformations and only processes them when an action (e.g. .show() or .count()) is called.

Problem: Adding .count() or .show() between steps causes Spark to reshuffle and compute multiple times—adding minutes of runtime per call.

Spark DAG Execution

By allowing Spark to build a complete DAG before executing, you reduce overhead. Group transformations and trigger them in a single action.

Best Practice: Avoid triggering actions during transformation steps. Group lazily and evaluate once.

Optimization Techniques

Partitioning: Align With Compute

Partitioning determines how Spark breaks up the dataset for parallel processing.

Example: 5 DPUs × 4 CPUs = 20 parallel tasks
Default = 32 partitions → first 20 run, 12 wait → idle CPUs during second execution

Code Example:

# Initiate a Spark Session

spark = glueContext.spark_session

# Read JSON files in parallel using Glue's DynamicFrame API

dynamic_frame = glueContext.create_dynamic_frame.from_options(

        format_options={"multiline": True},

        connection_type="s3",

        format="json",

        connection_options={"paths": [raw_source_location], "recurse": True},

        transformation_ctx="read_jsons"

    )

# After loading the data, repartition it to match your compute capacity

df = df.repartition(20)

Analogy: You bought 5 cars, each with 4 seats (CPUs), but split your team into 32 small groups. Some seats are empty, while others wait in line.

Minimizing Data Shuffling

Some operations—like groupBy, distinct, orderBy, and joins—cause reshuffling of data across partitions.

Analogy: Imagine calling everyone from 32 rooms into one to do group work, then splitting them up again. That’s what reshuffling does.

Best Practices:

  • Avoid unnecessary shuffling actions
  • Perform them after all necessary filters or mappings
  • Bundle multiple operations before a single eager action

Lazy Evaluation in Practice

Instead of logging data with .show(), use .printSchema() after . We have to plan the execution beforehand, considering where shuffle steps occur and logging after.

Think of taking a car trip...

Ideally, you plan your travel activities so that when you stop for gas, you also take a bathroom break, grab a snack, check the tires, etc. This way, you don’t have to stop separately for each action, wasting opportunities and causing unnecessary delays. In the same manner, one should plan shuffling actions in code to efficiently stack actions.

Code Example:

# Write intermediate results to S3 instead of using .show() or .count()

df.write.parquet("s3://my-bucket/logs/stage1_output/")

❌ A job with 5 .count() calls can cost extra minutes, if not hours.

✅ Use .show() or .printSchema() immediately after a shuffling action.

Methods that require data shuffling:

Lazy Evaluations (Transformations):

  • Operations are not executed immediately.
  • Spark builds a logical plan and optimizes execution.
  • Efficient use of resources.

Eager Evaluations (Actions)

  • Operations that trigger execution.
  • Should be used carefully to avoid driver memory overload.
  • Essential for retrieving results or saving data.

DynamicFrame vs. DataFrame

  • Use DynamicFrame when working with JSON data (like our large Food Distributor’s 130K+ files/day).
  • Use DataFrame for structured data like CSVs.
  • DynamicFrame supports schema evolution when dealing with unstructured files. Example: JSON.

Code Example:

# JSON ingestion with DynamicFrame

dynamic_frame = glueContext.create_dynamic_frame_from_options(

    connection_type="s3",

    connection_options={"paths": ["s3://bucket/jsons/"]},

    format="json"

)

# Convert to DataFrame for advanced Spark transformations:

df = dynamic_frame.toDF()

Conclusion

Optimization isn't just a final tuning step, it's a core discipline when engineering at scale. AWS Glue and Spark offer immense power, but only when paired with thoughtful, deliberate engineering. Choosing the right data structure, aligning partitions with compute, minimizing reshuffling, and grouping transformations with lazy evaluation practices can mean the difference between a job that runs in hours versus minutes, and costs thousands versus hundreds. Before calling any Glue job 'complete,' make optimization a formal part of your review process. A final sweep with a performance-first mindset can unlock hidden efficiencies, saving valuable time, compute, and cost. Engineering for scale isn’t optional, it’s the standard for modern data systems.

Remember these key takeaways when optimizing your workflows:

  • Use DynamicFrames for scale
  • Be intentional about partitions and reshuffling
  • Group multiple transformations without reshuffling actions.
  • Log smartly: remove .count() or .show() in production.

Ready to optimize your AWS Glue workflows and scale smarter? Contact us to learn how NexusLeap can help.

Answering Commonly Asked Questions.

Related articles