AWS Glue Pricing Breakdown: The Comprehensive Guide for 2024
Introduction
When it comes to Amazon Web Services (AWS) managed data integration tools, AWS Glue often stands out for its fully managed, serverless approach to cleaning, enriching, and organizing your data. But as with many AWS services, the pricing model can feel a bit daunting at first. The good news is that AWS Glue’s pricing, while multifaceted, is actually quite logical once you break it down.
In this blog post, we’ll simplify the AWS Glue pricing structure into seven main components. By the end, you’ll have a clearer picture of how each piece fits together and how you might optimize your costs. Stay tuned as we walk through each of these key billing elements and show you how to make the most of your AWS Glue investment.
AWS Glue Pricing Breakdown
AWS Glue offers a pay-as-you-go pricing model that charges you only for the resources you actually use. You’ll find that the costs vary depending on how you leverage the service’s different components. In general, the pricing covers seven key areas:
- Crawlers – Automatically discover and catalog data
- ETL Jobs & Interactive sessions – Extract, transform, and load data operations
- AWS Glue Data Catalog – Storage and retrieval of metadata
- Development Endpoints – Interactive environments to develop and test ETL code
- AWS Glue DataBrew – Interactive data preparation sessions and job runs
- DataBrew Interactive Sessions & Jobs – Pay per session for data preparation and per hour for each DataBrew node.
- AWS Glue Schema Registry – Store and manage schemas at no extra cost
Let’s break down each of these seven areas, so you’ll know exactly how costs are calculated and how to plan your spending wisely.
Note: All pricing examples in this blog are for the US East (N. Virginia) region
AWS Glue ETL Pricing & AWS Glue Interactive Sessions Pricing
AWS Glue costs are calculated based on Data Processing Units (DPUs), which bundle compute and memory resources to run your ETL workloads.
Data Processing Units (DPUs) and Billing Increments
A single DPU provides 4 vCPUs and 16 GB of memory. AWS bills usage in seconds, rounded up to the nearest second, and applies a minimum billing duration for most job types. For example, Spark jobs running on AWS Glue 2.0 or later have a 1-minute minimum, while older versions have a 10-minute minimum. By using smaller time increments, you ensure you’re only charged for the compute you actually use.
ETL Job Types and Rates
There are four types of AWS Glue jobs:
- Spark & Spark Streaming Jobs (Glue version 2.0 and later)
- Pricing: $0.44 per DPU-Hour
- Billing: Per second, with a 1-minute minimum
- Defaults: Spark uses 10 DPUs, Spark Streaming uses 2 DPUs
- Spark Jobs with Flexible Execution (Glue version 3.0 and later)
- Pricing: $0.29 per DPU-Hour
- Billing: Per second, with a 1-minute minimum
- Ray Jobs (Preview)
- Pricing: $0.44 per M-DPU-Hour
- Billing: Per second, with a 1-minute minimum
- Python Shell Jobs
- Pricing: $0.44 per DPU-Hour
- Billing: Per second, with a 1-minute minimum
- Defaults: 0.0625 DPU by default (can scale up to 1 DPU)
Interactive ETL Development
AWS Glue offers two options for interactive ETL code development: Interactive Sessions and Development Endpoints. Both are optional, and you’re billed only for the time you actively use them, based on the number of DPUs provisioned.
1. Interactive Sessions
Interactive Sessions allow you to interactively develop and test ETL code, with precise billing for active usage.
- Pricing: $0.44 per DPU-Hour
- Billing: Charged per second, with a 1-minute minimum.
- Defaults: Starts with 5 DPUs (minimum 2 DPUs required).
- Idle Timeout: Configurable to stop billing when the session is inactive.
Key Benefit: You only pay for the actual active time of your interactive sessions.
2. Development Endpoints
Development Endpoints provide a persistent environment for interactive ETL code development.
- Pricing: $0.44 per DPU-Hour
- Billing: Charged per second, with a 10-minute minimum.
- Defaults: Starts with 5 DPUs (minimum 2 DPUs required).
- No Automatic Timeout: Endpoints remain active until you manually shut them down.
Data Previews
With AWS Glue Studio data previews, you can test your transformations during the job-authoring process. Each AWS Glue Studio data preview session uses 2 DPUs, runs for 30 minutes, and stops automatically.
- Data Previews: Uses 2 DPUs for 30 minutes, then stops automatically
- Charges follow the standard DPU-hour rate of $0.44 per DPU-Hour.
Additional Charges
If you pull data from other AWS services, like Amazon S3, Amazon RDS, or Amazon Redshift, you’ll incur standard data transfer and request rates. Likewise, use of Amazon CloudWatch logs and events will be charged at standard CloudWatch rates.
Pricing Examples
Example 1: ETL Job
Imagine an AWS Glue Apache Spark job that runs for 15 minutes with 6 DPUs. The rate for a Spark job is $0.44 per DPU-Hour.
- Time: 15 minutes = 0.25 hours
- DPUs: 6
- Cost: 6 DPUs * 0.25 hours * $0.44 = $0.66
Example 2: AWS Glue Studio Job Notebooks & Interactive Sessions
Suppose you use a notebook with an Interactive Session that runs at 5 DPUs for 24 minutes. That’s 24 minutes = 0.4 hours.
- DPUs: 5
- Time: 0.4 hours
- Rate: $0.44 per DPU-Hour
- Cost: 5 DPUs * 0.4 hours * $0.44 = $0.88
AWS Glue Data Catalog Pricing
The AWS Glue Data Catalog is a centralized repository for managing metadata across your data assets. It seamlessly integrates with services like Amazon S3, Amazon Redshift, and third-party sources, offering a unified way to organize and query data through catalogs, databases, and tables.
The Data Catalog can also be accessed from Amazon SageMaker Lakehouse to enable data, analytics, and AI workflows. Additionally, with AWS Lake Formation, you can extend its capabilities to include fine-grained governance, permissions management, and database-like controls for your data assets.
When using the Data Catalog, you are charged for storing and accessing table metadata and for running data processing jobs that compute table statistics and table optimizations.
Metadata Storage Pricing
- Free Allowance: The first 1 million metadata objects are free.
- Beyond 1 Million Objects: You pay $1.00 per 100,000 objects over 1 million each month.
Metadata Access Pricing
- Free Allowance: The first 1 million metadata access requests per month are free.
- Beyond 1 Million Requests: You are charged $1.00 per 1 million requests over the free limit.
Common metadata access requests include:
CreateTable
CreatePartition
GetTable
GetPartitions
GetColumnStatisticsForTable
For a full list of requests supported by the Data Catalog, refer to the documentation.
Table Maintenance and Statistics in AWS Glue Data Catalog
The AWS Glue Data Catalog provides features to optimize table performance and improve query efficiency for your data stored in Amazon S3 and queried by services like Amazon Redshift, Athena, Amazon EMR, and AWS Glue ETL jobs.
1. Table Maintenance with Apache Iceberg Compaction
For Apache Iceberg tables stored in Amazon S3, AWS Glue Data Catalog supports managed compaction. This process combines small objects into larger ones, leading to:
- Improved Read Performance: Queries run faster with fewer I/O operations.
- Cost Efficiency: Better performance for analytics services like Redshift, Athena, and EMR.
How It’s Billed:
- Pricing: $0.44 per Data Processing Unit (DPU)-Hour.
- Billing Increment: Per second, rounded up to the nearest second.
- Minimum Duration: 1 minute per run.
A single DPU includes 4 vCPUs and 16 GB of memory, ensuring robust performance during compaction.
2. Column-Level Table Statistics
AWS Glue Data Catalog generates column-level table statistics for AWS Glue tables. These statistics integrate with cost-based optimizers (CBO) in Amazon Athena and Amazon Redshift, resulting in:
- Improved Query Performance: Optimized query plans reduce execution times.
- Potential Cost Savings: Lower compute and resource usage for queries.
Pricing:
- $0.44 per Data Processing Unit (DPU)-Hour.
- Billing: Per second, rounded up to the nearest second.
- Minimum Duration: 1 minute per run.
Additional Costs That Can Affect AWS Glue Data Catalog Bills
1. Storage Costs
The AWS Glue Data Catalog allows you to create and manage tables in Amazon S3 and Amazon Redshift. While the Data Catalog itself does not charge for storage, you will incur costs from these services:
- Amazon S3 Tables:
- Standard Amazon S3 rates apply for:
- Storage
- Requests (e.g., PUT, GET operations)
- Data transfer
- Standard Amazon S3 rates apply for:
- Amazon Redshift Tables:
- Standard Amazon Redshift rates apply for:
- Storage used by your tables
- Standard Amazon Redshift rates apply for:
2. Compute Costs
When querying tables stored in Amazon Redshift via AWS Glue, Amazon Athena, Amazon EMR, or third-party Apache Iceberg–compatible engine, AWS uses a service-managed Redshift Serverless workgroup to process queries and filter results.
- Amazon Redshift Serverless:
- You are billed for the compute resources used in the workgroup.
- Standard Amazon Redshift Serverless rates apply.
- There are no separate charges when querying Redshift tables directly from Amazon Redshift.
For more details, refer to the Amazon Redshift Pricing.
Pricing Examples for AWS Glue Data Catalog
Understanding AWS Glue Data Catalog costs becomes easier with real-world scenarios. Let’s look at some pricing examples that illustrate how storage, requests, and integrations with other services contribute to your bill.
1. Free Tier Example: $0 Cost
Imagine you store 1 million metadata objects and make 1 million metadata requests in a month.
- Metadata Storage Cost: $0 (covered under the free tier for the first million objects).
- Metadata Request Cost: $0 (first million requests are free).
Result: Your total cost is $0.
2. Standard Tier Example: Metadata Requests & Crawlers
Now, consider you:
- Store 1 million metadata objects.
- Make 2 million metadata requests.
- Run a crawler that consumes 2 DPUs for 30 minutes.
Breakdown:
- Metadata Storage Cost: $0 (first million metadata objects are free).
- Metadata Request Cost: The first million requests are free. You are billed for the additional 1 million requests at $1 per million = $1.
- Crawler Cost:
- Crawler runs for 30 minutes = 0.5 hours.
- 2 DPUs * 0.5 hours * $0.44 per DPU-Hour = $0.44.
Result: Your total cost is $1.44.
3. Querying Apache Iceberg Tables in Amazon S3 with Redshift Serverless
Here, you query Apache Iceberg tables stored in Amazon S3 using Amazon Redshift Serverless. Costs include:
- Amazon S3 Storage: Standard S3 rates for storing Apache Iceberg tables.
- Data Catalog Costs:
- Metadata storage for tables, databases, and catalogs.
- Metadata requests charged per request.
- Amazon Redshift Serverless: Compute costs for queries, billed per second (standard Redshift pricing).
AWS Glue Crawlers Pricing
AWS Glue Crawlers are essential tools for discovering and cataloging your data, automating the process of updating the AWS Glue Data Catalog. Crawlers scan data sources (like Amazon S3 or Redshift), infer schemas, and populate the catalog with metadata. While highly convenient, the costs for running crawlers are based on compute resources consumed.
How Crawler Pricing Works
- Hourly Rate: Crawlers are billed at $0.44 per DPU-Hour.
- Data Processing Units (DPUs):
- A single DPU includes 4 vCPUs and 16 GB of memory.
- The number of DPUs determines the compute power used by the crawler.
- Billing Granularity:
- Charged per second, rounded up to the nearest second.
- Minimum Duration: 10 minutes per crawler run.
AWS Glue DataBrew Interactive Sessions Pricing
AWS Glue DataBrew provides a visual, no-code data preparation tool that enables you to clean, transform, and enrich your data interactively. The pricing model for DataBrew’s interactive sessions is straightforward and based on 30-minute increments.
How Pricing Works
- Session Cost: Each interactive session is billed at $1.00 per 30-minute period.
- Billing Increments: Sessions are calculated in 30-minute blocks.
- First 40 Sessions Free: If you are a first-time DataBrew user, the first 40 sessions are free.
- Session Trigger: Any interaction with the DataBrew project UI—like clicking, filtering, or editing—keeps the session active.
- API/CLI/SDK Usage: The same billing rates apply when interacting with DataBrew projects through the DataBrew API, CLI, or SDK.
- Auto-Close: If there’s no activity, the session will automatically close at the end of the current 30-minute period.
Note: Multiple users working on different projects will be billed separately for their respective sessions.
Pricing Examples
Example 1: Single Session
You start a session at 9:00 AM, interact for a few minutes, and leave the console. You return at 9:20 AM and work until 9:30 AM.
- The two interactions fall within a single 30-minute session.
- Cost: $1.00 for 1 session.
Example 2: Multiple Sessions
You start a session at 9:00 AM and work until 9:50 AM. You exit the project and return at 10:15 AM for another round of work.
- First Interaction: 9:00 AM - 9:50 AM → 2 sessions (30 min + 30 min).
- Second Interaction: Starts at 10:15 AM → 1 session.
- Total Cost: 3 sessions * $1.00 = $3.00.
AWS Glue DataBrew Jobs Pricing
AWS Glue DataBrew jobs are used to clean, normalize, and transform large datasets without the need for writing code. Pricing is based on the number of DataBrew nodes consumed during the job’s execution.
How Pricing Works
- Hourly Rate: $0.48 per DataBrew node hour.
- DataBrew Node: Each node provides 4 vCPUs and 16 GB of memory.
- Default Allocation: By default, each DataBrew job uses 5 nodes.
- Billing Granularity: Jobs are billed per second, rounded up to the nearest second.
- Minimum Billing Duration: 1 minute per job run.
No Extra Costs: You are not charged for startup or shutdown time—only the active runtime of the job.
Additional Charges
If your DataBrew jobs interact with other AWS services (e.g., reading/writing data to Amazon S3), you may incur additional charges:
- Amazon S3: Standard rates apply for read/write requests and data storage.
- Other AWS Services: Refer to their respective pricing documentation.
Pricing Example
Let’s say a DataBrew job runs for 10 minutes and uses the default allocation of 5 DataBrew nodes.
- Node-Hour Rate: $0.48 per node hour.
- Time in Hours: 10 minutes = 1/6th of an hour.
- Cost Calculation:
- 5 nodes * (1/6 hour) * $0.48 = $0.40.
Result: The total cost for this job is $0.40.
AWS Glue Data Quality Pricing
AWS Glue Data Quality enhances confidence in your data by automatically measuring, monitoring, and managing its quality across data lakes and ETL pipelines. With features like recommendations, data quality tasks, and anomaly detection, AWS Glue helps you identify missing, stale, or incorrect data to ensure reliable outcomes.
At a base rate of $0.44 per DPU-Hour, AWS Glue Data Quality applies to tasks in the Data Catalog, ETL jobs, and anomaly detection processes.
Key Pricing Components of AWS Glue Data Quality
1. Data Quality in the Data Catalog
You can manage data quality for datasets cataloged in the AWS Glue Data Catalog by:
- Generating Recommendations: This creates a Recommendation Task, which requires provisioning DPUs.
- Running Data Quality Tasks: Once rules are applied, you can schedule these checks as Data Quality Tasks.
Billing Details:
- Minimum of 2 DPUs is required.
- Billed per second, with a 1-minute minimum duration.
Pricing Example:
Suppose a recommendation task uses 5 DPUs and runs for 10 minutes. The cost is calculated as: 5 DPUs × (10 minutes ÷ 60 minutes) × $0.44 = $0.37.
2. Data Quality for AWS Glue ETL Jobs
You can add data quality checks to ETL jobs to validate data during processing:
- Impact on Cost: These checks increase ETL job runtime and DPU consumption.
- Flexible Execution: For workloads without strict SLAs, you can use Flexible Execution to balance performance and cost.
Pricing Example: Evaluating Data Quality in an AWS Glue ETL Job
You can add data quality checks to AWS Glue ETL jobs using the Data Quality Transform in Glue Studio or AWS Glue APIs.
Scenario:
A job runs for 20 minutes (1/3 hour) using 6 DPUs.
- Standard Pricing: 6 DPUs × (1/3 hour) × $0.44 = $0.88
- Flex Pricing: 6 DPUs × (1/3 hour) × $0.29 = $0.58
3. Anomaly Detection in AWS Glue ETL
AWS Glue Anomaly Detection helps monitor unexpected patterns in datasets by generating and evaluating statistics based on rules and analyzers.
- DPU Consumption: You incur 1 DPU per statistic in addition to the DPUs provisioned for the ETL job.
- Average Runtime: Detecting anomalies takes 10–20 seconds per statistic.
- Minimum Billing: 1-second minimum per statistic.
Pricing Example:
Consider an AWS Glue ETL job that:
- Reads data from Amazon S3, transforms it, runs data quality checks, and loads the results into Amazon Redshift.
- Includes 10 rules and 10 analyzers, resulting in 20 statistics.
Cost Breakdown:
- ETL Job (without Anomaly Detection):
- 6 DPUs × (20 minutes ÷ 60 minutes) × $0.44 = $0.88
- Anomaly Detection:
- Each statistic consumes 1 DPU for anomaly detection.
- Time per statistic = 15 seconds (average).
- Cost: 20 statistics × 1 DPU × (15 ÷ 3600 hours) × $0.44 = $0.037
Total Cost:
- $0.88 (ETL job) + $0.037 (Anomaly Detection) = $0.917
4. Retraining Anomaly Detection Models
If your Glue job detects an anomaly, you may choose to exclude the anomalous statistic from the model to improve the accuracy of future predictions. This process involves retraining the anomaly detection model.
- DPU Consumption: 1 DPU per statistic for the retraining duration.
- Average Runtime: Approximately 15 seconds per statistic.
Pricing Example:
Suppose you exclude 1 data point to retrain the model:
- Cost = 1 statistic × 1 DPU × (15 seconds ÷ 3600 seconds) × $0.44 = $0.00185.
Retraining ensures the anomaly detection algorithm remains accurate with minimal additional cost.
5. Statistics Storage
- Cost: Free.
- AWS Glue Data Quality statistics are stored for up to 2 years.
- Limit: 100,000 statistics per account.
Additional Charges
AWS Glue interacts with Amazon S3 for reading and writing data during quality checks:
- Amazon S3 Charges: Standard S3 rates apply for:
- Storage
- Requests
- Data transfer
- Temporary Files: Shuffle files, quality results, and intermediate data are stored in S3 and are billed at standard rates.
AWS Glue Zero-ETL Pricing
AWS Glue Zero-ETL eliminates the need to manually build extract, transform, and load (ETL) pipelines for data ingestion and replication. It provides fully managed integrations that automate the process of moving data from source systems to target destinations for analytics and AI initiatives.
While AWS does not charge extra for Zero-ETL integration itself, you pay for the resources used to process and store the ingested data.
Key Pricing Breakdown
1. Source Data Ingestion Costs
AWS Glue charges for ingesting data from application sources:
- Cost: $1.50 per GB of ingested data.
- Billing: Billed per MB with a minimum ingestion size of 1 MB per request.
For Amazon DynamoDB, there is an additional charge to export data from continuous backups (point-in-time recovery). Refer to Amazon DynamoDB Pricing for details.
2. Target Data Processing Costs
The costs depend on where the data is written:
- Amazon S3
- Cost: $0.44 per AWS Glue DPU-Hour.
- Billing: Per second, with a 1-minute minimum.
- Amazon Redshift Managed Storage
- Billed based on Amazon Redshift Serverless compute.
- Refer to Amazon Redshift Pricing for details.
- Amazon Redshift Data Warehouse
- Billed based on Redshift resources used for compute.
- Amazon SageMaker Lakehouse
- Costs depend on the storage type chosen:
- For Amazon S3 Storage: $0.44 per DPU-Hour (AWS Glue compute).
- For Redshift Managed Storage: Billed at Redshift Serverless compute rates.
- Costs depend on the storage type chosen:
Amazon DynamoDB Zero-ETL Integration
With DynamoDB Zero-ETL, you can seamlessly export data from DynamoDB tables to SageMaker Lakehouse for analytics and AI use cases.
- Source Cost: DynamoDB charges for exporting data from continuous backups.
- Target Cost:
- Amazon S3: $0.44 per DPU-Hour for AWS Glue compute.
- Amazon Redshift: Redshift Serverless compute rates apply.
Example Cost Breakdown
- Source Cost: AWS Glue ingests 10 GB of data from an application source:
- Cost = 10 GB × $1.50/GB = $15.00.
- Target Cost: Data is processed and written to Amazon S3 using 5 DPUs for 10 minutes:
- Time in hours = 10 minutes = 1/6 hour.
- Cost = 5 DPUs × (1/6 hour) × $0.44 = $0.37.
Total Cost: $15.00 (Source) + $0.37 (Target) = $15.37.
Conclusion
AWS Glue’s pricing model may initially seem complex, but once you break it down, it’s quite structured and logical. By understanding how costs accumulate across its various components—ETL jobs (including Spark, Spark Streaming, Ray, and Python Shell), interactive development options (Interactive Sessions and Development Endpoints), and the Data Catalog with its metadata storage and request allowances—you can more precisely predict and manage your bills. Factors like DPU usage, data transfer, crawler operations, DataBrew sessions and jobs, Data Quality tasks, and Zero-ETL integrations each play a role in how costs add up. Additionally, your data’s storage location (Amazon S3, Amazon Redshift, SageMaker Lakehouse) and related services (like CloudWatch logs or DynamoDB exports) can influence the final price tag.
By carefully monitoring your DPU usage, optimizing job runtimes, taking advantage of the free metadata object and request tiers, and leveraging features like flexible execution or idle timeouts for interactive sessions, you can tailor AWS Glue’s capabilities to meet your performance needs without overspending. To dive deeper into what AWS Glue is and how it works, check out our other blog post, "AWS Glue for Beginners: Key Components and How It All Works”.