AWS Big Data Blog

Build a Data Lake Foundation with AWS Glue and Amazon S3

January 2025: This post was reviewed and updated for accuracy.

The key to becoming a data driven business is by first having a modern data foundation, which begins with the concept of a data lake.

A data lake has become the industry standard for storing structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, then run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

A data lake built on AWS uses Amazon S3 as its primary storage platform. Amazon S3 provides an optimal foundation for a data lake because of its virtually unlimited scalability and high durability. You can seamlessly and non-disruptively increase storage from gigabytes to petabytes of content, paying only for what you use. Amazon S3 is designed to provide 99.999999999% durability. It has scalable performance, ease-of-use features, native encryption, and access control capabilities. Amazon S3 integrates with a broad portfolio of AWS and third-party ISV tools for data ingestion, data processing, and data security.

The earliest challenges that inhibited building a data lake were keeping track of all of the raw assets as they were loaded into the data lake, and then tracking all of the new data assets and versions that were created by data transformation, data processing, and analytics. Thus, an essential component of a data lake built on Amazon S3 is the Data Catalog. The Data Catalog provides an interface to query all assets stored in data lake S3 buckets. The Data Catalog is designed to provide a single source of truth about the contents of the data lake.

AWS Glue is Amazon’s serverless data integration service that makes it simple and cost effective to extract, clean, enrich, load, and organize data. Originally launched in August 2017, AWS Glue began as an extract-transform-load (ETL) service designed to relieve developers and data engineers of the undifferentiated heavy lifting needed to load databases, data warehouses, and build data lakes on Amazon S3. Since then, it has evolved to serve a larger audience including ETL specialists and data scientists, and includes a broader suite of data integration capabilities. Today, hundreds of thousands of customers use AWS Glue every month.

This post walks you through the process of using AWS Glue to crawl your data on Amazon S3 and build a metadata store to be the backbone of your modern data architecture.

AWS Glue features

The AWS Glue Data Catalog is a centralized metadata repository for all your data assets across various data sources. It provides a unified interface to store and query information about data formats, schemas, and sources. When an AWS Glue ETL job runs, it uses this catalog to understand information about the data and ensure that it is transformed correctly.

One method of populating the AWS Glue Data Catalog is by using a crawler.  An AWS Glue crawler automatically discovers and extracts metadata from a data store, then updates the AWS Glue Data Catalog accordingly. The crawler connects to the data store to infer the schema of the data. It then creates or updates tables within the Data Catalog with the schema information that it discovered. A crawler can crawl both file-based and table-based data stores. To learn more about supported data stores, see Which data stores can I crawl?

You can provide a custom classifier to classify your data in AWS Glue. You can create a custom classifier using a grok pattern, an XML tag, JavaScript Object Notation (JSON), or comma-separated values (CSV). An AWS Glue crawler calls a custom classifier. If the classifier recognizes the data, it returns the classification and schema of the data to the crawler. You might need to define a custom classifier if your data doesn’t match any built-in classifiers, or if you want to customize the tables that are created by the crawler. For more information about creating a classifier using the AWS Glue console, see Creating classifiers using the AWS Glue console.

Because AWS Glue can integrate with more than 80 data sources on AWS, on premises, and on other clouds, it works seamlessly to orchestrate movement and management of your data across purpose-built databases and data warehouses.

The AWS Glue Data Catalog is compatible with Apache Hive Metastore and supports popular tools such as Hive, Presto, Apache Spark, and Apache Pig. The Data Catalog also provides built-in integration with Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Once you add your table definitions to the Data Catalog, they are available for ETL and also readily available for querying in Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum so that you can have a common view of your data between these services.

In addition, the AWS Glue Data Catalog features the following extensions for ease-of-use and data-management functionality:

  • Populate transactional tables in open-source table formats
  • Discover data with search
  • Identify and parse files with classification
  • Manage changing schemas with versioning
  • Create column-level statistics
  • Integrate with AWS Lake Formation

Once your data catalog is populated, you can use AWS Glue Studio to define ETL jobs. AWS Glue allows you to create a job through a visual interface, an interactive code notebook, or with a script editor.

To get started, you may consider using the simple visual interface to define data flow,  and use Glue managed transforms to prepare data.  With the visual interface, you can focus on building your pipeline via a drag and drop UI, with AWS Glue building a spark script in the background that you can always copy and customize if needed.  Follow the guidelines in Building visual ETL jobs with AWS Glue Studio to further understand how to use managed transformations, create your own transformations, integrate with open-source table formats, and curate re-usable recipes.

For more information, see the AWS Glue product details.

Amazon S3 data lake

AWS Glue is an essential component of an Amazon S3 data lake, providing the data catalog and transformation services for modern data analytics.

The preceding figure represents Glue’s role in a Modern Data Platform, connecting to disparate data sources, cataloging data, creating transformations, and loading data to it’s intended location whether that be a data lake, data warehouse, or purpose built database.

Let’s take a look a simple workflow to get comfortable with the Glue Data Catalog and using Glue Transformations.

The data can also be enriched by blending it with other datasets to provide additional insights. An AWS Glue crawler creates a table for each stage of the data based on a job trigger or a predefined schedule. The tables can be used by Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR to query the data at any stage.  This configuration is a popular design pattern that delivers Agile Business Intelligence to derive business value from a variety of data quickly and easily.

Jobs such as this can be triggered manually, scheduled within Glue or Amazon EventBridge Scheduler, prompted by an event detected through Amazon S3 Event Notifications or Amazon EventBridge, or additionally wrapped in a larger orchestration job coordinated by AWS Step Functions.  The important aspect of any new system is to get started small, then continue iterating with design and integration.

Proceed through the walkthrough below to build a foundational data pipelines and data catalog.

Walkthrough

In this walkthrough, you define a database, configure a crawler to explore data in an Amazon S3 bucket, create a table, transform the CSV file into Parquet, create a table for the Parquet data, and query the data with Amazon Athena.

Discover the data

Sign in to the AWS Management Console and open the AWS Glue console. You can find AWS Glue in the Analytics section. Before building this solution, please check the AWS Region Table for the regions where Glue is available.

The first step to discovering the data is to add a database. A database is a collection of tables.

  1. In the console, choose Add database. In Database name, type nycitytaxi, and choose Create.
  2. Choose Tables in the navigation pane. A table consists of the names of columns, data type definitions, and other metadata about a dataset.
  3. Add a table to the database nycitytaxi.You can add a table manually or by using a crawler. A crawler is a program that connects to a data store and progresses through a prioritized list of classifiers to determine the schema for your data. AWS Glue provides classifiers for common file types like CSV, JSON, Avro, and others. You can also write your own classifier using a grok pattern.
  4. To add a crawler, enter the data source: an Amazon S3 bucket named s3://aws-bigdata-blog/artifacts/glue-data-lake/data/. This S3 bucket contains the data file consisting of all the rides for the green taxis for the month of January 2017.
  5. Choose Next.
  6. For IAM role, choose the default role AWSGlueServiceRoleDefault in the drop-down list.
  7. For Database, choose nycitytaxi. It is important to understand how AWS Glue deals with schema changes so that you can select the appropriate method. In this example, the table is updated with any change. For more information about schema changes, see Cataloging Tables with a Crawler in the AWS Glue Developer Guide.
  8. For Frequency, choose Run on demand. The crawler can be run on demand or set to run on a schedule.
  9. Review the steps, and choose Finish. The crawler is ready to run. Choose Run it now.

    When the crawler has finished, one table has been added.
  10. Choose Tables in the left navigation pane, and then choose data. This screen describes the table, including schema, properties, and other valuable information.

Transform the data from CSV to Parquet format

Now you can configure and run a job to transform the data from CSV to Parquet. Parquet is a columnar format that is well suited for AWS analytics services like Amazon Athena and Amazon Redshift Spectrum. We will use Visual ETL for the Glue job.

  1. Under ETL in the left navigation pane, choose Jobs, and then choose Visual ETL job.
  2. For the Name, type nytaxi-csv-parquet.
  3. For the IAM role, choose AWSGlueServiceRoleDefault.
  4. Choose AWS Glue Data Catalog as the data source.
  5. Choose Change Schema Mapping and you can change some mappings to long data type
  6. Choose Drop Null fields transform.
  7. Choose Data preview to see how the transformations look.

  8. Choose Parquet as the format.
  9. Choose Amazon S3 as target node and choose an S3 target location (this could be an existing bucket or you create a newer bucket) and select format as Parquet.
    • Make sure IAM role AWSGlueServiceRole-default has put permissions on the S3 target location.
  10. Choose Save, and then choose Run job.
  11. You can check run details in Runs. Choose Runs.

Add the Parquet table and crawler

When the job has finished, add a new table for the Parquet data using a crawler.

  1. For Crawler name, type nytaxiparquet.
  2. Choose S3 as the Data store. Select the target bucket created in AWS Glue Job.
  3. For the IAM role, choose AWSGlueServiceRole-default.
  4. For Database, choose nycitytaxi.
  5. For Frequency, choose Run on demand.

After the crawler has finished, there are two tables in the nycitytaxi database: a table for the raw CSV data and a table for the transformed Parquet data.

Analyze the data with Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is capable of querying CSV data. However, the Parquet file format significantly reduces the time and cost of querying the data. For more information, see the blog post Analyzing Data in Amazon S3 using Amazon Athena.

To use AWS Glue with Amazon Athena, you must upgrade your Athena data catalog to the AWS Glue Data Catalog. For more information about upgrading your Athena data catalog, see this step-by-step guide.

  1. Open the AWS Management Console for Athena. The Query Editor displays both tables in the nycitytaxi

You can query the data using standard SQL.

  1. Choose the nytaxigreenparquet
  2. Type Select * From "nycitytaxi"."data" limit 10;
  3. Choose Run Query.

Conclusion

This post demonstrates how easy it is to build the foundation of a data lake using AWS Glue and Amazon S3. By using AWS Glue to crawl your data on Amazon S3 and build an Apache Hive-compatible metadata store, you can use the metadata across the AWS analytic services and popular Hadoop ecosystem tools. This combination of AWS services is powerful and easy to use, allowing you to get to business insights faster.

If you have questions or suggestions, please comment below.

Additional reading

See the following blog posts for more information:


About the authors

Gordon Heinrich is a Solutions Architect working with global systems integrators. He works with our partners and customers to provide them architectural guidance for building data lakes and using AWS analytic services. In his spare time, he enjoys spending time with his family, skiing, hiking, and mountain biking in Colorado.

Michael Purpura is a Sr. Solutions Architect at AWS.

Aneri Modi is an AWS Solutions Architect based out of Pennsylvania. She currently works with higher education customers to architect scalable and robust cloud solutions. She specializes in Analytics and AI/ML technologies.


Audit History

Last reviewed and updated in January 2025 by Michael Purpura | Sr. Solutions Architect and Aneri Modi | Solutions Architect