AWS Big Data Blog
Unlock scalable analytics with AWS Glue and Google BigQuery
Data integration is the foundation of robust data analytics. It encompasses the discovery, preparation, and composition of data from diverse sources. In the modern data landscape, accessing, integrating, and transforming data from diverse sources is a vital process for data-driven decision-making. AWS Glue, a serverless data integration and extract, transform, and load (ETL) service, has revolutionized this process, making it more accessible and efficient. AWS Glue eliminates complexities and costs, allowing organizations to perform data integration tasks in minutes, boosting efficiency.
This blog post explores the newly announced managed connector for Google BigQuery and demonstrates how to build a modern ETL pipeline with AWS Glue Studio without writing code.
Overview of AWS Glue
AWS Glue is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides all the capabilities needed for data integration, so you can start analyzing your data and putting it to use in minutes instead of months. AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can more easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows in a few steps in AWS Glue Studio. Data analysts and data scientists can use AWS Glue DataBrew to visually enrich, clean, and normalize data without writing code.
Introducing Google BigQuery Spark connector
To meet the demands of diverse data integration use cases, AWS Glue now offers a native spark connector for Google BigQuery. Customers can now use AWS Glue 4.0 for Spark to read from and write to tables in Google BigQuery. Additionally, you can read an entire table or run a custom query and write your data using direct and indirect writing methods. You connect to BigQuery using service account credentials stored securely in AWS Secrets Manager.
Benefits of Google BigQuery Spark connector
- Seamless integration: The native connector offers an intuitive and streamlined interface for data integration, reducing the learning curve.
- Cost efficiency: Building and maintaining custom connectors can be expensive. The native connector provided by AWS Glue is a cost-effective alternative.
- Efficiency: Data transformation tasks that previously took weeks or months can now be accomplished within minutes, optimizing efficiency.
Solution overview
In this example, you create two ETL jobs using AWS Glue with the native Google BigQuery connector.
- Query a BigQuery table and save the data into Amazon Simple Storage Service (Amazon S3) in Parquet format.
- Use the data extracted from the first job to transform and create an aggregated result to be stored in Google BigQuery.
Prerequisites
The dataset used in this solution is the NCEI/WDS Global Significant Earthquake Database, with a global listing of over 5,700 earthquakes from 2150 BC to the present. Copy this public data into your Google BigQuery project or use your existing dataset.
Configure BigQuery connections
To connect to Google BigQuery from AWS Glue, see Configuring BigQuery connections. You must create and store your Google Cloud Platform credentials in a Secrets Manager secret, then associate that secret with a Google BigQuery AWS Glue connection.
Set up Amazon S3
Every object in Amazon S3 is stored in a bucket. Before you can store data in Amazon S3, you must create an S3 bucket to store the results.
To create an S3 bucket:
- On the AWS Management Console for Amazon S3, choose Create bucket.
- Enter a globally unique Name for your bucket; for example,
awsglue-demo
. - Choose Create bucket.
Create an IAM role for the AWS Glue ETL job
When you create the AWS Glue ETL job, you specify an AWS Identity and Access Management (IAM) role for the job to use. The role must grant access to all resources used by the job, including Amazon S3 (for any sources, targets, scripts, driver files, and temporary directories), and Secrets Manager.
For instructions, see Configure an IAM role for your ETL job.
Solution walkthrough
Create a visual ETL job in AWS Glue Studio to transfer data from Google BigQuery to Amazon S3
- Open the AWS Glue console.
- In AWS Glue, navigate to Visual ETL under the ETL jobs section and create a new ETL job using Visual with a blank canvas.
- Enter a Name for your AWS Glue job, for example,
bq-s3-dataflow
. - Select Google BigQuery as the data source.
- Enter a name for your Google BigQuery source node, for example,
noaa_significant_earthquakes
. - Select a Google BigQuery connection, for example,
bq-connection
. - Enter a Parent project, for example,
bigquery-public-datasources
. - Select Choose a single table for the BigQuery Source.
- Enter the table you want to migrate in the form [dataset].[table], for example,
noaa_significant_earthquakes.earthquakes
.
- Enter a name for your Google BigQuery source node, for example,
- Next, choose the data target as Amazon S3.
- Enter a Name for the target Amazon S3 node, for example, earthquakes.
- Select the output data Format as Parquet.
- Select the Compression Type as Snappy.
- For the S3 Target Location, enter the bucket created in the prerequisites, for example,
s3://<YourBucketName>/noaa_significant_earthquakes/earthquakes/
. - You should replace
<YourBucketName>
with the name of your bucket.
- Next go to the Job details. In the IAM Role, select the IAM role from the prerequisites, for example,
AWSGlueRole
.
- Choose Save.
Run and monitor the job
- After your ETL job is configured, you can run the job. AWS Glue will run the ETL process, extracting data from Google BigQuery and loading it into your specified S3 location.
- Monitor the job’s progress in the AWS Glue console. You can see logs and job run history to ensure everything is running smoothly.
Data validation
- After the job has run successfully, validate the data in your S3 bucket to ensure it matches your expectations. You can see the results using Amazon S3 Select.
Automate and schedule
- If needed, set up job scheduling to run the ETL process regularly. You can use AWS to automate your ETL jobs, ensuring your S3 bucket is always up to date with the latest data from Google BigQuery.
You’ve successfully configured an AWS Glue ETL job to transfer data from Google BigQuery to Amazon S3. Next, you create the ETL job to aggregate this data and transfer it to Google BigQuery.
Finding earthquake hotspots with AWS Glue Studio Visual ETL.
- Open AWS Glue console.
- In AWS Glue navigate to Visual ETL under the ETL jobs section and create a new ETL job using Visual with a blank canvas.
- Provide a name for your AWS Glue job, for example,
s3-bq-dataflow
. - Choose Amazon S3 as the data source.
- Enter a Name for the source Amazon S3 node, for example, earthquakes.
- Select S3 location as the S3 source type.
- Enter the S3 bucket created in the prerequisites as the S3 URL, for example,
s3://<YourBucketName>/noaa_significant_earthquakes/earthquakes/
. - You should replace
<YourBucketName>
with the name of your bucket. - Select the Data format as Parquet.
- Select Infer schema.
- Next, choose Select Fields transformation.
- Next, choose Aggregate transformation.
- Next, choose RenameField transformation.
- Next, choose RenameField transformation
- Next, choose the data target as Google BigQuery.
- Provide a name for your Google BigQuery source node, for example,
most_powerful_earthquakes
. - Select a Google BigQuery connection, for example,
bq-connection
. - Select Parent project, for example,
bigquery-public-datasources
. - Enter the name of the Table you want to create in the form [dataset].[table], for example,
noaa_significant_earthquakes.most_powerful_earthquakes
. - Choose Direct as the Write method.
- Provide a name for your Google BigQuery source node, for example,
- Next go to the Job details tab and in the IAM Role, select the IAM role from the prerequisites, for example,
AWSGlueRole
.
- Choose Save.
Run and monitor the job
- After your ETL job is configured, you can run the job. AWS Glue runs the ETL process, extracting data from Google BigQuery and loading it into your specified S3 location.
- Monitor the job’s progress in the AWS Glue console. You can see logs and job run history to ensure everything is running smoothly.
Data validation
- After the job has run successfully, validate the data in your Google BigQuery dataset. This ETL job returns a list of countries where the most powerful earthquakes have occurred. It provides these by counting the number of earthquakes for a given magnitude by country.
Automate and schedule
- You can set up job scheduling to run the ETL process regularly. AWS Glue allows you to automate your ETL jobs, ensuring your S3 bucket is always up to date with the latest data from Google BigQuery.
That’s it! You’ve successfully set up an AWS Glue ETL job to transfer data from Amazon S3 to Google BigQuery. You can use this integration to automate the process of data extraction, transformation, and loading between these two platforms, making your data readily available for analysis and other applications.
Clean up
To avoid incurring charges, clean up the resources used in this blog post from your AWS account by completing the following steps:
- On the AWS Glue console, choose Visual ETL in the navigation pane.
- From the list of jobs, select the job
bq-s3-data-flow
and delete it. - From the list of jobs, select the job
s3-bq-data-flow
and delete it. - On the AWS Glue console, choose Connections in the navigation pane under Data Catalog.
- Choose the BiqQuery connection you created and delete it.
- On the Secrets Manager console, choose the secret you created and delete it.
- On the IAM console, choose Roles in the navigation pane, then select the role you created for the AWS Glue ETL job and delete it.
- On the Amazon S3 console, search for the S3 bucket you created, choose Empty to delete the objects, then delete the bucket.
- Clean up resources in your Google account by deleting the project that contains the Google BigQuery resources. Follow the documentation to clean up the Google resources.
Conclusion
The integration of AWS Glue with Google BigQuery simplifies the analytics pipeline, reduces time-to-insight, and facilitates data-driven decision-making. It empowers organizations to streamline data integration and analytics. The serverless nature of AWS Glue means no infrastructure management, and you pay only for the resources consumed while your jobs are running. As organizations increasingly rely on data for decision-making, this native spark connector provides an efficient, cost-effective, and agile solution to swiftly meet data analytics needs.
If you’re interested to see how to read from and write to tables in Google BigQuery in AWS Glue, take a look at step-by-step video tutorial. In this tutorial, we walk through the entire process, from setting up the connection to running the data transfer flow. For more information on AWS Glue, visit AWS Glue.
Appendix
If you are looking to implement this example, using code instead of the AWS Glue console, use the following code snippets.
Reading data from Google BigQuery and writing data into Amazon S3
Reading and aggregating data from Amazon S3 and writing into Google BigQuery
About the authors
Kartikay Khator is a Solutions Architect in Global Life Sciences at Amazon Web Services (AWS). He is passionate about building innovative and scalable solutions to meet the needs of customers, focusing on AWS Analytics services. Beyond the tech world, he is an avid runner and enjoys hiking.
Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect and Amazon AppFlow expert. He’s on a mission to make life easier for customers who are facing complex data integration challenges. His secret weapon? Fully managed, low-code AWS services that can get the job done with minimal effort and no coding.
Anshul Sharma is a Software Development Engineer in AWS Glue Team. He is driving the connectivity charter which provide Glue customer native way of connecting any Data source (Data-warehouse, Data-lakes, NoSQL etc) to Glue ETL Jobs. Beyond the tech world, he is a cricket and soccer lover.