AWS Big Data Blog

Unite Real-Time and Batch Analytics Using the Big Data Lambda Architecture, Without Servers!

The Big Data Lambda Architecture seeks to provide data engineers and architects with a scalable, fault-tolerant data processing architecture and framework using loosely coupled, distributed systems. At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpose-built engines, processes, and storage can be used for each, while serving and query layers present a unified view of all of the data.

Historically, the Lambda Architecture demanded the use of various complex systems to achieve the outcomes of uniting batch and real-time views. Data platform engineers and architects were required to implement services running on Amazon EC2 for data collection and ingestion, batch processing, stream processing, serving layers, and dashboards/reporting. As time has gone on, AWS customers have continued to ask for managed solutions that scale seamlessly and put less focus on infrastructure, allowing teams to focus on what really matters: the data and the resulting insights.

In this post, I show you how you can use AWS services like AWS Glue to build a Lambda Architecture completely without servers. I use a practical demonstration to examine the tight integration between serverless services on AWS and create a robust data processing Lambda Architecture system.

New:  AWS Glue

With the launch of AWS Glue, AWS provides a portfolio of services to architect a Big Data platform without managing any servers or clusters. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (for example, the table definition and schema) in the AWS Glue Data Catalog. After it’s cataloged, your data is immediately searchable, queryable, and available for ETL.

AWS Glue generates the code to execute your data transformations and data loading processes. Furthermore, AWS Glue provides a managed Spark execution environment to run ETL jobs against a data lake in Amazon S3. In short, you can now run a Lambda Architecture in AWS in a completely 100% serverless fashion!

“Serverless” applications allow you to build and run applications without thinking about servers. What this means is that you can now stream data in real-time, process huge volumes of data in S3, and run SQL queries and visualizations against that data without managing server provisioning, installation, patching, or capacity scaling. This frees up your users to spend more time interpreting the data and deriving business value for your organization.

Solution overview

The AWS services involved in this solution include:

The following diagram explains how the services work together:

None of the included services require the creation, configuration, or installation of servers, clusters, and databases.

In this example, you use these services to send and process simulated streaming data of sensor devices to Kinesis Firehose and store the raw data in S3. Using AWS Glue, you analyze the raw data from S3 in batch-oriented fashion to look at the thermostat efficiency over time against the historical data, and store results back in S3.

Using Amazon Kinesis Analytics, you analyze and filter the data to detect inefficient sensors in real time. In this example, you detect inefficiencies in thermostat devices by comparing their temperature settings against the temperature they are reading (where thermostats are typically set between 70-72º F).

Finally, you use Athena and Amazon QuickSight to query and visualize the data and build a dashboard that can be shared with other users of Amazon QuickSight in your organization.

Walkthrough

You use the AWS CLI and AWS CloudFormation throughout this tutorial, therefore a basic understanding of these services is assumed.

At the time of writing this post, AWS Glue is only available in the US East (N. Virginia) AWS Region. Complete all the steps in that region.

Step 1: Set up baseline AWS resources

I’ve set up a CloudFormation template to set up the Kinesis streaming resources for you.

Launch Stack

Before you continue with the rest of the walkthrough, make sure that the Status value is CREATE_COMPLETE. This can take anywhere between 3 and 5 minutes, so grab coffee or take a moment to prepare and review the next steps.

Step 2: Set up the Amazon Kinesis Data Generator

You will be leveraging the Amazon Kinesis Data Generator (KDG) to simulate devices pushing data into Kinesis Firehose.

Kinesis Data Generator

Make sure that you can log in to the KDG producer page before you continue.

Step 3: Send data to Kinesis Firehose using the KDG

In this step, you configure the KDG to send simulated records to simulate devices streaming data into Kinesis Firehose. In the data generator template below, notice the complex weighting. This demonstrates outliers in device temperature readings later on. Kinesis Firehose sends all the raw data to S3 in addition to Kinesis Analytics for real-time processing.

On the KDG producer page, for Region, select “us-east-1”. For Stream/delivery stream, select AWSBigDataBlog-RawDeliveryStream. For Records per second, type 100.

Paste the template shown below into KDG template window:

{
  "uuid": "{{random.uuid}}",
  "devicets": "{{date.utc("YYYY-MM-DD HH:mm:ss.SSS")}}",
  "deviceid": {{random.number(1000)}},
  "temp": {{random.weightedArrayElement(
    {"weights":[0.33, 0.33, 0.33, 0.01],"data":[70, 71, 72, 78]}
  )}}
}

Choose Send data.

Without leaving the KDG page, navigate to the Kinesis Analytics console to view the status of the application processing the data in real time. When you are happy with the amount of data sent, choose Stop Sending Data to Kinesis. I recommend waiting until at least 20,000 records are sent.

Step 4: Submit AWS Glue crawlers to interpret the table definition for Kinesis Firehose outputs in S3

In this step, you use AWS Glue crawlers to crawl and generate table definitions against the produced data in S3. This allows you to analyze data in aggregate over a historical context instead of just using the latest data.

Use CLI commands to define and run AWS Glue crawlers:

  • Replace the <bucketName> text in the command with the bucket name that you specified in the CloudFormation template in step 1.
  • Replace the <RoleArn> text in the command with an IAM role ARN with permissions to use AWS Glue. For more information, see Setting up IAM Permissions for AWS Glue.
    aws --region us-east-1 glue create-crawler \
    --name 'thermostat-data-crawler' \
    --database-name 'awsblogsgluedemo' \
    --role '<RoleArn>' \
    --targets '{"S3Targets":[{"Path": "s3://<bucketName>", "Exclusions": ["etl_code", "temporary_dir"]}]}'

To run the crawler, navigate to the AWS Glue console.

Select the crawler that you created earlier, “thermostat-data-crawler”, and choose Run crawler.

After just a few seconds, the crawler shows as “Ready”, and will have automatically inferred the schema of three tables:

  • “raw”—Data being sent from KDG to Kinesis Firehose, delivered to S3.
  • “results”—Kinesis Analytics results data on real-time processing for percent inefficiency of devices, stored in S3.
  • “sampledata”—Simulated user device settings table for 1,000 thermostats in S3. This data is joined to the raw data to calculate the “percent inefficiency” (absolute value of temperature reading minus temperature setting, divided by the temperature setting).

Zooming in on the architecture diagram, I’ve included the names of these tables. Take a moment to see how the tables fit in the overall Lambda Architecture:

Step 5: Submit an AWS Glue job to pre-process daily thermostat efficiency

In this step, you use an AWS Glue job to join data and pre-process calculated views to assess the efficiency of the thermostat devices. The results are stored in S3 to be queried using Athena.

  • Replace the <bucketName> text in the command with the bucket name that you specified in the CloudFormation template in step 1.
  • Replace the <RoleName> text in the command with an IAM role with permissions to use AWS Glue. For more information, see Setting up IAM Permissions for AWS Glue.
    aws --region us-east-1 glue create-job \
    --name 'daily_avg' \
    --role '<RoleName>' \
    --execution-property 'MaxConcurrentRuns=1' \
    --command '{"Name": "glueetl", "ScriptLocation": "s3://<bucketName>/etl_code/daily_avg.py"}' \
    --default-arguments '{"--TempDir":"s3://<bucketName>/temporary_dir","--job-bookmark-option":"job-bookmark-disable"}'

To view and run the job, navigate to the AWS Glue console.

Select the job named “daily_avg”, and choose Action, Run job.

In the Parameters (optional) dialog box, choose Run job.

This job can take between 10–14 minutes to run. Take a moment to view the script definition in the console. Take care not to save any changes to the script at this time:

https://console.aws.amazon.com/glue/home?region=us-east-1#editJob:jobName=daily_avg

You should see a page similar to the following:

Step 6: Query data in S3 directly with Athena

You can now navigate to the Athena console, where Athena already has the tables present and ready to query. No loading or clusters necessary!

This walkthrough assumes that you have already integrated AWS Glue with Athena. If not, see Upgrading to the AWS Glue Data Catalog Step-by-Step.

Try out each of the individual queries below to analyze the data in Athena.

Query and analyze the raw time-series data in the “master dataset” Lambda Architecture layer
--"Events by Device ID"
SELECT uuid,
         devicets,
         deviceid,
         temp
FROM awsblogsgluedemo."raw"
WHERE deviceid = 1
ORDER BY  devicets DESC;

--"Top 20 most active thermostats"
SELECT deviceid,
         COUNT(*) AS num_events
FROM awsblogsgluedemo."raw"
GROUP BY  deviceid
ORDER BY  num_events DESC LIMIT 20;
Query and analyze the data processed by AWS Glue at the “Serving” layer:
--"Top 10 Most inefficient devices"
SELECT deviceid,
        daily_avg_inefficiency
FROM awsblogsgluedemo.daily_avg_inefficiency
ORDER BY  daily_avg_inefficiency DESC LIMIT 10;

--"KPI - Overall device daily inefficiency"
SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) ) AS all_device_avg_inefficiency,
         date
FROM awsblogsgluedemo.daily_avg_inefficiency
GROUP BY  date;
Query and analyze the data filtered by Kinesis Analytics, the outputs of the “speed” layer:
--"Top 10 most inefficient devices - event-level granularity"
SELECT col0 AS "uuid",
   col1 AS "deviceid",
   col2 AS "devicets",
   col3 AS "temp",
   col4 AS "settemp",
   col5 AS "pct_inefficiency"
FROM awsblogsgluedemo.results
ORDER BY pct_inefficiency DESC limit 10;

Step 7: Visualize with Amazon QuickSight

Now you have the data in S3 and table definitions in the AWS Glue Data Catalog, and you can query the data directly in S3 with Athena. You are now ready to begin creating visualizations with Amazon QuickSight.

In these steps, I demonstrate how you can visualize your data in S3, again without provisioning any infrastructure for databases, clusters, or BI tools.

Tie in all the data sources into Amazon QuickSight

  1. Open Amazon QuickSight.
  2. Choose Manage data, New data set, Athena.
  3. Give the data source a name, like awsblogsgluedemo
  4. Choose awsblogsgluedemo, daily_avg_inefficiency, Select.
  5. Choose Directly query your data, Edit/Preview data. On the next screen, leave the settings unchanged and choose Save.
  6. Choose New data set in the top left corner.
  7. Under From existing data sources, choose awsblogsgluedemo.
  8. Choose Create data set, raw, and Select.
  9. Choose Directly query your data, Edit/Preview data. On the next screen, leave the settings unchanged and choose Save.

Repeat steps 6–9 to add the “results” dataset. At step 9, rename the column “col1” to “deviceid” by choosing the pencil icon.

Create an analysis that includes all data sources (raw, processed, real-time results)

  1. Return to the Amazon QuickSight main page by choosing QuickSight.
  2. Choose New analysis, daily_avg_inefficiency, and Create analysis.
  3. Choose Fields, Edit analysis data sets.
  4. Choose Add data set, results, and Select.

Repeat steps 3 and 4 to add the “raw” dataset.

Create a KPI chart on AWS Glue processed data

  1. Choose Fields, daily_avg_inefficiency.
  2. In the Visual types pane, choose the Key Performance Indicator (KPI) icon, shown below.
  1. Drag daily_avg_inefficiency from Fields into the Value
  2. Drag date into the Trend group well.
  3. For daily_avg_inefficiency (Sum), choose the down arrow, pause on over Aggregate: Sum, and choose Average. Choose the down arrow again, pause on over Show as: Number and choose Percent.
  4. Click daily_avg_inefficiency (Average) in the Value well to modify the presentation to fewer decimal points.

You should have a visualization similar to the following:

To create a scatter plot of all devices:

  1. At the top right of the screen, choose Add, Add visual*.
  2. For Visual type, choose Scatter plot.
  1. Drag deviceid into the X axis well and the Group/Color
  2. Drag daily_avg_inefficiency into the Y axis well and the Size

You should now have a scatter plot visualization similar to the following figure. See if you can find a poorly functioning device with a high percentage inefficiency!

To create a line chart against the raw time-series dataset

  1. Choose Fields, raw.
  2. Choose Add, Add visual*.
  3. For Visual type, choose Line chart.
  4. Drag devicets into the X axis Drag temp into the Value well.
  5. For temp (Sum), choose the down arrow, pause on over Aggregate: Sum, and then choose Average.
  6. Choose the down-arrow at the top right, and choose Format visual.
  7. Expand the Y-Axis menu on the left, and choose Range, Auto (based on data range).
  8. Drag the slider to expand the time-series line characters.

You should now have a line chart similar to the following image. The spikes represent an increase in average temperature readings across the devices.

To create a heat map of the Kinesis Analytics results dataset

  1. Choose Fields, results.
  2. Choose Add, Add visual*.
  3. For Visual type, choose Heap map.
  4. Drag deviceid into the Columns and Values For Aggregate, choose Count.
  5. Choose deviceid (Count). For Sort, choose the descending icon.

You should have a visualization like the following graph, which shows deviceid values on the far top left that have the most occurrences of an inefficient temperature reading. Notice that the heat map also colors them darker shades for easy visualization. Look for the devices at the top right that have a high frequency of inefficiency readings.

Finally, you can resize the charts and save them as a dashboard to share to other Amazon QuickSight users in your organization:

  1. Resize the charts by choosing the bottom right corner of a visual.
  2. Choose Share, Create dashboard.
  3. For Create new dashboard as, type a dashboard name, such as “Thermostats Performance.” Choose Create dashboard.

You should now have a dashboard containing all your visuals, similar to the following:

Conclusion

In this post, you saw how to process and analyze data from both streaming and batch sources together in a 100% serverless fashion. The Lambda architecture principles guided a system that separates the ingestion and processing mechanisms, but using fully managed, serverless AWS services.

You went from collecting real-time data, processing and joining with a user settings file, and querying and visualizing the data without servers.

I hope you found this post useful. In the comments, please share your thoughts on how you might extend this architecture to suit your needs.


Additional Reading

Learn how SmartNews built a Lambda Architecture on AWS to analyze customer behavior and recommend content!


About the Author

Laith Al-Saadoon is a Solutions Architect with a focus on data analytics at Amazon Web Services. He spends his days obsessing over designing customer architectures to process enormous amounts of data at scale. In his free time, he follows the latest trends in artificial intelligence.