AWS Partner Network (APN) Blog
Fusing Business Data with an Enterprise Data Solution for a Unified Data Strategy
By Rama Lankalapalli, Sr. Partner Solutions Architect – AWS
By Adam Hood, Samir Batla, John Reynolds, Amlan Das, and Shriya Goel – PwC
PwC |
As companies grow, enterprise data can become siloed within business units, making it cumbersome to join critical data points required for operational insights and decision making.
A key challenge faced by many companies is the lack of a centralized data repository that enables meaningful insights across different systems, such as an Order Management System (OMS), Warehouse Management System (WMS), Labor Management System (LMS), and Transportation Management Systems (TMS).
Stakeholders report that data challenges are a common denominator across the board, from C-suite executives working on financial reporting, business health metrics, and compliance regulations to the management team working on operational reports, insights, key metrics, and trends.
A crucial step in enabling your stakeholders to join data across the organization is creating a centralized data location. To accomplish this, organizations are looking to leverage cloud resources due to their scalability, storage flexibility, and consumption options.
Since setting up an initial cloud footprint can be daunting, PwC has built an enterprise data solution on Amazon Web Services (AWS) to centralize your business data and make it more accessible for stakeholders. This helps business value by reducing time spent on bespoke data collection efforts, allowing more time for analysis and innovation using consistent, up-to-date datasets.
PwC is an AWS Premier Tier Services Partner with AWS Competencies in Data and Analytics, DevOps, Security, and many more key areas of cloud computing. PwC helps you drive innovation throughout IT and the business to compete in today’s service economy.
Solution Overview
PwC’s enterprise data solution in a box is a preset collection of AWS services and workflows that enable you to collect, ingest, and query data from your various source systems in a single location in your AWS cloud environment.
Your data flows through the enterprise data solution in a series of steps: landing your source data, preparing the data for ingestion, ingesting the data into a fit-for-purpose repository (like Amazon Redshift), and finally consuming that data through visualization tools (such as Amazon QuickSight).
Figure 1 – Enterprise data solution built by PwC.
Landing Your Source Data
To take full advantage of the enterprise data solution, your business data must be transferred to the landing layer’s Amazon Simple Storage Service (Amazon S3) bucket. The enterprise data solution supports many data formats, including the common CSV, TSV, and Parquet formats.
The two primary cloud-native AWS tools to land your data are AWS DataSync for flat files and AWS Data Migration Service (AWS DMS) for most relational database systems. Third-party tools are also available that will move files from on-premises to Amazon S3, migrate traditional data, or perform more advanced change data capture (CDC).
Figure 2 – Data transfer process/tools to move data into the enterprise data solution.
Regardless of tooling, network connectivity from your corporate network to the AWS account where your enterprise data solution exists is a prerequisite for data transfer. This strategy should be determined with your IT/networking group and based on the amount of data transfer expected, as well as other considerations such as transmission of data over the public internet within corporate policy even when encrypted.
Let’s consider two common enterprise scenarios:
Scenario #1 – Flat File Ingestion
First, a daily batch process extracts data from a business system and creates a flat file of that output on a network file share. AWS DataSync is the preferred transfer option in this scenario.
Create a scheduled DataSync task for your data source that’s scheduled to run after your batch process completes. Please note the following is not actual source code but rather an indicative sample.
# in your app.py for the data source, create a new DataSyncSrcLocProps with network share information
data_sync_props = DataSyncSrcLocProps(
srcLocationType='NFS',
srcLocationName='MySAN',
srcLocationHost="MySAN.MyCorp.COM",
srcLocationPath="/MyAppFiles"
)
# pass the data_sync_props to the deploy stage of your orchestration pipeline
stack.add_deploy_stage(MyDataSourceStage(stack,data_sync_props,
target_vpc_id=DataPlatformResourceHelper.get_target_vpc_id(),
landing_subnet_list=DataPlatformResourceHelper.get_landing_subnets()))
# in your edw_orch_MyDatSource_datasync_stack.py, modify the desired CRON schedule
datasync_cron_expression = "30 9 * * ? *"
The enterprise data solution orchestration pipeline deploys a DataSync job and an Amazon EventBridge rule is created to listen for successful DataSync task completions. After every DataSync task completion, the rule triggers AWS Lambda to read all newly-transferred files from the DataSync event log, and passes this list of files to the orchestration step function for processing.
The orchestration makes the data available in the Amazon Redshift instance for querying or further processing.
Scenario #2 – Direct Database Ingestion
The second scenario is a line of business relational database which contains order data that needs to be available in the enterprise data solution. AWS DMS is the preferred data transfer option in this scenario.
AWS DMS is configured in the enterprise data solution to execute as a daily job, filter the result set (like the previous day’s orders), and write the data in Parquet format. An EventBridge rule is created with a cron schedule to execute a step function encapsulating the DMS process. This starts the AWS DMS task, monitors for completion, and triggers the data orchestration step function to move the landed data into the enterprise data solution’s Amazon Redshift instance.
These are just two out-of-the-box examples, but other tools can be used to land data in Amazon S3. Create your own trigger mechanism to start the data orchestration to move the data to S3 and data will move automatically into the enterprise data solution’s Redshift instance based on configuration files.
Preparing and Ingesting Source Data
After landing the source data, the enterprise data solution executes a series of steps to validate data integrity, prepare the data for ingestion, and perform the ingestion to the target repository. Data validation can be simple–like record count checks with a control file or include more complex business logic checks.
Next, data cleanup performs common tasks like converting empty strings to nulls, formatting dates, and ensuring data types match the schema specified in the configuration files. Finally, the data is moved into staging tables in Redshift using AWS Glue. All steps are orchestrated by AWS Step Functions.
Figure 3 – Configuration-driven data ingestion.
The AWS Step Function is automatically created and configured by the enterprise data solution using the data specified in the source configuration file, and no additional coding is required. The example configuration file below ingests the IMDB Names table.
datasource: imdb
dataset: names_basics
orchestration_type: DIM
src_file_validation: false
db_table: stg_imdb.stg_tbl_names_basics
landing_filename: NAMES_BASICS.
delimiter: "\t"
null_conversion_flag: true
null_conversion:
replacer_value:
null_values: ['',' ','\N']
file_header_flag: true
datetime_types: ['%Y-%m-%d %H:%M:%S','%Y-%m-%d %H:%M:%S.%f000','%Y-%m-%d %H:%M:%S.%f']
date_types: ['%Y/%m/%d','%Y-%m-%d', '%m/%d/%Y', '%m-%d-%y', '%m/%d/%y', '%Y-%m-%d %H:%M:%S']
data_dictionary:
nconst: varchar(20)
primaryName: varchar(255)
birthYear: int
deathYear: int
primaryProfession: varchar(255)
knownForTitles: varchar(255)
* Not actual source code an indicative sample of the code.
Consuming Data from the Enterprise Data Solution
After the successful orchestration of the step function, all data specified in the source data configuration files is staged in Redshift in the form of Fact and Dimension tables.
Now that your data is in Redshift, use Amazon QuickSight (or own third-party tools) to connect to your data. Once the connection is established, you’ll be able to see a list of available tables/views required to create the visualizations and derive valuable insights based on your business requirements.
Below is an example of a dashboard based on the IMDB dataset ingested into the enterprise data solution and consumed with QuickSight. This dashboard helps talent managers (agents) to derive valuable insights about current or potential clients.
Figure 4 – Amazon QuickSight dashboard.
Capturing Event-Based Data in the Enterprise Data Solution
The current release of PwC’s enterprise data platform is focused on batch-style data ingestion, where data is typically ingested on a daily basis, in larger chunks. AWS Glue is preferred to AWS Lambda in this scenario where processing time may exceed Lambda’s execution duration.
To enable use cases requiring near real-time data (for example, tracking total value of orders delivered from a warehouse throughout the day), the enterprise data solution will support event-based ingestion mechanisms like change data capture.
The existing orchestration step function will be adapted to process smaller, more frequent pieces of data, pushing data into Redshift more quickly. The dataset configuration file determines which version of the orchestration function is triggered to process incoming data.
Figure 5 – Configuration-driven batch and event data transformation.
Real-time Data Processing and Publishing
For organizations that require real-time data analysis, the platform will support streaming data technologies to process and analyze data as it arrives. For application integration scenarios, a parallel route is required for data publishing.
Using Amazon MSK, event-based data can be published to an EventBridge for consumption, and in parallel to the data orchestration step function. This enables lightweight transformation and augmentation of data events without the latency of loading and transforming in Redshift.
Figure 6 – Real-time data transformation orchestration.
Data Governance and Security
Each organization’s security needs are different based on factors such as industry, data sensitivity, and compliance. The enterprise data solution will ensure data governance by implementing role-based access controls to ensure only authorized users have access to specific data and resources, data quality checks, and compliance measures to maintain data accuracy, privacy, and security.
Conclusion
PwC has introduced an enterprise data solution to assist organizations with enhancing their data strategy and consolidating data. The platform bridges enterprise data silos by streamlining data collection processes, minimizing time spent on these efforts.
Using AWS-native services, combined with configuration and automation, the platform can be stood up quickly and expanded rapidly. Consequently, more time becomes available for data analysis and business innovation, facilitated by the use of consistent and updated datasets.
To learn how the enterprise data solution can help you, please contact PwC.
PwC – AWS Partner Spotlight
PwC is an AWS Premier Tier Services Partner that helps you drive innovation throughout IT and the business to compete in today’s service economy.
Contact PwC | Partner Overview | AWS Marketplace | Case Studies