AWS Big Data Blog

Scaling RISE with SAP data and AWS Glue

Customers often want to augment and enrich SAP source data with other non-SAP source data. Such analytic use cases can be enabled by building a data warehouse or data lake. Customers can now use the AWS Glue SAP OData connector to extract data from SAP. The SAP OData connector supports both on-premises and cloud-hosted (native and SAP RISE) deployments. By using the AWS Glue OData connector for SAP, you can work seamlessly with your data on AWS Glue and Apache Spark in a distributed fashion for efficient processing. AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

AWS Glue OData connector for SAP uses the SAP ODP framework and OData protocol for data extraction. This framework acts in a provider-subscriber model to enable data transfers between SAP systems and non-SAP data targets. The ODP framework supports full data extraction and change data capture through the Operational Delta Queues (ODQ) mechanism. As a source for data extraction for SAP, you can use SAP data extractors, ABAP CDS views, SAP BW, or BW/4 HANA sources, HANA information views in SAP ABAP sources, or any ODP-enabled data sources.

SAP source systems can hold historical data, and can receive constant updates. For this reason, it’s important to enable incremental processing of source changes. This blog post details how you can extract data from SAP and implement incremental data transfer from your SAP source using the SAP ODP OData framework with source delta tokens.

Solution overview

Example Corp wants to analyze the product data stored in their SAP source system. They want to understand their current product offering, in particular the number of products that they have in each of their material groups. This will include joining data from the SAP material master and material group data sources from their SAP system. The material master data is available on incremental extraction, while the material group is only available on a full load. These data sources should be combined and available to query for analysis.

Prerequisites

To complete the solution presented in the post, start by completing the following prerequisite steps:

  1. Configure operational data provisioning (ODP) data sources for extraction in the SAP Gateway of your SAP system.
  2. Create an Amazon Simple Storage Service (Amazon S3) bucket to store your SAP data.
  3. In an AWS Glue Data Catalog, create a database called sapgluedatabase.
  4. Create an AWS Identity and Access Management (IAM) role for the AWS Glue extract, transform, and load (ETL) job to use. The role must grant access to all resources used by the job, including Amazon S3 and AWS Secrets Manager. For the solution in this post, name the role GlueServiceRoleforSAP. Use the following policies:
    • AWS managed policies:
    • Inline policy:
      {
             "Version": "2012-10-17",
             "Statement": [
                    {
                            "Sid": "VisualEditor0",
                            "Effect": "Allow",
                            "Action": [
                                   "s3:PutObject",
                                   "s3:GetObjectAcl",
                                   "s3:GetObject",
                                   "s3:GetObjectAttributes",
                                   "s3:ListBucket",
                                   "s3:DeleteObject",
                                   "s3:PutObjectAcl"],
                            "Resource": [
                                   "arn:aws:s3:::<S3-BUCKET-NAME>",
                                   "arn:aws:s3:::<S3-BUCKET-NAME>/*"
                            ]
                    }
             ]
      }
      

Create the AWS Glue connection for SAP

The SAP connector supports both CUSTOM (this is SAP BASIC authentication) and OAUTH authentication methods. For this example, you will be connecting with BASIC authentication.

  1. Use the AWS Management Console for AWS Secrets Manager to create a secret called ODataGlueSecret for your SAP source. Details in AWS Secrets Manager should include the elements in the following code. You will need to enter your SAP system username in place of <your SAP username> and its password in place of <your SAP username password>.
    {
       "basicAuthUsername": "<your SAP username>",
       "basicAuthPassword": "<your SAP username password>",
       "basicAuthDisableSSO": "True",
       "customAuthenticationType": "CustomBasicAuth"
    }
    

  2. Create the AWS Glue connection GlueSAPOdata for your SAP system by selecting the new SAP OData data source.
  3. Configure the connection with the appropriate values for your SAP source.
    1. Application host URL: The host must have the SSL certificates for the authentication and validation of your SAP host name.
    2. Application service path: /sap/opu/odata/iwfnd/catalogservice;v=2;
    3. Port number: Port number of your SAP source system.
    4. Client number: Client number of your SAP source system.
    5. Logon language: Logon language of your SAP source system.
  4. In the Authentication section, select CUSTOM as the Authentication Type.
  5. Select the AWS Secret created in the preceding steps: SAPODataSecret.
  6. In the Network Options section enter the VPC, subnet and security group used for the connection to your SAP system. For more information on connecting to your SAP system, see Configure a VPC for your ETL job.

Create an ETL job to ingest data from SAP

In the AWS Glue console, create a new Visual Editor AWS Glue job.

  1. Go to the AWS Glue console.
  2. In the navigation pane under ETL Jobs choose Visual ETL.
  3. Choose Visual ETL to create a job in the Visual Editor.
  4. For this post, edit the default name to be Material Master Job and choose Save.

On your Visual Editor canvas, select your SAP sources.

  1. Choose the Visual tab, then choose the plus sign to open the Add nodes menu. Search for SAP and add the SAP OData Source.
  2. Choose the node you just added and name it Material Master Attributes.
    1. For SAP OData connection, select the GlueSAPOData connection.
    2. Select the material attributes, service and entity set from your SAP source.
    3. For Entity Name and Sub Entity Name, select SAP OData entity from your SAP source.
    4. From the Fields, select Material, Created on, Material Group, Material Type, Old Matl number, GLUE_FETCH_SQ, DELTA_TOKEN and DML_STATUS.
    5. Enter limit 100 in the filter section, to limit the data for design time.

Note that this service supports delta extraction, so Incremental transfer is the default selected option.

After the AWS Glue service role details have been chosen, the data preview is available. You can adjust the preview to include the three new available fields, which are:

  • glue_fetch_sq: This is a sequence field, generated from the EPOC timestamp in the order the record was received and is unique for each record. This can be used if you need to know or establish the order of changes in the source system.
  • delta_token: All records will have this field value blank, except for the last passed record, which will contain the value for the ODQ token to capture any changed records (CDC). This record is not a transactional record from the source and is only there for the purpose of passing the delta token value.
  • dml_status: This will show UPDATED for all newly inserted and updated records from the source and DELETED for records that have been deleted from source.

For delta enabled extraction, the last record passed will contain the value DELTA_TOKEN and the delta_token field will be filled as mentioned above.

  1. Add another SAP ODATA source connection to your canvas, and name this node Material Group Text.
    1. Select the material group service and entity set from your SAP source
    2. For Entity Name and Sub Entity Name, select the SAP OData entity from your SAP source

Note that this service supports full extraction, so Full transfer is the default selected option. You can also preview this dataset.

  1. When previewing the data, notice the language key. SAP passes all languages, so add a filter of SPRAS = ‘E’ to only extract English. Note this uses the SAP internal value of the field.
  2. Add a transform node to the canvas Change Schema transform after the Material Group Text.
    • Rename the material group field in target key to matkl2, so it is different than your first source.
    • Under Drop, select ;spras, odq_changemode, odq_entitycntr, dml_status, delta_token and glue_fetch_sq.

  3. Add a join transform to your canvas, bringing together both source datasets.
    1. Ensure the node parents of both Material Master Attributes and Change Schema have been chosen
    2. Select the Join type of Left join
    3. Select the join conditions as the key fields from each source
      • Under Material Master Attributes, select matkl
      • Under Change Schema, select matkl2

You can preview the output to ensure the correct data is being returned. Now, you are ready to store the result.

  1. Add the S3 bucket target, to your canvas.
    1. Ensure the node parents is Join
    2. For format, select Parquet.
    3. For S3 Target Location, browse to the S3 bucket you created in the prerequisites and add materialmaster/ to the S3 target location.
    4. For the Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
    5. For Database, select the name of the AWS Glue database created earlier sapgluedatabase.
    6. For Table name, enter materialmaster.
  2. Choose Save to save your job. Your job should look like the following figure.

Clone your ETL job and make it incremental

After your ETL job has been created, it’s ready to clone and include incremental data handling using delta tokens.

To do this, you will need to modify the job script directly. You will modify the script to add a statement which retrieves the last delta token (to be stored on the job tag) and add the delta token value to the to the request (or execution of the job), which will enable the Delta Enabled SAP OData Service when retrieving the data on the next job run.

The first execution of the job will not have a delta token value on the tag; therefore, the call will be an initial run and the delta token will subsequently be stored in the tags for future executions.

  1. Go to the AWS Glue console.
  2. In the navigation pane under ETL Jobs choose Visual ETL.
  3. Select the Material Master Job, choose Actions and select Clone job.
  4. Change the name of the job to Material Master Job Delta, then choose the Script tab.
  5. You need to add an additional python library that will take care of storing and retrieving the Delta Tokens for each job execution. To do this, navigate to the Job Details tab, scroll down and expand the Advanced Properties section. In the Python library path add the following path:
    s3://aws-blogs-artifacts-public/artifacts/BDB-4789/sap_odata_state_management.zip

  1. Now choose the Script tab and choose Edit script on the top right corner. Choose Confirm to confirm that your job will be script-only.

Apply the following changes to the script to enable the delta token.

  1. 7. Import the SAP OData state management library classes you added in step 5 above, by adding the following code to row 8.
    from sap_odata_state_management.state_manager import StateManagerFactory, StateManagerType, StateType

  2. The next few steps will retrieve and persist the delta token in the job tags so it can be accessed by the subsequent job execution. The delta token is added to the request back to the SAP source, so the incremental changes are extracted. If there is no token passed, the load will run as an initial load and the token will be persisted for the next run which will then be a delta load.To initialize the sap_odata_state_management library, extract the connection options into a variable and update them using the state manager. Do this by adding the following code to line 16 (after the job.init statement).

You can find the <key of MaterialMasterAttributes node> and the <entityName for Material Attribute> in the existing generated script under # Script generated for node Material Master Attributes. Be sure to replace with the appropriate values.

key = "<key of MaterialMasterAttributes node>"
state_manager = StateManagerFactory.create_manager(
    manager_type=StateManagerType.JOB_TAG, state_type=StateType.DELTA_TOKEN, options={"job_name": args['JOB_NAME'], "logger": glueContext.get_logger()}
)
options = {
    "connectionName": "GlueSAPOData",
    "entityName": "<entityName for Material Attribute>",
    "ENABLE_CDC": "true"
}
connector_options = state_manager.get_connector_options(key)
options.update(connector_options)
  1. 9. Comment out the existing script generated for node Material Master Attributes by adding a #, and add the following replacement snippet.
    <key of MaterialMasterAttributes node> = glueContext.create_dynamic_frame.from_options(connection_type="sapodata", connection_options=options, transformation_ctx="<key of MaterialMasterAttributes node>")
  2. To extract the delta token from the dynamic frame and persist it in the job tags, add the following code snippet just above the last line in your script (before job.commit())
    state_manager.update_state(key, <key of MaterialMasterAttributes node>.toDF())

This is what your final script should look like:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from sap_odata_state_management.state_manager import StateManagerFactory, StateManagerType, StateType

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

key = "MaterialMasterAttributes_node1730873953236"
state_manager = StateManagerFactory.create_manager(
    manager_type=StateManagerType.JOB_TAG, state_type=StateType.DELTA_TOKEN, options={"job_name": args['JOB_NAME'], "logger": glueContext.get_logger()}
)
options = {
    "connectionName": "GlueSAPOData",
    "entityName": "/sap/opu/odata/sap/ZMATERIAL_ATTR_SRV/EntityOf0MATERIAL_ATTR",
    "ENABLE_CDC": "true"
}

# Script generated for node Material Group Text
MaterialGroupText_node1730874412841 = glueContext.create_dynamic_frame.from_options(connection_type="sapodata", connection_options={"ENABLE_CDC": "false", "connectionName": "GlueSAPOData", "FILTER_PREDICATE": "SPRAS = 'E'", "ENTITY_NAME": "/sap/opu/odata/sap/ZMATL_GROUP_SRV/EntityOf0MATL_GROUP_TEXT"}, transformation_ctx="MaterialGroupText_node1730874412841")

# Script generated for node Material Master Attributes
#MaterialMasterAttributes_node1730873953236 = glueContext.create_dynamic_frame.from_options(connection_type="sapodata", connection_options={"ENABLE_CDC": "true", "connectionName": "GlueSAPOdata", "FILTER_PREDICATE": "limit 100", "SELECTED_FIELDS": "MATNR,MTART,MATKL,BISMT,ERSDA,DML_STATUS,DELTA_TOKEN,GLUE_FETCH_SQ", "ENTITY_NAME": "/sap/opu/odata/sap/ZMATERIAL_ATTR_SRV/EntityOf0MATERIAL_ATTR"}, transformation_ctx="MaterialMasterAttributes_node1732755261264")
MaterialMasterAttributes_node1730873953236 = glueContext.create_dynamic_frame.from_options(connection_type="sapodata", connection_options=options, transformation_ctx="MaterialMasterAttributes_node1730873953236")

# Script generated for node Change Schema
ChangeSchema_node1730875214894 = ApplyMapping.apply(frame=MaterialGroupText_node1730874412841, mappings=[("matkl", "string", "matkl2", "string"), ("txtsh", "string", "txtsh", "string")], transformation_ctx="ChangeSchema_node1730875214894")

# Script generated for node Join
MaterialMasterAttributes_node1730873953236DF = MaterialMasterAttributes_node1730873953236.toDF()
ChangeSchema_node1730875214894DF = ChangeSchema_node1730875214894.toDF()
Join_node1730874996674 = DynamicFrame.fromDF(MaterialMasterAttributes_node1730873953236DF.join(ChangeSchema_node1730875214894DF, (MaterialMasterAttributes_node1730873953236DF['matkl'] == ChangeSchema_node1730875214894DF['matkl2']), "left"), glueContext, "Join_node1730874996674")

# Script generated for node Amazon S3
AmazonS3_node1730875848117 = glueContext.write_dynamic_frame.from_options(frame=Join_node1730874996674, connection_type="s3", format="json", connection_options={"path": "s3://sapglueodatabucket", "compression": "snappy", "partitionKeys": []}, transformation_ctx="AmazonS3_node1730875848117")
state_manager.update_state(key, MaterialMasterAttributes_node1730873953236.toDF())
job.commit()
  1. Choose Save to save your changes.
  2. Choose Run to run your job. Note that there are currently no tags in your job details.
  3. Wait for your job run to be successfully completed. You can see the status on the Runs tab.
  4. After your job run is complete, you will notice on the Job Details tab that a tag has been added. The next job run will read this token and run a delta load.

Query your SAP data source data

The AWS Glue job run has created an entry in the Data Catalog enabling you to query the data immediately.

  1. Go to the Amazon Athena console.
  2. Choose Launch Query Editor.
  3. Make sure you have an appropriate workgroup assigned, or create a workgroup if required.
  4. Select the sapgluedatabase and run a query (such as the following) to start analyzing your data.
    select matkl, txtsh, count(*)
    from materialmaster
    group by 1, 2
    order by 1, 2;

Clean up

To avoid incurring charges, clean up the resources used in this post from your AWS account, including the AWS Glue jobs, SAP OData connection, Glue Data Catalog entry, Secrets Manager secret, IAM role, the contents of the S3 bucket, and the S3 bucket.

Conclusion

In this post, we showed you how to create a serverless incremental data load process for multiple SAP data sources. The approach used AWS Glue to incrementally load the data from a SAP source using SAP ODP delta tokens and then load the data into Amazon S3.

The serverless nature of AWS Glue means that there is no infrastructure management, and you pay only for the resources consumed while your jobs are running (plus storage cost for outputs). As organizations increasingly become more data driven, this SAP connector can provide an efficient, cost effective, performant, secure way to include SAP source data in your big data and analytic outcomes. For more information see AWS Glue.


About the authors

Allison Quinn is a Sr. ANZ Analytics Specialist Solutions Architect for Data and AI based in Melbourne, Australia working closely with Financial Service customers in the region. Allison worked over 15 years with SAP products before concentrating her Analytics technical specialty on AWS native services. She’s very passionate about all things data, and democratizing so that customers of all types can drive business benefit.

Pavol is an Innovation Solution Architect at AWS, specializing in SAP cloud adoption across EMEA. With over 20 years of experience, he helps global customers migrate and optimize SAP systems on AWS. Pavol develops tailored strategies to transition SAP environments to the cloud, leveraging AWS’s agility, resiliency, and performance. He assists clients in modernizing their SAP landscapes using AWS’s AI/ML, data analytics, and application services to enhance intelligence, automation, and performance.

Partha Pratim Sanyal is a Software Development Engineer with AWS Glue in Vancouver, Canada, specializing in Data Integration, Analytics, and Connectivity. With extensive backend development expertise, he is dedicated to crafting impactful, customer-centric solutions. His work focuses on building features that empower users to effortlessly analyze and understand their data. Partha’s commitment to addressing complex user needs drives him to create intuitive and value-driven experiences that elevate data accessibility and insights for customers.

Diego is an experienced Enterprise Solutions Architect with over 20 years’ experience across SAP technologies, specializing in SAP innovation and data and analytics. He has worked both as partner and as a customer, giving him a complete perspective on what it takes to sell, implement, and run systems and organizations. He is passionate about technology and innovation, focusing on customer outcomes and delivering business value.

Luis Alberto Herrera Gomez is a Software Development Engineer with AWS Glue in Vancouver, specializing in backend engineering, microservices, and cloud computing. With 7-8 years of experience, including roles as a backend and full-stack developer for multiple startups before joining Amazon and AWS; Luis focuses on developing scalable and efficient cloud-based applications. His expertise in AWS technologies enables him to design high-performance systems that handle complex data processing tasks. Luis is passionate about leveraging cloud computing to solving challenging business problems.