AWS Big Data Blog

Control your AWS Glue Studio development interface with AWS Glue job mode API property

In recent years, as the importance of big data has grown, efficient data processing and analysis have become crucial factors in determining a company’s competitiveness. AWS Glue, a serverless data integration service for integrating data across multiple data sources at scale, addresses these data processing needs. Among its features, the AWS Glue Jobs API stands out as a particularly noteworthy tool.

The AWS Glue Jobs API is a robust interface that allows data engineers and developers to programmatically manage and run ETL jobs. By using this API, it becomes possible to automate, schedule, and monitor data pipelines, enabling efficient operation of large-scale data processing tasks.

To improve customer experience with the AWS Glue Jobs API, we added a new property describing the job mode corresponding to script, visual, or notebook. In this post, we explore how the updated AWS Glue Jobs API works in depth and demonstrate the new experience with the updated API.

JobMode property

A new property JobMode describes the mode of AWS Glue jobs (script, visual, or notebook) to improve your UI experience. AWS Glue users can use the mode that best fits your preference. Some extract, transform, and load (ETL) developers prefer to use visual mode and create visual jobs using AWS Glue Studio visual editor. Some data scientists prefer to use notebooks jobs and use AWS Glue Studio notebooks. Some data engineers and developers prefer to implement script through the AWS Glue Studio script editor or preferred integrated development environment (IDE). After the job is created with the preferred mode, you can search for it by filtering on the job mode within your saved AWS Glue jobs page and find it easily. Additionally, if you are migrating existing iPython notebook files to AWS Glue Studio notebook jobs, you can now choose and set the job mode and do so for multiple jobs using this new API property, as demonstrated in this post.

How CreateJob API works with the new JobMode property

You can use CreateJob API to create AWS Glue script or a visual or notebook job. The following is an example of how it works for a visual job using AWS SDK for Python (Boto3): (replace <your-bucket-name> with your S3 bucket)

CODE_GEN_JSON_STR = '''
{
  "node-1": {
    "S3ParquetSource": {
      "Name": "Amazon S3",
      "Paths": [
        "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/"
      ],
      "Exclusions": [],
      "Recurse": true,
      "AdditionalOptions": {
        "EnableSamplePath": false,
        "SamplePath": "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/73612da260b94159b705cf4df12364cb_0.snappy.parquet"
      },
      "OutputSchemas": [
        {
          "Columns": [
            {
              "Name": "marketplace",
              "Type": "string"
            },
            {
              "Name": "customer_id",
              "Type": "string"
            },
            {
              "Name": "review_id",
              "Type": "string"
            },
            {
              "Name": "product_id",
              "Type": "string"
            },
            {
              "Name": "product_title",
              "Type": "string"
            },
            {
              "Name": "star_rating",
              "Type": "bigint"
            },
            {
              "Name": "helpful_votes",
              "Type": "bigint"
            },
            {
              "Name": "total_votes",
              "Type": "bigint"
            },
            {
              "Name": "insight",
              "Type": "string"
            },
            {
              "Name": "review_headline",
              "Type": "string"
            },
            {
              "Name": "review_body",
              "Type": "string"
            },
            {
              "Name": "review_date",
              "Type": "timestamp"
            },
            {
              "Name": "review_year",
              "Type": "bigint"
            }
          ]
        }
      ]
    }
  },
  "node-2": {
    "DropFields": {
      "Name": "Drop Fields",
      "Inputs": [
        "node-1"
      ],
      "Paths": [
        [
          "review_headline"
        ],
        [
          "review_body"
        ],
        [
          "review_date"
        ]
      ]
    }
  },
  "node-3": {
    "S3DirectTarget": {
      "Name": "Amazon S3",
      "Inputs": [
        "node-2"
      ],
      "PartitionKeys": [],
      "Path": "s3://<your-bucket-name>/data/jobmode-blog/output/parquet/",
      "Compression": "snappy",
      "Format": "parquet",
      "SchemaChangePolicy": {
        "EnableUpdateCatalog": false
      }
    }
  }
}
'''

glue_client = boto3.client('glue')
codeGenJson = json.loads(constants.CODE_GEN_JSON_STR, strict=False)

# Call the create_job method
try:
    glue_client.create_job(
        Name="glue-visual-job",
        Description="Glue Visual ETL job",
        Command={'Name': 'glueetl', 'ScriptLocation': "s3://aws-glue-assets-<account-id>-<region>/scripts/glue-visual-job", 'PythonVersion': "3"},
        WorkerType=constants.WORKERTYPE,
        NumberOfWorkers="G.1X",
        Role=<role-arn>,  
        GlueVersion="4.0",        
        CodeGenConfigurationNodes=codeGenJson,
        JobMode="VISUAL"
    )
    print("Successfully created Glue job")
except Exception as e:
    print(f"Error creating Glue job: {str(e)}")

CODE_GEN_JSON_STR represents the visual nodes for the AWS Glue Job. There are three nodes: node-1 uses S3 source, node-2 does transformation, and node-3 uses S3 target. The script instantiates the AWS Glue Boto3 client, loads the JSON, and calls the create_job. JobMode is set to VISUAL.

After you run the Python script, a new job is created. The following screenshot shows how the created job looks in AWS Glue visual editor.

There are three nodes in the visual directed acyclic graph (DAG): node 1 sources product review data for the product_category book from the public S3 bucket, node-2 drops some of the fields that aren’t needed for downstream systems, and node-3 persists the transformed data in a local S3 bucket.

How CloudFormation works with the new JobMode property

You can use AWS CloudFormation to create different types of AWS Glue jobs by specifying the JobMode parameter with the AWS::Glue::Job resource. The supported job modes include:

  • SCRIPT
  • VISUAL
  • NOTEBOOK

In this example, you create a AWS Glue notebook job using AWS CloudFormation, which requires setting the JobMode parameter to NOTEBOOK.

  1. Create a Jupyter Notebook file containing your logic and code, and save the notebook file with a descriptive name, such as my-glue-notebook.ipynb. Alternatively you can download the notebook file, and rename it to my-glue-notebook.ipynb.
  2. Upload the Notebook file to the notebooks/ folder within the aws-glue-assets-<account-id>-<region> S3 bucket.
  3. Create a new CloudFormation template to create a new AWS Glue job, specifying the NotebookJobName parameter as the same name as the Notebook file. Here’s the sample snippet of CloudFormation template:
    AWSTemplateFormatVersion: '2010-09-09'
    Description: CloudFormation template for creating an AWS Glue ETL job using a Jupyter Notebook
    
    Parameters:
      NotebookJobName:
        Type: String
        Description: Name of the AWS Glue ETL Notebook job
    
    Resources:
      GlueJobRole:
        Type: AWS::IAM::Role
        Properties:
          RoleName: !Sub ${AWS::StackName}-GlueJobRole
          AssumeRolePolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Principal:
                  Service:
                    - glue.amazonaws.com
                Action:
                  - sts:AssumeRole
          ManagedPolicyArns:
            - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
          Policies:
            - PolicyName: GlueJobS3Access
              PolicyDocument:
                Version: '2012-10-17'
                Statement:
                  - Effect: Allow
                    Action:
                      - iam:PassRole
                    Resource:
                      - !Sub arn:aws:iam::${AWS::AccountId}:role/${AWS::StackName}-GlueJobRole
    
      ETLNotebookJob:
        Type: AWS::Glue::Job
        Properties:
          Name: !Ref NotebookJobName
          Description: ETL job using a Jupyter Notebook
          Role: !GetAtt GlueJobRole.Arn
          Command:
            Name: glueetl
            PythonVersion: '3'
            ScriptLocation: !Sub s3://aws-glue-assets-${AWS::AccountId}-${AWS::Region}/scripts/${NotebookJobName}.py
          DefaultArguments:
            '--job-bookmark-option': job-bookmark-enable
          JobMode: NOTEBOOK
    
    Outputs:
      ETLNotebookJobName:
        Value: !Ref ETLNotebookJob
        Description: Name of the ETL Notebook job
  4. Deploy the CloudFormation template. For NotebookJobName, enter same name as the notebook file.
  5. Verify that the AWS Glue job you created is listed and that it has the name you specified in the CloudFormation template.

AWS Glue notebook shows the Notebook job that contains the existing cells that you had in the ipynb file. You can review the job details to confirm it’s configured correctly.

Console experience

On the AWS Glue console, in the navigation pane, choose ETL Jobs to observe all your ETL jobs listed. Here you have different columns Job name, Type, Created by, Last modified, and AWS Glue version. You can sort and filter by these columns. The following screenshot shows how it looks.

We also enhanced the console experience with the JobMode introduction. The Created by column on the console gives you information about JobMode of the job. You can filter access jobs created by VISUAL, NOTEBOOK, or SCRIPT, as shown in the following screenshot.

This new console experience helps you search and discover your jobs based on JobMode.

Conclusion

This post demonstrated how AWS Glue Job API works with the newly introduced job mode property. With the new property, you can explicitly choose the mode of each job. The steps instructed detailed usage in API, AWS SDK, and CloudFormation. Additionally, the property makes it straightforward to search and discover your jobs quickly on the AWS Glue console.


About the Authors

Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure, and high-performance data solutions in the cloud.

Manoj Shunmugam is a DevOps Consultant in Professional Services at Amazon Web Services. He works with customers to establish infrastructures using cloud-centered and/or container-based platforms in the AWS Cloud.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Gal HeyneGal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering, and BI. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design easy-to-use data products.