AWS Big Data Blog

Introducing self-managed data sources for Amazon OpenSearch Ingestion

Enterprise customers increasingly adopt Amazon OpenSearch Ingestion (OSI) to bring data into Amazon OpenSearch Service for various use cases. These include petabyte-scale log analytics, real-time streaming, security analytics, and searching semi-structured key-value or document data. OSI makes it simple, with straightforward integrations, to ingest data from many AWS services, including Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon DocumentDB (with MongoDB compatibility).

Today we are announcing support for ingesting data from self-managed OpenSearch or self-managed Elasticsearch and Apache Kafka clusters. These sources can either be on Amazon Elastic Compute Cloud (Amazon EC2) or on-premises environments.

In this post, we outline the steps to get started with these sources.

Solution overview

OSI supports the AWS Cloud Development Kit (AWS CDK), AWS CloudFormation, the AWS Command Line Interface (AWS CLI), Terraform, AWS APIs, and the AWS Management Console to deploy pipelines. In this post, we use the console to demonstrate how to create a self-managed Kafka pipeline.

Prerequisites

To make sure OSI can connect and read data successfully, the following conditions should be met:

  • Network connectivity to data sources – OSI is generally deployed in a public network, such as the internet, or in a virtual private cloud (VPC). OSI deployed in a customer VPC is able to access data sources in the same or different VPC and on the internet with an attached internet gateway. If your data sources are in another VPC, common methods for network connectivity include direct VPC peering, using a transit gateway, or using customer managed VPC endpoints powered by AWS PrivateLink. If your data sources are on your corporate data center or other on-premises environment, common methods for network connectivity include AWS Direct Connect and using a network hub like a transit gateway. The following diagram shows a sample configuration of OSI running in a VPC and using Amazon OpenSearch Service as a sink. OSI runs in a service VPC and creates an Elastic Network interface (ENI) in the customer VPC. For self-managed data source these ENIs are used for reading data from on-premises environment. OSI creates an VPC endpoint in the service VPC to send data to the sink.
  • Name resolution for data sources – OSI uses an Amazon Route 53 resolver. This resolver automatically answers queries to names local to a VPC, public domain names on the internet, and records hosted in private hosted zones. If you’re are using a private hosted zone, make sure you have a DHCP option set enabled, attached to the VPC using AmazonProvidedDNS as domain name server. For more information, see Work with DHCP option sets. Additionally, you can use resolver inbound and outbound endpoints if you need a complex resolution schemes with conditions that are beyond a simple private hosted zone.
  • Certificate verification for data source names – OSI supports only SASL_SSL for transport for Apache Kafka source. Within SASL, Amazon OpenSearch Service supports most authentication mechanisms like PLAIN, SCRAM, IAM, GSAPI and others. When using SASL_SSL, make sure you have access to certificates needed for OSI to authenticate. For self-managed OpenSearch data sources, make sure verifiable certificates are installed on the clusters. Amazon OpenSearch Service doesn’t support insecure communication between OSI and OpenSearch. Certificate verification cannot be turned off. In particular, the “insecure” configuration option is not supported.
  • Access to AWS Secrets Manager – OSI uses AWS Secrets Manager to retrieve credentials and certificates needed to communicate with self-managed data sources. For more information, see Create and manage secrets with AWS Secrets Manager.
  • IAM role for pipelines – You need an AWS Identity and Access Management (IAM) pipeline role to write to data sinks. For more information, see Identity and Access Management for Amazon OpenSearch Ingestion.

Create a pipeline with self-managed Kafka as a source

After you complete the prerequisites, you’re ready to create a pipeline for your data source. Complete the following steps:

  1. On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
  2. Choose Create pipeline.
  3. Choose Streaming under Use case in the navigation pane.
  4. Select Self managed Apache Kafka under Ingestion pipeline blueprints and choose Select blueprint.

This will populate a sample configuration for this pipeline.

  1. Provide a name for this pipeline and choose the appropriate pipeline capacity.
  2. Under Pipeline configuration, provide your pipeline configuration in YAML format. The following code snippet shows sample configuration in YAML for SASL_SSL authentication:
    version: '2'
    kafka-pipeline:
      source:
        kafka:
          acknowledgments: true
          bootstrap_servers:
            - 'node-0.example.com:9092'
          encryption:
            type: "ssl"
            certificate: '${{aws_secrets:kafka-cert}}'
            
          authentication:
            sasl:
              plain:
                username: '${{aws_secrets:secrets:username}}'
                password: '${{aws_secrets:secrets:password}}'
          topics:
            - name: on-prem-topic
              group_id: osi-group-1
      processor:
        - grok:
            match:
              message:
                - '%{COMMONAPACHELOG}'
        - date:
            destination: '@timestamp'
            from_time_received: true
      sink:
        - opensearch:
            hosts: ["https://search-domain-12345567890.us-east-1.es.amazonaws.com"]
            aws:
              region: us-east-1
              sts_role_arn: 'arn:aws:iam::123456789012:role/pipeline-role'
            index: "on-prem-kakfa-index"
    extension:
      aws:
        secrets:
          kafka-cert:
            secret_id: kafka-cert
            region: us-east-1
            sts_role_arn: 'arn:aws:iam::123456789012:role/pipeline-role'
          secrets:
            secret_id: secrets
            region: us-east-1
            sts_role_arn: 'arn:aws:iam::123456789012:role/pipeline-role'
  1. Choose Validate pipeline and confirm there are no errors.
  2. Under Network configuration, choose Public access or VPC access. (For this post, we choose VPC access).
  3. If you chose VPC access, specify your VPC, subnets, and an appropriate security group so OSI can reach the outgoing ports for the data source.
  4. Under VPC attachment options, select Attach to VPC and choose an appropriate CIDR range.

OSI resources are created in a service VPC managed by AWS that is separate from the VPC you chose in the last step. This selection allows you to configure what CIDR ranges OSI should use inside this service VPC. The choice exists so you can make sure there is no address collision between CIDR ranges in your VPC that is attached to your on-premises network and this service VPC. Many pipelines in your account can share same CIDR ranges for this service VPC.

  1. Specify any optional tags and log publishing options, then choose Next.
  2. Review the configuration and choose Create pipeline.

You can monitor the pipeline creation and any log messages in the Amazon CloudWatch Logs log group you specified. Your pipeline should now be successfully created. For more information about how to provision capacity for the performance of this pipeline, see the section Recommended Compute Units (OCUs) for the MSK pipeline in Introducing Amazon MSK as a source for Amazon OpenSearch Ingestion.

Create a pipeline with self-managed OpenSearch as a source

The steps for creating a pipeline for self-managed OpenSearch are similar to the steps for creating one for Kafka. During the blueprint selection, choose Data Migration under Use case and select Self managed OpenSearch/Elasticsearch. OpenSearch Ingestion can source data from all versions of OpenSearch and Elasticsearch from version 7.0  to  version 7.10.

The following blueprint shows a sample configuration YAML for this data source:

version: "2"
opensearch-migration-pipeline:
  source:
    opensearch:
      acknowledgments: true
      hosts: [ "https://node-0.example.com:9200" ]
      username: "${{aws_secrets:secret:username}}"
      password: "${{aws_secrets:secret:password}}"
      indices:
        include:
        - index_name_regex: "opensearch_dashboards_sample_data*"
        exclude:
          - index_name_regex: '\..*'
  sink:
    - opensearch:
        hosts: [ "https://search-domain-12345567890.us-east-1.es.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::123456789012:role/pipeline-role"
          region: "us-east-1"
        index: "on-prem-os"
extension:
  aws:
    secrets:
      secret:
        secret_id: "self-managed-os-credentials"
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::123456789012:role/pipeline-role"
        refresh_interval: PT1H

Considerations for self-managed OpenSearch data source

Certificates installed on the OpenSearch cluster need to be verifiable for OSI to connect to this data source before reading data. Insecure connections are currently not supported.

After you’re connected, make sure the cluster has sufficient read bandwidth to allow for OSI to read data. Use the Min and Max OCU setting to limit OSI read bandwidth consumption. Your read bandwidth will vary depending upon data volume, number of indexes, and provisioned OCU capacity. Start small and increase the number of OCUs to balance between available bandwidth and acceptable migration time.

This source is typically meant for one-time migration of data and not as continuous ingestion to keep data in sync between data sources and sinks.

OpenSearch Service domains support remote reindexing, but that consumes resources in your domains. Using OSI will move this compute out of the domain, and OSI can achieve significantly higher bandwidth than remote reindexing, thereby resulting in faster migration times.

OSI doesn’t support deferred replay or traffic recording today; refer to Migration Assistant for Amazon OpenSearch Service if your migration needs those capabilities.

Conclusion

In this post, we introduced self-managed sources for OpenSearch Ingestion that enable you to ingest data from corporate data centers or other on-premises environments. OSI also supports various other data sources and integrations. Refer to Working with Amazon OpenSearch Ingestion pipeline integrations to learn about these other data sources.

OpenSearch includes certain Apache-licensed Elasticsearch code from Elasticsearch B.V. and other source code. Elasticsearch B.V. is not the source of that other source code. ELASTICSEARCH is a registered trademark of Elasticsearch


About the Authors

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Arjun Nambiar is a Product Manager with Amazon OpenSearch Service. He focuses on ingestion technologies that enable ingesting data from a wide variety of sources into Amazon OpenSearch Service at scale. Arjun is interested in large-scale distributed systems and cloud-centered technologies, and is based out of Seattle, Washington.