A guide to data, analytics, and machine learning (ML) tools for AWS
From generative AI to data lakes, visualizations to transformation, this guide will help you choose the right data tools for the job
A guide to data, analytics, and machine learning (ML) tools for AWS
From generative AI to data lakes, visualizations to transformation, this guide will help you choose the right data tools for the job
At AWS Marketplace, we’re working hard to help AWS builders: Developers, data scientists, engineers, architects, and anybody building for the cloud; Navigate the complex and highly diverse landscape of tools available from AWS and AWS partners to get work done.
This guide investigates the broad arena of data, analytics, and ML tools into specific capability areas, discusses each area’s challenges, and highlights one or more tools that is optimal to achieve the right outcome. This guide is developed by builders for builders. We won’t cover every AWS and AWS Partner tool that’s available, but we’ll highlight the key tools for the use-case that are easy to get building with - they’re free to try with your AWS account and have pay-as-you-go pricing consolidated through AWS billing.
How to choose the right data-driven solution for AWS
The massively accelerated velocity for producing software and the infinite scale provided by the cloud, among other variables, has resulted in massive amounts of data. That comes with many challenges, from storing and moving it around, to making sense of it and, more recently, being able to leverage it to build and train artificial intelligence (AI) algorithms.
Given this broad range of requirements and diverse lifecycle, data and analytics can mean different things to different people, depending on how they approach and interface with data.
The industry has taken those challenges head on, and a whole breed of tools have been developed to make working with and making sense of data something intuitive, efficient, and capable of handling unprecedented scale. With that variety comes the challenge of choice—finding the right tool for the job can sometimes feel like a daunting task. With this guide, we’re looking to share our views as builders to make your selection process simpler.
Storage
Challenges and Requirements: Storing data is a complex problem that is driven by multiple factors which define the capabilities and features a given tool must fulfill for it to be a solid candidate for a production-grade deployment. These factors include data volume, rate of data change, restrictions around location and use of the data, and the necessary performance to satisfy specific use cases, such as interactive user experience.
With this in mind, let’s look at tools and AWS services that the AWS Marketplace developer relations team find optimal for the following categories:
Compliance and sovereignty
One of the biggest challenges to compliance is having clarity as to what regulations you must abide by, and monitoring that your data, depending on its storage location and irrespective of its rate of change, continuously fulfills those requirements. Additionally, eliminating the risk of unauthorized access and enforcing strong data lifecycle policies are also indispensable requirements to guaranteeing data compliance.
In terms of sovereignty, ensuring the geographical residency of your data is enforced and automated are key features required.
Key tools for the job
-
How it works
Data residency in Amazon S3
- Amazon S3 allows for region bounded buckets, which ensure all data stored in that bucket will reside in a specific geographical location. The cool thing about using Amazon S3 as a landing location for the bulk of your data is its flexibility towards transforming and pushing that data out to other systems. You can use Amazon S3 as the landing zone while leveraging data residency capabilities of other service such as Amazon Relational Database Service (Amazon RDS) and Amazon Elastic Compute Cloud (Amazon EC2), which lie at the core of most applications today, including those running in container orchestrated environments. If you add AWS Local Zones to the mix, you have even more control over the location of your data and its various projections.
Security and Compliance Enhanced Delta Lake with Databricks
- With Databricks you can build a Delta Lake on top of your S3 stored data, and using the Enhanced Security and Compliance Add On you get improved security and controls that help you guarantee your compliance needs are met. This Add On includes hardened host OS, inter-node encryption, and enforced updates, together with compliance security profiles available for HIPAA, IRAP, PCI-DSS, and FedRAMP High and Moderate out of the box.
Amazon Macie for automated sensitive data discovery on S3
- Having your data on S3 also allows you to easily enable Amazon Macie, a service that will continually evaluate, discover, and scan your data on Amazon S3, and gives you a very quick path to automating remediation using Amazon EventBridge and AWS Security Hub.
Start by creating your S3 Buckets in the appropriate regions and setting up your Databricks Workspace to use the buckets with the desired data residency and enabling the Compliance Security Profile. Then configure Amazon Macie to handle automated discovery of sensitive data in the applicable S3 buckets.
Durability and availability
Durability and availability are different but usually interrelated concerns. Durability focuses on ensuring data remains intact, complete, and uncorrupted over time, while availability refers to the ability to access data at any point in time. Achieving durability and availability is relatively straightforward when dealing with object storage and using services like Amazon S3, but it becomes particularly challenging when you are working with block data that requires direct and fast access from your workloads, and you need to keep this data replicated across regions for durability as well as available for direct access.
Key tools for the job
-
How it works
Amazon FSx for ONTAP
- Fast performance, regional block storage for direct access by your compute nodes running Linux, Windows, and MacOS can be achieved using Amazon FSx. With Amazon FSx you can share file storage across Amazon EC2 instances, Amazon EKS nodes and even AWS Lambda functions, with very low latency and incredibly fast performance. Amazon FSx provides out-of-the-box support for multi-Availability Zone (AZ) deployment.
NetApp BlueXP
- With Amazon FSx for NetApp ONTAP and NetApp BlueXP, you get access to effortless cross region replication and data transport capabilities. Backup and cross-region disaster recovery are also readily available, which allows you to centrally manage your entire data estate across AWS regions, including automating through the criteria that you choose to apply to data being moved around or recovered in case of disaster using rules and policies.
You can start by creating your Amazon FSx ONTAP file systems directly in the AWS Console, which will allow you to discover them using NetApp BlueXP or directly create the environment from within BlueXP directly. In both scenarios you will need to grant the correct IAM permissions for the AWS resources to be accessible and managed through BlueXP.
From here, you can configure your high availability architecture, automatic capacity and throughput handling as well as all the other powerful capabilities available with NetApp BlueXP working with FSx volumes.
- Fast performance, regional block storage for direct access by your compute nodes running Linux, Windows, and MacOS can be achieved using Amazon FSx. With Amazon FSx you can share file storage across Amazon EC2 instances, Amazon EKS nodes and even AWS Lambda functions, with very low latency and incredibly fast performance. Amazon FSx provides out-of-the-box support for multi-Availability Zone (AZ) deployment.
Scale
There are many challenges to storing and using data at scale, particularly when it comes to structured data:
- Infrastructure scale: Getting more storage and compute as your data volume and data access patterns become larger and more demanding without impacting users and ensuring continuous reliability and performance.
- Data import and integration: The data that lives in data warehouses comes from many sources, and changes over time. Reducing the effort in importing data as well as integrating different data sources directly for consolidate query access is a usual requirement of data at scale.
- Access, Analysis and Query Processing: Providing data engineers with the right tools and intuitive interfaces to work with the data towards gaining insights from it is a definite requirement to truly extract the value of data at scale.
Key tools for the job
-
How it works
Snowflake
- Providing instantaneous capacity to handle any scale and making features available such as data stages and external tables, Snowflake provides out of the box capabilities to solve for all the challenges of working with data at scale we’ve discussed. Snowflake also provides an intuitive and familiar interface by supporting standard ANSI SQL to perform most operations. Additionally, Snowflake enables multi-region architectures off the shelf, by supporting multi-region deployment, data sharing and replication.
Amazon S3
- S3 Buckets can be directly used within Snowflake for data staging, which allows to load and unload data into your Snowflake deployment without having to build and maintain any complex data pipeline. You can also directly integrate buckets as external sources which enable you to query data stored in S3 without having to import it into the Snowflake data catalog at all.
Start by creating and loading data in your S3 buckets. Once you have your buckets created, you can use Snowflake storage integrations to allow it to read and write data from them. Storage integrations allow for authentication and authorization using AWS native IAM, and without using less secure secret keys and access tokens.
Transportation and transformation
Challenges and Requirements:
For data to be of value, it must usually be taken from its raw form and enriched or otherwise transformed. There are many sources of data that must be integrated and related, including streaming data sources (from edge or IoT), data batches from DBMS, user generated data from applications and large volumes of unstructured or semi structured data from data lakes and data warehouses. Transportation and transformation are usually the target of Data Engineers.
Data streams
Data can be generated from many different sources, including edge devices, IoT clients, change data capture, and event driven systems architectures. Working with these data streams requires high-performance processing capabilities as well as dynamic and flexible mechanisms for ordering, deduplicating, and modifying that data in flight.
Data stream throughput can change very rapidly, and due to network and other conditions result in spiky behavior that requires the right mechanisms to buffer, throttle, and retry messages in those streams as necessary.
There will be an impact in data storage as well, as data will require to be persisted somewhere both while in transit as well as when directed to its final location. Processing the data stream must account for the eventual impact of scaling activities in target data storage systems, as well as be capable of handling varying volumes of data persisted temporarily during processing, staging, and prior to final storage.
Key tools for the job
-
How it works
Confluent Kafka
- Offers incredibly powerful and unique capabilities built on top of the battle-proven Apache Kafka. Addressing the challenges outlined above, Confluent offers automated cluster scaling, dramatically reducing the complexity of handling data streams at scale, as well as features towards automated data deduplication. Relying on Confluent Kafka you get various message delivery, receipt, and ordering guarantees out of the box, which gives you a wide array of solutions when building your cloud-native architecture. To handle challenges around scale, Confluent Kafka offers infinite storage and infinite retention capabilities, which makes storage concerns a thing of the past.
Amazon EventBridge
- Confluent Kafka has native integrations with Amazon EventBridge, which allows you to send messages from one or more Kafka topics, including customized mapping and other handy capabilities to a specified event bus. Amazon EventBridge can handle infinite load with fully serverless scaling capabilities, and offers native integrations to most of AWS services, as well as custom capabilities to integrate with third-party services.
Start by creating a Confluent Kafka cluster, and configuring all your data sources to stream data to it, using Kafka client libraries. Once your data is flowing into Confluent, use the EventBridge Sink connector to send your records to one or more EventBridge buses, which will allow you to configure any downstream processing or handling using the huge array of AWS services supported natively or through SDK by EventBridge.
- Offers incredibly powerful and unique capabilities built on top of the battle-proven Apache Kafka. Addressing the challenges outlined above, Confluent offers automated cluster scaling, dramatically reducing the complexity of handling data streams at scale, as well as features towards automated data deduplication. Relying on Confluent Kafka you get various message delivery, receipt, and ordering guarantees out of the box, which gives you a wide array of solutions when building your cloud-native architecture. To handle challenges around scale, Confluent Kafka offers infinite storage and infinite retention capabilities, which makes storage concerns a thing of the past.
Pipelines
Lying at the center of data engineering processes, data processing pipelines are a critical component in the process of getting data transported and transformed across locations and repositories, as well as automating workflows necessary for cleaning that data or developing capabilities around it.
Maintaining pipelines is usually a hard problem for engineers, considering their place in most architectures, requiring access and proper authorization to many data repositories managed by different parties, as well as requiring continuous update as data schemas evolve and new applications are deployed.
Performance and scalability are also interesting challenges, as the engine handling these workflows must scale to match the demands of source and target systems, as well as handle transformation tasks of growing complexity.
Key tools for the job
-
How it works
Matillion Data Productivity Cloud
- Building data pipelines with no code is one of the most powerful capabilities of Matillion, providing access to move and work with data to more diverse users, as well as reducing the effort in maintaining those pipelines over time. Matillion is powered by PipelineOS, which provides a cloud-native approach to scaling the underling infrastructure that runs the pipelines by using stateless agents capable of scaling to work in parallel and process massive amounts of data in a short period of time. Matillion also offers native and seamless integration with Amazon Redshift, dramatically simplifying storing processed and optimized data in a flexible and highly scalable repository.
Amazon Redshift- Offering both configurable as well as machine learning driven fully automated performance tuning, Amazon Redshift serves as an ideal candidate to store voluminous data generated by your data pipelines. The capacity of Amazon RedShiftto to scale up to meet demands can easily adjust to the requirements of a dynamic and elastic range of data pipelines and serves as a flexible landing location where you can analyze, visualize and, if necessary, project that data to other target systems.
Start by creating your Redshift cluster in AWS, and create a user for Matillion to have the necessary privileges to create databases and manage their data. Also, get the Redshift endpoint and port, which you will need when creating your Matillion pipelines.
Now you can start using Matillion’s wide range of capabilities to consume, extract or transform data from many different sources, including RDS data querying, which automatically stages data to S3, or any of the other data transfer sources including HDFS, SFTP, and many external third party services.
Visualization and analysis
Challenges and Requirements:
Gaining an understanding of data requires the ability to query and explore it with flexibility and efficiency and produce the necessary representations of it once you’ve arrived at the desired insights. In terms of analysis, having strong tools to explore the data and produce insightful results with simple syntax are both key capabilities of an optimal tool. As per visualization, having access to integrate with many different sources of data and present results in a variety of potential visual representations. This is the realm of data scientists, business intelligence engineers, and database administrators (DBAs).
Dashboards and visualization
Displaying insights on data that are intuitive, easy to understand, and clearly communicate the desired information to the viewer requires a combination of data as well as presentation and user experience skills, both indispensable to allowing consumers of those insights to get true value from the data.
Allowing data scientists and business intelligence engineers to deliver insights to end users from data, requires tooling that provide a wide and dynamic range of ways to present and visualize those insights , as well as seamless query and integration capabilities with the various systems in which the data that you want to present to users can live.
Consumers of these dashboards need to have the flexibility to customize and navigate them intuitively, while ensuring that access to data is only available to the right people, thus requiring strong role-based access control and granular authorization controls to data attributes.
Key tools for the job
-
How it works
Grafana
- With over 26 different ways to present data, Grafana provides a powerful framework for business intelligence engineers, data scientists, and other engineers working to extract and share insights on data to build intuitive, simple yet comprehensive dashboards, while allowing for integration of many data sources and providing a common interface to query and inspect the data as the dashboard is being built. Grafana Cloud also provides enterprise-level capabilities, such as SSO and IAM integrations, providing robust ways to build security and access controls between your dashboard consumers and data.
AWS Identity and Access Management
- With AWS IAM you get access to incredibly granular controls and multiple ways to group, assign and managed those permissions to individuals as well as groups of users. These access controls interface directly with all AWS services and can be used to control access to dashboards and specific components within them.
Once you have subscribed to your Grafana Cloud account, simply use Grafana Data Sources to connect and consume data from your AWS resources, including Amazon CloudWatch, Amazon RDS. Amazon ElasticSearch Services and many others. Then use any of the built in Grafana Dashboards to visualize the health and status of each service, or build your own using Grafana’s query, exploration, visualization and alerting capabilities.
Machine learning and generative AI
Challenges and Requirements:
Data lies at the core of machine learning as a crucial element in achieving the desired outcome from ML algorithms. Training, fine tuning, and Retrieval-Augmented Generation (RAG) are just three areas where data is indispensable in the context of machine learning. There are unique challenges that arise from integrating large data sets as part of development processes, which is a must considering the rapidly evolving domain of machine learning model development.
MLOps
A lot has been learned from DevOps about accelerating the delivery of software products. Given the rapid rise of new tools, frameworks, and methodologies to build and deploy machine-learning models, it becomes a requirement to apply many of those learnings to the workflows associated to developing AI capabilities.
Data lies at the center of machine-learning development. It is therefore indispensable to integrate data into the development workflow, something particularly challenging due to the nature of data: voluminous, constantly changing, and with a limited shelf life.
Data is used for training, fine tuning, and enriching the context that machine-learning models have access to. Many of these activities are time consuming and require considerable compute capacity . Optimizing workflows and automation to reduce the friction and time required to perform these processes is critical to successful machine-learning development projects.
Versioning machine-learning model artifacts and providing a testing framework to validate their performance and precision is also critical.
Key tools for the job
-
How it works
Databricks Data Intelligence Platform
- Databricks offers a strong set of capabilities in their MLFlow offering that manage the entire lifecycle of machine-learning development workflows, from data preparation to training and notebook development to model deployment across a varied range of integrated model runtime environments.
The Databricks Unity Catalog feature allows for storage of ML models, while enabling access controls, auditing, model discovery, and many other powerful capabilities when working with evolving versions of machine- learning projects.
Amazon SageMaker
- Databricks offers direct integration with Amazon SageMaker, allowing for direct and seamless deployment of models developed and managed using Databricks set of capabilities, while taking advantage of the powerful features built into Amazon SageMaker, including customizable scalability, API endpoint deployment and lifecycle management, and many other features required to put models into production environments (for example, tight IAM controls and direct access to data stored and managed within other AWS services).
You can use Databricks for the development and exploration of your model, and SageMaker to deploy it and make it available to users in a production ready capacity. Start by using Databricks by creating a workspace to work on the development of your model. Once you have a model ready to deploy, use the MLFlow API to deploy it to SageMaker, it is as simple as that.
- Databricks offers a strong set of capabilities in their MLFlow offering that manage the entire lifecycle of machine-learning development workflows, from data preparation to training and notebook development to model deployment across a varied range of integrated model runtime environments.
LLM RAG Development
Prototyping and developing RAG-based large language model (LLM) implementations can be a complex task that requires skills around a wide range of new tools and frameworks. Providing tooling to developers that simplify their development processes and exploration without adding new layers of complexity is a key concern to teams starting in their RAG development journey.
Familiarity with the environment, data structures, and interfaces are relevant requirements towards accelerating the skilling up of developers to explore machine-learning model development.
Integrating existing data sets and making them available as vector embeddings for LLMs through RAG in a seamless and intuitive way provides and accelerated foundation for developers to explore these new development paradigms.
Key tools for the job
-
How it works
MongoDB Atlas
- Developers are very familiar with MongoDB as a data repository for JSON documents, arguably one of the most common NoSQL data formats in use today, which also means a lot of data is already stored in these databases. MongoDB can be directly integrated as an Amazon Bedrock knowledge base , and it can serve both as document storage as well as storage for all the corresponding vector embeddings that will be used by the RAG implementation on top of your foundational LLM of choice. MongoDB provides an intuitive and familiar interface for developers to work with document and vector data, while enabling them to deploy RAG-based solutions with little effort by leveraging Amazon Bedrock’s fully managed and serverless access to foundational LLMs.
Amazon Bedrock
- Amazon Bedrock is a fully managed serverless offering that enables seamless access to a wide variety of high-performing foundational models. It also offers powerful and intuitive integration capabilities using Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases to enrich and extend the context available to LLMs when generating responses to users.
Start by creating your MongoDB Atlas cluster and storing the credentials to access it in AWS Secrets Manager. Store the data you want to use for your Knowledge base in an S3 bucket, this can be anything from PDF files to other documents. Now, create an Atlas Vector Search Index in your Atlas Cluster. Now you are ready to configure your Knowledge Base in Amazon Bedrock, adding the Atlas vector store you created and the S3 bucket holding your data in the Knowledge Base configuration. Finally create and Agent that uses this knowledge base and you’ll be ready to use this for as RAG source for your LLM.
More resources to help you build with AWS
About AWS Marketplace
AWS Marketplace makes it easy to find and add new tools from across the AWS partner community to your tech stack with the ability to try for free and pay-as-you-go using your AWS account.