How to build a streaming data transport and transformation layer with Apache Kafka on AWS
Accelerate application modernization and migration to AWS with a real-time data transformation and transport layer using Confluent Cloud
The landscape where most organizations operate production workloads today is quite diverse—a mix of legacy and modern applications. Some are running in the cloud following cloud-native architecture principles and using fit-for-purpose data repositories. Others are running on premises, using the staple technologies of the past world of monoliths for data storage: relational databases. And you can probably find everything in between, from on-premises data warehouses to applications being strangled into services not quite ready yet to hit the cloud.
In this article we will look into hybrid architectures and how data fits into that picture, the challenges introduced by data in the road to application migration and modernization, and how Confluent Cloud can work tightly together with Amazon Web Services (AWS) in building a layer that transports, transforms, and provides access to data as applications are modernized and moved around, while keeping stringent requirements of availability and data consistency.
Let’s dig in!
The landscape of applications today
I’m going to sound like my grandpa, but the old days were a whole lot simpler … and, at least in terms of systems architecture, there’s no denying that.
Back in the day, running applications online meant a pizza-box server, an ethernet cable, some http service, and a relational database (long live LAMP). And, yeah, some of you might be thinking that I could go even further back in time, but I don’t really think that’s needed to make my point.
The scenario above represented how entire systems were built, in a single codebase, all data living in the sample place, and, at most, vertically scaled.
The picture above is far from today’s landscape. The technology industry has been disrupted over the past couple of decades by a rapidly accelerating set of technologies. At the core of it all lies the cloud—and paradigms that arose around those groundbreaking capabilities such as DevOps.
Architectures started to change, cloud native principles started to solidify and applications—together with the teams building them—had to rapidly adapt. But change is always harder than it seems. Some applications were built from the ground up, others started to be split apart while some simply resisted change.
The result: a hybrid landscape of many applications at different stages of their cloud evolution spread across a wide variety of locations and running on top of many types of infrastructure. And data is no different The days of a single relational database and a single table holding everything together are now a story of the past. In place, we have data lakes, data warehouses, microservices, and all combinations thereof, depending on different types of data storage fit-for-purpose.
The challenges of data in application modernization
Data has some unique qualities that make it a particularly complex target to application modernization:
Data is constantly being accessed and produced
Most applications are worthless if it weren’t for the data they serve to users, and most applications are also continuously producing massive amounts of data from user interaction.
The challenge around this is that, even if you have refactored your application, or perhaps identified that it may be relatively easy to replatform your legacy applications, unless you can take the data together with the application, your cloud readiness is of no value.
And considering the fact that applications are producing data every second, getting the data to the cloud while having little downtime is a complex and usually costly technical exercise.
Moving data around can be challenging
The more data you have, the more data you will produce over time. And that translates into data volume: gigabytes turning to terabytes turning to petabytes. But just as much as there's an increasing volume of data being stored as time passes, recent data—even real time—has a much higher value in most applications.
Just as we discussed earlier, getting data moved over and available for real-time utilization can be challenging, but the fact that data volumes are now reaching orbital scale doesn't make it any easier because of three fundamental constraints:
- Bandwidth—A scarce and finite resource that will impose a hard limit to how much data you can move around over time.
- Time—If bandwidth is not enough to reasonably move your data in the necessary time frame, then there are alternatives that physically move data around. But then time is an issue that is usually addressed by moving the bulk of data and then synchronizing only deltas of data over time, and that’s not necessarily a simple task.
- Cost—Regardless of your migration path of choice, moving data around is usually expensive. And the more data you have, the more costly it will be.
Data structures are changing
Working with data at modern-day scale requires not just different pieces of infrastructure operating the services that store and provide access to it, it also requires changing the fundamental structure of how data is stored.
This is also driven by different needs around querying and ensuring data remains available even during disasters.
The outcome of this change is that relational data sources, which are representative of the majority of applications there were built in the past couple of decades, are being transformed into capability specific repositories. This introduces the need to work with unstructured and semi-structured data, documents, full text, and other forms of data representation.
The result is that, many times, once your application is ready to start working in its new home (i.e., the cloud), getting production data ready for it from legacy systems requires some dramatic transformations, all the while ensuring consistency across systems.
Solving these challenges with a data-transport layer
So, let’s look at this from a different perspective:
What would happen if all the data in your organization could live where it does but also become available, in close to real time, in an additional layer: a fabric on top that would allow you to query, transform, and direct data continuously from one or more sources into one or more targets?
And what if this layer was something that was built as a core component of the enterprise migration and modernization strategy, looking at data as a comprehensive and global enterprise entity that feeds and is fed by a variety of applications, not through the lens of a single application and its very specific use cases?
This is the concept that I want us to look at in more detail, by using Confluent Cloud as the cornerstone of an enterprise application-modernization and cloud-adoption strategy.
What is a data transport and consumption layer?
Your organization likely has many data repositories already, from relational databases running on premises to unstructured and semi-structured data living in data lakes to other data-storage solutions such as document databases and key-value stores that are used by individual services to read and write application-specific data.
A data transport and consumption layer connects with all these sources and provides a unified fabric where all the data that is distributed across the organization can be queried, transformed, and eventually placed in a new target location.
This transport and consumption layer uses a combination of batch and stream processing, towards enabling consolidated access to historical data, as well as to real-time data as it changes.
Confluent Cloud at the center of this approach
Kafka has been a staple of data engineering for quite some time, uniquely capable of handling massive scale with extreme efficiency and ensuring data integrity. Confluent has taken those capabilities to a whole new level with Confluent Cloud, which builds on top the battle-proven capabilities of Kafka a wide range of enterprise features that make building a data transport and streaming layer a whole lot simpler.
The first step to establishing a data transport and access layer is identifying and connecting with all the different sources that hold your organization’s data, something that as was mentioned previously, likely represents a quite diverse range of technologies and data repositories.
The task of connecting with these many sources usually required all sorts of “glue,” sometimes pipelines or other convoluted hacks to be able to consume it reliably, as in, keeping things in order, guaranteeing non-duplication, and making sure that data kept flowing as it changed.
Confluent has over 120 connectors for data sources and data sinks that eliminate a lot of the complexity in setting up producers and consumers, and all the details you must consider to maintain data integrity.
We will be looking into data sinks as a later step once you approach the new destination of your applications.
Now, as I’ve mentioned, you will need to understand the structure and contents of data as it flows into and is stored in this new layer that you’ve built on top of your source data repositories. And what could be simpler than using traditional SQL to query the data that is now stored in Confluent Cloud and continuously streaming from all the repositories that your organization relies on?
That is the task of Flink SQL, an ANSI standard-compliant SQL engine that can process both real-time and historical data. It provides users with a declarative way to express data transformations and analytics on streams of data, allowing you to analyze and transform data streams without any need for complex code.
Nevertheless, streaming data living in Kafka topics is usually not the most efficient location for providing access to it as applications are deployed. Nor is the unified semantic model that you can start building on top of that access layer the most compatible to applications that may be coming from a world where data was stored differently. This is where transformation and sinks come into play. But let’s look at those in the context of application migration and modernization.
Using the layer for app modernization and migration
So, imagine if as part of your application modernization and migration strategy, you had the brilliant foresight to build a unified data transport and access layer in parallel as your development teams started to look into the challenges of modernizing their code bases to work efficiently in the cloud. Or as they started to build brand new services with tightly defined scope that relied on data that was already likely handled by some monolithic application of yore.
Now, you have all your organization’s existing data available in a common fabric, flowing in real time into a central location, and some of your teams have made all the necessary changes to replatform existing applications to work in the cloud. And some others are ready to start deploying new services using a new schema and data storage technology into the cloud but require subsets of existing data for those applications.
Let's talk about data transformation and data sinks from the angle of the two scenarios I just mentioned: application replatforming and application modernization.
One-to-one transport of data repositories
When the scenario is one where the application has been modified only enough to run effectively in the cloud, but there have been no architectural changes related to data, for example, it is expected that it will use the same database technology and same schema. Transporting data from source to target in real time will make it relatively easy to migrate the application over and orchestrate it going into production with no data loss and little downtime, and without requiring data transformation.
Data transformation and new data targets
The real beauty of the data transportation and access layer lies in the simplicity it introduces to transforming data as it finds its new destination, regardless of the original source of the data.
A great example of this scenario is services being strangled off from monoliths, where only a subset of the data being produced by the monolith must now be accessed, in a new data repository, by the new service.
Using the data layer we’ve been discussing, you have consolidated access to data flowing in real time from any applicable source, and the capability to modify the structure of that data and have it continue to flow in real time to its new target, making data ready and available for the new applications and services once they’re deployed in the cloud.
What about the volume of data and retention?
There are two aspects to a data access and transportation layer that you must be mindful of and coordinate as you architect this solution.
A data access and transportation layer is not supposed to be a long-term data repository It is a layer that moves, transports, and provides visibility into data but should not be a persistent, long- term storage of your data. There are data “sinks” (i.e., targets) for a reason, and those should be the ultimate destination of data. There are costs and complexities related to time and data: you should architect your solution so that data doesn’t have to live longer than 30 days in the data access and transportation layer. If this is not possible, then there are other techniques that can be used to hold data for longer periods “in between” source and destination.
Data volumes are also a concern, mostly because of cost and time as well. It is important to understand the lifecycle of data as it flows through your data access and transportation layer to ensure you’re not using unnecessary volumes of data storage.
Now, this is another area where things become a whole lot easier and more flexible with Confluent, considering Confluent Cloud’s infinite retention and storage capabilities available in AWS clusters, which gives you access to petabytes of storage.
What do we get with this solution?
Confluent Cloud, which you can easily access in AWS Marketplace, gives you off-the-shelf capabilities that make it extremely easy to connect many sources of data to a common data-access and transportation layer.
It also provides you with data-querying capabilities using ksqlDB as well as options to send unmodified or transformed data to dozens of targets, including all the managed services AWS offers for storing all sorts of data types and satisfying many different scalability scenarios.
And with Confluent Cloud you will also be able to connect all your existing data repositories, no matter where they live.
With this, you have built a path to simplified application modernization and migration, and dramatically reduced the effort and complexity involved in moving and transforming data around as your teams modernize and develop applications.
You can try Confluent Cloud for free by using the offering in AWS Marketplace, which will expedite the process of connecting and provisioning your clusters in your AWS account.
Get hands on
Why AWS Marketplace?
Try SaaS products free with your AWS account to establish your proof-of-concept then pay-as-you-go in production with AWS Billing.