How unified data and machine learning workflows deliver value faster
Explore the tightly interrelated role of data and code in the process of developing, tuning, and working with machine-learning models and implementing RAG
Working on machine learning (ML) has presented all sorts of new challenges and a need to shift the mindset behind optimizing for developer experience, mostly due to the nature of the highly experimental and data-centric activity that is ML development.
The recent explosive evolution and widespread availability of frameworks, foundational models, and commercial applications of ML has triggered a similarly explosive increase in the community of developers that are now, in one way or another, exposed to working with these leading-edge technologies. And, as a consequence, a whole new set of requirements, pain points, and optimization opportunities have been identified to improve the development environments and development experience of these developers.
In this article we will look into the tightly interrelated role of data and code in the process of developing, tuning, and working with machine-learning models, even in relatively simple scenarios such as implementing Retrieval-Augmented Generation (RAG) on top of foundational large language models (LLMs), and how developer experience and infrastructure simplicity are critical to improving the efficacy of engineers evolving the subtle science and art of a whole new paradigm of how we make computers do things for us.
Let’s dig in!
Data at the center of workflows
There are many differences between traditional software engineering in the well-known, deterministic world of computer code that we’ve seen evolve into ever more elegant ways to write software for decades now. And this brand-new universe arises from writing models that are fundamentally changing how we instruct a system to do something for us.
But one key difference is how data and code are inextricably linked in the world of machine learning.
Regardless of the type of technology you are using to build your machine-learning algorithm, data will be required for training, for validation, or for context. Data will also be produced as outcomes to our exploration and development, and it will need to be analyzed and understood, usually at scales and degrees of complexity that challenge our human capabilities.
Without the right data and the right tools to work with that data, machine-learning engineers and data scientists will always struggle to collaborate effectively, siloed in separate systems and user interfaces, and requiring complicated and hard to maintain “bridges” to be able to collaborate. These “bridges” are usually shaped in the form of data pipelines, data duplication, and a whole lot of Slacking or emailing.
The result of this is inefficiency and poor experience for the engineers involved in the process.
Optimizing for experimentation
Before looking into some of the methods, patterns, and technologies that can be used to achieve optimized workflows, let’s talk a little as to what ML development actually looks like.
There are unique challenges to engineering something probabilistic. In traditional deterministic programming, it’s relatively easy to validate that the logic of your code, given a determined set of input parameters, produces a consistent and true response. This is not the case with probabilistic development.
The output of your model will actually be derived from many factors, including model parameters and the quantity, quality, and composition of the data used to train and validate the model. These qualities make the process of working with data and machine-learning models one that is, in my humble opinion, much more experimental and less predictable than working in traditional deterministic code development.
This is an important factor to take into account when looking into the experience of those engineers wrangling these types of problems, since it will directly affect what an intuitive, consolidated, and optimized developer experience–and therefore associate tooling–must look like.
And we have seen important evolution in the tools made available for engineers spelunking into the depths of machine learning, from notebooks that are easier to spin up or offered as managed services, to integration with integrated developer environments and other such niceties.
But still, there have been important gaps in unifying the world views of data, machine-learning exploration and development, and the behavior of models deployed across different environments.
Let’s go back to data
So, now that we have a shared sense of the nuances of working in the realm of machine learning, let's go back to data.
There are at least two important “roles” related to data that I want to highlight in this article:
- The data scientist, whose job is to make sense of the data and help the ML engineer understand what the data means, so that they can effectively use it as the model is being developed, trained, and deployed.
- The data engineer, whose job is to move data around, building pipelines, transforming data, and making sure that the data that the scientist is working with is clean and up to date.
Each of these roles has very unique requirements and pain points, and their worldview is quite unique to each other, but collectively relevant. That means they have clear ownership of activities and systems, but the operation and outcome of those systems is relevant not just to themselves, rather to the collective team that is tasked with the ultimate goal of creating, customizing, or deploying a machine-learning model.
Each of them must be able to effectively communicate and collaborate with the other, presenting and sharing the right views into the tasks and systems each of them operates, so that everybody else can be equipped to make the right decisions and work effectively. This requires something that provides a unified experience across these various domains of subject- matter expertise.
So, what could go wrong?
As we’ve seen up to now, getting these workflows and collaboration, with data and exploration right at the center, is non-trivial. There are many moving pieces, specialized knowledge, and a less-than-predictable process that could, if not properly managed, dramatically hinder velocity, which is ultimately the objective of most engineering organizations: the faster we can build, the faster we can validate our value hypothesis and the faster we can learn and adapt.
Velocity has been a key metric to track and a decisive factor in an organization’s ability to maintain a competitive advantage.
In this complex landscape, many things can go wrong.
Complex and expensive data pipelines
Moving data around, which has been historically the primary job of data engineers, is usually driven by data pipelines.
Data pipelines stand in the critical path to pretty much everything we’ve discussed so far—yet they are usually fragile and hard to maintain. How so, you might ask?
Well, data pipelines require connectivity to many systems and are very much tied to the schema of the data stored in those systems, which means any change in terms of access controls, data schemas, or even system capacity can block the flow of data. Or worse, they start producing data that is dirty and inconsistent, much to the dismay of the data scientists and machine learning engineers that depend on that data and likely figure this issue out once it’s a bit too late.
There is also the concern around cost: moving data around is usually expensive both because of data in transit as well as due to the necessary duplication of data storage across different systems.
Inability to collaborate efficiently
As I’m sure you can tell from what we’ve discussed so far, the various roles involved in building machine-learning models are tied to the hip and require very close collaboration. But in such a complex environment collaboration is not the easiest thing to accomplish.
If the systems and views into those systems are fully isolated, collaboration and understanding must fall back to very inefficient methods of communication, which means a lot of back-and-forth emails and instant messaging to get all the ducks in a row.
A lot of these challenges could be solved from a perspective of visibility, but system isolation usually requires more glue than necessary (which in itself is hard to maintain) to make sure all folks understand what is happening beyond their immediate sphere of ownership.
Lack of visibility across systems and environments
Let’s talk some more about visibility, which as we mentioned last hinders collaboration. But it also goes beyond just data during development. It also impacts understanding how data and models behave across environments.
Data and multi-stage development (meaning, going from development to staging to production, as an example) is a uniquely complex challenge where you must juggle many concerns, from data volumes, data quality, and freshness to privacy, governance, and sovereignty of that data.
Understanding how systems operate across environments is crucial to eventually deploying something into production that will behave as you expect, and that will allow your users to extract the value you’re looking for them to obtain.
Approaching these challenges
No matter how ML evolves, still our ability as humans to find ways into solving problems is unmatched, and thanks to that we have started to come up with new methods and technologies to solve these challenges.
And now I want use to look into how one particular and revolutionary product integrates all these solutions into a comprehensive and unified experience for all those involved in the process of building machines that can learn: Databricks.
MLOps: What we learned from DevOps but for machine learning
We have learned a lot from DevOps:
- Continuous integration and deployment have dramatically reduced the effort in building releases and increased in orders of magnitude the quality of the software shipped to prod.
- Continuous improvement as a mindset has transformed the way teams optimize for delivering value and look into their workflows as value streams.
- Everything as code enabled a world of ephemeral environments, consistent non-prod and prod reproducibility and gave us a platform to version and evolve our infrastructure just as we did our software.
MLOps looks to apply most of these concepts to the world of data and machine learning and serves as a foundational framework that optimizes the lifecycle of data in the context of ML engineering.
Databricks has native and powerful support for MLOps capabilities built right into the consolidated workspace that the various cross-functional teams building ML share, providing unique capabilities that optimize velocity and improve collaboration.
Data federation: Maybe we don’t need all those pipelines
We talked about data pipelines being critical to move data around. This means pulling data from its authoritative source, maybe modifying it, and then putting that data somewhere else that can be seen and accessed by data scientists and machine-learning engineers.
But what if we didn’t have to move the data around to use it in our early development and exploratory phases?
This is exactly where data federation comes into play. Federation allows you to connect to existing data sources (say, to your Amazon Redshift cluster) and explore and use that data as you experiment, design, and perform your early experiments in your ML development process.
There are some caveats to using this feature effectively, since you’ll be connecting to live data sources that may be in use by other systems, but it is a phenomenal way to accelerate exploration and experimentation.
I don’t think data federation eliminates the need for pipelines and transformation as you approach a fully developed model or even as you progress in later stages of the development process. But it does provide a fast way to validate, understand, and collaborate on data without the hassle of moving it around first, so that when you do get to moving it around, it will be on top of a much more stable, educated, and already collective set of decisions.
Workspaces: Because collaboration and visibility are key
So, at this point we have MLOps fully automating our more mature data and ML workflows and we have data federation giving us visibility into data without having to move it around. But how do we make sure that all team members in the highly cross-functional and technically complex domain of machine-learning engineering can collaborate efficiently with one another?
This is the concept of a workspace!
Databricks workspaces provide a unified and access-controlled means to provide consolidated and developer-experience-consistent tooling for all peers working together towards building machine-learning models, whether from scratch or simply enriching foundational LLMs with context using RAG.
Data federation put into context
Let’s put federation into the context of an actual and simple real world use case, using RAG to provide context to a foundational LLM.
RAG relies on external data stored as vector embeddings to provide similar content as context to customize the responses of foundational LLMs with information that was not originally in the foundational model’s training set.
Now let's imagine a lot of your existing knowledge, which would be relevant to make available to the LLM as context, lives in your Amazon Redshift.
By simply connecting your Amazon Redshift instance as a federated data source into your Databricks workspace, and using Amazon Bedrock to run one of the many foundational LLMs for you in your AWS account, you can quickly chunk and vectorize your data, and use it to enrich your Amazon Bedrock LLM, all without leaving the Databricks workspace and without having to move the data out of Amazon Redshift in the first place!
The cherry on top: Tight AWS integration
When you are in AWS, getting access to all these features can be achieved through very streamlined integrations. Let’s look at a couple of examples:
Integrations with a broad set of AWS services
Databricks offers native and tight integrations with all AWS data and analytics services, which means you get effortless access to connect, consume, and work with all services that will eventually be targets for your new production deployments of shiny new ML deployments, across the board: from data storage to analytics to machine-learning runtime and development environments like Amazon SageMaker.
AWS Marketplace deployment
Databricks is available, including a free trial, in AWS Marketplace. There are numerous benefits in getting Databricks in AWS Marketplace, including the well-known fact that billing is integrated with your AWS bill.
But the real value of AWS Marketplace deployment is the fully automated configuration process with Databricks Data Intelligence Platform, which automates all the actions required to connect your Databricks and AWS accounts, including all required permissions.
You get to automatically create your workspace as well, as part of your AWS Marketplace-initiated deployment.
So, what do you get with Databricks and AWS?
You get consolidated workflows and an elegant and comprehensive experience for all the different engineering skills required in the process of building and working with machine learning, without the need to move data around in the early stages, and with tight and efficient integrations across your entire set of AWS services!
Be on the lookout for a hands-on lab where we’ll look to apply all these different concepts to the use case of RAG, LLMs, and Amazon Bedrock we briefly touched above. We’ll guide you step-by-step through the process of building a fully functional solution!
Try Databricks Data Intelligence Platform in AWS Marketplace so you can start exploring right away!
Get hands on
Why AWS Marketplace?
Try SaaS products free with your AWS account to establish your proof-of-concept then pay-as-you-go in production with AWS Billing.