AWS Startups Blog
Modern Data Integration with Mortar and Redshift
Mortar is a robust platform seamlessly joining the best data technologies so that you can develop quickly on a rock-solid foundation. Users can connect any data source, apply any transformations or algorithms, and then, with one command, ship the entire scalable, robust workload to production. Mortar automates away boilerplate tasks around infrastructure, configuration, multitech integration, and monitoring, getting high-value projects into production in just days or weeks, rather than months or years.
Where We’re Coming From
I was an early employee at a New York City educational technology company called Wireless Generation. The company was extremely successful: We built some really interesting products there, and Wireless was eventually acquired for $400 million.
But my colleagues and I had this nagging problem. We found that it was frustratingly inefficient to use large data sets about student learning to do powerful modeling and analysis.
Around the same time, new technologies were arising in the nascent Hadoop ecosystem that harnessed distributed computing to massively parallelize complex data processing tasks like the ones we were running on student data. But without deep technical expertise in these new technologies, the barrier to entry was too high for most engineers.
We decided to change that. We started Mortar Data in 2011 to provide engineers and data scientists with a platform that would allow them to easily and instantly access the best data technologies — no setup and config hassles, no infrastructure headaches, and no crossing fingers and toes in hopes that everything goes off without a hitch instead of being swallowed up by an inscrutable error.
Big, Messy Data
Data, as we all know, is everywhere. And that’s a problem.
If you’re running an application in production, you’re capturing all kinds of data about the actions users take in your app. You’ve got data from your website. Data in your CRM. Data from the services you use to communicate with your users. Maybe even a little data under the couch cushions. You get my point. It’s all over the place.
The problem is even worse for Fortune 1000s and other large companies, which not only have tremendous volumes of data but usually have it spread across dozens of silos. Some of them are no longer active — they’re just sitting there, gathering dust.
Each of these data sources is useful to a point, but many big, important, direction-setting questions just can’t be answered without having all your data in an integrated, accessible data store. Unfortunately, stitching things together by hand is nigh on impossible, especially when you’re dealing with large volumes of data.
Cleaning Up the Mess
To keep costs low and operations simple, we leverage AWS to provide our customers with effectively limitless computational power on demand. Amazon Redshift, Amazon’s on-demand data warehouse, provides an ideal way to do large-scale, integrated reporting and analysis with ad-hoc queries or with an integrated BI tool as a graphical interface. It’s fast, it’s available on-demand with no upfront commitment, and it scales with ease. So it’s no wonder that Redshift, which was publicly released in 2013, is now AWS’s fastest-growing service ever.
The first time we used Redshift to analyze some data collected from our own web app, we realized two things: First, Redshift is just as awesome as we had heard; and second, Mortar is the perfect way to load a Redshift data warehouse.
As anyone with the word “data” in their job title knows, cleaning and processing data is a huge part of the job. Very rarely is data generated in a form that’s ready for use. It needs to be standardized, processed, cleaned up, and pared down to the fields that matter.
Among other technologies, Mortar’s platform runs Apache Pig, which executes simple, readable, stepwise data processing scripts as distributed MapReduce jobs (on Amazon’s EMR service). Pig’s data-flow language is extremely efficient for transforming data, which makes it ideal for taking in messy, raw data from any source and producing clean, preprocessed data that’s ready for integration. So we saw right away that our customers could integrate numerous data sources using Pig scripts that piped clean data into Redshift.
But that’s only half the battle. To be really useful, a data warehouse must stay up to date as new data rolls in, ideally with a minimum of upkeep and manual operation. That’s where another part of the Mortar platform comes in. Luigi, which was developed and open-sourced by Spotify and is now in use at countless companies (including Stripe, Capital One, Asana, and Foursquare), is a framework for orchestrating multistep data processing jobs. With Luigi and Mortar, an engineer can automate a data pipeline involving any technology with multiple dependencies.
For example, if task B depends on task A, you’ll want the pipeline to trigger task A and, once it’s complete, trigger task B. Finally, you can schedule the entire pipeline to run periodically or continuously.
That means that you can easily execute modular Pig scripts, each of them processing data from a different source, and feed all the data into Redshift automatically on a regular basis. Such pipelines sound complex but are actually extremely resilient: If one part of the pipeline fails for any reason, Mortar can automatically retry it. Luigi will resume work on the pipeline where it left off, saving both time and compute costs. Plus we’ve built in comprehensive monitoring and alerting to save you from sleepless nights. We’re a bunch of pager-carrying engineers, so we know how important that is!
Success Story
When you work at a startup, you never know just where your work will lead. When we first started Mortar, for instance, Redshift didn’t even exist, so we had no idea that what we were building would mesh so well with it. And when we did build out our Redshift functionality, we didn’t know who would use it or what kinds of new discoveries it would open up for our customers.
So, about a month ago, we were thrilled to read this blog post written by Michael Erasmus, an engineer at Buffer, one of our customers. Buffer had been “drowning in data” before quickly building a new architecture using Mortar that feeds data into Redshift on an ongoing basis.
With Looker, a graphical BI tool, running on top of Redshift, all of a sudden Buffer’s data was instantly available to everyone at the company who needed it. Erasmus says that even team members who were less technical “were really quick to take up Looker and satisfy their own data needs, coming up with amazing insights really quickly.”
With Redshift, anyone at Buffer can now analyze 500 million records in seconds instead of waiting for someone on the data team to write them a custom query. That’s a huge bottleneck that they’ve removed from their metrics and analytics process, and it should help them serve their customers even better. We’re proud to have built something that has helped them to do so.
Going Forward
Today our customers use Mortar to generate recommendations, to run predictive analytics, to build machine learning models, and to integrate multiple data sources into a central, accessible, easy-to-query data warehouse using Amazon Redshift. With tools like Redshift, we are moving ahead with our mission to free our customers from spending 90% of their time on boilerplate tasks so they can spend 100% of their time solving interesting problems specific to their business.