What is Apache Flink?
Apache Flink is an open-source, distributed engine for stateful processing over unbounded (streams) and bounded (batches) data sets. Stream processing applications are designed to run continuously, with minimal downtime, and process data as it is ingested. Apache Flink is designed for low latency processing, performing computations in-memory, for high availability, removing single point of failures, and to scale horizontally.
Apache Flink’s features include advanced state management with exactly-once consistency guarantees, event-time processing semantics with sophisticated out-of-order and late data handling. Apache Flink has been developed for streaming-first, and offers a unified programming interface for both stream and batch processing.
Why would you use Apache Fink?
Apache Flink is used to build many different types of streaming and batch applications, due to the broad set of features.
Some of the common types of applications powered by Apache Flink are:
- Event-driven applications, ingesting events from one or more event streams and executing computations, state updates or external actions. Stateful processing allows implementing logic beyond the Single Message Transformation, where the results depend on the history of ingested events.
- Data Analytics applications, extracting information and insights from data. Traditionally executed by querying finite data sets, and re-running the queries or amending the results to incorporate new data. With Apache Flink, the analysis can be executed by continuously updating, streaming queries or processing ingested events in real-time, continuously emitting and updating the results.
- Data pipelines applications, transforming and enriching data to be moved from one data storage to another. Traditionally, extract-transform-load (ETL) is executed periodically, in batches. With Apache Flink, the process can operate continuously, moving the data with low latency to their destination.
How does Apache Flink work?
Flink is a high throughput, low latency stream processing engine. A Flink application consists of an arbitrary complex acyclic dataflow graph, composed of streams and transformations. Data is ingested from one or more data sources and sent to one or more destinations. Source and destination systems can be streams, message queues, or datastores, and include files, popular database and search engines. Transformations can be stateful, like aggregations over time windows or complex pattern detection.
Fault tolerance is achieved by two separate mechanisms: automatic and periodic checkpointing of the application state, copied to a persistent storage, to allow automatic recovery in case of failure; on-demand savepoints, saving a consistent image of the execution state, to allow stop-and-resume, update or fork your Flink job, retaining the application state across stops and restarts. Checkpoint and savepoint mechanisms are asynchronous, taking a consistent snapshot of the state without “stopping the world”, while the application keeps processing events.
What are the benefits of Apache Flink?
Process both unbounded (streams) and bounded (batches) data sets
Apache Flink can process both unbounded and bounded data sets, i.e., streams and batch data. Unbounded streams have a start but are virtually infinite and never end. Processing can theoretically never stop.
Bounded data, like tables, are finite and can be processed from the beginning to the end in a finite time.
Apache Flink provides algorithms and data structures to support both bounded and unbounded processing through the same programming interface. Applications processing unbounded data runs continuously. Applications processing bounded data will end their execution when reaching the end of the input data sets.
Run applications at scale
Apache Flink is designed to run stateful applications at virtually any scale. Processing is parallelized to thousands of tasks, distributed multiple machines, concurrently.
State is also partitioned and distributed horizontally, allowing to maintain several terabytes across multiple machines. State is checkpointed to a persistent storage incrementally.
In-memory performance
Data flowing through the application and state are partitioned across multiple machines. Hence, computation can be completed by accessing local data, often in-memory.
Exactly-once state consistency
Applications beyond single message transformations are stateful. The business logic needs to remember events or intermediate results. Apache Flink guarantees consistency of the internal state, even in case of failure and across application stop and restart. The effect of each message on the internal state is always applied exactly-once, regardless the application may receive duplicates from the data source on recovery or on restart.
Wide range of connectors
Apache Flink has a number of proven connectors to popular messaging and streaming systems, data stores, search engines, and file system. Some examples are Apache Kafka, Amazon Kinesis Data Streams, Amazon SQS, Active MQ, Rabbit MQ, NiFi, OpenSearch and ElasticSearch, DynamoDB, HBase, and any database providing JDBC client.
Multiple levels of abstractions
Apache Flink offers multiple level of abstraction for the programming interface. From higher level streaming SQL and Table API, using familiar abstractions like table, joins and group by. The DataStream API offers a lower level of abstraction but also more control, with the semantics of streams, windowing and mapping. And finally, the ProcessFunction API offers fine control on the processing of each message and direct control of the state. All programming interfaces work seamlessly with both unbounded (streams) and bounded (tables) date sets. Different levels of abstractions can be used in the same application, as the right tool to solve each problem.
Multiple programming languages
Apache Flink can be programmed with multiple languages, from the high level streaming SQL to Python, Scala, Java, but also other JVM languages like Kotlin.
What are Apache Flink use cases?
Apache Flink use cases include:
-
Fraud detection, anomaly detection, rule-based alerting, real-time UX personalization are examples of use cases for event-driven application. Flink is a perfect fit for all these use cases that require processing streams of events in a stateful manner, considering the evolution over time, detecting complex patterns, or calculating statistics over time windows to detect deviations from expected thresholds.
-
Quality monitoring, ad-hoc analysis of live data, clickstream analysis, product experiment evaluation are streaming analytics use cases that Flink can efficiently support. Leveraging the high level of abstraction of SQL or Table API programming interface, you can run the same analytics on both streaming live data and batches of historical data.
-
Monitoring file system and writing data into a log, materializing an event stream to a database, incrementally building and refining a search index, are use cases efficiently supported by continuous ETL. Leveraging the wide set of connectors, Flink can directly read from several types of data stores, ingest streams of change events, and even capture changed directly. With continuously ingesting and processing the changes, and updating the destination systems directly, Flink can reduce the delay of the data synchronisation to seconds or less.
Who uses Apache Flink?
NortonLifeLock
NortonLifeLock is a global cybersecurity and internet privacy company that offers services to millions of customers for device security, and identity and online privacy for home and family.
NortonLifeLock offers a VPN product as a freemium service to users. Thus they need to enforce usage limits in real time to stop freemium users from using the service when their usage is over the limit. The challenge for NortonLifeLock is to do this in a reliable and affordable fashion.
NortonLifeLock simplified the implementation of user and device-level aggregation adopting Apache Flink.
Samsung SmartThings
As an independent subsidiary of Samsung, SmartThings is one of the leading IoT ecosystems in the world, creating the most effortless way for anyone to create a smart home.
Samsung SmartThings were running into issues like having the resources reserved to individual applications. This caused a delay and performance degradation while processing data. It eventually led them to high costly overhead at maintaining workloads in operations. They had to re-architect the data platform.
They moved from Apache Spark to Apache Flink.
BT Group
BT Group is the UK’s leading telecommunications and network provider and a leading provider of global communications services and solutions, serving customers in 180 countries. Its principal activities in the UK include the provision of fixed voice, mobile, broadband, and TV (including Sport), and a range of products and services over converged fixed and mobile networks to consumer, business, and public sector customers.
BT needed a service-monitoring application to support the rollout of Digital Voice, its new consumer product enabling high-definition voice calling over its UK broadband network.
BT built an event-driven analytics service using Apache Flink, to ingest, process, and visualize service data.
Autodesk
Autodesk, a leading provider of 3D design and engineering software, wants to do more than create and deliver software. It also wants to ensure its millions of global users have the best experience running that software.
Autodesk makes software for people who make things. They serve 200+ million customers. They needed to eliminate silos to find and fix customer issues faster. They wanted a consistent way to collect and measure metrics with a small operations team without escalating costs or creating data lock-in.
NHL
The National Hockey League is the second-oldest of the four major professional team sports leagues in North America. Today, the NHL consists of 32 Member Clubs, each reflecting the League’s international makeup, with players from more than 20 countries represented on team rosters.
NHL was facing several technical challenges like determining the features required and modeling methods to predict an event that has a large amount of uncertainty, and determining how to use streaming PPT sensor data to identify where a face-off is occurring, the players involved, and the probability of each player winning the face-off, all within hundreds of milliseconds.
Leveraging Apache Flink, NHL was able not just to predict the winner of a face-off, but to build a foundation for solving a number of similar problems in a real-time and cost-efficient way.
Poshmark
Poshmark is a leading social marketplace for new and secondhand style for women, men, kids, pets, home, and more. Their community of more than 80 million people across the US, Canada, Australia, and India is shaping the future of shopping to be simple, social, and sustainable.
Poshmark has been focusing on achieving top-line growth through personalization and enhancing user experience. The initial approach of using batch processing for personalization and security did not meet expectations for customer experience improvement.
Poshmark designed real-time personalization using real-time data enrichment with Apache Flink.
How can AWS help run Apache Flink applications in the cloud?
Amazon Manages Service for Apache Flink is a fully managed solution to run Apache Flink applications. Amazon Managed Service for Apache Flink reduces the complexity of building, managing, and integrating Apache Flink applications with other AWS services. With Amazon Managed Service for Apache Flink, there are no servers to mange, no minimum fee or setup cost. The setup is highly available by default. The application state is fully managed and stored to a high durability backend for fault tolerance. The application is controlled with a simple API, to stop, start, configure and scale the application.
Amazon Managed Service for Apache Flink Studio offers an interactive, notebook interface to Apache Flink. Using an Apache Zeppelin notebook, you can run SQL, Python and Scala code on Apache Flink, for development and experimentation, data inspection or visualization.
Amazon EMR also supports Apache Flink as a YARN application so that you can manage resources along with running other applications within the cluster.
Apache Flink natively supports Kubernetes. You can self-host Apache Flink in a containerized environment like Amazon Elastic Kubernetes Service (Amazon EKS) or completely manage it yourself using Amazon Elastic Compute Cloud (Amazon EC2).
Get started with Apache Flink on AWS by creating an account today.