Create Fan360 data products on AWS

Introduction

The blog post Better understand sports fan data with Fan360 on AWS introduces how a Fan360 data mesh architecture can help sport entities enrich insights through data collaboration within the organization, with third parties, partners, and sponsors. Insights and collaboration can be used to build new avenues for monetization. Creating an effective data product in your Fan360 data mesh involves data ingestion, storage and governance, cleaning and processing.

Data ingest

The first step to solving any puzzle is to make sure you have all of the requisite pieces. For sports entities, it means ensuring all relevant inputs from first and third parties are collected in the same domain and are available to build data products as needed. Amazon Web Services (AWS) offers a multitude of natively integrated ingestion services. Whether using Amazon AppFlow to stream merchandise sales data from a third party SaaS application, Amazon Kinesis to gather content viewership data from streaming services, or Amazon API Gateway to request ticketing data from a third party, sport entities can connect to relevant data sources efficiently. AWS Data Exchange enhances collaboration between sport entities, allowing ingest of additional third party datasets, published by other leagues or teams, which may be relevant to better understand a common audience.

Store and govern

Amazon Simple Storage Service (Amazon S3) provides a highly scalable, reliable, and secure object storage for your fan data. With cost-effective storage classes and easy-to-use management features, you can optimize costs, organize data, and configure fine-grained access controls to meet specific business, organizational, and compliance requirements. Whether you experience a steady influx or sudden spikes in data volume, Amazon S3 seamlessly scales to handle increased demand without compromising performance. This scalability ensures your Fan360 data mesh remains responsive and adaptable over time.

Ensuring the right level of governance is paramount to make sure you can control data access at scale. This not only involves controlling how data products are exposed to consumers, but also how different data producer teams can access input data to evolve and refine them.

AWS Lake Formation allows you to easily define fine-grained access control to all data products in your organization. You can register Amazon S3 data store locations in Lake Formation data catalogs, and centrally define security, governance, and auditing policies in one place, using Lake Formation tags to scale access policies. You can use Lake Formation to manage a central data catalog, and allow each data domain owner to manage their own catalogs. Lake Formation offers the ability to enforce data governance within each data domain and across domains, to ensure data is easily discoverable and secure, lineage is tracked and data access can be audited. Data products in each domain can evolve over time and be processed with different technical solutions, but access is governed by a federated security model that can be administered centrally, providing best practices for security and compliance, while allowing agility within the domain.

Lake Formation uses credential vending to provide short term federated access to AWS analytics services, such as Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, AWS Glue, or Amazon QuickSight. These services can query data on behalf of the calling user or principal. When granting permissions, data administrators don’t need to update their Amazon S3 bucket policies or IAM policies, and they don’t need to provide direct access to Amazon S3, but only define the access policies to the Lake Formation registered tables and databases. Figure 1 depicts how users of trusted analytics services can access underlying data.

Flow diagram of Lake Formation Storage API

Fig. 1 – Lake Formation Storage API functional flow

A principal (user) enters a query or request for data for a table through a trusted integrated service like Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, or AWS Glue.
The integrated service checks for authorization from Lake Formation for the table and requested columns and makes an authorization determination. If the user is not authorized, Lake Formation denies access to data and the query fails.
After authorization succeeds and storage authorization is turned on for the table and user, the integrated service retrieves temporary credentials from Lake Formation to access data.
The integrated service uses the temporary credentials from Lake Formation to request objects from Amazon S3.
Amazon S3 provides the Amazon S3 objects to the integrated service. The Amazon S3 objects contains all the data from the Lake Formation table.
The integrated service performs the necessary enforcement of Lake Formation policies, such as column level, row level and/or cell level filtering. The integrated service processes the queries and returns the results back to the user.

This model allows data administrators to define access policies to securely share data across the organization in one place, reducing configuration complexity and allowing flexible changes over time. Please refer to official documentation to learn more about how Lake Formation works with other trusted services.

In the context of the Fan360 data mesh, data domain administrators can use Lake Formation to easily define which internal team can access which input data sources and also how to share data products from the domain with other parts of the organization.

Clean and process

So far, we’ve setup ingestion pipelines by using purpose-built AWS services and have all of the ingested data landing inside of Amazon S3 in its original format. Thinking of your data as a logical product is a key principle of the data mesh architecture. Data mesh leverages a publisher and subscriber model, allowing data domains within an organization to share data products amongst each other.

Defining different data products from diverse data sources within the Fan360 domain may require using different tools and technologies to process and refine data. Incoming data from different sources might have different size, formats, and preparation requirements. AWS provides a broad set of native services to allow different personas to efficiently work to define data products, according to their requirements.

AWS Glue is a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. AWS Glue allows you to clean, prepare and process data at scale, consolidating major data integration capabilities, including data discovery, modern Extract, Transform, Load (ETL) procedures, cleansing, transforming, and centralized cataloging.

AWS Glue capabilities allow you to discover and catalog incoming sports fan data stored in Amazon S3, using Glue Crawlers to automatically infer schema from Amazon S3 objects, and populate corresponding data catalog tables in Lake Formation. AWS Glue offers different tools for data preparation and ETL tasks, allowing users with different levels of data analytics and data engineering skills to efficiently work with data.

Data engineers can use AWS Glue to define jobs based on their own ETL scripts designed for Apache Spark, Ray runtime environments or general-purpose Python. Using AWS Glue Studio Notebooks to develop their jobs, engineers can also take advantage of the integration with Amazon CodeWhisperer AI companion to increase their efficiency.
ETL developers can use AWS Glue Studio to create ETL jobs from a visual interface, with more than 250 built in transformations, without writing any code. The AWS Glue Studio job monitoring dashboard provides a global view of ETL execution and resource usage.
Business analysts and data scientists can take advantage of a visual data preparation tool like AWS Glue DataBrew, which allows you to efficiently prepare data for analytics and machine learning use cases without writing any code. Data preparation and transformation steps, called recipes, can also be published and made available for other users to be included in their visual ETL flows in AWS Glue Studio.

As the number of users in the data domain grows, managing the execution of different jobs at scale can become more complex. AWS Glue Blueprints allow different data domains to create role specific, predefined, and reusable workflows, managed by Lake Formation to ensure each user has access allowed data only. For example, a common workflow would be data format conversion, as per AWS documentation.

Conclusion

Combining data from different sources, sport entities can build a Fan360 data domain with refined data products, to be enriched and shared across the organization, with partners and sponsors. Think about combining ticketing data with merchandise data, or gather insights on attendance to live games from your stadium. Having all these data in an organized, secure location enables endless possibilities. Check out AWS Sports case studies to learn how sport entities are revolutionizing fan engagement with data-driven insight on AWS.

AWS for M&E Blog