AWS for Industries
Powering AI/ML with MDIO seismic streaming from the OSDU® Data Platform
Introduction
Today, oil and gas operators are challenging their workforce to find more value from legacy data. Artificial Intelligence and Machine Learning (AI/ML) workflows powered by cloud-scale computing are unlocking new insights, but these technologies aren’t optimized for traditional SEG-Y data formats. The industry standard OSDU Data Platform tackles data initial discovery and access, but users are challenged to efficiently unify their datasets. Using OSDU Data Platform External Data Sources Data Management Service (EDS DMS) for continual, secure, and configurable data discovery, TGS and Amazon Web Services (AWS) have collaborated to demonstrate seismic metadata discovery and efficient seismic trace streaming using the open-source Multi-Dimensional Input Output (MDIO) format from the TGS Data Verse to the Energy Data Insights (EDI) on AWS for OSDU Data Platform environment. This collaborative effort, delivered with the support of 47Lining, Hitachi Digital Services, shows the user’s ability to quickly stream and integrate seismic libraries into an operator’s workflows without extensive code or large-scale data duplication. Geoscientists who want to apply AI/ML workflows to seismic volumes can stream the AI/ML optimized MDIO seismic data with lossless compression capabilities, and use parallel processing to optimize the cost and performance of their workflows.
Seismic data management
TGS is a prominent global provider of geoscientific data and services for the energy industry, with a strong focus on seismic data management. Seismic data is crucial for exploration and production activities in the oil and gas sector, because it provides valuable insights into the subsurface geological structures and potential hydrocarbon reservoirs. Effective seismic data management is essential for making sure of the integrity, accessibility, and efficient usage of this valuable information.
Seismic data streaming refers to the real-time transmission of trace data from seismic libraries to processing and interpretation applications. This process continuously sends data over a network, enabling quick access to information and allowing for immediate insights into subsurface conditions. However, seismic data streaming needs substantial bandwidth and network infrastructure to avoid delays or data loss during transmission. Modern seismic data formats can help address seismic data streaming challenges.
Multi-Dimensional Input Output (MDIO)
MDIO is a fully open-source data storage format that enables computational workflows for various high-dimensional energy datasets such as seismic data and wind models. Designed to be efficient and flexible, MDIO provides interoperable cloud software infrastructure with existing energy data standards. MDIO’s Python Application Program Interfaces (APIs) enable integration with common libraries such as NumPy, TensorFlow, and PyTorch, which are commonly used for training and inference of machine and deep-learning models. MDIO cuts storage costs by 41% using Zstd compression, and improves data access and management with features such as selective decompression chunking, targeted data retrieval masking, and on-demand streaming, powered by a high-performance Dask backend for better workload distribution. This allows the MDIO-streamed seismic datasets to be used in AI/ML workflows with less preprocessing. The summary of MDIO, its technical characteristics, and its OSDU Forum Support status is shown in the following figure.
Figure 1. MDIO format summary
In the case of seismic data visualization, the solution described previously enables web, desktop, and mobile applications to consume seismic data directly from the cloud. For this use case, users typically want to be able to select and view 2D slices through large multi-dimensional seismic datasets. Strategies for making this process performant in all dimensions include creating transposed copies of the underlying data and employing concurrent multi-threaded or multi-process reads. Data compression is vital to minimizing storage costs and reducing bottlenecks in streaming data to a client application. The MDIO format enables these requirements, and the data quality can be tuned to the downstream application needs and available bandwidth.
Seismic streaming with MDIO
Integrating MDIO format support with seismic libraries and data discovery and delivery through EDS DMS of the OSDU Data Platform enables you to harness your energy data to deliver faster insights, enhanced performance, and improved decision-making for your overall seismic data management environment. The EDS DMS allows for searching and fetching from external seismic data repositories. The EDI APIs enable the data interchange through the industry standard OSDU open-source specifications. This approach allows for connectivity for data search and discovery and empowers companies to observe all of their seismic data in one place. Data from the corporate repositories and external libraries can be combined without the need for copying or re-indexing. Whether it is for ML, seismic imaging and processing, or real-time visualization, data in MDIO format is made available for on-demand data access.
Overall solution
The overall solution of integrating EDI on AWS, EDS DMS, TGS Data Verse as a seismic data provider, and MDIO format support is shown in the following figure.
Figure 2. MDIO Seismic Streaming from the OSDU Data Platform
EDI on AWS is an AWS supported, cloud-based offering of the OSDU Data Platform. It is empowering the oil and gas industry and helping customers manage the deployment, monitoring, management, scale, security, support, updates, and upgrades of the service so that the customers can focus on the value from the platform. EDI uses fully managed AWS services such as Amazon S3, Amazon DynamoDB, Amazon Elastic Kubernetes Service (Amazon EKS) and others to reduce costs, break down data silos, encourage innovation, and bring data together into one location. Having data in a trusted and centralized data repository such as EDI on AWS unlocks the potential for the variety of AI/ML workflows using cloud services such as Amazon SageMaker or Amazon Bedrock.
Powering AI/ML workflows
TGS has implemented automation in key areas of seismic data processing using MDIO and ML algorithms. The ML workflow focuses on denoising and deghosting. This process begins with Nav Merge data from the boat and leads to clean and improved seismic data quality for imaging faster than traditional methods. The TGS Imaging AnyWare processing software manages the ML models from testing to production by using its HPC environment. Developing these robust ML models is enabled by the TGS global data library.
Another important area where TGS uses large datasets is SaltNet, an ML model designed for segmenting salt bodies in seismic images. It employs a U-Net architecture, a 3D convolutional neural network (CNN) type, for semantic segmentation. Key features of the TGS implementation include data augmentation, customized loss functions to address class imbalance, and post-processing for refining results. The model aims to accurately identify salt structures crucial in Velocity model building in complex areas such as the Gulf of Mexico and Brazil.
TGS is also developing seismic foundation models using a self-supervised technique: a Masked Auto Encoder (MAE) with a 3D Vision Transformer (ViT) backbone. This model, which processes 3D seismic mini cubes, can categorize seismic features without labeled data, similar to the way the large language models (LLMs) process text. Furthermore, a 3D data pipeline using MDIO is employed for fast data loading and augmentation, optimizing data engineering and allowing for faster iteration. The model’s fine-tuning for interpolating missing seismic data and salt segmentation demonstrates generalization to practical applications, showcasing the transformative potential of large transformer-based architectures in geophysical exploration.
Conclusion and benefits
The need to use data efficiently is ever-growing in today’s fast-paced business environments. It is crucial to spend less time on discovery across enterprise datasets and external data subscriptions while having a complete catalog of the enterprise. After finding the data of interest, there is also a need to make it available for data analysis and interpretation faster than ever before. This workflow powers AI/ML models that are dependent on these datasets. Modern data formats such as MDIO, combined with the industry-standard OSDU Data Platform, are accelerating decision-making while accommodating the growing volume of seismic datasets in an efficient and cost-effective method.