AWS for Industries

TC Energy innovates using AWS to improve document consistency and asset management

TC Energy logo

TC Energy operates one of North America’s largest energy infrastructure portfolios, delivering the energy that millions of people rely on to power their lives in a sustainable way.

TC Energy’s portfolio includes three complimentary energy infrastructure businesses:

  • A 93,300 km network of natural gas pipelines that supplies more than 25 percent of the daily clean-burning natural gas demand across North America
  • A 4,900 km liquids pipeline system that connects growing continental oil supplies to key markets and refineries
  • Power generation facilities with combined capacity of approximately 4,300 MW—enough to power more than four million homes

TC Energy’s core values of safety, innovation, responsibility, collaboration, and integrity demand a disciplined, consistent, proactive, and systematic approach to risk management. Protecting people and over $100 billion of assets is a part of risk management that involves tens of thousands of documents, including contracts, engineering standards, and operating procedures. The integrity and accuracy of these documents are managed by TC Energy’s Engineering Governance team. To avoid conflict and duplication across these documents, the team was using a time-consuming, manual process of comparing all the clauses within these documents each time a document was created or revised. In addition, this repetitive manual process was also prone to error.

To improve this manual, error-prone process and increase the accuracy of detecting duplicates and conflicts, TC Energy turned to Amazon Web Services (AWS) and decided to use natural-language processing (NLP). A custom tool was developed using AWS along with an open-source, pretrained bidirectional encoder representations from transformers (BERT) language model. The tool, dubbed “Document Clause Analyzer,” detects references, parses clauses, computes sentence encodings, and calculates sentence similarity scores. The results are encoded in a report used by the Engineering Governance team in its review process.

Getting started by validating the concept

The first step Engineering Governance took was to collaborate with TC Energy’s Machine Learning Lab team to prove the value of NLP to generate reports that would assist with document review. The team was able to do this using a prebuilt machine learning (ML) model to create a clause-similarity report that aided in identifying clauses that are duplicates or conflicting. The team used Amazon SageMaker, a fully managed service for building, training, and deploying ML models and workflows. Using Jupyter Notebooks from Amazon SageMaker, the Machine Learning Lab team completed a proof of concept (PoC) in just 3 months. After feasibility was shown, Engineering Governance engaged the Information Management Product Delivery team to build a production-ready solution. To achieve this, the Product Delivery team added features that would help authors to compare clauses within a draft document with the clauses contained in all the referenced documents and other documents, along with a secure and working user interface.

Overview of the Document Clause Analyzer

The following figure shows the architecture of the solution that the Product Delivery team designed, built, and released during an 8-week product-development initiative.

TC Energy Innovates Case Study diagram

The solution consists of the following key components:

  1. Integration to the corporate content-management system
  2. Reference extractor
  3. Clause parser
  4. User interface
  5. Clause similarity model
  6. Reporting solution

Integration into the corporate content-management system

To facilitate the comparison of the clauses of a draft document with any other published corporate engineering standard, the team implemented a process to synchronize the solution with the corporate content-management system. The published documents were uploaded into the solution using a function of AWS Lambda, a serverless, event-driven compute platform. Each newly published document was processed by the reference extractor and clause parser.

Reference extractor and clause parser

The solution used Amazon Textract, an ML service that automatically extracts text, handwriting, and data from documents, to extract text from documents. The extracted text was then used to identify the references and clauses within an engineering standard.

TC Energy’s custom-built reference extractor, which had been built for a previous solution, was slightly modified to support draft documents for the Document Clause Analyzer. The module extracted references to any other published corporate document. The reference mappings were stored in Amazon Neptune, a fully managed graph database service, while clauses were sent for sentence embedding.

Sentence embedding is an NLP technique where sentences are mapped to vectors of real numbers. These vectors are called sentence encodings. The sentence encodings for all the clauses in a published document are stored as a file in Amazon Simple Storage Service (Amazon S3), a fully managed object storage service. After extraction, the list of references and files is saved to Amazon DynamoDB, a fully managed NoSQL database service.

The custom-built clause parser extracted and processed clauses within an engineering standard. The algorithm used to parse clauses was based on the standardized numbering format of the engineering standards. This workflow was initiated every time documents were uploaded to TC Energy’s content management or a draft document was uploaded into the solution for review.

Clause similarity model

After users uploaded a draft document and defined a list of documents to compare to, they generated a clause similarity report. The sentence encodings for each document listed were retrieved from the files in Amazon S3 and passed to the BERT-based clause similarity model. The ML model was used to calculate a similarity score for all combinations of clauses within the draft and all other documents. The results were returned and forwarded to another AWS Lambda function to generate a report.

Interactive user interface

The user interface helped users to upload valid engineering standards (see figure 1), define a list of engineering standards to be compared, initiate the clause similarity analysis, view a list of reports containing the results, and download a report (see figure 2).

The interface was developed in React and hosted in AWS Amplify, a full-stack service that can host front-end web apps and create the backend environment. To secure the application, the team used Amazon Cognito, a fully managed authentication service. It was integrated with the corporate directory store, so that corporate employees could securely and seamlessly access the app.

Figure 1: DCA UI – Document selection

Figure 2: DCA UI – Report generation

Reporting

The similarity scores returned from the clause-similarity calculator were filtered, formatted, and saved as an Excel file, which was then stored in Amazon S3. Users could then view and download the report from the web application. The results were used by the document contributor to review their draft for duplicated and conflicting requirements. To make the report more user friendly, Engineering Governance built a Power BI template that was able to ingest the downloaded Excel reports (see figure 3).

Figure 3: DCA – Power BI report

Road to production

Engineering Governance users wanted the ability to compare hundreds of documents within a limited time window. The linear approach used in the PoC running on Amazon SageMaker Jupyter Notebooks could take hours to generate a report. To make this product viable, the team needed to have reports produced in a matter of minutes. The approaches implemented below allowed the tool to move to production and produce reports in under 5 minutes.

Preprocessing documents and calculating encodings in advance

A significant amount of time was saved by having the encodings for all published documents calculated in advance and stored in Amazon S3 (alongside the document itself), so when a published document was selected for comparison, its encodings could be quickly retrieved. When a draft document was uploaded, its references were extracted and returned to the user. While a user was reviewing the references and completing the comparison list, the clauses of the draft document were extracted, and the encodings for each were calculated. By the time the user was ready to generate a report, much of the heavy processing had already been completed in the background.

Parallelizing the processing of each pair of documents

Because documents could potentially contain hundreds of references, processing sequentially could take upward of 1 hour to generate a report. To reduce the time to minutes, each pair of documents was processed concurrently. This was facilitated by AWS Lambda running TC Energy’s custom Docker image containing the BERT model. Another optimization was using the same container image for both calculating sentence encodings and producing similarity scores, allowing the AWS Lambda function runtime to almost always be in a warm state, saving a few minutes every time. After the last pair of documents was processed, a final AWS Lambda function compiled the data and generated an Excel file.

Reducing the number of encodings being compared

In the initial approach, all clauses were being compared with each other. This resulted in a lot of redundant and unnecessary comparisons, which increased the compute time. By focusing only on the comparisons that were required by the business, the team reduced the number of comparisons and thus reduced the problem space. In addition, this change also simplified the report for the business, resulting in a better user experience.

Filtering comparison results

A separate process to calculate the similarity scores was spun up for each pair of documents, and the results from each merged to produce a single report. A significant amount of delay when creating the report was caused by the substantial number of results being merged. To speed up the merge, a similarity-score filter was added to remove irrelevant comparisons based on a threshold determined by the business. This reduced the size of the report by over 90 percent and the time to produce it significantly.

Conclusion

To avoid conflict and duplication across thousands of engineering standards, TC Energy looked to NLP to assist in the review process. TC Energy’s Product Delivery Team used multiple AWS services to build a solution that could generate a clause-similarity report of over 200 documents in under 5 minutes. Sharing components between solutions, preprocessing, parallel computing, efficient logic, and data filtering were essential to the success of this initiative. The solution was also built with future use cases in mind. The team is looking to expand this product to cover more areas of the business.

Duane Patton

Duane Patton

Duane is a senior technology lead at TC Energy, where he has over two decades of experience within information technology and a business acumen that spans numerous areas across the corporation. He has a passion for innovation, technology, and machine learning that has helped him to deliver numerous machine learning–based solutions that help TC Energy to extract insights from its content.

Joseph Johansson

Joseph Johansson

Joseph Johansson is a Sr. Solutions Architect at AWS specializing in containers, midstream, and digital transformation. He loves diving deep into customer business problems and finding disruptive solutions using AWS.

Zachariah Albers

Zachariah Albers

Zach is a solutions architect at TC Energy with a passion for using the power of applied machine learning to drive innovation and solve complex business challenges. His expertise in serverless development, combined with a strong customer focus and a commitment to fast iteration, helps him to deliver cutting-edge solutions that meet the needs of a wide range of stakeholders.