AWS Database Blog

How Amazon stores deliver trustworthy shopping and seller experiences using Amazon Neptune

Nearly three decades ago, Amazon set out to be Earth’s most customer-centric company, where people can discover and purchase the widest possible selection of safe and authentic goods. When a customer makes a purchase in our store, they trust they will receive an authentic product, whether the item is sold by Amazon Retail or by one of millions of independent sellers. And when businesses choose to sell in our store, they trust we will provide a great selling experience free from competition with bad actors. We understand that customer trust is difficult to earn and easy to lose, which is why trust is at the foundation of the relationships we build and the innovations we make on behalf of our customers and selling partners. To provide customers and selling partners with a trustworthy experience, it is important to understand complex relationships and behavior patterns between various entities including the buyer, the seller, and the products. In this post, we demonstrate how just one of the many teams within Amazon stores identifies and investigates these relationships using Amazon Neptune.

Neptune includes a graph database engine, graph analytics database engine, graph machine learning (ML), and visualization tools, which you can use individually or together. With Amazon Neptune Database, you can scale your graphs with more than 100,000 queries per second for demanding applications using a serverless graph database designed for superior scalability and availability. With Amazon Neptune Analytics, you can get insights and find trends by quickly processing large amounts of graph data. Neptune Analytics supports popular graph analytics algorithms for use cases such as ranking social influencers, detecting groups for bad actors, or finding patterns in network activity.

The challenge

Information about entities (like buyers, sellers, products) within Amazon stores is typically stored in tabular format (rows and columns). Finding patterns and relationships within this format of data is time-consuming and heavily manual, often involving multiple databases. For example, suppose we want to identify relationships between entities, where direct interactions do not exist. This scenario is shown in the following figure with two types of entities: blue and green. The entities we are trying to find have a question mark next to them, and the starting entity has a red ‘X’.

Interaction graph example

The following code is our database schema. All the fields are strings.

green_entities(green_entity_id, interaction_id
blue_entities(blue_entity_id, interaction_id)

The following code is a SQL query that finds the second order entities and sorts them by the number of orders. We need to join twice.

SELECT g2.green_entity_id, COUNT(g2.interaction_id)
FROM green_entities g1
    JOIN blue_entities b ON g1.interaction_id = b.interaction_id
    JOIN green_entities g2 ON b.interaction_id = g2.interaction_id
WHERE g1.green_entity_id = 'Entity 1'
GROUP BY g2.green_entity_id
ORDER BY COUNT(g2.interaction_id)

To detect these relationships, we used to use complex SQL queries, rule engines, and other data science techniques, but didn’t have a way to detect these patterns at scale using automation.

The new approach

Graphs contain the same data but in a representation that makes it straightforward to trace through many layers of interactions across multiple types of entities.

Graph queries allow us to traverse the graph multiple hops away from our starting node. An equivalent SQL query would require that many hops or more as joins between tables. Returning to the blue/green entity example, the following code is the equivalent openCypher query. It is much simpler than the SQL queries.

MATCH (g1:green_entity)-[]-(:blue_entity)-[]-(g2:green_entity)
WHERE g1.green_entity_id = 'Entity 1'
RETURN g2.green_entity_id

Graph algorithms allow us to find clusters of closely related vertices that would be difficult with tabular data. For example, we could use the label propagation graph algorithm to propagate attribute labels from one entity to other entities. These algorithms take advantage of concepts such as edge weight and degree, which have no analog in tables. Unlike tabular ML techniques, graph ML techniques incorporate information from relationships when making predictions.

For these reasons, we hypothesized that a graph approach would improve identification and investigation of relationships and we wanted to test this hypothesis. To do so, we had to decide which graph service to use, and we chose Neptune.

Benefits of Amazon Neptune

Neptune is a fully managed graph database service. We chose it for the following reasons:

Features

  • We would not need to engineer our own graph traversal system.
  • Neptune supports multiple popular and open graph query languages such as Gremlin and openCypher. These languages are well-documented and have many examples. The supported graph query languages have different styles. For example, Gremlin is imperative, describing how a program operates step by step, whereas openCypher is declarative, describing what the program must accomplish. Depending on the use case, we can use either query language on the same dataset.
  • Neptune integrates with Jupyter notebooks that allow multiple software engineers to collaborate and brainstorm on graph queries. There is graph visualization support in Neptune notebooks, using Graph Explorer, and through interoperable external tools.

Developer Experience

Integration with other AWS services

Building a prototype

To test our hypothesis that a graph approach and Neptune were the right path forward, we embarked on a small-scale experiment. We focused on a single relationship-based pattern that our product team identified. Our goal was to write graph queries whose results would be all the entities involved in this pattern.

The experiment was limited to 3 months of data—the smallest amount of data that we felt was sufficient to stress test Neptune. We used the following iterative process:

  1. Identify tables in Amazon’s data warehouses containing the required data.
  2. Use SQL queries to transform the data into Gremlin data load formatted CSVs.
  3. Load the data into a newly created Neptune cluster using the %load magic in Neptune notebooks.
  4. Write Gremlin queries to find sets of entities matching a pattern using the %%gremlin magic in Neptune notebooks.
  5. Use postprocessing logic to extract desired entities and filter out other entities by joining the graph data with other sources.
  6. Validate our queries by sending resulting entities to our human specialists for manual verification. Based on feedback from our human specialists, we can refine our graph data model and return to Step 1.

The ability to rapidly iterate was supported by three crucial factors.

Firstly, it was effortless to spin up prototypes. Neptune’s notebooks meant we could immediately begin loading data and querying it after a graph cluster was created.

Secondly, support was abundant. The public documentation was comprehensive, and the Neptune solution architects were knowledgeable and helpful. When we have issues, the Neptune team is quick to respond. They regularly seek feedback from their customers and are continuously innovating to support new technological developments such as ML, AI, and performance improvements.

Lastly, Neptune can scale. Neptune loaded our data at a rate of approximately 100,000 records per second. This allowed us to load approximately 400 million nodes and 700 million edges in less than 7 hours. We found that r6i.12xlarge instances were the ideal instance for bulk data loads. Increasing the instance size beyond r6i.12xlarge had diminishing returns.

Because of its ability to scale, Neptune completed our graph queries quickly. Our online transaction processing (OLTP) queries, which have a small number of starting nodes, finished in less than 1 second. Our largest online analytical processing (OLAP) queries, which start from approximately 400 million nodes and traverse out six hops, still finished in approximately 28 minutes on r6i.12xlarge instances.

Additionally, Neptune is pay as you go. We did not have to pay any licensing fee. This helped us increase and decrease capacity as needed, minimizing costs.

These factors allowed us to complete the experiment in fewer than 5 weeks despite no prior experience with graph databases.

The experiment was a success. Of the entities found using Neptune, human reviews confirmed relationships in approximately 75% of cases. The efficiency of human reviews improved as well — they were 25% faster than without Neptune.

Automating the solution

After we validated our hypothesis via an experiment, we started building technical components to help automate the solution to scale to more patterns, and even support internal client use cases related to these patterns. The following diagram illustrates our high-level architecture.

High level architecture diagram

There are four foundational components around Amazon Neptune:

  • Bulk data loader
  • Query repository
  • Insight generation
  • Insight service

Let’s look at each component in more detail.

Bulk data loader

The bulk data loader updates our graph periodically with new entities and interactions, so our Neptune database remains fresh. The first step is a data transform that creates many Gremlin formatted CSV files asynchronously. These CSV are uploaded to Amazon Simple Storage Service (Amazon S3), which triggers the Data Completeness Checker AWS Lambda function. The Lambda function checks for the presence of all expected files. When all the files have finished uploading, it initiates an AWS Step Functions state machine. This state machine has two Lambda functions: a Submitter and a Poller. The Submitter makes an HTTPS POST request to Neptune’s loader endpoint to start the bulk data load. Afterwards, the Poller loops repeatedly calls the Get Status API to determine if the bulk data load is complete.

The following figure illustrates the bulk data loader low-level architecture.

Bulk data loader low level architecture diagram

Query repository

The query repository stores all our graph queries and metadata: description, version, and substitute-able parameters.

Insight generation

The first step in insight generation is to fetch the relevant graph query from the query repository. Taking inspiration from neptune-export service, we send the graph query to Neptune from an AWS Batch job. Making the request to Neptune from AWS Batch is preferred over other AWS Compute options due to the absence of a timeout and the ability to run concurrent but isolated jobs. When Neptune finishes running the query, the Insight Generator job writes the results as a JSON file to a Results S3 bucket. From the JSON file, we perform more postprocessing, filtering and enriching the results. The final output is a CSV file containing rows of entities engaging in the same pattern.

The following figure illustrates the low-level architecture.

Insight service

The insight service exposes APIs to query graph data (our data plane) and create, read, update, and delete queries (our control plane). As of this writing, it is under construction.

These components work together to serve vertically integrated components such as an investigation UI or an ML pipeline.

Generalizing the solution

The next step is to generalize what we built. We plan to support more patterns, and will also begin research into pattern-agnostic methods of detection. To accomplish this goal, we hope to use ML more. We are planning further experiments on the large graph datasets we have collected. We anticipate Amazon Neptune ML will help.

We are interested in using Neptune Analytics to speed up our graph queries. Neptune Analytics promises 20 times faster row scans and 200 times faster column scans, which can help when analyzing graphs with hundreds of millions to billions of edges. Neptune Analytics also has graph algorithms, which may be useful for finding complex relationships or, in a secondary role, building models that can find relationships.

We also plan to upgrade the UI that our human specialists use to view insights. One challenge is how to best visualize graph data for users that are unfamiliar with graphs. Initial feedback from our human specialists indicated a split on whether they found graph visualizations useful. We plan to explore different graph visualization tools and techniques to make insights clearer.

Conclusion

In this post, we described how our team is able ensure trustworthy experiences on Amazon by identifying complex relationships between multiple types of entities using Neptune. We found the whole experience exciting and enriching, and took away several lessons:

  • Graphs are a powerful tool for identifying complex relationships between multiple entity types – Graphs are a flexible data structure. They can model interactions between entities or relate disparate accounts by shared attributes. Both are useful when several entities are involved at scale.
  • Embrace prototyping – Prototyping gave us the space to explore Neptune without the pressure of project delivery. In an ever-evolving space, learning and failing fast is necessary. Neptune’s ease of use, abundant support, and scale allowed us to move fast.

To learn more about how Amazon uses AI with customer reviews, see How Amazon is using AI to ensure authentic customer reviews. To learn more about Neptune, see the Amazon Neptune User Guide.


About the Authors

George Pu is a Software Developer/Engineer at Amazon. Currently, he works on software that helps Amazon create and maintain trustworthy experiences for buyers and selling partners. Outside work, he enjoys trying new food and traveling to new places.


Lindsey Pogue is a Software Development Manager at Amazon. Before she was a manager, she was a Software Development Engineer. She currently leads a team of engineers working on ensuring trustworthy experiences for buyers and selling partners. Outside of work, Lindsey can be found outside, whether it’s backpacking in the summer or snowboarding in the winter.