AWS Database Blog

Find and link similar entities in a knowledge graph using Amazon Neptune, Part 1: Full-text search

A knowledge graph combines data from many sources and links related entities. Because a knowledge graph is a gathering place for connected data, we expect many of its entities to be similar. When we find that two entities are similar to each other, we can materialize that fact as a relationship between them.

In this two-part series, we demonstrate how to find and link similar entities in Amazon Neptune, a managed graph database service. In this post, we use lexical search to find entities with similar text. In part 2, we use semantic search to find entities with similar meaning. In both cases, we demonstrate a convention to link similar entities using an edge.

Why similarity search?

Neptune, like most databases, can find objects using basic string comparisons, such as exact or partial match. But we often require more advanced search capabilities such as fuzzy comparison (for example, finding persons despite typos in the name) or semantic comparison (for example, finding publications on a specific topic, even if different words are used). Sometimes we need more powerful ways to find things.

We might need this capability to search the graph on demand from an application. Or we might be tasked to augment the graph by linking related or duplicate entities. Granted, there are deduplication tools that detect and eliminate duplicates before the data is loaded into the database. But in a knowledge graph, it is inevitable that there will be similar entities. Similarity search is a useful tool to find those related entities. Additionally, when similar entities are found, we can traverse the graph to compare their relationships. Shared neighbors provides further evidence that the entities are linked.

Lexical search

A lexical search is the ability to find entities with similar text content. For example, patients whose names have similar spelling and also have the same birth date and ZIP code might be considered the same.

For lexical similarity, we match patients in a patient graph. We combine Neptune with Amazon OpenSearch Service. OpenSearch Service is a powerful search index service with support for the following:

  • Exact and partial text matching, fuzzy matching, synonyms, boosting, highlights, and stemming
  • Lexical and semantic search
  • Scoring matches and ranking results by relevance
  • Ingesting data via REST API or OpenSearch Ingestion from sources such as Amazon Simple Storage Service (Amazon S3), Kafka, and Amazon Kinesis

We use a built-in Neptune/OpenSearch integration called full text search (FTS), enabling advanced lexical search of graph data. Full text search uses a poller to synchronize data from the Neptune database to a separate OpenSearch Service cluster. For information about supported OpenSearch versions and Amazon OpenSearch Serverless compatibility, refer to Full text search in Amazon Neptune using Amazon OpenSearch Service. The poller is an AWS Lambda function that runs periodically and checks Amazon Neptune Streams, the Neptune change event log, for recent changes to Neptune data. When Neptune data changes, the Lambda function updates an index in the OpenSearch Service domain that maintains a JSON representation of Neptune nodes and edges, enabling you to search them on an attribute-by-attribute basis. The following diagram illustrates this workflow.

Full-text search sync Neptune to OpenSearch Service

Full text search also allows combined queries of Neptune and OpenSearch Service, as shown in the following diagram. In a query to the Neptune database, an application can implicitly bring back from OpenSearch Service nodes, edges, or resources that lexically match on specified criteria. The query may then do further traversal to find additional details.

Full-text search combined query

A combined query enables you to use one interaction to find patients with a name similar to a given pattern and also traverse in the graph to find related payer, encounter, and provider details that provide additional verification of a match. You could use this in two cases: to match a new patient to existing patients on file, or to match and link existing patients to each other.

To test-drive patient matching, we provide the sample notebook FindAndLink-FTS.ipynb. The notebook includes instructions to set up full text search in Neptune, populate the patient dataset, and query it. This setup incurs the cost of running a Neptune cluster and an OpenSearch Service domain.

Neptune supports two common graph representations: labeled property graph (LPG) and Resource Description Framework (RDF). We focus on LPG in this post. For patient matching, we use the Apache TinkerPop Gremlin query language to query patient data.

The following diagram illustrates the patient graph data model. Patient data is synthetic, drawn from https://synthea.mitre.org/downloads.

Patient data model

Let’s walk through the model:

  • A graph consists of nodes and edges. Nodes are drawn as boxes, and edges as arrows linking the nodes.
  • Nodes are Patient, Encounter, Payer, Organization, and Provider.
  • A patient can have encounters (hasENCOUNTER edge) and payers (hasPAYER edge).
  • A patient can change payers over time. Edge properties START_YEAR and END_YEAR designate the interval when the patient had a particular payer.
  • Additionally, an encounter can have a provider and an organization (hasPROVIDER and hasORGANIZATION edges, respectively).
  • Finally, a patient can match another patient; the matches edge connects patients, with edge properties matchSource and matchAlg indicating the source found the match and which algorithm it used, respectively.

Basic patient match

The following Gremlin query runs against Neptune to find patients whose last name resembles prosecco:

g.withSideEffect("Neptune#fts.endpoint",'${AOS_ENDPOINT}')
  .V().hasLabel("Patient").has("LAST","Neptune#fts prosecco~").elementMap('LAST')

The results are shown in the notebook as follows.

Basic patient match

Full text search supports Gremlin for property graph queries and SPARQL for RDF queries.

Let’s break down the query:

  • The side effect Neptune#fts.endpoint tells the Neptune engine which OpenSearch Service domain endpoint to use.
  • V().hasLabel(“Patient“) is run against the Neptune database to restrict the search to Patient nodes in the graph.
  • In has("LAST","Neptune#fts prosecco~"), Neptune queries on our behalf the OpenSearch Service domain for JSON documents in which the last name matches prosecco. The tilde (~) instructs OpenSearch Service to use a fuzzy match. Neptune stores a JSON document for each node in a dedicated index in the OpenSearch Service domain, maintaining each node property as an attribute of the document. This step returns nodes corresponding to matching documents.
  • elementMap(“LAST”) extracts the last name property from matching patient nodes, as well as node ID and label.

Multiple patient nodes match with the last name Prosacco, a fuzzy match for prosecco.

A more advanced patient match

A more advanced search finds a patient whose last name or maiden name resembles havey or town. For each matching patient, it gets that patient’s encounters and payers. The query string uses Apache Lucene syntax to check that either LAST or MAIDEN matches the fuzzy names:

query_string="Neptune#fts predicates.LAST.value:(havey~ OR town~) OR predicates.MAIDEN.value:(havey~ OR town~)"

The following is the Gremlin query to Neptune:

g
.withSideEffect("Neptune#fts.endpoint", '${AOS_ENDPOINT}')
.withSideEffect("Neptune#fts.queryType", "query_string")
.withSideEffect("Neptune#fts.sortOrder", "desc")
.V().hasLabel("Patient")
.has("*", '${query_string}')
.project('patient', 'encounters', 'payers')
.by(elementMap())
.by(out('hasENCOUNTER').elementMap().fold())
.by(outE('hasPAYER').as('pt').inV().as('payer').select('pt', 'payer').by(elementMap()).fold())

The query consists of the following:

  • .withSideEffect("Neptune#fts.queryType", "query_string") instructs Neptune to use OpenSearch query_string when submitting the query. The query uses Apache Lucene syntax.
  • .withSideEffect("Neptune#fts.sortOrder", "desc") instructs Neptune to sort the results from OpenSearch Service in descending order of score. By default, results are sorted by score.
  • .V().hasLabel("Patient") starts the query in Neptune by restricting to Patient nodes.
  • .has("*", '${query_string}') performs the OpenSearch Service query. It brings back Patient nodes that match on the given last or maiden names.
  • The project brings back the encounters and payers for each matching patient.

This query is ideal for matching an admitting patient against patients in the graph. The query returns many results, but we can ask the patient for further details, including additional personal details, payers, and encounter visit details to narrow them down. For example, if the patient indicates they had a previous encounter for hyperlipidemia, we can filter the results and check if they are Lori Towne-Harvey.

The notebook shows 113 matches on the name, but if we filter for hyperlipidemia, we get only 13 matches.

Advanced patient match

Let’s examine the first match more closely. We see that the patient’s name matches our name criteria:

'FIRST': 'Lori', 'LAST': 'Harvey', 'MAIDEN': 'Towne'

That patient has an encounter matching hyperlipidemia:

{
"<T.id":"1>":"http://example.org/patientgraph/encounter/4dfe5224-d166-4423-b84d-a61d8d6c72fe",
"<T.label":"4>":"Encounter",
"Id":"4dfe5224-d166-4423-b84d-a61d8d6c72fe",
"START":"1989-01-11T14:44:28Z",
"STOP":"1989-01-11T14:59:28Z",
"PATIENT":"37962cbc-310b-486b-85b1-febe07eda766",
"ORGANIZATION":"24cb4eab-6166-3530-bddc-a5a8a14a4fc1",
"PROVIDER":"e9f54a4e-c20e-3443-a6fc-2ab488a1345c",
"PAYER":"6e2f1a2d-27bd-3701-8d08-dae202c58632",
"ENCOUNTERCLASS":"ambulatory",
"CODE":390906007,
"DESCRIPTION":"Follow-up encounter",
"BASE_ENCOUNTER_COST":129.16,
"TOTAL_CLAIM_COST":129.16,
"PAYER_COVERAGE":54.16,
"REASONCODE":55822004.0,
"REASONDESCRIPTION":"Hyperlipidemia"
}

Materializing a match

When we are confident that two patients match, we can link them by establishing a match relationship between them. If we know there is a Raye Wyman, we can find patients with similar names. In the next query, we perform a fuzzy match on FIRST and LAST names.

We use the following Lucene query string:

fl_query_string = "Neptune#fts predicates.FIRST.value:raye~ AND predicates.LAST.value:wyman~"

We use that string in the following Gremlin query. It finds patients matching this criteria. It returns first name, last name, ZIP code, and birth date properties of matching patients.

g
.withSideEffect("Neptune#fts.endpoint", '${AOS_ENDPOINT}')
.withSideEffect("Neptune#fts.queryType", "query_string")
.withSideEffect("Neptune#fts.sortOrder", "desc")
.V().hasLabel("Patient")
.has("*", '${fl_query_string}')
.elementMap('FIRST', 'LAST', 'ZIP', 'BIRTHDATE')

The query finds two matches, which share the same LAST, ZIP, and BIRTHDATE.

Find first/last name match to materialize

The FIRST names are very close: Raye and Rxye. We might decide to materialize that these patients are the same by adding a match edge between them. The next Gremlin query uses addE(‘matches’) to add an edge labeled matches. The from() and to() parameters reference the source and target node IDs. We set edge properties for matchSource and matchAlg, indicating that the match is user-driven based on full text search:

g.addE('matches')
.from(V('http://example.org/patientgraph/patient/3e7c92d1-3c0d-485d-8289-154750bd5770'))
.to(V('http://example.org/patientgraph/patient/x3e7c92d1-3c0d-485d-8289-154750bd5770'))
.property("matchSource", "user").property("matchAlg", "fts")

At any time, we can find our matches using a query. The following query finds details of every match, up to a limit of 10, including the patients it matches:

g.V().outE('matches').inV().limit(10).path().by(elementMap())

In the notebook, we can visualize the results.

Show materialized match

Our convention of linking similar entities using a matches edge is generic enough to apply to entities of any type. We use it in the next section to link publications.

Clean up

If you set up a Neptune cluster, notebook instance, or OpenSearch Service domain to follow along, delete those resources to avoid further costs. The notebook has cleanup instructions.

Conclusion

A knowledge graph gathers entities from numerous sources. In this post, we demonstrated how to use lexical search to find similar entities in a knowledge graph running on Neptune. We combined Neptune with OpenSearch Service using a feature called full text search. We also showed a conventional way to link similar entities.

In part 2, we use vector similarity search in Neptune Analytics to find semantically similar entities.

Experiment with this method using the notebooks provided. Think about how to more effectively query and draw relationships in your knowledge graph using the approaches discussed, and share your thoughts in the comments section.


About the Author

Mike Havey is a Senior Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. His Amazon author page is https://www.amazon.com/Michael-Havey/e/B001IO9JBI.