AWS Database Blog

Exploring new features of Apache TinkerPop 3.7.x in Amazon Neptune

Amazon Neptune 1.3.2.0 now supports the Apache TinkerPop 3.7.x release line, introducing many major new features and improvements. TinkerPop 3.7 offers a large expansion to the Gremlin graph query language, primarily providing new syntax for handling strings, collections, and dates, but also offering new functionality in other areas. In this post, we highlight the features that have the greatest impact on Gremlin developers using Neptune, to help you understand the implications of upgrading to these versions of Neptune and TinkerPop.

New Gremlin syntax

TinkerPop 3.7.x introduces 26 new steps to the Gremlin language, and these new additions focus on filling common functional gaps around manipulation of strings, collections, and dates. Simple operations like splitting a string, combining lists, or adding a date were either difficult or impossible to do directly in Gremlin. This limitation forced you into the use of lambda functions (which aren’t available for all graphs, including Neptune) or to prematurely return query results to process them in application code. With these features now available directly in the Gremlin language, you can write your queries using features you’re accustomed to having in most query languages you’ve likely worked with in the past, helping Gremlin feel like a more familiar environment.

The following subsections provide an overview of these new step additions and covers other new syntax that landed in this release. The examples shown in these sections use the small version of the air routes dataset. All examples are written in Groovy.

String-related steps

For most applications, the string is perhaps the most widely stored and accessed data type in a database. Whether you need to concatenate a first and last name together, remove a trailing character from a sentence, or convert some input to all upper case, there tend to be simple functions for doing these basic operations in all programming languages. Gremlin now has steps to help cover these common use cases and many more: asString(), concat()length(), toLower(), toUpper(), trim(), lTrim(), rTrim(), reverse(), replace(), split(), substring(), and format().

These new step names should look familiar to you because they’re found in many programming languages, but let’s look at a few examples to demonstrate their use and to call out some Gremlin-specific elements. In the following example, you first get five vertices, extract the city property value, and then use the new split() step to break the string into a list of strings wherever a space is found:

gremlin> g.V().limit(5).values('city').split(' ')
==>[Seattle]
==>[Santa,Fe]
==>[San,Francisco]
==>[Philadelphia]
==>[San,Jose]

Extending on that example, let’s then lower case each of the strings in those lists:

gremlin> g.V().limit(5).values('city').split(' ').toLower(Scope.local)
==>[seattle]
==>[santa,fe]
==>[san,francisco]
==>[philadelphia]
==>[san,jose]

You can see how Scope.local, an argument common to certain Gremlin steps, informs toLower() to operate on the contents of the incoming List (as opposed to the List itself). You will find local forms for many of these new steps.

The next example involves the use of substring(), which shows how you could group count results based on the first character in the airport code and then only includes counts with more than one in that count.

gremlin> g.V().groupCount().by(values('code').substring(0,1)).
......1>   unfold().
......2>   filter(select(values).is(gt(1))).
......3>   fold()
==>[A=3,B=3,C=2,D=4,E=2,H=3,I=2,L=4,M=4,O=2,P=3,S=9,T=2]

While the above example is fairly straightforward, it demonstrates a case where these new steps alter the way you can develop applications in Gremlin. Before this release and the introduction of substring(), you would need to return all the vertices to your application, do the group count in your application code where you could pick out the first character, and then do the filter. It would force you to absorb the cost of sending data back to your application that you might not even need to achieve your result. Taking the example a step further, imagine for a moment this result wasn’t your objective and that you had additional processing to do with the vertices that met the filter criteria. If so, you would have to send a second request to the server to do that. By providing features like substring() directly in Gremlin itself, you remove these concerns and simplify application development.

Collection-related steps

There are many ways to end up with a collection in your results with Gremlin. The fold() step is perhaps the most obvious way, but you also see them produced by way of steps like group() and valueMap(). Given how often they’re encountered in Gremlin, the following steps give you a high degree of flexibility for accomplishing more data transformations directly in your queries: any(), all(), product(), merge(), intersect(), combine(), conjoin(), difference(), disjunct(), and reverse().

As you evaluate your existing Gremlin, you will likely find that many of these new steps will help simplify existing code. A good example of this is the all() step. The all() step is a filter that ensures that all objects in a List match the supplied predicate. You could implement this sort of functionality with existing Gremlin steps.

gremlin> g.V().group().by('city').by('runways').
......1>   unfold().
......2>   and(select(values).count(local).is(gt(1)), 
......3>       select(values).as('a','b').
......4>              where('a', eq('b')).
......5>                by(count(local)).
......6>                by(unfold().is(gt(2)).count()))
==>Washington D.C.=[4, 3]
==>Houston=[5, 4]

In the preceding example, we try to find cities with airports that have more than one airport and where every airport has more than two runways. you can see the complexity involved in the second select(), which is checking the list of runway counts in each city to ensure that all have more than two. It accomplishes the “all have more than two” part with the where() clause, which compares the total number of runways in the list with the total number of runways that are greater than two, and if they are equal, you know that all airports meet the criteria. A practiced eye would immediately detect this Gremlin pattern for this sort of filter and recognize that functionality, but for new Gremlin users the pattern won’t be immediately understood.

In 3.7.x, you can greatly improve the readability of this query by replacing much of that complex pattern with a direct use of all() step.

gremlin> g.V().group().by('city').by('runways').
......1>   unfold().
......2>   and(select(values).count(local).is(gt(1)), 
......3>       select(values).all(gt(2)))
==>Washington D.C.=[4, 3]
==>Houston=[5, 4]

Date-related steps

Query languages tend to have means for natively working with date data types. As of the 3.7.x release, Gremlin is now on par with those languages with the addition of the following steps: asDate(), dateAdd(), and dateDiff(). Before this change, you would probably have to store your dates as a long and then use a math() step to handle any manipulation you intended. Alternatively, as with string manipulation, you might also have to return the date to the application and modify it there natively. If you use dates in your applications in either of these ways, having these new date steps natively in Gremlin provides better options for working with dates more directly. Here’s an example where a string representation of a date is converted to a Date type, seven days are added and then the difference is calculated, in seconds, to the original date:

gremlin> g.inject("2023-08-02T00:00:00Z").asDate().
......1>   dateAdd(DT.day, 7).
......2>   dateDiff(datetime("2023-08-02T00:00:00Z"))
==>604800

Cardinality syntax

The introduction of mergeV() in TinkerPop 3.6.x greatly changed the way that you write Gremlin that mutated the graph. These steps provided a mechanism to write upsert-style operations more directly. One limitation to the initial syntax was that it assumed that the default cardinality would be used. With Neptune, the default cardinality is set, which is different compared to most graphs, which assume single. This difference meant that Neptune developers need to take a workaround using sideEffect() when using mergeV(). Here’s an example of that workaround being run against an empty graph:

gremlin> g.addV().property(id, '1234').
......1>          property(single, 'age', 19).
......2>          property(set, 'city', 'orlando')
==>v[1234]
gremlin> g.mergeV([(T.id): '1234']).
......1>     option(onMatch, 
......2>            sideEffect(property(single,'age', 20).
......3>                       property(set,'city','miami')).constant([:]))
==>v[1234]
gremlin> g.V().valueMap()
==>[city:[orlando,miami],age:[20]]

The sideEffect() in the previous example modifies the matched vertex using property() steps with explicit specification of the Cardinality, then has to return an empty Map with constant([:]) to fulfill the expected argument to option(). This workaround in some ways defeats the purpose of mergeV() because it complicates the syntax considerably. Moreover, we’ve identified cases where the workaround doesn’t work consistently.

In 3.7.x, you can directly specify the cardinality within the Map argument. As a result, the preceding example with mergeV() can be written as follows:

gremlin> g.addV().property(id, '1234').
......1>          property(single, 'age', 19).
......2>          property(set, 'city', 'orlando')
==>v[1234]
gremlin> g.mergeV([(T.id): '1234']).
......1>     option(onMatch, [age: single(20), city: set('miami')])
==>v[1234]
gremlin> g.V().valueMap()
==>[city:[orlando,miami],age:[20]]

union() as a start step

The union() step is a branching step that merges the results of an arbitrary number of traversals. It’s one of the more commonly used steps, but using it as a start step has always required the following workaround.

gremlin> g.inject(0).union(V().has('code','IAD'), 
......1>                   V().has('code','MIA')).
......2>   valueMap('code','city')
==>[code:[IAD],city:[Washington D.C.]]
==>[code:[MIA],city:[Miami]]

The inject() step starts the traversal with a throwaway value of 0 which then allows union() to behave as it normally does. In 3.7.x, you can avoid the inject() and use union() directly from g.

gremlin> g.union(V().has('code','IAD'), 
......1>         V().has('code','MIA')).
......2>   valueMap('code','city')
==>[code:[IAD],city:[Washington D.C.]]
==>[code:[MIA],city:[Miami]]

Testing with TinkerGraph

Developers using Neptune will often substitute TinkerGraph for certain types of testing. Assuming you allow for the differences between the two graphs, using TinkerGraph can help speed up testing and allow it to happen in a local environment. In 3.7.x, an important difference between Neptune and TinkerGraph was removed and TinkerGraph now supports transactions. As a result, the use of the g.tx() syntax for doing transactions with Gremlin will work for both Neptune and TinkerGraph. That said, while the Gremlin syntax for transactions is available, there might be subtle differences in the internal transaction semantics of the two graphs that could preclude certain types of tests. You can read about TinkerGraph’s transactional semantics in the TinkerPop Reference documentation and learn more about Neptune’s in its transaction documentation.

Irrespective of the differences, this change provides more testing opportunities when building applications with Neptune. For more information, see to Unit Testing Apache TinkerPop Transactions: From TinkerGraph to Amazon Neptune.

Compilation and dependencies

The upgrade of applications to TinkerPop 3.7.x should be relatively seamless, but there are two points to consider that might affect the process.

  1. The gremlin-drivermodule is no longer a dependency of the gremlin-server module. If you used gremlin-server as your dependency in your Java application and transitively accessed gremlin-driver, you will find that you now must explicitly add gremlin-driver.
    <dependency>
       <groupId>org.apache.tinkerpop</groupId>
       <artifactId>gremlin-driver</artifactId>
       <version>3.7.1</version>
    </dependency>
  2. Serializers and message constructs for Java are now in a new module called gremlin-util. This module is referenced by gremlin-driver and will be consumed transitively by your package manager. If you happen to reference the serializers or message constructs in your code or configuration files, you must make adjustments to those references on upgrade as the package names have changed slightly. Moreover, serializer names have changed to become more consistent. You can find more details in the TinkerPop Upgrade documentation.

Conclusion

In this post, we showed you that Apache TinkerPop 3.7.x introduces an extensive number of new features to the Gremlin language, which will help improve your ability to write graph queries in various ways.

  1. Combine queries – We’ve seen how these new features have the potential to spare native processing of query results where older versions of Gremlin didn’t have the inherent capability to easily manipulate strings, dates, or collections. Search for application code where you are post-processing results in a way that might be better handled server-side, particularly places where the processing is mostly happening to invoke a second query to Neptune.
  2. Improve code readability – Many of the new features offer a way to do what was already possible with Gremlin, but in a single step rather than a pattern of steps. Identify places in your Gremlin where you can take advantage of replacing a series of steps with just one or two of the new ones. Not only will you reduce the amount of code you have, but you will also make the intent of your Gremlin clearer to those reading it.
  3. Work with data more naturally – In some ways, this point fits into the previous two, but note that these changes can affect how you store data as well. If you were storing dates as long values so that you could do math() step operations on them or denormalizing some string data to work around the inability to modify case or take part of a string, you might decide to take advantage of these new features to avoid the added complexity.

Given all these opportunities to improve your development and your application experience, the Amazon Neptune 1.3.2.0 release with Apache TinkerPop 3.7.x is a version that all Gremlin users should consider upgrading to. As always, see the TinkerPop upgrade documentation and its CHANGELOG for a full listing of changes. Upgrade to Neptune 1.3.2.0 to take advantage of these important new features.


About the author

Stephen Mallette is a member of the Amazon Neptune team at AWS. He has developed graph database and graph processing technology for many years. He is a decade long contributor to the Apache TinkerPop project, the home of the Gremlin graph query language.