Five Suggestions for Neo4j and Cypress Rapid Batch Update

Keywords: Attribute Database JSON JDBC

Links to the original text: http://jexp.de/blog/2017/03/5-tips-tricks-for-fast-batched-updates-of-graph-structures-with-neo4j-and-cypher
Note: After testing, I modified some of the Cypher statements in the original text to conform to the grammatical rules.

When writing large amounts of data to a graphical database through a program, you want it to be able to process efficiently.

Inefficient ways

The following approaches are not very effective:
- Write values directly into statements, not through parameters
- Each update sends a request through a Transaction
- Send a large number of individual requests through a Transaction
- Generate a huge complex statement (hundreds of lines) and submit it it through a Transaction
- In a Transaction, sending a huge request results in an OOM error

The Right Way

You need to construct as small a request as possible, and the statement format is fixed (so that caching can be used), and then use it parameterically.

Each request can modify only one attribute, or the entire subgraph (hundreds of nodes), but its statement structure must be consistent, otherwise caching cannot be used.

UNWIND - Savior

To achieve this goal, you just need to add a UNWIND statement before your single request. UNWIND disperses large amounts of data (up to 10k or 50k) into rows, each containing all the information needed for each update.

You add a {batch} parameter and set its value to a Map list, which can contain your data (10k or 50k). The data will be packaged into a complete request, which conforms to the grammatical structure and uses caching (because its structure is consistent).

Grammatical structure

Input:

{batch: [{row1},{row2},{row3},...10k]}

Sentence:

UNWIND {batch} as row

// Write update statements based on Map data for each row

Example

Here are some examples

Create nodes and write properties

Data:

{batch: [{name:"Alice",age:32},{name:"Bob",age:42}]}

Sentence:

UNWIND {batch} as row
CREATE (n:Label)
SET n.name = row.name, n.age = row.age

Merge node and write attributes

Data:

{batch: [{id:"alice@example.com",properties:{name:"Alice",age:32}},{id:"bob@example.com",properties:{name:"Bob",age:42}}]}

Sentence:

UNWIND {batch} as row
MERGE (n:Label {id:row.id})
(ON CREATE) SET n.name = row.properties.name, n.age = row.properties.age

Find nodes, create / Merge relationships, and write properties

Data:

{batch: [{from:"alice@example.com",to:"bob@example.com",properties:{since:2012}},{from:"alice@example.com",to:"charlie@example.com",properties:{since:2016}}]}

Sentence:

UNWIND {batch} as row
MATCH (from:Label {from:row.from})
MATCH (to:Label {to:row.to})
CREATE/MERGE (from)-[rel:KNOWS]->(to)
(ON CREATE) SET rel.since = row.properties.since

Find nodes by id or id list

Good for multi-fork trees

Here we just pass in a separate attribute created. In fact, you can update it without passing in any attributes or a map attribute.

Data:

{batch: [{from:123,to:[44,12,128],created:"2016-01-13"}, {from:34,to:[23,35,2983],created:"2016-01-15"},...]}

Sentence:

UNWIND {batch} as row
MATCH (from) WHERE id(from) = row.from
MATCH (to) WHERE id(from) IN row.to // list of ids
CREATE/MERGE (from)-[rel:FOO]->(to)
SET rel.created = row.created

Faster and more efficient

Here are some more tips.

You can pass in a Map where the key is the node id or the relationship id. In this way, id lookup becomes more efficient.

Update existing nodes with id

Data:

{ batch : [{"1":334,"2":222,3:3840, ... 100k}]}

Sentence:

WITH {batch} as data, [k in keys({batch}) | toInt(k)] as ids
MATCH (n) WHERE id(n) IN ids

// Single attribute update
SET n.count = data[toString(id(n))]

Update existing relationships with id

Data:

{ batch : [{"1":334,"2":222,3:3840, ... 100k}]}

Sentence:

WITH {batch} as data, [k in keys({batch}) | toInt(k)] as ids
MATCH ()-[rel]->() WHERE id(rel) IN ids
SET rel.foo = data[toString(id(rel))]

Conditional creation of data

Sometimes you want to create data dynamically based on input. But Cypher currently does not have conditional statements such as WHEN or IF, and CASE WHEN is just an expression, so you have to use a technique I came up with years ago.

Cypher provides FOREACH statements to traverse each element in the list and perform updates separately. Thus, a list of 0 or 1 elements can be regarded as a conditional expression. Because traversal is not performed when there are 0 elements, but only once when there are 1 elements.

The general idea is as follows:

...
FOREACH (_ IN CASE WHEN predicate THEN [true] ELSE [] END |
... update operations ....
)

Where the true value in the list can be any other value, 42, ", null, and so on. As long as it is a value, we can get a non-empty list.

Similarly, you can use RANGE(1, CASE WHEN predicate THEN 1 ELSE 0 END). When the value of predicate is false, an empty list is scoped. Or, if you like to use filter, you can construct it by filter (IN [1] WHERE predicate).

Here is a complete example:

LOAD CSV FROM {url} AS row
MATCH (o:Organization {name:row.org})
FOREACH (_ IN case when row.type = 'Person' then [1] else [] end|
   MERGE (p:Person {name:row.name})
   CREATE (p)-[:WORKS_FOR]->(o)
)
FOREACH (_ IN case when row.type = 'Agency' then [1] else [] end|
   MERGE (a:Agency {name:row.name})
   CREATE (a)-[:WORKS_FOR]->(o)
)

It should be noted that variables created within FOREACH cannot be accessed externally. You need to re-query, or you need to complete all updates in FOREACH.

Using APOC Library

APOC The library provides many useful methods for you to use. Here, I recommend the following three methods:

  • Create nodes and relationships, and dynamically set labels and attributes
  • Batch submission and update
  • Create or manipulate Map s dynamically and assign properties

Dynamic creation of nodes and relationships

With apoc.create.node and apoc.create.relationship, you can dynamically calculate node labels, relationship types and arbitrary attributes.

  • The tag is a String array
  • Attribute is a Map
UWNIND {batch} as row
CALL apoc.create.node(row.labels, row.properties) yield node
RETURN count(*)

In apoc.create. * method, the function of setting/updating/deleting attributes and labels is also provided.

UWNIND {batch} as row
MATCH (from) WHERE id(n) = row.from
MATCH (to:Label) where to.key = row.to
CALL apoc.create.relationship(from, row.type, row.properties, to) yield rel
RETURN count(*)

Batch submit

j mentioned at the outset that a large number of submissions to Transaction are problematic. You can update millions of records with 2G-4G heaps, but it's difficult when the magnitude is larger. With 32G heap, my largest Transaction can reach 10M nodes.

At this point, apoc.periodic.iterate can provide great help.

The principle is simple: you have two Cypher statements. The first statement provides manipulatable data and generates huge data streams. The second statement performs a real update operation. It updates every data once, but it only creates a new Transaction after processing a certain amount of data.

For example, if your first statement returns five million nodes that need to be updated, if you use internal statements, then each node will be updated once. But if you set the batch size to 10k, each Transaction will update 10K nodes in batches.

If your updates are independent of each other (creating nodes, updating attributes, or updating independent subgraphs), you can add parallel:true to take full advantage of the cpu.

For example, if you want to calculate scores for multiple items and update attributes by batch processing, you should do the following

call apoc.periodic.iterate('
MATCH (n:User)-[r1:LIKES]->(thing)<-[r2:RATED]-(m:User) WHERE id(n)<id(m) RETURN thing, avg( r1.rating + r2.rating ) as score
','
WITH {thing} as t SET t.score = {score}
', {batchSize:10000, parallel:true})

Create/update Map dynamically

Although Cypher provides fairly traversal operations for lists, such as range, collect, unwind, reduce, extract, filter, size, etc., Map s sometimes need to be created and changed.

apoc.map. * provides a series of ways to simplify this process.

Create Map s from other data:

RETURN apoc.map.fromPairs([["alice",38],["bob",42],...​])
// {alice:38, bob: 42, ...}

RETURN apoc.map.fromLists(["alice","bob",...],[38,42])
// {alice:38, bob: 42, ...}

// groups nodes, relationships, maps by key, good for quick lookups by that key
RETURN apoc.map.groupBy([{name:"alice",gender:"female"},{name:"bob",gender:"male"}],"gender")
// {female:{name:"alice",gender:"female"}, male:{name:"bob",gender:"male"}}

RETURN apoc.map.groupByMulti([{name:"alice",gender:"female"},{name:"bob",gender:"male"},{name:"Jane",gender:"female"}],"gender")
// {female:[{name:"alice",gender:"female"},{name:"jane",gender:"female"}], male:[{name:"bob",gender:"male"}]}

Update Map:

RETURN apoc.map.merge({alice: 38},{bob:42})
// {alice:38, bob: 42}

RETURN apoc.map.setKey({alice:38},"bob",42)
// {alice:38, bob: 42}

RETURN apoc.map.removeKey({alice:38, bob: 42},"alice")
// {bob: 42}

RETURN apoc.map.removeKey({alice:38, bob: 42},["alice","bob","charlie"])
// {}

// remove the given keys and values, good for data from load-csv/json/jdbc/xml
RETURN apoc.map.clean({name: "Alice", ssn:2324434, age:"n/a", location:""},["ssn"],["n/a",""])
// {name:"Alice"}

conclusion

In these ways, I can quickly perform the update operation. Of course, you can also combine these methods to achieve more complex operations.

Posted by Ghettobusta on Sun, 14 Apr 2019 09:45:33 -0700