HashGNN: Deep Dive into Neo4j GDS’s New Node Embedding Algorithm | by Philipp Brunenberg | Aug, 2023

On this article, we’ll discover alongside a small instance how HashGNN hashes graph nodes into an embedding area.
In the event you desire watching a video on this, you are able to do so here.
HashGG (#GNN) is a node embedding approach, which employs ideas of Message Passing Neural Networks (MPNN) to seize high-order proximity and node properties. It considerably speeds-up calculation compared to conventional Neural Networks by using an approximation approach referred to as MinHashing. Due to this fact, it’s a hash-based method and introduces a trade-off between effectivity and accuracy. On this article, we’ll perceive what all of which means and can discover alongside a small instance how the algorithms works.
Many graph machine studying use-cases like hyperlink prediction and node classification require the calculation of similarities of nodes. In a graph context, these similarities are most expressive once they seize (i) the neighborhood (i.e. the graph construction) and (ii) the properties of the node to be embedded. Node embedding algorithms undertaking nodes right into a low-dimensional embedding area — i.e. they assign every node a numerical vector. These vectors — the embeddings — can be utilized for additional numerical predictive evaluation (e.g. machine studying algorithms). Embedding algorithms optimize for the metric: Nodes with an analogous graph context (neighborhood) and/or properties must be mapped shut within the embedding area. Graph embedding algorithms often make use of two basic steps: (i) Outline a mechanism to pattern the context of nodes (Random stroll in node2vec, k-fold transition matrix in FastRP), and (ii) subsequently cut back the dimensionality whereas preserving pairwise distances (SGD in node2vec, random projections in FastRP).
The clue about HashGNN is that it doesn’t require us to coach a Neural Community primarily based on a loss perform, as we must in a conventional Message Passing Neural Community. As node embedding algorithms optimize for “comparable nodes must be shut within the embedding area”, the analysis of the loss includes calculating the true similarity of a node pair. That is then used as suggestions to the coaching, how correct the predictions are, and to regulate the weights accordingly. Typically, the cosine similarity is used as similarity measure.
HashGNN circumvents the mannequin coaching and truly doesn’t make use of a Neural Community altogether. As a substitute of coaching weight matrices and defining a loss perform, it makes use of a randomized hashing scheme, which hashes node vectors to the identical signature with the chance of their similarity, which means that we will embed nodes with out the need of evaluating nodes instantly in opposition to one another (i.e. no must calculate a cosine similarity). This hashing approach is named MinHashing and was initially outlined to approximate the similarity of two units with out evaluating them. As units are encoded as binary vectors, HashGNN requires a binary node illustration. As a way to perceive how this can be utilized to embed nodes of a basic graph, a number of methods are required. Let’s take a look.
To begin with, let’s speak about MinHashing. MinHashing is a locality-sensitive hashing approach to approximate the Jaccard similarity of two units. The Jaccard similarity measures the overlap (intersection) of two units by dividing the intersection dimension by the variety of the distinctive parts current (union) within the two units. It’s outlined on units, that are encoded as binary vectors: Every factor within the universe (the set of all parts) is assigned a singular row index. If a selected set accommodates a component, that is represented as worth 1 within the respective row of the set’s vector. The MinHashing algorithm hashes every set’s binary vector independently and makes use of Ok
hash capabilities to generate a Ok
-dimensional signature for it. The intuitive rationalization of MinHashing is to randomly choose a non-zero factor Ok
occasions, by selecting the one with the smallest hash worth. This can yield the signature vector for the enter set. The attention-grabbing half is, that if we do that for 2 units with out evaluating them, they may hash to the identical signature with the chance of their Jaccard similarity (if Ok
is giant sufficient). In different phrases: The chance converges in opposition to the Jaccard similarity.
Within the illustration: Instance units s1
and s2
are represented as binary vectors. We are able to simply compute the Jaccard similarity by evaluating the 2 and counting the rows the place each vectors have a 1. These are fairly easy operations, however the complexity resides within the pair-wise comparability of vectors if we’ve many vectors.
Our universe U
has dimension 6 and we select Ok
(the variety of hash capabilities) to be 3. We are able to simply generate new hash capabilities by utilizing a easy formulation and bounds for a
, b
and c
. Now, what we truly do is to hash the indices of our vectors (1-6
), every referring to a single factor in our universe, with every of our hash capabilities. This can give us 3 random permutations of the indices and due to this fact parts in our universe. Subsequently, we will take our set vectors s1
and s2
and use them as masks on our permuted options. For every permutation and set vector, we choose the index of the smallest hash worth, which is current within the set. This can yield two third-dimensional vectors, one for every set, which is named the MinHash signature of the set.
MinHashing merely selects random options from the enter units, and we’d like the hash capabilities solely as a method to breed this randomness equally on all enter units. We have now to make use of the identical set of hash capabilities on all vectors, in order that the signature values of two enter units collide if each have the chosen factor current. The signature values will collide with the chance of the units’ Jaccard similarities.
HashGNN makes use of a message passing scheme as outlined in Weisfeiler-Lehman Kernel Neural Networks (WLKNN) to seize high-order graph construction and node properties. It defines the context sampling technique as talked about earlier for HashGNN. The WLK runs in T
iterations. In every iteration, it generates a brand new node vector for every node by combining the node’s present vector with the vectors of all instantly related neighbors. Due to this fact, it’s thought of to cross messages (node vectors) alongside the sides to neighboring nodes.
After T
iterations, every node accommodates data of nodes at T
hops distance (high-order). The calculation of the brand new node vector in iteration t
primarily aggregates all neighbor messages (from iteration t-1
) right into a single neighbor message after which combines it with the node vector of the earlier iteration. Moreover, the WLKNN employs three neural networks (weight matrices and activation perform); (i) for the aggregated neighbor vector, (ii) for the node vector and (iii) for the mixture of the 2. It’s a notable function of the WLK that the calculation in iteration t
is simply depending on the results of iteration t-1
. Due to this fact, it may be thought of a Markov chain.
Let’s discover how HashGNN combines the 2 approaches to embed graph vectors effectively to embedding vectors. Similar to the WLKNN, the HashGNN algorithm runs in T
iterations, calculating a brand new node vector for each node by aggregating the neighbor vectors and its personal node vector from the earlier iteration. Nevertheless, as an alternative of coaching three weight matrices, it makes use of three hashing schemes for locality-sensitive hashing. Every of the hashing schemes consists of Ok
randomly constructed hashing capabilities to attract Ok
random options from a binary vector.
In every iteration we carry out the next steps:
Step 1: Calculate the node’s signature vector: Min-hash (randomly choose Ok
options) the node vector from the earlier iteration utilizing hashing scheme 3. Within the first iteration the nodes are initialized with their binary function vectors (we’ll speak about how you can binarize nodes later). The ensuing signature (or message) vector is the one which is handed alongside the sides to all neighbors. Due to this fact, we’ve to do that for all nodes first in each iteration.
Step 2: Assemble the neighbor vector: In every node, we’ll obtain the signature vectors from all instantly related neighbors and mixture them right into a single binary vector. Subsequently, we use hashing scheme 2 to pick out Ok
random options from the aggregated neighbor vector and name the end result the neighbor vector.
Step 3: Mix node and neighbor vector into new node vector: Lastly, we use hashing scheme 1 to randomly choose Ok
options from the node vector of the earlier iteration and mix the end result with the neighbor vector. The ensuing vector is the brand new node vector which is the start line for the following iteration. Notice, that this isn’t the identical as in step 1: There, we apply hashing scheme 3 on the node vector to assemble the message/signature vector.
As we will see, the ensuing (new) node vector has affect of its personal node options (3 & 5), in addition to its neighbors’ options (2 & 5). After iteration 1 the node vector will seize data from neighbors with 1 hop distance. Nevertheless, as we use this as enter for iteration 2, it’ll already be influenced by options 2 hops away.
HashGNN was carried out by the Neo4j GDS (Graph Knowledge Science Library) and provides some helpful generalizations to the algorithm.
One essential auxiliary step within the GDS is function binarization. MinHashing and due to this fact HashGNN requires binary vectors as enter. Neo4j presents a further preparation step to rework real-valued node vectors into binary function vectors. They make the most of a method referred to as hyperplane rounding. The algorithm works as follows:
Step 1: Outline node options: Outline the (real-valued) node properties for use as node options. That is executed utilizing the parameter featureProperties
. We are going to name this the node enter vector f
.
Step 2: Assemble random binary classifiers: Outline one hyperplane for every goal dimension. The variety of ensuing dimensions is managed by the parameter dimensions
. A hyperplane is a high-dimensional aircraft and, so long as it resides within the origin, could be described solely by its regular vector n
. The n
vector is orthogonal to the aircraft’s floor and due to this fact describes its orientation. In our case the n
vector must be of the identical dimensionality because the node enter vectors (dim(f) = dim(n)
). To assemble a hyperplane, we merely draw dim(f)
occasions from a Gaussian distribution.
Step 3: Classify node vectors: Calculate the dot-product of the node enter vector and every hyperplane vector, which yields the angle between the hyperplane and the enter vector. Utilizing a threshold
parameter, we will determine whether or not the enter vector is above (1) or beneath (0) the hyperplane and assign the respective worth to the ensuing binary function vector. That is precisely what additionally occurs in a binary classification — with the one distinction that we don’t iteratively optimize our hyperplane, however quite use random Gaussian classifiers.
In essence, we draw one random Gaussian classifier for every goal dimension and set a threshold parameter. Then, we classify our enter vectors for every goal dimension and procure a d
-dimensional binary vector which we use as enter for HashGNN.
HashGNN makes use of locality-sensitive hashing to embed node vectors into an embedding area. Through the use of this system, it circumvents the computationally intensive, iterative coaching of a Neural Community (or different optimization), in addition to direct node comparisons. The authors of the paper report a 2–4 orders of magnitude sooner runtime than studying primarily based algorithms like SEAL and P-GNN, whereas nonetheless producing extremely comparable (in some circumstances even higher) accuracy.
HashGNN is carried out within the Neo4j GDS (Graph Knowledge Science Library) and might due to this fact be used out-of-the-box in your Neo4j graph. Within the subsequent submit, I’ll go into sensible particulars about how you can use it, and what to bear in mind.
Thanks for stopping by and see you subsequent time. 🚀