Originally published at: Similarity in Graphs: Jaccard Versus the Overlap Coefficient | NVIDIA Technical Blog
This post was originally published on the RAPIDS AI Blog. There is a wide range of graph applications and algorithms that I hope to discuss through this series of blog posts, all with a bias toward what is in RAPIDS cuGraph. I am assuming that the reader has a basic understanding of graph theory and…
Regarding your commentary on the unintuitive results for similarity of vertices in a clique: why not just consider a node as a neighbor of itself? In a social networking application, where an edge means “knows” e.g., Mike “knows” Jesse, the it would also seem that Mike should “know” himself too. Then, in an application where you don’t have a logical connection between an entity and itself there wouldn’t be less similarity between members of a clique.
I stumbled on this article because I’m curious about graph similarity. For example, if I have a graph G and I want to obfuscate it into H, to simply make it difficult for someone to learn about G’s connectivity, could I use a metric like Jaccard or O.C. to test how close or dissimilar the transformed graph is to the original? Maybe if I take the summation of edge Jaccard or O.C. over every pair of nodes in graphs G and H.
I’m not so sure.
Great question. Sorry for the slow response.
Both Jaccard and Overlap come from a non-graph background and are really related to sets. For graphs we use the neighbors of a vertex as the formation of the set, but that doesn’t need to be the case. Within Cyber analysis, for example, it is beneficial to compare the two-hop neighbors. So does vertex A interact with the same people as vertex B.
Now your question about considering a node as a neighbor of itself. The notion of a self-link is handled by the algorithms and self-links can impact the score. Now if you assume every node has a self-link, then it all balances out.
If you are interested in obfuscate graph G into H to make understanding about G harder, then self-links would be a start but not enough to hide similarity. You would really need to add a large collection of random edges.
As an aside you can see the work I did on creating a community detection algorithm using neighbor knowledge: https://www.academia.edu/download/89102692/p1393.pdf