In summary, our algorithm exhibits linear speedup with large graphs, including graphs that have high skewness in vertex degree distributions. The key idea behind the algorithm is to evenly partition all possible triplets of vertices among machines, sending edges that may form a triangle to a proxy machine this edge redistribution eliminates shuffling edges during join computation and therefore triangle enumeration becomes local and fully parallel. We experimentally prove our solution ensures a balanced data distribution, and hence workload, among machines. Our randomized solution provides a balanced workload for parallel query processing, being robust to the existence of skewed degree vertices. We choose a parallel columnar more » DBMS given its fast query processing, but our solution should work for a row DBMS as well.
Excel file metadata or latent data how to#
Our paper shows how to adapt and optimize a randomized distributed triangle enumeration algorithm with SQL queries, which is a significantly different approach from programming graph algorithms in traditional languages such as Python or C++. Alternatively, graph data can be quickly loaded into a DBMS. On the other hand, there is a large amount of data stored on database management systems (DBMSs), which can be modeled and analyzed as graphs. For instance, triangles are used to solve practical problems like community detection and spam filtering. Triangle enumeration is a fundamental problem in large-scale graph analysis. In both cases, it is shown that semidefinite programming can achieve exact recovery down to the optimal (information theoretic) threshold. We then investigate the performance of the semidefinite programming community detection as a function of the (unknown) composition of the nuisance latent variable. In the second part of the work, we consider aside from vertex labels a second latent variable that is unknown both in realization and in distribution. First, we consider a side information that does not form a Markov chain with the label and graph, and analyze the detection threshold of semidefinite programming subject to knowledge of this side information, which is a non-label latent variable on which the graph edges also depend. This work extends the scope of community detection in two ways. Recent work has also investigated the impact of additionally knowing the value of another variable at each vertex that is correlated with the vertex label (side information), while assuming side information is independent of the graph edges conditioned on the label.
Community detection refers to recovering a (latent) label on which the distribution of the observed graph depends.