From is simplified model to program and is

From “Think like a Vertex”
to “Think like a graph”

Yuanyuan
Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John
McPherson, IBM Almaden Research Center, USA

Abstract As with the growth of data in the form
of graphs and internet, it is important to process data with speed and
efficiency. There are number of systems available to process data such as
Graphlab and Pregel which divide input data into partitions and apply vertex
centric model. Basically, vertex-centric model is simplified model to program
and is useful for all graph algorithms.

This
model has some shortcoming i.e. this model hides the portioning information
from programmer. Due to which systems like Pregel takes longer processing time due
to greater number of network messages and Graphlab has heavy scheduling
overhead. To solve these shortcomings, a new model named graph-centric model is
proposed here. The graph centric model provide partition information to
programmers. This model is implemented in a system called Giraph++. The
Giraph++ is based on Apache Giraph which is open-source implementation of
Pregel. To check the applicability of proposed model, three categories of different
graph algorithms are used.

 

 

I.    
INTRODUCTION1

With the growing graph and network
data, scalable processing infrastructure is required. Existing systems
process graph data in following steps. 1) Divide input graphs into partitions
to perform parallelism 2) they use vertex-centric model, in which programmers
code algorithm by making vertex as a base. Every vertex has information
about  itself and directly connected neighbors.

 

In Pregel system, computations steps are as 1) receiving messages
from connected vertices 2) updating the state of edges and itself 3) Sending
messages to other connected vertices. In GraphLab, computation steps are as 1)
computation for a single vertex to read or/and update own data 2) or
read/update data of its neighbors.

 

The vertex-centric model proved easy for programmers to program
and also perform well for many graph algorithms. But it does not work efficiently.
  The
graph is partitioned into multiple partitions. These partitions are complete
sub-graph but vertex-centric model only consider one partition as unrelated
vertices. One vertex has only information about itself and directly connected
neighbors so information is forwarded through graphs in a very slow way. Due to
this, this model takes many computation steps to forward a message from source
vertex to destination vertex even both vertices locate in the same partition.

 

To resolve this issue, a new approach named “graph-centric model”
is proposed here in which programmer has information about the portioning
structure. The graph centric model is implemented in a distributed graph
processing system called Giraph++. To check the applicability and flexibility
of the proposed model, three algorithm categories are used i.e. graph
traversal, random walk and graph aggregation. There are number of algorithms
within these categories. We compare the vertex-centric model with graph-centric
model and hybrid model. Hybrid model works like the vertex-centric model but it
performs asynchronous computation also. During experiments, we get significant
performance gains from graph-centric model.

 

For example, if a graph has million number of vertices and
million number of edges than vertex-centric model run almost 63% slower than
the graph-centric model and use 204% greater number of messages.

 

This proposed graph-centric model is not a whole replacement of existing
vertex-centric model. Both have their own advantages and disadvantages. The
vertex-centric model is easy to implement. The graph-centric model is good for
algorithm-specific optimization problems. The graph-centric model can be
implemented in other data processing systems but here Giraph is chosen by us
due to its popularity. The graph mutation is a fundamental requirement for
graph aggregation e.g. graph sparsification 19,graph coarsening 12,11 and graph
summarization22. So Giraph++ is system which is able to support asynchronous
computation and mutation both at the same time for graph data.

 

The next sections of this paper are arranged as: Section II is based on
necessary background about Giraph. Section III provides the introduction of
graph-centric model. Section IV have applicability of graph-centric model for
multiple graph algorithms. In Section V, the hybrid model is proposed. In
Section VI, implementation detail is provided. 
Section VII provides related work. Then research paper concluded in
Section VIII.

II.    
Giraph and Pregel Overview

 

This section give a brief overview of Giraph which is an open
source implementation of Pregel.

 

The Giraph divides the graph into partitions as shown in Figure 1
and distribute them to set of nodes. One node becomes master and others become slave
nodes. The hash function is a default partitioner here but other partitioners
can be used. Mostly number of partitions is greater than the number of nodes. Every
partition has specific set of vertices and directly connected neighbors.

 

In Giraph vertex-centric model is used. Here each vertex is
independent unit inherit from Vertex class. Every vertex has following
attributes 1) vertex ID 2) vertex value 3) all directly connected outgoing
edges and each edge has value associated with it 4) messages buffer sent to it.
Here Figure 2 shows the major functions used here 5) vertex states either
active or inactive.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a)

 

 

Partition Number

Vertex ID

Edges

1

A
B

B


F

2

C
D


E

 

3

E
F


F


D

 

(b)

 

 

Sub-graph ID

Subgraph

 
G1

 

 
G2

 
G3

 

(c)

 

Figure 1: One example of a graph, its partitions and subgraphs

 

 

Figure 2: All functions in Giraph

 

 

Giraph program consists of input step and output step. In Input
step, graph is initialized using sequenced iterations called super-steps.
Super-steps are separated by global synchronization barriers. In output steps, results
get written. In the very starting, all vertices have active state. A vertex can
inactive itself by using function named votetoHalt(). A vertex get activated by
incoming messages from other connected neighbor vertices.  If all vertices are inactive, the program
will terminate. In super-step x, each active vertex can receive message send by
other directly connected neighbors in super-step x-1 and update the value of
current vertex and its outwards edges. Then send the message to other vertices
in next super-step i.e. x+1. All type of Messages computation is done in
function named compute(). A function named combiner can be used to reduce the
number of messages.

 

 

III.    
Graph-centric MODEL (Giraph++)

This section introduce graph-centric model. This model gives
programmers access to whole sub-graph.

A.    
Vertices types

Graph-centric model divides the input graph into partitions
as shown in Figure 1(b). Each partitions is a complete sub-graph as shown in
Figure 1(c). Here vertices are divided into two categories 1) internal vertices
2) boundary vertices. In figure 1(c), A, B are internal vertices of G1 while D,
F are boundary vertices of G1. One vertex can be internal in exactly one sub-graph
and it can be boundary vertex in other sub-graphs.  For example, A is internal vertex in G1 and is
a boundary vertex in other sub-graphs G2 and G3.

 

In graph-centric model, each internal vertex has following
attributes 1) vertex value 2) edge value 3) incoming messages. Each boundary
vertex of sub-graph has only vertex value. Vertex value of boundary vertex is
just a local copy. Its original copy is within corresponding internal vertex.

 

Difference between internal and boundary vertices is that
messages can be only send through boundary vertices to original copies.
Information exchange between internal vertices is efficient and also cheap. The
state of internal vertex can be changed at any time without any network message
or next super-step.

 

B.    
Programming API Giraph++

Giraph++ executes the program in sequence of super-steps. But
here in each super-step, computation is performed on entire sub-graph of one
partition.

 

Here Vertex class is used for both boundary vertices and
internal vertices. Some functions are not used here. Internal vertices used
functions that are highlighted in red and blue in Figure 2.Boundry vertices used
only red highlighted functions. Boundary vertices don’t have any state and
internal vertices have two states 1) active 2) inactive.

 

One new class is introduces here i.e. GraphPartition. Figure
3 shows some major functions of this class. GraphPartition allows 1) to use all
vertices i.e. boundary and internal 2) to check either a vertex is boundary and
internal 3) to send messages to internal vertices of other sub-graphs 4) to
inactive all internal vertices of one sub-graph at the same time. Here compute
() function is for the one whole sub-graph.

 

 

Figure 3: All functions of GraphPartition
class

 

IV.    
Manipulation of Giraph++

This section check the applicability of graph centric model to
three categories of algorithm 1) graph traversal 2)random walk 3) graph aggregation.
Here one representative from each of three categories is choosen.

 

A.    
Graph Traversal

Graph traversal category contain algorithms that have to visit
all vertices of one graph in specified way. I can check or update the values of
vertices with traversing. Its examples are connected components, shortest
distances etc.

 Here we examine the
implementation of connected components for both vertex-centric model and graph
centric model.

 

 

Figure
4: Example of connected component in both graph-centric model and
vertex-centric model

 

 

 

 

     Algorithm 1
shows connected component algorithm example in Giraph.Here vertex value of each
vertex is its label. In the very starting i.e. superstep 0 every vertex has its
own unique ID as a label. Then this label is communicated to all its neighbors.
In later supersteps, every vertex check the smalest label out of all received
messages. If new found label is smaller than previous label then vertex will
update its label with newest one and also communicate this label to directly
connected neighbors. At the end of algorithm. Every vertex has smallest label
out of all partitions. Here combiner() function is used to compute the minimum
label for each vertex. Figure 4(a), Figure 4(b) shows the labels and vertices
communication in each superstep.

 

     Algorithm 2 shows connected component
algorithm example in Giraph++. In Figure 4(a), 4(c) and 4(d) shows the sub-graphs
and communication between vertices. A set of vertices connected in one sub-graph
is also connected vertices in original graph.

 

In this program, superstep 0 runs
sequential algorithm of connected components. Here sequential algorithm breadth
first search is used on each subgraph. Then sends the local label of each
boundary vertex to its corresponding internal vertex. In figure 4(a), in
superstep 0 every vertex of one subgraph has smallest label i.e. A for subgraph
1 and C for subgraph 2. Here D is internal vertex of sungraph 1 and C is
internal vertex of subgraph 2. Each boundry vertex send label to corresponding
internal vertex. In next consective supersteps, internal vertices check all
incoming messages from boundry vertices of its subgraph and other subgraphs.
Then check the labels which represent equivalent components. Then these
components will be merged into larger component and stores into equiCC. So pair
A and C  is stored in equiCC. The
function consolidate() is used to assign the unique label for equivalent
component. In our example, A is new label for equivalent component. If boundry
vertex’s label is changed, then this label is to be communicated to
corresponding internal vertex.

 

If we compare both algorithms for
Figure 4(b) and Figure 4(d), then it has been observed that vertex-centric
model needs 6 supersteps while graph-centric model needs only 3 supersteps.

B.     Random Walk

Random walk algorithms include HITS13, PageRank5,
ObjectRank2 etc. Here we use only one algorithm i.e. PageRank as a representative
of Random Walk algorithm.

 

 

    Algorithm 3 shows
PageRank algorithm pseudo code implemented in vertex-centric model Giraph. Here
damping factor 0.85 is used. This algorithm is not original PageRank algorithm
but its extended version proposed in 25. PageRank is computed by following
equation

 

 

Where |Eu| is no of outgoing edges for vertex u. It is provided in 25 that cumulative
update give same value as original PageRank algorithm provides. The major
advantage of cumulative approach is the ability to update or increment values
in the next superstep.

 

 

Algorithm
4 is PageRank implementation in graph-centric model Giraph++. Like algorithm 3,
this algorithm also follows the cumulative approach for updating values. Major
differences between algorithm 3 and 4 are 1) every vertex has another value
named delta with vertex value, delta saves the intermediate values which are
received from other vertices of the same subgraph(line number 16 to 18)

 

2) PageRank
local computation is asynchronous i.e. it uses intermediate results of other
vertices in the same subgraph(line number 14).Our graph centric-centric model
allows local asynchrony. Graph-centric model and vertex-centric model both can
use combiner() function which computes sum of messages(line number 12 to 13 of
Algorithm 3).

1 This research paper is summary of research paper “From
think like a vertex to think like a graph”.