From “Think like a Vertex”

to “Think like a graph”

Yuanyuan

Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John

McPherson, IBM Almaden Research Center, USA

Abstract As with the growth of data in the form

of graphs and internet, it is important to process data with speed and

efficiency. There are number of systems available to process data such as

Graphlab and Pregel which divide input data into partitions and apply vertex

centric model. Basically, vertex-centric model is simplified model to program

and is useful for all graph algorithms.

This

model has some shortcoming i.e. this model hides the portioning information

from programmer. Due to which systems like Pregel takes longer processing time due

to greater number of network messages and Graphlab has heavy scheduling

overhead. To solve these shortcomings, a new model named graph-centric model is

proposed here. The graph centric model provide partition information to

programmers. This model is implemented in a system called Giraph++. The

Giraph++ is based on Apache Giraph which is open-source implementation of

Pregel. To check the applicability of proposed model, three categories of different

graph algorithms are used.

I.

INTRODUCTION1

With the growing graph and network

data, scalable processing infrastructure is required. Existing systems

process graph data in following steps. 1) Divide input graphs into partitions

to perform parallelism 2) they use vertex-centric model, in which programmers

code algorithm by making vertex as a base. Every vertex has information

about itself and directly connected neighbors.

In Pregel system, computations steps are as 1) receiving messages

from connected vertices 2) updating the state of edges and itself 3) Sending

messages to other connected vertices. In GraphLab, computation steps are as 1)

computation for a single vertex to read or/and update own data 2) or

read/update data of its neighbors.

The vertex-centric model proved easy for programmers to program

and also perform well for many graph algorithms. But it does not work efficiently.

The

graph is partitioned into multiple partitions. These partitions are complete

sub-graph but vertex-centric model only consider one partition as unrelated

vertices. One vertex has only information about itself and directly connected

neighbors so information is forwarded through graphs in a very slow way. Due to

this, this model takes many computation steps to forward a message from source

vertex to destination vertex even both vertices locate in the same partition.

To resolve this issue, a new approach named “graph-centric model”

is proposed here in which programmer has information about the portioning

structure. The graph centric model is implemented in a distributed graph

processing system called Giraph++. To check the applicability and flexibility

of the proposed model, three algorithm categories are used i.e. graph

traversal, random walk and graph aggregation. There are number of algorithms

within these categories. We compare the vertex-centric model with graph-centric

model and hybrid model. Hybrid model works like the vertex-centric model but it

performs asynchronous computation also. During experiments, we get significant

performance gains from graph-centric model.

For example, if a graph has million number of vertices and

million number of edges than vertex-centric model run almost 63% slower than

the graph-centric model and use 204% greater number of messages.

This proposed graph-centric model is not a whole replacement of existing

vertex-centric model. Both have their own advantages and disadvantages. The

vertex-centric model is easy to implement. The graph-centric model is good for

algorithm-specific optimization problems. The graph-centric model can be

implemented in other data processing systems but here Giraph is chosen by us

due to its popularity. The graph mutation is a fundamental requirement for

graph aggregation e.g. graph sparsification 19,graph coarsening 12,11 and graph

summarization22. So Giraph++ is system which is able to support asynchronous

computation and mutation both at the same time for graph data.

The next sections of this paper are arranged as: Section II is based on

necessary background about Giraph. Section III provides the introduction of

graph-centric model. Section IV have applicability of graph-centric model for

multiple graph algorithms. In Section V, the hybrid model is proposed. In

Section VI, implementation detail is provided.

Section VII provides related work. Then research paper concluded in

Section VIII.

II.

Giraph and Pregel Overview

This section give a brief overview of Giraph which is an open

source implementation of Pregel.

The Giraph divides the graph into partitions as shown in Figure 1

and distribute them to set of nodes. One node becomes master and others become slave

nodes. The hash function is a default partitioner here but other partitioners

can be used. Mostly number of partitions is greater than the number of nodes. Every

partition has specific set of vertices and directly connected neighbors.

In Giraph vertex-centric model is used. Here each vertex is

independent unit inherit from Vertex class. Every vertex has following

attributes 1) vertex ID 2) vertex value 3) all directly connected outgoing

edges and each edge has value associated with it 4) messages buffer sent to it.

Here Figure 2 shows the major functions used here 5) vertex states either

active or inactive.

(a)

Partition Number

Vertex ID

Edges

1

A

B

B

D

F

2

C

D

A

E

3

E

F

A

F

A

D

(b)

Sub-graph ID

Subgraph

G1

G2

G3

(c)

Figure 1: One example of a graph, its partitions and subgraphs

Figure 2: All functions in Giraph

Giraph program consists of input step and output step. In Input

step, graph is initialized using sequenced iterations called super-steps.

Super-steps are separated by global synchronization barriers. In output steps, results

get written. In the very starting, all vertices have active state. A vertex can

inactive itself by using function named votetoHalt(). A vertex get activated by

incoming messages from other connected neighbor vertices. If all vertices are inactive, the program

will terminate. In super-step x, each active vertex can receive message send by

other directly connected neighbors in super-step x-1 and update the value of

current vertex and its outwards edges. Then send the message to other vertices

in next super-step i.e. x+1. All type of Messages computation is done in

function named compute(). A function named combiner can be used to reduce the

number of messages.

III.

Graph-centric MODEL (Giraph++)

This section introduce graph-centric model. This model gives

programmers access to whole sub-graph.

A.

Vertices types

Graph-centric model divides the input graph into partitions

as shown in Figure 1(b). Each partitions is a complete sub-graph as shown in

Figure 1(c). Here vertices are divided into two categories 1) internal vertices

2) boundary vertices. In figure 1(c), A, B are internal vertices of G1 while D,

F are boundary vertices of G1. One vertex can be internal in exactly one sub-graph

and it can be boundary vertex in other sub-graphs. For example, A is internal vertex in G1 and is

a boundary vertex in other sub-graphs G2 and G3.

In graph-centric model, each internal vertex has following

attributes 1) vertex value 2) edge value 3) incoming messages. Each boundary

vertex of sub-graph has only vertex value. Vertex value of boundary vertex is

just a local copy. Its original copy is within corresponding internal vertex.

Difference between internal and boundary vertices is that

messages can be only send through boundary vertices to original copies.

Information exchange between internal vertices is efficient and also cheap. The

state of internal vertex can be changed at any time without any network message

or next super-step.

B.

Programming API Giraph++

Giraph++ executes the program in sequence of super-steps. But

here in each super-step, computation is performed on entire sub-graph of one

partition.

Here Vertex class is used for both boundary vertices and

internal vertices. Some functions are not used here. Internal vertices used

functions that are highlighted in red and blue in Figure 2.Boundry vertices used

only red highlighted functions. Boundary vertices don’t have any state and

internal vertices have two states 1) active 2) inactive.

One new class is introduces here i.e. GraphPartition. Figure

3 shows some major functions of this class. GraphPartition allows 1) to use all

vertices i.e. boundary and internal 2) to check either a vertex is boundary and

internal 3) to send messages to internal vertices of other sub-graphs 4) to

inactive all internal vertices of one sub-graph at the same time. Here compute

() function is for the one whole sub-graph.

Figure 3: All functions of GraphPartition

class

IV.

Manipulation of Giraph++

This section check the applicability of graph centric model to

three categories of algorithm 1) graph traversal 2)random walk 3) graph aggregation.

Here one representative from each of three categories is choosen.

A.

Graph Traversal

Graph traversal category contain algorithms that have to visit

all vertices of one graph in specified way. I can check or update the values of

vertices with traversing. Its examples are connected components, shortest

distances etc.

Here we examine the

implementation of connected components for both vertex-centric model and graph

centric model.

Figure

4: Example of connected component in both graph-centric model and

vertex-centric model

Algorithm 1

shows connected component algorithm example in Giraph.Here vertex value of each

vertex is its label. In the very starting i.e. superstep 0 every vertex has its

own unique ID as a label. Then this label is communicated to all its neighbors.

In later supersteps, every vertex check the smalest label out of all received

messages. If new found label is smaller than previous label then vertex will

update its label with newest one and also communicate this label to directly

connected neighbors. At the end of algorithm. Every vertex has smallest label

out of all partitions. Here combiner() function is used to compute the minimum

label for each vertex. Figure 4(a), Figure 4(b) shows the labels and vertices

communication in each superstep.

Algorithm 2 shows connected component

algorithm example in Giraph++. In Figure 4(a), 4(c) and 4(d) shows the sub-graphs

and communication between vertices. A set of vertices connected in one sub-graph

is also connected vertices in original graph.

In this program, superstep 0 runs

sequential algorithm of connected components. Here sequential algorithm breadth

first search is used on each subgraph. Then sends the local label of each

boundary vertex to its corresponding internal vertex. In figure 4(a), in

superstep 0 every vertex of one subgraph has smallest label i.e. A for subgraph

1 and C for subgraph 2. Here D is internal vertex of sungraph 1 and C is

internal vertex of subgraph 2. Each boundry vertex send label to corresponding

internal vertex. In next consective supersteps, internal vertices check all

incoming messages from boundry vertices of its subgraph and other subgraphs.

Then check the labels which represent equivalent components. Then these

components will be merged into larger component and stores into equiCC. So pair

A and C is stored in equiCC. The

function consolidate() is used to assign the unique label for equivalent

component. In our example, A is new label for equivalent component. If boundry

vertex’s label is changed, then this label is to be communicated to

corresponding internal vertex.

If we compare both algorithms for

Figure 4(b) and Figure 4(d), then it has been observed that vertex-centric

model needs 6 supersteps while graph-centric model needs only 3 supersteps.

B. Random Walk

Random walk algorithms include HITS13, PageRank5,

ObjectRank2 etc. Here we use only one algorithm i.e. PageRank as a representative

of Random Walk algorithm.

Algorithm 3 shows

PageRank algorithm pseudo code implemented in vertex-centric model Giraph. Here

damping factor 0.85 is used. This algorithm is not original PageRank algorithm

but its extended version proposed in 25. PageRank is computed by following

equation

Where |Eu| is no of outgoing edges for vertex u. It is provided in 25 that cumulative

update give same value as original PageRank algorithm provides. The major

advantage of cumulative approach is the ability to update or increment values

in the next superstep.

Algorithm

4 is PageRank implementation in graph-centric model Giraph++. Like algorithm 3,

this algorithm also follows the cumulative approach for updating values. Major

differences between algorithm 3 and 4 are 1) every vertex has another value

named delta with vertex value, delta saves the intermediate values which are

received from other vertices of the same subgraph(line number 16 to 18)

2) PageRank

local computation is asynchronous i.e. it uses intermediate results of other

vertices in the same subgraph(line number 14).Our graph centric-centric model

allows local asynchrony. Graph-centric model and vertex-centric model both can

use combiner() function which computes sum of messages(line number 12 to 13 of

Algorithm 3).

1 This research paper is summary of research paper “From

think like a vertex to think like a graph”.