Giraph Source Analysis - Statistics of the Number of Vertices Participating in each SuperStep

Keywords: Big Data Apache Hadoop Java Google

Author | Bai Song

Objective: In scientific research, it is necessary to analyze the number of vertices involved in each iteration to further optimize the system. For example, in the last line of SSP's compute() method, the current vertex voteToHalt is changed to an InActive state. So after each iteration, all vertices are in the InActive state. After large synchronization, the vertex of the received message will be activated to an Active state, and then the vertex's compute() method will be invoked. The purpose of this paper is to count the number of vertices involved in each iteration. Attached below is the compute() method of SSSP:

@Override
  public void compute(Iterable messages) {
    if (getSuperstep() == 0) {
      setValue(new DoubleWritable(Double.MAX_VALUE));
    }
    double minDist = isSource() ? 0d : Double.MAX_VALUE;
    for (DoubleWritable message : messages) {
      minDist = Math.min(minDist, message.get());
    }
    if (minDist < getValue().get()) {
      setValue(new DoubleWritable(minDist));
      for (Edge edge : getEdges()) {
        double distance = minDist + edge.getValue().get();
        sendMessage(edge.getTargetVertexId(), new DoubleWritable(distance));
      }
    }
	//Set the vertex to the InActive state
    voteToHalt();
  }

Attachment: The termination condition of the algorithm in giraph is that there is no active vertex and there is no message passing between worker s.

The termination condition of the algorithm in hama-0.6.0 is only to judge whether there are active vertices. Not really pregel ideas, semi-finished products.

The modification process is as follows:

  1. org.apache.giraph.partition. PartitionStats class

Adding variables and methods to count the number of vertices that each Partition participates in each step. The variables and methods added are as follows:

/** computed vertices in this partition */
private long computedVertexCount=0;
 
/**
* Increment the computed vertex count by one.
*/
public void incrComputedVertexCount() {
    ++ computedVertexCount;
}
 
/**
 * @return the computedVertexCount
 */
public long getComputedVertexCount() {
	return computedVertexCount;
}

Modify the readFields() and write() methods, adding the last sentence to each method. When each Partition calculation is completed, its own computed VertexCount is sent to Master, and Mater reads the summary.

@Override
public void readFields(DataInput input) throws IOException {
    partitionId = input.readInt();
    vertexCount = input.readLong();
    finishedVertexCount = input.readLong();
    edgeCount = input.readLong();
    messagesSentCount = input.readLong();
    //Add the following statement
    computedVertexCount=input.readLong();
}
 
@Override
public void write(DataOutput output) throws IOException {
    output.writeInt(partitionId);
    output.writeLong(vertexCount);
    output.writeLong(finishedVertexCount);
    output.writeLong(edgeCount);
    output.writeLong(messagesSentCount);
    //Add the following statement
    output.writeLong(computedVertexCount);
}
  1. org.apache.giraph.graph. GlobalStats class

    Adding variables and methods to count the total number of vertices involved in each step, including all Partitions on each Worker.

 /** computed vertices in this partition 
  *  Add by BaiSong 
  */
  private long computedVertexCount=0;
	 /**
	 * @return the computedVertexCount
	 */
	public long getComputedVertexCount() {
		return computedVertexCount;
	}

Modify the addPartitionStats(PartitionStats partitionStats) method to add the statistical computedVertexCount function.

/**
  * Add the stats of a partition to the global stats.
  *
  * @param partitionStats Partition stats to be added.
  */
  public void addPartitionStats(PartitionStats partitionStats) {
    this.vertexCount += partitionStats.getVertexCount();
    this.finishedVertexCount += partitionStats.getFinishedVertexCount();
    this.edgeCount += partitionStats.getEdgeCount();
    //Add by BaiSong, add the following statement
    this.computedVertexCount+=partitionStats.getComputedVertexCount();
 }

Of course, for the convenience of Debug, you can also modify the toString() method of this class (optional). The modifications are as follows:

public String toString() {
		return "(vtx=" + vertexCount + ", computedVertexCount="
				+ computedVertexCount + ",finVtx=" + finishedVertexCount
				+ ",edges=" + edgeCount + ",msgCount=" + messageCount
				+ ",haltComputation=" + haltComputation + ")";
	}
  1. org.apache.giraph.graph. ComputeCallable<I,V,E,M>

Add statistical function. In the computePartition() method, add the following sentence.

if (!vertex.isHalted()) {
        context.progress();
        TimerContext computeOneTimerContext = computeOneTimer.time();
        try {
            vertex.compute(messages);
	    //Add the following sentence: When the vertex calls the compute() method, add 1 to the computedVertexCount of the Partition
            partitionStats.incrComputedVertexCount();
        } finally {
           computeOneTimerContext.stop();
        }
......
  1. Add Counters statistics, and my blog Giraph Source Analysis (7) - Adding Message Statistics Similarly, this is not detailed here. The added class is: org.apache.giraph.counters.GiraphComputedVertex. The source code of this class is attached below.
package org.apache.giraph.counters;
 
import java.util.Iterator;
import java.util.Map;
 
import org.apache.hadoop.mapreduce.Mapper.Context;
import com.google.common.collect.Maps;
 
/**
 * Hadoop Counters in group "Giraph Messages" for counting every superstep
 * message count.
 */
 
public class GiraphComputedVertex extends HadoopCountersBase {
	/** Counter group name for the giraph Messages */
	public static final String GROUP_NAME = "Giraph Computed Vertex";
 
	/** Singleton instance for everyone to use */
	private static GiraphComputedVertex INSTANCE;
 
	/** superstep time in msec */
	private final Map superstepVertexCount;
 
	private GiraphComputedVertex(Context context) {
		super(context, GROUP_NAME);
		superstepVertexCount = Maps.newHashMap();
	}
 
	/**
	 * Instantiate with Hadoop Context.
	 * 
	 * @param context
	 *            Hadoop Context to use.
	 */
	public static void init(Context context) {
		INSTANCE = new GiraphComputedVertex(context);
	}
 
	/**
	 * Get singleton instance.
	 * 
	 * @return singleton GiraphTimers instance.
	 */
	public static GiraphComputedVertex getInstance() {
		return INSTANCE;
	}
 
	/**
	 * Get counter for superstep messages
	 * 
	 * @param superstep
	 * @return
	 */
	public GiraphHadoopCounter getSuperstepVertexCount(long superstep) {
		GiraphHadoopCounter counter = superstepVertexCount.get(superstep);
		if (counter == null) {
			String counterPrefix = "Superstep: " + superstep+" ";
			counter = getCounter(counterPrefix);
			superstepVertexCount.put(superstep, counter);
		}
		return counter;
	}
 
	@Override
	public Iterator iterator() {
		return superstepVertexCount.values().iterator();
	}
}
  1. The experimental results show that after running the program. The total number of vertices participating in each iteration is output at the terminal. Test SSSP (SimpleShortestPathsVertex class), there are 9 vertices and 12 edges in the input graph. The output results are as follows:

In the above test, there are six iterations. In the red box, the number of vertices participating in each iteration is shown in the order of 9, 4, 4, 3, 4, 0.

Interpretation: In the 0th overstep, each vertex is active, and there are 9 vertices involved in the calculation. In the fifth step, there are 0 vertices involved in the calculation, so no messages will be sent out, and each vertex is inactive, so the iteration terminates.

[For more articles, please visit Shulan Community]

Posted by sampledformat on Mon, 19 Aug 2019 20:53:20 -0700