NVIDIA rapids cuGraph model

Keywords: Docker github Spark pip

The RAPIDS cuGraph library is a set of graph analysis used to process data in GPU data frames - see cuDF. cuGraph is designed to provide NetworkX like API s that are familiar to data scientists, so they can now build GPU accelerated workflows more easily

Official documents:
rapidsai/cugraph
cuGraph API Reference

Supported models:

Article directory

1 installation and background

1.1 installation

Conda installation, https://github.com/rapidsai/cugraph:

# CUDA 10.0
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.0

# CUDA 10.1
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.1

# CUDA 10.2
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.2

For docker version, please refer to: https://rapids.ai/start.html "prerequisites"

docker pull rapidsai/rapidsai:cuda10.1-runtime-ubuntu16.04-py3.7
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    rapidsai/rapidsai:cuda10.1-runtime-ubuntu16.04-py3.7

1.2 background

cuGraph has taken a new step in integrating the leading graphics framework into an easy-to-use interface. A few months ago, RAPIDS received a copy of Hornet from the Georgia Institute of technology and reconstructed and renamed it cuHornet. This name change indicates that the source code has deviated from the Georgia Tech benchmark and reflects the matching of code API and data structure with RAPIDS cuGraph. The addition of cuHornet provides a boundary based programming model, a dynamic data structure and a list of existing analyses. In addition to the core number function, the first two available cuHornet algorithms are Katz centrality and K-Cores.

cuGraph is the graph analysis library of RAPIDS. Aiming at cuGraph, we propose a multi GPU PageRank algorithm supported by two new primitives: This is a multi GPU data converter from COO to CSR, and a function to calculate vertex degree. These primitives are used to convert source and target edge columns from Dask Dataframe to graphics format, and enable PageRank to scale across multiple GPUs.

The following figure shows the performance of the new multi GPU PageRank algorithm. Unlike the previous PageRank benchmark runtimes, these runtimes only measure the performance of the PageRank solver. This set of runtime includes Dask DataFrame to CSR transformation, PageRank execution, and result transformation from CSR back to DataFrame. The average results show that the new multi GPU PageRank analysis is more than 10 times faster than the 100 node Spark cluster.


Figure 1: time taken by cuGraph PageRank to calculate on different numbers of edges and NVIDIA Tesla V 100

The figure below only looks at the Bigdata dataset, 50 million vertices and 1.98 billion edges, and runs the HiBench end-to-end test. HiBench benchmark running time includes data reading, PageRank running, and then get the scores of all vertices. Previously, HiBench tested Google GCP on 10, 20, 50, and 100 nodes, respectively.


Figure 2: cuGraph PageRank vs Spark Graph (the lower the better) when 50 million edge to end PageRank is running

2 simple demo

Reference: https://github.com/rapidsai/cugraph

import cugraph

# assuming that data has been loaded into a cuDF (using read_csv) Dataframe
gdf = cudf.read_csv("graph_data.csv", names=["src", "dst"], dtype=["int32", "int32"] )

# create a Graph using the source (src) and destination (dst) vertex pairs the GDF  
G = cugraph.Graph()
G.add_edge_list(gdf, source='src', destination='dst')

# Call cugraph.pagerank to get the pagerank scores
gdf_page = cugraph.pagerank(G)

for i in range(len(gdf_page)):
	print("vertex " + str(gdf_page['vertex'][i]) + 
		" PageRank is " + str(gdf_page['pagerank'][i]))  

3 PageRank

cugraph.pagerank(G,alpha=0.85, max_iter=100, tol=1.0e-5)

  • G: cugraph.Graph object
  • alpha: float, The damping factor represents the probability to follow an outgoing edge. default is 0.85
  • max_iter: int, The maximum number of iterations before an answer is returned. This can be used to limit the execution time and do an early exit before the solver reaches the convergence tolerance. If this value is lower or equal to 0 cuGraph will use the default value, which is 100
  • tol: float, Set the tolerance the approximation, this parameter should be a small magnitude value. The lower the tolerance the better the approximation. If this value is 0.0f, cuGraph will use the default value which is 0.00001. Setting too small a tolerance can lead to non-convergence due to numerical roundoff. Usually values between 0.01 and 0.00001 are acceptable.

Returns:

  • df: a cudf.DataFrame object with two columns:
    • df['vertex']: The vertex identifier for the vertex
    • df['pagerank']: The pagerank score for the vertex

Installation:

# The notebook compares cuGraph to NetworkX,  
# therefore there some additional non-RAPIDS python libraries need to be installed. 
# Please run this cell if you need the additional libraries
!pip install networkx
!pip install scipy

Code module:

# Import needed libraries
import cugraph
import cudf
from collections import OrderedDict


# NetworkX libraries
import networkx as nx
from scipy.io import mmread

# Related parameters

# define the parameters 
max_iter = 100  # The maximum number of iterations
tol = 0.00001   # tolerance
alpha = 0.85    # alpha
# Define the path to the test data  
datafile='../data/karate-data.csv'

# NetworkX
# Read the data, this also created a NetworkX Graph 
file = open(datafile, 'rb')
Gnx = nx.read_edgelist(file)

pr_nx = nx.pagerank(Gnx, alpha=alpha, max_iter=max_iter, tol=tol)

cuGraph model:

# cuGraph

# Read the data  
gdf = cudf.read_csv(datafile, names=["src", "dst"], delimiter='\t', dtype=["int32", "int32"] )

# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe 
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst')


# Call cugraph.pagerank to get the pagerank scores
gdf_page = cugraph.pagerank(G)


# Find the most important vertex using the scores
# This methods should only be used for small graph
bestScore = gdf_page['pagerank'][0]
bestVert = gdf_page['vertex'][0]

for i in range(len(gdf_page)):
    if gdf_page['pagerank'][i] > bestScore:
        bestScore = gdf_page['pagerank'][i]
        bestVert = gdf_page['vertex'][i]
        
print("Best vertex is " + str(bestVert) + " with score of " + str(bestScore))

# A better way to do that would be to find the max and then use that values in a query
pr_max = gdf_page['pagerank'].max()


def print_pagerank_threshold(_df, t=0) :
    filtered = _df.query('pagerank >= @t')
    
    for i in range(len(filtered)):
        print("Best vertex is " + str(filtered['vertex'][i]) + 
            " with score of " + str(filtered['pagerank'][i]))              


print_pagerank_threshold(gdf_page, pr_max)
sort_pr = gdf_page.sort_values('pagerank', ascending=False)
d = G.degrees()
d.sort_values('out_degree', ascending=False).head(4)

Association results:

265 original articles published, 1519 praised, 4.55 million visitors+
His message board follow

Posted by florida_guy99 on Tue, 25 Feb 2020 06:53:28 -0800