R Language Constructs Protein Network and Implements GN Algorithms

Keywords: R Language network Excel xml

1. Construction of Protein Network

We used hunam-HIV PPI.csv, a human HIV-related protein interaction data set, to construct this protein interaction network.

In R, we can read data from files stored outside the R environment. Data can also be written to files stored and accessed by the operating system. R can read and write various file formats, such as csv, excel, xml, etc.

To read csv files, we need:

  • Setting up a working directory
  • Read CSV files

The code is as follows:

setwd("/Users/.../Documents/...")  
data <- read.csv("HIV-human PPI.csv")  

In this way, we get the protein interaction data and store it in the data.

Next, we use the igraph package to build the network. (Since only two lists in the data represent two connected vertices, I did not construct data frames to store vertices.)

edges <- data.frame(from=data[,1],to=data[,2])
g <- graph.data.frame(edges, directed = FALSE)  

The graph. data. frame (or graph_from_data_frame) function has many parameters. The details are as follows:

graph_from_data_frame(edges,direced,vertices)  

Now, we have created the graph G. If you want to see how it looks, you can do it simply by plot(g).

2. Module Discovery Method of Biological Network

In many complex networks, the division of modules (or communities) is very meaningful. Module discovery, or community discovery, has five main models.

The characteristics of community structure: the inner density of the community is higher than that of the inter-community, the internal connection of the community is relatively close, and the connection between the various communities is relatively sparse.

Community model concept Effect
Point connection The relationship between a certain point and a community is that of a community. At worst, it's often a big category of super-many.
Random walk Using Distance Similarity and Merging Hierarchical Clustering to Establish Community Running time is short, but the effect is not particularly good, there will be some kind of huge.
Spin glass Relational network is regarded as random network field, and hierarchical clustering is carried out by energy function. It takes a long time and is suitable for more complicated situations.
Intermediate Centrality Find the weakest deletion of the intermediate centrality and divide it into different large communities. Time-consuming, parameter setting is very important
Label propagation Label yourself by adjacent points, the same label a community It can be combined with eigenvectors and applied to topic class.

The idea of Gievan-Newman(GN) algorithm is the same. The details of the remaining models are not described in more detail. Here, you can refer to them. R Language SNA - Social Relations Network - igraph Packet (Community Dividing, Drawing).

Next, we introduce the basic idea of GN algorithm:

1. Calculate the intermediary centrality of all sides in the network;
2. Remove the most centralized edge of the intermediary;
3. Recalculate the centrality of all edges in the network after edge removal.
4. Jump to Step 2 and recalculate until there are no edges in the network.

As you can see, the idea of this algorithm is very simple. However, when will this algorithm terminate in order to optimize the structure of community partition? In Newman and Girvan 2004, they proposed the concept of Modularity Q (Global Modularity), which further improved the algorithm. Generally speaking, the value of Q is the best between 0.3 and 0.7, but it also needs specific consideration.

3. Module Discovery Method and Graphic Display

There are many algorithms for module partitioning, many of which have been integrated into igrah. After library("igraph"), we can call functions already implemented in many packages to partition the network g module.

algorithm author Particular year Complexity
GN Newman & Girvan 2004
CFinder 2005
Random Walk Method Pons & Latapy 2005
SPIN GLASS COMMUNITY DISCOVERY Reichardt & Bornholdt 2006
LPA (Label Propagation Algorithms) Raghavan et al 2007 O(m)
Fast Unfolding Vincent D. Blondel 2008
LFM 2009 O(n^2)
EAGLE 2009 O(s*n^2)
GIS 2009 O(n^2)
HANP(Hop Attenuation & Node Preferences) Lan X.Y. & Leung 2009 O(m)
GCE 2010 O(mh)
COPRA 2010
NMF 2010
Link 2010
SLPA/GANXis(Speaker-listener Label Propagation) Jierui Xie 2011
BMLPA(Balanced Multi-label Propagation) Wu Zhihao (Beijing Jiaotong University) 2012 O(n*logn)

1) Module discovery based on point connection: cluster_fast_greedy method discovers modules by directly optimizing modularity.

cluster_fast_greedy(graph, merges = TRUE, modularity = TRUE,
membership = TRUE, weights = E(graph)$weight)

graph Diagram of modules to be partitioned.
merges Whether to return to the merged model or not.
modularity Whether the modularity of each merge is returned as a vector.
membership Whether to consider all possible module structures and calculate member vectors corresponding to the maximum module degree each time you merge.
weights If it is not empty, it is a vector of edge weight.
return One communities Object.

An example:

cfg <- cluster_fast_greedy(g)
plot(cfg, g)  

2) GN algorithm: edge.betweenness.community. This method finds the weakest interrelated points in the network through the middle centrality, deletes the edges between them, and divides the network layer by layer, so as to get smaller and smaller modules. When the process is terminated at an appropriate time, the appropriate results of module partitioning can be obtained.

member <-edge.betweenness.community(g.undir,weight=E(g)
$weight,directed=F)
There is a default edge weight, and the default edge is undirected, and the directed=T represents directed.

The code that calls this method and graphically displays and saves it is as follows:

##
#• Community structure in social and biological networks
# M. Girvan and M. E. J. Newman
#• New to version 0.6: FALSE
#• Directed edges: TRUE
#• Weighted edges: TRUE
#• Handles multiple components: TRUE
#Runtime: |V||E|^2 ~Sparse: O(N^3)
##
ec <- edge.betweenness.community(g)
V(g)$size = 1  #I set the size of most vertices to 1
V(g)[degree(g)>=300]$size = 5 #But the vertices with large degrees are larger.
png('/Users/.../Documents/.../protein.png',width=1800,height=1800)# Specify the format and width of the graph to be done next
plot(ec,g) 
dev.off() # Turn off graphics devices 
print(modularity(ec)) 

In this way, the picture is saved as protein.png and the modularity is output.

3)walktrap.community Using Random Walk Model

##
#• Computing communities in large networks using random walks
# Pascal Pons, Matthieu Latapy
#• New to version 0.6: FALSE
#• Directed edges: FALSE
#• Weighted edges: TRUE
#• Handles multiple components: FALSE
#• Runtime: |E||V|^2
##
system.time(wc <- walktrap.community(g))
print(modularity(wc))
#membership(wc)
plot(wc , g)  

4) Newman fast algorithm: leading.eigenvector.community

Newman fast algorithm regards each node as a community, and each iteration selects two communities that produce the maximum Q value to merge until the whole network merges into a community. The whole process can be represented as a tree graph from which the final community structure can be obtained by choosing the hierarchical division with the largest Q value. The overall time complexity of the algorithm is O(m(m+n))

##
#• Finding community structure in networks using the eigenvectors of matrices
# MEJ Newman
# Phys Rev E 74:036104 (2006)
#• New to version 0.6: FALSE
#• Directed edges: FALSE
#• Weighted edges: FALSE
#• Handles multiple components: TRUE
#• Runtime: c|V|^2 + |E| ~N(N^2)
##
system.time(lec <-leading.eigenvector.community(g))
print(modularity(lec))
plot(lec,g)  

5)fastgreedy.community

##
#• Finding community structure in very large networks
# Aaron Clauset, M. E. J. Newman, Cristopher Moore
#• Finding Community Structure in Mega-scale Social Networks
# Ken Wakita, Toshiyuki Tsurumi
#• New to version 0.6: FALSE
#• Directed edges: FALSE
#• Weighted edges: TRUE
#• Handles multiple components: TRUE
#• Runtime: |V||E| log |V|
##
system.time(fc <- fastgreedy.community(g))
print(modularity(fc))
plot(fc, g)  

6)Fast unfolding algorithm: multilevel.community

##
#• Fast unfolding of communities in large networks
# Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre
#• New to version 0.6: TRUE
#• Directed edges: FALSE
#• Weighted edges: TRUE
#• Handles multiple components: TRUE
# Runtime: "linear" when |V| \approx |E| ~ sparse; (a quick glance at the algorithm \
# suggests this would be quadratic for fully-connected graphs)
system.time(mc <- multilevel.community(g, weights=NA))
print(modularity(mc))
plot(mc, g)  

7) Label propagation algorithm: label.propagation.community

##
#• Near linear time algorithm to detect community structures in large-scale networks.
# Raghavan, U.N. and Albert, R. and Kumara, S.
# Phys Rev E 76, 036106. (2007)
#• New to version 0.6: TRUE
#• Directed edges: FALSE
#• Weighted edges: TRUE
#• Handles multiple components: FALSE
# Runtime: |V| + |E|
system.time(lc <- label.propagation.community(g))
print(modularity(lc))
plot(lc , g)  

8) Spinglass community discovery: spinglass.community

member<-spinglass.community(g.undir,weights=E(g.undir)$weight,spins=2)
#The parameter weights need to be set because there is no default value

9) In order to better understand the GN algorithm, of course, we should try to implement a GN algorithm by ourselves.

4. Appendix: Common functions in igraph

1) plot drawing function

plot(g, layout = layout.fruchterman.reingold, vertex.size = V(g)$size+2,vertex.color=V(g)$color,vertex.label=V(g)$label,vertex.label.cex=1,edge.color = grey(0.5), edge.arrow.mode = "-",edge.arrow.size=5)

layout Setting the layout of the graph

layout,layout.auto,layout.bipartite,layout.circle,layout.drl,layout.fruchterman.reingold,layout.fruchterman.reingold.grid,layout.graphopt,layout.grid,layout.grid.3d,layout.kamada.kawai,layout.lgl,layout.mds,layout.merge,layout.norm,layout.random,layout.reingold.tilford,layout.sphere,layout.spring,layout.star,layout.sugiyama,layout.svd

vertex.size Set the size of the node

de<-read.csv("c:/degree-info.csv",header=F)
V(g)$deg<-de[,2]
V(g)$size=2
V(g)[deg>=1]$size=4
V(g)[deg>=2]$size=6
V(g)[deg>=3]$size=8
V(g)[deg>=4]$size=10
V(g)[deg>=5]$size=12
V(g)[deg>=6]$size=14

vertex.color Set the color of the node

color<-read.csv("c:/color.csv",header=F)
col<-c("red","skyblue")
V(g)$color=col[color[,1]]

vertex.label Setting the label of the node

V(g)$label=V(g)$name
vertex.label=V(g)$label

vertex.label.cex Set the size of the node tag
edge.color Set the color of the edge

E(g)$color="grey"
for(i in 1:length(pa3[,1])){
E(g,path=pa3[i,])$color="red"
}
edge.color=E(g)$color

edge.arrow.mode Sets the Connection Mode of Edges
edge.arrow.size sets the size of the arrow
E(g)$width=1 sets the width of the edge

2) Cluster analysis

Medium Degree Clustering of Edges

system.time(ec <- edge.betweenness.community(g))  
print(modularity(ec))  
plot(ec, g,vertex.size=5,vertex.label=NA)  

Random walk

system.time(wc <- walktrap.community(g))
print(modularity(wc))
#membership(wc)
plot(wc , g,vertex.size=5,vertex.label=NA)  

Eigenvalues (Personal Understanding Feels Similar to Spectral Clustering)

system.time(lec <-leading.eigenvector.community(g))
print(modularity(lec))
plot(lec,g,vertex.size=5,vertex.label=NA)  

greedy strategy

system.time(fc <- fastgreedy.community(g))
print(modularity(fc))
plot(fc, g,vertex.size=5,vertex.label=NA)  

Multi-level Clustering

system.time(mc <- multilevel.community(g, weights=NA))
print(modularity(mc))
plot(mc, g,vertex.size=5,vertex.label=NA)  

Label propagation

system.time(lc <- label.propagation.community(g))
print(modularity(lc))
plot(lc , g,vertex.size=5,vertex.label=NA)  

File output

zz<-file("d:/test.txt","w")
cat(x,file=zz,sep="\n")
close(zz)  

View variable data types and lengths

mode(x)
length(x)

Reference link

1.Ebey R Language Course

2.R Language igraph Packet Constructing Network Map-Detailed Demonstration of the Basic Process of Constructing Map

3.Official R Language igraph Description Document

4.Official R Language Manual

5.Inquiry into R-package igraph

6.Modularity and Fast Newman Algorithms

7.Overview of Module Discovery Algorithms

Posted by Stonewall on Sun, 22 Sep 2019 07:42:01 -0700