In graph/network theory: adjacency matrix is a square matrix used to represent a finite graph. Elements indicate whether pairs of vertices/nodes are adjacent/connected. Boolean. Simple finite graph: 1/0 representing vertices that are connected/not connected with an edge
Same size but elements indicate the distance between vertices rather than just connectivity (more relevant to our data? ie 0-1 correlations = distance/weight?). Hierarchical clustering trees decomposes into distance matrix.
One network feature that has been emphasized in recent work is community structure, the gathering of vertices into groups such that there is a higher density of edges within groups than between them
First mentioned in Clauset et al 2004 were they describe a new algorithm for inferring community structure from network topology which works by greedily optimizing the modularity.
Modularity [21] is a property of a network and a specific proposed division of that network into communities. Girvan and Newman [20, 21] proposed a divisive algorithm that uses edge betweenness as a metric to identify the boundaries of such communities. Computationally expensive. It measures when the division is a good one, in the sense that there are many edges within communities and only a few between them:
The entire process can be represented as a tree whose leaves are the vertices of the original network and whose internal nodes correspond to the joins. This dendrogram represents a hierarchical decomposition of the network into communities at all levels.
the existence of hierarchy can simultaneously explain and quantitatively reproduce many commonly observed topological properties of networks, such as right-skewed degree distributions, high clustering coefficients and short path lengths.
Hierarchical structure goes beyond simple clustering, however, by explicitly including organiza- tion at all scales in a network simultaneously.
Sales-Pardo et al. 2007 Extracting the hierarchical organization of complex systems
A central idea in biology is that life processes are hierarchically organized (2–4). Additionally, it seems plausible that this hierarchical structure plays an important role in the system’s dynamics.
Complex networks are convenient representations of the interactions within complex systems. Here, we focus on the identification of inclusion hierarchies in complex networks, that is, to the unraveling of the nested organization of the nodes in a network into modules, which in turn are composed of submodules and so on.
Method must identify the different levels in the hierarchy as well as the number of modules and their composition at each level. Methods whose output is subject to interpretation excluded i.e. if it organizes nodes into a tree structure, but it is up to the researcher to find a ‘‘sensible’’ criterion to establish which are the different levels in that tree. Also unacceptable for method to yield a tree even for networks with no internal structure.
The idea of the paper extends the definition of community struture discussed above to internal structure within each higher level module if there are subgroups of nodes within the module (submodules) that are more interconnected than to other nodes in the same module.
Our method comprises two major steps (Fig. 1):
Node Affinity: ‘‘topological overlap’’ (11, 16, 17), which is defined as the ratio between the number of common neighbors of the two nodes and the minimum degree of the two nodes. In defining affinity between a pair of nodes, one wants nodes that are close in the hierarchy to have a large affinity and to be classified in the same module in partitions with large modularity. We formalize this requirement by defining the affinity of a pair of nodes as the probability that the two nodes are classified in the same module for partitions that are local maxima of the modularity landscape
For each possible partition of the network into modules \(P\), modularity \(M\) can be calculated.
\[M(P)=\sum_{i=1}^m\left[\frac{l_i}{L}-\left(\frac{d_i}{2L}\right)^2\right]\]
where \(L\) is the total number of links in the network, \(l_i\) is the number of links within module \(i\), \(d_i\) is the sum of degrees of all of the nodes inside module \(i\), and the sum is over all of the \(m\) modules in partition \(P\) (Fig. 1A). The modularity of a partition is high when the number of intramodule links is much larger than expected for a random partition.
propose that the affinity \(A_{ij}\) of nodes \(i\) and \(j\) is determined by those partitions that are local maxima in the landscape and by their basins of attraction.
\(P_{max}\) = the set of partitions for which the modularity \(M\) is a local maxima, that is, partitions for which neither the change of a single node from one module to another nor the merging of two modules will yield a higher modularity \(A_{ij}\) = consider all partitions \(P\in{P_{max}}\) and find the fraction in which \((i, j)\) are placed in the same module.
However, such a procedure would not take into consid- eration the size of the basins of attraction of the different maxima, see below
We would start by grouping the nodes into a randomly chosen partition; let us say, partition 13.
There are two single node changes that increase the modularity. Node \(b\) can be placed in the same group as node \(d\); this is partition 15, which is a local maxima. Instead, node \(b\) can be placed in the same group as nodes \(a\) and \(c\); this is partition 14. Partition 14 is not a modularity maximum; thus one would continue our random ascent of the modularity landscape. From partition 14, one could move to partition 1 or to partition 15, both local maxima.
So from partition 13, one has a 25% chance of ending in partition 1 and a 75% chance of ending in partition 15. If one repeats this calculation for every possible starting partition, one obtains the size of the basin of attraction of the two local modularity maxima.
\[b(\tilde{P})=\sum_{P\in{\mathcal{P}}} \frac{b(P, \tilde{P})}{\Vert\mathcal{P}\Vert}\]
where:
We propose that the affinity: \(A_{ij}\) of a pair of nodes \((i, j)\) = the probability that when local maxima partition \(\tilde{P}\in{P_{max}}\) are sampled with probabilities \(b(\tilde{P})\) nodes \((i, j)\) are classified in the same module.
Note that, in contrast to other affinity measures proposed in refs. 9, 15, and 18, the measure we propose does not necessarily coincide with the ‘‘optimal’’ division of nodes into modules, that is, the partition that maximizes \(M\). In fact, the modules at the top level of the hierarchy do not necessarily correspond to the best partition found for the global network, even for relatively simple networks (Fig. 2C).
Hierarchical clustering methods have three major drawbacks:
They are only accurate at a local level —at every step a pair of units merge and some details of the affinity matrix are averaged with an inevitable loss of information.
There is no statistically sound general criterion to determine the relevant levels on the hierarchy.
Iteratively identifies in an unsupervised manner the modules at each level in the hierarchy.
ensemble of ‘‘equivalent’’ networks with no internal organization. These equivalent networks must have the same number of nodes and an identical degree sequence. A standard method for generating such networks is the Markov-chain switching algorithm (27, 28). Despite their having no internal structure, these randomized networks have numerous local modularity maxima (19).
if the network has a nonrandom internal structure, then local maxima in the original landscape should have significantly larger modularities than local maxima in the landscapes of the randomized networks.
Specifically, for a given network:
compute same quantity \(M_{av}^i\) for each network in the equivalent random ensemble. In virtue of the central limit theorem, the set of average modularities for the whole ensemble \({M_{av}^i}\)is normally distributed with mean \(M_{rand}\) and variance \(\sigma_{M_{rand}}^2\)
compute the z-score of the average modularity \(z\): quantify the level of organization of a network: compute the z-score of the average modularity z
\[z = (M_{av} - M_{rand})/\sigma_{M_{rand}}^2\]
In networks organized in a hierarchical fashion, nodes that belong to the same module at the bottom level of the hierarchy have greater affinity than nodes that are together at a higher level in the hierarchy.
Thus, if a network has a hierarchical organization, one will be able to order the nodes in such a way that groups of nodes with large affinity are close to each other. With such an ordering, the affinity matrix will have a ‘‘nested’’ block-diagonal structure.
To find such an ordering, we use simulated annealing (29) to minimize a cost function that weighs each matrix element with its distance to the diagonal (30)
\[C = \frac{1}{N} \sum_{i, j = 1}^N A_{ij}\vert i-j \vert\]
This problem belongs to the general class of quadratic assignment problems (31). Our algorithm is able to find the proper ordering for the affinity matrix and to accurately reveal the structure of hierarchically nested random graphs (Fig. 2).
If a module at level \(l\) (or the whole network at level 0) has internal modular structure, the corresponding affinity matrix is block-diagonal:
We expect the probability of a connection between two vertices to depend on their degree of relatedness. Structure of this type can be modelled mathematically by using a probabilistic approach in which we endow each internal node \(r\) of the dendrogram with a probability \(p_r\) and then connect each pair of vertices for which \(r\) is the lowest common ancestor independently with probability \(p_r\)
Given a dendrogram and a set of probabilities \(p_r\), the hierarchical random graph model allows us to generate artificial networks with a specified hierarchical structure.
fitting the hierarchical model to observed network data by using the tools of statistical inference, combining a maximum likelihood approach15 with a Monte Carlo sampling algorithm16 on the space of all possible dendrograms. This technique allows us to sample hierarchical random graphs with probability proportional to the likelihood that they generate the observed network. To obtain the results described below we combine information from a large number of such samples, each of which is a reasonably likely model of the data.
we use the sampled dendrograms to generate new networks, different in detail from the originals but, by definition, having similar hierarchical structure (see Supplementary Information for more details). We find that these ‘resampled’ networks match the statistical pro- perties of the originals closely, including their degree distributions, clustering coefficients, and distributions of shortest path lengths between pairs of vertices
from this set we can, by using techniques from phylogeny reconstruction21, create a single consensus dendrogram, which captures the topological features that appear consistently across all or a large fraction of the dendrograms and typically is a better summary of the network’s structure than any individual dendrogram
Another application of the hierarchical decomposition is the pre- diction of missing interactions in networks.
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
fit_hrg
: Creating igraph graphs from data frames or vice-versaThis function creates an igraph graph from one or two data frames containing the (symbolic) edge list and edge/vertex attributes.
Usage
graph_from_data_frame(d, directed = TRUE, vertices = NULL)
Arguments
Details
graph_from_data_frame
creates igraph graphs from one or two data frames. It has two modes of operation, depending on whether the vertices argument is NULL or not.
If vertices is NULL, then the first two columns of d
are used as a symbolic edge list and additional columns as edge attributes. The names of the attributes are taken from the names of the columns.
If vertices is not NULL, then it must be a data frame giving vertex metadata. The first column of vertices is assumed to contain symbolic vertex names, this will be added to the graphs as the ‘name’ vertex attribute. Other columns will be added as additional vertex attributes. If vertices is not NULL then the symbolic edge list given in d is checked to contain only vertex names listed in vertices.
Typically, the data frames are exported from some speadsheat software like Excel and are imported into R via read.table, read.delim or read.csv.
Value
An igraph graph object
Examples
## A simple example with a couple of actors
## The typical case is that these tables are read in from files....
actors <- data.frame(name=c("Alice", "Bob", "Cecil", "David",
"Esmeralda"),
age=c(48,33,45,34,21),
gender=c("F","M","F","M","F"))
relations <- data.frame(from=c("Bob", "Cecil", "Cecil", "David",
"David", "Esmeralda"),
to=c("Alice", "Bob", "Alice", "Alice", "Bob", "Alice"),
same.dept=c(FALSE,FALSE,TRUE,FALSE,FALSE,TRUE),
friendship=c(4,5,5,2,1,1), advice=c(4,5,5,4,2,3))
actors
## name age gender
## 1 Alice 48 F
## 2 Bob 33 M
## 3 Cecil 45 F
## 4 David 34 M
## 5 Esmeralda 21 F
relations
## from to same.dept friendship advice
## 1 Bob Alice FALSE 4 4
## 2 Cecil Bob FALSE 5 5
## 3 Cecil Alice TRUE 5 5
## 4 David Alice FALSE 2 4
## 5 David Bob FALSE 1 2
## 6 Esmeralda Alice TRUE 1 3
g <- graph_from_data_frame(relations, directed=TRUE, vertices=actors)
print(g, e=TRUE, v=TRUE)
## IGRAPH DN-- 5 6 --
## + attr: name (v/c), age (v/n), gender (v/c), same.dept (e/l),
## | friendship (e/n), advice (e/n)
## + edges (vertex names):
## [1] Bob ->Alice Cecil ->Bob Cecil ->Alice David ->Alice
## [5] David ->Bob Esmeralda->Alice
## The opposite operation
igraph::as_data_frame(g, what="vertices")
## name age gender
## Alice Alice 48 F
## Bob Bob 33 M
## Cecil Cecil 45 F
## David David 34 M
## Esmeralda Esmeralda 21 F
igraph::as_data_frame(g, what="edges")
## from to same.dept friendship advice
## 1 Bob Alice FALSE 4 4
## 2 Cecil Bob FALSE 5 5
## 3 Cecil Alice TRUE 5 5
## 4 David Alice FALSE 2 4
## 5 David Bob FALSE 1 2
## 6 Esmeralda Alice TRUE 1 3
fit_hrg
: Fit a hierarchical random graph modelDescription
fit_hrg fits a HRG to a given graph. It takes the specified steps number of MCMC steps to perform the fitting, or a convergence criteria if the specified number of steps is zero. fit_hrg can start from a given HRG, if this is given in the hrg argument and the start argument is TRUE.
fit_hrg(graph, hrg = NULL, start = FALSE, steps = 0)
Usage
## Not run:
## A graph with two dense groups
g <- sample_gnp(10, p=1/2) + sample_gnp(10, p=1/2)
hrg <- fit_hrg(g)
hrg
## Hierarchical random graph, at level 3:
## g1 p= 0
## '- g13 p=0.22 9
## '- g12 p=0.79 6 7 4 3 1 8 5 10 2
## '- g9 p=0.62
## '- g4 p= 1 20 11 16
## '- g19 p=0.67 12 14 15 13 19 17 18
## The consensus tree for it
consensus_tree(g, hrg=hrg, start=TRUE)
## $parents
## [1] 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21 22 22 -1
##
## $weights
## [1] 9716 9868 10000
##
## $hrg
## $hrg$left
## [1] -13 -5 -17 -10 -18 16 -16 -3 -4 10 4 -7 -12 1 11 -2 12
## [18] 0 -15
##
## $hrg$right
## [1] -9 -11 14 15 3 18 5 13 -19 19 7 -14 8 9 -8 6 -6
## [18] 2 17
##
## $hrg$prob
## [1] 0.0000000 0.1666667 1.0000000 1.0000000 1.0000000 0.0000000 0.3333333
## [8] 0.7500000 0.6190476 0.0000000 1.0000000 0.7857143 0.2222222 0.0000000
## [15] 0.0000000 0.4000000 0.0000000 1.0000000 0.6666667
##
## $hrg$edges
## [1] 0 1 3 2 2 0 2 3 13 0 1 11 2 0 0 2 0 1 4
##
## $hrg$vertices
## [1] 20 5 4 3 3 2 7 5 10 2 2 9 10 2 6 6 3 2 7
## Prediction of missing edges
g2 <- make_full_graph(4) + (make_full_graph(4) - path(1,2))
predict_edges(g2)
## $edges
## [,1] [,2]
## [1,] 5 6
## [2,] 1 5
## [3,] 4 6
## [4,] 4 5
## [5,] 1 6
## [6,] 2 6
## [7,] 2 5
## [8,] 3 5
## [9,] 3 6
## [10,] 1 7
## [11,] 4 7
## [12,] 3 7
## [13,] 1 8
## [14,] 2 7
## [15,] 4 8
## [16,] 2 8
## [17,] 3 8
##
## $prob
## [1] 0.291502105 0.004251045 0.004242867 0.004222654 0.004219240
## [6] 0.004031925 0.004021102 0.003754339 0.003697152 0.003534892
## [11] 0.003341470 0.003246077 0.003216714 0.003196160 0.003100233
## [16] 0.002692974 0.002624513
##
## $hrg
## Hierarchical random graph, at level 3:
## g1 p=0
## '- g5 p=1 3
## '- g3 p=1 2 1 4
## '- g7 p=1 8
## '- g6 p=1 6 5 7
## End(Not run)
sample_hrg()
: Sample from a hierarchical random graph modelsample_hrg samples a graph from a given hierarchical random graph model.
Usage
sample_hrg(hrg)
Arguments
Value
An igraph graph.
predict_edges()
: Predict edges based on a hierarchical random graph modelpredict_edges
uses a hierarchical random graph model to predict missing edges from a network. This is done by sampling hierarchical models around the optimum model, proportionally to their likelihood. The MCMC sampling is stated from hrg, if it is given and the start argument is set to TRUE. Otherwise a HRG is fitted to the graph first.
Usage
predict_edges(graph, hrg = NULL, start = FALSE, num.samples = 10000,
num.bins = 25)
Arguments
igraphHRG
object. predict_edges
allow this to be NULL
as well, then a HRG
is fitted to the graph first, from a random starting point.Value A list with entries:
edges: The predicted edges, in a two-column matrix of vertex ids.
hrg: The (supplied or fitted) hierarchical random graph model.
Examples
## A graph with two dense groups
g <- sample_gnp(10, p=1/2) + sample_gnp(10, p=1/2)
g
## IGRAPH U--- 20 42 --
## + attr: name_1 (g/c), name_2 (g/c), type_1 (g/c), type_2 (g/c),
## | loops_1 (g/l), loops_2 (g/l), p_1 (g/n), p_2 (g/n)
## + edges:
## [1] 1-- 2 1-- 3 1-- 4 2-- 4 3-- 4 2-- 5 3-- 5 1-- 6 2-- 6 3-- 6
## [11] 4-- 6 1-- 7 2-- 7 3-- 7 3-- 8 4-- 8 6-- 8 7-- 8 2-- 9 3-- 9
## [21] 6-- 9 7-- 9 8-- 9 4--10 6--10 8--10 9--10 11--13 12--13 13--15
## [31] 11--16 12--16 13--16 11--17 12--17 13--17 14--18 17--18 11--20 12--20
## [41] 18--20 19--20
hrg <- fit_hrg(g)
hrg
## Hierarchical random graph, at level 3:
## g1 p= 0
## '- g15 p=0.22 5
## '- g9 p=0.75 7 6 1 9 2 8 10 3 4
## '- g11 p=0.29
## '- g14 p=0.67 19 17 11 20
## '- g12 p= 0.2 13 12 14 16 15 18
## The consensus tree for it
consensus_tree(g, hrg=hrg, start=TRUE)
## $parents
## [1] 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21 22 22 -1
##
## $weights
## [1] 9345 7876 10000
##
## $hrg
## $hrg$left
## [1] -15 11 1 -2 13 0 -13 10 -18 -6 -14 -4 2 -16 -9 -8 14
## [18] -10 -3
##
## $hrg$right
## [1] -11 -5 -7 12 -17 -19 7 16 3 5 -12 17 9 19 4 18 15
## [18] 6 8
##
## $hrg$prob
## [1] 0.0000000 0.3333333 0.0000000 0.7500000 0.0000000 0.4000000 1.0000000
## [8] 1.0000000 0.7500000 1.0000000 0.2916667 0.2000000 0.0000000 0.6666667
## [15] 0.2222222 0.0000000 0.0000000 0.7142857 1.0000000
##
## $hrg$edges
## [1] 0 1 0 3 0 2 2 1 6 6 7 1 0 2 2 0 0 5 4
##
## $hrg$vertices
## [1] 20 4 4 5 3 6 3 2 9 7 10 6 2 4 10 3 2 8 5
## Prediction of missing edges
g2 <- make_full_graph(4) + (make_full_graph(4) - path(1,2))
predict_edges(g2)
## $edges
## [,1] [,2]
## [1,] 5 6
## [2,] 1 6
## [3,] 1 5
## [4,] 2 6
## [5,] 2 5
## [6,] 3 5
## [7,] 4 6
## [8,] 4 5
## [9,] 2 7
## [10,] 2 8
## [11,] 3 6
## [12,] 1 8
## [13,] 1 7
## [14,] 3 8
## [15,] 3 7
## [16,] 4 8
## [17,] 4 7
##
## $prob
## [1] 0.283530936 0.005529017 0.005363806 0.004969788 0.004969399
## [6] 0.004528083 0.004524869 0.004489098 0.004438744 0.004398698
## [11] 0.004372466 0.004306251 0.004248170 0.003896287 0.003832281
## [16] 0.003640145 0.003477809
##
## $hrg
## Hierarchical random graph, at level 3:
## g1 p=0
## '- g3 p=1 3
## '- g5 p=1 2 1 4
## '- g6 p=1 7
## '- g4 p=1 6 5 8
hrg_tree()
: Create an igraph graph from a hierarchical random graph modelhrg_tree creates the corresponsing igraph tree of a hierarchical random graph model.
Usage
hrg_tree(hrg)
Arguments
Value
An igraph graph.
cluster_fast_greedy()
: Community structure via greedy optimization of modularityDescription This function tries to find dense subgraph, also called communities in graphs via directly optimizing a modularity score.
Usage
cluster_fast_greedy(graph, merges = TRUE, modularity = TRUE,
membership = TRUE, weights = E(graph)$weight)
Arguments
Value:
cluster_fast_greedy returns a communities object, please see the communities manual page for details.
g <- make_full_graph(5) %du% make_full_graph(5) %du% make_full_graph(5)
g <- add_edges(g, c(1,6, 1,11, 6, 11))
fc <- cluster_fast_greedy(g)
membership(fc)
## [1] 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2
sizes(fc)
## Community sizes
## 1 2 3
## 5 5 5