--
GuidoCaldarelli - 14 Sep 2007
\chapter{Basic Mathematical Tools}
\section{Introduction}
Graphs are a powerful represention of any kind of system with interacting subunits.
In the simplest case we can picture such a system as a set of vertices (sometimes referred to as nodes} connected
to each other by edges. The descriptions of vertices and edges can be modified so as to take into account different types of subunits and varying strengths of the interactions).
In some cases, the network structure of the system is clearly evident, such as in the World-Wide-Web of HTML pages, or the Internet network of computers.
These systems have a real structure that mirrors the abstract description closely. These ``naturally occurring'' graphs are often
called {\em networks} or {\em webs} rather than graphs, and we will maintain this distinction thoroughout the book: between networks as real-world-objects, and graphs as their mathematical counterpart.
The abstract representation of a system in the form of a graph can be applied to a wide variety of physical and biological situations. It is therefore all the more remarkable that the vast majority of these graphs exhibit striking similarities.
Whatever the system, from protein interaction networks to metabolic reaction chains, we find
that the graphs all display similar properties regarding the way in which the vertices of the networks interconnect.
The aim of this book is to provide the necessary mathematical tools in order to analyze, understand and describe their shape and evolution. A purely geometrical analysis of the properties of network vertices is feasible for small or medium graphs.
In Biology however we often deal with systems with an enormous number of components, which requires a
thorough statistical analysis of these local geometrical properties.
\section{Definitions}
A graph is assigned by giving a set of vertices and the set of edges linking them.
Vertices are assumed to be all similar, but in more complicated models they can differ
for one or more properties\cite{CCDM02}. Edges on their turn can differ to take into
account the different types of possible interaction.
For example, if the graphs models the WWW, one can say that the vertices are the
web pages and the edge the hyperlink between them. In this specific case the edges
are oriented, since you can go only one way through a link and seldom two pages
are joined by reciprocal connection. You may also attach a number to the edges stating the number of
visit (i.e. traffic) from one page to another.
Usually edges are then described by assigning both an orientation and a weight (i.e. the
amount of traffic).
A redundant, but useful way to represent a graph is by means of the adjacency matrix.
If the number of vertices is $n$ (measure of the graph) the square matrix $A$ has $n \times n$ entries
$a_{ij}$ that are different from $0$ only if an edge between $i$ and $j$ exists.
In the simplest case of no orientation and no weight we can say that the presence of an edge
between vertices $i$ and $j$ gives $a_{ij}=1$.
When the diagonal elements are nonzero we have an edge between a vertex and itself (technically a `loop').
Unless specified otherwise, we consider those entries equal to $0$ (no loops).
Note that this matrix is symmetric (meaning $a_{ij}=a_{ji}$) only in the case of undirected graphs.
For directed graphs the elements $a_{ij}$ are generally
different from the elements $a_{ji}$ (those symmetric with
respect to the diagonal).
\subsection{Degrees}
The number of edges per vertex is called degree. If the edges are oriented we can distinguish
between in-degree and out-degree. If edges are weighted we have a weighted degree, often termed the {\em strength}.
\subsection{Distance}
The {\em distance} between two vertices is the minimum number of edges separating them. The {\em diameter} of a network is the largest such distance between two vertices in the entire network.
\subsection{Clustering}
The clustering coefficient measures how densely the neighbourhood of a node is connected. The coefficient is the ratio of the number of bond triangles which this particular node is part of, and the number of pairs of neighbours of the node. In other words, the clustering coefficient of a node is the probability that two neighbours of that node are themselves neighbours. For unweighted, undirected networks it is written as:
\[
c_i = {\sum_{j=1}^N \sum_{k=j+1}^N a_{ij} a_{jk} a_{ik} \over k_i (k_i - 1)/2}
\]
The clustering coefficient was introduced by Watts and Strogatz \cite{wattsstrogatz98} to characterize the topology of small-world networks. On weighted networks and directed networks, clustering coefficients are more difficult to define. Several approaches have been proposed in the literature \cite{Vespignani04,Onnela06,Ahnert07 - see Kertesz06 arXiv review, Park06}.
\section{Betweenness}
The betweenness of an edge or a node in a network measures the number of shortest paths between two nodes which pass through this edge or node\cite{Freeman77}.
\section{Network Motifs}
Network motifs are small subgraphs of directed networks, typically with 3 or 4 nodes. Alon \cite{alon02,alon04} showed that some motifs occur much more often than others in real-world networks. Moreover he demonstrated that the relative frequencies of these motifs can be used to divide real-world networks into 'superfamilies' of networks. The feed-forward loop is of particular significance in biological networks, and has been studied in detail\cite{alonffw_pnas}.
\section{Probability and Statistics}
\chapter{Graph Manipulation and Visualization}
\section{Community Detection}
Newman and Girvan \cite{NewmanGirvan2002} introduced an approach for measuring community structure in unweighted, undirected graphs.
The algorithm employs {\em edge betweenness} (see section \ref{betweenness}) and works as follows:
\begin{enumerate}
\item{calculate betweenness for all edges in the network}
\item{remove edge with highest betweenness}
\item{recalculate betweennesses for all edges affected by the removal}
\item{repeat from step 2 until no edges remain}
\end{enumerate}
The prominence of a community structure can be measured by determining the {\em modularity} of the graph, defined as follows: Assign each node $i$ of the graph to a community $\{c_i\}$ by some external criterion or algorithm. Of all $m$ edges in the graph, determine the fraction that link nodes within the same community. Randomize the connections while preserving the degrees $\{k_i\}$ of the nodes, so that a connection between two nodes occurs with probability $p = k_i k_j / 2m$. Measure again how many edges connect nodes within the same community, and subtract this value from the first one. This gives us the {\em modularity} of the graph, $Q$:
\[
Q = {1 \over 2m} \sum_{ij} \left( a_{ij} - {k_i k_j \over 2m}\right) \delta(c_i,c_j)
\]
Which is a measure of the quality of the division given by the $\{c_i\}$.
Newman has also presented a generalization of this community detection algorithm to weighted networks\cite{Newman2004}.
\section{Models}
\subsection{Random Graph Model}
Erd\"os-Renyi graphs are the simplest kind of random graph. Each edge exists with probability $p$, i.e.
\begin{eqnarray}
a_{ij} &=& 1 {\rm \,\, with \,\, probability \,\,} p \cr \cr
a_{ij} &=& 0 {\rm \,\, with \,\, probability \,\,} 1-p
\end{eqnarray}
\begin{itemize}
\item{A random graph with $n$ nodes and probability $p$ is denoted $G(n,p)$.}
\item{Most properties of random graphs show a sharp phase transition at a threshold value of $p$.}
\item{An extensive discussion of the properties of random graphs can be found in \cite{Bollobas}.}
\end{itemize}
A way of generating random graphs with arbitrary degree distributions has been described by Newman, Watts and Strogatz (2001) \cite{Newman2001}.
\subsection{Barab\'asi-Albert Model}
This is the first model introduced to reproduce the scale-invariance observed for the degree distribution in real-world networks.
This result is achieved by allowing the system to increase with time and by giving a particular choice for the wiring of the edges.
Starting from an initial configuration this model of growth evolves according to the following rules
\begin{itemize}
\item [{\bf Growth}]
Evolution is divided in time step, at each time step a new vertex enters the system. This vertex will draw
$m$ new edges to the vertices already present.
\item [{\bf Preferential Attachment}]
The vertices destination of the $m$ incoming edges are chosen with a probability proportional to their degree, that is to say
the probability that vertex $j$ would receive one of the new edges is
\begin{equation}
\Pi_j(k_j)=\frac{k_j}{\sum_{i=1,N} k_i
\end{equation}
\end{itemize}
It can be shown analytically that the degree of a vertex increases with time as a power law
\begin{equation}
k_j(t)\propto mt^{1/2}
\end{equation}
and that the degree distribution is a power law with exponent $-\gamma = -3$
\begin{equation}
P(k)\propto 2m^{1/2}k^{-3}
\end{equation}
\subsection{Fitness Model}
Until now we have only considered networks with indistinguishable vertices. By assigning a numerical value or a set of values to each vertex, we can introduce a ``fitness'' for each vertex, which represents its physical characteristics in some way.
This idea has proven particularly useful in protein interaction networks where
such a set of values can describe the electric charge, the colloidal properties,
the shape and the size. All these quantities play a role in determining whether or not two proteins interact.
Models based on this mechanism are known under various different names; as static model \cite{GKK01},
fitness model \shortcite{CCDM02} or hidden variable model \cite{BP03}.
A specific application to the field of protein networks is presented in\ Here we follow the more general formulation\ (\shortciteNP{CCDM02};
\citeNP{SCB04}) where no particular assumption is made about the statistical distribution used.
The network-building algorithm is the following:
\begin{itemize}
\item Start with $n$ vertices.
For every vertex $i$ draw a real number $x_i$ representing the
fitness of the vertex. The fitness values are supposed to measure the importance
or rank of the vertex in the graph and
they are extracted from a given probability distribution $\rho(x)$.
\item For every couple of vertices, $i,j$, we can draw an edge with a
probability given by the {\em linking function} $f(x_i,x_j)$
depending on the fitness values of the vertices involved.