prev: Note of data science training EP 8: Ensemble – Avenger's ensemble

Let’s say we have to solve the problem that can be visualize as a map. A lot of dots and those connecting lines. And we now are data scientists.

We have enough time to know this well.

NetworkX

NetworkX is a python library to help us solve those map-like problems.

Start with, we have a sample file in this format.

initiation

We can generate a map by:

  1. networkx.read_edgelist() from reading a file or
  2. networkx.Graph() to initiate an empty map

then use networkx.draw() to visualize a simple map.

The map can be bigger by adding new nodes by .add_node() and new edges by .add_edge().

Get basic information

Try networkx.info() to browse its basic info as below.

networkx.connected_components(): List all nodes in the map

networkx.subgraph(): List all nodes in the subgraph inside the map. A map can contain many disconnected subgraphs.

Density

networkx.density()

calculates density of nodes in the map.

  • For undirected graph, \(d = 2 \cdot \frac{edge}{node \cdot (node - 1)}\)
    Like the map above, we got \(d = 2 \cdot \frac{19}{12 \cdot (12-1)} = 2 \cdot 0.14\overline{39} = 0.2\overline{87}\).
  • For directed graph, \(d = \frac{edge}{node \cdot (node - 1)}\).

Clustering

Clustering is like a density of edges surrounding a particular node. This can be calculated by a ratio between total degree of all neighbor nodes and a formula of numbers of neighbor nodes (\(n(n-1)\)). The result can be varied from 0 which means there are no edges between neighbor nodes to 1 which is there are fully linked for all neighbor nodes.

It’s clustering (\(C_i\)). In case of undirected graphs, we multiply it with 2.

For example, \(C_A = \frac{2 \cdot e_{BD}}{(node_B + node_C + node_D)(node_B + node_C + node_D - 1)} = \frac{2 \cdot 1}{3 \cdot 2} = \frac{1}{3}\)

And another example, \(C_L = \frac{2 \cdot 0}{node_L \cdot (node_L - 1)} = 0\)

Reference: https://en.wikipedia.org/wiki/Clustering_coefficient

networkx.clustering()
networkx.average_clustering()

Find average clustering from \(\displaystyle \overline{C} = \frac{1}{n} \sum_{i=1}^{n}C_{i}\).

Shortest paths

networkx.shortest_path_length()

Find the length of the shortest path between two nodes. For an instance, node A to node I will cost 3 edges.

networkx.shortest_path()

Find all nodes in the shortest path between two nodes.

networkx.average_shortest_path_length()

Find average length of shortest paths of all pair of nodes in the graph. The length of a pair can be calculated by networkx.shortest_path_length(). The equation is \(\displaystyle \alpha = \sum_{s,t \in V}\frac{d(s,t)}{n(n-1)}\)

Centrality

Centrality represents characteristic of “center” in various terms as below:

Degree Centrality

networkx.degree_centrality()

Degree Centrality is ratio of the degree of a particular node and the degree of neighbor nodes. The more value a node has, the more nodes it can travel to.

For example, node A has degree as 3 and its neighbor nodes equals summation of degree of node B, C, D and A itself that is \(3 + 2 + 3 + 3 = 8\). Therefore, Degree Centrality of node A equals \(\frac{3}{11}= 0.\overline{27}\).

Closeness Centrality

networkx.closeness_centrality()

Closeness Centrality represents how close a particular node is to the center of the graph. The more it is, the closer the node is and this means the node has more chances to be the “crossing point” of the paths in the graph. Calculated by a number of all nodes in the graph which the particular node is a member – 1 (excludes itself) then divided by total length of shortest paths from that node to all other nodes that can be found by networkx.shortest_path_length()

For example, node B lives in the graph that contains 12 nodes and node B has total shortest path to other nodes as \(0 + 1 + 1 + 2 + 2 + 3 + 3 + 3 + 3 + 4 + 4 + 5 = 31\). So we calculated Closeness Centrality of node B as \(\frac{12-1}{31} = 0.3548\). That's it.

Betweenness Centrality

networkx.betweenness_centrality()

Betweeness Centrality measures how important as a connector a particular node is. Higher is better. It can be calculated by the operations explained as the following:

Let’s say, we are going to find Between Centrality of node C. Start with finding all shortest paths of all node pair in the graph.

Then find the ratio of a number of shortest paths which contains node C comparing with all paths. For example, looking at route AF, its shortest paths are 2 and one of them contains node C (F-C-A). It is \(\frac{1}{2}\) for this route.

Next, sum it up and we got \(6.2\).

Finally, apply the formula \(\displaystyle g(v) = \sum_{s \ne v \ne t}\frac{\sigma_{st}(v)}{\sigma_{st}} = \frac{2(6.2)}{(12-1)(12-2)} = \frac{2(6.2)}{110} = 0.11\overline{27}\). We have to multiply 2 as this is undirected graph.

As a result, networkx.betweenness_centrality() helps us calculate the value for each node as below.

Reference:

Eigenvector Centrality

networkx.eigenvector_centrality()

Eigenvector Centrality is from Eigenvector calculated by matrix-based node relationships. This can represent a “hub” characteristics of each node. Higher one means the more “hub-like” it is.

As the figure here, found node F is highest and node E is the next below.

Reference:

Page rank

networkx.pagerank()

This uses webpage ranking algorithm and it can “prioritize” nodes. This function also apply Eigenvector above for the result.

Clique

Clique is the smallest groups of nodes or subgraph in the graph.

networkx.clique.find_cliques()

Find all cliques in the graph.

networkx.clique.cliques_containing_node()

Find all cliques that contain specific node.

Numpy and Pandas

networkx.to_numpy_matrix()

Return a matrix of node relationship in the graph. 0 and 1 in the figure are the weight of edges. The matrix of undirected graph contains only 0 and 1.

networkx.to_pandas_adjacency()

Like a matrix above but return DataFrame instead.

networkx.to_pandas_edgelist()

Return a DataFrame of pairs of node.

Put it on Matplotlib

Here we gonna find the distribution of degree numbers.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# create dict of degree of each node
degree = dict(sample.degree)
# transform to Series 
degree_series = pd.Series(index=degree.keys(), data=list(degree.values()))
# create Series of unique values and counts
degree_series_count = degree_series.value_counts()
degree_series_count
# plotting bar graph
plt.bar(degree_series_count.index, degree_series_count)
plt.xlabel("Degree")
plt.ylabel("Node counts")

Ok. We’ve done everything and want to write back to file with networkx.write_edgelist().


Woulda say this is a convenient way to manage graph data with this library.

Between research the references, I found some cool use cases of it that is manage priority of football players by their positions and play stats. So we can apply the library in many fancy problems.

Now it’s the time.

See you next time, bye.

next: Note of data science training EP 10: Cluster – collecting and clustering