5

I am receiving data in the following format:

tail head
P01106  Q09472
P01106  Q13309
P62136  Q13616
P11831  P18146
P13569  P20823
P20823  P01100
...

Is there a good way to format this data as a graph with a numpy array? I am hoping to compute PageRank using this graph.

So far I have

import numpy as np
data = np.genfromtxt('wnt_edges.txt', skip_header=1, dtype=str)

I was thinking about using the graph data structure from Representing graphs (data structure) in Python but it didn't seem to make sense in this case since I'll be doing matrix multiplication.

1
  • 1
    Have you considered using networkx? You can easily convert your array into an edge list. Commented Apr 25, 2017 at 4:50

2 Answers 2

9

To avoid reinventing the wheel you should use networkx as suggested in comments and other answers.

If, for educational purposes, you want to reinvent the wheel you can create an adjacency matrix. The PageRank can be computed from that matrix:

The PageRank values are the entries of the dominant right eigenvector of the modified adjacency matrix.

Since each row/column of the adjacency matrix represents a node, you will need to enumerate the nodes so each node is represented by a unique number starting from 0.

import numpy as np

data = np.array([['P01106', 'Q09472'],
                 ['P01106', 'Q13309'],
                 ['P62136', 'Q13616'],
                 ['P11831', 'P18146'],
                 ['P13569', 'P20823'],
                 ['P20823', 'P01100']])


nodes = np.unique(data)  # mapping node name --> index
noidx = {n: i for i, n in enumerate(nodes)}  # mapping node index --> name

n = nodes.size  # number of nodes

numdata = np.vectorize(noidx.get)(data)  # replace node id by node index

A = np.zeros((n, n))
for tail, head in numdata:
    A[tail, head] = 1
    #A[head, tail] = 1  # add this line for undirected graph

This results in the following graph representation A:

array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

A 1 in row 5, column 0 for example means that there is an edge from node 5 to node 0, which corresponds to 'P20823' --> 'P01100'. Use the nodes array to look up node names from indices:

print(nodes)
['P01100' 'P01106' 'P11831' 'P13569' 'P18146' 'P20823' 'P62136' 'Q09472'
 'Q13309' 'Q13616']

If there are many nodes and few connections it's better to use a sparse matrix for A. But first try to stay with a dense matrix and only bother to switch to sparse of you have memory or performance issues.

Sign up to request clarification or add additional context in comments.

1 Comment

NetworkX is quite slow. For alternative, some functionalities are implemented in igraph and it is faster (quite underdocumented however). The library graph-tool, like numpy, also compiles to C/C++ code, and is the viable option for graphs larger than 10000 nodes. (It is quite scientific in the sense that most of the methods implemented are what you would mainly encounter in academic papers, like Bayesian network inference methods.) Also has parallel processing. The downside, however, is it is not available on Windows (unless you delete methods using Stan/pystan) but works with Docker.
3

I strongly suggest networkx:

import networkx as nx
#make the graph 
G = nx.Graph([e for e in data])
#compute the pagerank 
nx.pagerank(G)
# output:
# {'P01100': 0.0770275315329843,  'P01106': 0.14594493693403143, 
# 'P11831': 0.1,  'P13569': 0.0770275315329843,  'P18146': 0.1, 
# 'P20823': 0.1459449369340315,  'P62136': 0.1,  'Q09472':
# 0.07702753153298428,  'Q13309': 0.07702753153298428,  'Q13616': 0.1}

That's all it takes. pagerank documentation is here.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.