it.unimi.dsi.webgraph.algo
Class HyperApproximateNeighbourhoodFunction

java.lang.Object
  extended by it.unimi.dsi.util.IntHyperLogLogCounterArray
      extended by it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction
All Implemented Interfaces:
SafelyCloseable, Closeable, Serializable

public class HyperApproximateNeighbourhoodFunction
extends IntHyperLogLogCounterArray
implements SafelyCloseable

Computes the approximate neighbourhood function of a graph using HyperANF.

HyperANF is an algorithm computing an approximation of the neighbourhood function of a graph, that is, the function returning for each h the number of pairs of nodes at distance at most h. It has been described in “HyperANF: Approximating the Neighbourhood Function of Very Large Graphs on a Budget”, by Paolo Boldi, Marco Rosa and Sebastiano Vigna, Proceedings of the 20th international conference on World Wide Web, pages 625−634, ACM, (2011). It is a breakthrough improvement over the ANF technique described by Christopher R. Palmer, Phillip B. Gibbons, and Christos Faloutsos in “Fast Approximation of the ‘Neighbourhood’ Function for Massive Graphs”, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 81−90, ACM (2002).

Incidentally, HyperANF has been used to show that Facebook has just four degrees of separation.

At step h, for each node we (approximately) keep track using HyperLogLog counters of the set of nodes at distance at most h. At each iteration, the sets associated with the successors of each node are joined, thus obtaining the new sets. A crucial component in making this process efficient and scalable is the usage of broadword programming to implement the join phase, which requires maximising in parallel the list of register associated with each successor (the implementation is geared towards 64-bits processors).

Using the approximate sets, for each h we estimates the number of pairs of nodes (x,y) such that the distance from x to y is at most h.

To use this class, you must first create an instance. Then, you call init() and iterate() as much as needed (you can init/iterate several time, if you want so). Finally, you close() the instance. The method modified() will tell you whether the internal state of the algorithm has changed.

If you pass to the constructor (or on the command line) the transpose of your graph (you can compute it using Transform.transpose(ImmutableGraph) or Transform.transposeOffline(ImmutableGraph, int)), when three quarters of the nodes stop changing their value HyperANF will switch to a systolic computation: using the transpose, when a node changes it will signal back to its predecessors that at the next iteration they could change. At the next scan, only the successors of signalled nodes will be scanned. This strategy makes the last phases of the computation significantly faster. In particular, when a very small number of nodes is modified by an iteration HyperANF will switch to a systolic local mode, in which all information about modified nodes is kept in (traditional) dictionaries, rather than being represented as arrays of booleans.

Deciding when to stop iterating is a rather delicate issue. The only safe way is to iterate until modified() is zero, and systolic computation makes this goal easily attainable. However, in some cases one can assume that the graph is not pathological, and stop when the relative increment of the number of pairs goes below some threshold.

A commodity method will do everything for you.

Performance issues

This class can perform external computations: instead of keeping in core memory an old and a new copy of the counters, it can dump on disk an update list containing pairs <nodecounter>. At the end of an iteration, the update list is loaded and applied to the counters in memory. The process is of course slower, but the core memory used is halved.

If there are several available cores, the runs of iterate() will be decomposed into relatively small tasks (small blocks of nodes) and each task will be assigned to the first available core. Since all tasks are completely independent, this ensures a very high degree of parallelism. Be careful, however, as this feature requires a graph with a reasonably fast random access (e.g., in the case of a BVGraph, short reference chains), as many calls to ImmutableGraph.nodeIterator(int) will be made. The granularity of the decomposition is the number of nodes assigned to each task.

In any case, when attacking very large graphs (in particular in external mode) some system tuning (e.g., increasing the filesystem commit time) is a good idea. Also experimenting with granularity and buffer sizes can be useful. Smaller buffers reduce the waits on I/O calls, but increase the time spent in disk seeks. Large buffers improve I/O, but they use a lot of memory. The best possible setup is the one in which the cores are 100% busy during the graph scan, and the I/O time logged at the end of a scan is roughly equal to the time that is necessary to reload the counters from disk: essentially, you are computing as fast as possible.

HyperANF keeps carefully track of which counters have changed their value, and uses this information to speed up the computation. A consequence is that if you iterate up to stabilisation (i.e., until modified() is zero) the last iterations will be significantly faster.

Author:
Sebastiano Vigna, Paolo Boldi, Marco Rosa
See Also:
Serialized Form

Field Summary
protected  Condition allWaiting
          A condition that is notified when all iteration threads are waiting to be started.
static int DEFAULT_BUFFER_SIZE
          The default size of a buffer in bytes.
static int DEFAULT_GRANULARITY
          The default granularity of a task.
 float[] harmonicCentrality
          The sum of inverse distances from each given node, if requested.
protected  ReentrantLock lock
          The lock protecting all critical sections.
 int phase
          The current computation phase.
protected  Condition start
          The condition on which all iteration threads wait before starting a new phase.
 float[] sumOfDistances
          The sum of the distances from every given node, if requested.
 
Fields inherited from class it.unimi.dsi.util.IntHyperLogLogCounterArray
bitVector, CHUNK_MASK, CHUNK_SHIFT, CHUNK_SIZE, counterShift, counterSize, log2m, m, mMinus1, registers, registerSize, seed
 
Constructor Summary
HyperApproximateNeighbourhoodFunction(ImmutableGraph g, ImmutableGraph gt, int log2m)
          Creates a new approximator for the neighbourhood function using default values.
HyperApproximateNeighbourhoodFunction(ImmutableGraph g, ImmutableGraph gt, int log2m, ProgressLogger pl)
          Creates a new approximator for the neighbourhood function using default values.
HyperApproximateNeighbourhoodFunction(ImmutableGraph g, ImmutableGraph gt, int log2m, ProgressLogger pl, int numberOfThreads, int bufferSize, int granularity, boolean external)
          Creates a new approximator for the neighbourhood function.
HyperApproximateNeighbourhoodFunction(ImmutableGraph g, ImmutableGraph gt, int log2m, ProgressLogger pl, int numberOfThreads, int bufferSize, int granularity, boolean external, boolean doSumOfDistances, boolean doHarmonicCentrality, long seed)
          Creates a new approximator for the neighbourhood function.
HyperApproximateNeighbourhoodFunction(ImmutableGraph g, int log2m)
          Creates a new approximator for the neighbourhood function using default values and disabling systolic computation.
HyperApproximateNeighbourhoodFunction(ImmutableGraph g, int log2m, long seed)
          Creates a new approximator for the neighbourhood function using default values and disabling systolic computation.
HyperApproximateNeighbourhoodFunction(ImmutableGraph g, int log2m, ProgressLogger pl)
          Creates a new approximator for the neighbourhood function using default values and disabling systolic computation.
 
Method Summary
 double[] approximateNeighbourhoodFunction()
          Returns an approximation of the neighbourhood function.
 double[] approximateNeighbourhoodFunction(double threshold)
          Returns an approximation of the neighbourhood function.
 double[] approximateNeighbourhoodFunction(long upperBound, double threshold)
          Returns an approximation of the neighbourhood function.
 double[] approximateNeighbourhoodFunction(long upperBound, double threshold, long seed)
          Returns an approximation of the neighbourhood function.
 void close()
           
protected  void copyFromLocal(long[] t, long[] chunkBits, int node)
          Copies a counter from a local array.
protected  void copyToLocal(long[] chunkBits, long[] t, int node)
          Copies a counter to a local array.
protected  void finalize()
           
 void init()
          Initialises the approximator.
 void init(long seed)
          Initialises the approximator, providing a new seed to the underlying IntHyperLogLogCounterArray.
 double iterate()
          Performs a new iteration of HyperANF.
static void main(String[] arg)
           
protected  void max(long[] x, long[] y, int r, long[] accumulator, long[] mask)
          Computes the register-by-register maximum of two bit vectors.
 int modified()
          Returns the number of HyperLogLog counters that were modified by the last call to iterate().
protected  void transfer(long[] source, long[] dest, int node)
          Transfers the content of a counter between two parallel array of longwords.
 
Methods inherited from class it.unimi.dsi.util.IntHyperLogLogCounterArray
add, chunk, clear, clear, count, count, log2NumberOfRegisters, offset, registers, registerSize, relativeStandardDeviation
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_GRANULARITY

public static final int DEFAULT_GRANULARITY
The default granularity of a task.

See Also:
Constant Field Values

DEFAULT_BUFFER_SIZE

public static final int DEFAULT_BUFFER_SIZE
The default size of a buffer in bytes.

See Also:
Constant Field Values

sumOfDistances

public final float[] sumOfDistances
The sum of the distances from every given node, if requested.


harmonicCentrality

public final float[] harmonicCentrality
The sum of inverse distances from each given node, if requested.


lock

protected final ReentrantLock lock
The lock protecting all critical sections.


allWaiting

protected final Condition allWaiting
A condition that is notified when all iteration threads are waiting to be started.


start

protected final Condition start
The condition on which all iteration threads wait before starting a new phase.


phase

public int phase
The current computation phase.

Constructor Detail

HyperApproximateNeighbourhoodFunction

public HyperApproximateNeighbourhoodFunction(ImmutableGraph g,
                                             ImmutableGraph gt,
                                             int log2m,
                                             ProgressLogger pl,
                                             int numberOfThreads,
                                             int bufferSize,
                                             int granularity,
                                             boolean external)
                                      throws IOException
Creates a new approximator for the neighbourhood function.

Parameters:
g - the graph whose neighbourhood function you want to compute.
gt - the tranpose of g in case you want to perform systolic computations, or null.
log2m - the logarithm of the number of registers per counter.
pl - a progress logger, or null.
numberOfThreads - the number of threads to be used (0 for automatic sizing).
bufferSize - the size of an I/O buffer in bytes (0 for DEFAULT_BUFFER_SIZE).
granularity - the number of node per task in a multicore environment (it will be rounded to the next multiple of 64), or 0 for DEFAULT_GRANULARITY.
external - if true, results of an iteration will be stored on disk.
Throws:
IOException

HyperApproximateNeighbourhoodFunction

public HyperApproximateNeighbourhoodFunction(ImmutableGraph g,
                                             ImmutableGraph gt,
                                             int log2m)
                                      throws IOException
Creates a new approximator for the neighbourhood function using default values.

Parameters:
g - the graph whose neighbourhood function you want to compute.
gt - the tranpose of g in case you want to perform systolic computations, or null.
log2m - the logarithm of the number of registers per counter.
Throws:
IOException

HyperApproximateNeighbourhoodFunction

public HyperApproximateNeighbourhoodFunction(ImmutableGraph g,
                                             ImmutableGraph gt,
                                             int log2m,
                                             ProgressLogger pl)
                                      throws IOException
Creates a new approximator for the neighbourhood function using default values.

Parameters:
g - the graph whose neighbourhood function you want to compute.
gt - the tranpose of g in case you want to perform systolic computations, or null.
log2m - the logarithm of the number of registers per counter.
pl - a progress logger, or null.
Throws:
IOException

HyperApproximateNeighbourhoodFunction

public HyperApproximateNeighbourhoodFunction(ImmutableGraph g,
                                             int log2m)
                                      throws IOException
Creates a new approximator for the neighbourhood function using default values and disabling systolic computation.

Parameters:
g - the graph whose neighbourhood function you want to compute.
log2m - the logarithm of the number of registers per counter.
Throws:
IOException

HyperApproximateNeighbourhoodFunction

public HyperApproximateNeighbourhoodFunction(ImmutableGraph g,
                                             int log2m,
                                             long seed)
                                      throws IOException
Creates a new approximator for the neighbourhood function using default values and disabling systolic computation.

Parameters:
g - the graph whose neighbourhood function you want to compute.
log2m - the logarithm of the number of registers per counter.
seed - the random seed passed to IntHyperLogLogCounterArray.IntHyperLogLogCounterArray(int, long, int, long).
Throws:
IOException

HyperApproximateNeighbourhoodFunction

public HyperApproximateNeighbourhoodFunction(ImmutableGraph g,
                                             int log2m,
                                             ProgressLogger pl)
                                      throws IOException
Creates a new approximator for the neighbourhood function using default values and disabling systolic computation.

Parameters:
g - the graph whose neighbourhood function you want to compute.
log2m - the logarithm of the number of registers per counter.
pl - a progress logger, or null.
Throws:
IOException

HyperApproximateNeighbourhoodFunction

public HyperApproximateNeighbourhoodFunction(ImmutableGraph g,
                                             ImmutableGraph gt,
                                             int log2m,
                                             ProgressLogger pl,
                                             int numberOfThreads,
                                             int bufferSize,
                                             int granularity,
                                             boolean external,
                                             boolean doSumOfDistances,
                                             boolean doHarmonicCentrality,
                                             long seed)
                                      throws IOException
Creates a new approximator for the neighbourhood function.

Parameters:
g - the graph whose neighbourhood function you want to compute.
gt - the tranpose of g, or null.
log2m - the logarithm of the number of registers per counter.
pl - a progress logger, or null.
numberOfThreads - the number of threads to be used (0 for automatic sizing).
bufferSize - the size of an I/O buffer in bytes (0 for DEFAULT_BUFFER_SIZE).
granularity - the number of node per task in a multicore environment (it will be rounded to the next multiple of 64), or 0 for DEFAULT_GRANULARITY.
external - if true, results of an iteration will be stored on disk.
doSumOfDistances - whether the sum of distances from each node should be computed.
doHarmonicCentrality - whether harmonic centrality should be computed.
seed - the random seed passed to IntHyperLogLogCounterArray.IntHyperLogLogCounterArray(int, long, int, long).
Throws:
IOException
Method Detail

init

public void init()
Initialises the approximator.

This method must be call before a series of iterations. Note that it will not change the seed used by the underlying IntHyperLogLogCounterArray.

See Also:
init(long)

init

public void init(long seed)
Initialises the approximator, providing a new seed to the underlying IntHyperLogLogCounterArray.

This method must be call before a series of iterations.

Parameters:
seed - passed to IntHyperLogLogCounterArray.clear(long).

close

public void close()
           throws IOException
Specified by:
close in interface Closeable
Throws:
IOException

finalize

protected void finalize()
                 throws Throwable
Overrides:
finalize in class Object
Throws:
Throwable

max

protected final void max(long[] x,
                         long[] y,
                         int r,
                         long[] accumulator,
                         long[] mask)
Computes the register-by-register maximum of two bit vectors.

Parameters:
x - first vector of longs, representing a bit vector in LongArrayBitVector format, where the result will be stored.
y - a second vector of longs, representing a bit vector in LongArrayBitVector format, that will be maximised with x.
r - the register size.

copyToLocal

protected final void copyToLocal(long[] chunkBits,
                                 long[] t,
                                 int node)
Copies a counter to a local array.

Parameters:
chunkBits - the array storing the counter.
t - a local destination array.
node - the node number.

copyFromLocal

protected final void copyFromLocal(long[] t,
                                   long[] chunkBits,
                                   int node)
Copies a counter from a local array.

Parameters:
t - a local array.
chunkBits - the array where the counter will be stored.
node - the node number.

transfer

protected final void transfer(long[] source,
                              long[] dest,
                              int node)
Transfers the content of a counter between two parallel array of longwords.

Parameters:
source - the source array.
dest - the destination array.
node - the node number.

iterate

public double iterate()
               throws IOException
Performs a new iteration of HyperANF.

Returns:
an approximation of the following value of the neighbourhood function (the first returned value is for distance one).
Throws:
IOException

modified

public int modified()
Returns the number of HyperLogLog counters that were modified by the last call to iterate().

Returns:
the number of HyperLogLog counters that were modified by the last call to iterate().

approximateNeighbourhoodFunction

public double[] approximateNeighbourhoodFunction()
                                          throws IOException
Returns an approximation of the neighbourhood function. The computation will stop when modified() returns false.

Returns:
an approximation of the neighbourhood function computed up to stabilisation.
Throws:
IOException

approximateNeighbourhoodFunction

public double[] approximateNeighbourhoodFunction(double threshold)
                                          throws IOException
Returns an approximation of the neighbourhood function.

Parameters:
threshold - a value that will be used to stop the computation either by relative increment; if you specify -1, the computation will stop when modified() returns false.
Returns:
an approximation of the neighbourhood function.
Throws:
IOException

approximateNeighbourhoodFunction

public double[] approximateNeighbourhoodFunction(long upperBound,
                                                 double threshold)
                                          throws IOException
Returns an approximation of the neighbourhood function.

Parameters:
upperBound - an upper bound to the number of iterations.
threshold - a value that will be used to stop the computation either by relative increment; if you specify -1, the computation will stop when modified() returns false.
Returns:
an approximation of the neighbourhood function.
Throws:
IOException

approximateNeighbourhoodFunction

public double[] approximateNeighbourhoodFunction(long upperBound,
                                                 double threshold,
                                                 long seed)
                                          throws IOException
Returns an approximation of the neighbourhood function.

Parameters:
upperBound - an upper bound to the number of iterations.
threshold - a value that will be used to stop the computation either by relative increment; if you specify -1, the computation will stop when modified() returns false.
seed - the random seed passed to IntHyperLogLogCounterArray.IntHyperLogLogCounterArray(int, long, int, long).
Returns:
an approximation of the neighbourhood function.
Throws:
IOException

main

public static void main(String[] arg)
                 throws IOException,
                        JSAPException,
                        IllegalArgumentException,
                        ClassNotFoundException,
                        IllegalAccessException,
                        InvocationTargetException,
                        InstantiationException,
                        NoSuchMethodException
Throws:
IOException
JSAPException
IllegalArgumentException
ClassNotFoundException
IllegalAccessException
InvocationTargetException
InstantiationException
NoSuchMethodException