Parallel Hash-Join

Overview

This is a simple implementation of parallel hash joins. I’m using this as a first step in studying the performance problems in multicore systems programming. This implementation is tailored for a particular dataset, the IMDB movies.list and actresses.list files, which may be found at here.

To learn more about this algorithm, have a look at Multiprocessor Hash-Based Join Algorithms. In short:

The task is to join the set of (movie title, movie release year) with (actress name, movie title) on equal movie titles. For instance, we may be trying to answer the question “What is the average duration of an actress’ career?” (Movie titles and actress names are unique.)
Each of the n nodes begins with an equal fraction of both datasets.
Stage 1 (hash-partition): each node hash-partitions its fraction into n buckets, where bucket i is destined for node i. This is done for each dataset.
Stage 2 (build): each node receives the movie buckets destined for it from all nodes, and builds a hash table mapping the join key (movie titles) to tuples. This is done for the smaller of the two datasets, which in our case is the movies dataset.
Stage 3 (probe): each node receives the actress buckets destined for it from all nodes, and probes into the hash table using the movie title. If there is a match, then emit the resulting joined tuple (movie title, movie release year, actress name).

Results

Here are some results.

Requirements

Building

C++ Commons is a utility library I maintain; you’ll need to make it visible to hash-join:

$ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/cpp-commons/trunk cpp-commons
$ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/hash-join/trunk hash-join
$ ln -s "$PWD/cpp-commons/src/commons" hash-join/src/
$ cd hash-join/src/
$ CPATH="$PWD" make hashjoin-opt
$ out/hashjoin-opt 16 $MOVIEDATA/{movies,actresses}.dat

Supporting Tools

DbPrep filters the .list files to prepare them to be parsed by the hash join. This can additionally inflate the dataset size while maintaining roughly the same join-selectivity of the original dataset. I have also prepared some datasets available for your use.

LogProc processes stdout concatenated from multiple runs of the program. This will produce the time and speedup plots illustrating the scalability of the system. This has actually been made into a generic tool and will be moved to a utility project later.

Titles extracts the titles from the output of DbPrep on movies.list.

These tools all depend on the Scala Commons.

I used HashDist to experiment with the chaining of various hash functions on this dataset and to observe the resulting distributions.

License

Parallel Hash-Join is released under the GNU GPL3.

Contact

Back to assorted.sf.net.