This is a simple implementation of parallel hash joins. I’m using this as a first step in studying the performance problems in multicore systems programming. This implementation is tailored for a particular dataset, the IMDB movies.list
and actresses.list
files, which may be found at here.
To learn more about this algorithm, have a look at Multiprocessor Hash-Based Join Algorithms. In short:
Here are some results.
C++ Commons is a utility library I maintain; you’ll need to make it visible to hash-join:
$ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/cpp-commons/trunk cpp-commons
$ svn --quiet co https://assorted.svn.sourceforge.net/svnroot/assorted/hash-join/trunk hash-join
$ ln -s "$PWD/cpp-commons/src/commons" hash-join/src/
$ cd hash-join/src/
$ CPATH="$PWD" make hashjoin-opt
$ out/hashjoin-opt 16 $MOVIEDATA/{movies,actresses}.dat
DbPrep
filters the .list
files to prepare them to be parsed by the hash join. This can additionally inflate the dataset size while maintaining roughly the same join-selectivity of the original dataset. I have also prepared some datasets available for your use.
LogProc
processes stdout concatenated from multiple runs of the program. This will produce the time and speedup plots illustrating the scalability of the system. This has actually been made into a generic tool and will be moved to a utility project later.
Titles
extracts the titles from the output of DbPrep
on movies.list
.
These tools all depend on the Scala Commons.
I used HashDist to experiment with the chaining of various hash functions on this dataset and to observe the resulting distributions.
Parallel Hash-Join is released under the GNU GPL3.
Copyright 2008 Yang Zhang.
All rights reserved.
Back to assorted.sf.net.