Here are the graphs from the latest experiments and implementation:
This implementation was originally not scalable in the hashtable-building stage, which performed frequent allocations. The hashtable is stock from the SGI/libstdc++ implementation. I removed this bottleneck by providing a custom allocator that allocates from a non-freeing local memory arena.
Profiling reveals that most of the time is spent in the hash functions and the function that performs the memcpy during hash-partitioning.
actdb::partition1 is the hash-partitioning function for actresses, and it calls
push_bucket to copy tuples into buckets.
scan is just a function to touch all the data from the file.
% cumulative self self total time seconds seconds calls s/call s/call name 16.40 0.82 0.82 4547797 0.00 0.00 commons::hash_djb2(char const*) 14.80 1.56 0.74 4547797 0.00 0.00 __gnu_cxx::__stl_hash_string(char const*) 13.20 2.22 0.66 4547797 0.00 0.00 db::push_bucket(char**, bucket*, char const*, char const*, unsigned long) 12.80 2.86 0.64 2 0.32 0.32 commons::scan(void const*, unsigned long) 10.80 3.40 0.54 1 0.54 1.78 actdb::partition1(unsigned int, bucket*) ...
Now the hashtable construction phase is the most scalable part of the algorithm (despite its random access nature). The remaining bottlenecks appear to be due to memory stalls, but these are mostly masked by hardware prefetching.
The program does not scale much beyond 16 threads, though performance does improve slightly. The inability to scale beyond 16 is most likely due to the contention for cache capacity among multiple hardware threads per core.
I’ve tried to keep the implementation simple, with no fanciness in terms of custom task scheduling or control over allocation, leaving many things up to the OS.
Update: all tests were performed on josmp.csail.mit.edu.