NUMA Benchmarks Analysis

Updates

3/5/08: updated scalability experiments; added lessons learned
3/4/08: updated scalability experiments

All tests were performed on josmp.csail.mit.edu. The graphs show the results of running several different experiments. The results are averaged across three trials for each experiment. The experiments varied the following parameters:

number of threads (CPUs, 1–16, usually 16 if not testing scalability)
size of the memory buffer to operate on (10MB, 100MB, or 1GB)
number of operations, i.e. reads/writes (frequently 10 million)
number of times to repeat the chewing (usually 1)
whether to chew through the memory sequentially or using random access
whether to run operations in parallel on all the CPUs
whether to explicitly pin the threads to a CPU (usually we do)
whether to operate on a global buffer or on our own buffer (that we allocate ourselves) or on buffers that all other nodes allocated (for cross-communication)
whether operations are writes or reads
in experiments varying the number of cores k working concurrently: whether we’re using cores 1 through k or cores across the nodes in round-robin fashion

Here are some questions these results help answer:

How much does working from another node affect throughput?
- It doesn’t make much difference for sequential scans - this shows hardware prefetching (and caching) at work. It still makes a bit of difference.
- However, for random accesses, the difference is much more pronounced.
How much difference is there between sequential scan and random access?
- Substantial difference. Also magnifies NUMA effects. Compare a and b
Read vs. write
- Substantial difference. Random writes are over 2x slower than random reads.
- Compare a and b
Does malloc tend to allocate locally?
- Yes, because working with memory allocated from the current thread shows improved times.
Scalability of: cross-node memory writes vs. shared memory writes vs. local node memory writes
- Throughputs for sequential scans: a vs. b vs. c
- Speedup graphs: a vs. b vs. c
- Throughputs for random access: a vs. b vs. c
- Speedup graphs: a vs. b vs. c
Scalability of: cross-node memory reads vs. shared memory reads vs. local node memory reads
- Throughputs for sequential scans: a vs. b vs. c
- Speedup graphs: a vs. b vs. c
- Throughputs for random access: a vs. b vs. c
- Speedup graphs: a vs. b vs. c