NUMA Benchmarks Analysis
Updates
- 3/5/08: updated scalability experiments; added lessons learned
- 3/4/08: updated scalability experiments
All tests were performed on josmp.csail.mit.edu
. The graphs show the results of running several different experiments. The results are averaged across three trials for each experiment. The experiments varied the following parameters:
- number of threads (CPUs, 1–16, usually 16 if not testing scalability)
- size of the memory buffer to operate on (10MB, 100MB, or 1GB)
- number of operations, i.e. reads/writes (frequently 10 million)
- number of times to repeat the chewing (usually 1)
- whether to chew through the memory sequentially or using random access
- whether to run operations in parallel on all the CPUs
- whether to explicitly pin the threads to a CPU (usually we do)
- whether to operate on a global buffer or on our own buffer (that we allocate ourselves) or on buffers that all other nodes allocated (for cross-communication)
- whether operations are writes or reads
- in experiments varying the number of cores k working concurrently: whether we’re using cores 1 through k or cores across the nodes in round-robin fashion
Here are some questions these results help answer:
- How much does working from another node affect throughput?
- It doesn’t make much difference for sequential scans - this shows hardware prefetching (and caching) at work. It still makes a bit of difference.
- However, for random accesses, the difference is much more pronounced.
- How much difference is there between sequential scan and random access?
- Substantial difference. Also magnifies NUMA effects. Compare a and b
- Read vs. write
- Substantial difference. Random writes are over 2x slower than random reads.
- Compare a and b
- Does
malloc
tend to allocate locally?- Yes, because working with memory allocated from the current thread shows improved times.
- Scalability of: cross-node memory writes vs. shared memory writes vs. local node memory writes
- Throughputs for sequential scans: a vs. b vs. c
- Speedup graphs: a vs. b vs. c
- Throughputs for random access: a vs. b vs. c
- Speedup graphs: a vs. b vs. c
- Scalability of: cross-node memory reads vs. shared memory reads vs. local node memory reads
- Throughputs for sequential scans: a vs. b vs. c
- Speedup graphs: a vs. b vs. c
- Throughputs for random access: a vs. b vs. c
- Speedup graphs: a vs. b vs. c