Row Major v.s Column Major: A look into Spatial Locality, TLB and Cache Misses
March 16, 2025
Memory is often the bottleneck of programs. When writing programs, it is important to consider how data is structured based on their data accesses patterns. A classic example is accessing a matrix via row or column major i.e. going through a matrix by row or by column order.
Theory: Accessing a matrix by row-major will benefit from spatial locality as the next few elements will be accessed in the subsequent iterations and thus benefit from cacheline populating the next few elements on each cache miss. As column-major accesses elements that are spread apart, it will not benefit from the cache as much as the stride is sufficiently large enough that the CPU is less likely to retrieve the next subsequent elements within the column. Therefore, column-major accesses will suffer from higher cache and TLB misses so we should expect more cache accesses and page walks.
Theoretical/Expected Results:
Events | Row-Major | Col-Major |
---|---|---|
Cache Accesses | Low | High |
Cache Misses | Low | High |
TLB Acceses | Low | High |
TLB Misses | Low | High |
Results: Running perf stat -e dTLB-load,dTLB-load-misses,cache-misses,cache-references ./row_<major|col>
:
Events | Row-Major | Col-Major |
---|---|---|
Cache References | 164,724,829 | 1 559,702,701 |
Cache Misses | 164,010 | 1,071,496,040 |
dTLB Loads | 2,271,442 | 274,596,480 |
TLB Misses | 2,232,796 | 273,704,958 |
As expected, we do observe column major suffering greatly from its inability to utilize spatial locality. We can observe large magnitudes of cache accesses and dTLB loads compared to row major whereby cache is missed 68.7% of the time. This there translates to more than double the runtime performance slowdown (almost 3 times on my machine):
hyperfine --warmup 1 -i --export-markdown runtime.md ./row_major ./column_major
Benchmark 1: ./row_major
Time (mean ± σ): 3.629 s ± 0.146 s [User: 2.597 s, System: 1.011 s]
Range (min … max): 3.519 s … 4.006 s 10 runs
Benchmark 2: ./column_major
Time (mean ± σ): 10.793 s ± 0.290 s [User: 9.716 s, System: 1.024 s]
Range (min … max): 10.412 s … 11.201 s 10 runs
Summary
./row_major ran
2.97 ± 0.14 times faster than ./column_major