That google paper has a nice observation on glibc’s *memcmp*: “memcmp clearly st...

That google paper has a nice observation on glibc’s memcmp:

“memcmp clearly stands out of the correlation between call frequency and function size in Figure 8. It is both extremely frequent, and, at almost 6 KiB of code, 10× larger than memcpy which is conceptually of similar complexity. Examining its layout and execution patterns (Figure 13) suggests that it does suffer from the high amount of fragmentation we observed fleetwide in the previous section. While covering 90% of executed instructions in memcmp only requires two cache lines, getting up to 99% coverage requires 41 lines, or 2.6 KiB of cache capacity. Not only is more than 50% of code cold, but it is also interspersed between the relatively hot regions, and likely unnecessarily brought in by prefetchers. Such bloat is costly – performance counter data collected by GWP indicates that 8.2% of all i-cache misses among the 100 hottest functions are from memcmp alone.

A closer look at the actual code from glibc can explain the execution patterns in Figure 13. It is hand-written in assembly and precompiled, with extensive manual loop unrolling, many conditional cases for the various alignments of the two source arrays, and large jump tables.

In our experience, code usually evolves into similar state from over-reliance on micro-optimization and micro-benchmarking. While writing in assembly can in rare cases be a necessary evil, it prevents the compiler from doing even the most basic feedback-directed code layout optimizations.

[…]

We tested this hypothesis by macro-benchmarking a version of memcmp that is specifically optimized for code size (only 2 i-cache lines) and locality. In short, it only special-cases very small string sizes (to aid the compiler in inlining very fast cases) and falls back to rep cmps for larger compares. Even though it achieves slightly lower throughput numbers than the glibc version in micro-benchmarks, this simple proof-of-concept showed an overall 0.5%-1% end-to-end performance improvement on large-footprint workloads like web search.”

I would guess the assembly version can be improved by moving those cold blocks out of the cache lines of the hot code.