A100 Residency Control Atomics

Hi ,

On the A100 GTC presentations there was an experiment that showed performance uplifts on an histogram workload. The problem said there 250 million integers and 5 million bins and using L2 residency control provided performance uplifts there. Its not clear how the algorithm was implemented. But I did a very basic histogram implementation of a similar problem size where basically inside the kernel we do

“int val = d_values[tid]; //tid is thread ID
atomicAdd(&d_out[val],1);”

So basically a very vanilla histogram. While I did get some speedups 1.04 X it was not what was in the presentation which is likely because of the way the numbers may be distributed. What I did observe which is confusing is that in both cases the L2 cache hit rate is about 90% with a reduction hit rate of 100%. However, when we look at the memory traffic, the version with residency control also writes a few MBs to main memory. Whereas the version without residency control has main memory stores in the order of GBs. If the hit rate is same in both cases, why there is so much discrepancy in the write traffic to main memory when comparing the two cases?.

Any help would be appreciated. Thanks.

It seems that you are referring to this presentation. The ampere architecture whitepaper may be of interest:

To fully exploit the L2 capacity A100 includes improved cache management
controls. Optimized for neural network training and inferencing as well as general compute
workloads, the new controls ensure that data in the cache is used more efficiently by minimizing
writebacks to memory
and keeping reused data in L2 to reduce redundant DRAM traffic.

Beyond that, if you’d like to see a change in CUDA documentation, you may wish to file a bug.

1 Like