Hi ,
On the A100 GTC presentations there was an experiment that showed performance uplifts on an histogram workload. The problem said there 250 million integers and 5 million bins and using L2 residency control provided performance uplifts there. Its not clear how the algorithm was implemented. But I did a very basic histogram implementation of a similar problem size where basically inside the kernel we do
“int val = d_values[tid]; //tid is thread ID
atomicAdd(&d_out[val],1);”
So basically a very vanilla histogram. While I did get some speedups 1.04 X it was not what was in the presentation which is likely because of the way the numbers may be distributed. What I did observe which is confusing is that in both cases the L2 cache hit rate is about 90% with a reduction hit rate of 100%. However, when we look at the memory traffic, the version with residency control also writes a few MBs to main memory. Whereas the version without residency control has main memory stores in the order of GBs. If the hit rate is same in both cases, why there is so much discrepancy in the write traffic to main memory when comparing the two cases?.
Any help would be appreciated. Thanks.