I’ve got a fairly simple kernel that reads in values and builds a histogram. I’m getting a total read throughput of 134GB/s on a GTX 570 (theoretical = 152 GB/s). The writes are negligible. I’m not unhappy with the performance at all, but I think it can be better. I’ve profiled the code, and have good IPC (1.35), no control flow divergence, zero shared mem bank conflicts, 3.45 instructions/byte (GTX 570 ratio is close to 4:1), yet have 25-29% replay instruction ratio. I don’t know if working this down will bring the throughput up, but I’d like to know the likely cause for it. Each thread reads dozens of values (all coalesced), and does some arithmetic and an LDS/STS pair. The histogram counters are packed and strided so that each thread accesses its own bank. There are no atomics. There is also no inter-warp communication until the very end of the kernel (each thread does a ton of work), so it’s not like warps are stalled on syncthreads after LD. Nothing bad bear as far as I know. I have a much more complex kernel that has lots of bank conflicts and things yet has a replay of only 12%. Any ideas?
Also, I just noticed there is a Global memory replay of 7.2%. No idea why. Each warp is reading coalesced it should be all cool.