How to reduce Replayed Instruction %

I’ve got a fairly simple kernel that reads in values and builds a histogram. I’m getting a total read throughput of 134GB/s on a GTX 570 (theoretical = 152 GB/s). The writes are negligible. I’m not unhappy with the performance at all, but I think it can be better. I’ve profiled the code, and have good IPC (1.35), no control flow divergence, zero shared mem bank conflicts, 3.45 instructions/byte (GTX 570 ratio is close to 4:1), yet have 25-29% replay instruction ratio. I don’t know if working this down will bring the throughput up, but I’d like to know the likely cause for it. Each thread reads dozens of values (all coalesced), and does some arithmetic and an LDS/STS pair. The histogram counters are packed and strided so that each thread accesses its own bank. There are no atomics. There is also no inter-warp communication until the very end of the kernel (each thread does a ton of work), so it’s not like warps are stalled on syncthreads after LD. Nothing bad bear as far as I know. I have a much more complex kernel that has lots of bank conflicts and things yet has a replay of only 12%. Any ideas?

Also, I just noticed there is a Global memory replay of 7.2%. No idea why. Each warp is reading coalesced it should be all cool.



You mind uploading the kernel?

Could you clarify what does it mean by Global memory replay?

Replay instruction ratio…? Vow… Whats that?

Replayed instruction ratio is something that comes out of an analysis of a kernel in the Compute Visual Profiler in CUDA 4.0. According to the F1 help in this tool this is what it means (and some of the other new replay metrics, since I’m copy-pasting already):

Replayed Instructions (%)

This gives the percentage of instructions replayed during kernel execution. Replayed instructions are the difference between the numbers of instructions that are actually issued by the hardware to the number of instructions that are to be executed by the kernel. Ideally this should be zero. This is calculated as 100 * (instructions issued - instruction executed) / instruction issued

Global memory replay (%)

Percentage of replayed instructions caused due to global memory accesses. This is calculated as 100 * (l1 global load miss) / instructions issued

Local memory replay (%)

Percentage of replayed instructions caused due to local memory accesses. This is calculated as 100 * (l1 local load miss + l1 local store miss) / instructions issued

Shared bank conflict replay (%)

Percentage of replayed instructions caused due to shared memory bank conflicts. This is calculated as 100 * (l1 shared conflict)/ instructions issued

Oh! Thanks for the info…
Predication would affect this , I guess…

No thanks, it wasn’t my info in the first place ;)

I just noticed I forgot the Shared bank conflict replay ratio, now it’s there also.

Here are the number of a simple matrix multiplication kernel.
Replayed Instructions (%): 48.5%
Global memory replay (%): 4.3%
Local memory replay (%): 0%
Shared bank conflict replay (%): 152%

According to the help file, all numbers should be between 0 and 100, which is clearly not the case for my shared bank conflict (according to the definition is doesn’t need to be for the Shared bank conflict replay ratio).

I know my matrix multiplication has a lot of shared memory bank conflicts, so I guess the Shared bank conflict replay number means that every instruction regarding accessing shared memory is executed 1.52 times on average, to handle the shared memory bank conflicts. Not sure how to end up with 48.5% of replayed instructions though… Does anyone else has an explanation about these numbers?

Could anyone clarify what affects to the number of replays?
As Sarnath, I guess predication would do. But it doesn’t seem to be the only thing, there must be something else. What else? I guess it is something called instruction reissued, which happens when there is a cache miss. My guess is, when a cache miss occurs, the warp is switched out, and then later on that warp reissues the memory instruction again. This might be the global memory replay as well. So fully coalesced reads don’t guarantee zero global memory replay.

Additionally, if shared bank conflict count is calculated as told by Gert-Jan, it is easily to be bigger than 100% since the conflict count can be bigger than the issued instructions count.

Can someome correct me if I am wrong?

My guess is that “Replayed Instructions” ratio indicates how often instructions are reissued due to bank conflicts when accessing registers.

I dont think instructions are re-issued on a cache-miss…Then the replay ratio will be very heavy for the tiny L1 cache that we have…