I have a couple of questions about the 1.1 profiler and optimizing…
I know that coalescing global reads is very important. How important is it to make sure that writes are coalesced? Does it help that they are “fire and forget”? I’m particularly considering the case where writes all occur only at the end of each thread block.
In the 1.1 profiler, I don’t fully understand what warp_serialize tells us. Is it the number of instructions that have to run in series on a single multiprocessor, rather than across several? How does this relate to optimizing?
Thank you for your input.
Also, I think I may have found a bug in the profiler. When all of
branch
divergent_branch
warp_serialize
in the config file, all of the warp_serialize show [ 0 ], even when they have a value in every other config case.
Coalescing stores is important. I’m not sure about how fire and forget affects it (I will ask), but if you look at the “transpose” SDK sample, it gets faster mostly from optimizing stores to be coalesced (the loads in the “naive” version are already coalesced, only the stores are non-coalesced).
warp_serialize tells you how many warps had to serialize based on addresses This can mean shared memory bank conflicts or accessing multiple constant memory banks. This is useful for detecting performance problems related to shared and constant memory.
I can’t add anything to the discussion on warp_serialize, but I can tell you that you’ll never get close to theoretical limits reading float4’s coalesced. Use a 1D texture fetch bound to global memory instead. See this post: http://forums.nvidia.com/index.php?showtop…ndpost&p=290441
In regards to warp serialization, are you using shared memory? I can see how you would get bank conflicts (and therefore warp serialization) if you’re reading float4s into a smem array - four float4s need 16 32-bit words, so a fifth thread accessing a float4 in smem would cause a bank conflict.
Mr Anderson: I have a 2D texture version of the code (in hopes that the texture cache would aid performance due to locality of reference) but abandoned it because of the required post memory copy. Nice analysis in that thread - thank you for the red pill. I’ll swap to 1D tex fetches today.
Paulius: PDC (parallel data cache) == shared memory. I expected the block(5,1,1) case to cause a single warp stall, not 4. I expected block(9,1,1) to cause more warp stalls than block(5…8,1,1). I really didn’t expect a 3x jump in warp stalls going from block(16,1,1) to block(17,1,1). It’s not the fact that stalls occurred, but rather how many the profiler reports for different cases that I do not understand.