Optimizing and the 1.1 profiler

I have a couple of questions about the 1.1 profiler and optimizing…

  1. I know that coalescing global reads is very important. How important is it to make sure that writes are coalesced? Does it help that they are “fire and forget”? I’m particularly considering the case where writes all occur only at the end of each thread block.

  2. In the 1.1 profiler, I don’t fully understand what warp_serialize tells us. Is it the number of instructions that have to run in series on a single multiprocessor, rather than across several? How does this relate to optimizing?

Thank you for your input.

Also, I think I may have found a bug in the profiler. When all of


in the config file, all of the warp_serialize show [ 0 ], even when they have a value in every other config case.

Just warp_serialize:

method=[ _Z11unpackGreenPfPji ] gputime=[ 524.448 ] cputime=[ 572.903 ] occupancy=[ 0.667 ] warp_serialize=[ 0 ]
method=[ _Z18processTile_kernelPfS_S_S_ii ] gputime=[ 438.144 ] cputime=[ 493.007 ] occupancy=[ 0.667 ] warp_serialize=[ 16644 ]
method=[ Z23findCandidatesNO_kernelP6float3PfS1_S1_S1 ] gputime=[ 11.488 ] cputime=[ 57.702 ] occupancy=[ 0.333 ] warp_serialize=[ 0 ]
method=[ _Z21localSearchAll_kernelP6float3S0_ii ] gputime=[ 441.856 ] cputime=[ 486.793 ] occupancy=[ 0.667 ] warp_serialize=[ 14489 ]

With all three from above:

method=[ _Z11unpackGreenPfPji ] gputime=[ 524.640 ] cputime=[ 574.054 ] occupancy=[ 0.667 ] branch=[ 6212 ] divergent_branch=[ 0 ] warp_serialize=[ 0 ]
method=[ _Z18processTile_kernelPfS_S_S_ii ] gputime=[ 441.024 ] cputime=[ 496.740 ] occupancy=[ 0.667 ] branch=[ 22847 ] divergent_branch=[ 402 ] warp_serialize=[ 0 ]
method=[ Z23findCandidatesNO_kernelP6float3PfS1_S1_S1 ] gputime=[ 11.680 ] cputime=[ 57.431 ] occupancy=[ 0.333 ] branch=[ 104 ] divergent_branch=[ 20 ] warp_serialize=[ 0 ]
method=[ _Z21localSearchAll_kernelP6float3S0_ii ] gputime=[ 440.288 ] cputime=[ 485.147 ] occupancy=[ 0.667 ] branch=[ 24832 ] divergent_branch=[ 400 ] warp_serialize=[ 0 ]

Thanks for the bug report, we’ll check it out.

Coalescing stores is important. I’m not sure about how fire and forget affects it (I will ask), but if you look at the “transpose” SDK sample, it gets faster mostly from optimizing stores to be coalesced (the loads in the “naive” version are already coalesced, only the stores are non-coalesced).

warp_serialize tells you how many warps had to serialize based on addresses This can mean shared memory bank conflicts or accessing multiple constant memory banks. This is useful for detecting performance problems related to shared and constant memory.

I’ve reproduced the bug here and filed it with our software team.



I’m trying to maximize float4 reads into PDC and can’t get anywhere close to peak theoretical an 8800. While digging into this, I ran this test:

Each thread reads a float4 into a float4-aligned PDC address (float4*)Addr+threadIdx.x

All global reads are coherent

With these parameters:



→ warp_serialize = 0

This is as expected (4 float4 = 16 PDC banks)

However, with all other parameters the same,

Block(5,1,1) → warp_serialize = 4

Block(6,1,1) → warp_serialize = 4

Block(7,1,1) → warp_serialize = 4

Block(8,1,1) → warp_serialize = 4

Block(16,1,1) → warp_serialize = 4

At half-warp size, each PDC bank should be hit 4x

Block(17,1,1) → warp_serialize = 12

Why the 3x jump?

I can’t add anything to the discussion on warp_serialize, but I can tell you that you’ll never get close to theoretical limits reading float4’s coalesced. Use a 1D texture fetch bound to global memory instead. See this post: http://forums.nvidia.com/index.php?showtop…ndpost&p=290441

In regards to warp serialization, are you using shared memory? I can see how you would get bank conflicts (and therefore warp serialization) if you’re reading float4s into a smem array - four float4s need 16 32-bit words, so a fifth thread accessing a float4 in smem would cause a bank conflict.


Mr Anderson: I have a 2D texture version of the code (in hopes that the texture cache would aid performance due to locality of reference) but abandoned it because of the required post memory copy. Nice analysis in that thread - thank you for the red pill. I’ll swap to 1D tex fetches today.

Paulius: PDC (parallel data cache) == shared memory. I expected the block(5,1,1) case to cause a single warp stall, not 4. I expected block(9,1,1) to cause more warp stalls than block(5…8,1,1). I really didn’t expect a 3x jump in warp stalls going from block(16,1,1) to block(17,1,1). It’s not the fact that stalls occurred, but rather how many the profiler reports for different cases that I do not understand.