Unexplicable banks conflicts. Visual profiler and warp serialize.


I have a very simple kernel and when i run it, according to the Visual Profiler, there are some bank conflicts.

I don’t understand where these bank conflicts are coming from.

The Visual Profiler tells me that there is 412 warp serialize when i run these kernel with 32 threads per block and 1 block(so only 1 warp). How is is possible?

__global__ void Cuda_Test(biguint_t *Aarg, biguint_t *Barg)


        __shared__ unsigned int r;


        unsigned int a=r[0];

        unsigned int b=Ncst[threadIdx.x]; //In constant memory

        unsigned int l=a*b;//r[threadIdx.y][0]*Ncst[threadIdx.x];



According to what i understand they should be a broadcast in the third line, and no bank conflict in the next (each thread in a half warp wants a 32bit piece of memory that should be in different bank). So where the warp serialize come from?



Try commenting out lines until it goes away. I would suspect your read from constant memory. I don’t use the profiler much so I don’t know if it counts them, but constant memory can have bank conflicts.

Yes, nonuniform accesses to constant memory get serialized, at least on 1.x devices. So using [font=“Courier New”]threadIdx.x[/font] as index in constant memory is the worst possible case. You can use a texture instead if you want it cached on 1.x devices, otherwise just use global memory.

Constant memory should really be renamed ‘Constant and Uniform Memory’. The right way to think about it is that it only applies to values that you would normally stick into the immediate field of an instruction. It is really only there to avoid recompiling kernels that need to change the some immediate values between successive executions of the same kernel.

Yes. Although thankfully it is going away anyway.

Do you know whether 2.x devices still have constant memory? The Programming Guide claims it’s still there, but I’ve yet to see the compiler actually use it - whenever I disassemble a 2.x cubin, the compiler has actually just placed everything declared constant in global memory.

It is used mainly by graphics applications and I think that all GPU vendors include something similar to it, so I would assume that it is still in Fermi. Of course the compiler can choose not to use it, and without a way to program the driver/ISA explicitly, you may not be able to access it. You can declare it in PTX, although I don’t know if the lower layers of the compiler will convert it to global memory.

Oops, I lied. While I just tried to find an example I seem to remember where ptxas had moved data from constant to global memory I’ve come across an old sm_20 disassembly where constant memory is indeed used.

Thanks, Gregory!


Thanks the problem looks like it comes from the access to the constant memory.
What do you mean by non uniform access?

I have a kernel that do a lot of access to Ncst, it would slow down my kernel if i switch from constant to global memory, no?


With non-uniform access I mean that different threads of a warp (half-warp for 1.x devices) read different addresses. Your case is an extreme example for non-uniform access because using threadIdx.x as an index each thread will read different addresses, requiring 16 sequential reads instead of just one read whose result is broadcast to 16 threads.

What compute capability is your device? On 2.x devices you can just move the data to global memory, since that is cached as well and does not require uniform access. On 1.x you can use a texture in linear memory for the same effect.

Thanks for the explanation. I also find the section in the CUDA documentation that explain this issue (it is very well hidden in section
I try my kernel on device of compute capability 1.x, I will look into texture in linear memory.