Bank Conflicts and Serialized Warps

I’ve been running a few small kernels through the CUDA Visual Profiler to test my understanding of shared memory, and I’ve run into a bit of a problem.

I’m almost certain that my kernel should be causing 16-way bank conflicts, but according to the “warp_serialize” field in the profiler, this isn’t the case.

The kernel is listed below:

[codebox]global void bank_conflicts(float* array) {

extern __shared__ float cache[];

float tmp = cache[threadIdx.x * 16];

array[threadIdx.x] = tmp;

}[/codebox]

For each thread in a warp, (threadIdx.x * 16) % 16 will evaluate to 0, which I thought would force all threads to access the same shared memory bank. If this is the case, then all warps should be serialized.

However, if I run this kernel using a single block of 32 threads then the Visual Profiler reports that “warp_serialize” is equal to 120. I could almost understand if this value was 128 (each of the 32 threads reads 4 bytes) but that it is 120 has me completely baffled!

If anybody could shed light on the situation, it would be much appreciated!

EDIT: Another curious point of note is that the Visual Profiler reports that this kernel contains 5 branches, and that 1 of them is divergent. Does anybody know what’s going on?

If all threads in a half-warp access a single address, shared mem goes into broadcast mode and delivers all data in a single cycle.

Maybe I’m being stupid, but how would that happen in this case?

The value of threadIdx.x is different for each thread, so cache[threadIdx.x * 16] will evaluate to a different memory location on each thread. My understanding is that this would result in them accessing the same bank, but with different addresses.

Hm, that was a brainfart on my part. Disregard.

Perhaps they don’t count the first thread when reporting on warp_serialize? Otherwise it would be at least 1 for any program.

I thought the traditional trick was to allocate a shared array of n*17 to force offseted access.
[url=“http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/CUDA_Optimization_Harris.pdf”]http://people.maths.ox.ac.uk/~gilesm/hpc/N...tion_Harris.pdf[/url]
look p.41

EDIT: I read you post too fast. But if you did an allocation of cache with a size n %16 != 0 then it’s the reason you don’t have bank conflict.

I can see why that would happen, but the amount of shared memory I’m allocating is defined as threads * 16, so size % 16 will always be 0.

(Thanks, by the way, for the tip about padding shared memory; I’m sure that will come in useful later.)