I’ve been running a few small kernels through the CUDA Visual Profiler to test my understanding of shared memory, and I’ve run into a bit of a problem.
I’m almost certain that my kernel should be causing 16-way bank conflicts, but according to the “warp_serialize” field in the profiler, this isn’t the case.
For each thread in a warp, (threadIdx.x * 16) % 16 will evaluate to 0, which I thought would force all threads to access the same shared memory bank. If this is the case, then all warps should be serialized.
However, if I run this kernel using a single block of 32 threads then the Visual Profiler reports that “warp_serialize” is equal to 120. I could almost understand if this value was 128 (each of the 32 threads reads 4 bytes) but that it is 120 has me completely baffled!
If anybody could shed light on the situation, it would be much appreciated!
EDIT: Another curious point of note is that the Visual Profiler reports that this kernel contains 5 branches, and that 1 of them is divergent. Does anybody know what’s going on?
Maybe I’m being stupid, but how would that happen in this case?
The value of threadIdx.x is different for each thread, so cache[threadIdx.x * 16] will evaluate to a different memory location on each thread. My understanding is that this would result in them accessing the same bank, but with different addresses.