The CUDA profiler reports that my warp serialize rate is high. But I don’t know why… I only access shared memory with the following patterns that should not cause any bank conflicts.
const unsigned int BLOCK_SIZE=256;
const unsigned int TYPES = 2;
const Uint btid = threadIdx.y * blockDim.x + threadIdx.x;
__shared__ float s_vSum[BLOCK_SIZE];
// access pattern 1
s_vSum[btid] += someValue;
// access pattern 2 (create partial sums but avoid warp divergence)
for (unsigned int maxId = (BLOCK_SIZE>>1); maxId >= 32; maxId >>= 1) {
__syncthreads();
if (btid < maxId) {
s_vSum[btid] += s_vSum[btid+maxId];
}
}
// and the access patterns to the second shared memory array
__shared__ float s_v[BLOCK_SIZE*TYPES];
// access pattern 3
for (unsigned int ntypeIdx = 0; ntypeIdx < TYPES; ++ntypeIdx) {
s_v[ntypeIdx*BLOCK_SIZE+btid] = 0.0F;
}
// access pattern 4 (ntype is either 0 or 1)
s_v[ntype*BLOCK_SIZE+btid] += someValue;
// access pattern 5 (d_v is in device memory)
d_v[...] = someValue + s_v[btid] + s_v [BLOCK_SIZE+btid];
My GPU is a GTX295.
Any ideas why the profiler is reporting a high warp serialize rate?
Warp serialization might come not from shared memory access. If I am not mistaken, warp-divergent branches may increase that counter too.
I am also not sure how block of dimention 16x16 will be mapped onto warps. I remember in some old CUDA compiler & driver, different values of threadIdx.y meant different warps, so in that case half of all your warps are underutilised. However, this could be already corrected/changed. Just check it out!
My code should only diverge in very rare cases and that is confirmed by the profiler, which reports divergent branches separately and there are only very few (4 divergent branches compared to over 100000 warp serialize reported by the profiler for one kernel call).
When I run the code in emulation mode and print the thread IDs, then they increase as expected in a row-major order (i.e. (x0,y0), … (x15,y0), (x0,y1), …, (x15,y1),…).
I think there are 16 shared bank conflicts per warp. Take look: your threadblocks are 16x16, all devices with capabilities 1.x has 32 threads per warp and 16 shared banks, your warps are formed by 2 rows of 16 threads and data in shared is aligned to 16, so data[0], [16], [32]… will be in the same bank. Your access pattern (btid = threadIdx.y * blockDim.x + threadIdx.x) force that threads x0,y0 and x0,y1 (whom belongs to the same warp) access to data[0] and data[16]…same warp, same bank…so you’ve got 16 conflicts per warp for each pattern