NVIDIA SDK reduction invalid shared memory read?

I have a problem with NVIDIA SDK reduction example. In the step from kernel reduce3 to reduce4, the last warp is unrolled.

I copy here the loop of kernel “reduce3” from the SDK, sdata is an array in the shared memory:

for(unsigned int s=blockDim.x/2; s>0; s>>=1) 


	if (tid < s) 


		sdata[tid] += sdata[tid + s];




This loop is unrolled with help of templates in kernel “reduce4” as

if (tid < 32)


	if (blockSize >=  64) { sdata[tid] += sdata[tid + 32]; EMUSYNC; }

	if (blockSize >=  32) { sdata[tid] += sdata[tid + 16]; EMUSYNC; }

	if (blockSize >=  16) { sdata[tid] += sdata[tid +  8]; EMUSYNC; }

	if (blockSize >=   8) { sdata[tid] += sdata[tid +  4]; EMUSYNC; }

	if (blockSize >=   4) { sdata[tid] += sdata[tid +  2]; EMUSYNC; }

	if (blockSize >=   2) { sdata[tid] += sdata[tid +  1]; EMUSYNC; }


Let us consider now, the calling of reduce4 kernel with blockSize=32, which is just the number of threads in a block. The kernel is called with allocating ‘blockSize’ size dynamical shared memory. Then there will be an invalid memory access if (tid>=16): the sdata[tid+16] is not allocated before.

I have tried to check for the invalid memory access with valgrind in deviceemu mode, but it did not complain.

Why is there no invalid read error? Can somebody clarify me this issue, please?

you are right, but such invalid access does not affect the result.

I think that check of invalid access may work for global memory.

however even so, the code would not be good on latest architecture, fermi

since fermi merge all memory types into a 64-bit address space.

on the other hand, the example is to demonstrate “synch inside a warp is not necessary”,

in practice you can choose blocksize > warp size (warp size is 32 also in Fermi).