Oh you are right , i changed it earlier so only the block that holds the blockIdx.x*blockDim.x + threadIdx.x = 0 would do *outPoint = sdata[0];
but it’s still not working and i dont understand how block X can access threads that belong to block Y.
how do i fix that annoying problem and make this program work… :(
edit : well i kinda reviewed that code several times , each block “sorts” its self.
and eventually the smallest number of each block will be in the first position of the shared memory of the current block,
but what remains now is a “list” of shared memory blocks , which each one holds in its first position the smallest number of the current block.
how do i find the smallest number among them?
It can’t. That’s one of the cornerstones of the CUDA programming model.
Run a second kernel with just one block that finds the minimum along those values. Or, if the number of blocks is fairly small, transfer the list to the CPU and find the minimum there.
i’m not quite sure i understood the second sentence. lets say that the original kernel does what it suppose to do , so if i have 512 blocks , then each block contains
a shared memory where the first address of that shared memory holds the smallest number,
from this point i should write everything back to a global memory which should be at size of 512 , when each cell represents another block?
The amount of shared memory in your launch config is 512 bytes. Inside the kernel you mapped that as an array of float4. Those 512 bytes therefore would result in an sdata array of size 32. You are using your thread index to access that array, so it seems to me you are overrunning the array bounds, since your thread index should go from 0 to 511, right? Check for an error code, maybe it’s reporting something.
krinitsa - I wrote a sample kernel that finds the maximum value in a large array along with its index. It uses the scan primitives I discuss in great depth on the rest of my site: http://www.moderngpu.com/mailbag/maxindex.html