I have a little application where different threads will contribute to a final volume. Each thread adds its fraction to a voxel.
I am trying to build a 128x128x128 volume, my data type is float (i.e 4 bytes). The way I am doing reduction right now is by folding the shared memory on itself:
I divided the volume into 128x128 strips where its strip is 128 floats
// each thread compute its contribution to a strip of vexels
.
__syncthreads();
for (unsigned int s = blockDim.x/2; s > 0; s = s >> 1)
{
if (threadIdx.x < s)
{
for(int index = 0; index < 128; index++)
{
theSharedMemory[(threadIdx.x*128) + index] += theSharedMemory[((threadIdx.x+s)*128) + index];
}
}
__syncthreads();
}
// update the global memory
Since the shared memory is 16KB and my volume is 128x128x128 floats the number of threads I can run in parallel is:
N = 16KB/(128*sizeof(float))
My question is:
is this how reduction is done or there is a better way I am not aware of
what is size of shared memory, shared float theSharedMemory[128][128] ?
what is your execution configuration?
do you try to implement 2-D reduction by your idea?
for 2-D reduction, I think you need to consider coalesced of global memory,
i.e use a warp to deal with a row, this is 1-D reduction,
you can read 1-D reduction document in SDK/reduction/doc/reduction.pdf
I can NOT declare shared float theSharedMemory[128][128] because it needs 64KB and shared memory is only 16KB
therefore I construct the image strip by strip, each strip is: shared float theSharedMemory[128]
This strip is made of contributions from 512 different projections, and I want each thread to handle one projection and add its portion to the final voxel
I could not find the document you mentioned (i.e SDK/reduction/doc/reduction.pdf ) under the SDK installation: