Reduction done in shared memory

Hello

I have a little application where different threads will contribute to a final volume. Each thread adds its fraction to a voxel.
I am trying to build a 128x128x128 volume, my data type is float (i.e 4 bytes). The way I am doing reduction right now is by folding the shared memory on itself:

I divided the volume into 128x128 strips where its strip is 128 floats
// each thread compute its contribution to a strip of vexels
.

__syncthreads();

for (unsigned int s = blockDim.x/2; s > 0; s = s >> 1)
{

if (threadIdx.x < s)
{
	for(int index = 0; index < 128; index++)
	{
		theSharedMemory[(threadIdx.x*128) + index] += theSharedMemory[((threadIdx.x+s)*128) + index];
	}
}
__syncthreads();

}

// update the global memory

Since the shared memory is 16KB and my volume is 128x128x128 floats the number of threads I can run in parallel is:

N = 16KB/(128*sizeof(float))

My question is:
is this how reduction is done or there is a better way I am not aware of

Thanks

  1. what is size of shared memory,
    shared float theSharedMemory[128][128] ?

  2. what is your execution configuration?

  3. do you try to implement 2-D reduction by your idea?
    for 2-D reduction, I think you need to consider coalesced of global memory,
    i.e use a warp to deal with a row, this is 1-D reduction,

    you can read 1-D reduction document in SDK/reduction/doc/reduction.pdf

I can NOT declare shared float theSharedMemory[128][128] because it needs 64KB and shared memory is only 16KB

therefore I construct the image strip by strip, each strip is: shared float theSharedMemory[128]

This strip is made of contributions from 512 different projections, and I want each thread to handle one projection and add its portion to the final voxel

I could not find the document you mentioned (i.e SDK/reduction/doc/reduction.pdf ) under the SDK installation:

Thank you for the reply

  1. the reason why do I say “shared float theSharedMemory[128][128]” is for-loop of your code

[codebox]for(int index = 0; index < 128; index++)

{

theSharedMemory[(threadIdx.x*128) + index] += theSharedMemory[((threadIdx.x+s)*128) + index];

}[/codebox]

it seems that row index is “threadIdx.x” and column index is “index”, then index = 0:127 means

theSharedMemory has 128 columns, so I want to know how many rows of theSharedMemory.

  1. I use cuda 2.3 and SDK 2.3, and reduction example has a document, SDK/reduction/doc/reduction.pdf

I can not upload that file, so please see the link

http://oz.nthu.edu.tw/~d947207/NVIDIA/redu…n/reduction.pdf

I am not using a 2-D array, I am using just a linear array of 128 floats.

But every thread has its own area:

Thread 0: 0 - 127

Thread 1: 128 - 255

Thread 2: 256 - 383

and so on

Because I do not want threads to step over each other partial results

after that I add the partial results like this:

[codebox]

for (unsigned int s = blockDim.x/2; s > 0; s = s >> 1)

{

if (threadIdx.x < s)

{

	for(int index = 0; index < 128; index++)

	{

		theSharedMemory[(threadIdx.x*128) + index] += theSharedMemory[((threadIdx.x+s)*128) + index];

	}

}

__syncthreads();

}

[/codebox]

My basic idea is:

The size of you data and the size of shared memory determines how many threads you can to generate the final data

Is this true? or is there any other technique to increase the number of threads

I found the document on reduction

Thanks