Reduction done in shared memory

benkasmi · September 22, 2009, 7:56pm

Hello

I have a little application where different threads will contribute to a final volume. Each thread adds its fraction to a voxel.
I am trying to build a 128x128x128 volume, my data type is float (i.e 4 bytes). The way I am doing reduction right now is by folding the shared memory on itself:

I divided the volume into 128x128 strips where its strip is 128 floats
// each thread compute its contribution to a strip of vexels
.

__syncthreads();

for (unsigned int s = blockDim.x/2; s > 0; s = s >> 1)
{

if (threadIdx.x < s)
{
	for(int index = 0; index < 128; index++)
	{
		theSharedMemory[(threadIdx.x*128) + index] += theSharedMemory[((threadIdx.x+s)*128) + index];
	}
}
__syncthreads();

}

// update the global memory

Since the shared memory is 16KB and my volume is 128x128x128 floats the number of threads I can run in parallel is:

N = 16KB/(128*sizeof(float))

My question is:
is this how reduction is done or there is a better way I am not aware of

Thanks

LSChien · September 23, 2009, 1:13am

what is size of shared memory,
shared float theSharedMemory[128][128] ?
what is your execution configuration?
do you try to implement 2-D reduction by your idea?
for 2-D reduction, I think you need to consider coalesced of global memory,
i.e use a warp to deal with a row, this is 1-D reduction,

you can read 1-D reduction document in SDK/reduction/doc/reduction.pdf

benkasmi · September 23, 2009, 10:19pm

I can NOT declare shared float theSharedMemory[128][128] because it needs 64KB and shared memory is only 16KB

therefore I construct the image strip by strip, each strip is: shared float theSharedMemory[128]

This strip is made of contributions from 512 different projections, and I want each thread to handle one projection and add its portion to the final voxel

I could not find the document you mentioned (i.e SDK/reduction/doc/reduction.pdf ) under the SDK installation:

Thank you for the reply

LSChien · September 24, 2009, 1:23am

the reason why do I say “shared float theSharedMemory[128][128]” is for-loop of your code

[codebox]for(int index = 0; index < 128; index++)

{

theSharedMemory[(threadIdx.x*128) + index] += theSharedMemory[((threadIdx.x+s)*128) + index];

}[/codebox]

it seems that row index is “threadIdx.x” and column index is “index”, then index = 0:127 means

theSharedMemory has 128 columns, so I want to know how many rows of theSharedMemory.

I use cuda 2.3 and SDK 2.3, and reduction example has a document, SDK/reduction/doc/reduction.pdf

I can not upload that file, so please see the link

http://oz.nthu.edu.tw/~d947207/NVIDIA/redu…n/reduction.pdf

benkasmi · September 25, 2009, 5:32pm

the reason why do I say “shared float theSharedMemory[128][128]” is for-loop of your code

[codebox]for(int index = 0; index < 128; index++)

{
theSharedMemory[(threadIdx.x*128) + index] += theSharedMemory[((threadIdx.x+s)*128) + index];
}[/codebox]

it seems that row index is “threadIdx.x” and column index is “index”, then index = 0:127 means

. . .

I am not using a 2-D array, I am using just a linear array of 128 floats.

But every thread has its own area:

Thread 0: 0 - 127

Thread 1: 128 - 255

Thread 2: 256 - 383

and so on

Because I do not want threads to step over each other partial results

after that I add the partial results like this:

[codebox]

for (unsigned int s = blockDim.x/2; s > 0; s = s >> 1)

{

if (threadIdx.x < s)

{

	for(int index = 0; index < 128; index++)

	{

		theSharedMemory[(threadIdx.x*128) + index] += theSharedMemory[((threadIdx.x+s)*128) + index];

	}

}

__syncthreads();

}

[/codebox]

My basic idea is:

The size of you data and the size of shared memory determines how many threads you can to generate the final data

Is this true? or is there any other technique to increase the number of threads

I found the document on reduction

Thanks

Topic		Replies	Views
reduction6 kernel from CUDA SDK not working correctly CUDA Programming and Performance	2	1871	August 10, 2010
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6601	April 13, 2009
Reduction Block Size Optimization Questions regarding the example project CUDA Programming and Performance	2	1865	October 1, 2008
CUDA Reduction Using Register CUDA Programming and Performance	5	8982	July 23, 2011
Shared memory and global memory containg different values CUDA Programming and Performance	0	529	February 22, 2011
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1869	January 14, 2009
Parallel reduction problem CUDA Programming and Performance	1	5119	November 29, 2010
problem with shared mamery CUDA Programming and Performance	4	3225	May 11, 2009
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1871	November 18, 2009
Reduction Problem CUDA Programming and Performance	5	4800	October 13, 2010

Reduction done in shared memory

Related topics