hspnew
1
Each thread of the block computes the output and upload that output into corresponding locations in array in global memory.
Is there an easy, efficient way to compute the sum of the array ?
Either from calling within the kernel, or outside the kernel…
Outside the kernel I mean, using some runtime library… like CUBLAS, …
Thank you
DenisR
2
the reduction example in the SDK does exactly that.
Doesn’t this exaple only works with a compute capability 1.1 maybe you can also take a loot at CUDPP library (Parallel Prefix Sum)
DenisR
4
No, this example works for all versions. I think all examples that use atomic ops have atomic in their name.
Sarnath
5
Let me give my rough take on this problem. This may or may not be efficient as I have not even coded it. I am designing it inside this text box.
SUM_KERNEL(float *g_in, float *g_out)
{
-
Divide the input array into “N” distinct subsets. N is the number of blocks that
you are going to spawn.
-
Now, each block has to operate on roughly “TOTAL/N” amount of data-items.
Let us call this number as “M”.
-
g_in + (M*blockIdx.x) == SA (Start Address) for any given block
-
local_sum = 0;
for(i=threadIdx.x; i<M; i+=blockDim.x)
{
local_sum + = SA[i];
}
Note that all the global memory fetches are “coalesced”.
-
At this stage, each thread of a BLOCK holds a partial sum for THAT block.
-
g_out[blockIdx.x*blockDim.x + threadIdx.x] = local_sum;
Now, this demands that “g_out” requires as many entries as there are “threads”
for your kernel.
}
Now, it is just a question of calling “SUM_KERNEL” repeatedly with shrinking appropriate block numbers.
Also note that there is no synchronisation required between threads of a block.
Also, the shared memory usage is very very less.
So, depending on your hardware invoke the kernel with the correct amount of block and grid dimensions.
For example, On my 8800 GTX, I would call this with “32 threads per block” and 128 blocks. (16 multiprocessors * 8 max_active_blocks = 128 blocks).
That would mean – regardless of the input array size – I can reduce it to a partial sum of 4096 (128*32)elements.
Now, for input array sizes of the order 1 million, this 4096 is a very puny number.
Now, its upto the programmer to decide on optimal block and grid sizes and squeeze out performance.