Reduction

Hi,
I seem to remember that latest CUDA versions have introduced built-in macros/function to do reduction/min-max code over
some array in memory. Am I wrong? Should I still implement this myself in code or there is something CUDA supplies?
I’m mainly interested in a built-in function to compute the min/max value from an array in smem…

thanks
eyal

Wouldn’t that be Thrust?

No :) I want something in my kernel (can’t use Thrust for that).

I thought CUDA introduced lately some intrinsic/function to do that.

eyal

mmm i am very interested in reduction.
i’m working on “many but small reduction”.

SDK examples sums 16M elements, but i my case i have to sum N arrays of M COMPLEX elements each, where M is small (max 8192, usually 512).
this is the case when you have to find the meanvalue for each row of a matrix with few columns ;) it’s a common problem in signals filtering.

mmm i am very interested in reduction.
i’m working on “many but small reduction”.

SDK examples sums 16M elements, but i my case i have to sum N arrays of M COMPLEX elements each, where M is small (max 8192, usually 512).
this is the case when you have to find the meanvalue for each row of a matrix with few columns ;) it’s a common problem in signals filtering.

How about running N threads, each thread adds M elements? If N is large and you use a friendly memory layout (e.g. column-major), this should be as fast as you can get.

How about running N threads, each thread adds M elements? If N is large and you use a friendly memory layout (e.g. column-major), this should be as fast as you can get.

You’re thinking of the simple but useful __syncthreads_count(), __syncthreads_and(), __syncthreads_or().

They’re not general reduction using any type or operator, but are easy to use for their simpler specific cases.

You’re thinking of the simple but useful __syncthreads_count(), __syncthreads_and(), __syncthreads_or().

They’re not general reduction using any type or operator, but are easy to use for their simpler specific cases.

Thanks :)

Those however will not find the maximum number in a shared memory array, would it?

If they can be used for that end - can you please supply an example?

thanks

eyal

Thanks :)

Those however will not find the maximum number in a shared memory array, would it?

If they can be used for that end - can you please supply an example?

thanks

eyal

Why not just cut and paste from the reduction SDK example? You can try with atomicmax, but my guess is that it will not be faster than the sdk example code.

Why not just cut and paste from the reduction SDK example? You can try with atomicmax, but my guess is that it will not be faster than the sdk example code.

Thats what I did :) I just thought there is something simpler… oh well - yet another “job security” piece of code ;)

Thats what I did :) I just thought there is something simpler… oh well - yet another “job security” piece of code ;)