Suppose I have a kernel, that as part of its algorithm must sum and store values in the same array index - e.g. generates a race-condition. Currently I employ AtomicAdd then continue with the rest of the kernel. Needless to say, the AtomicAdd generates a huge performance hit I would like to avoid. Is there a way to employ a CUDA reduction as part of an existing kernel without calling another kernel ? - i.e. I can’t count on the use of dynamic parallelism as the hardware generations employed may vary.
Thank you to anyone willing to help.
A general outline of my kernel follows (please note this is just a brief and rough outline, not exacting) :
__global__ void myKernel(float *A, float *B, float r, int N) {
int idx = threadIdx.x + BlockDim.x * BlockIdx.x;
if (idx < N) {
// code to generate a value via a device call
float tmp = myKernel_A(A, r);
// store at element
atomicAdd(B[idx], tmp);
// continue with rest of kernel operations
...
}
}
Please excuse my nativity with regards to CUDA, I am learning. Is there no way to employ a reduction without calling a separate kernel from the host? If so, can you please point me to a link and/or pseudo-code that would accomplish this? I would like to avoid calling another kernel from the host just for the reduction part of the code.
Sorry my mistake, the idx is actually derived from an array of indices passed to the kernel. A more detailed kernel follows and should clear-up this. Once again, I apologize for incorrect kernel algorithm I posted.
__global__ void myKernel (float *A, float *B, int *indices, float r, int N) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (idx < N) {
// get index associated with this thread -
// NOTE: not unique as many different threads can map to same index
int i = indices [ idx ];
// code to generate a value via a device call
float tmp = myKernel_A(A, r);
// store at element
atomicAdd ( B[ i ], tmp);
// continue with rest of kernel operations
...
}
}
This is actually Q reductions going on in parallel, then, where Q is the number of unique indices (of i) that are actually present/extant/used.
A simple reduction will not help here.
It will be difficult to solve this on a device-wide basis without using device-wide sync mechanisms, or atomics.
An alternative approach to atomics is outlined here:
(Be advised there is no actual worked example there. It might be that the approach is not faster than atomics. Performance comparisons of these two algorithmic approaches, such as here:
will depend signfiicantly on the actual data sets, including the “density” of atomics. Kepler in particular has significantly improved the performance of global atomics.)
You can apply this at the block level if you want (thereby avoiding the need for a device-wide sync), but each block will then need to accumulate its partial results into the global results contained in B (perhaps using atomics). This isn’t a trivial problem that I would suggest is appropriate for someone at the beginner level.
The block-level reduction sounds promising, applying the atomics at the block-level versus the thread-level may be just the performance boost I am looking for. Can you provide a link/pseudo-code to apply this block-level reduction?