Scalar Product (dot product) with Atomic Operations How to implement a DotProduct all in the Kernel

The final sum of the dotproduct example is implemented on CPU. I know it’s possible to implement all the dotproduct in the GPU kernel using atomic operations. Does anyone know how to do that (an example of parallel reduction)?

thanks. :">

You almost certainly want to look at the reduciton example in the SDK.


// write result for this block to global mem 

if (tid == 0) g_odata[get_group_id(0)] = sdata[0];[/codebox]

Nothing about atomic operations.

I need something like: if (tid == 0) (protected area, serialized code) g_odata[0] = sdata[0]; (end protected area)

He is talking about reduction example in SDK, not dot product in SDK.
Reduction algorithm loads input values, sums them up and outputs a single value.
And you don’t need atomic operations for that ;)

I know the examples. But you should make the final sum on the host side. It’s because there’s no synchronization btw threads/work-item from different groups.

if (tid == 0) g_odata[get_group_id(0)] = sdata[0]; // from the reduction example.

if you have 4 groups, the host should accumulate a vector of 4 elements to get the result.

Am I wrong?

Ok, I appologise. I know how the algorithm works, but I am not entirely sure how it is done in SDK example. The dot product consists of two phases:

  • multiply components (fairly easy)
  • sum all elements (that’s where reduction algorithm comes in).

The fastest way to do it is to launch as many blocks as your GPU can handle, and let them do their work.
At the end we end up with each block returning a value. What to do next is a matter of taste.

a) Each block stores the result under output[blockIdx.x]=blockValue and we then launch the same algorithm with one block only on the output array
B) Each blocks increments the output value atomically as you suggest: atomicAdd(output,blockValue), but this will work only for integer arrays
c) Copy result back to host and sum it up using CPU.

Unlike in CPU side, there is no mechanism in CUDA to begin and end a protected region that would be executed by exactly one block or thread. There is only a handful of atomic operations to your disposal. You could craft a protected region yourself but it is usually an ugly hack and makes the code very slow

Thank you. I’ve found one solution. Sorry, it’s a OpenCL code. But it works.


  • sDOT OpenCL Kernel Function for Level 1 BLAS Dot Product dot<-xy

  • Author; Wendell Rodrigues

  • INRIA-Lille :: DaRT Team


__kernel void sDOT(

__global const unsigned int N,

__global const float* X,

__global const float* Y,

__global float* DOT,

__global int* FLAG,

__local float* sdata



// get index into global data array

unsigned int tid = get_local_id(0);

unsigned int i = get_global_id(0);

sdata[tid] = (i<N) ? X[i]*Y[i] : 0;

if (i==0) {





// do reduction in shared mem

for(unsigned int s=1; s < get_local_size(0); s *= 2)


    int index = 2 * s * tid;

if (index < get_local_size(0))


        sdata[index] += sdata[index + s];




// write result for this block to global mem

if (tid == 0) {

    while (atom_cmpxchg(FLAG,0,1)==1);

    DOT[0] += sdata[0];





The question remains, how fast is that code? This atomic operation in an actively waiting loop may introduce a lot of pressure on the memory controller.

You’re right. I’ve got nothing about performance. But, at least, I can code the Conjugate Gradient Algorithm(many dotproduct calls) all in the kernel side. I’ll post my results.

Wendell, it seems to me that you are on the correct track. Get the code working first and then optimize.

Let us know how it goes.


hey wendell, do you happen to know the speed difference between an iterative solver (such as conjugate gradient method) vs a direct sparse matrix solver for sparse matrices of order ~10,000 to 30,000 (i.e. is direct solver a few times faster, or are we talking about orders of magnitude)?

The SDK reduction example can do it all GPU side. It’s perfectly possible to accumulate 4 values on the GPU, just inefficient. I’d bet that it was faster than any atomic method though.

I’d like to see your example with two vectors (30k elements) without get out of the kernel. Could you show me how to? :sweat:


No. I hadn’t realised you had the restriction of doing it with a single kernel invocation. Must be an annoying restriction. I take it you’re inlining inside a more complicated kernel then?

That’s it. I’ve a CG(Conjugate Gradient Iterative Solver) Kernel. And it broke because of the dotproduct. But, unfortunately, the CG performance goes down(3x) with the atomic looping. :ermm:

Hi Nikolai,

I don’t know. Do u have a direct sparse matrix solver implementation or any benchmark? My matrices are sparse, symmetric and positive definite. I’m not math-person but this iterative method is suitable for these matrices. This is a classical solver on generic CPUs. I’m not sure for GPUs.

To Nikolai,

there is no definite answer, it depends out your matrix, is it symmetric ? is it positive definite ? and also on the pattern, i know that for us the CG or actually the modified preconditioned constraint CG works best.

To Wendell,

Without global sync i haven’t found a ways to do a complete CG solve in one kernel. There are actually 3 points of sync needed. after each dot product (there are 2 in each iteration) and another at the end when the new X is computed and is needed for the next matrix vector multiply. Even so running 300 very small kernels still got me a boost factor for around x10, again it depends on you sparse pattern…

Oh and atomics are horribly slow on the G200 … not somthing you would want to use in a CG

Do you have one (or few) dot products to compute in parallel or many of them?
If the latter is the case, maybe you could employ one-block-per-dot-product approach? That would be much easier to implement and probably even faster as blocks would not have to communicate through global memory at all!
Even faster would be one-warp-per-dot-product approach which would get rid of all __syncthreads() as well!

the dot products in a CG aren’t done in parralel. there are 2 big dot products each done in a different part of the algorithem. so you can’t do them in parralel. thow internally you can run each one in parallel.

just to explane here is a quick grab from wikipedia of octave (open source matlab) code: ( though in this one there are 4 dot products)

function = conjgrad(A,b,x0)

r = b - A*x0;

w = -r;

z = A*w;

a = (r’*w)/(w’*z);

x = x0 + a*w;

B = 0;

for i = 1:size(A)(1);

  r = r - a*z;

  if( norm® < 1e-10 )



  B = (r'*z)/(w'*z);

  w = -r + B*w;

  z = A*w;

  a = (r'*w)/(w'*z);

  x = x + a*w;



Hi Erdoom,

How do you do with only two dot?

I use this one of wikipedia.