Problems with atomicExch() on Tesla M2050

I am having problems getting atomicExch() to work on Tesla M2050. Perhaps I don’t understand this CUDA feature well enough.

Below is a CUDA program in its entirety:


#define NUMTHREADS 32



#define NUMBLOCKS 14


#include <iostream>

#include <cassert>

#include <cstdlib>

using namespace std;

__global__ void 

__launch_bounds__(NUMTHREADS, NUMBLOCKS/14)

sumblock(double* result){

  int ans = threadIdx.x;

  __shared__ volatile int sum;

  __shared__ volatile int lock;


    sum = 0;

    lock = 0;



  int lockcopy = 1;


    lockcopy = atomicExch((int *)&lock, 1);

  sum = sum + ans;

  atomicExch((int *)&lock, 0);



    result[blockIdx.x] = sum;


int main(){

  double *dresult;

  double *result;

  cudaError_t errcode;

  errcode = cudaMalloc((void **)&dresult, NUMBLOCKS*sizeof(double));

  printf("CUDA: %s\n", cudaGetErrorString(errcode));

  result = new double[NUMBLOCKS];


  printf("CUDA: %s\n", cudaGetErrorString(cudaGetLastError()));

  errcode = cudaMemcpy(result, dresult, 


  printf("CUDA: %s\n", cudaGetErrorString(errcode));

  cout<<"result[0] = "<<result[0]<<endl;


Here is a brief description of the kernel sumblock<>().

  1. It sets ans = threadIdx.x

  2. sum and lock are shared variables.

  3. the thread with threadIdx.x==0 initializes sum and lock to zero.

  4. Each thread exchanges 1 with lock atomically. It may enter the critical region if lockcopy is read as 0 after the exchange.

  5. In the critical region, the thread adds threadIdx.x to sum.

  6. At exit from critical region, the thread exchanges 0 into lock atomically.

  7. Note the __syncthreads().

The program is saved in the file It is compiled as follows:

nvcc --ptxas-options=-v -arch=sm_20 -DNUMTHREADS=10 -DNUMBLOCKS=14 -o Q2.exe

ptxas info : Compiling entry function ‘_Z15leibnizinblocksPd’ for ‘sm_20’

ptxas info : Used 8 registers, 8+0 bytes smem, 40 bytes cmem[0]

Note that the compiler uses 8 bytes of smem, 4 each for lock and sum. The cmem is presumably used to pass arguments and for constants such as 1.

The nvcc toolkit is as follows:

[root@ip-10-17-160-215 bq-gpu]# nvcc --version

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2010 NVIDIA Corporation

Built on Mon_Jun__7_18:56:31_PDT_2010

Cuda compilation tools, release 3.1, V0.2.1221

I expect result[0] to print as 0+1+…+9=45. However the program prints “CUDA: no error” twice as expected and just hangs. It does not print result[0] at all.

Can you please explain what is going on? Any help will be greatly appreciated.

The __synchthreads() inside the while loop is probably the source of the error. The block will stall if not every thread reaches the barrier (and that code dosn’t look to guarantee that every thread will reach the barrier at every execution of the loop).

You might want to cosider familiarising yourself with the use of code tags on this board. That code was incredibly hard to read without indentation or markup.

avidday: how does one familiarize oneself with code tags? Thanks for the suggestion. I will repost with code tags once I find out how to do that. The xemacs indentation remains intact when I cut and past into the forum’s editor. However, the indentation goes away when I click post.

The __syncthreads() is outside the while loop and not inside as you say.

Classical deadlock. Execution will never get past this infinite loop:


    lockcopy = atomicExch((int *)&lock, 1);

The first thread to succeed in the atomic operation will have to wait for the other threads in its warp to proceed past the loop. Which however never happens, as the lock is already taken by the first thread.

If you need double precision atomicAdd(), look here.

To make your code readable in the forum, include it inside [font=“Courier New”][code] [/code][/font] tags.

Tera: Many thanks! I should have thought of that. Your solution for an atomicAdd() for doubles is ingenious.