Problems with atomicExch() on Tesla M2050

arete · November 30, 2010, 12:19pm

I am having problems getting atomicExch() to work on Tesla M2050. Perhaps I don’t understand this CUDA feature well enough.

Below is a CUDA program in its entirety:

#ifndef NUMTHREADS

#define NUMTHREADS 32

#endif

#ifndef NUMBLOCKS

#define NUMBLOCKS 14

#endif

#include <iostream>

#include <cassert>

#include <cstdlib>

using namespace std;

__global__ void 

__launch_bounds__(NUMTHREADS, NUMBLOCKS/14)

sumblock(double* result){

  int ans = threadIdx.x;

  __shared__ volatile int sum;

  __shared__ volatile int lock;

  if(threadIdx.x==0){

    sum = 0;

    lock = 0;

  }

  __syncthreads();

  int lockcopy = 1;

  while(lockcopy==1)

    lockcopy = atomicExch((int *)&lock, 1);

  sum = sum + ans;

  atomicExch((int *)&lock, 0);

  __syncthreads();

  if(threadIdx.x==0)

    result[blockIdx.x] = sum;

}

int main(){

  double *dresult;

  double *result;

  cudaError_t errcode;

  errcode = cudaMalloc((void **)&dresult, NUMBLOCKS*sizeof(double));

  printf("CUDA: %s\n", cudaGetErrorString(errcode));

  result = new double[NUMBLOCKS];

  sumblock<<<NUMBLOCKS,NUMTHREADS>>>(dresult);

  printf("CUDA: %s\n", cudaGetErrorString(cudaGetLastError()));

  errcode = cudaMemcpy(result, dresult, 

		       NUMTHREADS*sizeof(double),cudaMemcpyDeviceToHost);

  printf("CUDA: %s\n", cudaGetErrorString(errcode));

  cout<<"result[0] = "<<result[0]<<endl;

}

Here is a brief description of the kernel sumblock<>().

It sets ans = threadIdx.x
sum and lock are shared variables.
the thread with threadIdx.x==0 initializes sum and lock to zero.
Each thread exchanges 1 with lock atomically. It may enter the critical region if lockcopy is read as 0 after the exchange.
In the critical region, the thread adds threadIdx.x to sum.
At exit from critical region, the thread exchanges 0 into lock atomically.
Note the __syncthreads().

The program is saved in the file Q2.cu. It is compiled as follows:

nvcc --ptxas-options=-v -arch=sm_20 -DNUMTHREADS=10 -DNUMBLOCKS=14 Q2.cu -o Q2.exe

ptxas info : Compiling entry function ‘_Z15leibnizinblocksPd’ for ‘sm_20’

ptxas info : Used 8 registers, 8+0 bytes smem, 40 bytes cmem[0]

Note that the compiler uses 8 bytes of smem, 4 each for lock and sum. The cmem is presumably used to pass arguments and for constants such as 1.

The nvcc toolkit is as follows:

[root@ip-10-17-160-215 bq-gpu]# nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Built on Mon_Jun__7_18:56:31_PDT_2010

Cuda compilation tools, release 3.1, V0.2.1221

I expect result[0] to print as 0+1+…+9=45. However the program prints “CUDA: no error” twice as expected and just hangs. It does not print result[0] at all.

Can you please explain what is going on? Any help will be greatly appreciated.

avidday · November 30, 2010, 12:39pm

The __synchthreads() inside the while loop is probably the source of the error. The block will stall if not every thread reaches the barrier (and that code dosn’t look to guarantee that every thread will reach the barrier at every execution of the loop).

You might want to cosider familiarising yourself with the use of code tags on this board. That code was incredibly hard to read without indentation or markup.

arete · November 30, 2010, 1:28pm

avidday: how does one familiarize oneself with code tags? Thanks for the suggestion. I will repost with code tags once I find out how to do that. The xemacs indentation remains intact when I cut and past into the forum’s editor. However, the indentation goes away when I click post.

The __syncthreads() is outside the while loop and not inside as you say.

tera · November 30, 2010, 2:43pm

Classical deadlock. Execution will never get past this infinite loop:

while(lockcopy==1)

    lockcopy = atomicExch((int *)&lock, 1);

The first thread to succeed in the atomic operation will have to wait for the other threads in its warp to proceed past the loop. Which however never happens, as the lock is already taken by the first thread.

If you need double precision atomicAdd(), look here.

To make your code readable in the forum, include it inside [font=“Courier New”][code] [/code][/font] tags.

arete · November 30, 2010, 4:12pm

Tera: Many thanks! I should have thought of that. Your solution for an atomicAdd() for doubles is ingenious.

Topic		Replies	Views
atomicexch atomicexch tesla cc 1.3 CUDA Programming and Performance	0	1491	February 4, 2009
about atomicExch CUDA Programming and Performance	1	2402	November 9, 2008
atomicExch(float) does not work - is it a bug? CUDA Programming and Performance	8	5145	June 27, 2011
Is atomicExch() safe for incremental a global float array? CUDA Programming and Performance	13	18382	July 6, 2009
atomicAdd & atomicExch in a particle simulation CUDA Programming and Performance	3	10234	April 21, 2011
DeadLock on this piece of code. Require your help. CUDA Programming and Performance	13	4781	March 17, 2010
Implementing mutual exclusion lock using atomicCAS() CUDA Programming and Performance	2	2388	August 5, 2009
noob question about ATOMIC operations... CUDA Programming and Performance	3	1177	February 18, 2013
Shared mem atomics Help needed to Fix this hang CUDA Programming and Performance	12	11797	June 8, 2009
who can give me an example for atomicExch()? CUDA Programming and Performance	0	3590	June 4, 2009

Problems with atomicExch() on Tesla M2050

Related topics