I am having problems getting atomicExch() to work on Tesla M2050. Perhaps I don’t understand this CUDA feature well enough.
Below is a CUDA program in its entirety:
#ifndef NUMTHREADS
#define NUMTHREADS 32
#endif
#ifndef NUMBLOCKS
#define NUMBLOCKS 14
#endif
#include <iostream>
#include <cassert>
#include <cstdlib>
using namespace std;
__global__ void
__launch_bounds__(NUMTHREADS, NUMBLOCKS/14)
sumblock(double* result){
int ans = threadIdx.x;
__shared__ volatile int sum;
__shared__ volatile int lock;
if(threadIdx.x==0){
sum = 0;
lock = 0;
}
__syncthreads();
int lockcopy = 1;
while(lockcopy==1)
lockcopy = atomicExch((int *)&lock, 1);
sum = sum + ans;
atomicExch((int *)&lock, 0);
__syncthreads();
if(threadIdx.x==0)
result[blockIdx.x] = sum;
}
int main(){
double *dresult;
double *result;
cudaError_t errcode;
errcode = cudaMalloc((void **)&dresult, NUMBLOCKS*sizeof(double));
printf("CUDA: %s\n", cudaGetErrorString(errcode));
result = new double[NUMBLOCKS];
sumblock<<<NUMBLOCKS,NUMTHREADS>>>(dresult);
printf("CUDA: %s\n", cudaGetErrorString(cudaGetLastError()));
errcode = cudaMemcpy(result, dresult,
NUMTHREADS*sizeof(double),cudaMemcpyDeviceToHost);
printf("CUDA: %s\n", cudaGetErrorString(errcode));
cout<<"result[0] = "<<result[0]<<endl;
}
Here is a brief description of the kernel sumblock<>().
-
It sets ans = threadIdx.x
-
sum and lock are shared variables.
-
the thread with threadIdx.x==0 initializes sum and lock to zero.
-
Each thread exchanges 1 with lock atomically. It may enter the critical region if lockcopy is read as 0 after the exchange.
-
In the critical region, the thread adds threadIdx.x to sum.
-
At exit from critical region, the thread exchanges 0 into lock atomically.
-
Note the __syncthreads().
The program is saved in the file Q2.cu. It is compiled as follows:
nvcc --ptxas-options=-v -arch=sm_20 -DNUMTHREADS=10 -DNUMBLOCKS=14 Q2.cu -o Q2.exe
ptxas info : Compiling entry function ‘_Z15leibnizinblocksPd’ for ‘sm_20’
ptxas info : Used 8 registers, 8+0 bytes smem, 40 bytes cmem[0]
Note that the compiler uses 8 bytes of smem, 4 each for lock and sum. The cmem is presumably used to pass arguments and for constants such as 1.
The nvcc toolkit is as follows:
[root@ip-10-17-160-215 bq-gpu]# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Mon_Jun__7_18:56:31_PDT_2010
Cuda compilation tools, release 3.1, V0.2.1221
I expect result[0] to print as 0+1+…+9=45. However the program prints “CUDA: no error” twice as expected and just hangs. It does not print result[0] at all.
Can you please explain what is going on? Any help will be greatly appreciated.