Here is a simple CUDA program in its entirety (saved in a file called bug1.cuda) (NUMTHREADS and NUMBLOCKS must both be even. They can be set using the -D option to nvcc):
#ifndef NUMTHREADS
#define NUMTHREADS 32
#endif
#ifndef NUMBLOCKS
#define NUMBLOCKS 14
#endif
#include <iostream>
#include <cassert>
#include <cstdlib>
using namespace std;
__global__ void leibniz(long int n, double *result){
int tid = threadIdx.x+blockIdx.x*blockDim.x;
double ans=0;
int step = blockDim.x*gridDim.x;
for(long int i=tid; i < n; i+=step)
ans += 4.0/(2.0*i+1.0);
if(tid%2==1)
ans = -ans;
result[tid] = 1.0*ans;
}
int main(){
long int n = 1000*1000*1000;
double *dresult, *result;
cudaMalloc((void **)&dresult, NUMTHREADS*NUMBLOCKS*sizeof(double));
result = new double[NUMTHREADS*NUMBLOCKS];
leibniz<<<NUMBLOCKS, NUMTHREADS>>>(n, dresult);
cudaMemcpy(result, dresult, NUMTHREADS*NUMBLOCKS*sizeof(double),cudaMemcpyDeviceToHost);
double ans=0;
for(int i=0; i < NUMBLOCKS*NUMTHREADS; i++)
ans += result[i];
cout<<"leibniz partial sum = "<<ans<<endl;
cout<<"result[0] = "<<result[0]<<endl;
delete[] result;
cudaFree(dresult);
}
It is supposed to compute the n-th partial sum of the series 4 - 4/3 + 4/5 - … which is equal to pi. However when I run the program I get the following output.
OUTPUT 1:
[root@ip-10-17-160-40 bq-gpu]# a.out
leibniz partial sum = -2.08849e+148
result[0] = -1.45682e+144
This output is dead wrong. However, if I modify the program by replacing ans += 4.0/(2.0i+1.0) by ans += 1.0/(2.0i+1.0) and then result[tid] = 1.0tid by result[tid] = 4.0tid, I get the following output.
[root@ip-10-17-160-40 bq-gpu]# a.out
leibniz partial sum = 3.14159
result[0] = 4.00164
This output is correct.
What is going on? Here are some items that may help.
- The source file is compiled using the following command:
[root@ip-10-17-160-40 bq-gpu]# nvcc -arch=compute_20 -DNUMTHREADS=1024 bug1.cu
-
The machine is Tesla M2050 (verified using cudaGetDeviceProperties).
-
The operating system is:
[root@ip-10-17-160-40 bq-gpu]# cat /etc/issue
CentOS release 5.5 (Final)
Kernel \r on an \m
- I have included a part of the ptx for the buggy version (the one with 4.0/(2.0*i+1.0)):
$Lt_0_2562:
//<loop> Loop body line 16, nesting depth: 1, estimated iterations: unknown
.loc 27 17 0
cvt.rn.f64.s64 %fd2, %rd2;
mov.f64 %fd3, 0d4010000000000000; // 4
add.f64 %fd4, %fd2, %fd2;
mov.f64 %fd5, 0d3ff0000000000000; // 1
add.f64 %fd6, %fd4, %fd5;
div.rn.f64 %fd7, %fd3, %fd6;
add.f64 %fd1, %fd1, %fd7;
cvt.s64.s32 %rd4, %r7;
add.s64 %rd2, %rd4, %rd2;
setp.lt.s64 %p2, %rd2, %rd3;
@%p2 bra $Lt_0_2562;
It looks fine.
- If you have an M2050 machine, I would like to know if you can reproduce this behavior. If not, for $5 you can reproduce it using a GPU cluster on Amazon EC2. I have attached a zip folder that has the source, the ptx for the two cases, as well as the compilation command.
On EC2, there are two ways you can reproduce this behavior. First you can put nvcc on your path, set LD_LIBRARY_PATH appropriately, “yum install gcc-c++” to get c++ and then use the compilation commands.
Second, you can follow the instructions in README that ask you to update the kernel and install the device driver. The device driver installation gives an error about something glx missing, but goes through to completion. Then if you like you can run the CUDA toolkit installer (version 3.1).
It makes no difference which way you do it. The strange behavior will occur.
I would greatly appreciate help in resolving this issue.