Bug in basic arithmetic on M2050

Here is a simple CUDA program in its entirety (saved in a file called bug1.cuda) (NUMTHREADS and NUMBLOCKS must both be even. They can be set using the -D option to nvcc):


#define NUMTHREADS 32



#define NUMBLOCKS 14


#include <iostream>

#include <cassert>

#include <cstdlib>

using namespace std;

__global__ void leibniz(long int n, double *result){

  int tid = threadIdx.x+blockIdx.x*blockDim.x;

  double ans=0;

  int step = blockDim.x*gridDim.x;

  for(long int i=tid; i < n; i+=step)

    ans += 4.0/(2.0*i+1.0);


    ans = -ans;

  result[tid] = 1.0*ans;


int main(){

  long int n  = 1000*1000*1000;

  double *dresult, *result;

  cudaMalloc((void **)&dresult, NUMTHREADS*NUMBLOCKS*sizeof(double));

  result = new double[NUMTHREADS*NUMBLOCKS];

leibniz<<<NUMBLOCKS, NUMTHREADS>>>(n, dresult);

cudaMemcpy(result, dresult, NUMTHREADS*NUMBLOCKS*sizeof(double),cudaMemcpyDeviceToHost);

  double ans=0;

  for(int i=0; i < NUMBLOCKS*NUMTHREADS; i++)

    ans += result[i];

  cout<<"leibniz partial sum = "<<ans<<endl;

  cout<<"result[0] = "<<result[0]<<endl;

delete[] result;



It is supposed to compute the n-th partial sum of the series 4 - 4/3 + 4/5 - … which is equal to pi. However when I run the program I get the following output.


[root@ip-10-17-160-40 bq-gpu]# a.out

leibniz partial sum = -2.08849e+148

result[0] = -1.45682e+144

This output is dead wrong. However, if I modify the program by replacing ans += 4.0/(2.0i+1.0) by ans += 1.0/(2.0i+1.0) and then result[tid] = 1.0tid by result[tid] = 4.0tid, I get the following output.

[root@ip-10-17-160-40 bq-gpu]# a.out

leibniz partial sum = 3.14159

result[0] = 4.00164

This output is correct.

What is going on? Here are some items that may help.

  1. The source file is compiled using the following command:

[root@ip-10-17-160-40 bq-gpu]# nvcc -arch=compute_20 -DNUMTHREADS=1024 bug1.cu

  1. The machine is Tesla M2050 (verified using cudaGetDeviceProperties).

  2. The operating system is:

[root@ip-10-17-160-40 bq-gpu]# cat /etc/issue

CentOS release 5.5 (Final)

Kernel \r on an \m

  1. I have included a part of the ptx for the buggy version (the one with 4.0/(2.0*i+1.0)):

 //<loop> Loop body line 16, nesting depth: 1, estimated iterations: unknown

	.loc	27	17	0

	cvt.rn.f64.s64 	%fd2, %rd2;

	mov.f64 	%fd3, 0d4010000000000000;	// 4

	add.f64 	%fd4, %fd2, %fd2;

	mov.f64 	%fd5, 0d3ff0000000000000;	// 1

	add.f64 	%fd6, %fd4, %fd5;

	div.rn.f64 	%fd7, %fd3, %fd6;

	add.f64 	%fd1, %fd1, %fd7;

	cvt.s64.s32 	%rd4, %r7;

	add.s64 	%rd2, %rd4, %rd2;

	setp.lt.s64 	%p2, %rd2, %rd3;

	@%p2 bra 	$Lt_0_2562;

It looks fine.

  1. If you have an M2050 machine, I would like to know if you can reproduce this behavior. If not, for $5 you can reproduce it using a GPU cluster on Amazon EC2. I have attached a zip folder that has the source, the ptx for the two cases, as well as the compilation command.

On EC2, there are two ways you can reproduce this behavior. First you can put nvcc on your path, set LD_LIBRARY_PATH appropriately, “yum install gcc-c++” to get c++ and then use the compilation commands.

Second, you can follow the instructions in README that ask you to update the kernel and install the device driver. The device driver installation gives an error about something glx missing, but goes through to completion. Then if you like you can run the CUDA toolkit installer (version 3.1).

It makes no difference which way you do it. The strange behavior will occur.

I would greatly appreciate help in resolving this issue.

There are speculations that CUDA 3.2 could have a buggy driver.
Check with CUDA 3.0 if possible.

The speculation about the broken driver is for CUDA 3.1 not CUDA 3.2.

Regardless of the speculation, I would really like it if people can reproduce the bug on M2050, which is NVIDIA’s latest and greatest offering. All the information for reproducing the bug is in my earlier post and as I said anyone can reproduce the bug for $5 ($2.10 actually if the EC2 setup is smooth). The bug is very embarrassing not to say shocking, and I still think perhaps something is off with my code.

Back to the speculation - CUDA toolkit does not include a driver. The driver is distinct from the toolkit. In addition, the earlier exchange about a broken driver for CUDA 3.1 is on 32 bit Linux, while the M2050 runs on 64 bit Linux.

I wonder if there is a way to contact NVIDIA and get the attention of their engineers. Any information will be appreciated.

I have a host machine with 2 S2050 connected to it (each of course contains 4 M2050):
Device 7: “Tesla M2050”
CUDA Driver Version: 3.0
CUDA Runtime Version: 3.0
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 3220897792 bytes
Number of multiprocessors: 14
Number of cores: 448
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4243455, CUDA Runtime Version = 3.0, NumDevs = 8, Device = Tesla M2050, Device = Tesla M2050

-bash-3.2$ cat /etc/issue
CentOS release 5.4 (Final)
Kernel \r on an \m

The output for your program is fine as far as I can see:
-bash-3.2$ ./a.out
leibniz partial sum = 3.14159
result[0] = 4.00164

When changing ans to be 1.0 instead of 4.0
I get this:
-bash-3.2$ nvcc -arch=compute_20 -DNUMTHREADS=1024 a.cu
-bash-3.2$ ./a.out
leibniz partial sum = 0.785398
result[0] = 1.00041

Does the problem still occur if you don’t try to allocate 7.5 GB of memory? Are you one a 64 bit platform?

How reproducible is this? Could it just be coincidence?

Thanks to Sarnath, eyalhir72 and tera for their replies. I have gone back and fixed the “result = new double[n]” line. That crept when I was stripping down the code.

I went back to Amazon’s EC2 and tried to reproduce the error. The first two times, it actually worked. The next two times it did not. So I have updated instructions for reproducing the bug. I think it may well have to do something with Amazon’s EC2 set up:

  1. Login in to a GPU machine (Tesla M5020 with Xeon 5570 host) on Amazon’s EC2.

  2. update your path for nvcc and set LD_LIBRARY_PATH for the cuda libraries.

  3. say “yum install gcc-c++”

  4. save the program “bug1.cu” given above and compile it using “nvcc -arch=compute_20 -DNUMTHREADS=1024 bug1.cu”

  5. run a.out. If it fails as above, congratulations — you have reproduced the bug!

  6. If not, reinstall the cuda toolkit. Amazon has included version 3.1 for OS version 5.4. I also tried version 3.0 for OS version 5.3, and version 3.2.16 for OS version 5.5. Compile and run again. The bug should occur now, if my experience is any guide.

What happens if you check return codes?
I’m surprised at myself I answered to your post without mentioning this, as it would be the most fundamental thing to do. Have you changed your initial post in this regard as well or have I really missed it?

The problem is that you are using too many registers.
You could easily see this adding a line to check the return code of the kernel:
printf(“CUDA: %s\n”, cudaGetErrorString(cudaGetLastError()));

With ans += 4.0/(2.0*i+1.0) the kernel requires 34 register and it will fail to launch

$nvcc --ptxas-options=-v -arch=sm_20 -DNUMTHREADS=1024 bug.cu
ptxas info : Compiling entry function ‘_Z7leibnizlPd’ for ‘sm_20’
ptxas info : Used 34 registers, 48 bytes cmem[0]

CUDA: too many resources requested for launch
leibniz partial sum = -2.08849e+148
result[0] = -1.45682e+144

With ans += 1.0/(2.0*i+1.0) the kernel requires only 20 registers and it will run just fine.

$nvcc --ptxas-options=-v -arch=sm_20 -DNUMTHREADS=1024 bug.cu
ptxas info : Compiling entry function ‘_Z7leibnizlPd’ for ‘sm_20’
ptxas info : Used 20 registers, 48 bytes cmem[0], 16 bytes cmem[16]

CUDA: no error
leibniz partial sum = 3.14159
result[0] = 4.00308

You should check the return code of the kernels and use the --ptxas-options=-v in your compilation to see how many registers are going to be needed.
If you want to run a block with 1024 threads, you need to be sure that you are using less than 32 threads.
BTW, the high register count for the first code snippet is a known bug that has been fixed internally and will be available in the next release.

Thanks to mfatica. I was trying the -Xptxas -v option, which is the documented way to find the number of registers used by the kernel. However, that was not working.

I checked your explanation. However, on occasion the kernel seems to launch and execute correctly even with 34 registers. Why is that?

You may have some old data on the GPU.

To get the output from -Xptxas -v ( that is shorter but equivalent version of --ptxas-options=-v ), you need to use -arch sm_20.
If you use --arch compute_20, you only generate the ptx, not the SASS, so you never invoke ptxas.