Is this a BUG of CuBLAS output not consistent for each run

BingBing · December 2, 2009, 11:37pm

Hi, all

I’m running an iterative linear solver on GPU. It works fine on Tesla C1060 but fails on GTX 275. (I have post a topic for it earlier). Now it seems that I reach where the problem is …

I wrote a sample code to test Cublas

[codebox]#include <string.h>

#include <stdio.h>

#include <stdlib.h>

#include <sys/time.h>

#include <time.h>

#include <math.h>

#include <cublas.h>

#include <cutil.h>

#include <cuda.h>

#define REAL double

int main(int argc, char *argv)

{

CUT_DEVICE_INIT(argc, argv);

cublasStatus status;

status = cublasInit();

if( status != CUBLAS_STATUS_SUCCESS)

{

fprintf(stderr, "Fatal Error: CUBLAS init failed.\n");

return(-1);

}

int i,j,n;

n = 100000;

REAL *x = (REAL ) malloc(nsizeof(REAL));

REAL *y = (REAL ) malloc(nsizeof(REAL));

srand (100);

for (i=0; i<n; i++)

{

x[i] = rand() / (RAND_MAX + 1.0);

y[i] = rand() / (RAND_MAX + 1.0);

}

REAL *d_x, *d_y;

CUDA_SAFE_CALL(cudaMalloc((void **)&d_x, (size_t)n*sizeof(REAL)));

CUDA_SAFE_CALL(cudaMalloc((void **)&d_y, (size_t)n*sizeof(REAL)));

CUDA_SAFE_CALL(cudaMemcpy(d_x, x, (size_t)n*sizeof(REAL), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(d_y, y, (size_t)n*sizeof(REAL), cudaMemcpyHostToDevice));

for (i=0; i<100; i++)

cublasDaxpy(n, 1.0, d_x, 1, d_y, 1);

REAL t = cublasDdot(n, d_y, 1, d_y, 1);

REAL t2 = 0.0;

for (j=0; j<100; j++)

for (i=0; i<n; i++)

  y[i] += x[i];

for (i=0; i<n; i++)

t2 += y[i]*y[i];

printf(“GPU = %lf, CPU = %lf\n”, t, t2);

free(x);

free(y);

CUDA_SAFE_CALL(cudaFree(d_x));

CUDA_SAFE_CALL(cudaFree(d_y));

status = cublasShutdown();

if( status != CUBLAS_STATUS_SUCCESS)

{

fprintf(stderr, "Fatal Error: CUBLAS shutdown failed.\n");

return(-1);

}

[/codebox]

As you see, in this code I do nothing but 100 SAXPY and one dot product on both CPU and GPU sides. Surprisingly, the output of GPU is not consistent for each run. Here is some output

[codebox] [~/GPU/testcublas] % ./test

Using device 0: GeForce GTX 275

GPU = 338244749.135282, CPU = 339742608.816349

[~/GPU/testcublas] % ./test

Using device 0: GeForce GTX 275

GPU = 338080902.402454, CPU = 339742608.816349

[~/GPU/testcublas] % ./test

Using device 0: GeForce GTX 275

GPU = 337694204.183996, CPU = 339742608.816349

[/codebox]

The result by GPU varies for each run.

I am working on 64-bit linux workstation located in MSI(Minnesota Supercomputing Center). The GPU card is GTX 275. The cuda they installed is cuda-2.0. compiled by icc

The code is attached.

Thanks a lot! :">
test.cu (1.51 KB)

LSChien · December 3, 2009, 1:56am

your code works in my machine, xp pro 64-bit, vc2005, driver 190.38, cuda 2.3

the following is result

H:\project_2008\GPU\example\forum_BingBing\release>forum_BingBing.exe

Using device 0: GeForce GTX 295

GPU = 337850370.160344, CPU = 337850370.160398

H:\project_2008\GPU\example\forum_BingBing\release>forum_BingBing.exe

Using device 0: GeForce GTX 295

GPU = 337850370.160344, CPU = 337850370.160398

H:\project_2008\GPU\example\forum_BingBing\release>forum_BingBing.exe

Using device 0: GeForce GTX 295

GPU = 337850370.160344, CPU = 337850370.160398

I think that you can try cuda 2.3

BingBing · December 3, 2009, 2:05am

your code works in my machine, xp pro 64-bit, vc2005, driver 190.38, cuda 2.3

the following is result

H:\project_2008\GPU\example\forum_BingBing\release>forum_BingBing.exe

Using device 0: GeForce GTX 295

GPU = 337850370.160344, CPU = 337850370.160398

H:\project_2008\GPU\example\forum_BingBing\release>forum_BingBing.exe

Using device 0: GeForce GTX 295

GPU = 337850370.160344, CPU = 337850370.160398

H:\project_2008\GPU\example\forum_BingBing\release>forum_BingBing.exe

Using device 0: GeForce GTX 295

GPU = 337850370.160344, CPU = 337850370.160398

I think that you can try cuda 2.3

Thanks !

BingBing · December 3, 2009, 3:18am

I tried cuda-2.3. Here is the makefile

default: test.o

    g++ -o test test.o -L/usr/local/cuda-toolkit/2.3/lib64 -lcudart -lcublas

test.o: test.cu

    /usr/local/cuda-toolkit/2.3/bin/nvcc -o test.o -c -arch=sm_13 -O3 -I. test.cu

BUT also failed.

Anyone can help?

Thanks

mfatica · December 3, 2009, 3:43am

It looks like a bug, I can reproduce the failure and from a quick scan, your code is correct.

BingBing · December 3, 2009, 4:16am

What card did you test it on?

once i worked on Tesla C1060, these never happened.

When I run it on GTX 275, I find this strange thing

dmyablonski · February 22, 2010, 3:00pm

Has anything come of this, is it listed as an official bug somewhere? I’m having the same problem and it’s throwing my calculations out of wack. Seems to be the cubalsDdot that is going awry here as well.

Any suggestions on how to get around it? write my own dot product kernel I guess? or is there another double precision dot product in the sdk somewhere?

BTW, my card is a GTX 285

mfatica · February 22, 2010, 3:36pm

It was a driver bug. Try to update to the latest 190.xx.

jan.heckman · July 29, 2010, 12:40pm

I still have a similar problem (inaccurate and changing results) right now (july) with the SDK example (simpleCublas), GTX 275, driver 258.96, Win 7 64. The differences are small, but creep up to the point where the result is “FAILED”, with an occasional “PASSED”. Cuda version 3.0 and 3.1. The identical executable works fine on a quadro FX 770M. Loaded cublas dll is cublas32_31_9.dll on the 275 and cublas32_31_4.dll on the quadro (checked in debug output).

The pass/fail test in the program is done on error_norm/ref_norm. I checked this value. On the 275, the best results are just under 1e-6 but will reach about 4e-6 :wacko: . On the quadro it is consistently 7e-8.

Anyone?

thanks in advance,

Jan