Sometimes results are accurate sometimes not?


I made a CUDA accelerated Sherman-Morrison algorithm that computes result (scalar) from the input matrix A(512x512), vector F0(1x512) and iterations of matrix Y (e.g. 512x1.000) -> vector Yi(512x1).

I ran my code through random input data and with two results one is CUDA computation the other is MAtlab computation.

The weird part is that if I compare the result from CUDA and Matlab in arround 998 cases out of those 1.000 the results will be fine, with tolerable difference (from +/-0.0001 to +/- 0,01 - The values are from 0 to 10 range). But in two cases out of 1.000 the difference of results will be huge (+/-100 to +/-600).

Some facts:

  • If I use different input data sometimes there will be no deviation, sometimes there will be one or two in different time of iteration

  • If I make the computation on the same input data the deviation will come up in the same location (e.g. 367 iteration)

  • If I compute the part that I tracked that could be the problem with just this stem in CUDA the results are normal.

What could be causing such occasional deviations?

My theories:

  • Some hardware error on GPU unit (RAM maybe?) - Im using 512 MB version of Gainward 8800GT GOlden Sample

  • Some random hardware computational error or CUBLAS computational error

  • Not fully supported IEEE 754

  • Some programming error (but I don’t think that could be possible because 99% results are ok and if it would be allocation problem the problem would be mostly on the same spot - and after the error result the following one is ok)

Cuda code:

    //compute ai = A*Yi  (512x512*512x1)

     (void) cublasSgemv ('n',512,512,alpha, kAgpu, 512, kYigpu, 1, beta,kA2igpu,1);


     //compute Yi^T*ai  (1x512*512x1)

     dp = cublasSdot (512, kYigpu, 1, kA2igpu, 1);


     //compute alfa

     alfa=1 + dp;


    //compute A=-alfa*a*a^T + A (result 512x512 matrix)

     (void) cublasSger (512, 512, alfa, kA2igpu,1, kA2igpu,1,kAgpu,512);

    //compute w=A*Yi (512x512*512x1)

     (void) cublasSgemv ('n',512,512,alpha, kAgpu, 512, kYigpu, 1, beta,kWgpu,1);

    //compute p=F0*w (1x512*512x1)

     dp = cublasSdot (512, kF0gpu, 1, kWgpu, 1);

Matlab equivalent code:




     alfa= 1+dp;


     A=alfa*(mai*mai') + A;



Today I will try the code on brand new test card 9800GTX to eliminate the possibility of hardware error. Anyone has any other explanation what could be wrong?

Thanks and best regards!

Most likely, you have a RACE in your program…

Usually Races are exposed in high-end super fast hardware easily…


I did not read ur post completely. It could be a precision thing as well. If you suspect a precision loss, – do the L1 error estimation as done in some SDK samples – example binomialOption SDK sample and see if you are within agreeable error limits.