Hello,
I made a CUDA accelerated Sherman-Morrison algorithm that computes result (scalar) from the input matrix A(512x512), vector F0(1x512) and iterations of matrix Y (e.g. 512x1.000) -> vector Yi(512x1).
I ran my code through random input data and with two results one is CUDA computation the other is MAtlab computation.
The weird part is that if I compare the result from CUDA and Matlab in arround 998 cases out of those 1.000 the results will be fine, with tolerable difference (from +/-0.0001 to +/- 0,01 - The values are from 0 to 10 range). But in two cases out of 1.000 the difference of results will be huge (+/-100 to +/-600).
Some facts:
-
If I use different input data sometimes there will be no deviation, sometimes there will be one or two in different time of iteration
-
If I make the computation on the same input data the deviation will come up in the same location (e.g. 367 iteration)
-
If I compute the part that I tracked that could be the problem with just this stem in CUDA the results are normal.
What could be causing such occasional deviations?
My theories:
-
Some hardware error on GPU unit (RAM maybe?) - Im using 512 MB version of Gainward 8800GT GOlden Sample
-
Some random hardware computational error or CUBLAS computational error
-
Not fully supported IEEE 754
-
Some programming error (but I don’t think that could be possible because 99% results are ok and if it would be allocation problem the problem would be mostly on the same spot - and after the error result the following one is ok)
Cuda code:
//compute ai = A*Yi (512x512*512x1)
(void) cublasSgemv ('n',512,512,alpha, kAgpu, 512, kYigpu, 1, beta,kA2igpu,1);
//compute Yi^T*ai (1x512*512x1)
dp = cublasSdot (512, kYigpu, 1, kA2igpu, 1);
//compute alfa
alfa=1 + dp;
alfa=-1/alfa;
//compute A=-alfa*a*a^T + A (result 512x512 matrix)
(void) cublasSger (512, 512, alfa, kA2igpu,1, kA2igpu,1,kAgpu,512);
//compute w=A*Yi (512x512*512x1)
(void) cublasSgemv ('n',512,512,alpha, kAgpu, 512, kYigpu, 1, beta,kWgpu,1);
//compute p=F0*w (1x512*512x1)
dp = cublasSdot (512, kF0gpu, 1, kWgpu, 1);
Matlab equivalent code:
mai=A*Yi(:,i);
dp=Yi(:,i)'*mai;
alfa= 1+dp;
alfa=-1/alfa;
A=alfa*(mai*mai') + A;
mW=A*Yi(:,i);
[mppp(i)]=F0*mW;
Today I will try the code on brand new test card 9800GTX to eliminate the possibility of hardware error. Anyone has any other explanation what could be wrong?
Thanks and best regards!