Calculation differences between CUDA and MATLAB

I’ve implemented matrix inverse calculation in CUDA. I tried it with the CULA, and with the CUBLAS libraries too. I have got the datas in Matlab too. After the calculation compared them and realized that the difference between the Matlab and the CUDA is so big and it doesn’t matter that I use the CULA or the CUBLAS functions. The relativ error could reach 1000%. I can’t understand why… The matrix isn’t too big, just a 111*111 . Does anybody have got idea why? I’d appreciate it.

Assuming this is not a mixup of column-major and row-major storage, this may also happen if your matrix is close to singular. Iterative refinement is a way to mitigate the problem. See also the Numerical Recipes section on iterative improvement.

Thanks. My matrix is symmetric so I can’t mixup the storage, but the second advice is might works. I will find out and let you know.

Jacket uses both CUBLAS and CULA for various operations and the results identically match the CPU output. So, what you are trying to do should definitely work. You can try Jacket for 15-days free if that’ll help: http://accelereyes.com

Are you working with the same precision for both cases, or one in double and one in single? Matlab usually does double precision and even if you try to use single you need to be very careful that double doesn’t sneak up on you. The differences in accuracy is big. Single supports about 7 significant digits while double around 15. It’s enough that you are doing different precision and 1000% relative difference becomes not so big.

If both are the same precision then you probably have issues with singularity.

Or even better, try the GPU functionality already in Matlab (Parallell computing toolbox): http://www.mathworks.se/discovery/matlab-gpu.html

I heard that it is pretty good :)

Although I have very little experience with jacket (played a bit with a trial license), I can tell you that the parallel computing toolbox is much more limited than jacket. Subscripting only arrived in matlab 2011a and it still doesn’t support bsxfun. Matrix multiply times are also a bit slower.