NVMEX + Single Precision FLOPS on GPU + CPU wrapper


I am facing a problem since long. I have tried everything that I could think of but nothing has worked :(

Background: I have a GeForce 9800GTX+ (CUDA compatibility 1.1 and does not support double precision). I am working on a feature recognition algorithm which involves lot of data dependency and many rounds convolutions (4th order IIR Filter). I am using nvmex conversions since half of the project is in matlab and the other half in C.

Problem: Everything seems to be working fine as far as the mex/nvmex goes. Since matlab is essentially double precision, both the CPU and the GPU versions (.cpp and .cu files) of the code take in double converted into single in matlab.

But the final outputs of the CPU and GPU versions (of the parts of the project being compared) do not match!! Earlier the difference was large…but after a lot of efforts I have brought down the difference to be less than 0.0001 i.e. starting only at the 5th place after decimal. But the application is so sensitive that even such a small discrepancy is ruining the project/application output…its not even close to the CPU version!!

I did the following in the process of correcting or reducing this error to the 5th place after decimal:

  1. Tried “scaling” the inputs inside .cpp/.cu versions. It did not help.

  2. Used __fmul_rn and __fadd_rn for all multiplications and additions respectively to avoid FMAD. This improved accuracy to some extent.

  3. Converted the .cpp and .cu outputs to fixed point and then back to double in matlab. There is a difference in the output but the project output is still far from what the CPU version gives.

  4. I realized the biggest error creeps in where-ever there is a floatfloat on a GPU, even if the result lies within the float range. There is a difference in the results of even simple floatfloat between calculations by hand, CPU and GPU (even by hand and CPU do not match!!). So I casted every addition and multiplication to float…though did not observe a lot of visible improvement in the accuracy.

  5. I am trying to “scale” inputs at the matlab level too…but somehow the process is not very trivial (due to the use of special toolboxes) and doesn’t seem like making a lot of difference at the end.

  6. I have noticed one more thing which I am now trying to work on and need immediate help with. So far, I was converting double to single in Matlab with an idea of providing identically converted inputs to both the CPU and the GPU. However, it seems that Prhs inputs taken in via mex/nvmex are slightly different in CPU and GPU versions (i.e. .cpp and .cu files). Though this was observed at only a few places, this kind of looks like a possible source of error. For that, now I am trying to create a .cpp wrapper around my .cu file. This wrapper will do the double to float conversion (just as in the .cpp version on the CPU) and then pass on only the float to the .cu being called inside the wrapper. This will also avoid double to single conversion in Matlab which comes at greater loss of accuracy than in C/C++. I wrote a .h file to include .cu file functions in the wrapper and also included “cuda.h” in the wrapper code. I have placed the .h file in the C:/CUDA/include folder to avoid long command-line options during compilation. It compiles properly but only when I compile both the .cpp and .cu together using nvmex along with -L(with names of the libraries specified) and -I switches; else I get compilation errors if I nvmex compile them separately. Also, I get a linking error when I run it. The linking error points towards some mexFunction not being used properly. Can someone please help!! I don’t know if my method here is correct or not!!

Please note that I have removed all memory optimizations for now from the .cu code. There is only one block of threads running and there is no deliberate shared memory or texture memory usage. The reason is that the primary goal right now is to have the GPU version of the application closely following the CPU version.

Please note that I am not very good at all this and have never coded or explored programming tools / compilers to this extent before. My only source of understanding is “google”. But still doesn’t seem like I am doing anything drastically wrong. So would request if you could reply in slightly lucid lingo.

I am stuck in this problem since long and will be really thankful if someone can share their experience with me. :( :( :(

Thanks & regards,