same kernel, different behavior linux-windows

I have a problem with cuda programming because runnning the same function in two different environments will return different results. In detail, I have developed in visual studio 2010 an application which produces a dll file (then called from matlab as mex file) which inside calls a .cu file with my cuda functions. When I compile it on my laptop (cuda 4.0 GeForce G102M 1.1 cuda capability) it runs and when I test it, it works well gining the expected results.

Then, when I try to do apply the developed code on a remote machine running linux and with cuda 3.2, it gives me wrong results. What we have done is to compile the .cu file on linux with nvcc (command line added down), then compile the cpp file with g++4.1. After that, I call the same matlab function, it runs but it returns very different (and clearly incorrect) results. I do not know why it happens, if it is because the grpahic card is different (way better on the linux machine), the cuda driver version is different, because somewhere I should add synchronize options (but in that case, why it does work on my laptop?).

If you have any idea how to address this proble it will be very appreciated.


This is the compiler line launched by visual studio on my laptop
-gencode=arch=compute_10,code=“sm_10,compute_10” <— is this very important ?
–use-local-env --cl-version 2010
-ccbin “c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\x86_amd64”
-I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include"
-G0 --keep-dir “x64\Debug” -maxrregcount=0 --machine 64 --compile -D_NEXUS_DEBUG -g
-Xcompiler "/EHsc /nologo /Od /Zi /MDd "
-o “x64\Debug\” “C:\Users\Mattia\documents\visual studio 2010\Projects\mex_cuda_wrapper_dll\mex_cuda_wrapper_dll\”

this is the command line to compile on the linux machine.
nvcc -o obj/cuda_func.os -c -shared
-Iinclude -I/usr/local/opt/Matlab2008a/extern/include -I/common/inc -I/usr/local/cuda/include -Imodules/segmentation/inc -shared
-Xcompiler -fPIC

What’s the card on the remote machine? I think there is something wrong with your kernel (maybe race conditions). Which is why you get different results running it on a GPU with higher processor clock.


the card on the remote machine is GeForce GTX 295, but I have to say also this:

me: connect to remote server, compile here , ssh to machine with tpu card: there launch matlab

this might introduce any problem ?

And, about analyzing my kernel, any advice on how identify those eventual race conditions ?

thank you anyway


Are you checking return codes of all CUDA calls for errors? The likely reason for your problem is that the kernel does not launch at all on the Linux machine.

This bit confuses me a little: Are you saying that your notebook runs windows, the remote machine runs Linux, and you are (cross-) compiling the program for the remote machine on your notebook?

-Yes I check every kernel execution and all memcpy operations too, all correct. Furthermore if the kernel wasn’t running, matlab mex function will return an all 0 variable, on the contrary it outputs some nonsense values (error rate negative or > 1 ). On the contrary, executing the same script on windows it works perfectly and the results are cohincident with CPU code only.

-this confuses me too ;), anyway : I write .cpp and .cu files on windows, here I compile them and test them on my local matlab (all on windows and no errors). then I copy the .cpp and .cu files on the remote machine (linux). There (in a linux environment) I compile with nvcc and g++4.1 producing a .so file that I rename .mexa64 in order to be able to call it as a matlab function. Then, I go via ssh to another linux remote machine (where I am not allowed to compile my files directly), open matlab and test my script… there it does not run properly :(

If you have any ideas where I might look, I will try



A thing I just remembered

the two kernel aren’t exactly the same, I had to comment a cudaDeviceSynchronize(); and cudaDeviceReset(); because on linux it did not compile them (I suppose it is because there CUDA 3.2 is installed instead than 4.0 as in my laptop.

You can try out Jacket SDK and see if that helps.

But if you comment out “cudaDeviceSynchronize”, you are trying to get results before the kernel finished. Did you add “cudaThreadSynchronize” for old driver instead of “… Device”?

Yes but nothig has changed. Now what I will do is to start from scractch, add one piece at a time and see where the results start diverging.

solve, it was a lack of vector initialization to 0 that, for some reason, in my laptop was performed automatically.