Difference Between Simulation and Device Execution Probably memory problem

I will describe my problem, and I won’t post any code because it is too long, and it is difficult to locate the bug.

So, I have a Tesla C1060 on a Fedora 10 machine, with CUDA 2.1. Trying to debug my code (I cannot use cuda-gdb because I get an error caused by a bug known to CUDA developers) I store variables in the global memory and I read them back.

I my kernel code I call a device function that calls another device function which returns an integer which is not correct. HOWEVER, if I use the global memory (for debugging purposes) to store internal variables of the function, it returns the correct integer value!!

In emulation the function returns the correct value regardless if I store the intermediate variables in the global memory.

Has anybody experienced similar problems? What other information would you like me to give you to understand the problem better?

Thanks!