I have a strange problem on the app I’m developing, that is that I can’t get it to work since upgrading to 3.2.
I’ve spent 2 days around the code to explicitly synchronize and getlasterror after each Cuda call, disable streams, … all for nothing. The behavior is that at the end of my computation, I get a vector full of #QNAN, without Cuda having returned any error code.
I’ve reverted to toolkit 3.1 (257.21 driver), recompile the code as is, and voilÃ , everything is working fine (vector filled with meaningfull values).
My app uses several features : Stream synchronization, context thread migration, async pinned memory transferts. I currently run 2 threads on each device, without any specific problem on the 257.21 driver / 3.1 toolkit.
If I upgrade the driver to the 260.61 release, I get the bad behavior described higher. After recompiling with the 3.2 toolkit, nothing gets better, even after disabling streams (all streams = NULL), not using pinned memory and having only 1 running thread on the tesla C1060.
I would be thankful for any hint on known feature changes between 3.1 and 3.2 that could explain this, or on how to try and spot the problem … The computation part of the app uses a mix of specific kernels and cublas calls.
My development set up : Win7 x64, VS2005, core I7, 1xGTX260, 1xTeslaC1060, 6GbRAM.