I am slowly gaining experience with CUDA but have come across a very puzzling, race-like condition.
The behavior is as follows:
Results are correctly processed during debugging when using printf statements to read results manually and compare them to known values from a cpu simulation.
Results are incorrect (unstable) when processed when printf statements are removed.
Observations / Notes:
A. The unstable results at felt like a race condition however when limiting kernels to <<<1,1>>> the unstable behavior persists.
B. cudaDeviceSynchronize() statements placed after kernels seem to have no effect.
C. Each kernel is using a lot of memory relative to previous kernels I have written, for example kernels use arrays (float array) and one kernel uses three of those. That said each kernel appears to launch and run successfully.
What might the unstable behavior be caused by?