Hello,
I am slowly gaining experience with CUDA but have come across a very puzzling, race-like condition.
The behavior is as follows:
-
Results are correctly processed during debugging when using printf statements to read results manually and compare them to known values from a cpu simulation.
-
Results are incorrect (unstable) when processed when printf statements are removed.
Observations / Notes:
A. The unstable results at felt like a race condition however when limiting kernels to <<<1,1>>> the unstable behavior persists.
B. cudaDeviceSynchronize() statements placed after kernels seem to have no effect.
C. Each kernel is using a lot of memory relative to previous kernels I have written, for example kernels use arrays (float array[105]) and one kernel uses three of those. That said each kernel appears to launch and run successfully.
Question:
What might the unstable behavior be caused by?