I have an application that in one instance runs for a very short time (<0.05 seconds). I am finding that on every alternate run it doesn’t run correctly and returns an incorrect result.
My intuition is that GPU hasn’t had time to get going and some of the threads are terminating early - a sort of thread collapse. If I introduce a global memory access predicated on thread id then it works fine.
Is anyone aware of issues related to very short runs?
The alternating between correct and incorrect result occurs for low values of my input parameter n. For higher values of n the problem does not occur.
The incorrect result, each time it occurs, is the same.
I know exactly which threads seem to be behaving differently between the two runs. It was during the process of getting these threads to store away some information (in global) to help me diagnose the problem that the problem disappeared when global memory access is included.
I have checked and rechecked array access and that everything is being initialised. The fact that the problem comes and goes and only occurs in short runs seems to me to indicate that the problem is more than a basic array overrun.
For the purposes of debugging this problem, shared memory is not used and global is only used at the start. “Local” memory is used.
Running what is basically the same code and data structures under C works.
CUDA is definitely working reliably for short as well as for long-running kernels. I’m afraid we can’t help you much with this problem unless you show source code that reproduces the problem.