I have a problem in a program, where a certain thread never returns from an atomicCAS call.
Unfortunately, the code is quite large and I have not, so far, managed to get a reduced version of the code where the problem occurs. But it is essentially an 64-bits atomicCAS on a valid memory address. I ran cuda-gdb, and it shows the thread stuck inside the atomicCAS, no code after it executes.
My question is, is there any known case where that happens? Where an atomic freezes indeterminately?
I did some tests, and it seems the thread waiting on the atomic was not being scheduled, as another thread in its block was doing something else. I forced the second thread to sync and the first worked. However, I am running in V100, and I assumed the Independent Thread Scheduling prevented this kind of behaviour. Is there anything specific I need to enable on the Voltas to enable proper yielding of threads?
Independent thread scheduling doesn’t enforce any sort of fairness or heuristic on thread scheduling. It simply gives the execution engine more latitude in what it may do.
The general CUDA execution model is that all threads eventually finish.
My guess here is that you are actually depending on some kind of inter-thread communication or synchronization for your algorithm, or you would have never noticed this.
Such algorithms, if they don’t incorporate any sort of proper synchronization technique, are just as broken on Volta as they were on previous architectures.
I can change it to proper synchronization, it is no issue. NVIDIA documentation is a bit confusing on this, I had understood v100 always gave other threads a chance, but it does not seem to be the case then.