The code fragment below is creating linked lists on multiple cores.
This has been working fine on GTX1080Tis, P5000s, and P6000s, and on an M920, all with CUDA 8.0.
With CUDA 9.1 or 9.2, the loop hangs.
If I build this module debugging enabled (nvcc -G), the code works on a Windows system with GTX1080TIs. I still haven’t been able to get it to work on Linux, although the compiler should of course be the same.
Since it works with debug code, I suspected the optimizer.
However, perhaps someone has some other idea.
Regards
while (!done) {
uint32_t hashValue = hashTable[hashCode]; // Current hash table value.
if ((hashValue & 0x80000000) == 0 && atomicCAS(&hashTable[hashCode], hashValue, lockID) == hashValue) {
//////////////////////////////////////////////////////////////////////////////////
// We can proceed when the hash value didn't change since we last fetched it, and
// the entry we got was not a locked indicator.
//////////////////////////////////////////////////////////////////////////////////
pointList[pointListOffset] = pointID; // New start of list.
pointLinks[pointListOffset] = hashValue; // Next point is old start of list or zero.
(void)atomicExch(&hashTable[hashCode], pointListOffset + 1); // Unlock the hash table entry, set offset to point data (one-based).
// "Memory fence functions only affect the ordering of memory operations by a thread;
// they do not ensure that these memory operations are visible to other threads (like
// __syncthreads() does for threads within a block (see Synchronization Functions))."
__threadfence(); // Flush caches: Other threads need to see these changes.
done = true;
}
}