Coding or Optimzation Problem?

The code fragment below is creating linked lists on multiple cores.

This has been working fine on GTX1080Tis, P5000s, and P6000s, and on an M920, all with CUDA 8.0.

With CUDA 9.1 or 9.2, the loop hangs.

If I build this module debugging enabled (nvcc -G), the code works on a Windows system with GTX1080TIs. I still haven’t been able to get it to work on Linux, although the compiler should of course be the same.

Since it works with debug code, I suspected the optimizer.

However, perhaps someone has some other idea.

Regards

    while (!done) {
        uint32_t hashValue = hashTable[hashCode];   // Current hash table value.

        if ((hashValue & 0x80000000) == 0 && atomicCAS(&hashTable[hashCode], hashValue, lockID) == hashValue) {
            //////////////////////////////////////////////////////////////////////////////////
            // We can proceed when the hash value didn't change since we last fetched it, and
            // the entry we got was not a locked indicator.
            //////////////////////////////////////////////////////////////////////////////////

            pointList[pointListOffset] = pointID;           // New start of list.
            pointLinks[pointListOffset] = hashValue;        // Next point is old start of list or zero.
            (void)atomicExch(&hashTable[hashCode], pointListOffset + 1);  // Unlock the hash table entry, set offset to point data (one-based).

            // "Memory fence functions only affect the ordering of memory operations by a thread;
            // they do not ensure that these memory operations are visible to other threads (like
            // __syncthreads() does for threads within a block (see Synchronization Functions))."

            __threadfence();                                // Flush caches:  Other threads need to see these changes.
            done = true;
        }
    }

It looks like you may be negotiating for a lock. If you are negotiating for a lock among threads in the same warp, then I would say you are asking for trouble on a pre-Volta architecture.

If you are negotiating for a lock among threads in the same warp, it’s possible that this is simply a defect in your code, regardless of the number of circumstances that you’ve found that it appears to work correctly.