What are the details of new PTX6 memory consistency model on Turing architecture and how can I compensate for its absence on Pre-Turing archs?

Recently watched a CPPCon Presentation on the following topic: CUDA on Turing Opens New GPU Compute Possibilities

Trie construction was shown as an example. However, the proposed code would only work on Volta, Xavier, or Turing GPUs, due to Independent Thread Scheduling and the PTX6 memory consistency model. (GitHub link - GitHub - ogiroux/freestanding)

I have two questions:

  1. What exactly is the purpose of all memory order clarifications in the code:
       if(n->next[index].ptr.load(simt::memory_order_acquire) == nullptr) {
		        n->next[index].ptr.wait(nullptr, simt::std::memory_order_acquire);
            else {
                auto next = bump.fetch_add(1, simt::std::memory_order_relaxed);
                n->next[index].ptr.store(next, simt::std::memory_order_release);
  1. I tried to implement my own trie construction on the GPU using atomic CUDA intrinsics, to use on pre-Turing architectures.
        int old = atomicCAS(&n->next[charIndex].set, 0, 1);

        if (old != 0)
            while (n->next[charIndex].ptr == nullptr);
            int oldBumpIndex = atomicAdd(bumpIndex, 1);
            n->next[charIndex].ptr = *bump + oldBumpIndex;
        n = n->next[charIndex].ptr;

Instead of an atomic_flag I introduced a new member (set) to the trie structure and use it to allow only one thread create a new trie node.
The code works, but sort of slow (slower than multithreaded CPU version, probably due to high thread divergence).
What are possible problems with this implementation (especially in the warp context, e.g. - all warps execute the same instruction – meaning whole warp will go spinlocking if at least one of its threads does?) and I don’t really see why I would need memory orderings here (regarding the first question).