Can you help-me about Cuda `AtomicCAS`?

I have a C++ / Cuda project that contains a Hash Table structure and each node in it contains its information, and an int variable called semaphore, see:typedef struct Node { char *key; double * word_vectors = NULL; double value; int semaphore = 1; struct Node *next; } Node;

All semaphores start with 1 which means that it is available, when 0 they will be unavailable, however they all start with 1. I guarantee this because I tested the output of the values

The function that updates the word_vectors information is called update_vectors and look how it is done:

device void update_vectors(char *input_word_index, char *context_word_index, int embedding_size, double gradient, double learning_rate, double momentum)
int tid = threadIdx.x + blockIdx.x * blockDim.x;
double input_word_delta, context_word_delta;
Node *no_input = procurar(input_word_index);
Node *no_context = procurar(context_word_index);
Node *firstNode, *secondNode;

// Determine the order of nodes
if (dstrcmp(no_input->key, no_context->key) < 0)
    firstNode = no_input;
    secondNode = no_context;
    firstNode = no_context;
    secondNode = no_input;

while (atomicCAS(&firstNode->semaphore, 1, 0) != 0)
    //printf("\nTrying to get resource 1\n");

printf("\nResource 1 Successfully obtained by tid: %d\n", tid);

while (atomicCAS(&secondNode->semaphore, 1, 0) != 0)
    //printf("\nTrying to get resource 2\n");
printf("\nResource 1 Successfully obtained by tid: %d\n", tid);

for (int i = 0; i < embedding_size; i++)

    //Critical section: update shared variables
    input_word_delta = (learning_rate * -1.0) * (gradient * no_context->word_vectors[i] + momentum * no_input->word_vectors[i]);
    context_word_delta = (learning_rate * -1.0) * (gradient * no_input->word_vectors[i] + momentum * no_context->word_vectors[i]);

    no_input->word_vectors[i] += input_word_delta;
    no_context->word_vectors[i] += context_word_delta;
// Release the semaphore
atomicExch(&firstNode->semaphore, 1);
atomicExch(&secondNode->semaphore, 1);

Note that I took care that all Threads always get the nodes in a single way using the dstrcmp function which is the same as strcmp, but for device.

Why when I set the while (atomicCAS(&firstNode->semaphore, 1, 0) != 0) to non-Zero, and the same for while (atomicCAS(&secondNode->semaphore, 1, 0) != 0), does the algorithm work? It doesn’t crash my GPU or black screen, however, when I change it to while (atomicCAS(&firstNode->semaphore, 1, 0) == 0) and while (atomicCAS(&secondNode->semaphore, 1, 0) != 0) it crashes the entire GPU and never gets any of the two resources?

Can anyone explain to me how this works even though it seems illogical? Maybe I’m not aware of something that the atomicCas function does, can someone please explain it to me the same way you would explain it to your dog? because I really want to understand this here.

As explained before I tried with the logic: while (atomicCAS(&firstNode->semaphore, 1, 0) == 0) and it got work, black screen and the resource was never picked up by the Threads, however, when I use: while (atomicCAS(&firstNode->semaphore, 1, 0) != 0) it works, however I can’t understand why, and I would like someone to explain it to me.

Thanks for your attention!