Why lock-free approach doesn't work on CUDA? lock vs. lock-free approach

The following code is based on an example of the book “CUDA by Example”.
Function add_to_table_lock() and add_to_table_lockfree() do the same work—hashing.
They (1) compute the hash value of an element in key,
(2) recode the head of the corresponding hash entry into the next field of the element, and
(3) push the element into the hash entry.

Why the result of the second function is not correct? When the number of elements is large enough, it cannot guarantee atomic access on table[hashValue].
However, the first is always correct.

The whole source file is attached

hashtable_gpu_lock.cu (5.27 KB)

.

Any ideas?

-----------------------------Function 1--------------------------------------------------
global void add_to_table_lock( unsigned int *keys, void **values, unsigned int *next, unsigned int *table, int *lock ) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i=0; i<32; i++) {
if ((tid % 32) == i) {
while(atomicCAS( &lock[hashValue], 0, 1 ) != 0 ); //lock approach
next[tid] = table[hashValue];
table[hashValue] = tid;
__threadfence();
atomicExch( &lock[hashValue], 0 );
}
}
}
-----------------------------Function 2--------------------------------------------------
global void add_to_table_lockfree( unsigned int *keys, void **values, unsigned int *next, unsigned int *table, int *lock ) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int flag = 0;
size_t hashValue = hash( keys[tid], HASH_ENTRIES );
while ( flag == 0 ) {
if ((atomicCAS( &lock[hashValue], 0, 1 ) == 0 )) { //lock-free approach
next[tid] = table[hashValue];
table[hashValue] = tid;
__threadfence();
atomicExch( &lock[hashValue], 0 );
flag = 1;
}
}
}