You are correct, I am able to witness an inconsistency eventually, even with added __threadfence_system()
I think there are 2 things to pay attention to here:
-
__threadfence_system() is required for correctness. There is no ordering guarantee from device code writing to memory without it.
-
We must also consider the read operation on the host side. The visibility guarantee for item 1 requires that host reads of adjacent elements (or elements to be compared for consistency) must occur at exactly the same time. If there is even a small gap between host reads of adjacent elements, then we run into a possible race condition. Consider this line of code:
if (arr[i] < arr[i + 1]) { // arr[i] == 0 and arr[i + 1] == 1 ==> Inconsistency
That code enforces no defined ordering of the read of arr[i] and arr[i+1] that I am aware of. Maybe if you are a language expert, you can prove to me that the read of arr[i] must occur before the read of arr[i+1]. Even if that is the case, it does not affect my argument. If the read of arr[i] occurs before arr[i+1], then anything at all can happen in the time between these two reads. A way to think about this is imagine if the entire CUDA kernel executes in-between these two reads. If that were to happen, then the read of arr[i] would return 0, and the read of arr[i+1] would return 1, but that is not a violation of any guarantee that I am aware of, and does not even indicate an ordering problem. It indicates a race condition in your (host) code.
I believe a simple way to test this is to:
-
make the modification you already proposed: put __threadfence_system() in the body of your kernel loop. Apart from any other discussion, it belongs there to enforce ordering of writes.
-
modify your host code to remove the race condition.
Here’s my modified test case, and I’ve run it for a long time (500 iterations) with no inconsistencies:
$ cat t317.cu
#include <stdio.h>
#include <sys/time.h>
#include <unistd.h>
#include <assert.h>
#include <string.h>
#include <iostream>
#include <math.h>
#include <bitset>
#define CUDA_CHECK(f) do { \
cudaError_t e = f; \
if (e != cudaSuccess) { \
printf("Cuda failure %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(e)); \
exit(1); \
} \
} while (0)
namespace UVMConsistency {
#define UVMSPACE volatile
#define START 0
#define GPU_START 1
#define GPU_FINISH 2
#define FINISH 3
#define NUM_SHARED 100
typedef unsigned long long int ulli;
__global__ void GPU_UVM_Writer_Kernel(UVMSPACE int *arr, UVMSPACE int *finished) {
// Wait for CPU
while (*finished != GPU_START);
// Loop and execute writes on shared memory page - sequentially
for (int i = 0; i < NUM_SHARED; i++) {
arr[i] = 1;
__threadfence_system();
}
// GPU finished - CPU can finish
*finished = GPU_FINISH;
// Wait for CPU to finish
while (*finished != FINISH);
}
class Consistency {
private: // Constructor & Destructor
Consistency() {
CUDA_CHECK(cudaMallocManaged(&arr, sizeof(int) * NUM_SHARED));
memset((void *) arr, 0, sizeof(int) * NUM_SHARED);
CUDA_CHECK(cudaMallocManaged(&finished, sizeof(int)));
memset((void *) finished, START, sizeof(int));
// Writing all the changes of UM to GPU
__sync_synchronize();
}
~Consistency() {
CUDA_CHECK(cudaFree((int *) arr));
CUDA_CHECK(cudaFree((int *) finished));
}
private: // Logic
bool is_arr_full(UVMSPACE int *arr) {
int count = 0;
for (int i = 0; i < NUM_SHARED; i++) {
count += arr[i];
}
return count == NUM_SHARED;
}
bool check_consistency(UVMSPACE int *arr) {
// Read shared memory page - sequentially
for (int i = 0; i < NUM_SHARED - 1; i++) {
#ifndef FIX
int v1 = arr[i];
int v2 = arr[i+1];
#else
int v2 = arr[i+1];
int v1 = arr[i];
#endif
if (v1 < v2) { // arr[i] == 0 and arr[i + 1] == 1 ==> Inconsistency
std::cout << i << "," << v1 << "," << v2 << std::endl;
return true;
}
}
return false;
}
void launch_task() {
// Start GPU task
GPU_UVM_Writer_Kernel<<<1,1>>>(arr, finished);
// GPU can start
*finished = GPU_START;
}
void check_consistency() {
// While writes have not finished
while (!is_arr_full(arr)) {
// Check if an inconsistency exists in the array
if (check_consistency(arr)) {
::std::cout << "Found Inconsistency !" << ::std::endl;
return;
}
}
::std::cout << "No Consistency Found" << ::std::endl;
}
void finish_task() {
while (*finished != GPU_FINISH);
// Task is over
*finished = FINISH;
CUDA_CHECK(cudaDeviceSynchronize());
}
public:
static void start() {
Consistency consistency;
// Start kernel
consistency.launch_task();
// Check GPU consistency
consistency.check_consistency();
// Finish task for CPU and GPU
consistency.finish_task();
}
private:
UVMSPACE int *arr;
UVMSPACE int *finished;
};
} // UVMConsistency namespace
int main() {
UVMConsistency::Consistency::start();
return 0;
}
[user2@dc11 misc]$ nvcc -arch=sm_60 -std=c++11 -o t317 t317.cu -DFIX
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$
If you compile without the -DFIX switch, the code will revert to producing errors, occasionally.