Implementing strong memory consistency between GPU and CPU

The CUDA memory system has weak consistency by default. I’d like to use language features (including PTX assembly if necessary) in order to correct a program such that the GPU writes and the CPU reads will occur as if there were strong consistency. This means the following:
For a GPU producer that updates integer values in sequential unified memory, I would like that a CPU process that’s reading these values sequentially, will see the updates of these integers in sequential order.

Pseudo is:
CPU:

  1. Allocate array (size 100) on unified memory page and reset values to zero.
  2. Launch GPU kernel.
  3. Attempt to read array as long it as its not full of 1’s, and look for a 0,1 sequence.
  4. If there isn’t such a sequence, strong consistency was achieved.

GPU Kernel:

  • Loop through array and increase all value sequentially to the value 1.

Any help on this matter would be extremely appreciated!!

Ophir

I’m not sure the statement “CUDA memory system has weak consistency by default” is correct, specially when using unified memory.
When a piece of memory is allocated with cudaMallocManaged, it is accessible only by the host or by the device at a given time, and never by both at the same time due to page migration. When the host references to this memory, it will be exactly as it was left by the last global or device function that accessed it.

This means that, in your example, assuming step #3 is done on the host, it will not happen until the kernel in step #2 is finished and a synchronization is issued. If you try to access this memory from the host without a cudaDeviceSynchronize after the kernel, the application will crash. The other way is also true, so that a kernel will never access the managed memory before the CPU is done working on it, though the synchronization is not needed.

If you are allocating device and host memory separately, that is, without cudaMallocManaged, then it is not unified memory that is being used, just conventional allocation. And as such, with 2 independent resources, it is up to the programmer to keep the consistency.

Have you actually coded something and are getting a different behavior?

Actually, that’s not true.
The page-migration engine copies pages between the concurrently running GPU and CPU, on access (by either device or the host).
I have written the code of-course, and as expected, weak-consistency was observed.
And again, what I would like to add is some actions to that algorithm such that strong consistency will be achieved.
A clarification on the page-migration engine’s job: If the host accesses a memory location that’s shared on a UM page, a page-fault is issued and that page is copied to host memory. And if the device accesses that same location, a GPU-side page-fault occurs as-well, and it is copied back to device memory.
It can, of-course, happen while both the device and the host are running.

Thanks

Show a code that causes this inconsistency with unified memory.

This is my code

#include <stdio.h>
#include <sys/time.h>
#include <unistd.h>
#include <assert.h>
#include <string.h>
#include <iostream>
#include <math.h>
#include <bitset>

#define CUDA_CHECK(f) do {                                                                \
  cudaError_t e = f;                                                                      \
  if (e != cudaSuccess) {                                                                 \
      printf("Cuda failure %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(e));    \
      exit(1);                                                                            \
  }                                                                                       \
} while (0)

namespace UVMConsistency {

#define UVMSPACE      volatile

#define START         0
#define GPU_START     1
#define GPU_FINISH    2
#define FINISH        3

#define NUM_SHARED 100

typedef unsigned long long int ulli;

__global__ void GPU_UVM_Writer_Kernel(UVMSPACE int *arr, UVMSPACE int *finished) {
  // Wait for CPU
  while (*finished != GPU_START);
  
  // Loop and execute writes on shared memory page - sequentially
  for (int i = 0; i < NUM_SHARED; i++) {
    arr[i] = 1;
  }
  
  // GPU finished - CPU can finish
  *finished = GPU_FINISH;

  // Wait for CPU to finish
  while (*finished != FINISH);
}

class Consistency {
private:	// Constructor & Destructor
  Consistency() {
    CUDA_CHECK(cudaMallocManaged(&arr, sizeof(int) * NUM_SHARED));
    memset((void *) arr, 0, sizeof(int) * NUM_SHARED);

    CUDA_CHECK(cudaMallocManaged(&finished, sizeof(int)));
    memset((void *) finished, START, sizeof(int));

    // Writing all the changes of UM to GPU
    __sync_synchronize();
  }

  ~Consistency() {
    CUDA_CHECK(cudaFree((int *) arr));
    CUDA_CHECK(cudaFree((int *) finished));
  }
  
private:	// Logic
  bool is_arr_full(UVMSPACE int *arr) {
    int count = 0;
    for (int i = 0; i < NUM_SHARED; i++) {
      count += arr[i];
    }
    return count == NUM_SHARED;
  }

  bool check_consistency(UVMSPACE int *arr) {
    // Read shared memory page - sequentially
    for (int i = 0; i < NUM_SHARED - 1; i++) {
      if (arr[i] < arr[i + 1]) {  // arr[i] == 0 and arr[i + 1] == 1  ==> Inconsistency
        return true;
      }
    }
    return false;
  }
  
  void launch_task() {
    // Start GPU task
    GPU_UVM_Writer_Kernel<<<1,1>>>(arr, finished);

    // GPU can start
    *finished = GPU_START;
  }

  void check_consistency() {
    // While writes have not finished
    while (!is_arr_full(arr)) {
      // Check if an inconsistency exists in the array
      if (check_consistency(arr)) {
        ::std::cout << "Found Inconsistency !" << ::std::endl;
        return;
      }
    }
    ::std::cout << "No Consistency Found" << ::std::endl;
  }

  void finish_task() {
    while (*finished != GPU_FINISH);
    // Task is over
    *finished = FINISH;

    CUDA_CHECK(cudaDeviceSynchronize());
  }
    
public:
  static void start() {
    Consistency consistency;
    // Start kernel
    consistency.launch_task();

    // Check GPU consistency
    consistency.check_consistency();

    // Finish task for CPU and GPU
    consistency.finish_task();
  }
private:
  UVMSPACE int *arr;
  UVMSPACE int *finished;
};

} // UVMConsistency namespace

int main() {
  UVMConsistency::Consistency::start();

  return 0;
}

When I run your code multiple times on a single Tesla P100 (no other GPUs in the system) on CUDA 10.0, CentOS 7, I get varying output, either:

Found Inconsistency !

or

No Consistency Found

(I assume that 2nd possibility above perhaps should be written “No Inconsistency Found”)

If I insert a __threadfence_system() call here:

__global__ void GPU_UVM_Writer_Kernel(UVMSPACE int *arr, UVMSPACE int *finished) {
  // Wait for CPU
  while (*finished != GPU_START);
  
  // Loop and execute writes on shared memory page - sequentially
  for (int i = 0; i < NUM_SHARED; i++) {
    arr[i] = 1;
  }
  __threadfence_system();             // added line here
  // GPU finished - CPU can finish
  *finished = GPU_FINISH;

  // Wait for CPU to finish
  while (*finished != FINISH);
}

I only get:

No Consistency Found

You may want to read the programming guide section on memory fence functions in CUDA:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions

I tried that, of-course.
I’m guessing if you run that a few more times you would get the “Found Inconsistency!” option.
And I’ve read this section obviously, and since it did not work (threadfence_system, which should actually work, at-least as they describe it), I tried the forums.
Also, more accurate would be inserting the fence call in the loop, so the host-thread will read every value sequentially. Though, this also does not work…

Actually, I am running on CUDA 8.0. You think that this could be an issue?
Also, working on Pascal, not Tesla.

You are correct, I am able to witness an inconsistency eventually, even with added __threadfence_system()

I think there are 2 things to pay attention to here:

  1. __threadfence_system() is required for correctness. There is no ordering guarantee from device code writing to memory without it.

  2. We must also consider the read operation on the host side. The visibility guarantee for item 1 requires that host reads of adjacent elements (or elements to be compared for consistency) must occur at exactly the same time. If there is even a small gap between host reads of adjacent elements, then we run into a possible race condition. Consider this line of code:

if (arr[i] < arr[i + 1]) {  // arr[i] == 0 and arr[i + 1] == 1  ==> Inconsistency

That code enforces no defined ordering of the read of arr[i] and arr[i+1] that I am aware of. Maybe if you are a language expert, you can prove to me that the read of arr[i] must occur before the read of arr[i+1]. Even if that is the case, it does not affect my argument. If the read of arr[i] occurs before arr[i+1], then anything at all can happen in the time between these two reads. A way to think about this is imagine if the entire CUDA kernel executes in-between these two reads. If that were to happen, then the read of arr[i] would return 0, and the read of arr[i+1] would return 1, but that is not a violation of any guarantee that I am aware of, and does not even indicate an ordering problem. It indicates a race condition in your (host) code.

I believe a simple way to test this is to:

  1. make the modification you already proposed: put __threadfence_system() in the body of your kernel loop. Apart from any other discussion, it belongs there to enforce ordering of writes.

  2. modify your host code to remove the race condition.

Here’s my modified test case, and I’ve run it for a long time (500 iterations) with no inconsistencies:

$ cat t317.cu
#include <stdio.h>
#include <sys/time.h>
#include <unistd.h>
#include <assert.h>
#include <string.h>
#include <iostream>
#include <math.h>
#include <bitset>

#define CUDA_CHECK(f) do {                                                                \
  cudaError_t e = f;                                                                      \
  if (e != cudaSuccess) {                                                                 \
      printf("Cuda failure %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(e));    \
      exit(1);                                                                            \
  }                                                                                       \
} while (0)

namespace UVMConsistency {

#define UVMSPACE      volatile

#define START         0
#define GPU_START     1
#define GPU_FINISH    2
#define FINISH        3

#define NUM_SHARED 100

typedef unsigned long long int ulli;

__global__ void GPU_UVM_Writer_Kernel(UVMSPACE int *arr, UVMSPACE int *finished) {
  // Wait for CPU
  while (*finished != GPU_START);

  // Loop and execute writes on shared memory page - sequentially
  for (int i = 0; i < NUM_SHARED; i++) {
    arr[i] = 1;
    __threadfence_system();
  }
  // GPU finished - CPU can finish
  *finished = GPU_FINISH;

  // Wait for CPU to finish
  while (*finished != FINISH);
}

class Consistency {
private:        // Constructor & Destructor
  Consistency() {
    CUDA_CHECK(cudaMallocManaged(&arr, sizeof(int) * NUM_SHARED));
    memset((void *) arr, 0, sizeof(int) * NUM_SHARED);

    CUDA_CHECK(cudaMallocManaged(&finished, sizeof(int)));
    memset((void *) finished, START, sizeof(int));

    // Writing all the changes of UM to GPU
    __sync_synchronize();
  }

  ~Consistency() {
    CUDA_CHECK(cudaFree((int *) arr));
    CUDA_CHECK(cudaFree((int *) finished));
  }

private:        // Logic
  bool is_arr_full(UVMSPACE int *arr) {
    int count = 0;
    for (int i = 0; i < NUM_SHARED; i++) {
      count += arr[i];
    }
    return count == NUM_SHARED;
  }

  bool check_consistency(UVMSPACE int *arr) {
    // Read shared memory page - sequentially
    for (int i = 0; i < NUM_SHARED - 1; i++) {
#ifndef FIX
      int v1 = arr[i];
      int v2 = arr[i+1];
#else
      int v2 = arr[i+1];
      int v1 = arr[i];
#endif
      if (v1 < v2) {  // arr[i] == 0 and arr[i + 1] == 1  ==> Inconsistency
        std::cout << i << "," << v1 <<  "," << v2 << std::endl;
        return true;

      }
    }
    return false;
  }

  void launch_task() {
    // Start GPU task
    GPU_UVM_Writer_Kernel<<<1,1>>>(arr, finished);

    // GPU can start
    *finished = GPU_START;
  }

  void check_consistency() {
    // While writes have not finished
    while (!is_arr_full(arr)) {
      // Check if an inconsistency exists in the array
      if (check_consistency(arr)) {
        ::std::cout << "Found Inconsistency !" << ::std::endl;
        return;
      }
    }
    ::std::cout << "No Consistency Found" << ::std::endl;
  }

  void finish_task() {
    while (*finished != GPU_FINISH);
    // Task is over
    *finished = FINISH;

    CUDA_CHECK(cudaDeviceSynchronize());
  }

public:
  static void start() {
    Consistency consistency;
    // Start kernel
    consistency.launch_task();

    // Check GPU consistency
    consistency.check_consistency();

    // Finish task for CPU and GPU
    consistency.finish_task();
  }
private:
  UVMSPACE int *arr;
  UVMSPACE int *finished;
};

} // UVMConsistency namespace

int main() {
  UVMConsistency::Consistency::start();

  return 0;
}
[user2@dc11 misc]$ nvcc -arch=sm_60 -std=c++11 -o t317 t317.cu -DFIX
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$ CUDA_VISIBLE_DEVICES="0" ./t317
No Consistency Found
[user2@dc11 misc]$

If you compile without the -DFIX switch, the code will revert to producing errors, occasionally.

Pascal and Tesla are not exclusive. There are Pascal processors that are also Tesla-branded. I am using a Tesla P100, which is a Pascal (family) processor.

I suggest providing a complete definition of your setup, here is mine:

Tesla P100, CUDA 10.0, driver 410.48, CentOS 7

I do have other GPUs in my system, but the CUDA_VISIBLE_DEVICES=“0” makes it so that the CUDA runtime only sees the P100 GPU.

All of these things matter for understanding UM behavior completely:

  • the GPU
  • the CUDA version
  • the OS
  • whether or not there are other GPUs in the system
  • if there are other GPUs, can they topologically be placed into a Peer-to-Peer arrangement

That’s it!
I haven’t thought about race conditions in my host code at all!
Thanks a-lot! Youv’e helped me so much! (have been grinding this problem for weeks!)

Ophir

So it seems that the initial discovery of the inconsistency was invalid, because of that race condition.
Any idea of how to capture one?