Device to host data copy may not reflect on host side using graphs

I am building a framework to process data for the host on the device using CUDA graphs. To hide copy latency of processed data, a Memcpy node copies the data back from the device into the corresponding host buffer as soon as possible.

In testing I have noticed that the data that is copied back to the host from inside a CUDA graph is not always correct and have theorized that it is actually cached on the host side due to the host not knowing that its data was changed. Here is a simple example implementation of the observed behavior:

#include <cuda_runtime.h>
#include <sstream>

#define CUDA_CHECK(err)                                                                      \
   do {                                                                                      \
      cudaError_t err_ = (err);                                                              \
      if(err_ != cudaSuccess) {                                                              \
         std::stringstream msg;                                                              \
         msg << "CUDA error at " << __FILE__ << ":" << __LINE__                              \
             << " code=" << static_cast<unsigned int>(err_) << "(" << cudaGetErrorName(err_) \
             << ") " << #err << std::endl;                                                   \
         throw std::runtime_error(msg.str());                                                \
      }                                                                                      \
   } while(0)

static inline void check(int *data, int value, size_t size, const char *message)
{
   size_t failed = 0;
   for(size_t idx = 0; idx < size; idx++) {
      if(data[idx] != value) {
         failed++;
      }
   }
   if(failed > 0)
      printf("%s: %zu out of %zu elements\n", message, failed, size);
}

int main()
{
   int *cached_h, *cached_d;
   size_t size = 1024, sizeInBytes = size * sizeof(*cached_h);
   CUDA_CHECK(cudaMallocHost(&cached_h, sizeInBytes));
   CUDA_CHECK(cudaMalloc(&cached_d, sizeInBytes)); // Device memory will stay 0

   // Prepare copying device memory back to host memory in a way that the host does not know
   cudaGraph_t covertCopyGraph;
   CUDA_CHECK(cudaGraphCreate(&covertCopyGraph, 0));
   cudaGraphNode_t covertCopy;
   CUDA_CHECK(cudaGraphAddMemcpyNode1D(
      &covertCopy,
      covertCopyGraph,
      nullptr,
      0,
      cached_h, // dst
      cached_d, // src
      sizeInBytes,
      cudaMemcpyDeviceToHost));
   cudaGraphExec_t covertCopyExec;
   CUDA_CHECK(cudaGraphInstantiate(&covertCopyExec, covertCopyGraph));

   // Change host memory, then pull device memory covertly for host
   for(size_t idx = 0; idx < size; idx++)
      cached_h[idx] = 2;
   CUDA_CHECK(cudaGraphLaunch(covertCopyExec, 0));
   // Verify that it may not match sometimes probably due to host cache
   check(cached_h, 0, size, "Data invalid after graph copy");

   // Change host memory, then pull device memory triggered from host directly
   for(size_t idx = 0; idx < size; idx++)
      cached_h[idx] = 2;
   CUDA_CHECK(cudaMemcpy(cached_h, cached_d, sizeInBytes, cudaMemcpyDeviceToHost));
   // Verify that it matches everytime
   check(cached_h, 0, size, "Data invalid after manual copy");

   // Free resources
   CUDA_CHECK(cudaGraphDestroy(covertCopyGraph));
   CUDA_CHECK(cudaGraphExecDestroy(covertCopyExec));
   CUDA_CHECK(cudaFreeHost(cached_h));
   CUDA_CHECK(cudaFree(cached_d));
}

When executing the program, most of the time I get the following output:

$ ./a.out
Data invalid after graph copy: 1024 out of 1024 elements

When executing the program while running a stress test on the host processor, I get a slightly different output, which indicates to me that the host cache is causing the irregular behavior (due to the cache being contested by the stress test):

$ ./a.out
Data invalid after graph copy: 791 out of 1024 elements
$ ./a.out
Data invalid after graph copy: 916 out of 1024 elements
$ ./a.out
Data invalid after graph copy: 957 out of 1024 elements
$ ./a.out
Data invalid after graph copy: 872 out of 1024 elements
$ ./a.out
Data invalid after graph copy: 1022 out of 1024 elements
$ ./a.out
Data invalid after graph copy: 1000 out of 1024 elements

Is there any way to avoid cached data on the host processor while still being able to copy data from the device back to the host inside a CUDA graph?

The used nvcc compiler version is 12.0.76, my system I’m currently testing on is Debian Bullseye using an an Intel Skylake CPU with a Turing GPU.

  1. There is nothing in your code that sets the device memory associated with cached_d to zero. cudaMalloc() does not do that.
  2. A graph launch is an asynchronous exercise, much like a kernel launch. In order for you to reliably see the results of the graph copy operation, you will need a device synchronization step of some sort, before you start testing host memory.

The following code has 2 lines added to address the above issues, and should not give the invalid data report:

# cat t26.cu
#include <cuda_runtime.h>
#include <sstream>

#define CUDA_CHECK(err)                                                                      \
   do {                                                                                      \
      cudaError_t err_ = (err);                                                              \
      if(err_ != cudaSuccess) {                                                              \
         std::stringstream msg;                                                              \
         msg << "CUDA error at " << __FILE__ << ":" << __LINE__                              \
             << " code=" << static_cast<unsigned int>(err_) << "(" << cudaGetErrorName(err_) \
             << ") " << #err << std::endl;                                                   \
         throw std::runtime_error(msg.str());                                                \
      }                                                                                      \
   } while(0)

static inline void check(int *data, int value, size_t size, const char *message)
{
   size_t failed = 0;
   for(size_t idx = 0; idx < size; idx++) {
      if(data[idx] != value) {
         failed++;
      }
   }
   if(failed > 0)
      printf("%s: %zu out of %zu elements\n", message, failed, size);
}

int main()
{
   int *cached_h, *cached_d;
   size_t size = 1024, sizeInBytes = size * sizeof(*cached_h);
   CUDA_CHECK(cudaMallocHost(&cached_h, sizeInBytes));
   CUDA_CHECK(cudaMalloc(&cached_d, sizeInBytes)); // Device memory will stay 0
   CUDA_CHECK(cudaMemset(cached_d, 0, sizeInBytes)); // ADDED

   // Prepare copying device memory back to host memory in a way that the host does not know
   cudaGraph_t covertCopyGraph;
   CUDA_CHECK(cudaGraphCreate(&covertCopyGraph, 0));
   cudaGraphNode_t covertCopy;
   CUDA_CHECK(cudaGraphAddMemcpyNode1D(
      &covertCopy,
      covertCopyGraph,
      nullptr,
      0,
      cached_h, // dst
      cached_d, // src
      sizeInBytes,
      cudaMemcpyDeviceToHost));
   cudaGraphExec_t covertCopyExec;
   CUDA_CHECK(cudaGraphInstantiate(&covertCopyExec, covertCopyGraph));

   // Change host memory, then pull device memory covertly for host
   for(size_t idx = 0; idx < size; idx++)
      cached_h[idx] = 2;
   CUDA_CHECK(cudaGraphLaunch(covertCopyExec, 0));
   // Verify that it may not match sometimes probably due to host cache
   CUDA_CHECK(cudaDeviceSynchronize()); // ADDED
   check(cached_h, 0, size, "Data invalid after graph copy");

   // Change host memory, then pull device memory triggered from host directly
   for(size_t idx = 0; idx < size; idx++)
      cached_h[idx] = 2;
   CUDA_CHECK(cudaMemcpy(cached_h, cached_d, sizeInBytes, cudaMemcpyDeviceToHost));
   // Verify that it matches everytime
   check(cached_h, 0, size, "Data invalid after manual copy");

   // Free resources
   CUDA_CHECK(cudaGraphDestroy(covertCopyGraph));
   CUDA_CHECK(cudaGraphExecDestroy(covertCopyExec));
   CUDA_CHECK(cudaFreeHost(cached_h));
   CUDA_CHECK(cudaFree(cached_d));
}

# nvcc -o t26 t26.cu
# ./t26
#

Thank you for your reply! I indeed forgot to synchronize in my example, I’m sorry for that. I was pretty sure that cudaMalloc initializes with zeroes, but the documentation states it as “not cleared”, which I overread.

I’ve now also changed the example to use streams, which is closer to my real implementation, and it still works like expected and not like I observed. I view this as solved, thank you again for helping!

Changed example:

#include <cuda_runtime.h>
#include <sstream>

#define CUDA_CHECK(err)                                                                      \
   do {                                                                                      \
      cudaError_t err_ = (err);                                                              \
      if(err_ != cudaSuccess) {                                                              \
         std::stringstream msg;                                                              \
         msg << "CUDA error at " << __FILE__ << ":" << __LINE__                              \
             << " code=" << static_cast<unsigned int>(err_) << "(" << cudaGetErrorName(err_) \
             << ") " << #err << std::endl;                                                   \
         throw std::runtime_error(msg.str());                                                \
      }                                                                                      \
   } while(0)

static inline void check(int *data, int value, size_t size, const char *message)
{
   size_t failed = 0;
   for(size_t idx = 0; idx < size; idx++) {
      if(data[idx] != value) {
         failed++;
      }
   }
   if(failed > 0)
      printf("%s: %zu out of %zu elements\n", message, failed, size);
}

int main()
{
   int *cached_h, *cached_d;
   size_t size = 1024, sizeInBytes = size * sizeof(*cached_h);
   CUDA_CHECK(cudaMallocHost(&cached_h, sizeInBytes));
   CUDA_CHECK(cudaMalloc(&cached_d, sizeInBytes)); // Device memory will stay 0
   CUDA_CHECK(cudaMemset(cached_d, 0, sizeInBytes));

   // Prepare copying device memory back to host memory in a way that the host does not know
   cudaGraph_t covertCopyGraph;
   CUDA_CHECK(cudaGraphCreate(&covertCopyGraph, 0));
   cudaGraphNode_t covertCopy;
   CUDA_CHECK(cudaGraphAddMemcpyNode1D(
      &covertCopy,
      covertCopyGraph,
      nullptr,
      0,
      cached_h, // dst
      cached_d, // src
      sizeInBytes,
      cudaMemcpyDeviceToHost));
   cudaGraphExec_t covertCopyExec;
   CUDA_CHECK(cudaGraphInstantiate(&covertCopyExec, covertCopyGraph));
   cudaStream_t stream;
   CUDA_CHECK(cudaStreamCreate(&stream));

   // Change host memory, then pull device memory covertly for host
   for(size_t idx = 0; idx < size; idx++)
      cached_h[idx] = 2;
   CUDA_CHECK(cudaGraphLaunch(covertCopyExec, stream));
   CUDA_CHECK(cudaStreamSynchronize(stream));
   // Verify that it may not match sometimes probably due to host cache
   check(cached_h, 0, size, "Data invalid after graph copy");

   // Change host memory, then pull device memory triggered from host directly
   for(size_t idx = 0; idx < size; idx++)
      cached_h[idx] = 2;
   CUDA_CHECK(cudaMemcpy(cached_h, cached_d, sizeInBytes, cudaMemcpyDeviceToHost));
   // Verify that it matches everytime
   check(cached_h, 0, size, "Data invalid after manual copy");

   // Free resources
   CUDA_CHECK(cudaGraphDestroy(covertCopyGraph));
   CUDA_CHECK(cudaGraphExecDestroy(covertCopyExec));
   CUDA_CHECK(cudaStreamDestroy(stream));
   CUDA_CHECK(cudaFreeHost(cached_h));
   CUDA_CHECK(cudaFree(cached_d));
}

Note: The documentation link you provided applies to CUDA 4.0, which was current somewhere in the ballpark of 2010/2011. Nowadays we are on CUDA 12.x and the current documentation for this item is here. (It also says the memory is not cleared.)

Old documentation like that is often times “not removed from the internet” for a few reasons:

  1. It would break links to it, for example in discussions like this one.
  2. Some folks might legitimately be trying to use a very old version of CUDA, and it’s then useful/helpful to have the “old documentation” available for reference.

However barring those situations, I would encourage you to refer to more recent documentation. You can generally find much/most NVIDIA documentation at https://docs.nvidia.com and that is true for the link I provided.

1 Like

From the documentation link I provided and the content it contained it was not clear to me that it was this old. I’ve edited my post with the newer documentation link you provided, which also refers to its CUDA version at the top of the page.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.