Accessing Managed Memory During Asynchronous Copies

Hi,

I’m not sure if what I came across is a feature, a bug, a known behaviour, or possibly all of the above. So I thought I’d ask if others knew about this.

While setting up some thorough unit testing code for a specific project, I found that some of the tests would pretty much always fail on Windows, but would run happily on Linux. After some painful debugging, since the problem happened quite deep inside of our code, I had to realize that managed memory just behaves differently on Windows and on Linux.

I boiled it down to the following demonstrator:

// CUDA include(s).
#include <cuda_runtime_api.h>

// System include(s).
#include <iostream>
#include <cstdlib>

/// Simple macro for checking CUDA return values.
#define CUDA_CHECK(EXPR)                                                 \
   do                                                                    \
   {                                                                     \
      const cudaError_t err = EXPR;                                      \
      if (err != cudaSuccess)                                            \
      {                                                                  \
         std::cerr << "CUDA error in file '" << __FILE__ << "' in line " \
                   << __LINE__ << ": " << cudaGetErrorString(err)        \
                   << std::endl;                                         \
         std::exit(1);                                                   \
      }                                                                  \
   } while (false)

int main()
{

   // Allocate a blob of "managed" and "device" memory.
   static const int N = 100;
   int *managed_mem = nullptr;
   CUDA_CHECK(cudaMallocManaged((void **)&managed_mem, N * sizeof(int)));
   int *device_mem = nullptr;
   CUDA_CHECK(cudaMalloc((void **)&device_mem, N * sizeof(int)));

   // Create a CUDA stream.
   cudaStream_t stream;
   CUDA_CHECK(cudaStreamCreate(&stream));

   // Fill the managed memory with some data. Demonstrating that this works just
   // fine at this point, while there's no copy going on.
   for (int i = 0; i < N; i++)
   {
      managed_mem[i] = i;
   }

   // Start an asynchronous copy into the device memory.
   CUDA_CHECK(cudaMemcpyAsync(device_mem, managed_mem, N * sizeof(int),
                              cudaMemcpyHostToDevice, stream));

   // Try to print the values in managed memory. This should work, but on
   // Windows it doesn't. :-(
   for (int i = 0; i < N; i++)
   {
      std::cout << "managed_mem[" << i << "] = " << managed_mem[i] << std::endl;
   }

   // Destroy the CUDA stream.
   CUDA_CHECK(cudaStreamSynchronize(stream));
   CUDA_CHECK(cudaStreamDestroy(stream));

   // De-allocate the memory.
   CUDA_CHECK(cudaFree(managed_mem));
   CUDA_CHECK(cudaFree(device_mem));

   // Return gracefully.
   return 0;
}

This dummy code works perfectly happily on Linux for me. But when trying to execute it on Windows (either natively or inside of WSL2), it crashes.

Apparently on Windows I’m not allowed to read from CUDA managed memory while a copy is happening “from it”. Notice that the managed memory block is not being modified in this setup. Either by the asynchronous copy, or by the host code accessing it.

Is this intentional/known?

As I began, this issue surfaced while setting up some thorough unit tests. Which would exercise all combinations of memory types and copy behaviours with our code. So I don’t really want to use asynchronous copies with managed memory like this in any actual code. But exactly because of this, I wonder if others have ever come across this issue.

Cheers,
Attila

The documentation states that managed memory on windows has the same limitations as pre-Pascal gpus
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

Thank you for your message. But I don’t understand how the thing that you linked relates to what I described. :-/

What I took away from that webpage is that on Windows the block-sizes used for migrating managed data between the host and the device are larger than on Linux. But I’m not trying to trigger any data migration. Or at least I don’t think I do. :-/

In this example I’m copying data from a managed memory blob into a “pure” device memory blob. Is the CUDA runtime really achieving this by first migrating the managed memory blob to the device and then doing a D->D transfer?

In the end, as I wrote, I took note of this behaviour, and will write our code accordingly. I just wanted to discuss about it a bit. To understand whether this behaviour is intentional/known. But so far I don’t know the answer to that.

Cheers,
Attila

The information is scattered across multiple subsections in the linked document. I think relevant for your case is the following:

Applications running on Windows (whether in TCC or WDDM mode) will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.

19.2.2.1. GPU Exclusive Access To Managed Memory

To ensure coherency on pre-6.x GPU architectures, the Unified Memory programming model puts constraints on data accesses while both the CPU and GPU are executing concurrently. In effect, the GPU has exclusive access to all managed data while any kernel operation is executing, regardless of whether the specific kernel is actively using the data. When managed data is used with cudaMemcpy*() or cudaMemset*(), the system may choose to access the source or destination from the host or the device, which will put constraints on concurrent CPU access to that data while the cudaMemcpy*() or cudaMemset*() is executing. See Memcpy()/Memset() Behavior With Managed Memory for further details.

It is not permitted for the CPU to access any managed allocations or variables while the GPU is active for devices with concurrentManagedAccess property set to 0. On these systems concurrent CPU/GPU accesses, even to different managed memory allocations, will cause a segmentation fault because the page is considered inaccessible to the CPU.

Since managed memory can be accessed from either the host or the device, cudaMemcpy*() relies on the type of transfer, specified using cudaMemcpyKind, to determine whether the data should be accessed as a host pointer or a device pointer.

If cudaMemcpyHostTo* is specified and the source data is managed, then it will accessed from the host if it is coherently accessible from the host in the copy stream (1); otherwise it will be accessed from the device.

When data is accessed from the device either by cudaMemcpy* or cudaMemset* , the stream of operation is considered to be active on the GPU. During this time, any CPU access of data that is associated with that stream or data that has global visibility, will result in a segmentation fault if the GPU has a zero value for the device attribute concurrentManagedAccess . The program must synchronize appropriately to ensure the operation has completed before accessing any associated data from the CPU.

Thanks for pointing to the specific parts!

Okay, you were indeed right. ;-) This does describe my setup. So yeah, we clearly need to account for this behaviour in our code.

Thanks again!