Hi,
I’m not sure if what I came across is a feature, a bug, a known behaviour, or possibly all of the above. So I thought I’d ask if others knew about this.
While setting up some thorough unit testing code for a specific project, I found that some of the tests would pretty much always fail on Windows, but would run happily on Linux. After some painful debugging, since the problem happened quite deep inside of our code, I had to realize that managed memory just behaves differently on Windows and on Linux.
I boiled it down to the following demonstrator:
// CUDA include(s).
#include <cuda_runtime_api.h>
// System include(s).
#include <iostream>
#include <cstdlib>
/// Simple macro for checking CUDA return values.
#define CUDA_CHECK(EXPR) \
do \
{ \
const cudaError_t err = EXPR; \
if (err != cudaSuccess) \
{ \
std::cerr << "CUDA error in file '" << __FILE__ << "' in line " \
<< __LINE__ << ": " << cudaGetErrorString(err) \
<< std::endl; \
std::exit(1); \
} \
} while (false)
int main()
{
// Allocate a blob of "managed" and "device" memory.
static const int N = 100;
int *managed_mem = nullptr;
CUDA_CHECK(cudaMallocManaged((void **)&managed_mem, N * sizeof(int)));
int *device_mem = nullptr;
CUDA_CHECK(cudaMalloc((void **)&device_mem, N * sizeof(int)));
// Create a CUDA stream.
cudaStream_t stream;
CUDA_CHECK(cudaStreamCreate(&stream));
// Fill the managed memory with some data. Demonstrating that this works just
// fine at this point, while there's no copy going on.
for (int i = 0; i < N; i++)
{
managed_mem[i] = i;
}
// Start an asynchronous copy into the device memory.
CUDA_CHECK(cudaMemcpyAsync(device_mem, managed_mem, N * sizeof(int),
cudaMemcpyHostToDevice, stream));
// Try to print the values in managed memory. This should work, but on
// Windows it doesn't. :-(
for (int i = 0; i < N; i++)
{
std::cout << "managed_mem[" << i << "] = " << managed_mem[i] << std::endl;
}
// Destroy the CUDA stream.
CUDA_CHECK(cudaStreamSynchronize(stream));
CUDA_CHECK(cudaStreamDestroy(stream));
// De-allocate the memory.
CUDA_CHECK(cudaFree(managed_mem));
CUDA_CHECK(cudaFree(device_mem));
// Return gracefully.
return 0;
}
This dummy code works perfectly happily on Linux for me. But when trying to execute it on Windows (either natively or inside of WSL2), it crashes.
Apparently on Windows I’m not allowed to read from CUDA managed memory while a copy is happening “from it”. Notice that the managed memory block is not being modified in this setup. Either by the asynchronous copy, or by the host code accessing it.
Is this intentional/known?
As I began, this issue surfaced while setting up some thorough unit tests. Which would exercise all combinations of memory types and copy behaviours with our code. So I don’t really want to use asynchronous copies with managed memory like this in any actual code. But exactly because of this, I wonder if others have ever come across this issue.
Cheers,
Attila