I’m experimenting with cudaMallocManaged on Windows 10 and am not getting the results I expect. My understanding has been that memory allocated with this call is always available for reading and writing on both devices, but I get seg faults any time I do a host access to a managed address after a kernel launch but before a cudaDeviceSynchronize(). This happens even for addresses that were on the host before the kernel launch and were not touched by the kernel launch. For example:
#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1 << 20;
float *x, *y, *z;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
cudaMallocManaged(&z, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
z[i] = 3.0f;
}
// Run kernel on 1M elements on the GPU
add << <1, 1 >> > (N, x, y);
// Touch a managed address that has been resident on the host
z[0] = 5;
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
cudaFree(z);
return 0;
}
Note that I write to z, then do the kernel launch (which doesn’t read or write z), then write to z. This second write fails. If I move the cudaDeviceSynchronize() to before the write to z then it works. All of the tutorials I’ve read say that this synchronize is for the purpose of avoiding race conditions by ensuring that the CPU is reading the post-kernel data. If that’s true I would expect to be able to read and write from both processors and get an indeterminate value. I wouldn’t expect a seg fault.
The following paragraph is from this tutorial: https://devblogs.nvidia.com/unified-memory-cuda-beginners/.
“In our simple example, we have a call to cudaDeviceSynchronize() after the kernel launch. This ensures that the kernel runs to completion before the CPU tries to read the results from the managed memory pointer. Otherwise, the CPU may read invalid data (on Pascal and later), or get a segmentation fault (on pre-Pascal GPUs).”
So my GPU seems to be behaving like a pre-Pascal GPU, since it’s getting seg faults. But it’s a GeForce RTX 2080 and I’m running Windows 10, CUDA 10, and R311 driver.
And to complete my gripe let me mention that this happens even when I’m using the driver API and do a cuLaunchGrid() (NOT ASync) call. Even though it’s a synchronous launch I still have to call cu*Synchronize() before I’m allowed to touch my managed memory again.
Does this match other peoples’ results? Is this expected? Are there any context attributes, compiler settings or anything else that controls this behavior?
Thanks,
Dave