cudaMallocManaged() clarification needed

I’m experimenting with cudaMallocManaged on Windows 10 and am not getting the results I expect. My understanding has been that memory allocated with this call is always available for reading and writing on both devices, but I get seg faults any time I do a host access to a managed address after a kernel launch but before a cudaDeviceSynchronize(). This happens even for addresses that were on the host before the kernel launch and were not touched by the kernel launch. For example:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
    for (int i = 0; i < n; i++)
        y[i] = x[i] + y[i];
}

int main(void)
{
    int N = 1 << 20;
    float *x, *y, *z;

    // Allocate Unified Memory – accessible from CPU or GPU
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));
    cudaMallocManaged(&z, N * sizeof(float));

    // initialize x and y arrays on the host
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
        z[i] = 3.0f;
    }

    // Run kernel on 1M elements on the GPU
    add << <1, 1 >> > (N, x, y);

    // Touch a managed address that has been resident on the host
    z[0] = 5;

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));
    std::cout << "Max error: " << maxError << std::endl;

    // Free memory
    cudaFree(x);
    cudaFree(y);
    cudaFree(z);

    return 0;
}

Note that I write to z, then do the kernel launch (which doesn’t read or write z), then write to z. This second write fails. If I move the cudaDeviceSynchronize() to before the write to z then it works. All of the tutorials I’ve read say that this synchronize is for the purpose of avoiding race conditions by ensuring that the CPU is reading the post-kernel data. If that’s true I would expect to be able to read and write from both processors and get an indeterminate value. I wouldn’t expect a seg fault.

The following paragraph is from this tutorial: https://devblogs.nvidia.com/unified-memory-cuda-beginners/.

“In our simple example, we have a call to cudaDeviceSynchronize() after the kernel launch. This ensures that the kernel runs to completion before the CPU tries to read the results from the managed memory pointer. Otherwise, the CPU may read invalid data (on Pascal and later), or get a segmentation fault (on pre-Pascal GPUs).”

So my GPU seems to be behaving like a pre-Pascal GPU, since it’s getting seg faults. But it’s a GeForce RTX 2080 and I’m running Windows 10, CUDA 10, and R311 driver.

And to complete my gripe let me mention that this happens even when I’m using the driver API and do a cuLaunchGrid() (NOT ASync) call. Even though it’s a synchronous launch I still have to call cu*Synchronize() before I’m allowed to touch my managed memory again.

Does this match other peoples’ results? Is this expected? Are there any context attributes, compiler settings or anything else that controls this behavior?

Thanks,
Dave

I should add that the above restriction happens even for memory allocated after the last kernel launch. That is, if I add another cudaMallocManaged(&w, …) on line 30 I’m not allowed to write to my new memory until after line 36. What’s up with that?

Are these restrictions correctly documented anywhere?

On windows, on CUDA 9 and later, the behavior of managed memory is the pre-6.x architecture model which requires cudaDeviceSynchronize() after a kernel launch, before you can touch managed allocations in host code.

So this is not correct (on Windows, on CUDA 9.x or CUDA 10.0):

My understanding has been that memory allocated with this call is always available for reading and writing on both devices

and this is exactly expected behavior:

but I get seg faults any time I do a host access to a managed address after a kernel launch but before a cudaDeviceSynchronize()

and it is documented in the current programming guide in the managed memory section:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

“Applications running on Windows (whether in TCC or WDDM mode) or macOS will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.”

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd

“Simultaneous access to managed memory on devices of compute capability lower than 6.x is not possible,”

The fact that you may or may not have “touched” the data in device code, and/or the fact that a specific managed allocation does or does not show up in the list of arguments passed to a particular kernel, does not affect these statements.

Applications desiring to test for the availability of concurrent managed access can use the named property concurrentManagedAccess when calling cudaGetDeviceProperties:

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0

The seg fault when trying to access managed memory from host, after a device worked on it and before a synchronize, is expected to happen and is clearly explained in the link you provided. The difference in behavior between pre-Pascal and post-Pascal when using cudaMallocManaged is not if it works, seg faults or not, but the lack of hardware page faulting. It is explained in the article.

This question has been asked at least 4 or 5 times in the past 2 weeks. What is common among them is that, after the code is fixed and supposed to work with CUDA pre-10, it won’t work on CUDA 10. The moderators suggested filing a bug with proper description with working sample code.

My personal experience is that cudaMallocManaged works perfectly on Windows 8.1/Ubuntu 16.04, both with CUDA 9.1 and 1080Ti. Right now I am on a Macbook Pro 2012 with a GT650M (Kepler), with Xcode 9.4 and CUDA 10 (yes, 10) and cudaMallocManaged works without any problem.

I suggest you read these threads and see what they have in common with your case. You can also try using thrust::host_vector and thrust::device_vector, which makes manual allocation/deallocation unnecessary and the copy between host and device much easier, and see if it works. This code here works for me with CUDA 10 and GT650M on Mac OS High Sierra:

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>

using namespace std;

__global__ void our_kernel(float *input)
    {
    printf("From GPU: %f\n", input[threadIdx.x]);
    }

int main(void)
    {
    const unsigned int LENGTH = 10;
    thrust::host_vector <float> h_array(LENGTH);
    thrust::device_vector <float> d_array(LENGTH);
    float *rp_array = thrust::raw_pointer_cast(d_array.data());

    for(int i = 0; i < LENGTH; i++)
        h_array[i] = i * 0.5f;

    thrust::copy(h_array.begin(), h_array.begin() + LENGTH, d_array.begin());

    our_kernel <<<1, LENGTH>>> (rp_array);
    // No cudaDeviceSynchronize() needed here

    thrust::fill(thrust::host, h_array.begin(), h_array.end(), 0);
    thrust::copy(d_array.begin(), d_array.begin() + LENGTH, h_array.begin());

    for(int i = 0; i < LENGTH; i++)
        cout << "From CPU: " << h_array[i] << endl;

    return 0;
    }

You can see in the code that I don’t do any explicit memory handling, Thrust is doing it. So we have a host and a device array, host is filled with some numbers, copied to the device array and the GPU shows it. Then the host array is filled with zeroes and the contents of the device array is copied, and the CPU shows it. See if this runs without crashing, and if it crashes, file a bug.

Thank you, Robert. That is exactly what I needed to hear, though definitely not what I wanted to hear. :)

I spent quite a bit of time architecting my code around managed memory and then root causing the seg faults. And this was after having done quite a lot of reading about managed memory. Now I feel like it was too difficult to discover the fact that support for most unified memory features has been withdrawn on Windows. Could I suggest a couple of edits that would help you communicate the true level of support to your developers so they avoid my pain?

First, people depend on the tutorials on the dev blog much more than on the appendices of the programming guide because they appear higher in search results and they are easier to read. So even though a dev blog article is conceptually tied to a point in time, such as the Pascal release, could I urge you to go back and edit the relevant dev blog articles (Tag: Unified Memory | NVIDIA Technical Blog) to update what is and is not supported? This is important because we naturally expect support to expand over time, not diminish, so it’s really counterintuitive for something that worked in a 2017 article to not work in 2018.

Second, edit appendix K to use the correct distinction of what works where. The inaccurate distinction of “compute capability 6.x” is mentioned 29 times, while “running on Windows or macOS…” is only mentioned once, and it’s not tied very precisely to the set of functionality that the distinction applies to. Because developers are not likely to spend the morning reading all of section K, as I felt I had to do after you kindly pointed me to it, they are not likely to discover the one sentence that contains the true distinction, since it’s not in the subsection they care about. And even after reading the whole thing in detail, it still left questions about what precisely would and would not work on Windows.

What I suggest you do instead is replace all of the “on devices with compute capability 6.x” phrases with phrases like “on devices with the concurrentManagedAccess property”, and have that be a hot link to a subsection or a table that explains both the 6.x and the Windows / Mac OS restrictions as they apply to each property. This will get all developers to think about whether the specific property exists on their device and it will also tell them the name of the property that is being described, rather than all of it being lumped together under the overly broad concept of UVM or managed memory.

I think the device names of the properties to distinguish are:
CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING
CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS
CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES
CU_DEVICE_ATTRIBUTE_DIRECT_MANAGED_MEM_ACCESS_FROM_HOST

I should also mention that it says “See Data Migration and Coherency for details” but that section doesn’t give any details about the OS constraints.

And a quick question: On Windows on compute capability 7.5 GPUs does page faulting happen at all, or is all page migration explicitly done before or after a kernel launch?

Thanks again for the help.

-Dave

I suggest filing recommendations for documentation changes as a bug at developer.nvidia.com

login as a registered developer. Click on your name in the upper right hand corner to get to your account area. From there you can report a bug.

Regarding the question, compute capability 7.5 behavior on windows shouldn’t be any different than compute capability 7.0 or 6.x behavior on windows. All page migration is done before (at) kernel launch or after a subsequent cudaDeviceSynchronize(). The kernel launch itself triggers the migration of data to the GPU. Regarding migration of data from GPU to CPU, and/or behavior of data after initial managed allocation upon first touch, those are implementation details that are not published AFAIK. There may be some CPU page faulting activity that is observable in some cases.