Kernel invocation invalidates unified memory blocks

I have a scenario where invoking a kernel invalidates memory blocks until a sync (cudaStresamSynchronize) is called.

The memory blocks are allocated up front, using cudaMallocManaged, during a data population phase. After that they are never written to.

During a later calculation phase, an array is populated with pointers into these blocks. They array of pointers is copied to gpu memory (allocated with cudaMalloc), and passed to a kernel.

Watching one of these pointers while through the debugger, it looks like
(short*) 0x0000302202 {94}
(i.e., valid pointer to shorts, with first value of 94.)

As soon as the kernel is invoked, the debugger display changes to
(short*) 0x0000302202 {???}
(i.e. invalid pointer.)

De-referencing the pointer now results in an access violation.

Calling cudaStreamSynchronize restores the memory, and the pointer becomes dereferencable again.

This happens even with an empty kernel.

(It also happens in both debug in optimized builds, with the VS debugger attached, with the CUDA debugger attached, or with no debugger attached.)

Why would invoking a kernel invalidate a unified memory block? Is Unified Memory actually intended to behave this way?

(Environment: Windows 10, CUDA 9.1, Visual Studio 2015, GeForce 1080 Ti)

yes, once kernel is invoked, all gpu arrays is owned by the kernel, and you can’t access them from cpu until you are synchronized to the end of kernel execution with cudaStreamSynchronize or so. i think it’s described in the CUDA manual

This sounds like expected behavior. UM under CUDA 9.1 on windows behaves in the “legacy” UM fashion.

A kernel launch will trigger transfer of data from host to device which will invalidate any usage of that pointer in host code until a cuda device synchronize is called. This is all spelled out in the UM section of the programming guide.

Any attempt to use the UM-allocated pointer after a kernel launch, but before a synchronize is done, will result in a seg fault.

Ok. (Any chance you can point me at the relevant section of the manual?)

Will queuing another kernel before calling cudaStreamSync also result in a seg fault? Or is it only host-side access that results in a fault? (The application I’m working on needs to use the ‘read-only’ memory from multiple threads, and each thread needs to launch multiple kernels. It sounds like it’s difficult/impossible to use unified memory in this scenario.)

  1. you should read entire CUDA manual section about Unified Memory, in particular K1.3, K2.2

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

You may also find https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/ helpful, in particular “Unified Memory or Unified Virtual Addressing?” part

  1. only host-side access is prohibited

After further investigation …

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-gpu-exclusive says 6.x GPUs support concurrent access, and that pre-6.x devices do not. The examples in the documentation section explicitly show this. The hardware I’m using (GeForce1080 Ti) is a 6.x GPU. So I’d think it would work.

However, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements indicates that CUDA on Windows doesn’t expose that functionality. The concurrentManagedAccess property evaluates to 0, which seems to confirm this.

So I think the question boils down to, when will the driver/CUDA on Windows catch up to the hardware and driver/CUDA on Linux? I’ll probably post this as a new topic.

it was already asked many times, NVidia answers that MS doesn’t cooperate with them to make appropriate changes in the driver

Questions about NVIDIA’s future plans are unlikely to be answered in this forum. You’re welcome to pose whatever questions you wish, of course; I just want to set expectations.

Thanks for the responses. We are shifting development for this to Linux for the time being, and will add Windows support when these features become available. Hopefully that is sooner rather than later.