Kernel invocation invalidates unified memory blocks

dlevi · January 4, 2018, 7:09pm

I have a scenario where invoking a kernel invalidates memory blocks until a sync (cudaStresamSynchronize) is called.

The memory blocks are allocated up front, using cudaMallocManaged, during a data population phase. After that they are never written to.

During a later calculation phase, an array is populated with pointers into these blocks. They array of pointers is copied to gpu memory (allocated with cudaMalloc), and passed to a kernel.

Watching one of these pointers while through the debugger, it looks like
(short*) 0x0000302202 {94}
(i.e., valid pointer to shorts, with first value of 94.)

As soon as the kernel is invoked, the debugger display changes to
(short*) 0x0000302202 {???}
(i.e. invalid pointer.)

De-referencing the pointer now results in an access violation.

Calling cudaStreamSynchronize restores the memory, and the pointer becomes dereferencable again.

This happens even with an empty kernel.

(It also happens in both debug in optimized builds, with the VS debugger attached, with the CUDA debugger attached, or with no debugger attached.)

Why would invoking a kernel invalidate a unified memory block? Is Unified Memory actually intended to behave this way?

(Environment: Windows 10, CUDA 9.1, Visual Studio 2015, GeForce 1080 Ti)

BulatZiganshin · January 4, 2018, 7:36pm

yes, once kernel is invoked, all gpu arrays is owned by the kernel, and you can’t access them from cpu until you are synchronized to the end of kernel execution with cudaStreamSynchronize or so. i think it’s described in the CUDA manual

Robert_Crovella · January 4, 2018, 7:37pm

This sounds like expected behavior. UM under CUDA 9.1 on windows behaves in the “legacy” UM fashion.

A kernel launch will trigger transfer of data from host to device which will invalidate any usage of that pointer in host code until a cuda device synchronize is called. This is all spelled out in the UM section of the programming guide.

Any attempt to use the UM-allocated pointer after a kernel launch, but before a synchronize is done, will result in a seg fault.

BulatZiganshin · January 4, 2018, 8:02pm

…

dlevi · January 4, 2018, 8:08pm

Ok. (Any chance you can point me at the relevant section of the manual?)

Will queuing another kernel before calling cudaStreamSync also result in a seg fault? Or is it only host-side access that results in a fault? (The application I’m working on needs to use the ‘read-only’ memory from multiple threads, and each thread needs to launch multiple kernels. It sounds like it’s difficult/impossible to use unified memory in this scenario.)

BulatZiganshin · January 4, 2018, 8:49pm

you should read entire CUDA manual section about Unified Memory, in particular K1.3, K2.2

You may also find https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/ helpful, in particular “Unified Memory or Unified Virtual Addressing?” part

only host-side access is prohibited

dlevi · January 4, 2018, 9:09pm

After further investigation …

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-gpu-exclusive says 6.x GPUs support concurrent access, and that pre-6.x devices do not. The examples in the documentation section explicitly show this. The hardware I’m using (GeForce1080 Ti) is a 6.x GPU. So I’d think it would work.

However, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements indicates that CUDA on Windows doesn’t expose that functionality. The concurrentManagedAccess property evaluates to 0, which seems to confirm this.

So I think the question boils down to, when will the driver/CUDA on Windows catch up to the hardware and driver/CUDA on Linux? I’ll probably post this as a new topic.

BulatZiganshin · January 4, 2018, 9:34pm

it was already asked many times, NVidia answers that MS doesn’t cooperate with them to make appropriate changes in the driver

Robert_Crovella · January 4, 2018, 9:51pm

Questions about NVIDIA’s future plans are unlikely to be answered in this forum. You’re welcome to pose whatever questions you wish, of course; I just want to set expectations.

dlevi · January 8, 2018, 8:40pm

Thanks for the responses. We are shifting development for this to Linux for the time being, and will add Windows support when these features become available. Hopefully that is sooner rather than later.

Topic		Replies	Views
Illegal memory access with unified memory CUDA Programming and Performance cuda	4	756	June 13, 2023
Using unified memory causes system crash CUDA Programming and Performance	28	6196	February 4, 2019
Is cuda API serial inner the drive level CUDA Programming and Performance	4	528	March 1, 2019
cudaMallocManaged() clarification needed CUDA Programming and Performance	5	11514	November 20, 2018
Driver is invalidating malloc'd memory 4.0RC2 Win32 CUDA Programming and Performance	1	10040	May 5, 2011
Unified memory and concurrent C++ objects Jetson TX2	10	2701	October 18, 2021
Using unified memory in dynamic parallelism Jetson TX1	1	632	April 6, 2016
Access memory of cudaMallocManaged after launch kernel will cause crash Jetson AGX Orin cuda	5	535	December 5, 2023
Unified Memory in CUDA 6 Technical Blog	87	2533	August 16, 2019
Call unified virtual memory without device synchronization results in segmentation fault CUDA Programming and Performance cuda , kernel	3	708	April 28, 2023

Kernel invocation invalidates unified memory blocks

Related topics