Pascal & capabilities 6.0 show cudaDevAttrConcurrentManagedAccess is 0

robosmith · December 17, 2018, 8:48pm

P100 in a Windows 2012 server and cudaMemPrefetchAsync fails with InvalidDevice error.

Checking properties and cudaDevAttrConcurrentManagedAccess returns 0.

Unified Memory seems to auto copy to the GPU for blocks of 39MB, but fails for blocks of 198KB.

I was hoping to force the copy by doing the prefetch, but that completely fails.

There are 2 P100s in the system, but cudaMemPrefetchAsync still fails and shows what appears to be incorrect attributes for a P100 with the system env variable CUDA_VISIBLE_DEVICES set to 1.

It seems this has been a problem at least since CUDA SDK 8.0 and I’m using 9.1, but it’s still a problem.

What am I missing?

tera · December 17, 2018, 8:56pm

TCC or WDDM mode?

Robert_Crovella · December 17, 2018, 9:03pm

For CUDA 9.x or later, it doesn’t matter.

With recent (CUDA 9.x, CUDA 10.0) CUDA version, the behavior on the windows operating system is as if it were a pre-pascal regime. In this regime, conccurrent managed access is indeed not possible.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

“Applications running on Windows (whether in TCC or WDDM mode) or macOS will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.”

The behavior is expected. cudaMemPrefetchAsync also has no meaning in such a scenario and will return an error code.

I don’t know what this statement is referring to, so my comments don’t apply to that:

The general idea expressed here was already indicated to OP here:

https://devtalk.nvidia.com/default/topic/1029706/cuda-programming-and-performance/partial-fail-of-peer-access-in-8-volta-gpu-instance-p3-16xlarge-on-aws-gt-huge-slowdown-/post/5238143/#5238143

“This is particularly true in a windows regime under CUDA 9.0/9.1, where demand-paged managed memory is not available.”

That statement is still true, and will likely never change for CUDA 9.0, 9.1, 9.2, and 10.0, if history is any guide.

memory hints, memory prefetching, demand-paging, concurrent access are all examples of features related to demand-paging UM which are not available in the “pre-pascal” regime, i.e. when the documentation specifically calls out “the basic Unified Memory model as on pre-6.x architectures”

robosmith · December 17, 2018, 10:35pm

This application differs greatly from the zero-copy memory problem referenced, as it uses only one GPU and thus no peer to peer mem copies are used.

Perhaps I was mistaken to rely on what is clearly working for the large memory blocks referenced.

Memory is allocated on the CPU with cudaMallocManaged, and what is working for the large block referenced is auto migration of CPU written data to the GPU when the kernel accessing that memory ptr is called.

It just doesn’t work for the more fine grained accesses of the smaller memory blocks.

What is being attempted is the writing a 198KB block from the CPU and reading that block on the GPU while the CPU writes the next block to a buffer space after the first 2.

This works when the blocks are 39MB halves of a single UM buffer.

Not sure if it’s the CPU write that fails, or the GPU read, but no exceptions are thrown; the data is just static.

robosmith · December 17, 2018, 10:40pm

I’m not sure which mode is running; I just assumed Tesla drivers always run TCC.

Robert_Crovella · December 17, 2018, 11:01pm

Under this regime, GPU and CPU concurrent access to a UM buffer is not supported and explores UB, regardless of your observations.

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd[/url]

robosmith · December 18, 2018, 4:26am

Concurrent access is not the issue. I have no problem synchronizing the kernel before CPU accesses.

The problem is that page faulting does not block kernel access until data is auto migrated to the GPU when accessing small portions of UM.

What is UB?

njuffa · December 18, 2018, 4:35am

UB = undefined behavior, meaning pretty much anything can happen, including the behavior you observed.

tera · December 18, 2018, 4:43am

As Robert has pointed out above, the features you are taking for granted are not available under any existing CUDA release for Windows.

If your intention is to implement double-buffering using managed memory, look into using cudaStreamAttachMemAsync() and streams.

robosmith · December 18, 2018, 7:14pm

tera;

Per your cite: “The code runs successfully on devices of compute capability 6.x due to the GPU page faulting capability which lifts all restrictions on simultaneous access.”

I am not “taking for granted,” these features are specified for P100 GPUs in your cite.

I have also read passages in the docs which say that “coherency is GUARANTEED.”

Maybe inclusion of the caveat that NVidia considers Windows unworthy of this support was neglected.

Thanks for your suggestion, but the mere inclusion of cudaStreamAttachMemAsync() to associate the UM with the stream causes the large buffer coherency, which was working without it, to fail.

Just to be clear, are you saying that kernel blocking and auto-migration of data with UM is NOT supported under 64 bit Windows? Need Linux for that?

tera · December 18, 2018, 7:43pm

I am not trying to make any claims beyond what Robert has written, or what is stated in the Programming Guide. I just wanted to point out a possible way forward for you without dropping use of managed memory completely.

Once you use cudaStreamAttachMemAsync() you need to be careful about which stream use the attached memory rather than relying of the safe, but slow default of “copy all memory for any kernel”. One may think of the operation more as “detach from all other streams” than attaching to the specific stream.

I apologize for the misleading “TCC or WDDM mode” question - I had misread your opening post.

njuffa · December 18, 2018, 7:49pm

In #3 Robert Crovella already pointed to this statement in the Programming Guide:

I read this as a clear caveat “this is a Linux-only feature at this time”.

robosmith · December 18, 2018, 9:52pm

njuffa;

That caveat is not referenced in subsequent declarations wrt Unified Memory capabilities.

Nor does it (at least your quote) specify how “the basic UM model” is hobbled.

I did see a reference which specified “on supporting OS,” but even that is somewhat cryptic.

NVidia seems to be relying on a users ability to have read the complete specification and intuit exactly how every part is related. IMO, that is unrealistic.

Also, it would be far better if each section of the UM documentation had a Linux section and a Windows, MacOS, etc section that clearly delineates what is supported for each OS, since there is such a great schism between OS for UM.

njuffa · December 18, 2018, 9:58pm

If you find NVIDIA’s documentation unclear or incomplete, you could always file an enhancement request via the bug reporting form. Prefix the synopsis with “RFE:” to mark it as an enhancement request.

robosmith · December 19, 2018, 4:36pm

I only have one stream in addition to default, as this is a demo program.

I have tried to use cudaMemcpyAsync with the stream and ::cudaMemcpyHostToDevice to force the UM host copy, but that fails with InvalidValue error.

robosmith · December 27, 2018, 7:46pm

Turns out that GPU registering the CPU memory does exactly what I was trying to do with UM in Windows. It works better and is much faster than the (apparently lazy) UM auto-copy.

The kernel which took ~30ms (incl copy) with UM, takes < 5ms with the GPU registered buffer.

And unlike with UM, coherency is maintained for small partial buffer writes. :)

Topic		Replies	Views
SM architecture 6.x additional Unified Memory (PeagableMemoryAccess and ConcurrentManaged Acess) support CUDA Programming and Performance	2	726	July 10, 2017
Accessing Managed Memory During Asynchronous Copies CUDA Programming and Performance	4	500	March 4, 2024
cudaMemPrefetchAsync returns cudaErrorInvalidDevice CUDA Programming and Performance	21	4606	November 15, 2021
cudaMallocManaged and CUDA 8.0 CUDA Programming and Performance	5	2550	June 21, 2018
concurrentManagedAccess = 0 CUDA Programming and Performance	0	874	July 11, 2017
concurrentManagedAccess is 0 on RTX 3060 Laptop GPU CUDA Programming and Performance cuda , windows-driver	2	423	March 14, 2024
Enabling Concurrent Managed Memory access on GPU CUDA Programming and Performance	10	216	February 12, 2025
Call unified virtual memory without device synchronization results in segmentation fault CUDA Programming and Performance cuda , kernel	3	647	April 28, 2023
cudaMallocManaged() clarification needed CUDA Programming and Performance	5	11348	November 20, 2018
unified memory with CUDA 8 CUDA Programming and Performance	7	3391	April 2, 2018

Pascal & capabilities 6.0 show cudaDevAttrConcurrentManagedAccess is 0

Related topics