Pascal & capabilities 6.0 show cudaDevAttrConcurrentManagedAccess is 0

P100 in a Windows 2012 server and cudaMemPrefetchAsync fails with InvalidDevice error.

Checking properties and cudaDevAttrConcurrentManagedAccess returns 0.

Unified Memory seems to auto copy to the GPU for blocks of 39MB, but fails for blocks of 198KB.

I was hoping to force the copy by doing the prefetch, but that completely fails.

There are 2 P100s in the system, but cudaMemPrefetchAsync still fails and shows what appears to be incorrect attributes for a P100 with the system env variable CUDA_VISIBLE_DEVICES set to 1.

It seems this has been a problem at least since CUDA SDK 8.0 and I’m using 9.1, but it’s still a problem.

What am I missing?

TCC or WDDM mode?

For CUDA 9.x or later, it doesn’t matter.

With recent (CUDA 9.x, CUDA 10.0) CUDA version, the behavior on the windows operating system is as if it were a pre-pascal regime. In this regime, conccurrent managed access is indeed not possible.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

“Applications running on Windows (whether in TCC or WDDM mode) or macOS will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.”

The behavior is expected. cudaMemPrefetchAsync also has no meaning in such a scenario and will return an error code.

I don’t know what this statement is referring to, so my comments don’t apply to that:

The general idea expressed here was already indicated to OP here:

https://devtalk.nvidia.com/default/topic/1029706/cuda-programming-and-performance/partial-fail-of-peer-access-in-8-volta-gpu-instance-p3-16xlarge-on-aws-gt-huge-slowdown-/post/5238143/#5238143

“This is particularly true in a windows regime under CUDA 9.0/9.1, where demand-paged managed memory is not available.”

That statement is still true, and will likely never change for CUDA 9.0, 9.1, 9.2, and 10.0, if history is any guide.

memory hints, memory prefetching, demand-paging, concurrent access are all examples of features related to demand-paging UM which are not available in the “pre-pascal” regime, i.e. when the documentation specifically calls out “the basic Unified Memory model as on pre-6.x architectures”

This application differs greatly from the zero-copy memory problem referenced, as it uses only one GPU and thus no peer to peer mem copies are used.

Perhaps I was mistaken to rely on what is clearly working for the large memory blocks referenced.

Memory is allocated on the CPU with cudaMallocManaged, and what is working for the large block referenced is auto migration of CPU written data to the GPU when the kernel accessing that memory ptr is called.

It just doesn’t work for the more fine grained accesses of the smaller memory blocks.

What is being attempted is the writing a 198KB block from the CPU and reading that block on the GPU while the CPU writes the next block to a buffer space after the first 2.

This works when the blocks are 39MB halves of a single UM buffer.

Not sure if it’s the CPU write that fails, or the GPU read, but no exceptions are thrown; the data is just static.

I’m not sure which mode is running; I just assumed Tesla drivers always run TCC.

Under this regime, GPU and CPU concurrent access to a UM buffer is not supported and explores UB, regardless of your observations.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd

Concurrent access is not the issue. I have no problem synchronizing the kernel before CPU accesses.

The problem is that page faulting does not block kernel access until data is auto migrated to the GPU when accessing small portions of UM.

What is UB?

UB = undefined behavior, meaning pretty much anything can happen, including the behavior you observed.

As Robert has pointed out above, the features you are taking for granted are not available under any existing CUDA release for Windows.

If your intention is to implement double-buffering using managed memory, look into using cudaStreamAttachMemAsync() and streams.

tera;

Per your cite: “The code runs successfully on devices of compute capability 6.x due to the GPU page faulting capability which lifts all restrictions on simultaneous access.”

I am not “taking for granted,” these features are specified for P100 GPUs in your cite.

I have also read passages in the docs which say that “coherency is GUARANTEED.”

Maybe inclusion of the caveat that NVidia considers Windows unworthy of this support was neglected.

Thanks for your suggestion, but the mere inclusion of cudaStreamAttachMemAsync() to associate the UM with the stream causes the large buffer coherency, which was working without it, to fail.

Just to be clear, are you saying that kernel blocking and auto-migration of data with UM is NOT supported under 64 bit Windows? Need Linux for that?

I am not trying to make any claims beyond what Robert has written, or what is stated in the Programming Guide. I just wanted to point out a possible way forward for you without dropping use of managed memory completely.

Once you use cudaStreamAttachMemAsync() you need to be careful about which stream use the attached memory rather than relying of the safe, but slow default of “copy all memory for any kernel”. One may think of the operation more as “detach from all other streams” than attaching to the specific stream.

I apologize for the misleading “TCC or WDDM mode” question - I had misread your opening post.

In #3 Robert Crovella already pointed to this statement in the Programming Guide:

I read this as a clear caveat “this is a Linux-only feature at this time”.

njuffa;

That caveat is not referenced in subsequent declarations wrt Unified Memory capabilities.

Nor does it (at least your quote) specify how “the basic UM model” is hobbled.

I did see a reference which specified “on supporting OS,” but even that is somewhat cryptic.

NVidia seems to be relying on a users ability to have read the complete specification and intuit exactly how every part is related. IMO, that is unrealistic.

Also, it would be far better if each section of the UM documentation had a Linux section and a Windows, MacOS, etc section that clearly delineates what is supported for each OS, since there is such a great schism between OS for UM.

If you find NVIDIA’s documentation unclear or incomplete, you could always file an enhancement request via the bug reporting form. Prefix the synopsis with “RFE:” to mark it as an enhancement request.

I only have one stream in addition to default, as this is a demo program.

I have tried to use cudaMemcpyAsync with the stream and ::cudaMemcpyHostToDevice to force the UM host copy, but that fails with InvalidValue error.

Turns out that GPU registering the CPU memory does exactly what I was trying to do with UM in Windows. It works better and is much faster than the (apparently lazy) UM auto-copy.

The kernel which took ~30ms (incl copy) with UM, takes < 5ms with the GPU registered buffer.

And unlike with UM, coherency is maintained for small partial buffer writes. :)