Unified Memory for CUDA Beginners

Originally published at: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

My previous introductory post, “An Even Easier Introduction to CUDA C++“, introduced the basics of CUDA programming by showing how to write a simple program that allocated two arrays of numbers in memory accessible to the GPU and then added them together on the GPU. To do this, I introduced you to Unified Memory, which…

What happens if I call cudaMemPrefetchAsync() on an array that is too big to fit in one piece on the GPU? This might occur when oversubscribing.

Hallo Mark Harris, please excuse my english i'm not a native speaker.

When i understand your footnote 3) correctly. Then the unifiy memory feature with paging is not available under Windows. It is the first time i read this restriction. We checked every documentation online there whre nothing to find about a limitation to Linux 64 bit.

I'm now really diasapointed becaus since Pascal was available we buyed for our work 4 Pascal cards especially beacuse of this feature and have registrated a fallback to zero copy memory using Pascal Cards with unified memory. We thought it would be a bug, like reported in the NVIDIA Developer Forum for example the last report: https://devtalk.nvidia.com/...

We can't switch to Linux because we use special depth-image camera with only Windows driver available.

So should we switch back to Cuda 7.5 where Pascal works like Maxwell missing the other Cuda 8.0 features or should we sell the Pascal cards and buy old maxwell cards? Will this be fixed with cuda 9 or will be there possiblie a fix in the next months for cuda 8.0? Or is there at least any possibility to fall back to normal unified memory behaviour without paging (would at least help for low resolutions to develop our framework in the hope that paging will work some day)?

Would be nice if you could help use we really like and would need the unified programming and the paging feature.

In this case only the last part of the array (as much as will fit) ) will end up in the GPU memory and the rest will be resident in the system memory. Currently the driver does not know exactly how much memory is available for eviction on the destination processor and therefore some pages could be fetched and then evicted during the prefetching operation. If this is a major performance issue for your application, please let us know more about your use case and we can follow up to improve the driver’s migration policy for oversubscription cases.

I don't have a problem *yet*. As long as the behaviour is predictable, we should be able to write our kernels such that performance degrades gracefully in the presence of oversubscribing. However, I'm worried about what happens if oversubscription always leaves the *last* part of the array on device memory, if thread block scheduling tends to execute the *first* thread blocks first (assuming not all thread blocks fit simultaneously). In the common case, where the first thread blocks read the first parts of the array, this would result in many page faults.

It would probably help if there were some common behaviour between oversubscription and thread block scheduling.

Fallback to system memory on Pascal under windows is due to a bug in CUDA 8 that should be fixed in the CUDA 9 release. So no, you should not downgrade to Maxwell. Windows support is tricky -- it requires support from Microsoft, so communicating your needs to them will help the cause.

Wait seriously? How long has this bug been known? We bought a ton of servers with Pascal cards about a year ago. It took many months before we finally realized that the problem was most likely with the cards themselves, and the only solution was to use unmanaged memory. To make this work, I had to abandon a huge amount of progress that would have allowed us to iterate and experiment *very* easily, and spend a huge amount of time just making single-use code for each different scenario we needed to run as a workaround, meanwhile our tiny cash-strapped company had to pay tens of thousands of dollars in penalties while we struggled to get something usable.

Maybe this issue should have been communicated somewhere among all of the hype about how much Pascal was supposed to make things *better*. And by "this issue" I mean "Pascal can't use managed memory on Windows". Seems to be worth noting.

Sorry, I'm very frustrated with Pascal, and it's probably coming out pretty strongly.

nvidia have a habit of releasing features half baked. it is very frustrating and causes distrust towards investment. Nvidia engineers should stand up to their marketing department for the sake of the long term success of their company.

To answer your question, David, about how long this bug has been known, i can tell you that I reported it to NVIDIA at 25.01.2017. With updates to it 3.2.2017 and 15.03.2017. In my last updated I was also very harsh and threatened to change to AMD and give the 1080 to an other group in our departmend which only works with LINUX. Now the windows peding frameworks of our group a redesigned to AMD. And for furtther PASCAL cards there is an order restriction till this bug is solved. And we will never again let our frameworks be depended to NVIDIA or AMD only.

At the end for use it would be enough if a pascal card would till end of this year behave under CUDA 8.0 like under 7.5 without the paging feature. Or if it would be commuicated and we wouldn't order PASCAL cards and give our MAXWELL cards to an other group. At least we could get some back.

I'm sorry that i don't reported the bug directly as Cuda 8 was available and the bug hited me at the begin of October 2016. And that i didn't posted it every where in the world and sended the PASCAL cards back. We had a lot of work and concentrated on somthing else because the performance in this moment was not so important and we still had enough MAXWELL cards. And I really belived that me and my colleagues made somthing wrong after we couldn't find any answer in any forum.

What are the plans for supporting Concurrent Managed Access for Pascal GPUs (I recently invested in a Titan Xp) on Windows 10?

Hi there, I tried Cuda 9 RC on Windows, but still not able to use cudaMemPrefetchAsync since Concurrent Managed Access is reporting as false (1080Ti)...
so, is this bug fixed?

I noticed the same behaviour on my GTX 1080 Ti also using the saxpy example and kept tinkering before finding this post. I noticed that, with 1M elements as in the original saxpy post, my bandwidth never went beyond 60GB/s, quite a bit shy of the 484GB/s that the 1080 Ti is nominally capable of. Once I increased the size of the arrays to 1G elements, I achieved a much more respectable 390GB/s without making any changes to the code. So, two questions: (1) Given that the saxpy code does not use cudaMallocManaged, it seems that the unified memory model also affects traditional code that uses cudaMemcpy. Can you elaborate on why that is? (2) Why does increasing the input size lead to reasonable bandwidth even though the page misses should still be there? Does some hardware prefetching come into play? If so, why doesn't it kick in for smaller but still substantial sizes such as 1M elements? Thanks.

Thanks for the tutorial. I am curious about one thing: pointer arithematic of unified memory. After "cudaMallocManaged(&ptr, 1024x1024x1024)", if I call "ptr += 1024" to increase the pointer, will the ptr still be recognized as a unified memory?

Dear sr. would you mind to tell me your OS and if your GTX 1080ti shows cudaDevAttrConcurrentManagedAccess = 1?


Mark Harris, thank you for this tutorial series. I've tested your code in my laptop with nVidia MX150 videocard (under Linux), which has a Pascal architecture, and here are the results:
add_cuda.cu - 183.55ms
add_block.cu - 7.9120ms
add_grid.cu - 6.0973ms
add_grid_pascal.cu - 289.74us
Could you please comment on the differences compared to your laptop? Maybe there are some other tricks to improve the end-result? Honestly, I was hoping for a higher speed-up. Thank you in advance!

I'm not sure what the difference between add_grid and add_grid_pascal is. However, your add_grid_pascal is achieving about 43 GB/s. I looked up the MX150 and peak bandwidth is about 48 GB/s so you are getting 90% of peak. My Pascal tests were not on a laptop, they used a server with a Tesla P100 which has MUCH higher bandwidth.

Thanks for your prompt and detailed reply Mark. The file add_grid_pascal.cu is the same add_grid.cu with just your 3rd solution of Prefetching applied. Regarding to getting the 90% of peak of my video-card, is it somewhat close to optimal or it's supposed to be much higher/lower? I've looked for the answer in your page about "How to Implement Performance Metrics in CUDA C/C++" (https://devblogs.nvidia.com... but there's no estimate about how close those numbers should or may be. There, in your example you have a ratio of 110.37/148 for the effective and theoretical bandwidths respectively, which is 74.57% only. I suppose this efficiency may also be affected by some physical factors varying from hardware to hardware of the same model (kind of a luck?). Any comment about this aspect would be highly appreciated.

The post you refer to hasn't been updated for more recent hardware. NVIDIA GPUs have improved the achievable utilization of bandwidth over time. I would say 90% of peak is pretty good! And yes, it's also affected by other factors, most importantly software / implementation factors such as memory access pattern and relative overhead compared to the runtime of the kernel. E.g. you will measure lower bandwidth for this kernel if you reduce N by a lot (due to kernel launch and other overheads) and you may see even higher bandwidth if you increase N.

what is the difference between cudaMemcpy() & cudaPrefetchasync() ?
I dont quite understand

cudaMemcpy() copies memory from one allocated memory region to another, just like regular C memcpy(). cudaPrefetchAsync causes pages of an allocated memory region to be prefetched (in other words, populated if necessary and migrated) to the specified destination device. So for example if you allocate managed memory and initialize it on one device, but you know it will be accessed on another device later, you can prefetch it there using this method (optionally overlapping the prefetch with other CUDA work on a different stream, hence "async"). See https://docs.nvidia.com/cud...