Methods to allocated 3D Unified Memory


Being pretty new to CUDA programming I could not find any specific function to allocated 3D memory that is Unified Memory. For example with the runtime API there are functions such as cudaMalloc() and cudaMalloc3D() but with Unified Memory there is only cudaMallocManaged() and not cudaMalloc3DManaged().

I want to utilize Unified Memory in a 3D manner. So, what would be some ways to allocate 3D Unified Memory.



2D and 3D apis involve pitched allocations. For ordinary CUDA programming, these aren’t very useful (anymore), and the texture use cases can’t use a managed underlying allocation anyway.

Just use ordinary cudaMallocManaged and manage the 3D access aspects yourself, without using pitched allocation/access.

Alright, Thanks!

But just to make sure:

  1. There is no existing CUDA function that will allow me to allocate 3D architecture with Unified Memory? If no, why is there none?

  2. Also, why are the 2D and 3D apis for unmanaged memory pitched allocations not useful anymore?


In other words, texturing requires that the data is resident in the GPU’s physical memory, now.

To be clear, I did not say they are “not useful anymore”. I said:

these aren’t very useful (anymore)

This is my opinion:

cudaMallocPitch, cudaMalloc3D, cudaMemcpy2D and cudaMemcpy3D are designed to handle pitched data.

A principal use case for pitched data is to ensure that each row begins on an aligned boundary. In particular ideally we would want each line in the allocation to begin on a segment boundary from DRAM.

This could be a significant optimization for the very first GPU architecture (named Tesla - not to be confused with the Tesla brand) consisting of GPUs like G80, GTX280, Tesla M1060, and others. This optimization typically allowed row-oriented data processing to make efficient usage of memory (at the expense of some wasted allocated space).

With GPUs of Fermi architecture (cc2.0) and newer, all such GPUs implement various caches, including the L2 which is active for all GPUs of cc2.0 and greater. Warp-level access that doesn’t begin on a DRAM segment boundary in a non-cached GPU could result in significant reduction in processing efficiency due to relatively poorer usage of data retrieved from DRAM. But with the caches, and with contiguous (i.e. non-pitched) allocations/access, the caches have a tendency to “fix” the efficiency loss associated with this sort of access, especially in a bulk access case where many threads/warps are accessing adjacent (but possibly unpitched) data.

As a result, pitched allocations and access on these newer GPUs typically only provides a small benefit, no benefit, or a small reduction in performance, in my experience.

Given that pitched allocations require special handling, extra code, and extra calculations on the GPU, I generally would not use them. You’re welcome to use them if you wish. You may find situations where they are still quite beneficial. I have not run across those (excepting their use in texturing of course).

Regardless of the validity of any of the above statements, there are no managed APIs for pitched allocations, or anything like cudaMalloc3D.