What is the stream-ordered equivalent of cudaMallocPitch?

QuaternionsRock · September 18, 2021, 3:34am

It seems rather odd that the Runtime API doesn’t provide stream-ordered variants of any memory allocation functions besides cudaMalloc. Perhaps I am missing something?

In order to better understand the behavior of cudaMallocPitch, I wrote a program that makes thousands of calls the function with randomly generated widths and heights for a maximum allocation size of ~8 GB, where I quickly discovered that the pitch was always equal to width rounded up to the nearest multiple of 512. I wrote the following function in order to mimic this behavior:

template<class T>
static __inline__ __host__ cudaError_t cudaMallocPitchAsync(T **ptr, size_t *pitch, size_t width, size_t height,
                                                            cudaStream_t stream) {
    *pitch = ((width - 1) / 512 + 1) * 512; // equivalent to ((width - 1 >> 9) + 1 << 9)
    return cudaMallocAsync(ptr, *pitch * height, stream);
}

However, this implementation is obviously flawed: I have no way to know that the “nearest multiple of 512” rule holds true for GPUs other than my own, and there is certainly no guarantee that the Driver API won’t completely change the behavior of the underlying cuMemAllocPitch function in the future.

I am guessing the cudaDevAttrTextureAlignment attribute dictates the alignment requirement of cudaMallocPitch - not cudaDevAttrTexturePitchAlignment, surprisingly, although I suspect I misunderstand the intended meaning of “pitch alignment” in this context - but reading this attribute at runtime still does not solve the second problem: if the logic behind cuMemAllocPitch is ever improved, user-defined reimplementations such as my own will be left behind.

striker159 · September 18, 2021, 6:46am

The reasoning behind 512 is simple.

For best performance warps have to do coalesced memory access.
Threads can read 16-byte words in a single instruction if the address is 16-byte aligned (e.g. loading int4)

Each warp could theoretically access 32 * 16 = 512 byte in one instruction. The pitch is chosen as multiple of 512 such that it is valid to access each row of pitched memory in this manner.

Robert_Crovella · September 18, 2021, 12:17pm

I sometimes see questions like “why is there no managed pitched allocator?” or “how do I handle a pitched allocation in thrust?” I think if you lump all this together, my own personal conjecture is that it must be that the CUDA API developers don’t think pitched allocations are as useful/valuable/important as they once used to be.

From a technical perspective, I can certainly see (due to the lack of the cache structure that was in later GPUs) that pitched allocations should have been noticeably important in cc1.x GPUs. Those GPUs died out circa 2016. From my own personal perspective, I have an opinion that the effort associated with pitched allocations is no longer worth it in the cases I have come across, for the presumed benefits that accrue. You may have a different view, and you can express your view if you wish by filing a bug, suggesting the improvement to the cuda runtime API that you would like to see.

There’s probably some connection with textures as well. I myself would prefer in most cases to use the mechanisms available to me first/primarily that don’t involve textures, before using textures as a last resource, to try and improve performance. Again, just my own opinions and conjecture, but I sometimes wonder if these ideas are thought about in terms of where to invest effort in API development.

Topic		Replies	Views
How to determine the base adress alignment and pitch alignment used by 'cudaMallocPitch' ? CUDA Programming and Performance	4	2511	June 9, 2016
Problem with 2D memory copy using pitch CUDA Programming and Performance	6	6471	November 20, 2011
cudaMallocPitch is giving inconsistent result cudaMallocPitch is giving inconsistent r CUDA Programming and Performance	5	6274	June 28, 2008
Understanding Memory Pitch Alignment CUDA Programming and Performance	9	11993	October 13, 2015
Possible CUDA improvements CUDA Programming and Performance	7	6123	July 14, 2008
cudaMalloc3D and friends proper use for whatever data type CUDA Programming and Performance	6	5923	July 14, 2010
question on memory coalescing and alignment CUDA Programming and Performance	0	1893	January 28, 2008
Contexts and cudaMallocHost Same rules? CUDA Programming and Performance	17	11227	November 15, 2008
Bad performance using MallocPitch and Memcpy2D CUDA Programming and Performance	9	2830	May 24, 2017
cuMemallocPitch for 3D allocations? CUDA Programming and Performance	2	7021	June 23, 2008

What is the stream-ordered equivalent of cudaMallocPitch?

Related topics