What is the stream-ordered equivalent of cudaMallocPitch?

It seems rather odd that the Runtime API doesn’t provide stream-ordered variants of any memory allocation functions besides cudaMalloc. Perhaps I am missing something?

In order to better understand the behavior of cudaMallocPitch, I wrote a program that makes thousands of calls the function with randomly generated widths and heights for a maximum allocation size of ~8 GB, where I quickly discovered that the pitch was always equal to width rounded up to the nearest multiple of 512. I wrote the following function in order to mimic this behavior:

template<class T>
static __inline__ __host__ cudaError_t cudaMallocPitchAsync(T **ptr, size_t *pitch, size_t width, size_t height,
                                                            cudaStream_t stream) {
    *pitch = ((width - 1) / 512 + 1) * 512; // equivalent to ((width - 1 >> 9) + 1 << 9)
    return cudaMallocAsync(ptr, *pitch * height, stream);
}

However, this implementation is obviously flawed: I have no way to know that the “nearest multiple of 512” rule holds true for GPUs other than my own, and there is certainly no guarantee that the Driver API won’t completely change the behavior of the underlying cuMemAllocPitch function in the future.

I am guessing the cudaDevAttrTextureAlignment attribute dictates the alignment requirement of cudaMallocPitch - not cudaDevAttrTexturePitchAlignment, surprisingly, although I suspect I misunderstand the intended meaning of “pitch alignment” in this context - but reading this attribute at runtime still does not solve the second problem: if the logic behind cuMemAllocPitch is ever improved, user-defined reimplementations such as my own will be left behind.

The reasoning behind 512 is simple.

  • For best performance warps have to do coalesced memory access.
  • Threads can read 16-byte words in a single instruction if the address is 16-byte aligned (e.g. loading int4)

Each warp could theoretically access 32 * 16 = 512 byte in one instruction. The pitch is chosen as multiple of 512 such that it is valid to access each row of pitched memory in this manner.

I sometimes see questions like “why is there no managed pitched allocator?” or “how do I handle a pitched allocation in thrust?” I think if you lump all this together, my own personal conjecture is that it must be that the CUDA API developers don’t think pitched allocations are as useful/valuable/important as they once used to be.

From a technical perspective, I can certainly see (due to the lack of the cache structure that was in later GPUs) that pitched allocations should have been noticeably important in cc1.x GPUs. Those GPUs died out circa 2016. From my own personal perspective, I have an opinion that the effort associated with pitched allocations is no longer worth it in the cases I have come across, for the presumed benefits that accrue. You may have a different view, and you can express your view if you wish by filing a bug, suggesting the improvement to the cuda runtime API that you would like to see.

There’s probably some connection with textures as well. I myself would prefer in most cases to use the mechanisms available to me first/primarily that don’t involve textures, before using textures as a last resource, to try and improve performance. Again, just my own opinions and conjecture, but I sometimes wonder if these ideas are thought about in terms of where to invest effort in API development.

1 Like