It seems rather odd that the Runtime API doesn’t provide stream-ordered variants of any memory allocation functions besides cudaMalloc
. Perhaps I am missing something?
In order to better understand the behavior of cudaMallocPitch
, I wrote a program that makes thousands of calls the function with randomly generated widths and heights for a maximum allocation size of ~8 GB, where I quickly discovered that the pitch
was always equal to width
rounded up to the nearest multiple of 512. I wrote the following function in order to mimic this behavior:
template<class T>
static __inline__ __host__ cudaError_t cudaMallocPitchAsync(T **ptr, size_t *pitch, size_t width, size_t height,
cudaStream_t stream) {
*pitch = ((width - 1) / 512 + 1) * 512; // equivalent to ((width - 1 >> 9) + 1 << 9)
return cudaMallocAsync(ptr, *pitch * height, stream);
}
However, this implementation is obviously flawed: I have no way to know that the “nearest multiple of 512” rule holds true for GPUs other than my own, and there is certainly no guarantee that the Driver API won’t completely change the behavior of the underlying cuMemAllocPitch
function in the future.
I am guessing the cudaDevAttrTextureAlignment
attribute dictates the alignment requirement of cudaMallocPitch
- not cudaDevAttrTexturePitchAlignment
, surprisingly, although I suspect I misunderstand the intended meaning of “pitch alignment” in this context - but reading this attribute at runtime still does not solve the second problem: if the logic behind cuMemAllocPitch
is ever improved, user-defined reimplementations such as my own will be left behind.