NPP image pitch padding


Although I haven’t fixed my previous post: . I’m making progress & this question is certainly related, hopefully someone can help…

Basically when I allocate an image on the GPU, using this call:


  • 8bit unsigned, single-channel 2D (image) memory allocator.

  • \param nWidthPixels The width of the 2D array (image) to be allocated.

  • \param nHeightPixels The height of the 2D array (image) to be allocated.

  • \param pStepBytes The number of bytes between successive rows of pixels is returned via this pointer to int.

  • \return A pointer to the new 2D array (image). 0 (null-pointer) indicates that an error occurred

  •  during allocation.


Npp8u * nppiMalloc_8u_C1(int nWidthPixels, int nHeightPixels, int * pStepBytes);[/codebox]

With a width of 1024, height 768, the pStepBytes is coming back with 1088, why is each line of my image being padded with 64 bytes (and thus crippling my algorithms to access pixels)? Under what circumstances is the pitch not (width * channels) ?

Thanks in advance,


Padding with 64 bytes from 1024 to 1088 is somewhat strange, since your data is already aligned to 64 byte.

In general your data should be allways padded to a muliple of 64 byte to achive efficient memory transfers. This means every new pixel line should start at a multiple of 64 byte. In such a case each half warp that access contiguous memory locations would need only one memory transfer.

Your memory accesses would need at least 2 memory transfers in the second line of pixels, if your pixel lines are not 64-byte aligned.

I don’t really know why it is padded in the case of 1024 pixels.

Your algorithms should be designed to work with padded data. It’s somewhat ugly, but you have no other option.

I use a padding in elements of my data type instead of the length in byte. This makes it less ugly, but not so portable, compared to pointer arithmetic. (Look into the NVIDIA programming reference).

Yeah re-writing my algorithms to cope with the pad is one way around, I guess that’s not really what I was going for. Thanks for the help, did not realise that about 64 byte alignment & I will consider it in future.


CapJo is right. It is generally a good idea to pad your lines to be multiples of 64 bytes in order for your alorithms to achieve coalescing. Without that padding it is almost impossibly hard to write code the perfectly coalesces it’s memory accesses.

NPP’s primitives all work with arbitrary linestrides (as long as the strides are multiples of the size of a single pixel). What that means is, that you’re not restricted to using NPP’s 2D memory allocators to allocate your image data. If you have kernels that cannot deal with arbitrary line strides, what you could do is, to use a normal cudaMalloc to allocate your image data with a padding of your choosing. NPP will be able to handle this, as long as your image data pointers are aligned to multiples of the pixel size and the line strides are also multiples of the pixel size.

The additional 64-byte padding is a somewhat obscure optimization that benefits some of the GPUs to achieve even better memory performance than 64-byte padded lines.


Thanks Frank, very helpful. I’m still slightly confused because the line width should already be padded to 64 bytes (1024%64=0). Although I’ve come to believe there’s a good case to just trust NPP & re-write my kernels to cope with this extra padding, which is what I’ve done! Still, very interesting read thanks.