Understanding Memory Pitch Alignment


I’m using cuda 6.5 for image processing with GTX 780 and GTX 750. I noticed some problems with my indexes due to cudaMallocPitch. It seems like the rows are filled to a multiple of 512 Bytes.
I understand the advantage of row alignment but I do not understand why 512 Bytes are used. This is too much and several 2D-Arrays (with different datatyps) are filled with a different number of padding elements. My code would work with 128 Byte.


You should be able to write code that works with whatever pitch value is returned by cudaMallocPitch.

The typical method of index calculation is given in the documentation:


Sure, I can calculate the index, but I’m concerned about the performance because my program works exactly at the border of the target frame rate.
More in detail I’m using a remapping map to correct lens distortion. At every pixel position just a one-dimensional index is stored. With depth images from Kinect 2 (image width is 512) it works like a charm. But depth images from the old Kinect (image width is 640) is does not work.
I see two solution:

  1. Reducing the Pitch Alignment from 512 Bytes to 128 Byte. Does it make sense?
  2. Storing two unsigned shorts as two-dimensional index.

If your index calculation correctly incorporates the row alignment it should work functionally correct regardless of the specific amount of the row alignment used. If there is a performance issue, rather than a functional one, it may simply have to do with the fact that an image of width 640 pixels has more data to process than one of width 512.

Nobody forces you to use cudaMallocPitch(). You can use cudaMalloc() if you do not want padding, or want to customize your padding. However, if you want to bind a 2D texture to the allocated memory, you wan to use cudaMallocPitch() because there are alignment requirements imposed by the texturing hardware. There may be little need to use textures, you may want to look into using __ldg() which also reads data through the texture path.

If you are not already doing so, I would suggest using the profiler to guide optimization efforts.

Yes, but I have three 2d-arrays with different alignment due to different datatyps.

Thanks a lot for the hint to __ldg(). It should be very usefull.

I had the idea to use cudaMalloc() with a customized padding, too. With 128 Bytes everything would work fine. But what are the drawbacks? There must be a reason why the driver uses 512 Bytes, even when I don’t bind textures.

CUDA allocation routines provide memory that is suitably aligned for any and all possible subsequent uses and optimization purposes.

I do not see a problem with having multiple 2D arrays allocated with cudaMallocPitch() even if they should not all use the same pitch value. The function returns the pitch to the caller, and you can in turn pass that to other functions using those allocations. So a 2D allocation is described by pointer plus dimensions, plus pitch, instead of just pointer and dimensions. Indexing then uses the pitch as passed together with the other allocation parameters.

Based on your vague descriptions, it is still a complete mystery to me what specific problems you encounter when using cudaMallocPitch().

Sorry, for completeness:
I have a depth image with 640x480 Pixels (old Kinect). Several other 2d-arrays with same size but other datatyps(float,float3,uchar3).

I’m using a 2d-array (640x480 Pixels) of type int for remapping ( (x,y)->index ). The index is used to access specific elements from the other arrays (index->(x’,y’)). Due to different alignments (datatyps, element size) I get different (x’,y’) positions.

I will change my code and I will use ushort2 instead of int ((x,y)->(x’,y’)).


An index is a mathematical concept. How that is transformed into an address is a programming question, but any abstract index transformation one desires is achievable independent of the underlying physical representation of each 2-D matrix. One just has to get the math correct that transforms the indices into addresses.

This is a very common scenario. For example, look at something like BLAS 3: For many API calls the 2-D matrices are specified not only by a starting address and two dimension specifications, but also by a fourth quantity called a “leading dimension”, which is another way of saying “pitch” of a 2-D matrix. The reason it is done that way is because this allows BLAS 3 calls to operate on arbitrary sub-matrices of a much larger containing matrix.

I did some experiments with customized alignment with cudaMalloc().
The Profiler is satisfied if the alignment is a multiple of 128 Bits (like expected with a GTX 750). No need for a 512 Byte alignment.

@MiKo3001: If you want to create your images with your defined pitch (e.g. 128 byte) and still want to bind textures references / texture objects on it (e.g., in order to take advantage of the bilinear interpolation functionality), that should work.
See the last answer in: http://stackoverflow.com/questions/12550927/pitch-alignment-for-2d-textures
Also this post may be interesting: http://stackoverflow.com/questions/25789916/cuda-any-problems-when-working-with-un-aligned-sub-images-for-texture-bindi