Understanding Memory Pitch Alignment

MiKo3001 · October 5, 2015, 9:16pm

Hello,

I’m using cuda 6.5 for image processing with GTX 780 and GTX 750. I noticed some problems with my indexes due to cudaMallocPitch. It seems like the rows are filled to a multiple of 512 Bytes.
I understand the advantage of row alignment but I do not understand why 512 Bytes are used. This is too much and several 2D-Arrays (with different datatyps) are filled with a different number of padding elements. My code would work with 128 Byte.

Thanks

Robert_Crovella · October 5, 2015, 9:39pm

You should be able to write code that works with whatever pitch value is returned by cudaMallocPitch.

The typical method of index calculation is given in the documentation:

[url]CUDA Runtime API :: CUDA Toolkit Documentation

MiKo3001 · October 6, 2015, 9:31am

Sure, I can calculate the index, but I’m concerned about the performance because my program works exactly at the border of the target frame rate.
More in detail I’m using a remapping map to correct lens distortion. At every pixel position just a one-dimensional index is stored. With depth images from Kinect 2 (image width is 512) it works like a charm. But depth images from the old Kinect (image width is 640) is does not work.
I see two solution:

Reducing the Pitch Alignment from 512 Bytes to 128 Byte. Does it make sense?
Storing two unsigned shorts as two-dimensional index.

njuffa · October 6, 2015, 2:47pm

If your index calculation correctly incorporates the row alignment it should work functionally correct regardless of the specific amount of the row alignment used. If there is a performance issue, rather than a functional one, it may simply have to do with the fact that an image of width 640 pixels has more data to process than one of width 512.

Nobody forces you to use cudaMallocPitch(). You can use cudaMalloc() if you do not want padding, or want to customize your padding. However, if you want to bind a 2D texture to the allocated memory, you wan to use cudaMallocPitch() because there are alignment requirements imposed by the texturing hardware. There may be little need to use textures, you may want to look into using __ldg() which also reads data through the texture path.

If you are not already doing so, I would suggest using the profiler to guide optimization efforts.

MiKo3001 · October 6, 2015, 7:09pm

Yes, but I have three 2d-arrays with different alignment due to different datatyps.

Thanks a lot for the hint to __ldg(). It should be very usefull.

I had the idea to use cudaMalloc() with a customized padding, too. With 128 Bytes everything would work fine. But what are the drawbacks? There must be a reason why the driver uses 512 Bytes, even when I don’t bind textures.

njuffa · October 6, 2015, 7:26pm

CUDA allocation routines provide memory that is suitably aligned for any and all possible subsequent uses and optimization purposes.

I do not see a problem with having multiple 2D arrays allocated with cudaMallocPitch() even if they should not all use the same pitch value. The function returns the pitch to the caller, and you can in turn pass that to other functions using those allocations. So a 2D allocation is described by pointer plus dimensions, plus pitch, instead of just pointer and dimensions. Indexing then uses the pitch as passed together with the other allocation parameters.

Based on your vague descriptions, it is still a complete mystery to me what specific problems you encounter when using cudaMallocPitch().

MiKo3001 · October 6, 2015, 7:55pm

Sorry, for completeness:
I have a depth image with 640x480 Pixels (old Kinect). Several other 2d-arrays with same size but other datatyps(float,float3,uchar3).

I’m using a 2d-array (640x480 Pixels) of type int for remapping ( (x,y)->index ). The index is used to access specific elements from the other arrays (index->(x’,y’)). Due to different alignments (datatyps, element size) I get different (x’,y’) positions.

I will change my code and I will use ushort2 instead of int ((x,y)->(x’,y’)).

THX

njuffa · October 6, 2015, 8:12pm

An index is a mathematical concept. How that is transformed into an address is a programming question, but any abstract index transformation one desires is achievable independent of the underlying physical representation of each 2-D matrix. One just has to get the math correct that transforms the indices into addresses.

This is a very common scenario. For example, look at something like BLAS 3: For many API calls the 2-D matrices are specified not only by a starting address and two dimension specifications, but also by a fourth quantity called a “leading dimension”, which is another way of saying “pitch” of a 2-D matrix. The reason it is done that way is because this allows BLAS 3 calls to operate on arbitrary sub-matrices of a much larger containing matrix.

MiKo3001 · October 12, 2015, 10:28pm

I did some experiments with customized alignment with cudaMalloc().
The Profiler is satisfied if the alignment is a multiple of 128 Bits (like expected with a GTX 750). No need for a 512 Byte alignment.

HannesF99 · October 13, 2015, 4:09pm

@MiKo3001: If you want to create your images with your defined pitch (e.g. 128 byte) and still want to bind textures references / texture objects on it (e.g., in order to take advantage of the bilinear interpolation functionality), that should work.
See the last answer in: cuda - Pitch alignment for 2D textures - Stack Overflow
Also this post may be interesting: texture2d - CUDA - Any problems when working with (un-aligned) sub-images (for texture-binding, NPP, ...)? - Stack Overflow

Topic		Replies	Views
Problem with 2D memory copy using pitch CUDA Programming and Performance	6	6470	November 20, 2011
cudaMallocPitch is giving inconsistent result cudaMallocPitch is giving inconsistent r CUDA Programming and Performance	5	6274	June 28, 2008
How to determine the base adress alignment and pitch alignment used by 'cudaMallocPitch' ? CUDA Programming and Performance	4	2507	June 9, 2016
About cudaBindTexture2D CUDA Programming and Performance	3	6344	March 31, 2009
cudaBindTexture2D incorrect documentation? requires mallocPitch, but not stated CUDA Programming and Performance	5	6172	December 10, 2009
cudaMalloc3D and friends proper use for whatever data type CUDA Programming and Performance	6	5923	July 14, 2010
2D textures bound to Pitch Linear Memory texture doesn't access correct information CUDA Programming and Performance	2	20211	December 21, 2010
Bad performance using MallocPitch and Memcpy2D CUDA Programming and Performance	9	2820	May 24, 2017
Texture binding with "unpitched" memory CUDA Programming and Performance	4	3878	July 22, 2011
What is the stream-ordered equivalent of cudaMallocPitch? CUDA Programming and Performance cuda	2	1311	September 18, 2021

Understanding Memory Pitch Alignment

Related topics