When I copy an int 2D array[6][30] into the device memory using cudaMallocPitch and cudamemcpy2D, I have no concept how the compiler pad the row so that it’s best fit for GPU memory transfer. i.e. How many int elements to pad at the end of my 30 int elements?
I thought 30 int takes 120 byte, so another 2 padding needed to pad the chunk to a 128 byte which is a memory transaction size, but actually I can not get my element of array[1][1] by accessing address 32+1=33.
The pitch returned in *pitch by cudaMallocPitch() is the width in bytes of the allocation.
The intended usage of pitch is as a separate parameter of the allocation, used to compute
addresses within the 2D array. Given the row and column of an array element of type T, the
address is computed as:
T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;
Ya, thanks for pointing me there, but my concern is how is the pitch size determined?
My pitch = 512 returned by the cudaMallocPitch(), meaning there are 128 int elements per row after the padding. But I have a row of 30 int, why not only pad 8 byte(two int words) in a row to reach 32 elements? why not 64 elements, or 96 elements,but 128?
The pitch is picked by the driver to provide optimal performance for a given GPU. At minimum, the pitch must satisfy the row alignment requirements of 2D textures that could be bound to the pitch-linear memory allocated, but the driver may pick something wider based on performance considerations.
Ok I see. So you mean it’s something we can not decide by ourselves, right? Just like a black box and the only thing that we can take advantage is the returned pitch width so that to index our 2D array in kernel.
What if I want to copy the 2D array into shared memory with each thread copying one data element? There must be some divergence by “if condition” to determine whether a thread ID is larger than pitch width or not so as to get the real elements we want. Will this kinda of divergence, maybe even within a warp, slow down the overall performance a great lot?
For example:
Each row: 30 real elements + 98 padded elements.
We need to check whether (thread ID%pitch<30), causing divergence.
You could always use regular cudaMalloc() and interprete the allocated memory to have as many dimensions as desired, with as much padding (or no padding) as you see fit. The use of cudaMemcpy2D() does not require the memory to be allocated with cudaMallocPitch(). For example, already in the very first release of CUBLAS I used cudaMemcpy2D() for the non-unit stride copies inside cublas{Get|Set}Vector(), on memory allocated via plain cudaMalloc().
[Later:]
Changed inadvertent use of cudaMalloc2D() to cudaMallocPitch().