Problem with 2D memory copy using pitch

bit_mapper · November 18, 2011, 3:19am

When I copy an int 2D array[6][30] into the device memory using cudaMallocPitch and cudamemcpy2D, I have no concept how the compiler pad the row so that it’s best fit for GPU memory transfer. i.e. How many int elements to pad at the end of my 30 int elements?

I thought 30 int takes 120 byte, so another 2 padding needed to pad the chunk to a 128 byte which is a memory transaction size, but actually I can not get my element of array[1][1] by accessing address 32+1=33.

njuffa · November 18, 2011, 4:58am

http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/docs/online/group__CUDART__MEMORY_g80d689bc903792f906e49be4a0b6d8db.html#g80d689bc903792f906e49be4a0b6d8db

The pitch returned in *pitch by cudaMallocPitch() is the width in bytes of the allocation. 

The intended usage of pitch is as a separate parameter of the allocation, used to compute 

addresses within the 2D array. Given the row and column of an array element of type T, the

address is computed as: 

T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

bit_mapper · November 18, 2011, 6:20am

Ya, thanks for pointing me there, but my concern is how is the pitch size determined?

My pitch = 512 returned by the cudaMallocPitch(), meaning there are 128 int elements per row after the padding. But I have a row of 30 int, why not only pad 8 byte(two int words) in a row to reach 32 elements? why not 64 elements, or 96 elements,but 128?

CUDA Toolkit Documentation 12.3 Update 1

The pitch returned in *pitch by cudaMallocPitch() is the width in bytes of the allocation. 

The intended usage of pitch is as a separate parameter of the allocation, used to compute 

addresses within the 2D array. Given the row and column of an array element of type T, the

address is computed as: 

T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

njuffa · November 18, 2011, 7:00am

The pitch is picked by the driver to provide optimal performance for a given GPU. At minimum, the pitch must satisfy the row alignment requirements of 2D textures that could be bound to the pitch-linear memory allocated, but the driver may pick something wider based on performance considerations.

bit_mapper · November 18, 2011, 8:40pm

Ok I see. So you mean it’s something we can not decide by ourselves, right? Just like a black box and the only thing that we can take advantage is the returned pitch width so that to index our 2D array in kernel.

What if I want to copy the 2D array into shared memory with each thread copying one data element? There must be some divergence by “if condition” to determine whether a thread ID is larger than pitch width or not so as to get the real elements we want. Will this kinda of divergence, maybe even within a warp, slow down the overall performance a great lot?

For example:

Each row: 30 real elements + 98 padded elements.

We need to check whether (thread ID%pitch<30), causing divergence.

njuffa · November 18, 2011, 9:19pm

You could always use regular cudaMalloc() and interprete the allocated memory to have as many dimensions as desired, with as much padding (or no padding) as you see fit. The use of cudaMemcpy2D() does not require the memory to be allocated with cudaMallocPitch(). For example, already in the very first release of CUBLAS I used cudaMemcpy2D() for the non-unit stride copies inside cublas{Get|Set}Vector(), on memory allocated via plain cudaMalloc().

[Later:]

Changed inadvertent use of cudaMalloc2D() to cudaMallocPitch().

bit_mapper · November 20, 2011, 9:44pm

Thanks for the reply. But I’m still a little bit confused about how to cudaMalloc a 2D array so as to fit the arguments of cudaMemcpy2D()

cudaMemcpy2D(

void * dst,

size_t dpitch,

const void * src,

size_t spitch,

size_t width,

size_t height,

enum cudaMemcpyKind kind

)

Topic		Replies	Views
Pointers array CUDA Programming and Performance	7	5568	July 28, 2009
CUDA 2D Array Problem Need help to manipulate 2D arrays in CUDA CUDA Programming and Performance	4	26443	March 17, 2011
Understanding Memory Pitch Alignment CUDA Programming and Performance	9	11993	October 13, 2015
Copying 2D array from host to device CUDA Programming and Performance	7	7257	July 27, 2010
cudaMalloc3D and friends proper use for whatever data type CUDA Programming and Performance	6	5923	July 14, 2010
cudaMallocPitch returns wrong pitch CUDA Programming and Performance	2	2678	May 8, 2012
Cuda Malloc Pitch Doubt on cudaMallocPitch() CUDA Programming and Performance	1	2676	May 24, 2012
"Pitch" in cudaMallocPitch()? CUDA Programming and Performance	3	4532	March 2, 2009
cudaMallocPitch is giving inconsistent result cudaMallocPitch is giving inconsistent r CUDA Programming and Performance	5	6274	June 28, 2008
About cudaBindTexture2D CUDA Programming and Performance	3	6346	March 31, 2009

Problem with 2D memory copy using pitch

Related topics