Padding input 2D array too slow need faster method

Hi,
I’ve been using nested loops to pad input data and prepare output data, after some profiling, I found out that it’s too time consuming.
Is there a better way of doing this?
Thanks.

why nested loops? don’t you use cudaMemset2D?

is it your situation?

d = data,

P = padding

ddddddddddPPP

ddddddddddPPP

ddddddddddPPP

ddddddddddPPP

ddddddddddPPP

or is it?

d = data,

P = padding

ddddddddddPPP

ddddddddPPPPP

dddddPPPPPPPP

ddddddddddddP

ddddddddddPPP

in the first case you can allocate memory for the whole array (including padding),

use Memcpy2D to copy data, and 2 times Memset2D (one for vertical block, one for orizontal) for pad.

in the second case you can allocate, first set the pad of the largest line to the whole array:

x : undefined data

xxxxxPPPPPPPP

xxxxxPPPPPPPP

xxxxxPPPPPPPP

xxxxxPPPPPPPP

xxxxxPPPPPPPP

and then copy the data:

ddddddddddPPP

ddddddddPPPPP

dddddPPPPPPPP

ddddddddddddP

ddddddddddPPP

if your padding is 4-bytes oriented (int or float), you can use (my code is not the best solution):

#include "cuda_runtime.h"

//Routines per aggiungere padding

__global__ void padset2D_kernel(int* dstPtr, const size_t pitch, const int value, const size_t width, const size_t height)

{

	const int ibx = blockIdx.x * 32;

	const int iby = blockIdx.y * 32;

	

	int posX = threadIdx.x + ibx;

	dstPtr += posX;

	int* curr = dstPtr + (iby + 8*threadIdx.y)*pitch;

	int* end = dstPtr + height*pitch;

	int* alt_end = curr + 8*pitch;

	if (alt_end < end) end = alt_end;

	if (posX < width)

	{

  while (curr < end)

  {

  	curr[0] = value;

  	curr += pitch;

  }

	}

}

void padset2D(int* dstPtr, size_t pitch, int value, int width, int height)

{

	int gridx = width/32;

	if (width % 32) gridx++;

	int gridy = height/32;

	if (height % 32) gridy++;

	dim3 grid( gridx, gridy ), threads( 32, 4 );

	

	//invocazione kernel

	padset2D_kernel<<<grid, threads>>>(dstPtr,pitch,value,width,height);

	cudaThreadSynchronize();

}

where pitch, width and height is “number of integers…”, not in bytes as memset2d…

Thanks for the reply.

Ye my case is the first case u mentioned.

I am currently only padding horizontally.

I just used cudaMemset which worked fine, is there a benefit of using cudaMemset2D?

Do you have to use that when padding both directions?

I am also confused about the pitch parameter in the Memcpy2D, I specified the padded width, but what if I pad both directions, why is there only one pitch parameter?

Does it only allow you to pad in the horizontal direction?

cudaError_t cudaMemset2D( void* dstPtr, size_t pitch, int value, size_t width, size_t height
)

width is the width in BYTES of padding,
height is the height of the area to pad in ROWS
pitch is the size of the ROW of your 2D array (including padding)

ex:
array:
ddddddddddddPPPP
ddddddddddddPPPP
ddddddddddddPPPP
ddddddddddddPPPP
ddddddddddddPPPP
ddddddddddddPPPP

dstPtr = array_pointer + 12
pitch = 16sizeof(data)
width = 4
sizeof(data)
height = 6
value = P

If memset depends on size of memory to be set, you could pad just one row/column with the desired value. Out of bounds texture fetches should return the value of the last element in the row/column, depending on how it is set…