Padding input 2D array too slow need faster method

pathfinder02 · September 15, 2008, 6:56am

Hi,
I’ve been using nested loops to pad input data and prepare output data, after some profiling, I found out that it’s too time consuming.
Is there a better way of doing this?
Thanks.

samuelmurdoch · September 15, 2008, 1:11pm

why nested loops? don’t you use cudaMemset2D?

is it your situation?

d = data,

P = padding

ddddddddddPPP

or is it?

d = data,

P = padding

ddddddddddPPP

ddddddddPPPPP

dddddPPPPPPPP

ddddddddddddP

ddddddddddPPP

in the first case you can allocate memory for the whole array (including padding),

use Memcpy2D to copy data, and 2 times Memset2D (one for vertical block, one for orizontal) for pad.

in the second case you can allocate, first set the pad of the largest line to the whole array:

x : undefined data

xxxxxPPPPPPPP

and then copy the data:

ddddddddddPPP

ddddddddPPPPP

dddddPPPPPPPP

ddddddddddddP

ddddddddddPPP

if your padding is 4-bytes oriented (int or float), you can use (my code is not the best solution):

#include "cuda_runtime.h"

//Routines per aggiungere padding

__global__ void padset2D_kernel(int* dstPtr, const size_t pitch, const int value, const size_t width, const size_t height)

{

	const int ibx = blockIdx.x * 32;

	const int iby = blockIdx.y * 32;

	

	int posX = threadIdx.x + ibx;

	dstPtr += posX;

	int* curr = dstPtr + (iby + 8*threadIdx.y)*pitch;

	int* end = dstPtr + height*pitch;

	int* alt_end = curr + 8*pitch;

	if (alt_end < end) end = alt_end;

	if (posX < width)

	{

  while (curr < end)

  {

  	curr[0] = value;

  	curr += pitch;

  }

	}

}

void padset2D(int* dstPtr, size_t pitch, int value, int width, int height)

{

	int gridx = width/32;

	if (width % 32) gridx++;

	int gridy = height/32;

	if (height % 32) gridy++;

	dim3 grid( gridx, gridy ), threads( 32, 4 );

	

	//invocazione kernel

	padset2D_kernel<<<grid, threads>>>(dstPtr,pitch,value,width,height);

	cudaThreadSynchronize();

}

where pitch, width and height is “number of integers…”, not in bytes as memset2d…

pathfinder02 · September 15, 2008, 1:27pm

why nested loops? don’t you use cudaMemset2D?

is it your situation?

d = data,

P = padding

ddddddddddPPP

ddddddddddPPP

ddddddddddPPP

ddddddddddPPP

ddddddddddPPP

or is it?

d = data,

P = padding

ddddddddddPPP

ddddddddPPPPP

dddddPPPPPPPP

ddddddddddddP

ddddddddddPPP

in the first case you can allocate memory for the whole array (including padding),

use Memcpy2D to copy data, and 2 times Memset2D (one for vertical block, one for orizontal) for pad.

in the second case you can allocate, first set the pad of the largest line to the whole array:

x : undefined data

xxxxxPPPPPPPP

xxxxxPPPPPPPP

xxxxxPPPPPPPP

xxxxxPPPPPPPP

xxxxxPPPPPPPP

and then copy the data:

ddddddddddPPP

ddddddddPPPPP

dddddPPPPPPPP

ddddddddddddP

ddddddddddPPP

if your padding is 4-bytes oriented (int or float), you can use (my code is not the best solution):
#include "cuda_runtime.h"

//Routines per aggiungere padding

__global__ void padset2D_kernel(int* dstPtr, const size_t pitch, const int value, const size_t width, const size_t height)

{

	const int ibx = blockIdx.x * 32;

	const int iby = blockIdx.y * 32;

	

	int posX = threadIdx.x + ibx;

	dstPtr += posX;

	int* curr = dstPtr + (iby + 8*threadIdx.y)*pitch;

	int* end = dstPtr + height*pitch;

	int* alt_end = curr + 8*pitch;

	if (alt_end < end) end = alt_end;

	if (posX < width)

	{

  while (curr < end)

  {

  	curr[0] = value;

  	curr += pitch;

  }

	}

}

void padset2D(int* dstPtr, size_t pitch, int value, int width, int height)

{

	int gridx = width/32;

	if (width % 32) gridx++;

	int gridy = height/32;

	if (height % 32) gridy++;

	dim3 grid( gridx, gridy ), threads( 32, 4 );

	

	//invocazione kernel

	padset2D_kernel<<<grid, threads>>>(dstPtr,pitch,value,width,height);

	cudaThreadSynchronize();

}
where pitch, width and height is “number of integers…”, not in bytes as memset2d…

[snapback]440202[/snapback]

Thanks for the reply.

Ye my case is the first case u mentioned.

I am currently only padding horizontally.

I just used cudaMemset which worked fine, is there a benefit of using cudaMemset2D?

Do you have to use that when padding both directions?

I am also confused about the pitch parameter in the Memcpy2D, I specified the padded width, but what if I pad both directions, why is there only one pitch parameter?

Does it only allow you to pad in the horizontal direction?

samuelmurdoch · September 15, 2008, 2:49pm

cudaError_t cudaMemset2D( void* dstPtr, size_t pitch, int value, size_t width, size_t height
)

width is the width in BYTES of padding,
height is the height of the area to pad in ROWS
pitch is the size of the ROW of your 2D array (including padding)

ex:
array:
ddddddddddddPPPP
ddddddddddddPPPP
ddddddddddddPPPP
ddddddddddddPPPP
ddddddddddddPPPP
ddddddddddddPPPP

dstPtr = array_pointer + 12
pitch = 16sizeof(data)
width = 4sizeof(data)
height = 6
value = P

bog · September 18, 2008, 1:49pm

If memset depends on size of memory to be set, you could pad just one row/column with the desired value. Out of bounds texture fetches should return the value of the last element in the row/column, depending on how it is set…

Topic		Replies	Views
memcpy2d question Question on memcpy2d implementation CUDA Programming and Performance	3	1797	October 3, 2008
Problem with 2D memory copy using pitch CUDA Programming and Performance	6	6589	November 20, 2011
What are row alignments for 2D arrays used for? CUDA Programming and Performance	1	762	October 11, 2019
help with cudaMemcpy2D I can't get a matrix/ array to copy correctly from host to device CUDA Programming and Performance	3	5103	July 14, 2009
Does cudaMallocPitch zero pad arrays on the device or should I be doing that? CUDA Programming and Performance	0	961	January 20, 2013
Is there a ready-made function in cuda used for pading '0' for a 2D image before cufft CUDA Programming and Performance cuda , kernel	4	536	June 2, 2023
cudaMemcpy2D slow CUDA Programming and Performance	4	5857	January 30, 2009
cudaMemcpy2D example? CUDA Programming and Performance	5	19760	February 1, 2012
CUDA 2D Array Problem Need help to manipulate 2D arrays in CUDA CUDA Programming and Performance	4	26527	March 17, 2011
trouble with cudaMemcpy2D I cant get a matrix to copy into 2D pitched memory CUDA Programming and Performance	1	959	July 13, 2009

Padding input 2D array too slow need faster method

Related topics