cuMemcpy2D() Memory pitch limit, and best way to arrange large data

tasora · November 28, 2008, 2:45pm

Hallo,

I need some hints from CUDA experts about how to organize my memory buffers.

In my application, that is a GPU rigid body simulation software, I have a large number of bodies

arranged like in the attached JPG picture:

Note, however, that the JPG picture shows bodies arranged in a 2D array buffer, but currently I am

still using a 1D array arrangement, because parts of my kernels must access body data almost randomly, so

there are some pointer arithmetic that fits better in a 1D arrangement.

I will try to explain my problem with ‘Ascii art’:

suppose, by simplifying, that I have a ‘rigid body’ structure that cointains 1)speed, 2)ang.velocity, 3)position,

quaternion rotation, so it is made in this way:

[font=“Courier New”]

float4 that I draw with symbol … =

float4 #

float4 %

float4 *[/font]

With the ascii-art simplification above, the buffer of my bodies ‘body_buffer’ can be represented with this Figure 1:

[font=“Courier New”]<--------------body_stride------------------->

==============================================

##############################################

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

**********************************************[/font]

Note that I allocate the body_buffer as a simple memory chunk -not as a 2D cuda array-

using cudaMalloc(), ex:

[font=“Courier New”] cudaMalloc( (void**) &body_buffer, body_stride* 4 * 4 );[/font]

so you could also imagine as a long 1D array…

[font=“Courier New”]=================#################%%%%%%%%%% etc. etc.[/font]

Given the body_stride (also the max.number of bodies) I can access, for instance, the 3rd field (the position)

of the i-th body by doing

[font=“Courier New”] a_position = body_buffer[i_body+ 2*body_stride][/font]

Good. But now comes the problem.

Suppose I want to do 2D memory copies from host to device, and viceversa, using cuMemcpy2D(): as here…

[font=“Courier New”]

==============================================

############################################## <— buffer on dev.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

      ^

      |  ..copy operation..

   =======

   #######    <--- buffer on host

   %%%%%%%

   *******[/font]

Note, these 'partial’2D copies would be really welcome in my application because, for many

performance reason, I would avoid creating and copying a large body buffer on the host: ex.

it would be better to create a small host buffer for say 1000 bodies and fill/copy it

into device with multiple calls to cuMemcpy2D(), up to a million of bodies, for instance.

But the problem is that cuMemcpy2D() uses ‘memory pitch’, and in current NVIDIA

hardware there is a limitation: max memory pitch= 262144 bytes!!

This would allow for maximum 10k bodies in a row, and I must work with larger number of bodies.

I know, someone might suggest of arranging bodies in multiple shorter rows, as

in followinf Figure 2, so that 2D copy could work for gazilions of bodies even with the

memory pitch limitation:

[font=“Courier New”]

<-------memory pitch limit–>|

<------row_stride----->

=======================

#######################

%%%%%%%%%%%%%%%%%%%%%%%

=======================

#######################

%%%%%%%%%%%%%%%%%%%%%%%

=======================

#######################

%%%%%%%%%%%%%%%%%%%%%%%

…[/font]

Well, the approach above would allow me to use cuMemcpy2D() in very efficient way,

but there is a drawback: the pointer arithmetic in my kernels would become much

more intricated. (ex. suppose I want to fetch the 3rd field of the i-th body,

it would turn into something horrible, with modulus arithmetic like

[font=“Courier New”]

int mrow = floor(i_body/row_stride);

a_position = body_buffer[mrow4row_stride + (i_body % row_stride) + 2*row_stride][/font]

Otherwise I could do something completely different, like this ‘folded’ data buffer of Figure 3:

[font=“Courier New”]

<-------memory pitch limit–>|

<–body_stride/folds–>

=======================

#######################

…[/font]

In this latter approach, I would still get a simple pointer arithmetic in my fetching

instrutions, that is the same initial example

[font=“Courier New”] a_position = body_buffer[i_body+ 2*body_stride][/font]

…but I do not see here a meaningful way of using the cuMemcpy2D() instruction

to copy selected sub-blocks of bodies!!!

Summing up:

either I arrange bodies as in Figure 1, but I cannot

go beyond 10k bodies if I want to use 2D copies,
or I go with approach of Figure 2, but pointer arithmetic

will be complex and time-consuming.

(Note that I always use CUdeviceptr for buffers, not the CUarray stuff

because I don’t think that CUarray would change things

that much respect to CUdeviceptr… - or not?)

Has anyone some hint or suggestion about this?

Alessandro Tasora

UniversitÃ degli Studi di Parma

Topic		Replies	Views
cudaMemcpy2D(): reason for pitch limits? CUDA Programming and Performance	3	7189	August 20, 2007
Problem with 2D memory copy using pitch CUDA Programming and Performance	6	6581	November 20, 2011
Can't get cuMemcpy2D to Work CUDA Programming and Performance	1	2064	October 7, 2010
Avoiding cudaMemcpy2D() because of 65536 pitch limit CUDA Programming and Performance	1	2421	April 14, 2009
Problem with copying a 2D array CUDA Programming and Performance	2	597	March 27, 2014
Why cudaMemcpy2D cause "invalid pitch argument"? CUDA Programming and Performance	2	6692	June 10, 2008
CudaMallocPitch and CudaMemcpy2D CUDA Programming and Performance	7	5743	August 3, 2015
trouble with cudaMemcpy2D I cant get a matrix to copy into 2D pitched memory CUDA Programming and Performance	1	956	July 13, 2009
cudaMemcpy2D invalid pitch argument CUDA Programming and Performance opencv , cuda	3	884	June 2, 2022
Memcpy2D error? CUDA Programming and Performance	2	2283	July 23, 2007

cuMemcpy2D() Memory pitch limit, and best way to arrange large data

Related topics