cuMemcpy2D() Memory pitch limit, and best way to arrange large data

Hallo,

I need some hints from CUDA experts about how to organize my memory buffers.

In my application, that is a GPU rigid body simulation software, I have a large number of bodies

arranged like in the attached JPG picture:

bodybuffer.jpg

Note, however, that the JPG picture shows bodies arranged in a 2D array buffer, but currently I am

still using a 1D array arrangement, because parts of my kernels must access body data almost randomly, so

there are some pointer arithmetic that fits better in a 1D arrangement.

I will try to explain my problem with ‘Ascii art’:

suppose, by simplifying, that I have a ‘rigid body’ structure that cointains 1)speed, 2)ang.velocity, 3)position,

  1. quaternion rotation, so it is made in this way:

[font=“Courier New”]

float4 that I draw with symbol … =

float4 #

float4 %

float4 *[/font]

With the ascii-art simplification above, the buffer of my bodies ‘body_buffer’ can be represented with this Figure 1:

[font=“Courier New”]<--------------body_stride------------------->

==============================================

##############################################

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

**********************************************[/font]

Note that I allocate the body_buffer as a simple memory chunk -not as a 2D cuda array-

using cudaMalloc(), ex:

[font=“Courier New”] cudaMalloc( (void**) &body_buffer, body_stride* 4 * 4 );[/font]

so you could also imagine as a long 1D array…

[font=“Courier New”]=================#################%%%%%%%%%% etc. etc.[/font]

Given the body_stride (also the max.number of bodies) I can access, for instance, the 3rd field (the position)

of the i-th body by doing

[font=“Courier New”] a_position = body_buffer[i_body+ 2*body_stride][/font]

Good. But now comes the problem.

Suppose I want to do 2D memory copies from host to device, and viceversa, using cuMemcpy2D(): as here…

[font=“Courier New”]

==============================================

############################################## <— buffer on dev.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


      ^

      |  ..copy operation..

   =======

   #######    <--- buffer on host

   %%%%%%%

   *******[/font]

Note, these 'partial’2D copies would be really welcome in my application because, for many

performance reason, I would avoid creating and copying a large body buffer on the host: ex.

it would be better to create a small host buffer for say 1000 bodies and fill/copy it

into device with multiple calls to cuMemcpy2D(), up to a million of bodies, for instance.

But the problem is that cuMemcpy2D() uses ‘memory pitch’, and in current NVIDIA

hardware there is a limitation: max memory pitch= 262144 bytes!!

This would allow for maximum 10k bodies in a row, and I must work with larger number of bodies.

I know, someone might suggest of arranging bodies in multiple shorter rows, as

in followinf Figure 2, so that 2D copy could work for gazilions of bodies even with the

memory pitch limitation:

[font=“Courier New”]

<-------memory pitch limit–>|

<------row_stride----->

=======================

#######################

%%%%%%%%%%%%%%%%%%%%%%%


=======================

#######################

%%%%%%%%%%%%%%%%%%%%%%%


=======================

#######################

%%%%%%%%%%%%%%%%%%%%%%%


…[/font]

Well, the approach above would allow me to use cuMemcpy2D() in very efficient way,

but there is a drawback: the pointer arithmetic in my kernels would become much

more intricated. (ex. suppose I want to fetch the 3rd field of the i-th body,

it would turn into something horrible, with modulus arithmetic like

[font=“Courier New”]

int mrow = floor(i_body/row_stride);

a_position = body_buffer[mrow4row_stride + (i_body % row_stride) + 2*row_stride][/font]

Otherwise I could do something completely different, like this ‘folded’ data buffer of Figure 3:

[font=“Courier New”]

<-------memory pitch limit–>|

<–body_stride/folds–>

=======================

=======================

=======================

#######################

#######################

#######################




…[/font]

In this latter approach, I would still get a simple pointer arithmetic in my fetching

instrutions, that is the same initial example

[font=“Courier New”] a_position = body_buffer[i_body+ 2*body_stride][/font]

…but I do not see here a meaningful way of using the cuMemcpy2D() instruction

to copy selected sub-blocks of bodies!!!

Summing up:

  • either I arrange bodies as in Figure 1, but I cannot

    go beyond 10k bodies if I want to use 2D copies,

  • or I go with approach of Figure 2, but pointer arithmetic

    will be complex and time-consuming.

(Note that I always use CUdeviceptr for buffers, not the CUarray stuff

because I don’t think that CUarray would change things

that much respect to CUdeviceptr… - or not?)

Has anyone some hint or suggestion about this?

Alessandro Tasora

Università degli Studi di Parma