How to allocate a 3d array such that you can use the indecies to access its elements

Hi there, I’ve been going through the NVIDIA CUDA programming guide, best practices guide, and the forum trying to find out how to allocate a 3d array and get to access it using indecies just like in c++ (array[i][j][k]), that’s because I have a 6000 lines of code that I’ve converted from matlab to c++, and now I have to get it working on the GPU (EVGA GeForce GTX 285), 99% of the code is loops around 3d arrays, here is an example:

for( k=P1;k<=KP;k++)
for ( j=P1;j<=JP;j++)
for ( i=3;i<=I_1;i++){

Hx[i][j][k]=dax[i][j][k]Hx[i][j][k]+
dx1[i][j][k]
(Ey[i][j][k+1]-Ey[i][j][k]-
Ez[i][j+1][k]+Ez[i][j][k])+
db2x[i][j][k](Ey[i][j][k+2]-Ey[i][j][k-1]-
Ez[i][j+2][k]+Ez[i][j-1][k])+
dx3[i][j][k]
(Ey[i][j+1][k+2]+Ey[i][j-1][k+2]+
Ey[i+1][j][k+2]+Ey[i-1][j][k+2]-
Ey[i][j+1][k-1]-Ey[i][j-1][k-1]-
Ey[i+1][j][k-1]-Ey[i-1][j][k-1]-
Ez[i+1][j+2][k]-Ez[i-1][j+2][k]-
Ez[i][j+2][k+1]-Ez[i][j+2][k-1]+
Ez[i+1][j-1][k]+Ez[i-1][j-1][k]+
Ez[i][j-1][k+1]+Ez[i][j-1][k-1])+
dx4[i][j][k]*(Ey[i+1][j+1][k+2]+Ey[i-1][j+1][k+2]+
Ez[i+1][j-1][k-1]+Ez[i-1][j-1][k-1]);

} ///////////////////////////////////////

I have like 500 loops in this fashion, I am very confused, I don’t know If I should use cuda arrays (which are fetched only by using textures as I read in the guide, while my arrays are read/write arrays), or use cudaMalloc (which I think will fail when copying from the host since I’ll be copying pointers!)

This is the function I used to allocate and initialize the arrays in c++, which I think can’t be done on CUDA unless I do it directly on the GPU (i.e. don’t allocate on the host then copy to the GPU), but I dont know if this can be done.

void Malloc3dVal (float ***arr3d,int x,int y,int z, float val){

int i,j,k;

if(arr3d==NULL){
   cout<<"Memory allocation failed. Exiting...."<<endl;
   return;
}   

for( i=0; i<x; i++ )
{
    arr3d[ i ] = (float **)malloc(y* sizeof (**arr3d));/* Allocate 'y' number of pointers to int */
	if(arr3d[i]==NULL){/*Validate malloc's success/failure using the return value*/
		   cout<<"Memory allocation failed. Exiting...."<<endl;
		   return;
	}
    for( j=0; j<y; j++)
    {
        arr3d[ i ][ j ]= (float *)malloc(z* sizeof (***arr3d));/* Allocate 'z' number of ints */
		if(arr3d[ i ][ j ]==NULL){ /*Validate malloc's success/failure using the return value*/
		   cout<<"Memory allocation failed. Exiting...."<<endl;
		   return;
		} 
		for( k=0; k<z;k++){

			arr3d[i][j][k]=val;


		}  

    }  
}  

}

////////////////////////////////////////////////////////////////

Or maybe I should convert all 3d to big 1d arrays and do some index calculation (which means I have to edit the whole 6000 lines again).

Another issue is that as mentioned above I have almost 500 loops, they depend on each other, but each iteration in a loop is independent from the previous one, should I make 500 kernels!!! each loop is different than the other, I mean I can’t make a kernel that can replace many loops!!

Help is highly appreciated, Thank you all in advance

note: I am still waiting for the GPU to arrive by the DHL, that’s why i couldn’t try the options myself, plus it may arrive by next monday, and my boss wants me to get working on the code till then. I tried to install the CUDA driver to use the emulator but it gave me an error.

Bingo! Unless, that is, someone’s come up with some cunning set of C++ templates to get around this issue. For performance, you always want 1D arrays with computed indices (even on the CPU). This is because C (and hence C++) only really has 1D arrays - the C[i][j][k] arrays you’re describing are really arrays of pointers to other arrays of pointers, which in turn point to the actual data. All that indirection is bad for performance. This isn’t an issue in Fortran (and I’m guessing Matlab) because the Fortran language ‘knows’ what a multi-dimensional array is, and so can allocate a contiguous block of memory and perform the index calculations itself.

It’s hard working out exactly what you’re describing, but an autocoder might be of use. Or you might be able to bend the thrust library to your purposes.

“For performance, you always want 1D arrays with computed indices (even on the CPU).”

Besides a google search, any preferred reference docs with the ‘computed indices’ topic for those fairly new to C/C++? R-project makes matrix and dataframe manipulation too easy. TIA. V.

looks like a structured (ie finite difference) Maxwell solver? Maybe you want to take a look at OpenCurrent http://code.google.com/p/opencurrent/ a library for writing structured PDE solvers with CUDA…

I’ve been programming since age 11, so I feel entitled to spread a little wisdom, lol. ;)

How to interpret a flat 1D array in memory as a 3D array using the magic of pointer casts. The only catch is that you cannot specify the highest dimension (the [3]) in the type cast. This is how you could completely avoid the error-prone index calculation (it will be handled internally by the compiler then).

testfunction will become your kernel call, and you should be ready to go.

#include <stdio.h>

float array[3*5*2] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23

,24,25,26,27,28,29,30};

void testfunction(float (*foobar)[5][2])

{

	int i,j,k;

	for (i=0; i<3; i++)

	{

		for (j=0; j<5;j++)

		{

			for (k=0; k<2; k++)

			{

				printf("array[%d][%d][%d]=%f\n", i,j,k, foobar[i][j][k]);

			}

		}

	}

}

int main(int argc, char **argv)

{

	testfunction((float (*)[5][2]) array);

}

I take it that testfunction should be written as a template, with 5 and 2 as template parameters? How do you cope with the leading dimension (and bounds-check it)?

Depends if your array sizes are variable or fixed. In the fixed case, a define may be sufficient. I am not sure if a kernel call can be templatized. But as member function of a template class, sure you can put integer template arguments into the array definitions.

Automatic bounds checking on arrays is generally not done in C - if you want to do this, you’d have to manually verify the bounds of your indices.

int *a;

Declaring “a” as a pointer, you can still access a[0], a[1], a[2]… The highest array dimension exists implicitly.

Same when declaring a pointer to a two dimensional array. You can access it in three dimensions then.

Christian

Thanks much. So if I understand some of the gleanings derived from the various forum posts-- assume data is imported from a nxm matrix held in a csv file. Better to load it into a 1D array, leave it that way when transferring to device and do all one’s GPU calcs with indexing…Is that correct?

V.

How to interpret a flat 1D array in memory as a 3D array using the magic of pointer casts. The only catch is that you cannot specify the highest dimension (the [3]) in the type cast. This is how you could completely avoid the error-prone index calculation (it will be handled internally by the compiler then).

testfunction will become your kernel call, and you should be ready to go.

#include <stdio.h>

float array[3*5*2] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23

,24,25,26,27,28,29,30};

void testfunction(float (*foobar)[5][2])

{

	int i,j,k;

	for (i=0; i<3; i++)

	{

		for (j=0; j<5;j++)

		{

			for (k=0; k<2; k++)

			{

				printf("array[%d][%d][%d]=%f\n", i,j,k, foobar[i][j][k]);

			}

		}

	}

}

int main(int argc, char **argv)

{

	testfunction((float (*)[5][2]) array);

}

[/quote]

That’s a reasonable first approach.

For squeezing the last little bit of performance out, consider using templated array classes for computing the offsets in the code, as someone suggested further up. When you are sure the entire array contains less than 2^^24 array elements. you can use 24 bit multiplications for the offsets (__umul24) and save some clock cycles (32 bit integer multiplications take twice the time because nVidia saved some logic gates).

But the first priority would be to get it running, and the pointer cast method seems the safest way to proceed.

Hi,

sorry for replying late, I didn’t get an email notification so I thought no one replied to my thread lol

I understand that they are arrays of pointers as I said here:

"

I don’t know If I should use cuda arrays (which are fetched only by using textures as I read in the guide, while my arrays are read/write arrays), or use cudaMalloc (which I think will fail when copying from the host since I’ll be copying pointers!)

"

but the problem now is that the max size of threads in a block is 512 (as far as I understand it means if using a 1D grid, right?) , and I have arrays of size 717171, which is far greater than 512. according to the CUDA programming guide:

"

􀂉The maximum number of threads per block is 512;

􀂉 The maximum sizes of the x-, y-, and z-dimension of a thread block are 512, 512,

and 64, respectively;

􀂉 The maximum size of each dimension of a grid of thread blocks is 65535;

"

as far as I understand the difference between the 1st, 2nd and 3rd points:

1st: 512 threads/block if 1D grid

2nd: 51251264 threads/block if 3D grid, or 512*512 if 2D

3rd: 65535 blocks/grid

please correct me

I am going to convert the matlab code to fortran as well, do you think if I do that first and convert from fortran to cuda it would be better?, and about the autocoder,could you suggest one for me, I’ve never used an autocoder, I’ve converted the matlab code into c++ code using a series of finding & replacing strings (after declaring and initializing the arrays of course for c++ which is nothing in matlab)

I have almost 500 successive loops, for each loop to do correct calculations, the previous loop has to finish first (I think I should use _syncThreads after each one),

but, the iterations of a loop are independent (that’s why I want to put each loop in a kernel to run each iteration using a thread). now, the loops are different which means I have to put each one in a separate kernel==> 500 kernels! I don’t know if this would introduce a problem, I mean what is the maximum number of successive kernels in an application?

Thanks a lot, I appreciate your help :)

Thank you, I will google the R-project.

I don’t really get the theory behind the program since it has to do with electromagnetics and I am working as a programmer to convert the code form matlab to c++,fortran and make it run on the GPU for max performance. I will check the website.

Thanks a lot :)

is “foobar” a keyword? and does it work in c++?

shouldn’t the above line be (array instead of foobar):

Thanks a lot, seems like a day saver :)

Lol, foobar is no keyword. I am not using any secret sauce in the code that no one would know about.

foobar is simply the name of the argument passed into the testfunction. In C you can name variables as you like.

I could have named this argument array, in which case it would have a local scope to this function. Then the array declared at file scope would no longer be visible. But this would be confusing, so I avoided it.

Christian

You really don’t want this project to be your first CUDA assignment. Spend more time on the SDK samples and simple things to get really used to CUDA.

So in your code I see above you have k as the last index in all of the array indices. Parallelize over the k loop. i.e. create thread blocks with 71 threads. Then the grid could take care of the other array dimensions.

By doing so, your memory access will be contiguous for all threads. With the relaxed memory coalescing rules for Compute 1.3 devices, such as yours, you will even have coalesced memory access patterns. That fact alone can produce a speed difference of factor 5-10.

Christian

lol Sorry, stupid me, this is abc programming, i was confused by the ( ), I thought it’s some kind of notation that I didn’t know, and didn’t pay attention to name of the passed argument.

yes of course, I was just trying to have an overall idea, but definitely won’t start with this project. I installed the GPU today and started going through the SDK examples.

Thanks for the explanation above, I will try it. :)

These classes (in fact structs) that I contrived today might prove useful for array index calculations. They support implicit conversions from and to unsigned int / int respectively and can be used as a drop-in replacement for unsigned int / int in array index calculations.

Be aware that on any operation other than multiplications (for example additions, subtractions) the result gets implicitly converted to unsigned int / int. Therefore before further multiplications, you might want to cast intermediate results back to uint24 / int24.

If you were to manually compute your indices in a large 1D array for ranges 0…71 per each dimension, this class could be useful.

WARNING: not tested yet. ;)

/**

 * Unsigned 24 bit integer type for use in array index calculations.

 * No clamping to 24 bit results occurs except for operands of a multiplication.

 * It's the responsibility of the user of the class to verify that operands

 * for multiplications are always within the 24 bit range.

 */

typedef struct _uint24

{

	int value;

	_uint24(unsigned int a) {

		value = a;

	}

	operator unsigned int () const {

		return value;

	}

	__device__ _uint24 operator*(const _uint24 &b) const {

		return (_uint24)__umul24(value, b.value);

	};

	__device__ _uint24 operator*=(const _uint24 &b) {

		value = __umul24(value, b.value);

		return *this;

	};

} uint24;

/**

 * Signed 24 bit integer type for use in array index calculations.

 * No clamping to 24 bit results occurs except for operands of a multiplication.

 * It's the responsibility of the user of the class to verify that operands

 * for multiplications are always within the 24 bit range.

 */

typedef struct _int24

{

	int value;

	_int24(int a) {

		value = a;

	}

	operator int () const {

		return value;

	}

	__device__ _int24 operator*(const _int24 &b) const {

		return (_int24)__mul24(value, b.value);

	};

	__device__ _int24 operator*=(const _int24 &b) {

		value = __mul24(value, b.value);

		return *this;

	};

} int24;
// example use

	uint24 x = threadIdx.x;

	uint24 w = gridDim.x;

	uint24 y = blockIdx.x;

	g_out[w*y + x]  = 1.0f;	// index calculation will use the efficient __umul24

I have a question,

cudaMalloc3d() is recommended for allocating 3d arrays, but it allocates in global memory right? it doesn’t make sense unless there is some automatic caching mechanism used behind the scenes that automatically caches to shared memory. is that what happens?, because all that I found in the programming guide includes:

can you help me please understand the above code. as far as I understand, the loop divides the 3d array into slices then each slice into rows and then read row, I don’t get the following statement: float element = row;

what does that mean? in each iteration of the loop around a row a new variable called “element” is created? and where is it allocated? in a register, or shared memory?

Thank you very much

The () are overriding the operator precedence. I wanted to make sure that I declare a pointer to an array. In some cases its better to be safe than sorry. Only ultra über geeks can recite the C/C++ operator precedences by heart.

A pointer to an array of 3 ints each:

int (*ptr)[3]

An array (of size 3) of pointers to int:

int *(ptr[3])

you always have to decipher such an expression from the inside out.

float (*(ptr[3][3]))[2][3][4];

This one declares ptr to be a two dimensional array of pointers (3x3) to a three dimensional array of floats of

dimension 2x3x4. Believe me, I have such beasts in production code. ;)