Indexing in 3D data

If I have 3D data, how do I get the x, y and z coordinates in the kernel?

Lets say I set up the 3D data like this

cudaArray *darray = NULL;

const cudaExtent volumeSize = make_cudaExtent(sizeof(float2)*NX, NY, NZ);

cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();

cudaMalloc3DArray(&darray, &desc, volumeSize);

cudaPitchedPtr dPtr;

cudaMalloc3D(&dPtr, volumeSize);

cudaMemset3D(dPtr, 0, volumeSize);

// Make a device to device copy

cudaMemcpy3DParms copyParams = {0};

copyParams.srcPtr   = dPtr;

copyParams.dstArray = darray;

copyParams.extent   = volumeSize;

copyParams.kind	 = cudaMemcpyDeviceToDevice;

cudaMemcpy3D(&copyParams);

And have a kernel something like this, how do I get the x y and z coordinates for the current index in the volume?

__global__ void mykernel(float *A, float *B)

{		

	int idx = threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y;

	int x = ??

	int y = ??

	int z = ??

	

	A[idx] = B[idx] * x + 3 * y + 5 * z;

}

I want to different things in the data depending on the x y and z coordinates.

Do I have to use a 3D blockgrid if my data is 3D ?

This is a good question!

Unfortunately on current hardware the grid is only 2D, which makes it tricky to calculate 3D indicies. To makes things worse, integer divide and modulo (which you would normally use to calculate your own 3d indicies) are very expensive on current GPUs.

The best solution I’ve seen is this code (credit to Jonathan Cohen, hopefully he doesn’t mind me posting it here).

The setup code is:

[codebox]host void launchThreads(int nx, int ny, int nz) {

int threadsInX = 16;

int threadsInY = 4;

int threadsInZ = 4;

int blocksInX = (nx+threadsInX-1)/threadsInX;

int blocksInY = (ny+threadsInY-1)/threadsInY;

int blocksInZ = (nz+threadsInZ-1)/threadsInZ;

dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ);

dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);

callKernel<<<Dg, Db>>>(…, blocksInY, 1.0f/(float)blocksInY);

}[/codebox]

And the kernel code looks like this:

[codebox]global void callKernel(…, unsigned int blocksInY, float invBlocksInY)

{

unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);

unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz,blocksInY);

unsigned int i     = __umul24(blockIdx.x,blockDim.x) + threadIdx.x;

unsigned int j     = __umul24(blockIdx.y ,blockDim.y) + threadIdx.y;

unsigned int k     = __umul24(blockIdx.z ,blockDim.z) + threadIdx.z;

// use i,j,k ...

}[/codebox]

We should really have a sample of 3D array processing in the SDK. Anybody want to contribute one? :)

I thought of another solution but I don’t know if it is better than the one you showed.

Before the launch of the kernel, I can setup a coordinate system

uint3 mycoordinatesystem[NX*NY*NZ];

int idx = 0;

for(x = 0; x < NX; x++)

{

	for(y = 0; y < NY; y++)

   {

	   for(z = 0; z < NZ; z++)

	   {

			  mycoordinatesystem[idx].x = x;

			  mycoordinatesystem[idx].y = y;

			  mycoordinatesystem[idx].z = z;

			  idx++;

	   }

   }

}

then in the kernel I know what each index means in terms of x y and z, or am I getting something wrong? The disadvantage is that it will use some memory but no further calculations than the normal idx calculation is needed in the kernel.

Hello,

I am currently having a very similar problem. However, I found another way to calculate the coordinates, which is only possible because i know that my x-dimension never exeeds 512 (which is the maximum number of threads in a block). Simply create an 1D block of threads, and use the grid for the other 2 dimensions:

dim3 Db(DIMX, 1, 1);

dim3 Dg(DIMZ, DIMY);

Then in the kernel, the x, y, and z coordinates are just:

int x = threadIdx.x;

int y = blockIdx.y;

int z = blockIdx.x;

Which is great, because it requires no calculations. But then I get into trouble as you mentioned, because how does I correctly calculate the array index? If the dimensions are a power of 2, it works when I do this:

int idx = x + y*blockDim.x + z*blockDim.x*gridDim.y;

When I use any other dimensions than powers of 2, the indexing becomes wrong. I expect that it is because I need to use the pitch, or the xsize and ysize from the cudaPitchedPtr.

But how do I use them, plz help!

  • Glimberg

Thanks, and how do I calculate the total index?

Hi all,

I relied on the sample given in the CUDA 2.1 manual to access elements in a malloc3D allocated structure :

// host code

cudaPitchedPtr devPitchedPtr;

cudaExtent extent = make_cudaExtent(64, 64, 64);

cudaMalloc3D(&devPitchedPtr, extent);

myKernel<<<100, 512>>>(devPitchedPtr, extent);

// device code

__global__ void myKernel(cudaPitchedPtr devPitchedPtr,

cudaExtent extent)

{

char* devPtr = devPitchedPtr.ptr;

size_t pitch = devPitchedPtr.pitch;

size_t slicePitch = pitch * extent.height;

for (int z = 0; z < extent.depth; ++z) {

char* slice = devPtr + z * slicePitch;

for (int y = 0; y < extent.height; ++y) {

float* row = (float*)(slice + y * pitch);

for (int x = 0; x < extent.width; ++x) {

float element = row[x];

}

}

}

}

but I have troubles when data dimensions differ from a power of 2, as mentioned in previous posts (approach similar to Glimberg). does this mean this sample code is wrong ? I am actually using 2.0, can this make a difference ?

nobody else encountered troubles with 3D indexing ? :/

I’m facing the very same trouble as you guys

Using Cohen’s approach for workload assign, what would be the best way reserve and copy memory to the device? I’m using a 3D indexed (data[z][y]) data volume. I guess cudaMemcpy3D could be the best, but… should I use a cudaArray (copyParams.dstArray) or a regular array pointer (copyParams.dstPtr)?

Thanks for the trick. I think there is a mistake in the above code though, there shouldn’t be a period before z and y:

unsigned int j = __umul24(blockIdxy ,blockDim.y) + threadIdx.y;

unsigned int k     = __umul24(blockIdxz ,blockDim.z) + threadIdx.z;

unsigned int j = __umul24(blockIdx.(HERE)y ,blockDim.y) + threadIdx.y;

unsigned int k     = __umul24(blockIdx.(HERE)z ,blockDim.z) + threadIdx.z;

Dear all: I share my two ideas on 3-D data index map

method 1: naive map, create a 3-D thread box and map each thread to a data element, I use

figure 1 and figure 2 to illustrate the idea

figure 1

figure 2

above idea can be implemented via one kernel “naive_3D” and one C-wrapper “naive_3D_device” and

put these two codes into one .cu file.

type “doublereal” can be float or double, depends on your application

[codebox]static global void naive_3D( doublereal *dst, doublereal *src,

		unsigned int n1, unsigned int n2, unsigned int n3,

		double one_over_n2, double one_over_n3,  unsigned int sizeOfData  ) 

{

double tmp1 ; 

unsigned int t1, xIndex, yIndex, zIndex, index_in, index_out, gridIndex ;	

// step 1: compute gridIndex in 1-D and 1-D data index “index_in”

gridIndex = blockIdx.y * gridDim.x + blockIdx.x ;

index_in = ( gridIndex * BLOCK_DIM_Y + threadIdx.y )*BLOCK_DIM_X + threadIdx.x ;

// step 2: extract 3-D data index via

// index_in = row-map(i,j,k) = (i-1)n2n3 + (j-1)*n3 + (k-1)

// where xIndex = i-1, yIndex = j-1, zIndex = k-1

if ( index_in < sizeOfData ){

	tmp1 = __uint2double_rn( index_in ) ;

	tmp1 = tmp1 * one_over_n3 ;

	t1 = __double2uint_rz( tmp1 ) ; 

	zIndex = index_in - n3*t1 ;	

	tmp1 = __uint2double_rn( t1 ) ;

	tmp1 = tmp1 * one_over_n2 ;

	xIndex = __double2uint_rz( tmp1 ) ; 

	yIndex = t1 - n2 * xIndex ;

	... your code

}// for each valid address

} [/codebox]

and its C-wrapper

[codebox]#define BLOCK_DIM_X 16

#define BLOCK_DIM_Y 16

void naive_3D_device( doublereal *Y, doublereal *X,

	unsigned int n1, unsigned int n2, unsigned int n3 ) 		

{

unsigned int k1, k2 ;

// step 1: compute # of threads per block

unsigned int  nthreads = BLOCK_DIM_X * BLOCK_DIM_Y ;

// step 2: compute # of blocks needed

unsigned int  nblocks =  ( n1*n2*n3 + nthreads -1 )/nthreads ;

// step 3: find grid’s configuration

double db_nblocks = (double)nblocks ;

k1 = (unsigned int) floor( sqrt(db_nblocks) ) ;

k2 = (unsigned int) ceil( db_nblocks/((double)k1)) ;

dim3 threads(BLOCK_DIM_X, BLOCK_DIM_Y, 1);

dim3 grid( k2, k1, 1 );

double eps = 1.E-12 ;

double one_over_n2 = (1.0 + eps)/((double)n2) ;

double one_over_n3 = (1.0 + eps)/((double)n3) ;	 

unsigned int sizeOfData = n1*n2*n3 ;  

naive_3D<<< grid, threads >>>( Y, X, n1, n2, n3, one_over_n2, one_over_n3, sizeOfData ) ;	 

} [/codebox]

observation: naive map works for extraction of index (xIndex, yIndex, zIndex) of any 3-D data,

however it is not good for some application, for example, 3-D data transpose,

say A(1:n1, 1:n2, 1:n3) transpose to B(1:n2, 1:n3, 1:n1) by B(j,k,i) <— A(i,j,k)

hence I provide another kind of map

method 2: typical map, it deal with y-z slice by slice along x-direction, I use figure 3, 4, 5 to illustrate the idea

figure 3

figure 4

figure 5

above idea can be implemented via one kernel “typical_3D” and one C-wrapper “typical_3D_device” and

put these two codes into one .cu file.

[codebox]static global void typical_3D( doublereal *dst, doublereal *src,

		unsigned int n1, unsigned int n2, unsigned int n3,

		unsigned int Gy, unsigned int Gz, unsigned int k2, 

		float one_over_n2, float one_over_n3 ) 

{

float tmp1 ;

unsigned int t1, t2, xIndex, yIndex, zIndex, index_in ;

unsigned int x = blockIdx.x * BLOCK_DIM_X + threadIdx.x;

unsigned int y = blockIdx.y * BLOCK_DIM_Y + threadIdx.y;

if( (x < Gz) && (y < Gy) ) {

// step 1: transform grid index to 3D corase grid index

// x = n3 * t1 + zIndex, y = n2 * t2 + yIndex

// where (zIndex, yIndex): index to y-z slice, (t1, t2): index to x direction

	tmp1 = __uint2float_rz( x ) ;

	tmp1 = tmp1 * one_over_n3 ;

	t1 = __float2uint_rz( tmp1 ) ; 

	zIndex = x - n3*t1 ;

	tmp1 = __uint2float_rz( y ) ;

	tmp1 = tmp1 * one_over_n2 ;

	t2 = __float2uint_rz( tmp1 ) ; 

	yIndex = y - n2*t2 ;

	xIndex = t2 * k2 + t1 ;

// (xIndex, yIndex, zIndex) = ( i-1, j-1, k-1), row-major(i,j,k) = (i-1)n2n3 + (j-1)*n3 + (k-1)

	if ( xIndex < n1 ){

		index_in = ( xIndex * n2 + yIndex )*n3 + zIndex ;	

		// your code ....

	}

}// (x < Gz) && (y < Gy) 	

}

[/codebox]

and its C-wrapper

[codebox]#define BLOCK_DIM_X 16

#define BLOCK_DIM_Y 16

void typical_3D_device( doublereal *Y, doublereal *X,

	unsigned int n1, unsigned int n2, unsigned int n3 ) 

{

unsigned int Gy, Gz, k1, k2 ;

double db_n1 = (double) n1 ;

/* in order to save resource, we want to find two integers k1, k2 such that

  •  		k1 * k2 - n1 <= 1 
    

*/

int max_k1 = (int) floor( sqrt(db_n1) ) ;

for ( k1 = max_k1 ; 1 <= k1 ; k1-- ){

	k2 = (unsigned int) ceil( db_n1/((double)k1)) ;

	if ( 1 >= (k1*k2 - n1) ){

		break ;

	}

}

Gy = n2*k1 ;  Gz = n3*k2 ;

dim3 threads(BLOCK_DIM_X, BLOCK_DIM_Y, 1);

dim3 grid( (Gz+BLOCK_DIM_X-1)/BLOCK_DIM_X, (Gy+BLOCK_DIM_Y-1)/BLOCK_DIM_Y, 1 );

float one_over_n2 = 1.0/((double)n2) ;

float one_over_n3 = 1.0/((double)n3) ;     

typical_3D<<< grid, threads >>>( Y, X, n1, n2, n3, Gy, Gz, k2, one_over_n2, one_over_n3 ) ;			 

}[/codebox]

remark 1: in typical method, I use “float” to do modulo operation, this is good for small Gy, Gz,

one can use “double” to do modulo operation (see naive method)

remark 2: in C-wrapper of typical method, I provide another approach to find pair (k1, k2),

this approach save resource but sometimes k2 may exceed 65535, one can use

k1 = (unsignedint) floor( sqrt(db_n1) ) ;

k2 = (unsigned int) ceil( db_n1/((double)k1)) ;

in fact, experimental result shows no difference between two approaches for (k1,k2) computation.

Nice diagrams, thanks for posting!

@LSChien

Nice work! Which tool did you use to create these diagrams?

Bye,
amadea

microsoft Powerpoint

static __global__ void naive_3D( doublereal *dst, doublereal *src,

			unsigned int n1, unsigned int n2, unsigned int n3,

			double one_over_n2, double one_over_n3,  unsigned int sizeOfData  ) 

{

	double tmp1; 

	unsigned int t1, xIndex, yIndex, zIndex, index_in, index_out, gridIndex;	

// step 1: compute gridIndex in 1-D	and 1-D data index "index_in"

	gridIndex = blockIdx.y * gridDim.x + blockIdx.x;

	index_in = ( gridIndex * BLOCK_DIM_Y + threadIdx.y )*BLOCK_DIM_X + threadIdx.x;

// step 2: extract 3-D data index via 

//	 index_in = row-map(i,j,k) = (i-1)*n2*n3 + (j-1)*n3 + (k-1)

// where  xIndex = i-1, yIndex = j-1, zIndex = k-1	

	if ( index_in < sizeOfData ){

		tmp1 = __uint2double_rn( index_in );

		tmp1 = tmp1 * one_over_n3;

 		t1 = __double2uint_rz( tmp1 ); 

 		zIndex = index_in - n3*t1;	

		tmp1 = __uint2double_rn( t1 );

		tmp1 = tmp1 * one_over_n2;

 		xIndex = __double2uint_rz( tmp1 ); 

 		yIndex = t1 - n2 * xIndex;

 		... your code

 	}// for each valid address

}

I don’t understand why are you extracting the index? if you have index_in you can access the data right?

and in the code, if i have an i,j,k index that i wish to access its corresponding data element, then i is blockIdx.xblockDim.x + threadIdx.x , j is blockIdx.yblockDim.y + threadIdx.y, and k is gridIndex right? and you use index_in to access the element in linear memory

or

i is threadIdx.x , j is threadIdx.y, and k is gridIndex ?

please correct me

Thank you

Lung Sheng,

A couple of queries and a comment:

I’m hoping to use this indexing code to transpose 3D arrays, just as you say: A(i,j,k) -> B(j,k,i).

If I’m right, all I need to add to the kernel typical_3d is:

[codebox]index_in = ( xIndex * n2 + yIndex )*n3 + zIndex ;

index_out = (yIndex * n3 + zIndex )*n1 + xIndex ;

dst[index_out] = src[index_in];[/codebox]

… do you think this is right, or have I misunderstoood the indexing structure?

In addition (forgive me, I’m still very new to this game!), I’ll be using input data of type cufftComplex - which is an interleaved float type (two floats for one element). I don’t think I actually need to change the indexing to compensate for this, but I’m concerned about block size - should I change the block dimensions to 8 instead of 16, to ensure that a block occupies the same amount of memory?

Lastly, thankyou so much for the effort you’ve put into the diagrams above - they’ve really helped me learn a lot about how all of this works!

Kind regards

Tom Clark

** Edited - grammatical changes **

sorry, I post “method 2” to demonstrate copy routine.

If you want to do A(i,j,k) --> B(j,k,i), then you must use x-z slice along y-direction, just a slightly modification of method 2.

besides I write a simple description for method 3 (this is my original idea, variant of method 2) in

http://oz.nthu.edu.tw/~d947207/NVIDIA/copy…nspose_post.pdf

and source code in http://oz.nthu.edu.tw/~d947207/NVIDIA/copy…se3D_revised.cu

second, when you deal with “complex” type, then optimal bandwidth you can expect is 1/2

if you only access real part or imaginary part.

However if you just want to do transpose operation, then you can regard complex array as

float array or double array, then you can achieve maximum bandwidth.

Lung Sheng,

Thank you so much - this is incredibly helpful!

Kind regards

Tom Clark

In the following code:

what is the function __float2uint_rd() ?

Many thanks

Nevermind, I found the answer in the programming guide:

It rounds down a float number to an unsigned integer (section C.2 of the programming guide)