Indexing in 3D data

wanderine · January 19, 2009, 6:11pm

If I have 3D data, how do I get the x, y and z coordinates in the kernel?

Lets say I set up the 3D data like this

cudaArray *darray = NULL;

const cudaExtent volumeSize = make_cudaExtent(sizeof(float2)*NX, NY, NZ);

cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();

cudaMalloc3DArray(&darray, &desc, volumeSize);

cudaPitchedPtr dPtr;

cudaMalloc3D(&dPtr, volumeSize);

cudaMemset3D(dPtr, 0, volumeSize);

// Make a device to device copy

cudaMemcpy3DParms copyParams = {0};

copyParams.srcPtr   = dPtr;

copyParams.dstArray = darray;

copyParams.extent   = volumeSize;

copyParams.kind	 = cudaMemcpyDeviceToDevice;

cudaMemcpy3D(&copyParams);

And have a kernel something like this, how do I get the x y and z coordinates for the current index in the volume?

__global__ void mykernel(float *A, float *B)

{		

	int idx = threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y;

	int x = ??

	int y = ??

	int z = ??

	

	A[idx] = B[idx] * x + 3 * y + 5 * z;

}

I want to different things in the data depending on the x y and z coordinates.

Do I have to use a 3D blockgrid if my data is 3D ?

Simon_Green · January 20, 2009, 12:28am

This is a good question!

Unfortunately on current hardware the grid is only 2D, which makes it tricky to calculate 3D indicies. To makes things worse, integer divide and modulo (which you would normally use to calculate your own 3d indicies) are very expensive on current GPUs.

The best solution I’ve seen is this code (credit to Jonathan Cohen, hopefully he doesn’t mind me posting it here).

The setup code is:

[codebox]host void launchThreads(int nx, int ny, int nz) {

int threadsInX = 16;

int threadsInY = 4;

int threadsInZ = 4;

int blocksInX = (nx+threadsInX-1)/threadsInX;

int blocksInY = (ny+threadsInY-1)/threadsInY;

int blocksInZ = (nz+threadsInZ-1)/threadsInZ;

dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ);

dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);

callKernel<<<Dg, Db>>>(…, blocksInY, 1.0f/(float)blocksInY);

}[/codebox]

And the kernel code looks like this:

[codebox]global void callKernel(…, unsigned int blocksInY, float invBlocksInY)

{

unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);

unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz,blocksInY);

unsigned int i     = __umul24(blockIdx.x,blockDim.x) + threadIdx.x;

unsigned int j     = __umul24(blockIdx.y ,blockDim.y) + threadIdx.y;

unsigned int k     = __umul24(blockIdx.z ,blockDim.z) + threadIdx.z;

// use i,j,k ...

}[/codebox]

We should really have a sample of 3D array processing in the SDK. Anybody want to contribute one? :)

wanderine · January 20, 2009, 7:43am

I thought of another solution but I don’t know if it is better than the one you showed.

Before the launch of the kernel, I can setup a coordinate system

uint3 mycoordinatesystem[NX*NY*NZ];

int idx = 0;

for(x = 0; x < NX; x++)

{

	for(y = 0; y < NY; y++)

   {

	   for(z = 0; z < NZ; z++)

	   {

			  mycoordinatesystem[idx].x = x;

			  mycoordinatesystem[idx].y = y;

			  mycoordinatesystem[idx].z = z;

			  idx++;

	   }

   }

}

then in the kernel I know what each index means in terms of x y and z, or am I getting something wrong? The disadvantage is that it will use some memory but no further calculations than the normal idx calculation is needed in the kernel.

glimberg · January 21, 2009, 2:00pm

Hello,

I am currently having a very similar problem. However, I found another way to calculate the coordinates, which is only possible because i know that my x-dimension never exeeds 512 (which is the maximum number of threads in a block). Simply create an 1D block of threads, and use the grid for the other 2 dimensions:

dim3 Db(DIMX, 1, 1);

dim3 Dg(DIMZ, DIMY);

Then in the kernel, the x, y, and z coordinates are just:

int x = threadIdx.x;

int y = blockIdx.y;

int z = blockIdx.x;

Which is great, because it requires no calculations. But then I get into trouble as you mentioned, because how does I correctly calculate the array index? If the dimensions are a power of 2, it works when I do this:

int idx = x + y*blockDim.x + z*blockDim.x*gridDim.y;

When I use any other dimensions than powers of 2, the indexing becomes wrong. I expect that it is because I need to use the pitch, or the xsize and ysize from the cudaPitchedPtr.

But how do I use them, plz help!

Glimberg

wanderine · January 24, 2009, 2:51pm

This is a good question!

Unfortunately on current hardware the grid is only 2D, which makes it tricky to calculate 3D indicies. To makes things worse, integer divide and modulo (which you would normally use to calculate your own 3d indicies) are very expensive on current GPUs.

The best solution I’ve seen is this code (credit to Jonathan Cohen, hopefully he doesn’t mind me posting it here).

The setup code is:

[codebox]host void launchThreads(int nx, int ny, int nz) {
int threadsInX = 16;

int threadsInY = 4;

int threadsInZ = 4;
int blocksInX = (nx+threadsInX-1)/threadsInX;
int blocksInY = (ny+threadsInY-1)/threadsInY;

int blocksInZ = (nz+threadsInZ-1)/threadsInZ;
dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ);
dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);
callKernel<<<Dg, Db>>>(…, blocksInY, 1.0f/(float)blocksInY);

}[/codebox]

And the kernel code looks like this:

[codebox]global void callKernel(…, unsigned int blocksInY, float invBlocksInY)

{
unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);

unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz,blocksInY);

unsigned int i     = __umul24(blockIdx.x,blockDim.x) + threadIdx.x;

unsigned int j     = __umul24(blockIdx.y ,blockDim.y) + threadIdx.y;

unsigned int k     = __umul24(blockIdx.z ,blockDim.z) + threadIdx.z;

// use i,j,k ...
}[/codebox]

We should really have a sample of 3D array processing in the SDK. Anybody want to contribute one? :)

Thanks, and how do I calculate the total index?

sudoca · March 19, 2009, 11:06am

Hi all,

I relied on the sample given in the CUDA 2.1 manual to access elements in a malloc3D allocated structure :

// host code

cudaPitchedPtr devPitchedPtr;

cudaExtent extent = make_cudaExtent(64, 64, 64);

cudaMalloc3D(&devPitchedPtr, extent);

myKernel<<<100, 512>>>(devPitchedPtr, extent);

// device code

__global__ void myKernel(cudaPitchedPtr devPitchedPtr,

cudaExtent extent)

{

char* devPtr = devPitchedPtr.ptr;

size_t pitch = devPitchedPtr.pitch;

size_t slicePitch = pitch * extent.height;

for (int z = 0; z < extent.depth; ++z) {

char* slice = devPtr + z * slicePitch;

for (int y = 0; y < extent.height; ++y) {

float* row = (float*)(slice + y * pitch);

for (int x = 0; x < extent.width; ++x) {

float element = row[x];

}

}

}

}

but I have troubles when data dimensions differ from a power of 2, as mentioned in previous posts (approach similar to Glimberg). does this mean this sample code is wrong ? I am actually using 2.0, can this make a difference ?

sudoca · March 30, 2009, 9:30am

nobody else encountered troubles with 3D indexing ? External Image

karloch · August 9, 2009, 11:01am

I’m facing the very same trouble as you guys

karloch · August 9, 2009, 7:40pm

Using Cohen’s approach for workload assign, what would be the best way reserve and copy memory to the device? I’m using a 3D indexed (data[z][y]) data volume. I guess cudaMemcpy3D could be the best, but… should I use a cudaArray (copyParams.dstArray) or a regular array pointer (copyParams.dstPtr)?

didem · September 22, 2009, 7:13pm

Thanks for the trick. I think there is a mistake in the above code though, there shouldn’t be a period before z and y:

unsigned int j = __umul24(blockIdxy ,blockDim.y) + threadIdx.y;

unsigned int k     = __umul24(blockIdxz ,blockDim.z) + threadIdx.z;

unsigned int j = __umul24(blockIdx.(HERE)y ,blockDim.y) + threadIdx.y;

unsigned int k     = __umul24(blockIdx.(HERE)z ,blockDim.z) + threadIdx.z;

This is a good question!

Unfortunately on current hardware the grid is only 2D, which makes it tricky to calculate 3D indicies. To makes things worse, integer divide and modulo (which you would normally use to calculate your own 3d indicies) are very expensive on current GPUs.

The best solution I’ve seen is this code (credit to Jonathan Cohen, hopefully he doesn’t mind me posting it here).

The setup code is:

[codebox]host void launchThreads(int nx, int ny, int nz) {
int threadsInX = 16;

int threadsInY = 4;

int threadsInZ = 4;
int blocksInX = (nx+threadsInX-1)/threadsInX;
int blocksInY = (ny+threadsInY-1)/threadsInY;

int blocksInZ = (nz+threadsInZ-1)/threadsInZ;
dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ);
dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);
callKernel<<<Dg, Db>>>(…, blocksInY, 1.0f/(float)blocksInY);

}[/codebox]

And the kernel code looks like this:

[codebox]global void callKernel(…, unsigned int blocksInY, float invBlocksInY)

{
unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);

unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz,blocksInY);

unsigned int i     = __umul24(blockIdx.x,blockDim.x) + threadIdx.x;

unsigned int j     = __umul24(blockIdx.y ,blockDim.y) + threadIdx.y;

unsigned int k     = __umul24(blockIdx.z ,blockDim.z) + threadIdx.z;

// use i,j,k ...
}[/codebox]

We should really have a sample of 3D array processing in the SDK. Anybody want to contribute one? :)

LSChien · September 23, 2009, 2:28am

Dear all: I share my two ideas on 3-D data index map

method 1: naive map, create a 3-D thread box and map each thread to a data element, I use

figure 1 and figure 2 to illustrate the idea

figure 1

figure 2

above idea can be implemented via one kernel “naive_3D” and one C-wrapper “naive_3D_device” and

put these two codes into one .cu file.

type “doublereal” can be float or double, depends on your application

[codebox]static global void naive_3D( doublereal *dst, doublereal *src,

		unsigned int n1, unsigned int n2, unsigned int n3,

		double one_over_n2, double one_over_n3,  unsigned int sizeOfData  )

{

double tmp1 ; 

unsigned int t1, xIndex, yIndex, zIndex, index_in, index_out, gridIndex ;

// step 1: compute gridIndex in 1-D and 1-D data index “index_in”

gridIndex = blockIdx.y * gridDim.x + blockIdx.x ;

index_in = ( gridIndex * BLOCK_DIM_Y + threadIdx.y )*BLOCK_DIM_X + threadIdx.x ;

// step 2: extract 3-D data index via

// index_in = row-map(i,j,k) = (i-1)n2n3 + (j-1)*n3 + (k-1)

// where xIndex = i-1, yIndex = j-1, zIndex = k-1

if ( index_in < sizeOfData ){

	tmp1 = __uint2double_rn( index_in ) ;

	tmp1 = tmp1 * one_over_n3 ;

	t1 = __double2uint_rz( tmp1 ) ; 

	zIndex = index_in - n3*t1 ;	

	tmp1 = __uint2double_rn( t1 ) ;

	tmp1 = tmp1 * one_over_n2 ;

	xIndex = __double2uint_rz( tmp1 ) ; 

	yIndex = t1 - n2 * xIndex ;

	... your code

}// for each valid address

} [/codebox]

and its C-wrapper

[codebox]define BLOCK_DIM_X 16

define BLOCK_DIM_Y 16

void naive_3D_device( doublereal *Y, doublereal *X,

	unsigned int n1, unsigned int n2, unsigned int n3 )

{

unsigned int k1, k2 ;

// step 1: compute # of threads per block

unsigned int  nthreads = BLOCK_DIM_X * BLOCK_DIM_Y ;

// step 2: compute # of blocks needed

unsigned int  nblocks =  ( n1*n2*n3 + nthreads -1 )/nthreads ;

// step 3: find grid’s configuration

double db_nblocks = (double)nblocks ;

k1 = (unsigned int) floor( sqrt(db_nblocks) ) ;

k2 = (unsigned int) ceil( db_nblocks/((double)k1)) ;

dim3 threads(BLOCK_DIM_X, BLOCK_DIM_Y, 1);

dim3 grid( k2, k1, 1 );

double eps = 1.E-12 ;

double one_over_n2 = (1.0 + eps)/((double)n2) ;

double one_over_n3 = (1.0 + eps)/((double)n3) ;	 

unsigned int sizeOfData = n1*n2*n3 ;  

naive_3D<<< grid, threads >>>( Y, X, n1, n2, n3, one_over_n2, one_over_n3, sizeOfData ) ;

} [/codebox]

observation: naive map works for extraction of index (xIndex, yIndex, zIndex) of any 3-D data,

however it is not good for some application, for example, 3-D data transpose,

say A(1:n1, 1:n2, 1:n3) transpose to B(1:n2, 1:n3, 1:n1) by B(j,k,i) <— A(i,j,k)

hence I provide another kind of map

method 2: typical map, it deal with y-z slice by slice along x-direction, I use figure 3, 4, 5 to illustrate the idea

figure 3

figure 4

figure 5

above idea can be implemented via one kernel “typical_3D” and one C-wrapper “typical_3D_device” and

put these two codes into one .cu file.

[codebox]static global void typical_3D( doublereal *dst, doublereal *src,

		unsigned int n1, unsigned int n2, unsigned int n3,

		unsigned int Gy, unsigned int Gz, unsigned int k2, 

		float one_over_n2, float one_over_n3 )

{

float tmp1 ;

unsigned int t1, t2, xIndex, yIndex, zIndex, index_in ;

unsigned int x = blockIdx.x * BLOCK_DIM_X + threadIdx.x;

unsigned int y = blockIdx.y * BLOCK_DIM_Y + threadIdx.y;

if( (x < Gz) && (y < Gy) ) {

// step 1: transform grid index to 3D corase grid index

// x = n3 * t1 + zIndex, y = n2 * t2 + yIndex

// where (zIndex, yIndex): index to y-z slice, (t1, t2): index to x direction

	tmp1 = __uint2float_rz( x ) ;

	tmp1 = tmp1 * one_over_n3 ;

	t1 = __float2uint_rz( tmp1 ) ; 

	zIndex = x - n3*t1 ;

	tmp1 = __uint2float_rz( y ) ;

	tmp1 = tmp1 * one_over_n2 ;

	t2 = __float2uint_rz( tmp1 ) ; 

	yIndex = y - n2*t2 ;

	xIndex = t2 * k2 + t1 ;

// (xIndex, yIndex, zIndex) = ( i-1, j-1, k-1), row-major(i,j,k) = (i-1)n2n3 + (j-1)*n3 + (k-1)

	if ( xIndex < n1 ){

		index_in = ( xIndex * n2 + yIndex )*n3 + zIndex ;	

		// your code ....

	}

}// (x < Gz) && (y < Gy)

}

[/codebox]

and its C-wrapper

[codebox]define BLOCK_DIM_X 16

define BLOCK_DIM_Y 16

void typical_3D_device( doublereal *Y, doublereal *X,

	unsigned int n1, unsigned int n2, unsigned int n3 )

{

unsigned int Gy, Gz, k1, k2 ;

double db_n1 = (double) n1 ;

/* in order to save resource, we want to find two integers k1, k2 such that

```
 		k1 * k2 - n1 <= 1 
```

*/

int max_k1 = (int) floor( sqrt(db_n1) ) ;

for ( k1 = max_k1 ; 1 <= k1 ; k1-- ){

	k2 = (unsigned int) ceil( db_n1/((double)k1)) ;

	if ( 1 >= (k1*k2 - n1) ){

		break ;

	}

}

Gy = n2*k1 ;  Gz = n3*k2 ;

dim3 threads(BLOCK_DIM_X, BLOCK_DIM_Y, 1);

dim3 grid( (Gz+BLOCK_DIM_X-1)/BLOCK_DIM_X, (Gy+BLOCK_DIM_Y-1)/BLOCK_DIM_Y, 1 );

float one_over_n2 = 1.0/((double)n2) ;

float one_over_n3 = 1.0/((double)n3) ;     

typical_3D<<< grid, threads >>>( Y, X, n1, n2, n3, Gy, Gz, k2, one_over_n2, one_over_n3 ) ;

}[/codebox]

remark 1: in typical method, I use “float” to do modulo operation, this is good for small Gy, Gz,

one can use “double” to do modulo operation (see naive method)

remark 2: in C-wrapper of typical method, I provide another approach to find pair (k1, k2),

this approach save resource but sometimes k2 may exceed 65535, one can use

k1 = (unsignedint) floor( sqrt(db_n1) ) ;

k2 = (unsigned int) ceil( db_n1/((double)k1)) ;

in fact, experimental result shows no difference between two approaches for (k1,k2) computation.

Simon_Green · September 25, 2009, 11:53am

Nice diagrams, thanks for posting!

kameo · September 26, 2009, 8:57pm

@LSChien

Nice work! Which tool did you use to create these diagrams?

Bye,
amadea

LSChien · September 26, 2009, 11:17pm

microsoft Powerpoint

ghandurah · October 28, 2009, 6:05am

static __global__ void naive_3D( doublereal *dst, doublereal *src,

			unsigned int n1, unsigned int n2, unsigned int n3,

			double one_over_n2, double one_over_n3,  unsigned int sizeOfData  ) 

{

	double tmp1; 

	unsigned int t1, xIndex, yIndex, zIndex, index_in, index_out, gridIndex;	

// step 1: compute gridIndex in 1-D	and 1-D data index "index_in"

	gridIndex = blockIdx.y * gridDim.x + blockIdx.x;

	index_in = ( gridIndex * BLOCK_DIM_Y + threadIdx.y )*BLOCK_DIM_X + threadIdx.x;

// step 2: extract 3-D data index via 

//	 index_in = row-map(i,j,k) = (i-1)*n2*n3 + (j-1)*n3 + (k-1)

// where  xIndex = i-1, yIndex = j-1, zIndex = k-1	

	if ( index_in < sizeOfData ){

		tmp1 = __uint2double_rn( index_in );

		tmp1 = tmp1 * one_over_n3;

 		t1 = __double2uint_rz( tmp1 ); 

 		zIndex = index_in - n3*t1;	

		tmp1 = __uint2double_rn( t1 );

		tmp1 = tmp1 * one_over_n2;

 		xIndex = __double2uint_rz( tmp1 ); 

 		yIndex = t1 - n2 * xIndex;

 		... your code

 	}// for each valid address

}

I don’t understand why are you extracting the index? if you have index_in you can access the data right?

and in the code, if i have an i,j,k index that i wish to access its corresponding data element, then i is blockIdx.xblockDim.x + threadIdx.x , j is blockIdx.yblockDim.y + threadIdx.y, and k is gridIndex right? and you use index_in to access the element in linear memory

or

i is threadIdx.x , j is threadIdx.y, and k is gridIndex ?

please correct me

Thank you

tclark · December 11, 2009, 1:13pm

Lung Sheng,

A couple of queries and a comment:

I’m hoping to use this indexing code to transpose 3D arrays, just as you say: A(i,j,k) → B(j,k,i).

If I’m right, all I need to add to the kernel typical_3d is:

[codebox]index_in = ( xIndex * n2 + yIndex )*n3 + zIndex ;

index_out = (yIndex * n3 + zIndex )*n1 + xIndex ;

dst[index_out] = src[index_in];[/codebox]

… do you think this is right, or have I misunderstoood the indexing structure?

In addition (forgive me, I’m still very new to this game!), I’ll be using input data of type cufftComplex - which is an interleaved float type (two floats for one element). I don’t think I actually need to change the indexing to compensate for this, but I’m concerned about block size - should I change the block dimensions to 8 instead of 16, to ensure that a block occupies the same amount of memory?

Lastly, thankyou so much for the effort you’ve put into the diagrams above - they’ve really helped me learn a lot about how all of this works!

Kind regards

Tom Clark

** Edited - grammatical changes **

LSChien · December 12, 2009, 2:39am

sorry, I post “method 2” to demonstrate copy routine.

If you want to do A(i,j,k) → B(j,k,i), then you must use x-z slice along y-direction, just a slightly modification of method 2.

besides I write a simple description for method 3 (this is my original idea, variant of method 2) in

http://oz.nthu.edu.tw/~d947207/NVIDIA/copy…nspose_post.pdf

and source code in http://oz.nthu.edu.tw/~d947207/NVIDIA/copy…se3D_revised.cu

second, when you deal with “complex” type, then optimal bandwidth you can expect is 1/2

if you only access real part or imaginary part.

However if you just want to do transpose operation, then you can regard complex array as

float array or double array, then you can achieve maximum bandwidth.

tclark · December 14, 2009, 4:41pm

Lung Sheng,

Thank you so much - this is incredibly helpful!

Kind regards

Tom Clark

farid.harhad · July 26, 2010, 7:22pm

In the following code:

[codebox]global void callKernel(…, unsigned int blocksInY, float invBlocksInY)

{
unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);

unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz,blocksInY);

unsigned int i     = __umul24(blockIdx.x,blockDim.x) + threadIdx.x;

unsigned int j     = __umul24(blockIdx.y ,blockDim.y) + threadIdx.y;

unsigned int k     = __umul24(blockIdx.z ,blockDim.z) + threadIdx.z;

// use i,j,k ...
}[/codebox]

We should really have a sample of 3D array processing in the SDK. Anybody want to contribute one? :)

what is the function __float2uint_rd() ?

Many thanks

farid.harhad · July 26, 2010, 7:33pm

Nevermind, I found the answer in the programming guide:

It rounds down a float number to an unsigned integer (section C.2 of the programming guide)

Topic		Replies	Views
3D grids CUDA Programming and Performance	13	10164	June 29, 2009
How to allocate a 3d array such that you can use the indecies to access its elements CUDA Programming and Performance	20	5470	October 24, 2009
Setting up 3d arryas I have some questions about how to use 3d arrays and cudaArrays CUDA Programming and Performance	10	27959	April 5, 2010
trouble indexing 1D->3D array in my kernel CUDA Programming and Performance	2	17882	January 31, 2011
Problem about cudaMemcpy3D() CUDA Programming and Performance	9	7156	October 26, 2008
[beginer] Different arrays size indexing CUDA Programming and Performance	27	10322	November 8, 2010
Does anyone know how to build a 2D/3D array using CudaMalloc? CUDA Programming and Performance	7	19741	November 3, 2010
Indexing into 3D volume Simply can't find my mistake. CUDA Programming and Performance	4	4984	February 16, 2009
General formula's for maximum scaling and flexibility Scaling and flexibility, grids, blocks, th CUDA Programming and Performance	7	1133	June 22, 2011
three dimensional indexing! CUDA Programming and Performance	4	6943	January 14, 2011

Indexing in 3D data

Related topics