Dear all: I share my two ideas on 3-D data index map

method 1: naive map, create a 3-D thread box and map each thread to a data element, I use

figure 1 and figure 2 to illustrate the idea

figure 1

figure 2

above idea can be implemented via one kernel “naive_3D” and one C-wrapper “naive_3D_device” and

put these two codes into one .cu file.

type “doublereal” can be float or double, depends on your application

[codebox]static **global** void naive_3D( doublereal *dst, doublereal *src,

```
unsigned int n1, unsigned int n2, unsigned int n3,
double one_over_n2, double one_over_n3, unsigned int sizeOfData )
```

{

```
double tmp1 ;
unsigned int t1, xIndex, yIndex, zIndex, index_in, index_out, gridIndex ;
```

// step 1: compute gridIndex in 1-D and 1-D data index “index_in”

```
gridIndex = blockIdx.y * gridDim.x + blockIdx.x ;
index_in = ( gridIndex * BLOCK_DIM_Y + threadIdx.y )*BLOCK_DIM_X + threadIdx.x ;
```

// step 2: extract 3-D data index via

// index_in = row-map(i,j,k) = (i-1)*n2*n3 + (j-1)*n3 + (k-1)

// where xIndex = i-1, yIndex = j-1, zIndex = k-1

```
if ( index_in < sizeOfData ){
tmp1 = __uint2double_rn( index_in ) ;
tmp1 = tmp1 * one_over_n3 ;
t1 = __double2uint_rz( tmp1 ) ;
zIndex = index_in - n3*t1 ;
tmp1 = __uint2double_rn( t1 ) ;
tmp1 = tmp1 * one_over_n2 ;
xIndex = __double2uint_rz( tmp1 ) ;
yIndex = t1 - n2 * xIndex ;
... your code
}// for each valid address
```

} [/codebox]

and its C-wrapper

[codebox]#define BLOCK_DIM_X 16

#define BLOCK_DIM_Y 16

void naive_3D_device( doublereal *Y, doublereal *X,

```
unsigned int n1, unsigned int n2, unsigned int n3 )
```

{

```
unsigned int k1, k2 ;
```

// step 1: compute # of threads per block

```
unsigned int nthreads = BLOCK_DIM_X * BLOCK_DIM_Y ;
```

// step 2: compute # of blocks needed

```
unsigned int nblocks = ( n1*n2*n3 + nthreads -1 )/nthreads ;
```

// step 3: find grid’s configuration

double db_nblocks = (double)nblocks ;

k1 = (unsigned int) floor( sqrt(db_nblocks) ) ;

k2 = (unsigned int) ceil( db_nblocks/((double)k1)) ;

```
dim3 threads(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
dim3 grid( k2, k1, 1 );
double eps = 1.E-12 ;
double one_over_n2 = (1.0 + eps)/((double)n2) ;
double one_over_n3 = (1.0 + eps)/((double)n3) ;
unsigned int sizeOfData = n1*n2*n3 ;
naive_3D<<< grid, threads >>>( Y, X, n1, n2, n3, one_over_n2, one_over_n3, sizeOfData ) ;
```

} [/codebox]

observation: naive map works for extraction of index (xIndex, yIndex, zIndex) of any 3-D data,

however it is not good for some application, for example, 3-D data transpose,

say A(1:n1, 1:n2, 1:n3) transpose to B(1:n2, 1:n3, 1:n1) by B(j,k,i) <— A(i,j,k)

hence I provide another kind of map

method 2: typical map, it deal with y-z slice by slice along x-direction, I use figure 3, 4, 5 to illustrate the idea

figure 3

figure 4

figure 5

above idea can be implemented via one kernel “typical_3D” and one C-wrapper “typical_3D_device” and

put these two codes into one .cu file.

[codebox]static **global** void typical_3D( doublereal *dst, doublereal *src,

```
unsigned int n1, unsigned int n2, unsigned int n3,
unsigned int Gy, unsigned int Gz, unsigned int k2,
float one_over_n2, float one_over_n3 )
```

{

```
float tmp1 ;
unsigned int t1, t2, xIndex, yIndex, zIndex, index_in ;
unsigned int x = blockIdx.x * BLOCK_DIM_X + threadIdx.x;
unsigned int y = blockIdx.y * BLOCK_DIM_Y + threadIdx.y;
if( (x < Gz) && (y < Gy) ) {
```

// step 1: transform grid index to 3D corase grid index

// x = n3 * t1 + zIndex, y = n2 * t2 + yIndex

// where (zIndex, yIndex): index to y-z slice, (t1, t2): index to x direction

```
tmp1 = __uint2float_rz( x ) ;
tmp1 = tmp1 * one_over_n3 ;
t1 = __float2uint_rz( tmp1 ) ;
zIndex = x - n3*t1 ;
tmp1 = __uint2float_rz( y ) ;
tmp1 = tmp1 * one_over_n2 ;
t2 = __float2uint_rz( tmp1 ) ;
yIndex = y - n2*t2 ;
xIndex = t2 * k2 + t1 ;
```

// (xIndex, yIndex, zIndex) = ( i-1, j-1, k-1), row-major(i,j,k) = (i-1)*n2*n3 + (j-1)*n3 + (k-1)

```
if ( xIndex < n1 ){
index_in = ( xIndex * n2 + yIndex )*n3 + zIndex ;
// your code ....
}
}// (x < Gz) && (y < Gy)
```

}

[/codebox]

and its C-wrapper

[codebox]#define BLOCK_DIM_X 16

#define BLOCK_DIM_Y 16

void typical_3D_device( doublereal *Y, doublereal *X,

```
unsigned int n1, unsigned int n2, unsigned int n3 )
```

{

```
unsigned int Gy, Gz, k1, k2 ;
double db_n1 = (double) n1 ;
```

/* in order to save resource, we want to find two integers k1, k2 such that

*/

```
int max_k1 = (int) floor( sqrt(db_n1) ) ;
for ( k1 = max_k1 ; 1 <= k1 ; k1-- ){
k2 = (unsigned int) ceil( db_n1/((double)k1)) ;
if ( 1 >= (k1*k2 - n1) ){
break ;
}
}
Gy = n2*k1 ; Gz = n3*k2 ;
dim3 threads(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
dim3 grid( (Gz+BLOCK_DIM_X-1)/BLOCK_DIM_X, (Gy+BLOCK_DIM_Y-1)/BLOCK_DIM_Y, 1 );
float one_over_n2 = 1.0/((double)n2) ;
float one_over_n3 = 1.0/((double)n3) ;
typical_3D<<< grid, threads >>>( Y, X, n1, n2, n3, Gy, Gz, k2, one_over_n2, one_over_n3 ) ;
```

}[/codebox]

remark 1: in typical method, I use “float” to do modulo operation, this is good for small Gy, Gz,

one can use “double” to do modulo operation (see naive method)

remark 2: in C-wrapper of typical method, I provide another approach to find pair (k1, k2),

this approach save resource but sometimes k2 may exceed 65535, one can use

k1 = (unsignedint) floor( sqrt(db_n1) ) ;

k2 = (unsigned int) ceil( db_n1/((double)k1)) ;

in fact, experimental result shows no difference between two approaches for (k1,k2) computation.