Dear all: I share my two ideas on 3-D data index map
method 1: naive map, create a 3-D thread box and map each thread to a data element, I use
figure 1 and figure 2 to illustrate the idea
figure 1
figure 2
above idea can be implemented via one kernel “naive_3D” and one C-wrapper “naive_3D_device” and
put these two codes into one .cu file.
type “doublereal” can be float or double, depends on your application
[codebox]static global void naive_3D( doublereal *dst, doublereal *src,
unsigned int n1, unsigned int n2, unsigned int n3,
double one_over_n2, double one_over_n3, unsigned int sizeOfData )
{
double tmp1 ;
unsigned int t1, xIndex, yIndex, zIndex, index_in, index_out, gridIndex ;
// step 1: compute gridIndex in 1-D and 1-D data index “index_in”
gridIndex = blockIdx.y * gridDim.x + blockIdx.x ;
index_in = ( gridIndex * BLOCK_DIM_Y + threadIdx.y )*BLOCK_DIM_X + threadIdx.x ;
// step 2: extract 3-D data index via
// index_in = row-map(i,j,k) = (i-1)n2n3 + (j-1)*n3 + (k-1)
// where xIndex = i-1, yIndex = j-1, zIndex = k-1
if ( index_in < sizeOfData ){
tmp1 = __uint2double_rn( index_in ) ;
tmp1 = tmp1 * one_over_n3 ;
t1 = __double2uint_rz( tmp1 ) ;
zIndex = index_in - n3*t1 ;
tmp1 = __uint2double_rn( t1 ) ;
tmp1 = tmp1 * one_over_n2 ;
xIndex = __double2uint_rz( tmp1 ) ;
yIndex = t1 - n2 * xIndex ;
... your code
}// for each valid address
} [/codebox]
and its C-wrapper
[codebox]#define BLOCK_DIM_X 16
#define BLOCK_DIM_Y 16
void naive_3D_device( doublereal *Y, doublereal *X,
unsigned int n1, unsigned int n2, unsigned int n3 )
{
unsigned int k1, k2 ;
// step 1: compute # of threads per block
unsigned int nthreads = BLOCK_DIM_X * BLOCK_DIM_Y ;
// step 2: compute # of blocks needed
unsigned int nblocks = ( n1*n2*n3 + nthreads -1 )/nthreads ;
// step 3: find grid’s configuration
double db_nblocks = (double)nblocks ;
k1 = (unsigned int) floor( sqrt(db_nblocks) ) ;
k2 = (unsigned int) ceil( db_nblocks/((double)k1)) ;
dim3 threads(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
dim3 grid( k2, k1, 1 );
double eps = 1.E-12 ;
double one_over_n2 = (1.0 + eps)/((double)n2) ;
double one_over_n3 = (1.0 + eps)/((double)n3) ;
unsigned int sizeOfData = n1*n2*n3 ;
naive_3D<<< grid, threads >>>( Y, X, n1, n2, n3, one_over_n2, one_over_n3, sizeOfData ) ;
} [/codebox]
observation: naive map works for extraction of index (xIndex, yIndex, zIndex) of any 3-D data,
however it is not good for some application, for example, 3-D data transpose,
say A(1:n1, 1:n2, 1:n3) transpose to B(1:n2, 1:n3, 1:n1) by B(j,k,i) <— A(i,j,k)
hence I provide another kind of map
method 2: typical map, it deal with y-z slice by slice along x-direction, I use figure 3, 4, 5 to illustrate the idea
figure 3
figure 4
figure 5
above idea can be implemented via one kernel “typical_3D” and one C-wrapper “typical_3D_device” and
put these two codes into one .cu file.
[codebox]static global void typical_3D( doublereal *dst, doublereal *src,
unsigned int n1, unsigned int n2, unsigned int n3,
unsigned int Gy, unsigned int Gz, unsigned int k2,
float one_over_n2, float one_over_n3 )
{
float tmp1 ;
unsigned int t1, t2, xIndex, yIndex, zIndex, index_in ;
unsigned int x = blockIdx.x * BLOCK_DIM_X + threadIdx.x;
unsigned int y = blockIdx.y * BLOCK_DIM_Y + threadIdx.y;
if( (x < Gz) && (y < Gy) ) {
// step 1: transform grid index to 3D corase grid index
// x = n3 * t1 + zIndex, y = n2 * t2 + yIndex
// where (zIndex, yIndex): index to y-z slice, (t1, t2): index to x direction
tmp1 = __uint2float_rz( x ) ;
tmp1 = tmp1 * one_over_n3 ;
t1 = __float2uint_rz( tmp1 ) ;
zIndex = x - n3*t1 ;
tmp1 = __uint2float_rz( y ) ;
tmp1 = tmp1 * one_over_n2 ;
t2 = __float2uint_rz( tmp1 ) ;
yIndex = y - n2*t2 ;
xIndex = t2 * k2 + t1 ;
// (xIndex, yIndex, zIndex) = ( i-1, j-1, k-1), row-major(i,j,k) = (i-1)n2n3 + (j-1)*n3 + (k-1)
if ( xIndex < n1 ){
index_in = ( xIndex * n2 + yIndex )*n3 + zIndex ;
// your code ....
}
}// (x < Gz) && (y < Gy)
}
[/codebox]
and its C-wrapper
[codebox]#define BLOCK_DIM_X 16
#define BLOCK_DIM_Y 16
void typical_3D_device( doublereal *Y, doublereal *X,
unsigned int n1, unsigned int n2, unsigned int n3 )
{
unsigned int Gy, Gz, k1, k2 ;
double db_n1 = (double) n1 ;
/* in order to save resource, we want to find two integers k1, k2 such that
*/
int max_k1 = (int) floor( sqrt(db_n1) ) ;
for ( k1 = max_k1 ; 1 <= k1 ; k1-- ){
k2 = (unsigned int) ceil( db_n1/((double)k1)) ;
if ( 1 >= (k1*k2 - n1) ){
break ;
}
}
Gy = n2*k1 ; Gz = n3*k2 ;
dim3 threads(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
dim3 grid( (Gz+BLOCK_DIM_X-1)/BLOCK_DIM_X, (Gy+BLOCK_DIM_Y-1)/BLOCK_DIM_Y, 1 );
float one_over_n2 = 1.0/((double)n2) ;
float one_over_n3 = 1.0/((double)n3) ;
typical_3D<<< grid, threads >>>( Y, X, n1, n2, n3, Gy, Gz, k2, one_over_n2, one_over_n3 ) ;
}[/codebox]
remark 1: in typical method, I use “float” to do modulo operation, this is good for small Gy, Gz,
one can use “double” to do modulo operation (see naive method)
remark 2: in C-wrapper of typical method, I provide another approach to find pair (k1, k2),
this approach save resource but sometimes k2 may exceed 65535, one can use
k1 = (unsignedint) floor( sqrt(db_n1) ) ;
k2 = (unsigned int) ceil( db_n1/((double)k1)) ;
in fact, experimental result shows no difference between two approaches for (k1,k2) computation.