Dear all,
I wrote a 3D transpose code with the following functionality:
-
XYZ → YXZ
-
XYZ → ZYX
The cuboid is of size (nx,ny,nz), and is stored in memory such that the cell with an index (i,j,k) has linear
address = (k*ny + j)*nx + i
After the transpose 1), the transposed cuboid has the dimensions (ny, nx, nz) and the transposed element corresponding to (i,j,k) has
address = (k*nx + i)*ny + j in the transposed cuboid
After the transpose 2), the transposed cuboid has the dimensions (nz, ny, nx) and the transposed element corresponding to (i,j,k) has
address = (i*ny + j)*nz + k in the transposed cuboid
The piece of code which carries out these transpose operations is provided below. What bothers me is the resulted bandwidth. I used CUDA_PROFILE=1 to profile the code.
For transpose a) I have the following output for nx=ny=nz=256
method=[ _globfunc__Z13dev_transposeIL5TMODE0EEvP6float4PfS2_S3 ] gputime=[ 3760.928 ] cputime=[ 3.000 ] occupancy=[ 1.000 ]
Which results to 256^342/3761e-6/1e9 ~ 35 GB/s
However, for the transpose 2) the output is
method=[ _globfunc__Z13dev_transposeIL5TMODE1EEvP6float4PfS2_S3 ] gputime=[ 25821.152 ] cputime=[ 2.000 ] occupancy=[ 1.000 ]
Which results to 5 GB/s !!!
I experimented, and changed the following lines in
case XZ:
…
n1 = ny*nx
n2 = ny*nz
to
case XZ:
…
n1 = nx
n2 = nz
The output is clearly garbage, but at least I measured the bandwidth. The result is 45 GB/s. Here is the output of the profiler
method=[ _globfunc__Z13dev_transposeIL5TMODE1EEvP6float4PfS2_S3 ] gputime=[ 2939.040 ] cputime=[ 2.000 ] occupancy=[ 1.000 ]
???
It appears that I am either missing something important or it is new undocumented feature/bug in CUDA. If read/writes are coalesced with n1 = n2 = value, I assume they are coalseced for n1=n2=value*value (my tests are done for nx=ny=nz=256).
Have anybody of you experienced such a strange behaviour or perhaps one may comment on the code if something has to be changed.
Thanks!
Evghenii
Here is the code:
[codebox]
#ifndef DEV_TRANSPOSE_CU
#define DEV_TRANSPOSE_CU
template
device float transpose(float x) {
shared float shdata[DIM][DIM + 1];
shdata[threadIdx.x][threadIdx.y] = x;
__syncthreads();
x = shdata[threadIdx.y][threadIdx.x];
__syncthreads();
return x;
}
// MODE
// 0 - transpose XY: threads(DIM, DIM, 1), grid(ny/DIM, nz, 1)
// 1 - transpose XZ: threads(DIM, DIM, 1), grid(nz/DIM, ny, 1)
template
global void dev_transpose(float *data_in,
float *data_out,
int nx, int ny, int nz) {
const int DIM2 = 4;
const int DIM = 1 << DIM2;
int src, dst;
int i1, i2;
int n1, n2;
switch(MODE) {
case 0:
i1 = i2 = blockIdx.y*nx*ny;
n1 = nx;
n2 = ny;
break;
case 1:
i1 = blockIdx.y*nx;
i2 = blockIdx.y*nz;
n1 = ny*nx;
n2 = ny*nz;
break;
}
for (int bx = 0; bx < (nx >> DIM2); bx++) {
src = i1 +
((blockIdx.x << DIM2) + threadIdx.y) * n1 +
((bx << DIM2) + threadIdx.x);
dst = i2 +
((bx << DIM2) + threadIdx.y) * n2 +
((blockIdx.x << DIM2) + threadIdx.x);
float data = data_in[src];
data = transpose<DIM>(data);
data_out[dst] = data;
}
}
[/codebox]