3D transpose, weird behaviour

Dear all,

I wrote a 3D transpose code with the following functionality:

  1. XYZ → YXZ

  2. XYZ → ZYX

The cuboid is of size (nx,ny,nz), and is stored in memory such that the cell with an index (i,j,k) has linear

address = (k*ny + j)*nx + i

After the transpose 1), the transposed cuboid has the dimensions (ny, nx, nz) and the transposed element corresponding to (i,j,k) has

address = (k*nx + i)*ny + j in the transposed cuboid

After the transpose 2), the transposed cuboid has the dimensions (nz, ny, nx) and the transposed element corresponding to (i,j,k) has

address = (i*ny + j)*nz + k in the transposed cuboid

The piece of code which carries out these transpose operations is provided below. What bothers me is the resulted bandwidth. I used CUDA_PROFILE=1 to profile the code.

For transpose a) I have the following output for nx=ny=nz=256

method=[ _globfunc__Z13dev_transposeIL5TMODE0EEvP6float4PfS2_S3 ] gputime=[ 3760.928 ] cputime=[ 3.000 ] occupancy=[ 1.000 ]

Which results to 256^342/3761e-6/1e9 ~ 35 GB/s

However, for the transpose 2) the output is

method=[ _globfunc__Z13dev_transposeIL5TMODE1EEvP6float4PfS2_S3 ] gputime=[ 25821.152 ] cputime=[ 2.000 ] occupancy=[ 1.000 ]

Which results to 5 GB/s !!!

I experimented, and changed the following lines in

case XZ:

n1 = ny*nx

n2 = ny*nz

to

case XZ:

n1 = nx

n2 = nz

The output is clearly garbage, but at least I measured the bandwidth. The result is 45 GB/s. Here is the output of the profiler

method=[ _globfunc__Z13dev_transposeIL5TMODE1EEvP6float4PfS2_S3 ] gputime=[ 2939.040 ] cputime=[ 2.000 ] occupancy=[ 1.000 ]

???

It appears that I am either missing something important or it is new undocumented feature/bug in CUDA. If read/writes are coalesced with n1 = n2 = value, I assume they are coalseced for n1=n2=value*value (my tests are done for nx=ny=nz=256).

Have anybody of you experienced such a strange behaviour or perhaps one may comment on the code if something has to be changed.

Thanks!

Evghenii

Here is the code:

[codebox]

#ifndef DEV_TRANSPOSE_CU

#define DEV_TRANSPOSE_CU

template

device float transpose(float x) {

shared float shdata[DIM][DIM + 1];

shdata[threadIdx.x][threadIdx.y] = x;

__syncthreads();

x = shdata[threadIdx.y][threadIdx.x];

__syncthreads();

return x;

}

// MODE

// 0 - transpose XY: threads(DIM, DIM, 1), grid(ny/DIM, nz, 1)

// 1 - transpose XZ: threads(DIM, DIM, 1), grid(nz/DIM, ny, 1)

template

global void dev_transpose(float *data_in,

		      float *data_out,

		      int nx, int ny, int nz) {

const int DIM2 = 4;

const int DIM = 1 << DIM2;

int src, dst;

int i1, i2;

int n1, n2;

switch(MODE) {

case 0:

i1 = i2 = blockIdx.y*nx*ny;

n1 = nx;

n2 = ny;

break;

case 1:

i1 = blockIdx.y*nx;

i2 = blockIdx.y*nz;

n1 = ny*nx;

n2 = ny*nz;

break;

}

for (int bx = 0; bx < (nx >> DIM2); bx++) {

src = i1 +

  ((blockIdx.x << DIM2) + threadIdx.y) * n1 + 

  ((bx         << DIM2) + threadIdx.x);

dst = i2 +

  ((bx         << DIM2) + threadIdx.y) * n2 +

  ((blockIdx.x << DIM2) + threadIdx.x);

float data = data_in[src];

data = transpose<DIM>(data);

data_out[dst] = data;

}

}

#endif

[/codebox]

What GPU are you testing this on?

The GPU I use is: GTX280, CUDA v2.0

Hi, did you find a workaround to your problem?

I believe that the reason why your bandwidth is so low for the slowest-fastest memory direction transpose (versus, slowest-second-slowest) is because your GPU uses virtual memory, and this order of traversal causes TLB cache thrashing.

Kind regards,

Danny

Hi Danny,

I completely rearranged my data structure. I used z-ordering of my 3d-matrix by splitting it in blocks of 16x16x16 elements, to have some data locality. The blocks are linear in memory, such that the first element in a block (bx, by, bz) is bznbxnby + by*nbx + bx, where nbx = nx/16, nby = ny/16. So that element (i,j,k) is given by:

address = (bznbxnby + by*nbx + bz) + (k%16 * 256 + j%16 * 16 + i%16).

Then I use standard transpose kernel where a single kernel-block transposes a 16x16 cells of a matrix in a given slice. The threads are (16,16,1) and grid is (nz, (nx/16)(ny/16), 1) in case of XY transpose, such that blockIdx.x tells which z-slice to transpose, and (ny, (nx/16)(nz/16), 1) for XZ transpose.

The resulted bandwidth becomes 35 GB/s

BUT,

Should I modify the kernel that the grid is ((nx/16)(ny/16), nz, 1) for XY transpose, or ((nx/16)(nz/16), ny, 1) for XZ tranpose, the resulted bandwidth is decrease by a factor of 4-5.

This part is beyond my understanding. Any ideas?

I am not sure what are do you mean here. Can you elaborate a bit? WHat do you mean slowest-fastest memory direction, etc? I though that there is no caching in GPU memory.

Cheers,

Evghenii