3D transpose, weird behaviour

egaburov · January 6, 2009, 11:30pm

Dear all,

I wrote a 3D transpose code with the following functionality:

XYZ → YXZ
XYZ → ZYX

The cuboid is of size (nx,ny,nz), and is stored in memory such that the cell with an index (i,j,k) has linear

address = (k*ny + j)*nx + i

After the transpose 1), the transposed cuboid has the dimensions (ny, nx, nz) and the transposed element corresponding to (i,j,k) has

address = (k*nx + i)*ny + j in the transposed cuboid

After the transpose 2), the transposed cuboid has the dimensions (nz, ny, nx) and the transposed element corresponding to (i,j,k) has

address = (i*ny + j)*nz + k in the transposed cuboid

The piece of code which carries out these transpose operations is provided below. What bothers me is the resulted bandwidth. I used CUDA_PROFILE=1 to profile the code.

For transpose a) I have the following output for nx=ny=nz=256

method=[ _globfunc__Z13dev_transposeIL5TMODE0EEvP6float4PfS2_S3 ] gputime=[ 3760.928 ] cputime=[ 3.000 ] occupancy=[ 1.000 ]

Which results to 256^342/3761e-6/1e9 ~ 35 GB/s

However, for the transpose 2) the output is

method=[ _globfunc__Z13dev_transposeIL5TMODE1EEvP6float4PfS2_S3 ] gputime=[ 25821.152 ] cputime=[ 2.000 ] occupancy=[ 1.000 ]

Which results to 5 GB/s !!!

I experimented, and changed the following lines in

case XZ:

…

n1 = ny*nx

n2 = ny*nz

to

case XZ:

…

n1 = nx

n2 = nz

The output is clearly garbage, but at least I measured the bandwidth. The result is 45 GB/s. Here is the output of the profiler

method=[ _globfunc__Z13dev_transposeIL5TMODE1EEvP6float4PfS2_S3 ] gputime=[ 2939.040 ] cputime=[ 2.000 ] occupancy=[ 1.000 ]

???

It appears that I am either missing something important or it is new undocumented feature/bug in CUDA. If read/writes are coalesced with n1 = n2 = value, I assume they are coalseced for n1=n2=value*value (my tests are done for nx=ny=nz=256).

Have anybody of you experienced such a strange behaviour or perhaps one may comment on the code if something has to be changed.

Thanks!

Evghenii

Here is the code:

[codebox]

#ifndef DEV_TRANSPOSE_CU

#define DEV_TRANSPOSE_CU

template

device float transpose(float x) {

shared float shdata[DIM][DIM + 1];

shdata[threadIdx.x][threadIdx.y] = x;

__syncthreads();

x = shdata[threadIdx.y][threadIdx.x];

__syncthreads();

return x;

}

// MODE

// 0 - transpose XY: threads(DIM, DIM, 1), grid(ny/DIM, nz, 1)

// 1 - transpose XZ: threads(DIM, DIM, 1), grid(nz/DIM, ny, 1)

template

global void dev_transpose(float *data_in,

		      float *data_out,

		      int nx, int ny, int nz) {

const int DIM2 = 4;

const int DIM = 1 << DIM2;

int src, dst;

int i1, i2;

int n1, n2;

switch(MODE) {

case 0:

i1 = i2 = blockIdx.y*nx*ny;

n1 = nx;

n2 = ny;

break;

case 1:

i1 = blockIdx.y*nx;

i2 = blockIdx.y*nz;

n1 = ny*nx;

n2 = ny*nz;

break;

}

for (int bx = 0; bx < (nx >> DIM2); bx++) {

src = i1 +

  ((blockIdx.x << DIM2) + threadIdx.y) * n1 + 

  ((bx         << DIM2) + threadIdx.x);

dst = i2 +

  ((bx         << DIM2) + threadIdx.y) * n2 +

  ((blockIdx.x << DIM2) + threadIdx.x);

float data = data_in[src];

data = transpose<DIM>(data);

data_out[dst] = data;

}

#endif

[/codebox]

tmurray · January 7, 2009, 12:17am

What GPU are you testing this on?

egaburov · January 7, 2009, 7:46am

The GPU I use is: GTX280, CUDA v2.0

deaton · April 17, 2009, 9:43pm

Hi, did you find a workaround to your problem?

I believe that the reason why your bandwidth is so low for the slowest-fastest memory direction transpose (versus, slowest-second-slowest) is because your GPU uses virtual memory, and this order of traversal causes TLB cache thrashing.

Kind regards,

Danny

egaburov · April 18, 2009, 12:23pm

Hi Danny,

I completely rearranged my data structure. I used z-ordering of my 3d-matrix by splitting it in blocks of 16x16x16 elements, to have some data locality. The blocks are linear in memory, such that the first element in a block (bx, by, bz) is bznbxnby + by*nbx + bx, where nbx = nx/16, nby = ny/16. So that element (i,j,k) is given by:

address = (bznbxnby + by*nbx + bz) + (k%16 * 256 + j%16 * 16 + i%16).

Then I use standard transpose kernel where a single kernel-block transposes a 16x16 cells of a matrix in a given slice. The threads are (16,16,1) and grid is (nz, (nx/16)(ny/16), 1) in case of XY transpose, such that blockIdx.x tells which z-slice to transpose, and (ny, (nx/16)(nz/16), 1) for XZ transpose.

The resulted bandwidth becomes 35 GB/s

BUT,

Should I modify the kernel that the grid is ((nx/16)(ny/16), nz, 1) for XY transpose, or ((nx/16)(nz/16), ny, 1) for XZ tranpose, the resulted bandwidth is decrease by a factor of 4-5.

This part is beyond my understanding. Any ideas?

I am not sure what are do you mean here. Can you elaborate a bit? WHat do you mean slowest-fastest memory direction, etc? I though that there is no caching in GPU memory.

Cheers,

Evghenii

Topic		Replies	Views
Transpose example, strange dim dependent lagg.. CUDA Programming and Performance	24	12376	October 25, 2009
Doubling the speed of the SDK transpose CUDA Programming and Performance	16	6397	October 15, 2008
3D Transpose ... and memory coalescence CUDA Programming and Performance	3	5283	January 26, 2008
3d transpose CUDA Programming and Performance	1	1667	November 24, 2008
Transpose performance CUDA Programming and Performance	0	2367	July 11, 2008
transpose example, SDK 3.2 CUDA Programming and Performance	4	7503	March 15, 2011
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2768	September 18, 2009
Transposing the grid, but not its thread-blocks, results in slower performance, why ? CUDA Programming and Performance	9	963	March 2, 2018
Question about tranpose CUDA Programming and Performance	19	7426	June 11, 2008
An Efficient Matrix Transpose in CUDA Fortran Technical Blog	2	461	February 5, 2014

3D transpose, weird behaviour

Related topics