Doubling the speed of the SDK transpose

For those of you who use the SDK transpose:
So I got around to benchmarking the transpose in the SDK after using it blindly since last summer.
What I found was that I was only getting about 15GB/s on a 2048x2048 matrix which is less than half the bandwidth of the card (peak would be 76.8/2 on a C870). I sat down and wrote another transpose kernel which uses 32x16 threads to transpose 32x32 blocks in shared memory (instead of 16x16) and got a 2.3x speedup, about 34.9GB/s on C870.

I think the right way to go is use 16x16 threads with float2/int2 loads (64-bit) because then I can get 2 or 3 blocks per multi-processor. The pointer arithmetic get kinda mind bending though.

If anybody wants it I’ll post the code.

Anybody have any thoughts on this.
Thanks,
Mike.

Please do! Maybe NVIDIA will update the SDK to include it.

wouldn’t 32x16 yield a threadblock bigger then the 256 max? Or am I wrong?

Anyways post the code, with a lil more detail on why the optimization works.

Always fun to learn new stuff ;)

Maximum number of threads per block is 512.

Yes, please post the code!

This is the transpose code which gets close to 35GB/s.

This may be a little rough because I have to retype the code as I can’t get it off the machine where it resides for various reasons <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

If someone could test it out to make sure I typed it correctly…

Here is the code:

// transpose using a 32x32 block in shared memory

// 2D XxY threads == x32y8 threads == 256 threads

// it uses four 32x8 sub-blocks to make the 32x32 block in shared memory

// sample call for 2048x2048 matrix

// ...

// dim3 grid(2048/32,  2048/32, 1);

// dim3 threads(32, 8, 1)

// transpose_x32y8_dev<<<grid, threads>>>(d_odata, d_idata, 11, 11);

// ...

__global__ void transpose_x32y8_dev(float *odata, float *idata, int lg_dimx, int lg_dimy)

{

    __shared__ float block[32][32+1];

    unsigned int xBlock = blockIdx.x << 5; // blockIdx.x * 32

    unsigned int yBlock = blockIdx.y << 5; // blockIdx.y * 32

    unsigned int index_out, index_in;

   // load block and transpose into shared memory

    // turn 2d index into 1d index

    index_in = ((yBlock + threadIdx.y + 0) << lg_dimx) + xBlock + threadIdx.x;

    // load data into shared memory

    block[threadIdx.x][threadIdx.y + 0] = idata[index_in];

    // turn 2d index into 1d index

    index_in = ((yBlock + threadIdx.y + 8) << lg_dimx) + xBlock + threadIdx.x;

    // load data into shared memory

    block[threadIdx.x][threadIdx.y + 8] = idata[index_in];

    // turn 2d index into 1d index

    index_in = ((yBlock + threadIdx.y + 16) << lg_dimx) + xBlock + threadIdx.x;

    // load data into shared memory

    block[threadIdx.x][threadIdx.y + 16] = idata[index_in];

    // turn 2d index into 1d index

    index_in = ((yBlock + threadIdx.y + 24) << lg_dimx) + xBlock + threadIdx.x;

    // load data into shared memory

    block[threadIdx.x][threadIdx.y + 24] = idata[index_in];

   __syncthreads();

   // store transposed block back to device memory

    // turn 2d index into 1d index

    index_out = ((xBlock + threadIdx.y + 0) << lg_dimy) + yBlock + threadIdx.x;

    // store transposed block back to device memory

    odata[index_out] = block[threadIdx.y + 0][threadIdx.x];

    // turn 2d index into 1d index

    index_out = ((xBlock + threadIdx.y + 8) << lg_dimy) + yBlock + threadIdx.x;

    // store transposed block back to device memory

    odata[index_out] = block[threadIdx.y + 8][threadIdx.x];

    // turn 2d index into 1d index

    index_out = ((xBlock + threadIdx.y + 16) << lg_dimy) + yBlock + threadIdx.x;

    // store transposed block back to device memory

    odata[index_out] = block[threadIdx.y + 16][threadIdx.x];

    // turn 2d index into 1d index

    index_out = ((xBlock + threadIdx.y + 24) << lg_dimy) + yBlock + threadIdx.x;

    // store transposed block back to device memory

    odata[index_out] = block[threadIdx.y + 24][threadIdx.x];

}

I should have mentioned that the transpose_x32y8_dev() kernel uses 10 registers and 4264 bytes of shared memory. So, I think I’ll get 2 blocks running in the processor concurrently…

Anybody have any thoughts,
Mike.

Should run with 3 threadblocks per multiprocessor if you use 10 registers and 4224 bytes of smem per block.

Paulius

You are right, I checked the occupancy and it is 1.0 which means there are 768 threads active per multiprocessor.

So the question is now: I thought that the number of registers per thread could only be incremented by 4, which would yield 8 or 12 but not 10. Maybe it is just the maxrreg compiler option which has this limitation?

Mike.

This is a trivial issue, but you can rely on the loop unroller in nvcc and write:

__global__ void transpose_x32y8_dev(float *odata, float *idata, int lg_dimx, int lg_dimy)

{

   __shared__ float block[32][32+1];

   unsigned int xBlock = blockIdx.x << 5; // blockIdx.x * 32

   unsigned int yBlock = blockIdx.y << 5; // blockIdx.y * 32

   unsigned int index_out, index_in;

  // load block and transpose into shared memory

   for (int i=0; i < 32; i += 8) {

     // turn 2d index into 1d index

     index_in = ((yBlock + threadIdx.y + i) << lg_dimx) + xBlock + threadIdx.x;

     // load data into shared memory

     block[threadIdx.x][threadIdx.y + i] = idata[index_in];

   }

   __syncthreads();

  // store transposed block back to device memory

   for (int i=0; i < 32; i += 8) {

     // turn 2d index into 1d index

     index_out = ((xBlock + threadIdx.y + i) << lg_dimy) + yBlock + threadIdx.x;

     // store transposed block back to device memory

     odata[index_out] = block[threadIdx.y + i][threadIdx.x];

   }

}

I checked with nvcc --ptx, and the output ptx is the same.

Good to know.

Did you measure the performance?

Nope, didn’t have a chance to write a wrapper to call the kernel. Just did a quick copy-paste-compile check. :)

just to be complete I measured the performance between the SDK transpose and my new transpose code.

numbers are GB/s

size(XxX)        new       SDK

      32         0.588     0.564

      64         2.336     2.355

     128         9.442     9.376

     256        22.674    21.630

     512        39.842    28.397

    1024        40.182    30.557

    2048        36.278    15.208

    4096        34.509    10.655

    8192        29.942    11.392

For reference: http://forums.nvidia.com/index.php?showtopic=78794

I get with the above code and using my GTX280 only 11.8 GB/s (EDIT: wrong, 23.6 GB/s) for a 1024x1024 matrix. (Runtime 330us).
Runtime of SDK example is 400us. Can anybody confirm that with his/her GTX280?

Thanks

I’ll try it out on my C1060.

I imagine that there are several factors at work here:

  1. longer latency… ~550 cycles on 8800GTX vs. more on C1060 (i’ll measure it)

  2. wider memory interface 384bits vs 512bits and with burst lengths (GDDR3 vs GDDR4) what they are you probably need 64 words consecutive and this kernel gives 32.

  3. 30 SM on GT200 vs 16 SM on G80… only 32 blocks in each dimension maybe this has an effect since things don’t divide evenly.

  4. Are you remembering to multiply bandwidth by 2 because you are reading and writing data?

Thanks,

Freak dammit, I always forget the factor of 2 (rw)… So my bandwidth is 23.6 GB/s. But still a lot less than the 40 GB/s. With my 8800GTX I plugged in a few days ago I got 180us (46 GB/s).

I will play around a little more and see…

Becareful when you use the code if you don’t sure what it does.
The above code will break if you have size that is different from 2^n, an easy test will show that it will give wrong result with 2047x2047 inputs.

Anyway, thank for posting the code, it does faster than SDK version (by increase the work load with each thread) with large input size .