Doubling the speed of the SDK transpose

mhmerrill · May 10, 2008, 2:35am

For those of you who use the SDK transpose:
So I got around to benchmarking the transpose in the SDK after using it blindly since last summer.
What I found was that I was only getting about 15GB/s on a 2048x2048 matrix which is less than half the bandwidth of the card (peak would be 76.8/2 on a C870). I sat down and wrote another transpose kernel which uses 32x16 threads to transpose 32x32 blocks in shared memory (instead of 16x16) and got a 2.3x speedup, about 34.9GB/s on C870.

I think the right way to go is use 16x16 threads with float2/int2 loads (64-bit) because then I can get 2 or 3 blocks per multi-processor. The pointer arithmetic get kinda mind bending though.

If anybody wants it I’ll post the code.

Anybody have any thoughts on this.
Thanks,
Mike.

seibert · May 10, 2008, 3:02pm

Please do! Maybe NVIDIA will update the SDK to include it.

CzubinMark · May 10, 2008, 11:51pm

wouldn’t 32x16 yield a threadblock bigger then the 256 max? Or am I wrong?

Anyways post the code, with a lil more detail on why the optimization works.

Always fun to learn new stuff ;)

seibert · May 11, 2008, 1:14am

Maximum number of threads per block is 512.

jimh · May 12, 2008, 4:49pm

Yes, please post the code!

mhmerrill · May 12, 2008, 6:04pm

This is the transpose code which gets close to 35GB/s.

This may be a little rough because I have to retype the code as I can’t get it off the machine where it resides for various reasons <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

If someone could test it out to make sure I typed it correctly…

Here is the code:

// transpose using a 32x32 block in shared memory

// 2D XxY threads == x32y8 threads == 256 threads

// it uses four 32x8 sub-blocks to make the 32x32 block in shared memory

// sample call for 2048x2048 matrix

// ...

// dim3 grid(2048/32,  2048/32, 1);

// dim3 threads(32, 8, 1)

// transpose_x32y8_dev<<<grid, threads>>>(d_odata, d_idata, 11, 11);

// ...

__global__ void transpose_x32y8_dev(float *odata, float *idata, int lg_dimx, int lg_dimy)

{

    __shared__ float block[32][32+1];

    unsigned int xBlock = blockIdx.x << 5; // blockIdx.x * 32

    unsigned int yBlock = blockIdx.y << 5; // blockIdx.y * 32

    unsigned int index_out, index_in;

   // load block and transpose into shared memory

    // turn 2d index into 1d index

    index_in = ((yBlock + threadIdx.y + 0) << lg_dimx) + xBlock + threadIdx.x;

    // load data into shared memory

    block[threadIdx.x][threadIdx.y + 0] = idata[index_in];

    // turn 2d index into 1d index

    index_in = ((yBlock + threadIdx.y + 8) << lg_dimx) + xBlock + threadIdx.x;

    // load data into shared memory

    block[threadIdx.x][threadIdx.y + 8] = idata[index_in];

    // turn 2d index into 1d index

    index_in = ((yBlock + threadIdx.y + 16) << lg_dimx) + xBlock + threadIdx.x;

    // load data into shared memory

    block[threadIdx.x][threadIdx.y + 16] = idata[index_in];

    // turn 2d index into 1d index

    index_in = ((yBlock + threadIdx.y + 24) << lg_dimx) + xBlock + threadIdx.x;

    // load data into shared memory

    block[threadIdx.x][threadIdx.y + 24] = idata[index_in];

   __syncthreads();

   // store transposed block back to device memory

    // turn 2d index into 1d index

    index_out = ((xBlock + threadIdx.y + 0) << lg_dimy) + yBlock + threadIdx.x;

    // store transposed block back to device memory

    odata[index_out] = block[threadIdx.y + 0][threadIdx.x];

    // turn 2d index into 1d index

    index_out = ((xBlock + threadIdx.y + 8) << lg_dimy) + yBlock + threadIdx.x;

    // store transposed block back to device memory

    odata[index_out] = block[threadIdx.y + 8][threadIdx.x];

    // turn 2d index into 1d index

    index_out = ((xBlock + threadIdx.y + 16) << lg_dimy) + yBlock + threadIdx.x;

    // store transposed block back to device memory

    odata[index_out] = block[threadIdx.y + 16][threadIdx.x];

    // turn 2d index into 1d index

    index_out = ((xBlock + threadIdx.y + 24) << lg_dimy) + yBlock + threadIdx.x;

    // store transposed block back to device memory

    odata[index_out] = block[threadIdx.y + 24][threadIdx.x];

}

mhmerrill · May 12, 2008, 6:16pm

I should have mentioned that the transpose_x32y8_dev() kernel uses 10 registers and 4264 bytes of shared memory. So, I think I’ll get 2 blocks running in the processor concurrently…

Anybody have any thoughts,
Mike.

paulius · May 12, 2008, 9:21pm

Should run with 3 threadblocks per multiprocessor if you use 10 registers and 4224 bytes of smem per block.

Paulius

mhmerrill · May 12, 2008, 10:18pm

You are right, I checked the occupancy and it is 1.0 which means there are 768 threads active per multiprocessor.

So the question is now: I thought that the number of registers per thread could only be incremented by 4, which would yield 8 or 12 but not 10. Maybe it is just the maxrreg compiler option which has this limitation?

Mike.

seibert · May 12, 2008, 11:52pm

This is a trivial issue, but you can rely on the loop unroller in nvcc and write:

__global__ void transpose_x32y8_dev(float *odata, float *idata, int lg_dimx, int lg_dimy)

{

   __shared__ float block[32][32+1];

   unsigned int xBlock = blockIdx.x << 5; // blockIdx.x * 32

   unsigned int yBlock = blockIdx.y << 5; // blockIdx.y * 32

   unsigned int index_out, index_in;

  // load block and transpose into shared memory

   for (int i=0; i < 32; i += 8) {

     // turn 2d index into 1d index

     index_in = ((yBlock + threadIdx.y + i) << lg_dimx) + xBlock + threadIdx.x;

     // load data into shared memory

     block[threadIdx.x][threadIdx.y + i] = idata[index_in];

   }

   __syncthreads();

  // store transposed block back to device memory

   for (int i=0; i < 32; i += 8) {

     // turn 2d index into 1d index

     index_out = ((xBlock + threadIdx.y + i) << lg_dimy) + yBlock + threadIdx.x;

     // store transposed block back to device memory

     odata[index_out] = block[threadIdx.y + i][threadIdx.x];

   }

}

I checked with nvcc --ptx, and the output ptx is the same.

mhmerrill · May 13, 2008, 12:10am

Good to know.

Did you measure the performance?

seibert · May 13, 2008, 1:47am

Nope, didn’t have a chance to write a wrapper to call the kernel. Just did a quick copy-paste-compile check. :)

mhmerrill · May 13, 2008, 2:46pm

just to be complete I measured the performance between the SDK transpose and my new transpose code.

numbers are GB/s

size(XxX)        new       SDK

      32         0.588     0.564

      64         2.336     2.355

     128         9.442     9.376

     256        22.674    21.630

     512        39.842    28.397

    1024        40.182    30.557

    2048        36.278    15.208

    4096        34.509    10.655

    8192        29.942    11.392

Fuchs · October 4, 2008, 12:29am

For reference: [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

I get with the above code and using my GTX280 only 11.8 GB/s (EDIT: wrong, 23.6 GB/s) for a 1024x1024 matrix. (Runtime 330us).
Runtime of SDK example is 400us. Can anybody confirm that with his/her GTX280?

Thanks

mhmerrill · October 7, 2008, 2:17am

I’ll try it out on my C1060.

I imagine that there are several factors at work here:

longer latency… ~550 cycles on 8800GTX vs. more on C1060 (i’ll measure it)
wider memory interface 384bits vs 512bits and with burst lengths (GDDR3 vs GDDR4) what they are you probably need 64 words consecutive and this kernel gives 32.
30 SM on GT200 vs 16 SM on G80… only 32 blocks in each dimension maybe this has an effect since things don’t divide evenly.
Are you remembering to multiply bandwidth by 2 because you are reading and writing data?

Fuchs · October 7, 2008, 1:44pm

Thanks,

Freak dammit, I always forget the factor of 2 (rw)… So my bandwidth is 23.6 GB/s. But still a lot less than the 40 GB/s. With my 8800GTX I plugged in a few days ago I got 180us (46 GB/s).

I will play around a little more and see…

Linh_Ha · October 15, 2008, 10:07pm

Becareful when you use the code if you don’t sure what it does.
The above code will break if you have size that is different from 2^n, an easy test will show that it will give wrong result with 2047x2047 inputs.

Anyway, thank for posting the code, it does faster than SDK version (by increase the work load with each thread) with large input size .

Topic		Replies	Views
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2723	September 18, 2009
An Efficient Matrix Transpose in CUDA C/C++ Technical Blog	31	2542	October 30, 2020
Efficient in-place transpose of multiple square float matrices CUDA Programming and Performance	10	5450	October 10, 2019
Transpose example, strange dim dependent lagg.. CUDA Programming and Performance	24	12242	October 25, 2009
Matrix transpose slower using shared memory CUDA Programming and Performance	5	1000	August 7, 2015
Question about tranpose CUDA Programming and Performance	19	7312	June 11, 2008
transpose example, SDK 3.2 CUDA Programming and Performance	4	7470	March 15, 2011
no bank conflicts on gtx285 or partition camping on quadro nvs 140m? CUDA Programming and Performance	1	3217	March 27, 2010
CUDA Memory Transpose CUDA Programming and Performance	10	1881	March 23, 2015
My simple but speedy reduction code (runs 106.4GB/s on GTX 295) 106.4/111.9=95.1% to the peak bandwi CUDA Programming and Performance	32	28214	August 15, 2010

Doubling the speed of the SDK transpose

Related topics