For those of you who use the SDK transpose:
So I got around to benchmarking the transpose in the SDK after using it blindly since last summer.
What I found was that I was only getting about 15GB/s on a 2048x2048 matrix which is less than half the bandwidth of the card (peak would be 76.8/2 on a C870). I sat down and wrote another transpose kernel which uses 32x16 threads to transpose 32x32 blocks in shared memory (instead of 16x16) and got a 2.3x speedup, about 34.9GB/s on C870.
I think the right way to go is use 16x16 threads with float2/int2 loads (64-bit) because then I can get 2 or 3 blocks per multi-processor. The pointer arithmetic get kinda mind bending though.
This is the transpose code which gets close to 35GB/s.
This may be a little rough because I have to retype the code as I can’t get it off the machine where it resides for various reasons <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />
If someone could test it out to make sure I typed it correctly…
Here is the code:
// transpose using a 32x32 block in shared memory
// 2D XxY threads == x32y8 threads == 256 threads
// it uses four 32x8 sub-blocks to make the 32x32 block in shared memory
// sample call for 2048x2048 matrix
// ...
// dim3 grid(2048/32, 2048/32, 1);
// dim3 threads(32, 8, 1)
// transpose_x32y8_dev<<<grid, threads>>>(d_odata, d_idata, 11, 11);
// ...
__global__ void transpose_x32y8_dev(float *odata, float *idata, int lg_dimx, int lg_dimy)
{
__shared__ float block[32][32+1];
unsigned int xBlock = blockIdx.x << 5; // blockIdx.x * 32
unsigned int yBlock = blockIdx.y << 5; // blockIdx.y * 32
unsigned int index_out, index_in;
// load block and transpose into shared memory
// turn 2d index into 1d index
index_in = ((yBlock + threadIdx.y + 0) << lg_dimx) + xBlock + threadIdx.x;
// load data into shared memory
block[threadIdx.x][threadIdx.y + 0] = idata[index_in];
// turn 2d index into 1d index
index_in = ((yBlock + threadIdx.y + 8) << lg_dimx) + xBlock + threadIdx.x;
// load data into shared memory
block[threadIdx.x][threadIdx.y + 8] = idata[index_in];
// turn 2d index into 1d index
index_in = ((yBlock + threadIdx.y + 16) << lg_dimx) + xBlock + threadIdx.x;
// load data into shared memory
block[threadIdx.x][threadIdx.y + 16] = idata[index_in];
// turn 2d index into 1d index
index_in = ((yBlock + threadIdx.y + 24) << lg_dimx) + xBlock + threadIdx.x;
// load data into shared memory
block[threadIdx.x][threadIdx.y + 24] = idata[index_in];
__syncthreads();
// store transposed block back to device memory
// turn 2d index into 1d index
index_out = ((xBlock + threadIdx.y + 0) << lg_dimy) + yBlock + threadIdx.x;
// store transposed block back to device memory
odata[index_out] = block[threadIdx.y + 0][threadIdx.x];
// turn 2d index into 1d index
index_out = ((xBlock + threadIdx.y + 8) << lg_dimy) + yBlock + threadIdx.x;
// store transposed block back to device memory
odata[index_out] = block[threadIdx.y + 8][threadIdx.x];
// turn 2d index into 1d index
index_out = ((xBlock + threadIdx.y + 16) << lg_dimy) + yBlock + threadIdx.x;
// store transposed block back to device memory
odata[index_out] = block[threadIdx.y + 16][threadIdx.x];
// turn 2d index into 1d index
index_out = ((xBlock + threadIdx.y + 24) << lg_dimy) + yBlock + threadIdx.x;
// store transposed block back to device memory
odata[index_out] = block[threadIdx.y + 24][threadIdx.x];
}
I should have mentioned that the transpose_x32y8_dev() kernel uses 10 registers and 4264 bytes of shared memory. So, I think I’ll get 2 blocks running in the processor concurrently…
You are right, I checked the occupancy and it is 1.0 which means there are 768 threads active per multiprocessor.
So the question is now: I thought that the number of registers per thread could only be incremented by 4, which would yield 8 or 12 but not 10. Maybe it is just the maxrreg compiler option which has this limitation?
I get with the above code and using my GTX280 only 11.8 GB/s (EDIT: wrong, 23.6 GB/s) for a 1024x1024 matrix. (Runtime 330us).
Runtime of SDK example is 400us. Can anybody confirm that with his/her GTX280?
I imagine that there are several factors at work here:
longer latency… ~550 cycles on 8800GTX vs. more on C1060 (i’ll measure it)
wider memory interface 384bits vs 512bits and with burst lengths (GDDR3 vs GDDR4) what they are you probably need 64 words consecutive and this kernel gives 32.
30 SM on GT200 vs 16 SM on G80… only 32 blocks in each dimension maybe this has an effect since things don’t divide evenly.
Are you remembering to multiply bandwidth by 2 because you are reading and writing data?
Freak dammit, I always forget the factor of 2 (rw)… So my bandwidth is 23.6 GB/s. But still a lot less than the 40 GB/s. With my 8800GTX I plugged in a few days ago I got 180us (46 GB/s).
Becareful when you use the code if you don’t sure what it does.
The above code will break if you have size that is different from 2^n, an easy test will show that it will give wrong result with 2047x2047 inputs.
Anyway, thank for posting the code, it does faster than SDK version (by increase the work load with each thread) with large input size .