Matrix transpose problem (SDK example) for matrices that are not multiple of 16

pfccpp · December 2, 2007, 11:25pm

First of all, this example calculates the grid size incorrectly.

The original code is:

dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);

Since size_x and BLOCK_DIM are integers, the code performs an integer division.

If size_x is 32 grid.x is 2 (BLOCK_DIM is always 16) and if size_x is 33 grid.x is 2 and that´s not correct.

So I replaced the original code with this one:

int gx = ceilf(size_x / (float)BLOCK_DIM);

int gy = ceilf(size_y / (float)BLOCK_DIM);

dim3 grid(gx, gy, 1);

that calculates correctly the grid size.

The problem is when I try to transpose an arbitrary matrix with the optimized transpose kernel and the matrix is not multiple of 16 I get strange errors.

For example: (I´m using emudebug mode)

If size_x = 32 and size_y == 128 (default) everything works fine.

If size_x = 33 (something not equal to a multiple of 16) and size_y == 128 I get several of these errors randomly:

HEAP[transpose.exe]: Heap block at 003ECF28 modified at 003ECFC0 past requested size of 8c

First-chance exception at 0x7c97df51 in transpose.exe: 0xC0000005: Access violation reading location 0x45684ff8.

First-chance exception at 0x10010d2c in transpose.exe: 0xC0000005: Access violation writing location 0x4486c000.

Unhandled exception at 0x10010d2c in transpose.exe: 0xC0000005: Access violation writing location 0x4486c000.

The optimized kernel doesn´t compute data behind the limit of the matrix:

if (xIndex < width && yIndex < height){

   ... transpose code here...

}

On the other hand, naive kernel handles correctly every matrix I test.

What is the problem with the optimized kernel?

Thanks a lot.

Topic		Replies	Views
Question about tranpose CUDA Programming and Performance	19	7485	June 11, 2008
rectangular matrix transpose CUDA Programming and Performance	3	7748	April 30, 2008
need help urgently, inconsistent result? half the time the result is right and half the time wrong CUDA Programming and Performance	4	984	February 24, 2011
SDK Transpose revisited ... yet again! CUDA Programming and Performance	3	4919	May 16, 2008
Kernel adapted from CUDA documentation giving incorrect result CUDA Programming and Performance	1	278	August 19, 2023
Max matrix size for matrix transposition CUDA Programming and Performance	4	6474	April 3, 2011
Batch matrix transposed CUDA Programming and Performance	0	2401	August 21, 2009
Matrix multiplication doesn't works if gridDim increase 80 CUDA Programming and Performance	2	1166	January 5, 2010
problem with bigger than 32768-size grids CUDA bug? CUDA Programming and Performance	9	6681	January 28, 2009
matrix multiplication with some modification Legacy PGI Compilers	3	2515	October 30, 2012

Matrix transpose problem (SDK example) for matrices that are not multiple of 16

Related topics