First of all, this example calculates the grid size incorrectly.
The original code is:
dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
Since size_x and BLOCK_DIM are integers, the code performs an integer division.
If size_x is 32 grid.x is 2 (BLOCK_DIM is always 16) and if size_x is 33 grid.x is 2 and that´s not correct.
So I replaced the original code with this one:
int gx = ceilf(size_x / (float)BLOCK_DIM);
int gy = ceilf(size_y / (float)BLOCK_DIM);
dim3 grid(gx, gy, 1);
that calculates correctly the grid size.
The problem is when I try to transpose an arbitrary matrix with the optimized transpose kernel and the matrix is not multiple of 16 I get strange errors.
For example: (I´m using emudebug mode)
If size_x = 32 and size_y == 128 (default) everything works fine.
If size_x = 33 (something not equal to a multiple of 16) and size_y == 128 I get several of these errors randomly:
HEAP[transpose.exe]: Heap block at 003ECF28 modified at 003ECFC0 past requested size of 8c
First-chance exception at 0x7c97df51 in transpose.exe: 0xC0000005: Access violation reading location 0x45684ff8.
First-chance exception at 0x10010d2c in transpose.exe: 0xC0000005: Access violation writing location 0x4486c000.
Unhandled exception at 0x10010d2c in transpose.exe: 0xC0000005: Access violation writing location 0x4486c000.
The optimized kernel doesn´t compute data behind the limit of the matrix:
if (xIndex < width && yIndex < height){
... transpose code here...
}
On the other hand, naive kernel handles correctly every matrix I test.
What is the problem with the optimized kernel?
Thanks a lot.