First of all, this example calculates the grid size incorrectly.

The original code is:

```
dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
```

Since size_x and BLOCK_DIM are integers, the code performs an integer division.

If size_x is 32 grid.x is 2 (BLOCK_DIM is always 16) and if size_x is 33 grid.x is 2 and that´s not correct.

So I replaced the original code with this one:

```
int gx = ceilf(size_x / (float)BLOCK_DIM);
int gy = ceilf(size_y / (float)BLOCK_DIM);
dim3 grid(gx, gy, 1);
```

that calculates correctly the grid size.

The problem is when I try to transpose an arbitrary matrix with the optimized transpose kernel and the matrix is not multiple of 16 I get strange errors.

For example: (I´m using emudebug mode)

If size_x = 32 and size_y == 128 (default) everything works fine.

If size_x = 33 (something not equal to a multiple of 16) and size_y == 128 I get several of these errors randomly:

HEAP[transpose.exe]: Heap block at 003ECF28 modified at 003ECFC0 past requested size of 8c

First-chance exception at 0x7c97df51 in transpose.exe: 0xC0000005: Access violation reading location 0x45684ff8.

First-chance exception at 0x10010d2c in transpose.exe: 0xC0000005: Access violation writing location 0x4486c000.

Unhandled exception at 0x10010d2c in transpose.exe: 0xC0000005: Access violation writing location 0x4486c000.

The optimized kernel doesn´t compute data behind the limit of the matrix:

```
if (xIndex < width && yIndex < height){
... transpose code here...
}
```

On the other hand, naive kernel handles correctly every matrix I test.

What is the problem with the optimized kernel?

Thanks a lot.