Hi,

I have recently written my code for rectangular matrix transposition (of any dimensions),

but I can transpose matrices of only up to about 100 000 000 float elements (i.e. matrix sizes of e.g. 10 000 x 10 000, 2 000 x 50 000, … ),

because kernel execution fails for higher matrix dimensions.

How it is possible? Are there any constraints I can’t see? Any dependences on compute capability/MP count? Or max number of instructions per kernel?

I want to write this constraint-check function (in order to safely prevent kernel execution if constraints aren’t met), but I don’t know how deduce/calculate constraints…

Here is my kernel (it’s just slightly modified version from SDK):

```
const int TILE_SIZE = 16; // dimension of tile, it must be an integral multiple of BLOCK_ROWS
const int BLOCK_ROWS = 16; // number of rows of threads in each thread block
__global__ void transposeRectangularMatrixKernel(float * inMatrix, float * outMatrix, int sizeX, int sizeY) {
// tile of elements loaded from input matrix that will be stored into output matrix (padding of size of 1 column used to guarantee no bank conflicts)
__shared__ float tile[TILE_SIZE][TILE_SIZE+1];
// load the tile of elements of input matrix allocated in global memory into the shared memory
for (int i=0; i<TILE_SIZE; i+=BLOCK_ROWS) {
// calculate x and y coordinates of element from input matrix that will be loaded by a single thread into the tile allocated in shared memory
int x = blockIdx.x * TILE_SIZE + threadIdx.x;
int y = blockIdx.y * TILE_SIZE + threadIdx.y + i;
// to avoid reading elements out of range of input matrix
if (x<sizeX && y<sizeY) {
tile[threadIdx.y + i][threadIdx.x] = inMatrix[y*sizeX + x];
}
}
// wait untill all global and shared memory accesses made by threads of thread block are visible to all threads in the block
__syncthreads();
// store the tile of elements in a transposed way into the output matrix
for (int i=0; i<TILE_SIZE; i+=BLOCK_ROWS) {
// calculate transposed x and y coordinates of element in output matrix
int x = blockIdx.y * TILE_SIZE + threadIdx.x;
int y = blockIdx.x * TILE_SIZE + threadIdx.y + i;
// to avoid storing elements out of range of output matrix
if (x<sizeY && y<sizeX) {
outMatrix[y * sizeY + x] = tile[threadIdx.x][threadIdx.y + i];
}
}
}
```

Execution configuration is

```
dim3 gridDim((sizeX + TILE_SIZE - 1)/TILE_SIZE, (sizeY + TILE_SIZE - 1)/TILE_SIZE);
dim3 blockDim(TILE_SIZE,BLOCK_ROWS);
```

where sizeX and sizeY are arbitrary dimensions of matrix to transpose