Max matrix size for matrix transposition


I have recently written my code for rectangular matrix transposition (of any dimensions),

but I can transpose matrices of only up to about 100 000 000 float elements (i.e. matrix sizes of e.g. 10 000 x 10 000, 2 000 x 50 000, … ),

because kernel execution fails for higher matrix dimensions.

How it is possible? Are there any constraints I can’t see? Any dependences on compute capability/MP count? Or max number of instructions per kernel?

I want to write this constraint-check function (in order to safely prevent kernel execution if constraints aren’t met), but I don’t know how deduce/calculate constraints…

Here is my kernel (it’s just slightly modified version from SDK):

const int TILE_SIZE  = 16; // dimension of tile, it must be an integral multiple of BLOCK_ROWS

const int BLOCK_ROWS = 16; // number of rows of threads in each thread block

__global__ void transposeRectangularMatrixKernel(float * inMatrix, float * outMatrix, int sizeX, int sizeY) {

// tile of elements loaded from input matrix that will be stored into output matrix (padding of size of 1 column used to guarantee no bank conflicts)

    __shared__ float tile[TILE_SIZE][TILE_SIZE+1];

// load the tile of elements of input matrix allocated in global memory into the shared memory

    for (int i=0; i<TILE_SIZE; i+=BLOCK_ROWS) {

// calculate x and y coordinates of element from input matrix that will be loaded by a single thread into the tile allocated in shared memory

        int x = blockIdx.x * TILE_SIZE + threadIdx.x;

        int y = blockIdx.y * TILE_SIZE + threadIdx.y + i;

// to avoid reading elements out of range of input matrix

        if (x<sizeX && y<sizeY) {

            tile[threadIdx.y + i][threadIdx.x] = inMatrix[y*sizeX + x];



// wait untill all global and shared memory accesses made by threads of thread block are visible to all threads in the block


// store the tile of elements in a transposed way into the output matrix

    for (int i=0; i<TILE_SIZE; i+=BLOCK_ROWS) {

// calculate transposed x and y coordinates of element in output matrix

        int x = blockIdx.y * TILE_SIZE + threadIdx.x;

        int y = blockIdx.x * TILE_SIZE + threadIdx.y + i;

// to avoid storing elements out of range of output matrix

        if (x<sizeY && y<sizeX) {

            outMatrix[y * sizeY + x] = tile[threadIdx.x][threadIdx.y + i];




Execution configuration is

dim3 gridDim((sizeX + TILE_SIZE - 1)/TILE_SIZE, (sizeY + TILE_SIZE - 1)/TILE_SIZE);

dim3 blockDim(TILE_SIZE,BLOCK_ROWS);

where sizeX and sizeY are arbitrary dimensions of matrix to transpose

Are you sure you aren’t running out of memory or hitting the display driver watchdog timer limit, if this is running on a card with an attached display?

You are right, yes, thanks!! My stupid mistake! I forgot to calculate with space needed for storing output matrix! :wallbash: :thanks:

But anyway my results are slightly lower than theoretical maximum of input matrix dimension can be,

calculate with me (I’m running on GeForce GTX 295 with 939327488 Bytes of global memory):

we need to store 2 matrices (input and transposed one) of float elements in global memory,

size of float is 4B,

so max. theoretical dimension of square matrix that can be flawlessly processed by my kernel is

sqrt((939327488 / 2) / 4) = 10835

but I can run my kernel with matrices of max size of 10165 x 10165 elements. Maybe some memory is reserved by OS (Linux)?

The CUDA context does take some memory, and there are page size considerations too. You should find the available memory is something like 50Mb less than the capacity of the card on a dedicated compute 1.3 device (more on Fermi because of printf and C++ support). You might also have to round the amount of memory requested in a malloc call by a small amount because of page sizes. IIRC “big” pages on the GT200 were 1Mb each, so the minimum granularity of allocations is of that order.

you could have maximum size as big as sqrt(939327488 / 4) = 15324 if you do an in-place transpose.