Max matrix size for matrix transposition

pandomovnik · March 30, 2011, 11:02am

Hi,

I have recently written my code for rectangular matrix transposition (of any dimensions),

but I can transpose matrices of only up to about 100 000 000 float elements (i.e. matrix sizes of e.g. 10 000 x 10 000, 2 000 x 50 000, … ),

because kernel execution fails for higher matrix dimensions.

How it is possible? Are there any constraints I can’t see? Any dependences on compute capability/MP count? Or max number of instructions per kernel?

I want to write this constraint-check function (in order to safely prevent kernel execution if constraints aren’t met), but I don’t know how deduce/calculate constraints…

Here is my kernel (it’s just slightly modified version from SDK):

const int TILE_SIZE  = 16; // dimension of tile, it must be an integral multiple of BLOCK_ROWS

const int BLOCK_ROWS = 16; // number of rows of threads in each thread block

__global__ void transposeRectangularMatrixKernel(float * inMatrix, float * outMatrix, int sizeX, int sizeY) {

// tile of elements loaded from input matrix that will be stored into output matrix (padding of size of 1 column used to guarantee no bank conflicts)

    __shared__ float tile[TILE_SIZE][TILE_SIZE+1];

// load the tile of elements of input matrix allocated in global memory into the shared memory

    for (int i=0; i<TILE_SIZE; i+=BLOCK_ROWS) {

// calculate x and y coordinates of element from input matrix that will be loaded by a single thread into the tile allocated in shared memory

        int x = blockIdx.x * TILE_SIZE + threadIdx.x;

        int y = blockIdx.y * TILE_SIZE + threadIdx.y + i;

// to avoid reading elements out of range of input matrix

        if (x<sizeX && y<sizeY) {

            tile[threadIdx.y + i][threadIdx.x] = inMatrix[y*sizeX + x];

        }

    }

// wait untill all global and shared memory accesses made by threads of thread block are visible to all threads in the block

    __syncthreads();

// store the tile of elements in a transposed way into the output matrix

    for (int i=0; i<TILE_SIZE; i+=BLOCK_ROWS) {

// calculate transposed x and y coordinates of element in output matrix

        int x = blockIdx.y * TILE_SIZE + threadIdx.x;

        int y = blockIdx.x * TILE_SIZE + threadIdx.y + i;

// to avoid storing elements out of range of output matrix

        if (x<sizeY && y<sizeX) {

            outMatrix[y * sizeY + x] = tile[threadIdx.x][threadIdx.y + i];

        }

    }

}

Execution configuration is

dim3 gridDim((sizeX + TILE_SIZE - 1)/TILE_SIZE, (sizeY + TILE_SIZE - 1)/TILE_SIZE);

dim3 blockDim(TILE_SIZE,BLOCK_ROWS);

where sizeX and sizeY are arbitrary dimensions of matrix to transpose

avidday · March 30, 2011, 11:56am

Are you sure you aren’t running out of memory or hitting the display driver watchdog timer limit, if this is running on a card with an attached display?

pandomovnik · March 30, 2011, 12:30pm

You are right, yes, thanks!! My stupid mistake! I forgot to calculate with space needed for storing output matrix! External Image External Image

But anyway my results are slightly lower than theoretical maximum of input matrix dimension can be,

calculate with me (I’m running on GeForce GTX 295 with 939327488 Bytes of global memory):

we need to store 2 matrices (input and transposed one) of float elements in global memory,

size of float is 4B,

so max. theoretical dimension of square matrix that can be flawlessly processed by my kernel is

sqrt((939327488 / 2) / 4) = 10835

but I can run my kernel with matrices of max size of 10165 x 10165 elements. Maybe some memory is reserved by OS (Linux)?

avidday · March 30, 2011, 12:37pm

The CUDA context does take some memory, and there are page size considerations too. You should find the available memory is something like 50Mb less than the capacity of the card on a dedicated compute 1.3 device (more on Fermi because of printf and C++ support). You might also have to round the amount of memory requested in a malloc call by a small amount because of page sizes. IIRC “big” pages on the GT200 were 1Mb each, so the minimum granularity of allocations is of that order.

sergeyn · April 3, 2011, 10:00am

you could have maximum size as big as sqrt(939327488 / 4) = 15324 if you do an in-place transpose.

Topic		Replies	Views
CUDA Memory Transpose CUDA Programming and Performance	10	2088	March 23, 2015
Matrix Size Limitation CUDA Programming and Performance	2	1283	September 1, 2009
CUDA matrix transpose using shared memory CUDA Programming and Performance cuda	0	523	July 6, 2020
matrix_mul with max_size CUDA Programming and Performance	1	1120	May 21, 2010
problem with big matrix CUDA Programming and Performance	3	2019	August 29, 2008
Matrix transpose problem (SDK example) for matrices that are not multiple of 16 CUDA Programming and Performance	0	5508	December 2, 2007
rectangular matrix transpose CUDA Programming and Performance	3	7748	April 30, 2008
Batch matrix transposed CUDA Programming and Performance	0	2401	August 21, 2009
How to improve this matrix multiplication code in CUDA? CUDA Programming and Performance	6	1577	July 2, 2015
Question about tranpose CUDA Programming and Performance	19	7485	June 11, 2008

Max matrix size for matrix transposition

Related topics