CUBLAS alloc limit

sptx · June 1, 2009, 3:42pm

Just wondering what kind of memory limits you guys are having using CUBLAS. We’re using cublaSgemm, using a Tesla C1060 (4GB ddr3). We are trying to allocate ~1.97GB matrix, but am getting a cublas error on the cublasAlloc() calls. In the code, we’re looking at variables cubeArray and xptr :unsure:

[codebox]void cublasTestData(string headerFile, string dataFile, double *runStats)

{

//Read data cube from disk

std::clock_t start;

double diff;

start = std::clock();

float *cubeArray = readData(headerFile, dataFile);  //~2GB

runStats[1] = ( std::clock() - start ) / (double)CLOCKS_PER_SEC;

//Initialize variables for Covariance

float scalar = (1.0 / (float)numPixels);

//Solution Matrix (numBands x numBands)

float *secondTerm = (float*)malloc(sizeof(float) * numBands * numBands);

//Cube Array Device Memory

float* xptr;

//Solution Matrix Device Memory

float* yptr;

//Unit vector Device Memory

float* zptr;

//Signature Sums Device Memory

float* sigptr;



//Unit Vector (numPixels x 1)

float* unitVector = (float*)malloc(sizeof(float) * numPixels);

float* signatureSums = (float*)malloc(sizeof(float) * numBands);

for(int i=0; i<numPixels; i++){

	unitVector[i] = 1.0;

}

memset(secondTerm, 0, sizeof(float) * numBands * numBands);

memset(signatureSums, 0, sizeof(float) * numBands);

//CUBLAS State (error handling)

cublasStatus state;

if(cublasInit() == CUBLAS_STATUS_NOT_INITIALIZED) {

	printf("CUBLAS init error.\n");

}

//Allocate device memory for data cube

state = cublasAlloc(numBands*numPixels, sizeof(*cubeArray), (void**)&xptr);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation video memory.\n");  //Error being thrown here

}



//Allocate device memory for solution

state = cublasAlloc(numBands*numBands, sizeof(*secondTerm), (void**)&yptr);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation video memory.\n");

}

//Allocate device memory for unit vector

state = cublasAlloc(numPixels, sizeof(*unitVector), (void**)&zptr);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation video memory.\n");

}



//Allocate device memory for signature sums

state = cublasAlloc(numBands, sizeof(*signatureSums), (void**)&sigptr);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation video memory.\n");

}



//Copy data cube from Host to Device 

state = cublasSetMatrix(numPixels, numBands, sizeof(*cubeArray), cubeArray, numPixels, xptr, numPixels);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation matrix.\n");

}

//Copy solution matrix from Host to Device

state = cublasSetMatrix(numBands, numBands, sizeof(*secondTerm), secondTerm, numBands, yptr, numBands);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation matrix.\n");

}

//Copy unit vector from Host to Device

state = cublasSetMatrix(numPixels, 1, sizeof(*unitVector), unitVector, numPixels, zptr, numPixels);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation matrix.\n");

}

//Copy signature vector from Host to Device

state = cublasSetMatrix(numBands, 1, sizeof(*signatureSums), signatureSums, numBands, sigptr, numBands);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation matrix.\n");

}

cublasSgemm('n', 'n', numBands, 1, numPixels, 1.0, xptr, numBands, zptr, numPixels, 1.0, sigptr, numBands);

cublasSgemm('n','t', numBands, numBands, 1, scalar*scalar, sigptr, numBands, sigptr, numBands, 1.0, yptr, numBands);

cublasSgemm('n', 't', numBands, numBands, numPixels, scalar, xptr, numBands, xptr, numBands, -1.0, yptr, numBands);

if (state != CUBLAS_STATUS_SUCCESS) {

	printf("CUBLAS execution error.\n");

}

state = cublasGetMatrix(numBands,numBands, sizeof(*yptr), yptr, numBands, secondTerm, numBands);



free(signatureSums);

free(unitVector);

free(secondTerm);

runStats[0] = numBands * numRows * numCols / 1000000;

if(dataType == 2){

	runStats[0] *= 2.0;

}else if(dataType == 4){

	runStats[0] *= 4.0;

}

runStats[2] =  ( std::clock() - start ) / (double)CLOCKS_PER_SEC;

cublasFree(xptr);

cublasFree(yptr);

cublasFree(zptr);

cublasFree(sigptr);

}

[/codebox]

mfatica · June 1, 2009, 4:59pm

I am looking into this.
You can use the regular cudaMalloc ( cublasAlloc is just a wrapper), I know that works fine for large allocation.
I am allocating a single matrix of 3.9 GB with cudaMalloc.

mfatica · June 1, 2009, 5:33pm

Which OS are you running?

sptx · June 1, 2009, 5:36pm

WinXP 64. We actually have a Boxx PSC (4x Tesla C1060) and are using it with the default installation (OS, etc).

sptx · June 1, 2009, 5:39pm

I just reran it using cudaMalloc rather than cublasAlloc. The alloc worked this time, however it failed on the 2GB cublasSetMatrix().

mfatica · June 1, 2009, 5:40pm

Which CUDA version?

sptx · June 1, 2009, 5:44pm

2.1

sptx · June 1, 2009, 5:57pm

Also note, we’ve tried with smaller sizes (1.24GB and 1.66GB) with success. It seems to break somewhere in the 1.7-2.1GB range.

mfatica · June 1, 2009, 8:12pm

Under WinXP64, largest object you can allocate with cublasAlloc is approximately 4,232,800,000 bytes:

status = cublasAlloc(1058200000,sizeof(float),(void**)&devPtr);

returns CUBLAS_STATUS_SUCCESS.

cublasSetMatrix has some limitations on the sizes if you are transferring a sub-matrix (it is using cudaMemcpy2D that has a limit in the maximum pitch), what are the actual numbers in the call?
I posted a slow version somewhere on the forum that has no limit, try to search in my posts.

sptx · June 1, 2009, 8:32pm

I think we found the problem. The file was 2GB full of integers, but to use CUBLAS we had to cast them to float, which resulted in a total size of 1062400000 * sizeof(float) which is clearly too big to fit in 4GB.

we are calling cublasSetMatrix(1600000, 664, sizeof(*cubeArray), cubeArray, numPixels, xptr, numPixels), which again is too big.

Thanks for all your help!! =)

Topic		Replies	Views
Cublas functions, matrix size limit..? Able to allocate too much memory through cublasAlloc CUDA Programming and Performance	0	2380	March 18, 2009
cublasAlloc fails even though there is enough memory CUDA Programming and Performance	4	10982	December 15, 2009
Limit on cublasAlloc? CUDA Programming and Performance	16	10885	October 2, 2010
Cublas Memory Allocation CUDA Programming and Performance	2	1611	December 11, 2007
bug in memory allocation? CUDA Programming and Performance	6	4244	May 24, 2012
cuBLAS fails when matrix has more than 2^31-1 entries? CUDA Programming and Performance	13	884	October 12, 2021
CUBLAS failing on cublasAlloc from C CUDA Programming and Performance	2	3969	February 27, 2007
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18293	March 30, 2011
Is there any function in cublas limit the matrix size less than or equal to 48 CUDA Programming and Performance	0	1336	May 25, 2014
Single allocation limit of 2GB on linux CUDA Programming and Performance	3	13003	March 2, 2011

CUBLAS alloc limit

Related topics