CUBLAS alloc limit

Just wondering what kind of memory limits you guys are having using CUBLAS. We’re using cublaSgemm, using a Tesla C1060 (4GB ddr3). We are trying to allocate ~1.97GB matrix, but am getting a cublas error on the cublasAlloc() calls. In the code, we’re looking at variables cubeArray and xptr :unsure:

[codebox]void cublasTestData(string headerFile, string dataFile, double *runStats)

{

//Read data cube from disk

std::clock_t start;

double diff;

start = std::clock();

float *cubeArray = readData(headerFile, dataFile);  //~2GB

runStats[1] = ( std::clock() - start ) / (double)CLOCKS_PER_SEC;

//Initialize variables for Covariance

float scalar = (1.0 / (float)numPixels);

//Solution Matrix (numBands x numBands)

float *secondTerm = (float*)malloc(sizeof(float) * numBands * numBands);

//Cube Array Device Memory

float* xptr;

//Solution Matrix Device Memory

float* yptr;

//Unit vector Device Memory

float* zptr;

//Signature Sums Device Memory

float* sigptr;



//Unit Vector (numPixels x 1)

float* unitVector = (float*)malloc(sizeof(float) * numPixels);

float* signatureSums = (float*)malloc(sizeof(float) * numBands);

for(int i=0; i<numPixels; i++){

	unitVector[i] = 1.0;

}

memset(secondTerm, 0, sizeof(float) * numBands * numBands);

memset(signatureSums, 0, sizeof(float) * numBands);

//CUBLAS State (error handling)

cublasStatus state;

if(cublasInit() == CUBLAS_STATUS_NOT_INITIALIZED) {

	printf("CUBLAS init error.\n");

}

//Allocate device memory for data cube

state = cublasAlloc(numBands*numPixels, sizeof(*cubeArray), (void**)&xptr);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation video memory.\n");  //Error being thrown here

}



//Allocate device memory for solution

state = cublasAlloc(numBands*numBands, sizeof(*secondTerm), (void**)&yptr);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation video memory.\n");

}

//Allocate device memory for unit vector

state = cublasAlloc(numPixels, sizeof(*unitVector), (void**)&zptr);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation video memory.\n");

}



//Allocate device memory for signature sums

state = cublasAlloc(numBands, sizeof(*signatureSums), (void**)&sigptr);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation video memory.\n");

}



//Copy data cube from Host to Device 

state = cublasSetMatrix(numPixels, numBands, sizeof(*cubeArray), cubeArray, numPixels, xptr, numPixels);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation matrix.\n");

}

//Copy solution matrix from Host to Device

state = cublasSetMatrix(numBands, numBands, sizeof(*secondTerm), secondTerm, numBands, yptr, numBands);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation matrix.\n");

}

//Copy unit vector from Host to Device

state = cublasSetMatrix(numPixels, 1, sizeof(*unitVector), unitVector, numPixels, zptr, numPixels);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation matrix.\n");

}

//Copy signature vector from Host to Device

state = cublasSetMatrix(numBands, 1, sizeof(*signatureSums), signatureSums, numBands, sigptr, numBands);

if(state != CUBLAS_STATUS_SUCCESS) {

 printf("Error allocation matrix.\n");

}

cublasSgemm('n', 'n', numBands, 1, numPixels, 1.0, xptr, numBands, zptr, numPixels, 1.0, sigptr, numBands);

cublasSgemm('n','t', numBands, numBands, 1, scalar*scalar, sigptr, numBands, sigptr, numBands, 1.0, yptr, numBands);

cublasSgemm('n', 't', numBands, numBands, numPixels, scalar, xptr, numBands, xptr, numBands, -1.0, yptr, numBands);

if (state != CUBLAS_STATUS_SUCCESS) {

	printf("CUBLAS execution error.\n");

}

state = cublasGetMatrix(numBands,numBands, sizeof(*yptr), yptr, numBands, secondTerm, numBands);



free(signatureSums);

free(unitVector);

free(secondTerm);

runStats[0] = numBands * numRows * numCols / 1000000;

if(dataType == 2){

	runStats[0] *= 2.0;

}else if(dataType == 4){

	runStats[0] *= 4.0;

}

runStats[2] =  ( std::clock() - start ) / (double)CLOCKS_PER_SEC;

cublasFree(xptr);

cublasFree(yptr);

cublasFree(zptr);

cublasFree(sigptr);

}

[/codebox]

I am looking into this.
You can use the regular cudaMalloc ( cublasAlloc is just a wrapper), I know that works fine for large allocation.
I am allocating a single matrix of 3.9 GB with cudaMalloc.

Which OS are you running?

WinXP 64. We actually have a Boxx PSC (4x Tesla C1060) and are using it with the default installation (OS, etc).

I just reran it using cudaMalloc rather than cublasAlloc. The alloc worked this time, however it failed on the 2GB cublasSetMatrix().

Which CUDA version?

2.1

Also note, we’ve tried with smaller sizes (1.24GB and 1.66GB) with success. It seems to break somewhere in the 1.7-2.1GB range.

Under WinXP64, largest object you can allocate with cublasAlloc is approximately 4,232,800,000 bytes:

status = cublasAlloc(1058200000,sizeof(float),(void**)&devPtr);

returns CUBLAS_STATUS_SUCCESS.

cublasSetMatrix has some limitations on the sizes if you are transferring a sub-matrix (it is using cudaMemcpy2D that has a limit in the maximum pitch), what are the actual numbers in the call?
I posted a slow version somewhere on the forum that has no limit, try to search in my posts.

I think we found the problem. The file was 2GB full of integers, but to use CUBLAS we had to cast them to float, which resulted in a total size of 1062400000 * sizeof(float) which is clearly too big to fit in 4GB.

we are calling cublasSetMatrix(1600000, 664, sizeof(*cubeArray), cubeArray, numPixels, xptr, numPixels), which again is too big.

Thanks for all your help!! =)