bug in memory allocation?

timtim1 · April 13, 2012, 11:29pm

Hi,

maybe I’m missing something but I have some trouble with the following code. I’m trying to allocate an as large as possible chunk of memory in the beginning of my program using cublas to avoid memory fragmentation. I try to determine the biggest available chunk of continuous memory by decreasing the block size beginning at 2 GB down to 0 bytes. The desired result is the allocation of something between 80% and 90% of the total available memory. However, on my Windows 7 64 bit, this fails. I think that all subsequent calls to cublasAlloc fail once a call has failed because I tried to request too much memory.

I have tried driver versions 296.10 and the current beta version 301.24, both with the same results.

I should mention that the code runs perfectly fine on the same machine under Gentoo Linux x86_64, using nvidia-drivers-295.40.

Can anyone reproduce these results? Or maybe tell me a setup where this code works as desired?

I’m using a GTX 550 Ti.

Cheers

#include <iostream>

#include <cublas.h>

using namespace std;

inline void checkStatus(cublasStatus s) {

	if (s != CUBLAS_STATUS_SUCCESS) {

		switch (s) {

		case CUBLAS_STATUS_ALLOC_FAILED:

			cerr << "CUBLAS_STATUS_ALLOC_FAILED" << endl;

			break;

		case CUBLAS_STATUS_ARCH_MISMATCH:

			cerr << "CUBLAS_STATUS_ARCH_MISMATCH" << endl;

			break;

		case CUBLAS_STATUS_EXECUTION_FAILED:

			cerr << "CUBLAS_STATUS_EXECUTION_FAILED" << endl;

			break;

		case CUBLAS_STATUS_INTERNAL_ERROR:

			cerr << "CUBLAS_STATUS_INTERNAL_ERROR" << endl;

			break;

		case CUBLAS_STATUS_INVALID_VALUE:

			cerr << "CUBLAS_STATUS_INVALID_VALUE" << endl;

			break;

		case CUBLAS_STATUS_MAPPING_ERROR:

			cerr << "CUBLAS_STATUS_MAPPING_ERROR" << endl;

			break;

		case CUBLAS_STATUS_NOT_INITIALIZED:

			cerr << "CUBLAS_STATUS_NOT_INITIALIZED" << endl;

			break;

		default:

			cerr << "CUBLAS_UNKNOWN_ERROR" << endl;

		}

	};

}

int main(int argc, char* argv[]) {

	cublasStatus result;

	result = cublasInit();

	checkStatus(result);

	void* ptr = NULL;

	int n = 1024 * 1024 * 100;

	result = cublasAlloc(n, 1, &ptr);

	

	if (result == CUBLAS_STATUS_SUCCESS && ptr)

		cout << "memory allocation of " << n / 1024 / 1024 << " MB successful." << endl;

	else

		checkStatus(result);

	if (ptr)

		cublasFree(ptr);

	n = numeric_limits<int>::max();

	while (n > 0) {

		result = cublasAlloc(n, 1, &ptr);

		if (result != CUBLAS_STATUS_ALLOC_FAILED)

			break;

		cerr << "mermory allocation of " << n / 1024 / 1024 << " MB failed." << endl;

		n -= 10 * 1024*1024;

	}

	checkStatus(result);

	if (n <= 0 || ptr == NULL) {

		cerr << "Memory allocation failed!" << endl;

	}

	if (ptr)

		cublasFree(ptr);

	result = cublasShutdown();

	checkStatus(result);

	return 0;

}

timtim1 · April 13, 2012, 11:34pm

Added the whole MSVC10 solution and the Linux makefile as an attachment.
memAllocFail.rar (3.17 KB)

timtim1 · April 14, 2012, 10:19am

p.s.: I get the exact same behavior when I use cudaMalloc instead of cublasAlloc.

cudaError_t cures;

n = numeric_limits<int>::max();

while (n > 0) {

	cures = cudaMalloc(&ptr, n);

	if (cures != CUDA_ERROR_OUT_OF_MEMORY)

		break;

	cerr << "mermory allocation of " << n / 1024 / 1024 << " MB failed." << endl;

	n -= 10 * 1024*1024;

}

xmuben · April 19, 2012, 2:02am

i get the same problem when call cudaMalloc on gtx 680, win7.

tera · April 19, 2012, 2:25am

Try calling [font=“Courier New”]cudaGetLastError()[/font] after the failing allocation to reset the error. Otherwise all following CUDA calls will keep returning the error code of the failed call.

timtim1 · April 19, 2012, 8:17am

tera: thanks, but that didn’t change anything.

However, I managed to further pin down the issue. If I replace the

n -= 10 * 1024*1024;

by something like

n *= 0.9;

the iteration succeeds as soon as n is below the largest available block size.

Again, a fully working (or not, depending on which line you use to decrease n External Image ) example:

#include <iostream>

#include <limits>

#include <cuda_runtime.h>

using namespace std;

inline void checkStatus(cudaError_t e) {

	if (e != cudaSuccess) {

		cerr << cudaGetErrorString(e) << endl;

	};

}

void DisplayProperties( cudaDeviceProp* pDeviceProp ) {

	if( !pDeviceProp )

		return;

	cout <<  endl << "Device Name " << pDeviceProp->name << endl;

	cout <<  "**************************************" << endl;

	cout <<  "Total Global Memory\t\t\t" << pDeviceProp->totalGlobalMem/1024 << " KB" << endl;

	cout <<  "Shared memory available per block\t" << pDeviceProp->sharedMemPerBlock/1024 << " KB" << endl;

	cout <<  "Number of registers per thread block\t" << pDeviceProp->regsPerBlock << endl;

	cout <<  "Warp size in threads\t\t\t" << pDeviceProp->warpSize << endl;

	cout <<  "Memory Pitch\t\t\t\t" << pDeviceProp->memPitch << " bytes" << endl;

	cout <<  "Maximum threads per block\t\t" << pDeviceProp->maxThreadsPerBlock << endl;

	cout <<  "Maximum Thread Dimension (block)\t" << pDeviceProp->maxThreadsDim[0] << " " << pDeviceProp->maxThreadsDim[1] << " " << pDeviceProp->maxThreadsDim[2] << endl;

	cout <<  "Maximum Thread Dimension (grid)\t\t" << pDeviceProp->maxGridSize[0] << " " << pDeviceProp->maxGridSize[1] << " " << pDeviceProp->maxGridSize[2] << endl;

	cout <<  "Total constant memory\t\t\t" << pDeviceProp->totalConstMem << " bytes" << endl;

	cout <<  "CUDA ver\t\t\t\t" << pDeviceProp->major << "." << pDeviceProp->minor << endl;

	cout <<  "Clock rate\t\t\t\t" << pDeviceProp->clockRate << " KHz" << endl;

	cout <<  "Texture Alignment\t\t\t" << pDeviceProp->textureAlignment << " bytes" << endl;

	cout <<  "Device Overlap\t\t\t\t" << (pDeviceProp->deviceOverlap?"Allowed":"Not Allowed") << endl;

	cout <<  "Number of Multi processors\t\t" << pDeviceProp->multiProcessorCount << endl << endl;

}

int main(int argc, char* argv[]) {

	cudaError_t cudaStat;

	

	cudaStat = cudaSetDevice(0);

	checkStatus(cudaStat);

	cudaDeviceProp devProps;

	cudaStat = cudaGetDeviceProperties(&devProps,0);

	checkStatus(cudaStat);

	DisplayProperties(&devProps);

	void* ptr = NULL;

	int n = 1024 * 1024 * 100;

	cudaStat = cudaMalloc(&ptr, n);

	

	if (cudaStat == cudaSuccess && ptr)

		cout << "memory allocation of " << n << " MB successful." << endl;

	else

		checkStatus(cudaStat);

	

	if (ptr)

		cudaFree(ptr);

	n = numeric_limits<int>::max();

	while (n > 0) {

		cudaStat = cudaMalloc(&ptr,n);

		if (cudaStat != cudaErrorMemoryAllocation) {

			cout << "successfully allocated " << n << " bytes of memory!" << endl;

			break;

		}

		cudaStat = cudaMalloc(&ptr,490733567);

		if (cudaStat != cudaErrorMemoryAllocation && ptr) {

			cudaFree(ptr);

		}

		cerr << "mermory allocation of " << n << " bytes failed: " << endl;

		checkStatus(cudaGetLastError());

		//n -= 10 * 1024*1024; // this fails?! wth?

		n *= 0.9; // this works!

	}

	checkStatus(cudaStat);

	if (n <= 0 || ptr == NULL) {

		cerr << "Memory allocation failed!" << endl;

	} else {

		cout << "Allocation of " << n / 1024 / 1024 << " MB memory successful." << endl;

	}

	if (ptr)

		cudaFree(ptr);

	return 0;

}

And here is the output using a constant decrement:

Device Name GeForce GTX 550 Ti

Total Global Memory 1048256 KB

Shared memory available per block 48 KB

Number of registers per thread block 32768

Warp size in threads 32

Memory Pitch 2147483647 bytes

Maximum threads per block 1024

Maximum Thread Dimension (block) 1024 1024 64

Maximum Thread Dimension (grid) 65535 65535 65535

Total constant memory 65536 bytes

CUDA ver 2.1

Clock rate 1800000 KHz

Texture Alignment 512 bytes

Device Overlap Allowed

Number of Multi processors 4

memory allocation of 104857600 MB successful.

mermory allocation of 2147483647 bytes failed: out of memory

mermory allocation of 2136997887 bytes failed: out of memory

… a lot of failures …

mermory allocation of 18874367 bytes failed: out of memory

mermory allocation of 8388607 bytes failed: out of memory

out of memory

Memory allocation failed!

And here when using a multiplier < 1:

Device Name GeForce GTX 550 Ti

Total Global Memory 1048256 KB

Shared memory available per block 48 KB

Number of registers per thread block 32768

Warp size in threads 32

Memory Pitch 2147483647 bytes

Maximum threads per block 1024

Maximum Thread Dimension (block) 1024 1024 64

Maximum Thread Dimension (grid) 65535 65535 65535

Total constant memory 65536 bytes

CUDA ver 2.1

Clock rate 1800000 KHz

Texture Alignment 512 bytes

Device Overlap Allowed

Number of Multi processors 4

memory allocation of 104857600 MB successful.

mermory allocation of 2147483647 bytes failed: out of memory

mermory allocation of 1952257860 bytes failed: out of memory

mermory allocation of 1774779872 bytes failed: out of memory

mermory allocation of 1613436247 bytes failed: out of memory

mermory allocation of 1466760224 bytes failed: out of memory

mermory allocation of 1333418385 bytes failed: out of memory

mermory allocation of 1212198531 bytes failed: out of memory

mermory allocation of 1101998664 bytes failed: out of memory

mermory allocation of 1001816967 bytes failed: out of memory

mermory allocation of 910742697 bytes failed: out of memory

mermory allocation of 827947906 bytes failed: out of memory

mermory allocation of 752679914 bytes failed: out of memory

successfully allocated 684254467 bytes of memory!

Allocation of 652 MB memory successful.

insmvb00 · May 24, 2012, 11:47am

Hi!

Are you sure that you are freeing the memory pointed by ptr in each iteration?

i.e

if (cudaStat != cudaErrorMemoryAllocation && ptr) {

                        cudaFree(ptr);

                }

This performs what you want?

Topic		Replies	Views
Limit on cublasAlloc? CUDA Programming and Performance	16	10773	October 2, 2010
Memory allocation reliablity CUDA Programming and Performance	8	3240	August 18, 2008
Allocation error causes all other allocation attempts to fail CUDA Programming and Performance	1	8295	March 21, 2011
cublasAlloc fails even though there is enough memory CUDA Programming and Performance	4	10924	December 15, 2009
memory allocation problem CUDA Programming and Performance	2	4831	September 8, 2009
Tesla C1060 Allocation Problem CUDA Programming and Performance	1	2594	July 8, 2009
cudaMalloc fails on huge allocation CUDA Programming and Performance	4	818	March 28, 2011
CUBLAS alloc limit CUDA Programming and Performance	9	10844	June 1, 2009
Cublas functions, matrix size limit..? Able to allocate too much memory through cublasAlloc CUDA Programming and Performance	0	2359	March 18, 2009
Memory Allocation Behavior CUDA Programming and Performance	1	5464	November 10, 2010

bug in memory allocation?

Related topics