bug in memory allocation?

Hi,

maybe I’m missing something but I have some trouble with the following code. I’m trying to allocate an as large as possible chunk of memory in the beginning of my program using cublas to avoid memory fragmentation. I try to determine the biggest available chunk of continuous memory by decreasing the block size beginning at 2 GB down to 0 bytes. The desired result is the allocation of something between 80% and 90% of the total available memory. However, on my Windows 7 64 bit, this fails. I think that all subsequent calls to cublasAlloc fail once a call has failed because I tried to request too much memory.

I have tried driver versions 296.10 and the current beta version 301.24, both with the same results.

I should mention that the code runs perfectly fine on the same machine under Gentoo Linux x86_64, using nvidia-drivers-295.40.

Can anyone reproduce these results? Or maybe tell me a setup where this code works as desired?

I’m using a GTX 550 Ti.

Cheers

#include <iostream>

#include <cublas.h>

using namespace std;

inline void checkStatus(cublasStatus s) {

	if (s != CUBLAS_STATUS_SUCCESS) {

		switch (s) {

		case CUBLAS_STATUS_ALLOC_FAILED:

			cerr << "CUBLAS_STATUS_ALLOC_FAILED" << endl;

			break;

		case CUBLAS_STATUS_ARCH_MISMATCH:

			cerr << "CUBLAS_STATUS_ARCH_MISMATCH" << endl;

			break;

		case CUBLAS_STATUS_EXECUTION_FAILED:

			cerr << "CUBLAS_STATUS_EXECUTION_FAILED" << endl;

			break;

		case CUBLAS_STATUS_INTERNAL_ERROR:

			cerr << "CUBLAS_STATUS_INTERNAL_ERROR" << endl;

			break;

		case CUBLAS_STATUS_INVALID_VALUE:

			cerr << "CUBLAS_STATUS_INVALID_VALUE" << endl;

			break;

		case CUBLAS_STATUS_MAPPING_ERROR:

			cerr << "CUBLAS_STATUS_MAPPING_ERROR" << endl;

			break;

		case CUBLAS_STATUS_NOT_INITIALIZED:

			cerr << "CUBLAS_STATUS_NOT_INITIALIZED" << endl;

			break;

		default:

			cerr << "CUBLAS_UNKNOWN_ERROR" << endl;

		}

	};

}

int main(int argc, char* argv[]) {

	cublasStatus result;

	result = cublasInit();

	checkStatus(result);

	void* ptr = NULL;

	int n = 1024 * 1024 * 100;

	result = cublasAlloc(n, 1, &ptr);

	

	if (result == CUBLAS_STATUS_SUCCESS && ptr)

		cout << "memory allocation of " << n / 1024 / 1024 << " MB successful." << endl;

	else

		checkStatus(result);

	if (ptr)

		cublasFree(ptr);

	n = numeric_limits<int>::max();

	while (n > 0) {

		result = cublasAlloc(n, 1, &ptr);

		if (result != CUBLAS_STATUS_ALLOC_FAILED)

			break;

		cerr << "mermory allocation of " << n / 1024 / 1024 << " MB failed." << endl;

		n -= 10 * 1024*1024;

	}

	checkStatus(result);

	if (n <= 0 || ptr == NULL) {

		cerr << "Memory allocation failed!" << endl;

	}

	if (ptr)

		cublasFree(ptr);

	result = cublasShutdown();

	checkStatus(result);

	return 0;

}

Added the whole MSVC10 solution and the Linux makefile as an attachment.
memAllocFail.rar (3.17 KB)

p.s.: I get the exact same behavior when I use cudaMalloc instead of cublasAlloc.

cudaError_t cures;

n = numeric_limits<int>::max();

while (n > 0) {

	cures = cudaMalloc(&ptr, n);

	if (cures != CUDA_ERROR_OUT_OF_MEMORY)

		break;

	cerr << "mermory allocation of " << n / 1024 / 1024 << " MB failed." << endl;

	n -= 10 * 1024*1024;

}

i get the same problem when call cudaMalloc on gtx 680, win7.

Try calling [font=“Courier New”]cudaGetLastError()[/font] after the failing allocation to reset the error. Otherwise all following CUDA calls will keep returning the error code of the failed call.

tera: thanks, but that didn’t change anything.

However, I managed to further pin down the issue. If I replace the

n -= 10 * 1024*1024;

by something like

n *= 0.9;

the iteration succeeds as soon as n is below the largest available block size.

Again, a fully working (or not, depending on which line you use to decrease n ;) ) example:

#include <iostream>

#include <limits>

#include <cuda_runtime.h>

using namespace std;

inline void checkStatus(cudaError_t e) {

	if (e != cudaSuccess) {

		cerr << cudaGetErrorString(e) << endl;

	};

}

void DisplayProperties( cudaDeviceProp* pDeviceProp ) {

	if( !pDeviceProp )

		return;

	cout <<  endl << "Device Name " << pDeviceProp->name << endl;

	cout <<  "**************************************" << endl;

	cout <<  "Total Global Memory\t\t\t" << pDeviceProp->totalGlobalMem/1024 << " KB" << endl;

	cout <<  "Shared memory available per block\t" << pDeviceProp->sharedMemPerBlock/1024 << " KB" << endl;

	cout <<  "Number of registers per thread block\t" << pDeviceProp->regsPerBlock << endl;

	cout <<  "Warp size in threads\t\t\t" << pDeviceProp->warpSize << endl;

	cout <<  "Memory Pitch\t\t\t\t" << pDeviceProp->memPitch << " bytes" << endl;

	cout <<  "Maximum threads per block\t\t" << pDeviceProp->maxThreadsPerBlock << endl;

	cout <<  "Maximum Thread Dimension (block)\t" << pDeviceProp->maxThreadsDim[0] << " " << pDeviceProp->maxThreadsDim[1] << " " << pDeviceProp->maxThreadsDim[2] << endl;

	cout <<  "Maximum Thread Dimension (grid)\t\t" << pDeviceProp->maxGridSize[0] << " " << pDeviceProp->maxGridSize[1] << " " << pDeviceProp->maxGridSize[2] << endl;

	cout <<  "Total constant memory\t\t\t" << pDeviceProp->totalConstMem << " bytes" << endl;

	cout <<  "CUDA ver\t\t\t\t" << pDeviceProp->major << "." << pDeviceProp->minor << endl;

	cout <<  "Clock rate\t\t\t\t" << pDeviceProp->clockRate << " KHz" << endl;

	cout <<  "Texture Alignment\t\t\t" << pDeviceProp->textureAlignment << " bytes" << endl;

	cout <<  "Device Overlap\t\t\t\t" << (pDeviceProp->deviceOverlap?"Allowed":"Not Allowed") << endl;

	cout <<  "Number of Multi processors\t\t" << pDeviceProp->multiProcessorCount << endl << endl;

}

int main(int argc, char* argv[]) {

	cudaError_t cudaStat;

	

	cudaStat = cudaSetDevice(0);

	checkStatus(cudaStat);

	cudaDeviceProp devProps;

	cudaStat = cudaGetDeviceProperties(&devProps,0);

	checkStatus(cudaStat);

	DisplayProperties(&devProps);

	void* ptr = NULL;

	int n = 1024 * 1024 * 100;

	cudaStat = cudaMalloc(&ptr, n);

	

	if (cudaStat == cudaSuccess && ptr)

		cout << "memory allocation of " << n << " MB successful." << endl;

	else

		checkStatus(cudaStat);

	

	if (ptr)

		cudaFree(ptr);

	n = numeric_limits<int>::max();

	while (n > 0) {

		cudaStat = cudaMalloc(&ptr,n);

		if (cudaStat != cudaErrorMemoryAllocation) {

			cout << "successfully allocated " << n << " bytes of memory!" << endl;

			break;

		}

		cudaStat = cudaMalloc(&ptr,490733567);

		if (cudaStat != cudaErrorMemoryAllocation && ptr) {

			cudaFree(ptr);

		}

		cerr << "mermory allocation of " << n << " bytes failed: " << endl;

		checkStatus(cudaGetLastError());

		//n -= 10 * 1024*1024; // this fails?! wth?

		n *= 0.9; // this works!

	}

	checkStatus(cudaStat);

	if (n <= 0 || ptr == NULL) {

		cerr << "Memory allocation failed!" << endl;

	} else {

		cout << "Allocation of " << n / 1024 / 1024 << " MB memory successful." << endl;

	}

	if (ptr)

		cudaFree(ptr);

	return 0;

}

And here is the output using a constant decrement:

Device Name GeForce GTX 550 Ti


Total Global Memory 1048256 KB

Shared memory available per block 48 KB

Number of registers per thread block 32768

Warp size in threads 32

Memory Pitch 2147483647 bytes

Maximum threads per block 1024

Maximum Thread Dimension (block) 1024 1024 64

Maximum Thread Dimension (grid) 65535 65535 65535

Total constant memory 65536 bytes

CUDA ver 2.1

Clock rate 1800000 KHz

Texture Alignment 512 bytes

Device Overlap Allowed

Number of Multi processors 4

memory allocation of 104857600 MB successful.

mermory allocation of 2147483647 bytes failed: out of memory

mermory allocation of 2136997887 bytes failed: out of memory

… a lot of failures …

mermory allocation of 18874367 bytes failed: out of memory

mermory allocation of 8388607 bytes failed: out of memory

out of memory

Memory allocation failed!

And here when using a multiplier < 1:

Device Name GeForce GTX 550 Ti


Total Global Memory 1048256 KB

Shared memory available per block 48 KB

Number of registers per thread block 32768

Warp size in threads 32

Memory Pitch 2147483647 bytes

Maximum threads per block 1024

Maximum Thread Dimension (block) 1024 1024 64

Maximum Thread Dimension (grid) 65535 65535 65535

Total constant memory 65536 bytes

CUDA ver 2.1

Clock rate 1800000 KHz

Texture Alignment 512 bytes

Device Overlap Allowed

Number of Multi processors 4

memory allocation of 104857600 MB successful.

mermory allocation of 2147483647 bytes failed: out of memory

mermory allocation of 1952257860 bytes failed: out of memory

mermory allocation of 1774779872 bytes failed: out of memory

mermory allocation of 1613436247 bytes failed: out of memory

mermory allocation of 1466760224 bytes failed: out of memory

mermory allocation of 1333418385 bytes failed: out of memory

mermory allocation of 1212198531 bytes failed: out of memory

mermory allocation of 1101998664 bytes failed: out of memory

mermory allocation of 1001816967 bytes failed: out of memory

mermory allocation of 910742697 bytes failed: out of memory

mermory allocation of 827947906 bytes failed: out of memory

mermory allocation of 752679914 bytes failed: out of memory

successfully allocated 684254467 bytes of memory!

Allocation of 652 MB memory successful.

Hi!

Are you sure that you are freeing the memory pointed by ptr in each iteration?

i.e

if (cudaStat != cudaErrorMemoryAllocation && ptr) {

                        cudaFree(ptr);

                }

This performs what you want?