CUDA SDK Samples and 8800 GTS cudaMalloc failure

This is really just informational and I hope I am posting into the correct area.

cudaMalloc fails in a number of the samples if the hardware is an 8800GTS with 320. BlackScholes is an example of the cudaMalloc failure. Perhaps it is stipulated in the programming guide but I was unable to find this anywhere in the docs or on the Forums. Anyway it is pretty straightforward. OPT_SZ is specified as 20000000 and the malloc id for sizeof(float) * OPT_SZ. I believe BlackScholes has at least 5 buffers of this size and it does fail.

This was driving me crazy as I am new to cuda and GPGPU and I have a Dell XPS 600 with a 7800GTX and 8800GTS and I was assuming I was making a mistake with the drivers or the mixed mode cards were in someway not supported by cuda or PTX. What made it worse was that some samples ran fine! The good news is the configuration works great if OPT_SZ is set small enough on those applicable samples.

This is really just picky as the 8800 320 is certainly not the target for cuda but I hope this helps some other newbie from too much hairpulling (mine is thin enough).


If there are 20M elements per array, each element at 4 bytes (a ‘float’), and there are 5 of those arrays, then the total amount of memory needed is 400MBytes. This is a little more than the 320MB card has, so cudaMalloc() will start to fail on the later calls.

We will add a check to see if there is enough memory on the card installed on the system.


Since the amount of DRAM is fixed and there is no memory management, a call like cudaFreeSpace would be useful to tell you the max cudaMalloc() that will succeed in the current context. Then one can divvy up the available space to suit your app. Currently it takes 20+ calls to cudaMalloc() & cudaFree() to binary chop and get that number.


the driver API call

CUresult cuMemGetInfo(unsigned int *free, unsigned int *total)

can answer this. You can intermix it with runtime API calls. It was undocumented until 1.0 :)


Thanks Peter - it seems one should always try out what the manual says you can’t do. I did test it and if I call cuMemGetInfo() BEFORE cudaMalloc() I get zeros for the returns and the correct returns after at least one cudaMalloc()/cudaFree() call pair.


FYI: Also I notice that cuMemGetInfo() returns the total free space, not the max cudaMalloc() that will succeed.

I can confirm this, it returns zeroes until the first cudaMalloc call with 1.0