Inconsistency in allocating maximum memory using cudaMalloc()


I am trying to allocate the maximum amount of GPU memory. The way I did it is to query the free memory and try to allocate that size. If that fails, I decrease the request size and try it. This process repeats until the allocation succeeds. Usually using this process I can get pretty close to the maximum amount of memory. However, It does not work that well with cuda 4.0 and Fermi cards any more. The attached code demonstrates that.

There are two arguments:
–step #the number of MB decreased in request size in each step, default 1
–small-mem # this means a cudaMalloc() call for a small size before requesting the large memory, default no

Here is the test results from C2050:

255 MB

./cmt --step 16
250 MB

./cmt --step 1 --small-mem
254 MB

./cmt --step 16 --small-mem
2986 MB

Tests show in S1070s, close to peak memory can be allocated regardless of these 2 parameters. So this is specific to Fermi.


Forgot to attach the code. Here it is.
cmt.cpp (2.19 KB)