I am trying to allocate the maximum amount of GPU memory. The way I did it is to query the free memory and try to allocate that size. If that fails, I decrease the request size and try it. This process repeats until the allocation succeeds. Usually using this process I can get pretty close to the maximum amount of memory. However, It does not work that well with cuda 4.0 and Fermi cards any more. The attached code demonstrates that.
There are two arguments:
–step #the number of MB decreased in request size in each step, default 1
–small-mem # this means a cudaMalloc() call for a small size before requesting the large memory, default no
Here is the test results from C2050:
./cmt --step 16
./cmt --step 1 --small-mem
./cmt --step 16 --small-mem
Tests show in S1070s, close to peak memory can be allocated regardless of these 2 parameters. So this is specific to Fermi.