The first Cuda call also initializes the context, which uses quite a bit of memory for heap, stack and printf buffer. Subsequent calls to malloc will only take up the memory that is allocated (plus some small wasted space to ensure proper alignment).
The heap is there to serve future allocations from the device, not from the host. So a second call to cudaMalloc() will not be served from the heap set up while initializing the context during the first cudaMalloc(). The point I wanted to make is that the second call does not reserve another 230Mb.
If you are not using printf() and malloc() on the device and no recursion or deeply nested function calls, you can limit the size of the context by calling My cuCtxSetLimit() before the first cudaMalloc()
I have do as you said:
cuCtxCreate(&cuContext, 0 , 0);
size_t size = 0;
cuCtxSetLimit(CU_LIMIT_STACK_SIZE , 0);
cuCtxSetLimit(CU_LIMIT_PRINTF_FIFO_SIZE, 0);
cuCtxSetLimit(CU_LIMIT_MALLOC_HEAP_SIZE, 0);
cuCtxGetLimit(&size, CU_LIMIT_STACK_SIZE ); //size = 1K
cuCtxGetLimit(&size, CU_LIMIT_PRINTF_FIFO_SIZE); //size = 800k
cuCtxGetLimit(&size, CU_LIMIT_MALLOC_HEAP_SIZE); //size = 4M
cudaMalloc();
when i called the 3 limits functions , i found the memory usage decreased from 230M to 210M.
then i run the examples from the cuda sdk,
matrixMulDrv and the matrixMulDynlinkJIT,
and i found there is a strange: the call cuCtxCreate on the matrixMulDrv will occupy 200M memroy ,but the call cuCtxCreate on the matrixMulDynlinkJIT will only occupy 100M memory.
i have been every confused.