show sizes of GPU memory usage, eg log cudaMalloc, CUDA reports "out of memory" at runtime

On a new device (a GeForce GTX 1080) I get ‘out of memory’.
(It is reported by cudaGetLastError after cudaDeviceSynchronize).
I am assuming this is a programming error.

Are there any tools which can list all the GPU buffers
and their sizes, which the host code has allocated.

Is it possible to check for running out of GPU memory
before launching kernels (which then fail at run time).

Many thanks
Bill

More information: cudaMemGetInfo says after all cudaMalloc()
the GTX 1080 still has 216 million bytes free, yet I still get
CUDA error “out of memory” after the first Kernel is launched.

Any ideas or suggestions would be most welcome
Bill

more2 information.
It appears that with CUDA 8.0 cudaLimitStackSize is 1024.
The kernel is launched with 1760 blocks of 64 threads.
So I think 115 million bytes of the GTX 1080’s memory
will be taken by the stack.
HOWEVER if I compile with nvcc --ptxas-options=-v it says:

ptxas info : 1470 bytes gmem, 2336 bytes cmem[3]

ptxas info : Compiling entry function ‘_Z25cuda_inexact_match_callerPjiP16alignment_meta_tP16barracuda_aln1_tP11init_info_tP13widths_bids_ticc’ for ‘sm_35’
ptxas info : Function properties for _Z25cuda_inexact_match_callerPjiP16alignment_meta_tP16barracuda_aln1_tP11init_info_tP13widths_bids_ticc
11712 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 94 registers, 384 bytes cmem[0], 148 bytes cmem[2], 2 textures

ptxas info : Compiling entry function ‘_Z23cuda_find_exact_matchesPjiiP7bwtkl_t’ for ‘sm_35’
ptxas info : Function properties for _Z23cuda_find_exact_matchesPjiiP7bwtkl_t
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 77 registers, 348 bytes cmem[0], 1 textures

ptxas info : Compiling entry function ‘_Z19cuda_prepare_widthsPjiP13widths_bids_tPc’ for ‘sm_35’
ptxas info : Function properties for _Z19cuda_prepare_widthsPjiP13widths_bids_tPc
1120 bytes stack frame, 208 bytes spill stores, 208 bytes spill loads
ptxas info : Used 255 registers, 360 bytes cmem[0], 148 bytes cmem[2], 2 textures

Would I be right in thinking the compilers “11712 bytes stack frame” means my kernel
will try and allocate 11712 bytes for each thread? Ie ignore the limit given by
the runtime value cudaLimitStackSize. If so this would be 1259MB (much more than is left).

As always any help or suggestions would be most welcome.
Bill

The compiler resource statistics are always per-thread, as there is no compile-time notion of grid / block configuration. Whatever is reported may be rounded up to the next unit of granularity (which may differ by architecture) for actual allocation purposes.

I do not recall what the default per-thread stack allocation is that is provided by the driver, I think it is 2 KB (check the documentation, I am reasonably sure the default setting is documented). In order to launch a kernel with a stack frame of 11712 bytes, you will need to adjust the stack size in your app first, say to 12*1024 bytes.

size_t myStackSize = 12*1024;
cudaError_t stat = cudaDeviceSetLimit (cudaLimitStackSize, myStackSize);

I think there is also an upper limit on the stack size, I don’t think all of available memory can be dedicated to the stack. In general, a large stack size may be an indicator that the software architecture is sub-optimal; this applies to both GPU and CPU work.

Dear njuffa,
Many thanks. I have a workround, which is to reduce the size of
some global memory arrays. This both leaves more memory free in the GPU
and also reduces the grid size (and hence the number of threads) used to
launch the kernel.

Thanks again
Bill