show sizes of GPU memory usage, eg log cudaMalloc, CUDA reports "out of memory" at runtime

wlangdon · December 11, 2016, 2:37pm

On a new device (a GeForce GTX 1080) I get ‘out of memory’.
(It is reported by cudaGetLastError after cudaDeviceSynchronize).
I am assuming this is a programming error.

Are there any tools which can list all the GPU buffers
and their sizes, which the host code has allocated.

Is it possible to check for running out of GPU memory
before launching kernels (which then fail at run time).

Many thanks
Bill

wlangdon · December 12, 2016, 12:30pm

More information: cudaMemGetInfo says after all cudaMalloc()
the GTX 1080 still has 216 million bytes free, yet I still get
CUDA error “out of memory” after the first Kernel is launched.

Any ideas or suggestions would be most welcome
Bill

wlangdon · December 12, 2016, 4:50pm

more2 information.
It appears that with CUDA 8.0 cudaLimitStackSize is 1024.
The kernel is launched with 1760 blocks of 64 threads.
So I think 115 million bytes of the GTX 1080’s memory
will be taken by the stack.
HOWEVER if I compile with nvcc --ptxas-options=-v it says:

ptxas info : 1470 bytes gmem, 2336 bytes cmem[3]

ptxas info : Compiling entry function ‘_Z25cuda_inexact_match_callerPjiP16alignment_meta_tP16barracuda_aln1_tP11init_info_tP13widths_bids_ticc’ for ‘sm_35’
ptxas info : Function properties for _Z25cuda_inexact_match_callerPjiP16alignment_meta_tP16barracuda_aln1_tP11init_info_tP13widths_bids_ticc
11712 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 94 registers, 384 bytes cmem[0], 148 bytes cmem[2], 2 textures

ptxas info : Compiling entry function ‘_Z23cuda_find_exact_matchesPjiiP7bwtkl_t’ for ‘sm_35’
ptxas info : Function properties for _Z23cuda_find_exact_matchesPjiiP7bwtkl_t
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 77 registers, 348 bytes cmem[0], 1 textures

ptxas info : Compiling entry function ‘_Z19cuda_prepare_widthsPjiP13widths_bids_tPc’ for ‘sm_35’
ptxas info : Function properties for _Z19cuda_prepare_widthsPjiP13widths_bids_tPc
1120 bytes stack frame, 208 bytes spill stores, 208 bytes spill loads
ptxas info : Used 255 registers, 360 bytes cmem[0], 148 bytes cmem[2], 2 textures

Would I be right in thinking the compilers “11712 bytes stack frame” means my kernel
will try and allocate 11712 bytes for each thread? Ie ignore the limit given by
the runtime value cudaLimitStackSize. If so this would be 1259MB (much more than is left).

As always any help or suggestions would be most welcome.
Bill

njuffa · December 12, 2016, 8:07pm

The compiler resource statistics are always per-thread, as there is no compile-time notion of grid / block configuration. Whatever is reported may be rounded up to the next unit of granularity (which may differ by architecture) for actual allocation purposes.

I do not recall what the default per-thread stack allocation is that is provided by the driver, I think it is 2 KB (check the documentation, I am reasonably sure the default setting is documented). In order to launch a kernel with a stack frame of 11712 bytes, you will need to adjust the stack size in your app first, say to 12*1024 bytes.

size_t myStackSize = 12*1024;
cudaError_t stat = cudaDeviceSetLimit (cudaLimitStackSize, myStackSize);

I think there is also an upper limit on the stack size, I don’t think all of available memory can be dedicated to the stack. In general, a large stack size may be an indicator that the software architecture is sub-optimal; this applies to both GPU and CPU work.

wlangdon · December 13, 2016, 10:00am

Dear njuffa,
Many thanks. I have a workround, which is to reduce the size of
some global memory arrays. This both leaves more memory free in the GPU
and also reduces the grid size (and hence the number of threads) used to
launch the kernel.

Thanks again
Bill

Topic		Replies	Views
how to get more global memory available CUDA Programming and Performance	5	1061	June 18, 2013
cudaDeviceSetLimit call increases the GPU memory CUDA Programming and Performance	2	1110	September 28, 2016
Memory problem CUDA Programming and Performance	9	6612	March 20, 2008
memory exhausted on GPU CUDA Programming and Performance	3	1078	September 9, 2014
What info can I extract about use of device memory? CUDA Programming and Performance	1	777	March 16, 2009
Amount of memory available How much memory available to cudaMalloc? CUDA Programming and Performance	1	3536	April 25, 2007
What is the maximum CUDA Stack frame size per Kerenl. CUDA Programming and Performance	1	13637	November 18, 2013
out of memory CUDA Programming and Performance	11	16542	April 13, 2009
cudaMalloc failed with unknown error after only 491656bytes CUDA Programming and Performance	9	4431	July 2, 2009
Device memory size CUDA Programming and Performance	11	46908	June 6, 2008

show sizes of GPU memory usage, eg log cudaMalloc, CUDA reports "out of memory" at runtime

Related topics