A meaning of nvlink warning: Stack size for entry function cannot be statically determined

First I want to appologize for inappropriate title, it should be

A meaning of nvcc output with option --ptxas-options=-v

I do not know how to change the title, so it is left as it was created (by mistake).

Hallo all,
I compile a program which consists of many separate source files (mostly definitions of objects v member device functions). Because my program was to slow (perheaps too frequent access to global memory), I decided to use nvcc compile option –ptxas-options=-v to find out how memory is distributed among all memory types. But I am unable to interpret the output. The output of the compilation is as follows:

nvcc -Xcompiler -rdynamic -lineinfo -dc -g -G --ptxas-options=-v -arch sm_21 file1.cu
ptxas info    : 304 bytes gmem, 40 bytes cmem[14]
ptxas info    : Function properties for function1_id
    88 bytes stack frame, 68 bytes spill stores, 68 bytes spill loads
ptxas info    : Compiling entry function 'Kernel_id' for 'sm_21'
ptxas info    : Function properties for Kernel_id
    120 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 55 registers, 48 bytes smem, 120 bytes cmem[0]
… and a lot of other infos similar to previous one from different files, but the info for the kernel occurs only once.

Can someone please clarify to me, what are the meaning of the output infos (ptxas infos)? I want to find out how data are distributed among different types of memories and further optimize their distribution.

Thank you for any hints,

Dalibor

registers are registers
gmem is global memory (I think this may be a particular use of it; don’t know the details)
smem is shared memory
cmem is constant memory (there are multiple banks, the bank number is given as an index)
stack frame is part of local memory
spill loads and store use part of the stack frame

This output may give programmers some rough ideas with regard to occupancy or spilling, and therefore can be somewhat useful for performance work. The tool speciifcally designed for performance work is the Visual Profiler. The Best Practices Guide describes many strategies for writing high performance CUDA code.

Thanks for reply and recommendation of information sources,
I am going to read the Best Practices Guide and find out how to use Visual Profiler. But the details about spilling loads and stores to local memory (part of global memory) by compiler and means of usage of global memory by compiler are crucial for me. To understand that and fix the code, I need detailed explanation. Therefore, any detailed comments are welcomed.