questions about maxrregcount and Xptxas

Dear forum,

I have two quick questions:

1.

I split my source code into multiple .cu files (even FileIO.cu) and noticed the info below:

1>  FileIO.cu

1>  ptxas info    : Compiling entry function '__cuda_dummy_entry__' for 'sm_20'

1>  ptxas info    : Used 2 registers, 32 bytes cmem[0], 51200 bytes cmem[2]

...

1>  Main.cu

1>  ptxas info    : Compiling entry function '__cuda_dummy_entry__' for 'sm_20'

1>  ptxas info    : Used 2 registers, 32 bytes cmem[0], 51200 bytes cmem[2]

I assume dummy entry here means the memory usage is simply a theoretical value and no device memory is actually allocated for this function. Correct?

What about the situation in Main.cu where the kernels are called. No device memory is used for kernel calls, either?

Though CUDA front end can separate host and device code, is there any disadvantage of organizing source code this way (only using .cu and .cuh)?

2.

Does -maxrregcount=0 place all automatic variables on local memory instead of register?

Thanks for clarifications! External Image

(1) Empty dummy kernels are created for .cu files that do not contain any device code at all. I do not know why they are being created, but such kernels never get called, and you can safely ignore them. Two registers are used for empty kernels because of ABI requirements. One register is used as a stack pointer, I do not recall what the second register is used for. It is not clear to me where the high cmem usage is coming from. Is there a constant declaration by any chance (possibly imported from a header file), but without any accompanying device code?

(2) -maxrregcount does not allow for arbitrarily low register targets, at least not for sm_2x and sm_3x targets. There is no benefit to lowering the register usage for a kernel below what is needed to achieved full occupancy. Consequently, PTXAS enforces a lower limit of 16 registers for sm_2x and 32 registers for sm_3x. It warns about the adjustment of the specified limit if a lower limit is specified via -maxrregcount.

(3) Because of the way the CUDA compiler splits the host and device code and invokes the host compiler, host code performance when compiled via nvcc may differ somewhat from the performance of the same code compiled directly by the host compiler. While I have seen this happen I don’t have a good estimate as to how frequently this happens. If this is a concern, I would recommend compiling only that portion of the host code with nvcc that that calls device code (i.e. global functions), while compiling other host code with the regular host compiler. This also allows for use of a host compiler or host compiler features not supported by CUDA for that non-device-calling portion of the code, as long as the object files are linkable.

Thanks a lot, njuffa. External Image

Each file indeed includes a header file but I didn’t declare any “constant” and there are only around 30 "const"s and a few short host template functions plus declarations of host+device functions in it.

The NVCC document says “All non-CUDA compilation steps are forwarded to a general purpose C compiler that is supported by nvcc, and on Windows platforms, where this compiler is an instance of the Microsoft Visual Studio compiler, nvcc will translate its options into appropriate “cl” command syntax.” Does it mean the nvcc safely devolves the host code to cl and we may still benefit from the host compiler?

NVCC is only a compiler driver that calls various “real” compilers. Device code is compiled by NVIDIA’s own toolchain. All host code is compiled by the host compiler (various gcc versions on Linux, various MSVC versions on Windows).

You can pass compiler flags to the host compiler as desired with the -Xcompiler commandline switch of NVCC. Here is an example from one of my Linux projects, where I pass options to gcc:

-Xcompiler -O3 -Xcompiler -march=core2 -Xcompiler -mtune=core2 -Xcompiler -msse2