Isolating host-side include files

The attached code consists of a main .cu file calling 20 functions in 20 separate source files. Each source file contains *.hxx header files generated automatically from XSD schemas by the CodeSynthesis data binding tool. The *.hxx files are host-side C++ headers with dependencies on Boost and Xerces headers – there is no CUDA content. The remaining source code is just a wrapper.

When each source file is compiled under NVCC the ptxas -v option informs us that

ptxas info : 88 bytes gmem, 3728 bytes cmem[3]

In other words, nvcc is loading certain data described in the *.hxx headers into constant memory. The cumulative effect of compiling 21 files that each contribute 3728 bytes to constant memory exceeds the hard 64K limit for global constant memory so that the linker fails with message

Invoking: NVCC Linker
/usr/local/cuda-7.0/bin/nvcc --cudart static -L/usr/local/cuda-7.0/lib64 --relocatable-device-code=true -gencode arch=compute_50,code=compute_50 -gencode arch=compute_50,code=sm_50 -link -o “launch-test” ./junk1.o ./junk10.o ./junk11.o ./junk12.o ./junk13.o ./junk14.o ./junk15.o ./junk16.o ./junk17.o ./junk18.o ./junk19.o ./junk2.o ./junk20.o ./junk3.o ./junk4.o ./junk5.o ./junk6.o ./junk7.o ./junk8.o ./junk9.o ./launch-test.o -lcuda -lcudadevrt
nvlink error : File uses too much global constant data (0x131d0 bytes, 0x10000 max)
make: *** [launch-test] Error 255

Is there any way to work around this limitation? I have not yet investigated which header elements are contributing to global constant memory but I would be reluctant to edit Boost, Xerces, or automatically generated CodeSynthesis code in any event.

The question is whether there is a way to prevent nvcc from (mis) interpreting host-side headers. I don’t see any relevant CUDA preprocessor macros or compiler switches and I doubt that I can precompile these headers using GCC and then include the result into *.cu files to be compiled under NVCC. Any advice or comments would be greatly appreciated.

// Test accumulation of constant memory

#include <cuda_runtime.h>
#include <cuda_runtime_api.h>

#include <xsd-core/all.hxx>
#include <xsd-math/all.hxx>
#include <xsd-quant/all.hxx>

extern void junk1() ;
extern void junk2() ;
extern void junk3() ;
extern void junk4() ;
extern void junk5() ;
extern void junk6() ;
extern void junk7() ;
extern void junk8() ;
extern void junk9() ;
extern void junk10() ;
extern void junk11() ;
extern void junk12() ;
extern void junk13() ;
extern void junk14() ;
extern void junk15() ;
extern void junk16() ;
extern void junk17() ;
extern void junk18() ;
extern void junk19() ;
extern void junk20() ;

int main  ( void )

The usual suggestion in these cases is to divide the non-CUDA dependent code into ordinary .cpp files compiled by your regular host compiler, and create wrapper functions as necessary that call into the necessary cuda functions contained in .cu files, compiled by nvcc.

Objects produced by g++ and nvcc for example, can be linked together at project link time.

I would think the first order of business is to find out where the use of cmem[3] is coming from. Constant bank assignments are implementation details that differ from architecture to architecture. In any event, constant memory bank usage should have nothing to do with host code.

Reverse engineering constant bank memory use by disassembling sm_50 code, it seems that cmem[0] is used to pass kernel arguments, cmem[2] is used to store literal constants extracted from the source code by the compiler, and cmem[3] contains user declared constant data. This list may not be complete, there may be other uses of constant banks that I have not discovered yet.

You can inspect the constant bank data by running cuobjdump --dump-elf on the object files. You may be able recognize the data stored in section .nv.constant3. In any event this constant memory should come from device code of some sort: either the contents of your .cu file, or a private header file of yours, or the content of some CUDA header file. The fact that each object file contributes exactly 3728 bytes seems to hint at a declaration in a header file.