Separate Compilation and Linking of CUDA C++ Device Code

Tesla K20m is a passively cooled board designed for rack mount servers with sufficient airflow to cool it. Do you have it in such a server?

Hi Mark,
No, this Tesla was in the cluster of the BSC. Now they have bought new Tesla and gave us this one for testing us software. The Tesla is installed on one workstation (a simple PC) without cooler. So I was wondering if I could buy a cooler.

I don't think you can buy one from NVIDIA. There may be a way to modify an aftermarket cooler, but it of course would not be supported. There is a discussion on the same thing with older Tesla cards here: https://devtalk.nvidia.com/...

Well, it's ok, finally we'll buy a new GPU card, it will be easier than modify some cooler. Thank you very much Mark, best regards.

cat Makefile

# Build tools
CXX = g++-5
NVCC = /usr/bin/nvcc

# here are all the objects
GPUOBJS = cuexample.o
OBJS = cppexample.o

# make and compile
cudaexample.out: $(OBJS) $(GPUOBJS)
$(NVCC) -o cudaexample.out $(OBJS) $(GPUOBJS)

cuexample.o: cuexample.cu
$(NVCC) -c cuexample.cu

cppexample.o: cppexample.cpp
$(CXX) -c cppexample.cpp
clean:
rm cppexample.o cuexample.o

Has anything changed with respect to realocatable device code since CUDA 6.0?

Where can I read more about the caveats of RDC and separate compiling?
I am getting a noticeable slowdown in my CUDA application just by compiling with "-dc" instead of "-c" each .cu. I do not think I have fully understood the caveats, as I would not expect any effect at all, given that currently all device functions, constant memory, etc are within file scope (I can compile each .cu separately using "-c" without issues).

The caveats section makes me think that the only negative effects would come from function call optimizations between compilation units. But inlining and any other function call optimization would still take place at file scope, right?

Thanks!

Is there a way to compile all the files with g++ except for the .cu files and then link later?

If you are using `make`, you can just set up different pattern rules for .cpp and .cu files. https://www.gnu.org/softwar...

Same idea (different mechanics) applies to other build tools (CMake etc.).

Does someone try to link multiple cu file on qtcreator on Windows?

I am stuck from many weeks. The only project which works with multiple file on qtcreator project are made on linux or Mac but not on windows. I talk with people who hav the same problem. It seems that the qmake only take the last compiled cu file the other are not well compiled.

Can someone help me about that or give me a link of runnable qtcreator project in windows with multiple cu file?

This might help
http://docs.nvidia.com/cuda...

To the future generation: nvcc is a C++ compiler and not a pure C compiler. So, Please try not to link the nvcc generated objects with gcc. If you insist on gcc try using extern "C" in your device code so nvcc can compile it that way. I found out about this the hard way.

You can read more about it here:

http://docs.nvidia.com/cuda...

and Here:

http://stackoverflow.com/qu...

Nice article, gives me some nice explanation on a few of the finer points of binary building with nvcc.

Great writeup, indeed it makes things simpler especially when compiling large number of code files.

If the separate compilation units that are fed as input to nvlink contain cuda kernels and device functions that invoke device functions marked as __forceinline__, will these functions be inlined? Assume they would be inlined if one put all the source code into a single file. What would be the version of cuda in which nvlink would inline?

nvlink does not do any inlining itself. __forceinline__ is for static functions that will be inlined at compile time into the compile unit.

OK. For the "ab" example in the nvcc document [1], ptx clearly does not inline bar() -- and the graph from nvdisasm shows "JCAL `(_Z3barv);" -- which I guess is a "jump call to bar()". Earlier, I had mis-read the graph and assumed that bar() had been inlined -- where can I find the documentation for the instructions shown (including JCAL)? Why doesn't the picture show an arrow for the jump to bar() and for the return from bar()?

https://uploads.disquscdn.c...

Commands used:

nvcc --generate-code arch=compute_50,code=sm_52 --generate-code arch=compute_61,code=sm_61 --verbose --resource-usage --generate-line-info --source-in-ptx --keep --keep-dir generated/tmp -Xptxas --warn-on-double-precision-use,--warn-on-local-memory-usage,--warn-on-spills,--preserve-relocs --output-directory generated/obj --device-c a.cu b.cu
nvcc --generate-code arch=compute_50,code=sm_52 --generate-code arch=compute_61,code=sm_61 --verbose --resource-usage --generate-line-info --source-in-ptx --keep --keep-dir generated/tmp -Xptxas --warn-on-double-precision-use,--warn-on-local-memory-usage,--warn-on-spills,--preserve-relocs --device-link generated/obj/a.o generated/obj/b.o --output-file generated/obj/link.o
nvcc --verbose --resource-usage --generate-line-info --source-in-ptx --keep --keep-dir generated/tmp -Xptxas --warn-on-double-precision-use,--warn-on-local-memory-usage,--warn-on-spills,--preserve-relocs --lib --output-file generated/lib/libgpu.a generated/obj/a.o generated/obj/b.o generated/obj/link.o
g++ -o generated/bin/a.exe -Lgenerated/lib -lgpu -lcudadevrt -lcudart -L/usr/local/cuda/lib64

nvdisasm -cfg generated/tmp/link.compute_61.cubin | dot -ogenerated/out/cfg_61.png -Tpng

[1] https://docs.nvidia.com/cud...

Correct, the JCAL is the call to bar. Your questions have veered off topic from separate compilation, would need someone else to answer about why the graph doesn't show the actual edges.

The makefile format seems to be that of a Unix machine. Does this syntaxwork on Windows 10?

This is why I love CUDA. The ability to write a single code for both CPU and GPU.