There is something about separate compilation I don’t understand; it is related to shared memory usage.

Let’s take an example and use the code from the official documentation in cuda 5.5:

Let’s consider file, and b.h and and verbose flag to compile line:

nvcc --ptxas-options -v -arch=sm_20 -dc

It reports that kernel foo requires 64 bytes of shared memory (I would expect 8x4=32 bytes).

Now if you build the exact same app without separate compilation, the verbose report says kernel foo requires 32 bytes of shared memory as expected.

I don’t understand what’s going on here; is it real ? normal behaviour ? Is it a bug ?

Thanks for your help.