shared mem size doubles with "-rdc=true"?

Hi All,

Here is a very simple snippet, which allocates 1k of 4-byte integers in shared memory:

include

global void sharedTest(const unsigned int dataStreamID)
{
shared uint32_t testShort[1000];
// Some “using” to have the array not optimized out
testShort[0] = 0;
testShort[1] = 0;
testShort[2] = testShort[0] + testShort[1];
}

void gpuSharedTest(cudaStream_t streamID)
{
std::cout << "Uint size " << sizeof(uint) << std::endl;
std::cout << "Uint32 size " << sizeof(uint32_t) << std::endl;

dim3 block(192);
dim3 grid(86, 1);   // Many x 1    

sharedTest<<<grid, block, 0, streamID>>>(0);

}

Output of the test is of no surprise:
Uint size 4
Uint32 size 4
so I would expect the kernel to allocate 4000 bytes of shared memory.

However, when I compile this with “-rdc=true” (our project has many .cu source files), it takes 8000 bytes of smem instead of 4000:

nvcc -arch sm_20 -O3 --ptxas-options -v -lineinfo -g -rdc=true -shared -Xcompiler -fpic sharedTest.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘_Z10sharedTestj’ for ‘sm_20’
ptxas info : Function properties for _Z10sharedTestj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 8000 bytes smem, 36 bytes cmem[0]

Without “-rdc=true” it only takes 4000:

nvcc -arch sm_20 -O3 --ptxas-options -v -lineinfo -g -shared -Xcompiler -fpic sharedTest.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘_Z10sharedTestj’ for ‘sm_20’
ptxas info : Function properties for _Z10sharedTestj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 4000 bytes smem, 36 bytes cmem[0]

The same size doubling occurs for other data types (uint, short, uint16_t) as well.

Any clue why?

P.S. Environment:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

gcc --version
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54)
Copyright (C) 2006 Free Software Foundation, Inc.

OS is CentOS 5 64-bit, cuda_5.0.35_linux_64_rhel5.x-1.run installed with all the default options.

From the private message:

I also saw this. I never tracked it down.
But I think 8000 is a false number – 4000 is true.

Here was my experience:

In cuda occupancy calculator, it would indicate that with 8000 (or whatever) max blocksize was 256
(or whatever). But I simply launched with 512 blocks, and it executed.

That matches my observations too, since it is possible to launch a kernel with “80000” bytes smem. Obviously, this is impossible on the gtx470 I use, so it looks like it is just the PTX output bug.

However, I’m a bit concerned if there are any other “underwater stones”.
I’m concerned because I have no idea why would static/dynamic cuda linking influence printed shared mem size %-))