Hi All,
Here is a very simple snippet, which allocates 1k of 4-byte integers in shared memory:
global void sharedTest(const unsigned int dataStreamID)
{
shared uint32_t testShort[1000];
// Some “using” to have the array not optimized out
testShort[0] = 0;
testShort[1] = 0;
testShort[2] = testShort[0] + testShort[1];
}
void gpuSharedTest(cudaStream_t streamID)
{
std::cout << "Uint size " << sizeof(uint) << std::endl;
std::cout << "Uint32 size " << sizeof(uint32_t) << std::endl;
dim3 block(192);
dim3 grid(86, 1); // Many x 1
sharedTest<<<grid, block, 0, streamID>>>(0);
}
Output of the test is of no surprise:
Uint size 4
Uint32 size 4
so I would expect the kernel to allocate 4000 bytes of shared memory.
However, when I compile this with “-rdc=true” (our project has many .cu source files), it takes 8000 bytes of smem instead of 4000:
nvcc -arch sm_20 -O3 --ptxas-options -v -lineinfo -g -rdc=true -shared -Xcompiler -fpic sharedTest.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘_Z10sharedTestj’ for ‘sm_20’
ptxas info : Function properties for _Z10sharedTestj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 8000 bytes smem, 36 bytes cmem[0]
Without “-rdc=true” it only takes 4000:
nvcc -arch sm_20 -O3 --ptxas-options -v -lineinfo -g -shared -Xcompiler -fpic sharedTest.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘_Z10sharedTestj’ for ‘sm_20’
ptxas info : Function properties for _Z10sharedTestj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 4000 bytes smem, 36 bytes cmem[0]
The same size doubling occurs for other data types (uint, short, uint16_t) as well.
Any clue why?
P.S. Environment:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
gcc --version
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54)
Copyright (C) 2006 Free Software Foundation, Inc.
OS is CentOS 5 64-bit, cuda_5.0.35_linux_64_rhel5.x-1.run installed with all the default options.