Shared memory usage

I have a program that calls a series of kernels. When compiled separately, the cubin file for each kernel shows the expected amount of shared memory used. When the entire program is compiled, the cubin file shows much more shared memory used for each kernel.

Here is the code for one of the kernels:

__global__ void 

prodKernel( float* g_Gx, float* g_Gy, float* g_Gz,  float* GxGy, float* GyGz, float* GzGx, float* GxGx, float* GyGy, float* GzGz)


	float tmpX = 0, tmpY = 0, tmpZ = 0;

	int I = threadIdx.x;

	int B = blockIdx.x;

	tmpX = g_Gx[B*IMAGE_WIDTH +I]/32;

	tmpY = g_Gy[B*IMAGE_WIDTH +I]/32;

	tmpZ = g_Gz[B*IMAGE_WIDTH +I]/32;

	GxGy[B*IMAGE_WIDTH +I] = tmpX * tmpY;

	GyGz[B*IMAGE_WIDTH +I] = tmpY * tmpZ;

	GzGx[B*IMAGE_WIDTH +I] = tmpZ * tmpX;

	GxGx[B*IMAGE_WIDTH +I] = tmpX * tmpX;

	GyGy[B*IMAGE_WIDTH +I] = tmpY * tmpY;

	GzGz[B*IMAGE_WIDTH +I] = tmpZ * tmpZ;


Compiled separately, its cubin file shows 52 bytes of shared memory.

Here is the code in the function that calls it:

//set up the kernel params for the outer product kernel

	num_threads = 1; 

	dim3 prod_grid( IMAGE_HEIGHT, 1, 1);

	dim3 prod_threads( IMAGE_WIDTH, 1, 1);

	prodKernel<<<prod_grid, prod_threads>>>( d_Gx2, d_Gy2, d_Gz2, d_GxGy, d_GyGz, d_GzGx, d_GxGx, d_GyGy, d_GzGz);

	CUT_CHECK_ERROR("Outer product kernel execution failed");

When the above function is compiled, the cubin file shows 8260 bytes of shared memory for prodKernel.

Any ideas why this is so?

There is a confirmed bug in CUDA. The ptxasm miscalculates the shared mem requirements. If multiple kernels exist in the same file, it puts the sum of all shared declarations into the cubin for every kernel. This might make sense if the runtime can avoid realloc. If the kernels don’t run in succession however it is nonsense. NVIDIA said it will be fixed in the next update.