I have a program that calls a series of kernels. When compiled separately, the cubin file for each kernel shows the expected amount of shared memory used. When the entire program is compiled, the cubin file shows much more shared memory used for each kernel.
Here is the code for one of the kernels:
__global__ void
prodKernel( float* g_Gx, float* g_Gy, float* g_Gz, float* GxGy, float* GyGz, float* GzGx, float* GxGx, float* GyGy, float* GzGz)
{
float tmpX = 0, tmpY = 0, tmpZ = 0;
int I = threadIdx.x;
int B = blockIdx.x;
tmpX = g_Gx[B*IMAGE_WIDTH +I]/32;
tmpY = g_Gy[B*IMAGE_WIDTH +I]/32;
tmpZ = g_Gz[B*IMAGE_WIDTH +I]/32;
GxGy[B*IMAGE_WIDTH +I] = tmpX * tmpY;
GyGz[B*IMAGE_WIDTH +I] = tmpY * tmpZ;
GzGx[B*IMAGE_WIDTH +I] = tmpZ * tmpX;
GxGx[B*IMAGE_WIDTH +I] = tmpX * tmpX;
GyGy[B*IMAGE_WIDTH +I] = tmpY * tmpY;
GzGz[B*IMAGE_WIDTH +I] = tmpZ * tmpZ;
}
Compiled separately, its cubin file shows 52 bytes of shared memory.
Here is the code in the function that calls it:
//set up the kernel params for the outer product kernel
num_threads = 1;
dim3 prod_grid( IMAGE_HEIGHT, 1, 1);
dim3 prod_threads( IMAGE_WIDTH, 1, 1);
prodKernel<<<prod_grid, prod_threads>>>( d_Gx2, d_Gy2, d_Gz2, d_GxGy, d_GyGz, d_GzGx, d_GxGx, d_GyGy, d_GzGz);
CUT_CHECK_ERROR("Outer product kernel execution failed");
When the above function is compiled, the cubin file shows 8260 bytes of shared memory for prodKernel.
Any ideas why this is so?