Weird lmem issue

Hi,

Bellow are three simple kernels:

struct fInput

{

public:

	float  *pFloatIn;

	float4 *pFloat4In;

};

//__global__ void Float4Test( float *pIn, float *fOutput )

__global__ void Test1( fInput *pInput, float *fOutput )

{

	float4 fVal = pInput->pFloat4In[ threadIdx.x ];

	fOutput[ threadIdx.x ] = fVal.y * fVal.x;

}

__global__ void Test2( float4 *pIn, float *fOutput )

{

	float4 fVal = pIn[ threadIdx.x ];

	fOutput[ threadIdx.x ] = fVal.y * fVal.x;

}

__global__ void Test3( fInput *pInput, float *fOutput )

{

	float fVal = pInput->pFloatIn[ threadIdx.x ];

	fOutput[ threadIdx.x ] = fVal * fVal;

}

Here is the ptxas information for the three functions:

1>ptxas info	: Compiling entry function '_Z5Test3P6fInputPf'

1>ptxas info	: Used 4 registers, 16+16 bytes smem

1>ptxas info	: Compiling entry function '_Z5Test2P6float4Pf'

1>ptxas info	: Used 4 registers, 16+16 bytes smem

1>ptxas info	: Compiling entry function '_Z5Test1P6fInputPf'

1>ptxas info	: Used 6 registers, 16+0 bytes lmem, 16+16 bytes smem

I was wondering why does Test1 kernel requires 16 bytes of lmem??? if I put another float4 array in the

structure and read it - the lmem usage will go up to 32 bytes.

thanks

eyal

I was pondering that for a while, and I can’t offer an answer. I am presuming that the compiler can’t resolve the scope of the pointer fetch for the float 4 case, so it declines to use registers, although there doesn’t seem to be a good reason why. Whether this behaviour is a feature or a bug is anyone’s guess.

Just got confirmation from nVidia that this issue has been fixed in a future version.

“This issue has been fixed in a development version of the CUDA toolkit. The fix will be available in a future release of CUDA. I will update this bug when that release becomes available.”

NVIDIA Bug ID: 641112

Just an idea: Have you tried the “register” or “volatile” keywords to give the compiler a hint?