Hi,
I know this sounds a bit stupid - but I dont get any performance boost from my linux GTX480 over my Windows GTX280. Both environments are 64bit.
The kernel is ~ 50% memory bound and 50% compute bound.
I tried the following things:
-
Replace texture accesses with global access.
-
Configure the main kernel to 16KB shared mem and 48KB L1 (or vice versa)
-
Increased the amount of threads per block from 64 to 128,256,512,1024.
The kernel looks roughly like this:
__inline__ __device__ void CalcTraceData( unsigned int iTraceChunk, unsigned int iTraceIndex, unsigned short iTimeIndex,
short Nb, short numMoveSamples, float w2, int InputTracePos,
float2 *pDataInput,
float &f1, float &f2, float &f3, float &f4 )
{
if ( ( iTimeIndex >= Nb ) && ( iTimeIndex < numMoveSamples ) )
{
float w1 = 1.f - w2;
float2 fCurrentTimeValues = tex1Dfetch( texInput, iInputPos );
float2 fNextTimeValues = tex1Dfetch( texInput, iInputPos + 1 );
float fPhaseValW1 = fCurrentTimeValues.x * w1;
float fPhaseValPlusOneW2 = fNextTimeValues.x * w2;
f1 += fPhaseValW1 + fPhaseValPlusOneW2;
f2 += fCurrentTimeValues.x * fPhaseValW1 + fNextTimeValues.x * fPhaseValPlusOneW2;
f3 += fCurrentTimeValues.y * w1 + fNextTimeValues.y * w2;
f4++;
}
}
// Main kernel function
__shared__ short2 smPositions[ ENGINE_BLOCK_THREAD_SIZE ];
__shared__ float smw2[ ENGINE_BLOCK_THREAD_SIZE ];
__shared__ int smInputTracePos[ ENGINE_BLOCK_THREAD_SIZE ];
// Loop is manually unrolled 16 times...
for( int iTraceIndex = 0; iTraceIndex < ENGINE_BLOCK_THREAD_SIZE; iTraceIndex ++ )
{
CalcTraceData( iTraceChunk, iTraceIndex, iTimeIndex,
smPositions[ iTraceIndex ].x, smPositions[ iTraceIndex ].y, smw2[ iTraceIndex ], smInputTracePos[ iTraceIndex ], pDataInput,
fTraceOutPhase, fTraceOutPhase2, fTraceOutStack, fFold );
}
Some of the changes yield even worse performance than the GTX280 and the others gave the same performance.
Here is the command line I’ve used (for a block size of 128):
/usr/local/cuda/bin/nvcc --ptxas-options=“-v " -gencode=arch=compute_20,code="sm_20,compute_20" -maxrregcount=30 --compiler-options=”-fno-strict-aliasing -fPIC" -I ~build/NVIDIA_CUDA_SDK/common/inc/ -I /home/build/build -I /home/build/build -I. -I /usr/local/include -I /usr/include -I /usr/local/cuda/include -O3 -DUNIX -o Engine/GPU/Engine.o -c Engine/GPU/Engine.cu
ptxas info : Compiling entry function '_Z15CalculateEngineILb0EEvjjjjjjP23ramsPjS2_PiPfS3_P6flo
at2S4_S4_S4_S4_’ for ‘sm_20’
ptxas info : Used 30 registers, 1612+0 bytes smem, 144 bytes cmem[0], 27368 bytes cmem[2], 8 bytes cmem[16]
g++ -L/usr/local/lib64 -DTIXML_USE_STL -pthread -L. -L /home/build/build/lib64 -L /home/build/build/lib64 -L /usr/local/cuda/lib64 -L ~build/NVIDIA_CUDA_SDK/lib -L ~build/NVIDIA_CUDA_SDK/common/lib/linux/ -lcudart -lcutil -shared -o libEngineGNU64.so -DTIXML_USE_STL -I /home/build/build -I /home/build/build -I. -I /usr/local/include -I /usr/include -I /usr/local/cuda/include -I ~build/NVIDIA_CUDA_SDK/common/inc/ -O3 -fomit-frame-pointer -falign-functions=16 -fprefetch-loop-arrays -fpeel-loops -funswitch-loops -ftree-vectorize -fPIC -m64 -fargument-noalias -funroll-loops Engine/Engine.o
EDIT: It seems that with SM2.0 I get twice the register usage than in 2.3, why is that? dual dispatcher? is that ok?
If I remove the maxregcount (which was there in the first place for another kernel) I get that the kernel uses 32 registers where
in 2.3 it only used 16.
Any ideas/assistance would be highly appreciated.
Thanks
Eyal