Fermi question

Hi,

I know this sounds a bit stupid - but I dont get any performance boost from my linux GTX480 over my Windows GTX280. Both environments are 64bit.

The kernel is ~ 50% memory bound and 50% compute bound.

I tried the following things:

  1. Replace texture accesses with global access.

  2. Configure the main kernel to 16KB shared mem and 48KB L1 (or vice versa)

  3. Increased the amount of threads per block from 64 to 128,256,512,1024.

The kernel looks roughly like this:

__inline__ __device__ void CalcTraceData( unsigned int iTraceChunk, unsigned int iTraceIndex, unsigned short iTimeIndex, 

					short Nb, short numMoveSamples, float w2, int InputTracePos,

					float2 *pDataInput,

					float &f1, float &f2, float &f3, float &f4 )

{ 

	if ( ( iTimeIndex >= Nb ) && ( iTimeIndex < numMoveSamples ) )

	{

								float w1 = 1.f - w2; 

		float2 fCurrentTimeValues = tex1Dfetch( texInput, iInputPos );

		float2 fNextTimeValues = tex1Dfetch( texInput, iInputPos + 1 );

		float fPhaseValW1 = fCurrentTimeValues.x * w1;

		float fPhaseValPlusOneW2 = fNextTimeValues.x * w2;

		f1 += fPhaseValW1 + fPhaseValPlusOneW2;

		f2 += fCurrentTimeValues.x * fPhaseValW1 + fNextTimeValues.x * fPhaseValPlusOneW2;

		f3 += fCurrentTimeValues.y * w1 + fNextTimeValues.y * w2;

		f4++;

				 }

}

// Main kernel function

__shared__ short2 smPositions[ ENGINE_BLOCK_THREAD_SIZE ];

__shared__ float smw2[ ENGINE_BLOCK_THREAD_SIZE ];

__shared__ int smInputTracePos[ ENGINE_BLOCK_THREAD_SIZE ];

// Loop is manually unrolled 16 times...

for( int iTraceIndex = 0; iTraceIndex < ENGINE_BLOCK_THREAD_SIZE; iTraceIndex ++ )

{

   CalcTraceData( iTraceChunk, iTraceIndex, iTimeIndex,

	smPositions[ iTraceIndex ].x, smPositions[ iTraceIndex ].y, smw2[ iTraceIndex ], smInputTracePos[ iTraceIndex ], pDataInput,

	fTraceOutPhase, fTraceOutPhase2, fTraceOutStack, fFold );

}

Some of the changes yield even worse performance than the GTX280 and the others gave the same performance.

Here is the command line I’ve used (for a block size of 128):

/usr/local/cuda/bin/nvcc --ptxas-options=“-v " -gencode=arch=compute_20,code="sm_20,compute_20" -maxrregcount=30 --compiler-options=”-fno-strict-aliasing -fPIC" -I ~build/NVIDIA_CUDA_SDK/common/inc/ -I /home/build/build -I /home/build/build -I. -I /usr/local/include -I /usr/include -I /usr/local/cuda/include -O3 -DUNIX -o Engine/GPU/Engine.o -c Engine/GPU/Engine.cu

ptxas info : Compiling entry function '_Z15CalculateEngineILb0EEvjjjjjjP23ramsPjS2_PiPfS3_P6flo

at2S4_S4_S4_S4_’ for ‘sm_20’

ptxas info : Used 30 registers, 1612+0 bytes smem, 144 bytes cmem[0], 27368 bytes cmem[2], 8 bytes cmem[16]

g++ -L/usr/local/lib64 -DTIXML_USE_STL -pthread -L. -L /home/build/build/lib64 -L /home/build/build/lib64 -L /usr/local/cuda/lib64 -L ~build/NVIDIA_CUDA_SDK/lib -L ~build/NVIDIA_CUDA_SDK/common/lib/linux/ -lcudart -lcutil -shared -o libEngineGNU64.so -DTIXML_USE_STL -I /home/build/build -I /home/build/build -I. -I /usr/local/include -I /usr/include -I /usr/local/cuda/include -I ~build/NVIDIA_CUDA_SDK/common/inc/ -O3 -fomit-frame-pointer -falign-functions=16 -fprefetch-loop-arrays -fpeel-loops -funswitch-loops -ftree-vectorize -fPIC -m64 -fargument-noalias -funroll-loops Engine/Engine.o

EDIT: It seems that with SM2.0 I get twice the register usage than in 2.3, why is that? dual dispatcher? is that ok?

If I remove the maxregcount (which was there in the first place for another kernel) I get that the kernel uses 32 registers where

in 2.3 it only used 16.

Any ideas/assistance would be highly appreciated.

Thanks

Eyal

Try to stop unrolling the loop.
And what is your grid size?

Why??? in anycase it made the code run slower by ~10-15%.

For the test case, I use a 50x50 grid with 128 threads.

eyal

After some tests and with assistance from an nVidia dev-tech employee I got the following:

  1. The main reason for not seeing any performance boost was due to low occupancy. On the GTX280 I used
    64 threads per block mainly to achieve high occupancy and because of smem considerations.
    On the 480 this configuration yeilded ~33% occupancy, which seems to be too low for Fermi.
    Increasing the thread per block to 256 gave an instant boost of ~30-40%.
  2. There are now ~x2 registers used probably because of the move to 64Bit in CUDA3.0, therefore
    the doubling in registers in hardware is probably “wasted” because of this.
  3. It seems that for my specific algorithm the L1 didnt show any performance boost whatsoever.
  4. I’ll try 3.1 as it might have lower register consumption and might improve occupancy.

The worst part, at least as I see it is this:
I knew my code was roughly 50% compute bound and 50% memory bound. Now altough the compute
part in Fermi supposedly increased by a factor of 2 (MAD and twice the core count), the memory bandwidth
only increased from 140GB/s to 180GB/s - ~ the 30-40% boost I’ve seen.
As far as I know, most kernels are memory bounded, this means that the move to Fermi will give
~30% boost unless the L1 cache really shines…

I’d be very interested to hear what you guys think.

eyal

Did you try stop unrolling loop on Fermi?

Why??? in anycase it made the code run slower by ~10-15% (on Fermi and on non-fermi cards)

Hi,
A quick update on the previous post.

Changing to 3.1 didnt make any register change nor performance change.
The documentation in 3.1 supports the 64Bit issue which causes the use of more registers in the kernel:

“If you build your application in 64-bit mode (either by passing -m64 to nvcc or by specifying neither –m64 nor –m32 when compiling on a 64-bit machine), e.g., to gain access to more than 4GB of system memory, be aware that nvcc will compile both the host code and the device code in 64-bit mode for devices of compute capability 2.0. While this works, the larger pointers in the device code sometimes incur a performance penalty for the device (because of the extra space those pointers occupy in the register file, among other reasons). If benchmarking your application shows a measurable performance difference between otherwise-identical 64-bit mode and 32-bit mode code, and if you are not targeting GPUs with large amounts of device memory that can take advantage of a 64-bit address space, and if you use the driver API to load your device code, then this performance penalty can be avoided by running 32-bit device code alongside the 64-bit host code. Note that this hybrid approach is not supported by the runtime API, which requires the bitness of the device code and host code to match. Furthermore, special care must be taken when using this mode with the driver API to ensure that any pointers passed from the host code to the device code are of the appropriate length.”

Anyway I’d still be happy to hear what you think about what I wrote before:
"As far as I know, most kernels are memory bounded, this means that the move to Fermi will give
~30% boost unless the L1 cache really shines… "

thanks
eyal

hoomd kernels are memory bound and get 60% performance increases on GTX 480 vs GTX 285. It is a semi-random memory access pattern.

I cannot explain that result. As you point out, memory bandwidth only increases by 30% on the hardware. And hoomd uses the texture cache, not L1. All attempts I have made to utilize L1 cache for semi-random reads have resulted in lower performance than the texture cache. Maybe the texture cache benefits from L2?? And maybe it has a different cache line size than L1??? I wish NVIDIA would give us more info about the way the texture cache works, even if it does change from generation to generation. In some applications, knowing how the cache works can be lead to drastic performance increases. When I find time, I’m going to write a mini cache simulator for the L1 cache for my data access pattern and see if I can find out what is wrong there… but I digress.

On your point 1 from a previous post: I have also found that tuning block sizes is extremely important on Fermi. I upgraded to CUDA 3.1beta and that 60% performance boost went away! Then I re-benchmarked all kernels at all block sizes. With re-tuned values, a majority of the performance came back, though not all of it. In one specific kernel, the optimum block size vs. the least optimum block size was a 100% difference in performance. This is a big contrast to G200 where I typically only observe 20% variation across block sizes. And remember, Fermi can go to block sizes of 1024! I have yet to find a kernel that runs well over 512, but to be thorough in the tuning, one must benchmark all block sizes.

This is especially weird as nVidia states in the docs that Fermi should favour direct gmem (via L1) over textures. For my code it was the opposite.

That might have some relation with the increased register usage (which I’m still looking at but it looks to me 3.0/3.1 uses way too much registers, especially when I use constants to initialize some

parameters in the code). Increasing the threads per block should not change the register count and therefore you might give Fermi more threads to work with. Furthermore many of my kernels

require some logic/loops/preparations above the internal loop which also takes a bit of time, so using twice the threads per block, should cut this preparation time by half.

Increasing past 256 didnt show any improvement for me, though.

Anyway, till now the results for Fermi, at least for me, are a bit dissapointing… :(

eyal

wow, just checked something regarding the register pressure related to Fermi and I stumbled onto something… :)

My original code used this:

__shared__ int smDataX[ BLOCK_THREAD_SIZE ];

   __shared__ int smDataY[ BLOCK_THREAD_SIZE ];

   __shared__ int smnb[ BLOCK_THREAD_SIZE ];

I guess the shared memory usage gets translated into pointers. Now, even on 32bit (pre CUDA3.0) this would get translated into 3 pointers.

However the following:

__shared__ int3 smData[ BLOCK_THREAD_SIZE ];

Gets translated into one pointer !!! And on Fermi (going with pointers from 4 bytes to 8) this is a big difference in register usage and hence occupancy…

I got down from 39 registers to 32 in this kernel and the occupancy jumped from 50% to 67%.

The performance increased by ~10% due to this change, even though their might be smem bank conflicts and I’m using int3 instead of int…

eyal

yep, they are dissapointing, for me, on my test kernel which do a lot of very scattered memory reads it always runs slower than on gtx260 (old version, with 192 cores)

So what’s your performance when you use 32 bit pointers instead?

To use 32 bit pointers you either need to compile on a 32 bit machine, or use the driver interface (not the runtime) on a 64 bit machine.

So what’s your performance when you use 32 bit pointers instead?

To use 32 bit pointers you either need to compile on a 32 bit machine, or use the driver interface (not the runtime) on a 64 bit machine.

No I’m still using 64Bit for Fermi (I pointed out that the trick works on pre 3.0 64Bit as well).

By merging three different shared memory arrays in the kernel into one, it seems the kernel requires

2 pointers less (4 bytes on 32bit and 8 on 64bit). On fermi it reduced my register usage from 39 to 32 increasing occupancy

from 50 to 67%. In one of the kernels it gave me ~20% performance boost :)

eyal

No I’m still using 64Bit for Fermi (I pointed out that the trick works on pre 3.0 64Bit as well).

By merging three different shared memory arrays in the kernel into one, it seems the kernel requires

2 pointers less (4 bytes on 32bit and 8 on 64bit). On fermi it reduced my register usage from 39 to 32 increasing occupancy

from 50 to 67%. In one of the kernels it gave me ~20% performance boost :)

eyal

I wonder how other people are doing with the new Fermi cards? Has anyone with intensive memory usage been able to acheive significant boost over previous generation?? (>40%)

Something else that I’ve found, well knew but didn’t sink in till now, the amount of data I read from memory is huge (I guess like many other HPC users), so the L1 and

even L2 isn’t making any difference to my code as the caches are not able to cope with such amount of data. One of the main features of Fermi and I can’t make use of it… :(

eyal

I wonder how other people are doing with the new Fermi cards? Has anyone with intensive memory usage been able to acheive significant boost over previous generation?? (>40%)

Something else that I’ve found, well knew but didn’t sink in till now, the amount of data I read from memory is huge (I guess like many other HPC users), so the L1 and

even L2 isn’t making any difference to my code as the caches are not able to cope with such amount of data. One of the main features of Fermi and I can’t make use of it… :(

eyal

If moving to FERMI is gonna gimme 10 to 30% performance gains, I would rather use 2 TESLAs… No?
(most kernels are memory intensive… at least upto 30 to 40%)

If moving to FERMI is gonna gimme 10 to 30% performance gains, I would rather use 2 TESLAs… No?
(most kernels are memory intensive… at least upto 30 to 40%)

I would guess, that with so many cores such small cache will not speed up memory transfers, but it can improve memory coherency, succesive writes are now seen on other cores in proper order, in addition atomic ops are faster on fermi thanks to cache. It would be cool to have ability to configure l2 cache as ‘global’ shared memory, if we could decide what should be stored in it maybe it would give more profit.