understanding the trade-off between block size and occupancy

I wrote a simple finite-difference code with CUDA and profiled it with visual profiler. I found some strange behavior regarding the block sizes and occupancy.

The calculation is rather simple: for each thread, it loads a few numbers from the global memory, and do some simple algebra, and then write them back. The memory read/write are all coalesced.

My domain size is 512x512. If I set griddim(128/N,128/N) blockdim(4N,4N), with N=0.25,0.5,1,2,3,4, I can see the occupancy ranges from 0.25 to 1 (when N=4). However, the run-time at various N numbers are quite different. The best occupancy case (N=4) is in fact twice slower than the 1/4-occupancy case when N=1.

I am wondering if anyone can explain to me why I see this behavior?

my card is a GTX-295, and I compiled the code with CUDA 2.3.

thanks in advance.

Qianqian

when N=1

Occupancy analysis for kernel 'UpdateE' for context 'Session2 : Device_0 : Context_0' : 

Kernel details : Grid size: 128 x 128, Block size: 4 x 4 x 1

Register Ratio		= 0.5  ( 8192 / 16384 ) [14 registers per thread] 

Shared Memory Ratio	= 0.25 ( 4096 / 16384 ) [88 bytes per Block] 

Active Blocks per SM	= 8 : 8

Active threads per SM	= 128 : 1024

Occupancy		= 0.25  ( 8 / 32 )

Occupancy limiting factor	= Block-Size 

Warning: Grid Size (16384) is not a multiple of available SMs (30).

when N=4

Occupancy analysis for kernel 'UpdateE' for context 'Session5 : Device_0 : Context_0' : 

Kernel details : Grid size: 32 x 32, Block size: 16 x 16 x 1

Register Ratio		= 0.875  ( 14336 / 16384 ) [14 registers per thread] 

Shared Memory Ratio	= 0.125 ( 2048 / 16384 ) [88 bytes per Block] 

Active Blocks per SM	= 4 : 8

Active threads per SM	= 1024 : 1024

Occupancy		= 1  ( 32 / 32 )

Occupancy limiting factor	= None

Warning: Grid Size (1024) is not a multiple of available SMs (30).

I think it is because of hiding memory access latency.

The bottleneck of your program should be global memory access, so even occupancy is 1, SM is still waiting for operands to continue computing. There are only 4 active blocks per SM where occupancy is 1; but when the occupancy is 0.25, there are 8 blocks to hide the latency.