understanding the trade-off between block size and occupancy

FangQ · February 18, 2010, 12:39am

I wrote a simple finite-difference code with CUDA and profiled it with visual profiler. I found some strange behavior regarding the block sizes and occupancy.

The calculation is rather simple: for each thread, it loads a few numbers from the global memory, and do some simple algebra, and then write them back. The memory read/write are all coalesced.

My domain size is 512x512. If I set griddim(128/N,128/N) blockdim(4N,4N), with N=0.25,0.5,1,2,3,4, I can see the occupancy ranges from 0.25 to 1 (when N=4). However, the run-time at various N numbers are quite different. The best occupancy case (N=4) is in fact twice slower than the 1/4-occupancy case when N=1.

I am wondering if anyone can explain to me why I see this behavior?

my card is a GTX-295, and I compiled the code with CUDA 2.3.

thanks in advance.

Qianqian

when N=1

Occupancy analysis for kernel 'UpdateE' for context 'Session2 : Device_0 : Context_0' : 

Kernel details : Grid size: 128 x 128, Block size: 4 x 4 x 1

Register Ratio		= 0.5  ( 8192 / 16384 ) [14 registers per thread] 

Shared Memory Ratio	= 0.25 ( 4096 / 16384 ) [88 bytes per Block] 

Active Blocks per SM	= 8 : 8

Active threads per SM	= 128 : 1024

Occupancy		= 0.25  ( 8 / 32 )

Occupancy limiting factor	= Block-Size 

Warning: Grid Size (16384) is not a multiple of available SMs (30).

when N=4

Occupancy analysis for kernel 'UpdateE' for context 'Session5 : Device_0 : Context_0' : 

Kernel details : Grid size: 32 x 32, Block size: 16 x 16 x 1

Register Ratio		= 0.875  ( 14336 / 16384 ) [14 registers per thread] 

Shared Memory Ratio	= 0.125 ( 2048 / 16384 ) [88 bytes per Block] 

Active Blocks per SM	= 4 : 8

Active threads per SM	= 1024 : 1024

Occupancy		= 1  ( 32 / 32 )

Occupancy limiting factor	= None

Warning: Grid Size (1024) is not a multiple of available SMs (30).

touchtony · March 29, 2010, 7:14pm

I wrote a simple finite-difference code with CUDA and profiled it with visual profiler. I found some strange behavior regarding the block sizes and occupancy.

The calculation is rather simple: for each thread, it loads a few numbers from the global memory, and do some simple algebra, and then write them back. The memory read/write are all coalesced.

My domain size is 512x512. If I set griddim(128/N,128/N) blockdim(4N,4N), with N=0.25,0.5,1,2,3,4, I can see the occupancy ranges from 0.25 to 1 (when N=4). However, the run-time at various N numbers are quite different. The best occupancy case (N=4) is in fact twice slower than the 1/4-occupancy case when N=1.

I am wondering if anyone can explain to me why I see this behavior?

my card is a GTX-295, and I compiled the code with CUDA 2.3.

thanks in advance.

Qianqian

when N=1
Occupancy analysis for kernel 'UpdateE' for context 'Session2 : Device_0 : Context_0' : 

Kernel details : Grid size: 128 x 128, Block size: 4 x 4 x 1

Register Ratio		= 0.5  ( 8192 / 16384 ) [14 registers per thread] 

Shared Memory Ratio	= 0.25 ( 4096 / 16384 ) [88 bytes per Block] 

Active Blocks per SM	= 8 : 8

Active threads per SM	= 128 : 1024

Occupancy		= 0.25  ( 8 / 32 )

Occupancy limiting factor	= Block-Size 

Warning: Grid Size (16384) is not a multiple of available SMs (30).
when N=4
Occupancy analysis for kernel 'UpdateE' for context 'Session5 : Device_0 : Context_0' : 

Kernel details : Grid size: 32 x 32, Block size: 16 x 16 x 1

Register Ratio		= 0.875  ( 14336 / 16384 ) [14 registers per thread] 

Shared Memory Ratio	= 0.125 ( 2048 / 16384 ) [88 bytes per Block] 

Active Blocks per SM	= 4 : 8

Active threads per SM	= 1024 : 1024

Occupancy		= 1  ( 32 / 32 )

Occupancy limiting factor	= None

Warning: Grid Size (1024) is not a multiple of available SMs (30).

I think it is because of hiding memory access latency.

The bottleneck of your program should be global memory access, so even occupancy is 1, SM is still waiting for operands to continue computing. There are only 4 active blocks per SM where occupancy is 1; but when the occupancy is 0.25, there are 8 blocks to hide the latency.

Topic		Replies	Views
I've a question about CUDA Occuapncy Calculator by NVIDIA CUDA Programming and Performance	13	2684	March 5, 2013
Block size and occupancy CUDA Programming and Performance	12	679	January 2, 2025
GPU profiling 33% occupancy faster then 50-66% CUDA Programming and Performance	2	3393	March 13, 2007
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5989	July 25, 2007
Occupancy Query Performance not as expected CUDA Programming and Performance	11	4561	February 3, 2009
Occupancy factor CUDA Programming and Performance	1	688	October 20, 2015
Occupancy limiting factor = Block-Size Occupancy limit CUDA Programming and Performance	9	5694	June 5, 2011
What is the performance impact of launching many many small blocks? CUDA Programming and Performance cuda , kernel	7	570	November 7, 2024
Occupancy/ Optimazation How to use Occupancy Calculator, improve performance CUDA Programming and Performance	12	16969	December 7, 2011
CUDA Occupancy Calculation How to understand the occupancy? CUDA Programming and Performance	1	2038	September 5, 2008

understanding the trade-off between block size and occupancy

Related topics