Occupancy limiting factor = Block-Size Occupancy limit

jameschen · June 1, 2011, 6:56pm

i wrote a program and run it under profiler,i got the following result, i just don’t understand the meaning, can somebody explain what’s the meaning of each number? thanks
my GPU is Tesla C2050 External Image

Kernel details : Grid size: 61 x 70, Block size: 16 x 8 x 1
Register Ratio = 0.5 ( 16384 / 32768 ) [16 registers per thread]
Shared Memory Ratio = 0.0833333 ( 4096 / 49152 ) [512 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 1024 : 1536
Occupancy = 0.666667 ( 32 / 48 )
Achieved occupancy = 0.666667 (on 14 SMs)
Occupancy limiting factor = Block-Size

kbam · June 1, 2011, 11:48pm

All looks pretty good to me

plenty of registers spare,
occupancy is above 0.5 so thats fine
loads of shared memory left

If you get a bottleneck on transfers to/from global arrays you might be able to improve that using the shared memory.

If your design allows it then you can increase occupancy by changing your blocks to say 16 x 12. Occupancy is only one thing that affects performance though, so it might not run any faster with 192 threads per block, might even run slower.

Cheers,
kbam

jameschen · June 2, 2011, 1:46am

thanks a lot for your attention.

But I still want to know the meanning of the numbers,

for example:

Active Blocks per SM= 8 : 8 [what does 8:8 mean?]

Occupancy = 0.666667 ( 32 / 48 )[what does 32/48 mean?]

External Image

kbam · June 2, 2011, 5:34am

CUDA - Wikipedia about half way down has a table of technical specifications
for a Compute capability 2.x device it shows:

Maximum number of resident blocks per multiprocessor 8
Maximum number of resident warps per multiprocessor 48
Maximum number of resident threads per multiprocessor 1536

So what it is saying that you could run larger blocks, with your current code each SM will have 8 active blocks of 4 warps each (total 32) running, but hardware can have 48 warps. Worth trying larger blocksize if easy to do, see if your run time drops.

jameschen · June 2, 2011, 5:29pm

External Image but the devicequery shows that:

Tesla C2050

Total amount of global memory 2817720320bytes

Number of multiprocessors 14

Number of cores 448

Total amount of constant memory 65536bytes

Total amount of shared memory per block 49152bytes

Total number of registers available per block 32768

warp size 32

Maximum number of thread per block 1024

Maximum size of each dimension of block 1024x1024x64

Maximum size of each dimension of a grid 65535x65535x1

Maximun memory pitch 2147483647bytes

Texture alignment 512bytes

what’s the different between per block and per multiprocessor?why the numbers are different?

External Image External Image External Image

kbam · June 2, 2011, 11:53pm

Each of the 14 multiprocessors can be assigned more than one block at a time.
“Maximum number of resident blocks per multiprocessor 8”

Each block can only be actually doing instructions for one block at a time, but if that one has to wait for data the mulitprocessor very quickly switches context to another block. In this way the delay in waiting for data ( ‘latency’ ) is hidden (providing there is a block that is able to run).

jameschen · June 3, 2011, 12:58am

I see!!! thank you very much.

but do you have some documents that focus on these execution details?

External Image

seibert · June 3, 2011, 1:18pm

The CUDA C Programming Guide Chapters 1 and 2 are a good reference.

jameschen · June 5, 2011, 2:15pm

thanks, but I want more details about how GPU works, Programming Guide just focus on programming, I want more about the structure,the execution process and optimazition. do you have any? thanks a lot

tera · June 5, 2011, 2:36pm

Although a bit dated (only addresses compute capability 1.x), this is a great reference: Demystifying GPU Microarchitecture through Microbenchmarking

Topic		Replies	Views
Amount of Shared Memory CUDA Programming and Performance	10	4182	June 3, 2010
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5898	July 25, 2007
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	1565	December 19, 2023
Max # of blocks? CUDA Programming and Performance	10	9975	November 28, 2007
understanding the trade-off between block size and occupancy CUDA Programming and Performance	1	14150	March 29, 2010
question about calculating occupancy CUDA Programming and Performance	2	6521	April 7, 2010
CUDA Visual Profiler Vista CUDA Programming and Performance	2	4131	September 11, 2009
How determine max number of blocks and threads for a GPU? CUDA Programming and Performance	4	20596	December 13, 2018
Occupancy/ Optimazation How to use Occupancy Calculator, improve performance CUDA Programming and Performance	12	16795	December 7, 2011
max number of block CUDA Programming and Performance	21	17574	April 20, 2010

Occupancy limiting factor = Block-Size Occupancy limit

Related topics