Occupancy limiting factor = Block-Size Occupancy limit

i wrote a program and run it under profiler,i got the following result, i just don’t understand the meaning, can somebody explain what’s the meaning of each number? thanks
my GPU is Tesla C2050 External Image

Kernel details : Grid size: 61 x 70, Block size: 16 x 8 x 1
Register Ratio = 0.5 ( 16384 / 32768 ) [16 registers per thread]
Shared Memory Ratio = 0.0833333 ( 4096 / 49152 ) [512 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 1024 : 1536
Occupancy = 0.666667 ( 32 / 48 )
Achieved occupancy = 0.666667 (on 14 SMs)
Occupancy limiting factor = Block-Size

All looks pretty good to me

plenty of registers spare,
occupancy is above 0.5 so thats fine
loads of shared memory left

If you get a bottleneck on transfers to/from global arrays you might be able to improve that using the shared memory.

If your design allows it then you can increase occupancy by changing your blocks to say 16 x 12. Occupancy is only one thing that affects performance though, so it might not run any faster with 192 threads per block, might even run slower.

Cheers,
kbam

thanks a lot for your attention.

But I still want to know the meanning of the numbers,

for example:

Active Blocks per SM= 8 : 8 [what does 8:8 mean?]

Occupancy = 0.666667 ( 32 / 48 )[what does 32/48 mean?]

External Image

CUDA - Wikipedia about half way down has a table of technical specifications
for a Compute capability 2.x device it shows:

Maximum number of resident blocks per multiprocessor 8
Maximum number of resident warps per multiprocessor 48
Maximum number of resident threads per multiprocessor 1536

So what it is saying that you could run larger blocks, with your current code each SM will have 8 active blocks of 4 warps each (total 32) running, but hardware can have 48 warps. Worth trying larger blocksize if easy to do, see if your run time drops.

External Image but the devicequery shows that:

Tesla C2050

Total amount of global memory 2817720320bytes

Number of multiprocessors 14

Number of cores 448

Total amount of constant memory 65536bytes

Total amount of shared memory per block 49152bytes

Total number of registers available per block 32768

warp size 32

Maximum number of thread per block 1024

Maximum size of each dimension of block 1024x1024x64

Maximum size of each dimension of a grid 65535x65535x1

Maximun memory pitch 2147483647bytes

Texture alignment 512bytes

what’s the different between per block and per multiprocessor?why the numbers are different?

External Image External Image External Image

Each of the 14 multiprocessors can be assigned more than one block at a time.
“Maximum number of resident blocks per multiprocessor 8”

Each block can only be actually doing instructions for one block at a time, but if that one has to wait for data the mulitprocessor very quickly switches context to another block. In this way the delay in waiting for data ( ‘latency’ ) is hidden (providing there is a block that is able to run).

I see!!! thank you very much.

but do you have some documents that focus on these execution details?

External Image

The CUDA C Programming Guide Chapters 1 and 2 are a good reference.

thanks, but I want more details about how GPU works, Programming Guide just focus on programming, I want more about the structure,the execution process and optimazition. do you have any? thanks a lot

Although a bit dated (only addresses compute capability 1.x), this is a great reference: Demystifying GPU Microarchitecture through Microbenchmarking