i wrote a program and run it under profiler,i got the following result, i just don’t understand the meaning, can somebody explain what’s the meaning of each number? thanks
my GPU is Tesla C2050 External Image
Kernel details : Grid size: 61 x 70, Block size: 16 x 8 x 1
Register Ratio = 0.5 ( 16384 / 32768 ) [16 registers per thread]
Shared Memory Ratio = 0.0833333 ( 4096 / 49152 ) [512 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 1024 : 1536
Occupancy = 0.666667 ( 32 / 48 )
Achieved occupancy = 0.666667 (on 14 SMs)
Occupancy limiting factor = Block-Size
plenty of registers spare,
occupancy is above 0.5 so thats fine
loads of shared memory left
If you get a bottleneck on transfers to/from global arrays you might be able to improve that using the shared memory.
If your design allows it then you can increase occupancy by changing your blocks to say 16 x 12. Occupancy is only one thing that affects performance though, so it might not run any faster with 192 threads per block, might even run slower.
CUDA - Wikipedia about half way down has a table of technical specifications
for a Compute capability 2.x device it shows:
Maximum number of resident blocks per multiprocessor 8
Maximum number of resident warps per multiprocessor 48
Maximum number of resident threads per multiprocessor 1536
So what it is saying that you could run larger blocks, with your current code each SM will have 8 active blocks of 4 warps each (total 32) running, but hardware can have 48 warps. Worth trying larger blocksize if easy to do, see if your run time drops.
Each of the 14 multiprocessors can be assigned more than one block at a time.
“Maximum number of resident blocks per multiprocessor 8”
Each block can only be actually doing instructions for one block at a time, but if that one has to wait for data the mulitprocessor very quickly switches context to another block. In this way the delay in waiting for data ( ‘latency’ ) is hidden (providing there is a block that is able to run).
thanks, but I want more details about how GPU works, Programming Guide just focus on programming, I want more about the structure,the execution process and optimazition. do you have any? thanks a lot