Occupancy/ Optimazation How to use Occupancy Calculator, improve performance

Markuss.london · November 28, 2011, 3:38pm

Hi there,

I finished my Cuda Kernel for an unstructured Jacobi solver, these are my debug information

ptxas info    : Compiling entry function '_Z25Kernel_Rocket_halo_updatePfPiiS0_S0_' for 'sm_10'

3>  ptxas info    : Used 14 registers, 40+16 bytes smem, 8 bytes cmem[1]

3>  ptxas info    : Compiling entry function '_Z13Kernel_RocketiPiPfS_' for 'sm_10'

3>  ptxas info    : Used 15 registers, 16032+16 bytes smem, 24 bytes cmem[1]

and

ptxas info    : Compiling entry function '_Z25Kernel_Rocket_halo_updatePfPiiS0_S0_' for 'sm_20'

3>  ptxas info    : Function properties for _Z25Kernel_Rocket_halo_updatePfPiiS0_S0_

3>      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

3>  ptxas info    : Used 14 registers, 72 bytes cmem[0]

3>  ptxas info    : Compiling entry function '_Z13Kernel_RocketiPiPfS_' for 'sm_20'

3>  ptxas info    : Function properties for _Z13Kernel_RocketiPiPfS_

3>      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

3>  ptxas info    : Used 16 registers, 16000+0 bytes smem, 64 bytes cmem[0]

3>  tmpxft_000022f8_00000000-3_Jacobiexternal.compute_10.cudafe1.cpp

3>  tmpxft_000022f8_00000000-20_Jacobiexternal.compute_10.ii

I am a little bit confused now how to proceed with the occupancy calculator, I have to cards to test my code, a tesla c1060 and a gtx 470. I tried both cards with block_size 512 and the fermi card with 1024. The GTX is clearly the card the gives much better results. My question is now how do I proceed to improve my kernel performance?

It would be great if someone actually could tell me what, based on these inputs the optimal configuration would be.

If there are no more improvements possible at the moment do you think the performance will get better when I am solving 3d problems = more values in shared memory compared to my current 2d version.

Does anyone have some experiences with cfd code what kind of speed up to expect ? This is the pure solving part at the moment, no coefficient building so far.

Thank you in advance

Markus

Gaszton · November 29, 2011, 11:02am

Did you tried the nvidia compute visual profiler?

it can tell you the occupancy achieved by your kernels, and also shows you if there are any limiting factors, like registers or shared memory usage

tera · November 29, 2011, 12:38pm

Don’t focus too much on occupancy. Calculate how close you get to theoretical peak memory bandwidth and theoretical peak instruction throughput, and then decide which of the two is worth optimizing.

Markuss.london · November 29, 2011, 2:11pm

@ Gaszton

I haven’t used it yet just started to look at the occupancy calculator

@ tera

cheers for your answers, do I calculate the peak memory bandwith and peak instruction throughput by the data provided above or do I need sth else, or any specific tool?

tera · November 29, 2011, 3:54pm

I just noticed that you compiled your code for the Tesla c1060 as sm_10. The Tesla 10XX series is of compute capability 1.3, so the first optimization should be to compile for sm_13.

You can find the peak memory throughput in the data sheets, it is 102 GB/s for the Tesla C 1060 and 152.0 GB/s for the GTX 570. Peak instruction throughput is a bit more tricky. If you are mainly performing linear algebra tasks (integer or float), you get 2*(number of cores)*(shader frequency) (FL)OP/s. So the TESLA C 1060 achieves 624 G(FL)OP/s and the GTX 570 scores 1405.44 G(FL)OP/s.

For the values actually achieved you need to count how many bytes the kernel reads or writes, and how many instructions/operation it performs, and divide that by its runtime.

The CUDA C Best Practices Guide has a wealth of information on this.

Markuss.london · November 29, 2011, 5:31pm

I just noticed that you compiled your code for the Tesla c1060 as sm_10. The Tesla 10XX series is of compute capability 1.3, so the first optimization should be to compile for sm_13.

You can find the peak memory throughput in the data sheets, it is 102 GB/s for the Tesla C 1060 and 152.0 GB/s for the GTX 570. Peak instruction throughput is a bit more tricky. If you are mainly performing linear algebra tasks (integer or float), you get 2*(number of cores)*(shader frequency) (FL)OP/s. So the TESLA C 1060 achieves 624 G(FL)OP/s and the GTX 570 scores 1405.44 G(FL)OP/s.

For the values actually achieved you need to count how many bytes the kernel reads or writes, and how many instructions/operation it performs, and divide that by its runtime.

The CUDA C Best Practices Guide has a wealth of information on this.

Thanks tera

ptxas info    : Compiling entry function '_Z25Kernel_Rocket_halo_updatePfPiiS0_S0_' for 'sm_13'

3>  ptxas info    : Used 14 registers, 40+16 bytes smem, 8 bytes cmem[1]

3>  ptxas info    : Compiling entry function '_Z13Kernel_RocketiPiPfS_' for 'sm_13'

3>  ptxas info    : Used 15 registers, 16032+16 bytes smem, 24 bytes cmem[1]

ok done that, didn’t change anything. My question would more be as I can’t change my registers and my shared memory usage what should be my next steps to optimize my kernel, do I have to change the #of blocks and block_sizes to an optimum to achieve the best possible results?

sorry for the silly qsts but I am fairly new to cuda and there are still a lot of things that confuse me.

I think I will be definitly using the GTX 470 card as I seems to produce much better results.

many thanks

Markus

tera · November 30, 2011, 2:18am

Yes, benchmark different blocksizes (which ideally are multiples of 64) and see which performs best. Since the second kernel can run just one block/SM on the Tesla (and 3/SM on the Fermi GTX), it will obviously profit from large blocksizes.

Sorry, I can’t really give better general advice that the Best Practices Guide has, as Nvidia has done a good job on the documentation. We might be able to give some advice here if you post some concrete code.

Markuss.london · November 30, 2011, 5:52pm

Thanks again tera,

I tried various combinations of block_sizes and shared memory sizes but it looks like the 512 threads per blocks is pretty much the optimum that can be achieved.

I am still a bit confused about the occupancy calculator (version 2.4), there are 2 entries one for thread blocks per sm(8) and the other for active thread blocks per sm (3) do I have to care about the 8 or is it enough if I keep my the 3 in mind?

I will see if I can post some specific pieces of code that would make the advices easier.

many thanks anyway

tera · December 1, 2011, 8:14pm

8 thread blocks per SM is just the maximum possible, you don’t need to care about that. Active thread blocks per SM is a more relevant parameter (having more than one block per SM allows useful work to be done during a [font=“Courier New”]__syncthreads()[/font]), although the most interesting value is active warps (or active threads) per SM.

Markuss.london · December 7, 2011, 1:32pm

Thanks again tera,

would this then mean that a good layout would be if each block has 192 threads , so therefore 192*8 blocks per SM = 1536 active threads per SM and 1536/32 = 48 active warps per SM? For fermi 2.0 and a GTX 470

greets Markus

tera · December 7, 2011, 2:57pm

Note you only have 3 blocks per SM = 576 threads per SM because of the shared memory usage, which gives you 38% occupancy, which is on the very low end of the useful range. As I said earlier, 8 blocks per SM is only the theoretical maximum.

Markuss.london · December 7, 2011, 3:35pm

Ok I think I am slowly getting it, the 16032+16 from the output allready occupy approx. a third of the 50k available per SM, right ? That is why only 3 blocks are available per sm.

So this actually means I have to get my shared memory usage down, just to get a rough idea what it does:

a domain decomposition software splits my problem up into almost similar sized junks (e.g 2000 cells / 512 should give me 4 domains = 4 blocks)
each junk gets halo cells allocated to establish block to block communication
the actual solving just calculates the sum of all cells that surround the actual cell and divedes them by the the number.
u_sh is used to store the actual cell values (values tx=0…tx=n)and the halo values (tx=n+1…tx=n+m)
the kernel itself uses allocation tables to pick the right values out of the shared memory

To get my blocksize down I basically just divide my problem up in different sized junks. So if i would get my shared memory down to let’s say around 6-6.5k this would give me the maximum theoretical throughput of 8 blocks per sm (8*6.5k<= 50k), while using 192 threads per blocks? This would then also mean that in my case more smaller blocks would be better than less bigger blocks.

The second kernel is just responsible for updating the halo values after each iterration.

sorry for the more or less stupid questions but I try to finally understand the whole concept.

many thanks in advance

Markus

Markuss.london · December 7, 2011, 3:38pm

So values like this should be much better

3>  ptxas info    : Compiling entry function '_Z25Kernel_Rocket_halo_updatePfPiiS0_S0_' for 'sm_20'

3>  ptxas info    : Function properties for _Z25Kernel_Rocket_halo_updatePfPiiS0_S0_

3>      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

3>  ptxas info    : Used 14 registers, 72 bytes cmem[0]

3>  ptxas info    : Compiling entry function '_Z13Kernel_RocketiPiPfS_' for 'sm_20'

3>  ptxas info    : Function properties for _Z13Kernel_RocketiPiPfS_

3>      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

3>  ptxas info    : Used 16 registers, 4104+0 bytes smem, 64 bytes cmem[0]

3>  tmpxft_000019b4_00000000-3_Jacobiexternal.compute_13.cudafe1.cpp

3>  tmpxft_000019b4_00000000-20_Jacobiexternal.compute_13.ii