I finished my Cuda Kernel for an unstructured Jacobi solver, these are my debug information
ptxas info : Compiling entry function '_Z25Kernel_Rocket_halo_updatePfPiiS0_S0_' for 'sm_10'
3> ptxas info : Used 14 registers, 40+16 bytes smem, 8 bytes cmem[1]
3> ptxas info : Compiling entry function '_Z13Kernel_RocketiPiPfS_' for 'sm_10'
3> ptxas info : Used 15 registers, 16032+16 bytes smem, 24 bytes cmem[1]
and
ptxas info : Compiling entry function '_Z25Kernel_Rocket_halo_updatePfPiiS0_S0_' for 'sm_20'
3> ptxas info : Function properties for _Z25Kernel_Rocket_halo_updatePfPiiS0_S0_
3> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
3> ptxas info : Used 14 registers, 72 bytes cmem[0]
3> ptxas info : Compiling entry function '_Z13Kernel_RocketiPiPfS_' for 'sm_20'
3> ptxas info : Function properties for _Z13Kernel_RocketiPiPfS_
3> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
3> ptxas info : Used 16 registers, 16000+0 bytes smem, 64 bytes cmem[0]
3> tmpxft_000022f8_00000000-3_Jacobiexternal.compute_10.cudafe1.cpp
3> tmpxft_000022f8_00000000-20_Jacobiexternal.compute_10.ii
I am a little bit confused now how to proceed with the occupancy calculator, I have to cards to test my code, a tesla c1060 and a gtx 470. I tried both cards with block_size 512 and the fermi card with 1024. The GTX is clearly the card the gives much better results. My question is now how do I proceed to improve my kernel performance?
It would be great if someone actually could tell me what, based on these inputs the optimal configuration would be.
If there are no more improvements possible at the moment do you think the performance will get better when I am solving 3d problems = more values in shared memory compared to my current 2d version.
Does anyone have some experiences with cfd code what kind of speed up to expect ? This is the pure solving part at the moment, no coefficient building so far.
Don’t focus too much on occupancy. Calculate how close you get to theoretical peak memory bandwidth and theoretical peak instruction throughput, and then decide which of the two is worth optimizing.
I haven’t used it yet just started to look at the occupancy calculator
@ tera
cheers for your answers, do I calculate the peak memory bandwith and peak instruction throughput by the data provided above or do I need sth else, or any specific tool?
I just noticed that you compiled your code for the Tesla c1060 as sm_10. The Tesla 10XX series is of compute capability 1.3, so the first optimization should be to compile for sm_13.
For the values actually achieved you need to count how many bytes the kernel reads or writes, and how many instructions/operation it performs, and divide that by its runtime.
ptxas info : Compiling entry function '_Z25Kernel_Rocket_halo_updatePfPiiS0_S0_' for 'sm_13'
3> ptxas info : Used 14 registers, 40+16 bytes smem, 8 bytes cmem[1]
3> ptxas info : Compiling entry function '_Z13Kernel_RocketiPiPfS_' for 'sm_13'
3> ptxas info : Used 15 registers, 16032+16 bytes smem, 24 bytes cmem[1]
ok done that, didn’t change anything. My question would more be as I can’t change my registers and my shared memory usage what should be my next steps to optimize my kernel, do I have to change the #of blocks and block_sizes to an optimum to achieve the best possible results?
sorry for the silly qsts but I am fairly new to cuda and there are still a lot of things that confuse me.
I think I will be definitly using the GTX 470 card as I seems to produce much better results.
Yes, benchmark different blocksizes (which ideally are multiples of 64) and see which performs best. Since the second kernel can run just one block/SM on the Tesla (and 3/SM on the Fermi GTX), it will obviously profit from large blocksizes.
Sorry, I can’t really give better general advice that the Best Practices Guide has, as Nvidia has done a good job on the documentation. We might be able to give some advice here if you post some concrete code.
I tried various combinations of block_sizes and shared memory sizes but it looks like the 512 threads per blocks is pretty much the optimum that can be achieved.
I am still a bit confused about the occupancy calculator (version 2.4), there are 2 entries one for thread blocks per sm(8) and the other for active thread blocks per sm (3) do I have to care about the 8 or is it enough if I keep my the 3 in mind?
I will see if I can post some specific pieces of code that would make the advices easier.
8 thread blocks per SM is just the maximum possible, you don’t need to care about that. Active thread blocks per SM is a more relevant parameter (having more than one block per SM allows useful work to be done during a [font=“Courier New”]__syncthreads()[/font]), although the most interesting value is active warps (or active threads) per SM.
would this then mean that a good layout would be if each block has 192 threads , so therefore 192*8 blocks per SM = 1536 active threads per SM and 1536/32 = 48 active warps per SM? For fermi 2.0 and a GTX 470
Note you only have 3 blocks per SM = 576 threads per SM because of the shared memory usage, which gives you 38% occupancy, which is on the very low end of the useful range. As I said earlier, 8 blocks per SM is only the theoretical maximum.
Ok I think I am slowly getting it, the 16032+16 from the output allready occupy approx. a third of the 50k available per SM, right ? That is why only 3 blocks are available per sm.
So this actually means I have to get my shared memory usage down, just to get a rough idea what it does:
a domain decomposition software splits my problem up into almost similar sized junks (e.g 2000 cells / 512 should give me 4 domains = 4 blocks)
each junk gets halo cells allocated to establish block to block communication
the actual solving just calculates the sum of all cells that surround the actual cell and divedes them by the the number.
u_sh is used to store the actual cell values (values tx=0…tx=n)and the halo values (tx=n+1…tx=n+m)
the kernel itself uses allocation tables to pick the right values out of the shared memory
To get my blocksize down I basically just divide my problem up in different sized junks. So if i would get my shared memory down to let’s say around 6-6.5k this would give me the maximum theoretical throughput of 8 blocks per sm (8*6.5k<= 50k), while using 192 threads per blocks? This would then also mean that in my case more smaller blocks would be better than less bigger blocks.
The second kernel is just responsible for updating the halo values after each iterration.
sorry for the more or less stupid questions but I try to finally understand the whole concept.