How to optimize kernel based on ncu report?

I watched this… But I only know, how I can use the ncu, and I still do not know, what I can do for … I do not even know what is the problem I found using ncu… for the supposed problem… Maybe change the algorithm? Add some hardware properties?

Hi, @202476410arsmart may help you. Also there is a folder extras/samples under Nsight compute installation path. You can refer to the samples there also.

Emmm… Good one. Thank you!!! But these samples seems quite simple… Like shared memory bank conflict, uncoalesced reading, direct change FP64 into FP32…

Do you have other, more advanced optimization examples? Maybe some, books? If they also uses ncu to profile, that would be the best.

By the way, generally discussing, if we are bounded by memory, maybe we can try cache, like shared memory, but what if we are bounded by compute? Maybe … most common way is tuning the tiling size? Or… we have to change the algorithm itself?

This is another video about roofline you can refer.

Sorry, we don’t have recommended books now. Please refer Nsight Compute doc or search the videos online directly.