I’m working on the occupancy of my kernel. I got a theoretical occupancy of 12.5%, but i think i’m missing one parameter. Indeed, i don’t use shared memory, the number of register is not so high.
What kinf of factor could impact so hard the occupancy?
NEW CUDA CODE ptxas info : Compiling entry function '_Z17mygetRSS_ITM_SRTMPdP7s_blockS_PvmP11s_host_infoP10s_devparamP13s_device_data' for 'sm_52' ptxas info : Function properties for _Z17mygetRSS_ITM_SRTMPdP7s_blockS_PvmP11s_host_infoP10s_devparamP13s_device_data 8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 64 registers, 384 bytes cmem, 892 bytes cmem => launch advice <<< 5860, 256 >>> Number of sm = 24. Max active blocks = 1. Theoretical occupancy: 0.125000
With my Titan-X i should get like a theorical occupancy of 50%. With the old version of my kernel, with 125 register per thread i was at 25%. I rewrote all the code and i did drop the occupancy although the number of register is far reduce.
The big difference is that i compile all my .cu into .o then i link them to get a library static. So, i use a .hcu with some prototype of __device__function. Before the new code, the source files were including each other to produce a huge big .cu
OLD CUDA CODA ptxas info : Compiling entry function '_Z15getRSS_ITM_SRTMPdS_S_dd8AreaTypeddiddiiS_P9prop_typeP10propv_typeP10propa_typeS_S_ddddiiiddiiiiiiS_ddiiS_Piddd' for 'sm_52' ptxas info : Function properties for _Z15getRSS_ITM_SRTMPdS_S_dd8AreaTypeddiddiiS_P9prop_typeP10propv_typeP10propa_typeS_S_ddddiiiddiiiiiiS_ddiiS_Piddd 48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 124 registers, 624 bytes cmem, 1632 bytes cmem => launch advice <<< 2930, 512 >>> Number of sm = 24. Max active blocks = 1. Theoretical occupancy: 0.250000
The old code does not perform better than the improve code, but the new code is really different. So i really care about this occupancy, i’d like hit like 40% to bench if their is gain of performance