Occupancy factor

Hello.

I’m working on the occupancy of my kernel. I got a theoretical occupancy of 12.5%, but i think i’m missing one parameter. Indeed, i don’t use shared memory, the number of register is not so high.

What kinf of factor could impact so hard the occupancy?

NEW CUDA CODE
ptxas info    : Compiling entry function '_Z17mygetRSS_ITM_SRTMPdP7s_blockS_PvmP11s_host_infoP10s_devparamP13s_device_data' for 'sm_52'
ptxas info    : Function properties for _Z17mygetRSS_ITM_SRTMPdP7s_blockS_PvmP11s_host_infoP10s_devparamP13s_device_data
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 384 bytes cmem[0], 892 bytes cmem[2]

=> launch advice <<< 5860, 256 >>>
Number of sm = 24. Max active blocks = 1. Theoretical occupancy: 0.125000

With my Titan-X i should get like a theorical occupancy of 50%. With the old version of my kernel, with 125 register per thread i was at 25%. I rewrote all the code and i did drop the occupancy although the number of register is far reduce.

The big difference is that i compile all my .cu into .o then i link them to get a library static. So, i use a .hcu with some prototype of __device__function. Before the new code, the source files were including each other to produce a huge big .cu

OLD CUDA CODA
ptxas info    : Compiling entry function '_Z15getRSS_ITM_SRTMPdS_S_dd8AreaTypeddiddiiS_P9prop_typeP10propv_typeP10propa_typeS_S_ddddiiiddiiiiiiS_ddiiS_Piddd' for 'sm_52'
ptxas info    : Function properties for _Z15getRSS_ITM_SRTMPdS_S_dd8AreaTypeddiddiiS_P9prop_typeP10propv_typeP10propa_typeS_S_ddddiiiddiiiiiiS_ddiiS_Piddd
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 124 registers, 624 bytes cmem[0], 1632 bytes cmem[2]

=> launch advice <<< 2930, 512 >>>
Number of sm = 24. Max active blocks = 1. Theoretical occupancy: 0.250000

The old code does not perform better than the improve code, but the new code is really different. So i really care about this occupancy, i’d like hit like 40% to bench if their is gain of performance

i take it you are not using texture

have you tried reducing the block size, just to note impact?