NVCC performance bug

I’m developing a GRU kernel that has 6 GEMV computations. With all GEMV computations, the kernel occupies 76 registers(0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads) and costs 8.5ms per launch.

Then I commented 3 GEMV function calls, the kernel occupies 168 registers(5112 bytes stack frame, 5092 bytes spill stores, 5086 bytes spill loads) and costs 11.6ms.

This is extremely unacceptable to me. I tried to reduce the computations and loads, but resulted in bad performance. All I did was delete 3 function calls.

Unlimited implicit unroll is the root cause. After explicitly writing #pragma unroll 16, the register usage and time cost are back to normal.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.