I’m developing a GRU kernel that has 6 GEMV computations. With all GEMV computations, the kernel occupies 76 registers(0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads) and costs 8.5ms per launch.
Then I commented 3 GEMV function calls, the kernel occupies 168 registers(5112 bytes stack frame, 5092 bytes spill stores, 5086 bytes spill loads) and costs 11.6ms.
This is extremely unacceptable to me. I tried to reduce the computations and loads, but resulted in bad performance. All I did was delete 3 function calls.