I’m currently managing to store a 187 by 66 matrix (single precision) on-chip. The goal is actually to do some heavy duty work on a 240 by 66 matrix, everything on-chip. There is enough space in the register file combined with some shared memory.
I’m using a lot of unrolling to make sure nothing spills over into local memory. When trying to unroll further ( 187+) the compiler starts getting unhappy :
“Advisory: Loop was not unrolled, too much code expansion”
Has anyone experienced similar issues? Any workarounds ?
What I’ve seen others do (especially before CUDA even had loop unrolling) was to write their code in a template engine and tweak the the parameters controlling the unrolling of different loops. They then run automatic tests at all the various sets of parameters to find the fastest kernel. You could use a similar technique to “manually” unroll what you need to so you don’t have to trust the compiler.
ok, i figured out a workaround. If i doubled the amount of threads being used the unrolling depth could be halved, meanwhile this doesn’t add any ( at least what i can tell now… ) new inter block communication problems. This allows me to keep a 240 by 66 matrix on-chip and I just increased the occupancy.
Any easy fix in my problem but i guess in some apps this would lead to extra reduction steps.