NVRM: Xid (0084:00) kernel does not terminate

I prefer to take a working cuda-unaware program and convert it to cuda code with perl/bash script. This ensures absence of bugs. And very often compiler-ran-out-of-registers bug stops me (setting Olimit appears to have no effect). I failed several times before creating a variant which compiles and fits in registers completely. And the code is not optimal: if the compiler worked properly I could make it better