I know that the j < 27 is dumb because this condition is respected in the loop, but when i get rif of it, my kernel crashes (too many ressources used) else it works perfectly when I add j < 27.
If anyone has any idea about this problem :) I would be happy to know this ^^
What’s the size of your blocks? Removing that code probably makes you randomly use an extra register or two. Since your block is so big it barely fits, that extra register pushes it out completely.
Use --ptxas_options=-v and -maxrregcount=N to monitor and control register usage.