Yeah, try to use just bitwise-AND. Pad your grid’s y dimension to a Po2. It should be little overhead to start a pad block and immediately return out of each of its threads.
EDIT: obviously, this makes sense if you’re going to be calculating your dimensions frequently. You might do this to save a few registers or because you have many device functions and don’t want to pass around parameters needlessly. Then again, you can still just store the block’s coordinates in smem and then offset by threadIdx on the spot whenever you need to. Yeah, that’d be best.