cuModuleLoadData returns CUDA_ERROR_INVALID_IMAGE if I use rem instruction?

With the code below, when I try to load the resulting module (compiled with ptxas -v -arch sm_30), cuModuleLoadData returns CUDA_ERROR_INVALID_IMAGE. If I comment out the rem instruction, it loads just fine? I’m running with CUDA 7.5 on a GTX 690, Windows 10.

.version 3.2
.target sm_30
.address_size 64

.entry renderKernel
.param.u64 _bufferAddr,
.param.s32 _bufferWidth
.reg.s64 bufferAddr;
.reg.u32 bufferWidth, t0;

ld.param.u64         bufferAddr, [_bufferAddr];   bufferAddr, bufferAddr;
ld.param.u32         bufferWidth, [_bufferWidth];

mov.u32              t0, %ntid.x;
mad.wide.u32         bufferAddr, t0, 4, bufferAddr;

rem.u32              t0, t0, bufferWidth; // comment this line out to make this work        [bufferAddr],	t0;



i think it may be architecture mismatch. try to compile it in a smost generic way (i.e. with all arcghitectures+ptx), then if it will work try to reduce the arch list