A strange bug with threadIdx.x

You’ve been tripped up by unsigned arithmetic.

threadIdx.x is an unsigned quantity.

You may wish to study what happens when you do something like:

threadIdx.x - 90

or

unsigned val = 0;
long long res = val - 90;

in unsigned arithmetic. the quantity res will not contain -90! (Try it in plain host code.)

If you want signed arithmetic, you could first convert threadIdx.x to a signed quantity, e.g.

((int)threadIdx.x) - 90

Of course, threadIdx.x is not available in host code, but you can reproduce your observation in host code by using an unsigned variable, setting it to zero, and then trying your code. So this is a function of C++ behavior, not anything unique or specific to CUDA. And it is not a compiler bug or anything like that.

Aside: When using this kind of kernel-launch coding pattern, I usually encourage people to put a cudaDeviceSynchronize() after the kernel call. It’s not related to your question, and I realize it may appear to be unnecessary on some platforms, perhaps.