Odd code exits kernel

So I have some code I’m writing for a project at work. A kernel I have is writing to a piece of memory based on the current thread index (sort of). Written one way it will exit the kernel immediately. Also note that in each of the two examples, only a single block is performing the instructions.

Here is a basic example of code that will abort:

int count = threadIdx.x;

while(currentpos < end)


    *currentpos = (((count % 3) == 2) * 15) | (((count % 97) == 96) * 27);

    currentpos += blockDim.x;

    count += blockDim.x;


Here is the same code but modified to actual complete:

int count = threadIdx.x;

while(currentpos < end)


    *currentpos = (((threadIdx.x % 3) == 2) * 15) | (((threadIdx.x % 97) == 96) * 27);

    currentpos += blockDim.x;

    count += blockDim.x;


I realize that the reason for doing this sort of thing may not be obvious, but trust that it is exactly what I want to do. I’m honestly not sure where to go from here in terms of debugging…it seems like code that should function properly. It does, in fact, function properly in the emulation builds. Anyone have any ideas?

You didn’t specify exactly what kind of erroneous behavior you’re seeing. Also, if something works in emulation but not on the GPU, 9 times out of 10 it’s a race condition somewhere that’s the cause. Is there any more code you can publish? If you don’t feel comfortable doing that, you’re welcome to PM me with more details.

The usual way, especially with new compilers and when like here the code should be equivalent: compare the generated code (at least ptx, possibly the native code with decuda).

Also try making “count” unsigned, unsigned short, unsigned char (personally I prefer the C99 [u]int8_t, [u]int16_t etc. types define in stdint.h).

Due to issues with rounding, using unsigned probably is a good idea anyway for performance reasons, it also makes it easier for the compiler to replace the division by a constant with some shifts/adds and similar tricks.

^^^ Analyzing the assembly language shouldn’t be your first choice. Also, I think the bicycle shed should be blue since that’s a calming color.

Your assignment to currentpos in the first example makes use of count, which is changed within the loop. In the second example, the assignment makes use of threadIdx.x, which does not change.

Well, it sure isn’t. But when you feel stuck and the problematic code is small, even when the compiler is not messing up, it allows to examine the code with all the wrong assumptions that come from knowing what it is supposed to be doing or what a variable is supposed to contain.

Not to mention that the insight into the low-level code might help understand the difference between emulation and native execution, which supposedly exists and the bug you found can not explain :tongue:

Ok, so an interesting find! Given Reimar’s suggestion, I made count unsigned…and everything worked! How odd. I then made count signed again, but added the line below the count incrementor:

if(count < 0)

    count *= -1;

This is to see if CUDA has a problem taking the modulus of variables with negative values, but this code also failed.

If count is a char, it will fail. If it is an unsigned char, it passes. So it definitely seems to be a problem with signed variables.

A couple of other interesting observations. If I assign *currentpos to either half statement (either side of the ‘or’) then it will pass. Only both of them together causes a fail. Also, if I assign a variable to one side of the ‘or’ statement and then assign currentpos to the other half or’d with the variable, it will fail. I can see the possibility that compiler optimizations are just converting this into the original form, though.

Either way, it’s a weird bug…something is not working as it should. The solution is to use an unsigned variable in the statement. Thank you all for your help.

Errr, that seems odd as

int idyy = (blockIdx.x*blockDim.x+threadIdx.x)%nColumns;

has always worked for me, where nColumns is just an int.

I wonder what exactly you were trying to test there :biggrin:

Just in case you are not aware even in mathematics a % b != -a % b.

For C, it is (-a) % b == -(a % b) though (which is not in line with mathematics though).

Is nColumns a function argument or something or a constant? % with a constant might result in completely different code that does not use any division instruction…

Most useful would probably a complete but minimal testcase, i.e. the minimum amount of code that still compiles and where the results are wrong, because the examples so far are incomplete and even wrong…

nColumns is passed into the kernel, which might be why its different. But idyy is an int, not unsigned or const, which is the problem that was being had.

No, it is not. The problem is probably calculating modulus a constant.

Seriously, just try it and compile these with nvcc --ptx

__global__ void a(int *x) {

    *x = threadIdx.x % 3;


__global__ void b(int *x) {

    int id = threadIdx.x;

    *x = id % 3;


__global__ void c(int *x, int mod) {

    *x = threadIdx.x % mod;


Each will create different code, the last one completely different and the others slightly different. E.g. the code for (b) requires and additional condition - that is the usual problem you get when using signed numbers with division/modulus, so you should never use them when performance matters.

The code for (b) does look suspicious due to using .s32 instead of .u32 despite using exactly the same algorithm as (a) (it just stores the sign in a predicate register and then takes the absolute value as first steps), but I did not test it (and I am using CUDA 2.0 beta 2).

It seems relatively obvious to me that there is some problem with the either the compilation or execution of the code that I had. I was just trying to narrow down the problem to see if it was an issue taking modulus with negative numbers. I realize that the result is different than the original code, but in my test case the actual resultant values are irrelevant, I only care if the kernel actually completes execution.

I have actually ran into several problems since this issue was “resolved” (bypassed) dealing with adding an additional function parameter. If the parameter is used in a for loop as a maximum value (for (int i = 0; i < variable; i++) then the kernel will again fail to execute properly. If I instead use a constant, then it runs as expected. There are definitely some odd things happening, but I have yet to narrow them down enough to actually be helpful to anyone.