strange loop counter causing incorrect behaviour

I think I’ve narrowed down a problem and isolated what is probably a bug. I could file a bug report, but I thought I would ask here first in case it’s my mistake. It seems that the compiler is optimizing my for-loop conditional to make the same condition later in the code evaluate incorrectly. The code is below, but here’s a simpler explanation.

The kernel parses an octree, using an array of loop counters to keep track of the progress at each depth. If at any point a node is -1, it breaks out of the loop early and jumps to the next depth. The current counter value will be saved so when you return to this depth later, it can continue from the same place.

This is the loop:

for ( ; m2[d] < 8; m2[d]++)

After the loop, if the counter is 8, I know I have finished parsing this depth. If the counter is less than 8, I know I must have broken out early, so I jump down a level, and restart from the top.

The problem is this. In the case where the final node (counter = 7) of any depth is -1, once I break out of the loop, the value is EQUAL to 7, but NOT LESS than 8.

The check:

if (m2[d] == 7) printf("equals 7!\n");
if (m2[d] < 8) printf("less than 8!\n");

results in “equals 7!” but does not print “less than 8!”. Because of this, it does not enter either if block, stays at the current depth, and gets stuck in an infinite loop on this node. You can imagine my frustration trying to track this one down.

As far as I can tell, the compiler is not recognizing the functionality of ‘break’ in terms of how to optimize. It is assuming the loop condition is false immediately after the loop, and not re-evaluating it. Other conditions with other values like (m2[d] == 7) and (m2[d] < 9) work fine.

Adding printfs in specific places, or changing the order of things, seems to prevent the problem. Also defining m2 as volatile fixes it. I don’t see why I would have to declare it as volatile, there are no other threads accessing this value. The kernel reproduces this error every time when launched as a single block, single thread.

int d = 0; // current depth
	int m2[10]; // current target cell at each depth
	int m2_pos[10]; // target cell position in start/count array

	m2[0] = 0; // start at first cell
	m2_pos[0] = 0;

	for (int j = 0; j < 8; j++) { // verify c_count is working
		printf("%d = %d\n", j, pt2.c_count[0][j]);

	int i = 0;
	for (i = 0; i < 20; i++) { // until search is complete

		printf("depth %d, starting at %d+%d\n", d, m2_pos[d], m2[d]);

		for ( ; m2[d] < 8; m2[d]++) { // for each cell at this depth

			int pos = m2_pos[d] + m2[d];
			int m2_count = pt2.c_count[d][pos];

			// target cell points to next depth
			if (m2_count == -1 && d == 0) {

				m2_pos[d+1] = pt2.c_start[d][m2_pos[d] + m2[d]];

				break; // jump to next depth in this cell


		printf("m2[%d] = %d\n", d, m2[d]);
		if (m2[d] == 7) printf("equals 7!\n");
		if (m2[d] < 8) printf("less than 8!\n");
		//else printf("not less than 8!\n");
		printf("m2[%d] = %d\n", d, m2[d]);

		if (m2[d] < 8) { // work to do at next depth

			printf("jumping down to depth %d\n", d+1);

			m2[d]++; // this cell done
			d++; // jump to next depth
			m2[d] = 0; // start fresh at next depth

		if (m2[d] == 8) { // finished this depth

			if (d == 0) break; // at top level: all done!

			// jump back to previous depth and continue where we left off



From what I can gather from the above snippet md2 is a thread-local array and not data accessed by multiple threads, ruling out race conditions. I agree this looks like a bug caused by optimization, as I don’t spot anything in the code itself that jumps out as invoking undefined behavior.

One additional quick experiment you could do is turn off PXTAS optimizations by passing -Xptxas -O0. If this fixes the problem, try increasing one level at a time (highest and default level is -O3) to see at what optimization level the problem appears.

The observations that use of volatile or adding printf() make the problem go away are consistent with the hypothesis of an optimization bug. Use of volatile prevents the caching of data in a register, and use of printf effectively creates code motion barriers, both reducing the amount of optimization that can be applied.

If this happens with CUDA 5.0, please file a bug with self-contained (compilable and runnable) repro code via the bug-reporting form linked from the registsred developer website. If this happens with an earlier CUDA version I would suggest upgrading to CUDA 5.0.

Thanks. It is reproducible in it’s own small .cu file, with this compile command:
nvcc -o compilerbug -arch=sm_30

On 32-bit windows and 64-bit linux. Must be an NVCC thing.

I’ll submit a bug.

The problem occurs with -O2 but not with -O1.

Thanks for filing the bug. Based on your experimens it seems like a bug in the PTXAS optimizer. As a workaround that lets you make forward progress I would suggest continued use of -Xptxas -O1 for now. The compiler team may have a suggestion for a better workaround once they have taken a look at this.

FYI, I heard back from them, the bug is reproducible in CUDA 5 but has been fixed in their current developer version of the compiler. So it will be fixed in the next release. I was able to recreate other types of incorrect behavior caused by this bug, so be aware there is probably some sort of issue with local memory code optimizations that might affect anything that uses it. Since any variable can spill over into local memory, be careful with any conditionals.