Incomprehendible register usage, once again

I’ve got a kernel here that per se uses 14 registers, 48+48 bytes smem and 8 bytes cmem.

However, this is only the case if i let it do all the calculation it should do, without writing the result back to device memory. If i uncomment this single line

output[pixelIdx] = color;

i get a usage of 40 registers, 64+0 bytes lmem, 48+48 bytes smem, 52 bytes cmem. the variable pixelIdx doesnt seem to have any influence on this, since the compilation result is the same when using 0 as index.

output is of type uchar3 btw, as is color.

to give a little more of a picture, i’m roughly doing this:

for (dynamic loopcount)

	{

		if (...)

		{

			write something to shared mem

		}

		__syncthreads();

		

		for (static loopcount)

		{

			for (dynamic loopcount)

			{

				read something from shared memory

				if (...)

				{

					if (...)

					{

						if (...)

						{

							uchar3 color = ...

							output[pixelIdx] = color;

						}

					}

				}

			}

		}

	}

Does anyone have a hint here how this happens, and how to fix this? Maybe somebody encountered a similar problem or whatever. I’m really lost here.

I played around with forbidding unrolling of the loops etc already, as well as with making the variables i use constant, volatile etc … with no results =/

This is a sign that the compiler is doing its job and optimizing the generated code.

If you remove that one source code line, many of the code’s previous computations are unused. The compiler strips them out, leaving you with a quite simplified kernel that’s nothing at all like your original one.

The quoted line is the only reference to the output in the whole kernel, and as I stated the index I write to does not make any difference.

Ah I see now that i did not post this, but I’m doing other things in the innermost part of the code, which are not written to shared or global memory though. Would the compiler go so far as to throw all this out, if its so to say only self-… well, if its only used for more calculations on the previous results, which are not needed either? Sorry I hope you’re getting what I mean :)

All in all, is what you’re saying that the code does actually need all those registers plus the local memory? I’ll go over that with my maths again, but I thought it should fit into the registers easily …

EDIT: Allright, I’ll have to slim down that kernel … removing some of the stuff, it gets down to 38 registers and no local memory, so it seems the code i’ve created is too heavy in register usage. Thanks for pointing me at it, I’ve completely got that out of sight.

Yes indeed, the compiler is smart enough to do this in many cases. If you comment out your output line, it doesn’t need the color computed variable, so that goes away. But the compiler also optimizes recursively. If color isn’t used, maybe the variables that color depended on are now also unused. If so, the compiler can erase those. Repeat, with other various rules like “this for loop now has an empty body, so you can delete it entirely” and so on, and basically all the unused stuff gets removed from the program to simplify it as much as possible without changing its output or behavior.

This is true for CPU compilers as well, though you often don’t get such clear feedback about it.

You’re noticing the optimization clearly here because you’re looking at register counts, not the unchanged behavior.