Strange performance behavior

I have an application that I’m working on that does calculations.

It can do something like 3 billion calculations per second, but when I make simple code changes that rate goes down to 5 million per second.

For example, I have a function that by default at the end returns a true value. For testing I have that set to return false and get the 3 billion rate. If I change that return to a true, I lose all the performance and the rate goes down to 5 million a second. Strangely with the test data I’m using, the code will never reach this default return. Seems as if the compiler does something weird here.

Any one experience this???

Example of code…

device bool SomeFunction(int* parm1, char* parm2, …)
if (condition_a)
return false;
if (condition_b)
return false;
//return true;
return false;

Thanks in advance.

If you return [font=“Courier New”]false[/font] in any case, the compiler does not even have to evaluate the conditions.

The above is just a quick sample I came up with. There are conditions that can return true.

I have another example where I store data in a char variable (char chrByteTest). If I use this in an if statement following the assignment, for example, if (var1 == 1 && chrByteTest = 0x00), it kills performance. If the statement is if (var1 == 1) then the performance is there. There are just various simple coding things that you would not expect to cause huge issues that do. It has been a very frustrating experience.

There has got to be a simple explanation or some setting that I just can’t figure out.


Use the --ptx option with nvcc and take a look at the PTX being generated. If something obviously weird is going on, like big chunks of your kernel being optimized away, it should show up there.

Compared a ptx file from a return true vs a return false. There are tons of differences in the ptx files. Seems really strange.

Is there a big length difference in the PTX files? I’m curious if the dead-code optimizer is doing something here…

Yes, there is a big difference.

The good code has 4550 lines in the ptx file and is 138KB in size.

The bad code has 10357 lines in the ptx file and is 296KB in size.

So changing a “return false” to a “return true” is causing the underlying code to more than double in size. Obviously alot more code to execute is reduced performance.

I’m up for any suggestions. I’ve been trying weird stuff in the code to get around this, but I shouldn’t have to change simple code to confusing code to fix something like this.


I’ll keep up my previous suggestion. Even if some condition can lead to a [font=“Courier New”]return true;[/font], any conditions after that don’t need to be evaluated.

At this point, you might have a genuine compiler bug on your hands (the CUDA compiler is fairly agressive, and might have gone too far), so reducing the problem to a test case for the compiler team would be the next step.

I did change around the logic to do what you stated. The final result, is that if I specify a return true whether I hit it or not in the logic cause the huge performance loss.

I started to break down my code in order to submit a test case for the failing code. Strangely, when I got to a certain point of eliminating code, the performance returned. So it is not limited to by any particular command. In this case I removed a function that returned a void that gave the performance back. Wondering now if there is something about memory usage or something like that. I do have a good bit of local variables but they are not memory intensive. Things like counters, integers for loops and small char arrays. Should they be moved to shared memory? What are the limitations?


What does nvcc print when compiling with --ptxas-options=-v for the slow and the fast versions?

Here is what I’m seeing…

fast version

1>ptxas info : Compiling entry function ‘_Z10GoCalculateP16stcInterfaceDataPbPh’ for ‘sm_20’

1>ptxas info : Used 24 registers, 2808+0 bytes smem, 56 bytes cmem[0]

slow version

1>ptxas info : Compiling entry function ‘_Z10GoCalculateP16stcInterfaceDataPbPh’ for ‘sm_20’

1>ptxas info : Used 29 registers, 4+0 bytes lmem, 2808+0 bytes smem, 56 bytes cmem[0]

The four bytes of local memory use are somewhat suspicious. A variable whose address is taken which the compiler can’t optimize away? Surely use of local memory might slow down things a lot.

You might look for “.local” in the PTX file and see what that single word of local memory is used four.

Can’t really tell. It looks like it is the same 2 variables declared in local memory from the good compile and the bad compile. Just that the bad compile references them many more times.

I’m going to try and move them to shared memory. Since I’m pretty new to CUDA programming, is there a simple way to have a variable be only referenced by the thread it is created in or do you need to create it as an array and reference it based on the threads index?


But the good kernel does not use any local memory at all…

You need to create an array and use the thread index.

Here is what the two variable look like that have .local (Both are int arrays, var1 and var2)

.local .align 4 .b8 __cuda___cuda_local_var_90537_9_non_const_var1_03828[288];

.local .align 4 .b8 __cuda___cuda_local_var_90536_9_non_const_var2_2884116[256];

I’m guessing these are used as local because of the size. I tried to move them to shared, but then get a too much shared memory message during the compile.

Since the deceleration is the same in both the good code and the bad code, it has to be the way the compiler is optimizing the code, making many references to it in the bad code. Optimizations are turned off in the compiler settings so it is something happening natively.

Even if they were small, it is quite likely that they would have to be stored in local memory anyway. Arrays can’t be put into registers unless you always index the array with a constant.

I’m confused. Didn’t the good kernel have no lmem usage at all and the bad one had only 4 bytes? So where do these 544 bytes of local memory come from now?

That’s what’s confusing.

I just verified again. The good compile doesn’t say anything about lmem, but the ptx file has the two fields.

The bad compile has the 4 bytes and the ptx file has the two fields.

But in either case, it is more than 4 bytes each. They are both integers, so it really should be 4 x the number of array elements.

It was compile using toolkit version 3.2. Wondering if it would help to be compile using the 4.0 release candidate.