Strange performance behavior

jmnet · May 9, 2011, 12:26am

I have an application that I’m working on that does calculations.

It can do something like 3 billion calculations per second, but when I make simple code changes that rate goes down to 5 million per second.

For example, I have a function that by default at the end returns a true value. For testing I have that set to return false and get the 3 billion rate. If I change that return to a true, I lose all the performance and the rate goes down to 5 million a second. Strangely with the test data I’m using, the code will never reach this default return. Seems as if the compiler does something weird here.

Any one experience this???

Example of code…

device bool SomeFunction(int* parm1, char* parm2, …)
{
if (condition_a)
{
return false;
}
if (condition_b)
{
return false;
}
.
.
//return true;
return false;
}

Thanks in advance.

tera · May 9, 2011, 1:03am

If you return [font=“Courier New”]false[/font] in any case, the compiler does not even have to evaluate the conditions.

jmnet · May 9, 2011, 1:49am

The above is just a quick sample I came up with. There are conditions that can return true.

I have another example where I store data in a char variable (char chrByteTest). If I use this in an if statement following the assignment, for example, if (var1 == 1 && chrByteTest = 0x00), it kills performance. If the statement is if (var1 == 1) then the performance is there. There are just various simple coding things that you would not expect to cause huge issues that do. It has been a very frustrating experience.

There has got to be a simple explanation or some setting that I just can’t figure out.

Thanks.

seibert · May 9, 2011, 12:37pm

Use the --ptx option with nvcc and take a look at the PTX being generated. If something obviously weird is going on, like big chunks of your kernel being optimized away, it should show up there.

jmnet · May 9, 2011, 1:08pm

Compared a ptx file from a return true vs a return false. There are tons of differences in the ptx files. Seems really strange.

seibert · May 9, 2011, 2:31pm

Is there a big length difference in the PTX files? I’m curious if the dead-code optimizer is doing something here…

jmnet · May 9, 2011, 3:27pm

Yes, there is a big difference.

The good code has 4550 lines in the ptx file and is 138KB in size.

The bad code has 10357 lines in the ptx file and is 296KB in size.

So changing a “return false” to a “return true” is causing the underlying code to more than double in size. Obviously alot more code to execute is reduced performance.

I’m up for any suggestions. I’ve been trying weird stuff in the code to get around this, but I shouldn’t have to change simple code to confusing code to fix something like this.

Thanks.

tera · May 9, 2011, 3:43pm

I’ll keep up my previous suggestion. Even if some condition can lead to a [font=“Courier New”]return true;[/font], any conditions after that don’t need to be evaluated.

seibert · May 9, 2011, 5:15pm

At this point, you might have a genuine compiler bug on your hands (the CUDA compiler is fairly agressive, and might have gone too far), so reducing the problem to a test case for the compiler team would be the next step.

jmnet · May 9, 2011, 7:22pm

I did change around the logic to do what you stated. The final result, is that if I specify a return true whether I hit it or not in the logic cause the huge performance loss.

jmnet · May 10, 2011, 12:18am

I started to break down my code in order to submit a test case for the failing code. Strangely, when I got to a certain point of eliminating code, the performance returned. So it is not limited to by any particular command. In this case I removed a function that returned a void that gave the performance back. Wondering now if there is something about memory usage or something like that. I do have a good bit of local variables but they are not memory intensive. Things like counters, integers for loops and small char arrays. Should they be moved to shared memory? What are the limitations?

Thanks.

tera · May 10, 2011, 12:38am

What does nvcc print when compiling with --ptxas-options=-v for the slow and the fast versions?

jmnet · May 10, 2011, 2:06am

Here is what I’m seeing…

fast version

1>ptxas info : Compiling entry function ‘_Z10GoCalculateP16stcInterfaceDataPbPh’ for ‘sm_20’

1>ptxas info : Used 24 registers, 2808+0 bytes smem, 56 bytes cmem[0]

slow version

1>ptxas info : Compiling entry function ‘_Z10GoCalculateP16stcInterfaceDataPbPh’ for ‘sm_20’

1>ptxas info : Used 29 registers, 4+0 bytes lmem, 2808+0 bytes smem, 56 bytes cmem[0]

tera · May 10, 2011, 2:23am

The four bytes of local memory use are somewhat suspicious. A variable whose address is taken which the compiler can’t optimize away? Surely use of local memory might slow down things a lot.

You might look for “.local” in the PTX file and see what that single word of local memory is used four.

jmnet · May 10, 2011, 12:38pm

Can’t really tell. It looks like it is the same 2 variables declared in local memory from the good compile and the bad compile. Just that the bad compile references them many more times.

I’m going to try and move them to shared memory. Since I’m pretty new to CUDA programming, is there a simple way to have a variable be only referenced by the thread it is created in or do you need to create it as an array and reference it based on the threads index?

Thanks.

tera · May 10, 2011, 3:20pm

But the good kernel does not use any local memory at all…

You need to create an array and use the thread index.

jmnet · May 10, 2011, 3:51pm

Here is what the two variable look like that have .local (Both are int arrays, var1 and var2)

.local .align 4 .b8 __cuda___cuda_local_var_90537_9_non_const_var1_03828[288];

.local .align 4 .b8 __cuda___cuda_local_var_90536_9_non_const_var2_2884116[256];

I’m guessing these are used as local because of the size. I tried to move them to shared, but then get a too much shared memory message during the compile.

Since the deceleration is the same in both the good code and the bad code, it has to be the way the compiler is optimizing the code, making many references to it in the bad code. Optimizations are turned off in the compiler settings so it is something happening natively.

seibert · May 10, 2011, 5:44pm

Here is what the two variable look like that have .local (Both are int arrays, var1 and var2)
.local .align 4 .b8 __cuda___cuda_local_var_90537_9_non_const_var1_03828[288];

.local .align 4 .b8 __cuda___cuda_local_var_90536_9_non_const_var2_2884116[256];
I’m guessing these are used as local because of the size. I tried to move them to shared, but then get a too much shared memory message during the compile.

Even if they were small, it is quite likely that they would have to be stored in local memory anyway. Arrays can’t be put into registers unless you always index the array with a constant.

tera · May 10, 2011, 6:25pm

Here is what the two variable look like that have .local (Both are int arrays, var1 and var2)
.local .align 4 .b8 __cuda___cuda_local_var_90537_9_non_const_var1_03828[288];

.local .align 4 .b8 __cuda___cuda_local_var_90536_9_non_const_var2_2884116[256];
I’m guessing these are used as local because of the size. I tried to move them to shared, but then get a too much shared memory message during the compile.

Since the deceleration is the same in both the good code and the bad code, it has to be the way the compiler is optimizing the code, making many references to it in the bad code. Optimizations are turned off in the compiler settings so it is something happening natively.

I’m confused. Didn’t the good kernel have no lmem usage at all and the bad one had only 4 bytes? So where do these 544 bytes of local memory come from now?

jmnet · May 10, 2011, 6:57pm

That’s what’s confusing.

I just verified again. The good compile doesn’t say anything about lmem, but the ptx file has the two fields.

The bad compile has the 4 bytes and the ptx file has the two fields.

But in either case, it is more than 4 bytes each. They are both integers, so it really should be 4 x the number of array elements.

It was compile using toolkit version 3.2. Wondering if it would help to be compile using the 4.0 release candidate.

Topic		Replies	Views
Ptxas compiler speed. CUDA Programming and Performance	23	12095	December 20, 2012
Very long kernels resulting in unoptimized compilation CUDA Programming and Performance	2	450	March 10, 2023
Optimizing ptx CUDA Programming and Performance	10	8954	April 24, 2008
_Very_ slow compilation of .cu file CUDA Programming and Performance	6	17170	August 25, 2009
The number of registers used in a kernel and the performance are related to the way the local variables are processed？ CUDA Programming and Performance	4	541	March 1, 2020
smart ideas for an interesting problem CUDA Programming and Performance	21	9534	December 10, 2008
inconsistent results on every run CUDA Programming and Performance	15	5468	October 17, 2011
Local variables and registers CUDA Programming and Performance	13	6115	March 23, 2010
cuda memory usage in debug(with GDB),debug(without GDB) and release differ, extra 2GB usage in relea CUDA Programming and Performance	11	4204	February 9, 2016
Performance Issue CUDA Programming and Performance	9	4091	February 2, 2011

Strange performance behavior

Related topics