one more assignment statement increases the runtime by more than 100%

Hi, I am running into a weird situation. I have a CUDA kernel that includes many arrays in shared memory and many automatic variables. Say A is an array of type float in shared memory, and x is an automatic variable of type float. A and x are both used many times during the execution of the kernel function. It takes 3 seconds to complete my application. If I add one assignment statement A[???] = x at the end of the kernel, the runtime increases to 8 seconds. However, if I add A[???] = 1.234, or float y = x * 2.345, instead of adding A[???] = x, the runtime doesn’t increase from 3 seconds.

Could anyone please help me? Any comments are welcome. Thanks a lot in advance.

It’s a smart compiler optimization. The compiler can tell when a computed expression is never used or stored, and eliminate all the computation needed to create it.
This can dramatically improve the speed of your program (while keeping correctness.)

The same optimizations happen on CPU compilers too.

So your example of A=1.234 means all the computation of “x” can be deleted since x is never used.
Same with y=x*2.345 because in this case, y is never used, and therefore neither is x, and again you can delete all the compute needed for x.

Thank you so much for your inputs, SPWorley.

I had been trying hard in order to output the value of x in 3 seconds. Your explanation on smart compiler tells me that if I need to output the value of x, the actual runtime is 8 seconds. I have to find other ways, such as optimizing the algorithm, to cut the runtime.

SPWorley has correctly explained your one question but your real problem remains.
It is curious that though A and x are used many times in your kernel, it is just the final assignment that slows down your code. Have you looked at how many registers you are using? It could be that that final assignment pushes up your register count.

PTXAS info shows 63 registers used no matter if I have the final assignment or not. Even if I add some dummy automatic variables and some dummy code using those new variables, it still shows “63” registers used. I saw from another discussion that somebody said 90 registers used, I don’t know how they could see that large number, since a reply from my another question in a separate topic in this forum says 63 is maximum for compute capability 2.0 device.

Maximum register number for 1.x devices is 124, so 90 registers are possible there. You may compile your code with [font=“Courier New”]-arch=sm_13[/font] to get a rough idea how many registers the compiler would use if they were available. Be aware though that the architecture of 1.x and 2.x devices is quite different, so the register numbers are not directly comparable.

Thank you so much, Tera. I compiled the code with sm_13, and PTXAS info showed that 118 registers used. It seems to me that I have used too many automatic variables, and I have to rethink about the algorithm to eliminate some automatic variables in order to reduce registers used.

I just used visual profiler and found that if I comment out the final assignment, local_load nad local_store both are reduced by more than 80%.

The logic of my code is: after lenghty calculations, the result is x. I want to assign the result from each thread into a shared array so that I can aggregate the results at block level and copy them back to CPU memory. I guess that if I comment out the final assignment, the smart compiler would think calculations for getting x are not necessary and therefore eliminate all those calculations. This is just my guess.