why adding 1 line =exploding time to compile

x248 · June 2, 2009, 8:54am

I have a “big” device function using 60 floats, and with 3000 lines of calculation.
All is simple caculation beetween the floats and because I donot “index” by arrays
do be in the register, I have a lot of lines.
Time to compile is 2 seconds.
If I add one line: v1+=v2*(v3+v4);
then the comilation time is more than 2 minutes.
If I look in the task manager, I see tha ptxas.exe working and taking 600 Mo of Memory:
Before adding this line it was less than 50Mo.

I think it is because of the “virtuality” of the register, that I read in the forum.
If I put volatile when I define my floats then I have a complete decrease of performance.

So, is there any possibility to avoid this problem ?
if it is a problem of size of register, does it mean that with a GTX 295 which a a double
size of register compared to my 9800 DX2 (16 384 against 8 192) I could solve my problem?

Why a discontinuous effect of Memory by ptxas.exe.

I use VS2005, maybye an option I can try ?

Thanks.

SPWorley · June 2, 2009, 3:26pm

It’s impossible to say, though seeing the code could help.

The most likely explanation is that it’s the compiler’s optimizer properly being smart. If your code without that line doesn’t use the values v2, v3, v4 elsewhere to affect v1, and your final output only uses v1, then v2, v2, v3 are redundant and don’t need to be computed themselves. The compiler’s optimizer properly elminates them AND everything that they depended on that nothing else needs.
It’s a great optimization.

But this means that by having that one line, the compiler has to do a lot more work… generate a lot more code, unroll more loops, etc. So compilation can be a lot slower.

I have no idea if this is truly the issue, but it’s a very common behavior both in run and compile time where the global optimizer surprises you by following the dependencies (properly!) and a single line of code can make a major difference to your program.

x248 · June 2, 2009, 5:14pm

I give you the code.
The project is GLMM1, and I use VS2005.
The interesting file is glmm1Kernel.cu

If you compile like the file, I have more than 500Mo used by ptxas.exe
If

you do not use the line 1557 (you will see the comment), then not problem compilation in 5 seconds, less than 50Mo
of memory. Just to remark that this line use an array defined previoulsly by the host by CudaMemcopy.

2)If I reduce the number of lines before, it is OK (see where I propose to comment: line 695 to 1537).

I have to read/modify/write a vector of 45 float/thread, so if I use shared memory I can just lanch less than
100 threads/multiproc, so less than 4 blocks of 32 threads which is really not optimal( there is 8 processor).

Can you tell me if you have the same result, and any advice to reduce code are welcome!

Thanks in advance.

x248 · June 2, 2009, 5:29pm

This is the code.

GLMM1_pb2.zip (74.3 KB)

SPWorley · June 2, 2009, 5:59pm

From a quick look at the code, I think my theory explains it.

The kernel is long and complex and does take a while to compile.
But if you delete that last line, the optimizer realizes a lot of the work done is pointless since it’s never used, and can eliminate most if not all of your code, so it’s easy and fast to compile a do-nothing function.

I could still be wrong, but the fact that this is the final line in your large function, and all the function’s computation is finally saved by this line makes sense that this is an optimization issue. If the function doesn’t save anything, it doesn’t need to compute anything. If it doesn’t need to compute anything, compiling is easy and fast.

So I bet this is a great compiler optimization, a feature that can speed up compilation significantly. It’s not a bug where one line takes a huge time to compile.

tmurray · June 2, 2009, 6:11pm

dead code elimination–it’s pretty great!

x248 · June 2, 2009, 7:12pm

To have comment by a Nvidia’s guy, it is great too !

I understand this elimination code, but then, how do you explain the fact that if you comment something like 1/4 of the lines then, (the last line is active),

it goes very fast:

Why (1/4) of lines (maybye less because there are macros inline) make a so huge difference for

ptxas.exe:

2 secs to compile and less than 50M0 of Ram used.

against

(500 lines more):

10 minutes to compile and 800 Mega of Ram used by ptxas.exe

And all of that for an .exe less than 1MegaO.

I obseved that the code will be faster by a big factor if the complation time is not too long.

It is like there is a mechanism of optimization that can not be correct with a long code using something

like 70 floats by thread+ an adress to the device ?

Is there a point where there is a decrease of optimization, and where the compiler has problems.

I just know the limit is 2 millions of ptx instruction, and I think I am far away.

Thanks.

eyalhir74 · June 3, 2009, 10:52am

I think this is the classic situation of mis-interpreting what you see :)

dead-code optimization explains it all. f the compiler optimized most of your code, I guess the compilation

time will be faster and obviously the kernel run time will be faster :)

I’d bet on the dead-code optimization :)

cheers,

eyal

x248 · June 3, 2009, 5:06pm

ok, but I would like to understand why the compiler need 800Mega of Ram and 10 minutes to build an exe

of less than 1 Mega. Is it normal?

cbuchner1 · June 4, 2009, 2:39pm

I would call this rather extreme. Understanding why this happens requires you to know the internals of the nvcc compiler and its code optimizer.

There are some options you could try to tweak, one being the --maxrregcount option.

You could also try to disable any code optimization (or parts thereof) to see if it is the optimization phase that makes the memory requirements explode.

Christian

YDD · June 4, 2009, 3:37pm

Well, the original post implied that the kernel is 3000 lines long, which is a bit extreme in itself… I start to get scared if my kernels are topping a couple of hundred lines. A source code file of 3000 lines is enough to give me the shivers, let alone an individual routine!

x248 · June 8, 2009, 12:15pm

I think you are totally right, in fact that means the compiler maybye “keeps” a lot of data (or exe) before giving the optimized exe.

My progam is complex, overall near the limit of number of registers per thread, and maybye it produces an important “loop” in the optimization.

I am using windows XP/Visual Studio 2005. Is there any possibilty to play with the nvcc options: they are maybye in

a file, or do I need necessary to use the nvcc command in line( and so going out of VS2005).

Thanks in advance. :)

cbuchner1 · June 8, 2009, 12:28pm

They are stored in the form of custom build rules for .cu files. I can’t tell you anything more specific though: Just dig if you can find the build rules under the “properties” dialogue for your .cu files. It will call nvcc from there, allowing you to specify nvcc-specific command line options.

You can also open the visual c++ project file (.vcproj) using Notepad (or any other plain text editor) to find those build rules.

And finall there’s the option to use the CUDA project wizard, as posted in the Windows specific CUDA forum. It supports to change some of the nvcc options directly from the wizard’s dialogue box.

Christian

x248 · June 8, 2009, 1:11pm

Thanks for those 3 solutions! External Media

I found in VS2005: In the global view of the project, if if select the “main”.cu and I open the option, I see it is “excluded for the build” with a

specific inline command:

“$(CUDA_BIN_PATH)\nvcc.exe” -ccbin “$(VCInstallDir)bin” -c -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"$(CUDA_INC_PATH)" -I./ -I…/…/common/inc -o $(ConfigurationName)$(InputName).obj $(InputFileName)

and I can change that.

My mistake was to look is the global options of the project, and in fact you see only the inline command for the option of the “main”.cu,

and not other .cu of the project.

Thanks again, Regards. External Media

Topic		Replies	Views
Very long kernels resulting in unoptimized compilation CUDA Programming and Performance	2	450	March 10, 2023
_Very_ slow compilation of .cu file CUDA Programming and Performance	6	17170	August 25, 2009
Slow compile and cudaMalloc CUDA Programming and Performance	8	3691	February 2, 2011
Ptxas compiler speed. CUDA Programming and Performance	23	12095	December 20, 2012
very slow compile CUDA Programming and Performance	7	2174	February 8, 2012
BUG: Broken register allocation, toolkit 2.3 CUDA Programming and Performance	15	6913	May 10, 2010
Strange performance behavior CUDA Programming and Performance	20	18362	May 11, 2011
Once again for registry spills, performance and nvcc magic CUDA Programming and Performance	9	6350	September 6, 2013
running code from cudatoolkit 3.2 to 4.0 -- ptxas error CUDA Programming and Performance	3	3964	August 17, 2011
Possible CUDA improvements CUDA Programming and Performance	7	6123	July 14, 2008

why adding 1 line =exploding time to compile

Related topics