I have a “big” device function using 60 floats, and with 3000 lines of calculation.
All is simple caculation beetween the floats and because I donot “index” by arrays
do be in the register, I have a lot of lines.
Time to compile is 2 seconds.
If I add one line: v1+=v2*(v3+v4);
then the comilation time is more than 2 minutes.
If I look in the task manager, I see tha ptxas.exe working and taking 600 Mo of Memory:
Before adding this line it was less than 50Mo.
I think it is because of the “virtuality” of the register, that I read in the forum.
If I put volatile when I define my floats then I have a complete decrease of performance.
So, is there any possibility to avoid this problem ?
if it is a problem of size of register, does it mean that with a GTX 295 which a a double
size of register compared to my 9800 DX2 (16 384 against 8 192) I could solve my problem?
Why a discontinuous effect of Memory by ptxas.exe.
It’s impossible to say, though seeing the code could help.
The most likely explanation is that it’s the compiler’s optimizer properly being smart. If your code without that line doesn’t use the values v2, v3, v4 elsewhere to affect v1, and your final output only uses v1, then v2, v2, v3 are redundant and don’t need to be computed themselves. The compiler’s optimizer properly elminates them AND everything that they depended on that nothing else needs.
It’s a great optimization.
But this means that by having that one line, the compiler has to do a lot more work… generate a lot more code, unroll more loops, etc. So compilation can be a lot slower.
I have no idea if this is truly the issue, but it’s a very common behavior both in run and compile time where the global optimizer surprises you by following the dependencies (properly!) and a single line of code can make a major difference to your program.
I give you the code.
The project is GLMM1, and I use VS2005.
The interesting file is glmm1Kernel.cu
If you compile like the file, I have more than 500Mo used by ptxas.exe
If
you do not use the line 1557 (you will see the comment), then not problem compilation in 5 seconds, less than 50Mo
of memory. Just to remark that this line use an array defined previoulsly by the host by CudaMemcopy.
2)If I reduce the number of lines before, it is OK (see where I propose to comment: line 695 to 1537).
I have to read/modify/write a vector of 45 float/thread, so if I use shared memory I can just lanch less than
100 threads/multiproc, so less than 4 blocks of 32 threads which is really not optimal( there is 8 processor).
Can you tell me if you have the same result, and any advice to reduce code are welcome!
From a quick look at the code, I think my theory explains it.
The kernel is long and complex and does take a while to compile.
But if you delete that last line, the optimizer realizes a lot of the work done is pointless since it’s never used, and can eliminate most if not all of your code, so it’s easy and fast to compile a do-nothing function.
I could still be wrong, but the fact that this is the final line in your large function, and all the function’s computation is finally saved by this line makes sense that this is an optimization issue. If the function doesn’t save anything, it doesn’t need to compute anything. If it doesn’t need to compute anything, compiling is easy and fast.
So I bet this is a great compiler optimization, a feature that can speed up compilation significantly. It’s not a bug where one line takes a huge time to compile.
To have comment by a Nvidia’s guy, it is great too !
I understand this elimination code, but then, how do you explain the fact that if you comment something like 1/4 of the lines then, (the last line is active),
it goes very fast:
Why (1/4) of lines (maybye less because there are macros inline) make a so huge difference for
ptxas.exe:
2 secs to compile and less than 50M0 of Ram used.
against
(500 lines more):
10 minutes to compile and 800 Mega of Ram used by ptxas.exe
And all of that for an .exe less than 1MegaO.
I obseved that the code will be faster by a big factor if the complation time is not too long.
It is like there is a mechanism of optimization that can not be correct with a long code using something
like 70 floats by thread+ an adress to the device ?
Is there a point where there is a decrease of optimization, and where the compiler has problems.
I just know the limit is 2 millions of ptx instruction, and I think I am far away.
I would call this rather extreme. Understanding why this happens requires you to know the internals of the nvcc compiler and its code optimizer.
There are some options you could try to tweak, one being the --maxrregcount option.
You could also try to disable any code optimization (or parts thereof) to see if it is the optimization phase that makes the memory requirements explode.
Well, the original post implied that the kernel is 3000 lines long, which is a bit extreme in itself… I start to get scared if my kernels are topping a couple of hundred lines. A source code file of 3000 lines is enough to give me the shivers, let alone an individual routine!
They are stored in the form of custom build rules for .cu files. I can’t tell you anything more specific though: Just dig if you can find the build rules under the “properties” dialogue for your .cu files. It will call nvcc from there, allowing you to specify nvcc-specific command line options.
You can also open the visual c++ project file (.vcproj) using Notepad (or any other plain text editor) to find those build rules.
And finall there’s the option to use the CUDA project wizard, as posted in the Windows specific CUDA forum. It supports to change some of the nvcc options directly from the wizard’s dialogue box.