ptxas resource requirements!

ptxas seems to be more and more demanding on my resources. Recently it took over half an hour to compile my code and consumed almost 1.5GB of memory!
Is that a normal behaviour or more some memory leak or something?
What is your experience when compiling your code?

Wow, how big is the kernel you are compiling?!

The thing is that it is not that big. A few hundred of lines…

That sounds like a compiler bug, then. If it still happens with the latest toolkit, you should file a bug with NVIDIA along with the kernel source.

This isn’t uncommon if you’re using floating point intrinsics. I encountered and posted about very similar problems when using __f* related functions about 3-4 months ago.

We’re currently having to run our kernel with math functions we ‘dont’ want to use because of this issue, just so it can compile properly for us (and even now with our ‘tweaked’ code it takes ~5 minutes, and about 800mb of memory to compile a 2000 line source file)

I can better that ;) I get up to 2.5 Gb of memory consumed by ptxas…
30 minutes per kernel, for a file that contains 3 kernels, with each 2 different values for a template parameter…

Hm… I do have templates with parameteres… maybe that is the source of the problem. Thank you for the hint!

I prefer not to show my source just yet and I was unsuccesful producing a smaller test example so far.

Add me to the list, ptxas is running now for 8min (CPU time) for a kernel that previously compiled within seconds, which is really weird. Especially since I made not change to the kernel code at all, only to the calling code and still see this widely different behavior.

I use CUDA 3.0 on Ubuntu 8.04 64bit with gcc 4.2.4.

Has a bug been filed for this?


Are you using [font=“Courier New”]#pragma unroll[/font]?

No I don’t use #pragma unroll. The kernel is actually generate code that uses lots of intermediate values (so many registers will certainly not be sufficient and it spills to local memory).

Update: I left ptxas running for a couple of hours and I assume it deadlocks somewhere. I could confirm the same behavior with CUDA 3.1 on Ubuntu 9.10 64bit with gcc 4.2.4

I could share the kernel if that helps (it is 260 lines of code though).