ptxas segfault crashing the compiler on large kernel

Hi,

I’ve recently revisited some old code that used to compile and run fine with version 1.0 of the toolkit, but crashes nvcc version 1.1. Specifically, I get a segmentation fault from the compiler when I attempt to build.

Running nvcc -ptx will successfully compile a ptx file, and then ptxas will give the segfault when attempting to assemble it, so I suspect the problem is in ptxas. The particular kernel I’m trying to compile is quite large, both in terms of memory usage and number of instructions. Perhaps ptxas is not able to handle it? Of course, I would expect an error message, not a segfault. It shouldn’t be possible to crash the compiler!

Unfortunately I cannot post source code. Has anybody else seen this problem before? I would certainly appreciate a fix!

Thanks,

Brian

On which OS?

Hi,
I’ve only seen this in combination with some extra tweaking compiler flags like -maxrregcounts XX which limits the registers used.

Does the cubin file give any hints?
Compile with nvcc -cubin to get the cubin file and look to the register and shared memory usage.

I noticed that Cuda1.0 took sometimes less registers to execute than Cuda1.1 does.

Perhaps you kernel is really too large :fear:

Regards,

Johannes

The error occurs on both linux (ubuntu) and windows XP systems, which so far are the only ones I’ve tried.

I am not using any extra compiler flags, so that’s not it. I tried nvcc -cubin, but that segfaulted as well.

It’s not straightforward at all to split this kernel into smaller pieces, so if there’s another mechanism to resolve this, that would be helpful. And it does still work with version 1.0, which makes me wonder if kernel size is really the issue, although I agree it’s possible.

Yeah, unfortunately, ptxas is the tool that figures out how many registers the kernel requires (and generates the cubin), so the -cubin flag won’t help you track this down.

Can you scan through the PTX output and see what the highest register used is? The nvcc compiler generates PTX code which uses static single assigment, so it will look like your PTX code is using hundreds of registers. One of the jobs of ptxas is to figure how to map these PTX registers to real registers, reusing the real registers as much as possible. I’m wondering if your kernel has tripped over some bug that hits when the number of allocated PTX registers is huge.

Sure:

unsigned 16 goes up to $rh3
unsigned 32 goes up to $r6433
float 32 goes up to $f4172
float 64 goes up to $fd156
pred goes up to $p1257

I’ve done some more poking around in the ptx file, and found some odd-looking variable declarations. For example:

.local .align 4 .b8 __cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda_result16136136136136136136136136136136136136136136136136136136136136136136136[28];

I’m not sure where they’re coming from, but it seems like the kind of thing that could crash a compiler.

I’ve also had this problem (code works in 1.0 and segfaults ptxas with 1.1), I submitted a bug report and was told the problem was fixed with the internal version of the toolchain. So theoretically this should be fixed when 1.2 is released.

Well, gee…I guess that’s good to hear. Do we know when version 1.2 will be out?

the beta was expected end of march, so should be soon.

the ___cuda___cuda___cuda___cuda_result I have seen in relation to trig functions. You can find how sinf, etc is implemented in the toolkit in some header file.