nvcc/pxtas bug 2.2+

Platform: Windows XP 32bit

CUDA 2.3 Error:

1>### Assertion failure at line 123 of ../../be/cg/NVISA/expand.cxx:

1>### Compiler Error in file blah.cpp3.i during Code_Expansion phase:

1>### unexpected mtype

1>nvopencc ERROR: <snip>\thirdparty/cudatoolkit2.2/bin/win_ia32/open64/lib//be.exe returned non-zero status 1

CUDA 2.2 Error: ptxas simply runs in an infinite loop (waited for 20+ minutes, it ended up getting to about 3.5gig memory usage so I had to kill it)

I managed to get this error when I started to add IEEE compliant paths to all of my mathematical functions (so I could switch between fast math, and correct math with a simple macro switch).

The only difference between the working code, and the code that screws with nvcc/ptxas - is simply the use of the IEEE floating point intrinsics (__fmaf_rn, __fsqrt_rn, __fmul_ru, etc).

Unfortunately I can’t post blah.cpp2.i here (a: because I can’t find it, even after --keep’ing the files, and b: if it’s anything like the other .i/.ii/.gpu/etc files, it has about 3k lines worth of proprietary, copyrighted, and patented source code of which I’m not authorized to reveal).

Instead I’ve attached the mathematical functions, that use the IEEE intrinsics which appear to cause the error. I should note though, I’m not yet able to reproduce the same error in a smaller/simpler kernel that causes the same bug in the nvcc toolchain…

Any help would be appreciated…

[s]Edit: After a day work on this, I’ve somehow managed to get the same error without using any intrinsicts - so it’s not related to the intrinsics at all it would seem (I just got unlucky seeing this error for the first time when I started using the IEEE intrinsics).

After some googling I came across a similar reference to this same error (probably would’ve found it via the forum search… if it worked) in which the respondents came to the conclusion the error is in regards to writing an uninitialized register into memory (global/shared) - though after going over my code rather thoroughly, initializing ‘everything’ after initialization - in addition to the initializations I do later on in the code, it hasn’t solved my problem…[/s]

[b]Final Edit: It seems it was indeed an uninitialized variable (that I missed in my initial sweep), same problem as this post: http://forums.nvidia.com/index.php?showtop…rt=#entry575590

Again, as suggested in that post - a more relevant error would save a lot of lost time in this case…[/b]
bug.txt (2.21 KB)

Resolved: see original post.

[s]Updated mathematical cuda code, and attached probable use case…

If i un-comment the use case function in my kernel (eg: just return 0), it avoids the assertion/infinite loop - even when using the IEEE path, yet nvcc asserts/loops when using the IEEE path… the function doesn’t really do anything that out of the ordinary… so not sure why it’d be causing issues.

It’s possible the compiler is also just optimizing half the kernel away when i comment out that function, thus avoiding the error completely… it’s hard to tell.[/s]
bug_usecase.txt (2.34 KB)
bug_math.txt (2.87 KB)

It seems this infinite loop bug, isn’t specific to 2.2 after all - it’s actually a completely separate issue that’s common to both 2.2 and 2.3. I’ll end up making a new thread for that issue once I have more details on specific cause.

My initial thoughts are that it looks like it’s not actually an infinite loop - rather adding these IEEE intrinsics to my low level math functions (which are used in hundreds of places throughout my kernel) exponentially increases compiler time - and either due to a memory leak, or just insane optimization measures being taken inside of ptxas - memory keeps increasing (at 3-5mb/second), using 100% CPU usage…

I’ve confirmed this by gradually adding IEEE functions to one of my mathematical functions at a time, first compile took 2 minutes (with 1 functions using the IEEE intrinsics), second 10 minutes (with 2 functions using the IEEE intrinsics), waiting on third compile now (with 3 functions using the IEEE intrinsics)…

Ultimately I have at least 6 low level functions that need to use these intrinsics, plus another 2 ‘large’ (100+ lines of pure FP operations) which need to use the IEEE intrinsics too - but I imagine I’ll run out of memory before ptxas can even compile the ultimate case for me…

Update: It seems with only 3 of my math functions using the IEEE intrinsics, the ptx generated from my CUDA code is 1867KB in size, almost 60,000 lines of PTX… which would explain why ptxas is taking so long… I’d hate to think what kind if analysis/optimizations it’s doing on those 60,000 lines of PTX for generating my cubins…

Update: 3 math functions using intrinsics took 21 minutes to compile…

Wow.

Are you using loop unrolling?

__fdiv_rn and __fsqrt_rn are roughly 100 lines of PTX each, which means your 3 functions make 600 calls to such functions? (__fadd_rn and __fmul_rn don’t count, they map to single instructions).

Besides these intrinsics, are there other ones that you use? (_fmaf* ?)

I have “#pragma unroll 1” before most of my constant loops, however there is one loop which may be getting unrolled (iterates 3 times, on a constant).

I don’t use anything besides *_rn and *_ru if there’s no *_rn, and the only intrinsics I’m using are the ones I mentioned above.

Overall I should only be making 50 or so calls to my lower level math functions, each of which has 1-6 calls… so assuming an average of 3 calls, roughly 50 manual calls total (excluding any possible loop unrolling done by the compiler), that’s only 150 intrinsic calls…

So if my 3-iteration loop is being unrolled, that could potentially be 450 intrinsic calls after loop unrolling - roughly speaking.

I’m curious why the __fdiv_rn and __fsqrt_rn aren’t no-inlined… they’re already quite slow, don’t take pointer arguments, I can’t imagine it’d be too detrimental to performance…

Edit: I attempted to force the 3-iteration loop to not unroll (pragma unroll 1), no dice… Not sure what’s causing nvcc to generate such insane amounts of ptx from using these intrinsics…

I should note that the ptx generated ‘without’ these intrinsics is only 5,800 lines of ptx…