OptiX 6.0.0 is broken on driver 591.44

Hi @dhart , can you send me your github account name? I’ll share the repository with you (or I can send it in some other way).

Update: today I also received a report of a consistent “an illegal instruction was encountered” on a 5080 (also 5070Ti now) while running the same cursed raygen, even the CUDA 12 version (driver 591.59; same thing also on 581.80). I got the exact input data and couldn’t reproduce it on my 4070Ti and 3070. Other raygens run fine on all cards. compute-sanitizer doesn’t catch anything on mine.

Update 2: unrolled loop doesn’t cause issues on any cards / CUDA compilers (but I can’t afford shipping with a gigabyte of all unrolled PTX variants).

Account name shared via DM. Hopefully we can take a look, but please be aware that most of our team is now out of the office for the holidays. I apologize in advance if getting this resolved is slow.

Yeah I understand not being able to unroll everything. And the trick of changing local variables into direct references also isn’t working everywhere? There might be other simple but unintuitive code munging band-aids you can use temporarily.

Have you tried switching between OptiX-IR and PTX? It’s possible that might make a difference.

I’m trying to think of other things you might be able to do that would help, but I don’t want to send you on a wild goose chase of course…

Make sure to turn on OptiX validation mode occasionally. (And make sure to ship with it off.)

Try using the env var CUDA_LAUNCH_BLOCKING to rule out synchronization issues.

There’s a small chance that CUDA 11 or 13 could help, but I hesitate to suggest that since the OptiX side of the compiler is in the driver and it seems like our fears that this isn’t necessarily a CUDA problem might be coming true.


David.

Thanks! I’ve just sent you the project link.

It looks like whatever I do, even if it fixes any visual errors on 10/20/30/40-series, still causes the “illegal instruction” crash on 50-series, no matter the input data or anything. Except for the loop unrolling. Kinda out of ideas. Tried replacing optixTrace with optixTraverse and OPTIX_RAY_FLAG_DISABLE_ANYHIT | OPTIX_RAY_FLAG_DISABLE_CLOSESTHIT | OPTIX_RAY_FLAG_TERMINATE_ON_FIRST_HIT but 5080 still hits the illegal instruction. Will try OptiX-IR now.

Validation mode on a 5080 throws this (and nothing on older GPUs):

[2][ERROR]: Error syncing stream (CUDA error string: an illegal instruction was encountered, CUDA error code: 715)
Error recording resource event on user stream (CUDA error string: an illegal instruction was encountered, CUDA error code: 715)
Error recording resource event on user stream (CUDA error string: an illegal instruction was encountered, CUDA error code: 715)
Error launching work to RTX

UPDATE: Compiled to Optix-IR, and it works fine on my 4070Ti. However, on a 5080 it now fails when trying to create the module:


[2][COMPILER]: COMPILE ERROR: Module compilation failed
Info: Module Statistics
payload values        :          1
attribute values      :          0
Info: Properties for entry function “__raygen__oxMain”
semantic type                :                 RAYGEN
trace call(s)                :                      3
continuation callable call(s):                      0
basic block(s)               :                    151
instruction(s)               :                   1766
Info: Compiled Module Summary
non-entry function(s):     0
basic block(s)       :     0
instruction(s)       :     0

I guess it’s the same “illegal instruction”, just caught earlier.

Compiling Optix-IR with –use_fast_math -Wno-deprecated-gpu-targets again causes “an illegal instruction was encountered”.


Update 2: OK, I think I found the culprit. So, in OptiX6 there was rtTextureSamplerGetId which returned an int. This int, when printed out, also looked rather small, like a context-specific texture number (0, 1, 2, 3…). Because the numbers are usually in a small range I did an evil thing of casting it to a float in my light structure just so I can reuse the same value for some light types where it actually needs to be a float and cast it back to int when it should be a texture ID… I should’ve at least reinterpreted it, but I didn’t because I was sure it’s not a problem in such value range.

After moving to OptiX9 I replaced most of these ints with cudaTextureObject_t, which is 64-bit. Except for this structure because it was so nice and tight, and funnily enough texture values returned from cudaCreateTextureObject were also the same tiny numbers! I figured GetId was returning the same thing and there is maybe, probably, possibly no good reason for it to really be 64-bit…

And apparently my “maybe” somehow held up on everything until 50-series. Is the driver silently replacing CPU cudaTextureObject_t values with real pointers now? Or is the float-to-u64 conversion different now? I don’t know, and in fact, just replacing the float with an int stopped 5080 crashes. But I’m adding a proper cudaTextureObject_t to the structure now, not taking chances anymore.

It still puzzles me that it still worked on a 5080 either if I removed all optixTrace calls or unrolled the loop.

Oh wow, that’s a bit wild. I don’t know what changed about cudaTextureObject_t, or float to u64, but I’ll ask around and see if anyone knows. I’m super glad you’re unblocked though, thank you for the update!

Puzzling indeed. It’s possible that there was a register stomp that just got moved somewhere innocuous, but I’m speculating wildly.


David.

1 Like