Illegal address error when using both GeometryTriangles and Geometry nodes

bdr · June 27, 2022, 6:53pm

I’m running into a “cudaDriver().CuEventSynchronize( m_event ) returned (700): Illegal address, file: , line: 0)” error when trying to trace a scene which contains both GeometryTriangles and Geometry nodes in the scene graph. There’s no problem with only GeometryTriangles or only Geometry nodes. The application works fine with older drivers (recently tested with 472.84), but started failing around version 490 (currently testing with 512.15).
My scene graph has a collection of GeometryInstances which are divided into separate GeometryGroups based on geometry type (so each GeometryGroup contains only GeometryTriangles or Geometry). Each GeometryGroup is the child of a Transform, and the Transforms are all children of a single Group. Above that Group there’s one more Transform and then a Group which is the root of the scene graph.
I’ve tried adding an extra Group parent for each GeometryGroup, and also tried adding a Group to parent each Transform, with no effect.
The optixGeometryTriangles SDK example combines GeometryTriangles and Geometry, and works, but has no Transforms. What are the rules for structuring a scene graph which uses GeometryTriangles, Geometry, and Transforms?
Thanks

Current test configuration:
Windows 10 21H2
dual Quadro RTX 4000 (tested with single GPU also)
512.15 driver
CUDA 10.0
OptiX 6.5*

*I know, I know. I’m working on moving to OptiX 7, but it’s going to take a while and I’m sidetracked fixing this.

droettger · June 28, 2022, 6:49am

My scene graph has a collection of GeometryInstances which are divided into separate GeometryGroups based on geometry type (so each GeometryGroup contains only GeometryTriangles or Geometry). Each GeometryGroup is the child of a Transform, and the Transforms are all children of a single Group.

Your scene setup sounds correct.
The older OptiX API has acceleration structures (AS) at graph nodes with “Group” in the name, means Group and GeometryGroup hold the AS. You cannot share different geometric primitives in a GeometryGroup. So all fine.

Above that Group there’s one more Transform and then a Group which is the root of the scene graph.

What is that used for?
That would only slow down the BVH traversal due to the third AS level and transform.
Whatever that transform does, that could be folded into the Transforms one level deeper, or when that is not scaling non-uniformly, as inverse into the camera orientation.

Could you try newer display drivers? There are R515 releases out.

If that doesn’t help, we’d need a minimal complete reproducer in failing state to analyze. The smaller the better.

OptiX 6 versions allow an OptiX API Capture (OAC) into a number of files in a folder which we can replay in most cases.
Instructions here: https://forums.developer.nvidia.com/t/optix-prime-performance-issue-using-windows/44606/2

bdr · June 28, 2022, 10:53pm

The issue persists in the 516.25 drivers.
As you point out, the top level transform could be optimized away. But I think I was looking in the wrong place with the scene graph structure. After more troubleshooting, and failed attempts at a minimal reproducer, it appears the issue is related to material usage. Certain combinations of material programs applied to GeometryTriangles/Geometry do work with the mixed scene graph, while others fail. All combinations seem to work when the scene graph is exclusively GeometryTriangles or Geometry. I’m guessing maybe having both types of nodes is throwing off stack sizing or alignment and leading to the invalid address, but I haven’t figured out the pattern.
I wouldn’t be able to send you the output from OptiX API Capture.

bdr · July 8, 2022, 8:13pm

After more experimentation, the issue seems tied to the use of bitfields in an attribute. I have a struct with bitfields as shown below. Everything works fine with older drivers, or if the scene graph is exclusively GeometryTriangles or Geometry. Using new drivers, mixed GeometryTriangles/Geometry, and with PACK_ALL_UINT undefined, I get the illegal address error, apparently related to accessing other attributes. With PACK_ALL_UINT defined, I get this:

Unsupported:   %spec.select53 = select i1 %pred.i.i, i32 %160, i32 16777215
LLVM ERROR: Instruction not yet supported for integer types larger than 64 bits

There’s no aliased access, bit twiddling, or anything much happening in the usage of these bitfields. Just simple assigments and reads/comparisons. I was really hoping I could duplicate it by hacking on an SDK example, but no luck so far.
I think between this and Error with attribute access using OptiX 6.5 and Linux 510 driver , I’m just going to stop using attributes with anything other than 32-bit built-ins or CUDA vector types. Once I port to OptiX 7, however, I will need some bit-packing method (struct bitfields, bitwise masks on an unsigned int, something) to fit within the 8x32-bit attribute limit. What approach seems safest going forward?
I have user reports of the LLVM error on Linux with 510.47.03 drivers, possibly without mixed GeometryTriangles/Geometry, but have not been able to duplicate it myself.
Thanks


struct BitfieldTest
{
	unsigned int number24 : 24;
	enum AnEnum
	{
		val0 = 0,
		val1 = 1,
		val2 = 2
	};
	AnEnum enumVal : 2;
#ifdef PACK_ALL_UINT
	unsigned int flag0 : 1;
	unsigned int flag1 : 1;
	unsigned int flag2 : 1;
	unsigned int flag3 : 1;
#else
	bool flag0 : 1;
	bool flag1 : 1;
	bool flag2 : 1;
	bool flag3 : 1;
#endif
};
rtDeclareVariable(BitfieldTest, bitfields, attribute bitfields, );

Test configuration:
Windows 10 21H2
dual Quadro RTX 4000
516.25 driver
CUDA 10.0
OptiX 6.5

dhart · July 8, 2022, 9:25pm

For bit fields, around here we most often use normal C-style masks & bit position defines, sometimes along with macros to combine and extract bits if necessary.

I can imagine reasons bit fields might be problematic in the sense that an OptiX attribute is designed to fit in a single register and avoid memory transactions, while operating on struct members and/or bit fields seems likely to generate code that uses multiple registers (maybe sometimes one for each separate field or member). Manually decoding an attribute via masks is likely to do the same thing unless you’re careful to keep it packed, but at least you’ll have fewer frustrating compiler errors.

We could perhaps discuss alternatives to using 8 attribute values or more. Do you have options for moving some of that data into your hit shaders, even if it requires recomputing some things, or have you already reduced the attributes to the bare minimum?

–
David.

bdr · July 8, 2022, 9:51pm

With some bit packing (which I’ll change to bitmask macros), and deferring everything I can to the hit programs (including some recomputation), I’m under the 8 register limit with a couple bits left for a rainy day.
I do think it would be good for OptiX 7 to have the option for some extra attribute storage (could just be a single block of data that gets copied appropriately) like old OptiX has. I understand the registers would be strongly preferred, but a custom intersection program may be doing arbitrarily large calculations, and recomputing some of the results can get messy.
Thanks

dhart · July 8, 2022, 10:38pm

I think the team is open to discussing the possibility of more attributes in the future. The questions I get when representing the user perspective is, “what’s the use case?”, and “can this be done any other way?” and “is the proposed solution the fastest and/or best practice?” A couple of things to keep in mind are that attributes can take away registers from other places and/or cause spilling. We don’t want to add more attributes and give the impression that they’re freely available only to cause performance degradation.

In a pinch, you can always resort to using the payload or local/global memory, but that comes with the complexity of needing to watch and manage your t values yourself, since intersection tests, even when you hit things, are often executed out of order and may be discarded in favor of closer hits. You’ll also end up paying for extra memory traffic to write and re-write any self-managed attributes during traversal. So this method is slow and not advised, but it is possible if you have no alternatives.

The reason I asked about the number of values is that it’s tempting and common for people to compute values that seem very accessible to the intersection program. A really simple example would be a surface normal. All the geometry information is right there, and a normal is just 3 attributes, so it seems like it will be easy and fast to just do the cross product (for example) and store the normal, and it can seem like it will probably faster in intersect than in the hit shader. Except it’s easy to forget (at least for me) that 90% (hand-wavy made-up number) of intersection tests are thrown away, so computing a normal and using attributes to pass it to the hit shader may be wasting a lot of work and eating into your register budget at the same time. It’s common for a decent amount of recomputation in the hit shader to be faster than something innocuous and small in the intersection program, even when it’s messier and more code. So our general recommendation is to use attributes only to save information about the location of the hit (if needed), and the bare minimum of ID info in order to get at the data you need in your hit shader, and then in your hit shader reconstruct everything else- normals and texture coordinates and flags and interpolated geometry values, etc., etc… Maybe you’re already doing that to the maximum extent possible, I just wanted to clarify the context of my question, on the off chance that it might be helpful to think about moving one or more of the attributes over to the shading side.

–
David.

bdr · July 9, 2022, 7:18pm

I’m certainly in favor of preferring hit shader calculations almost all the time. And for triangles having an ID and barycentric coordinates goes a long way. But in the grand scheme of weird things some people use OptiX for, there will be cases where some attributes are byproducts of an extensive intersection calculation, and where the chances of running that extensive calculation and then discarding it are low (ie low probability of multiple intersections on a single ray which get past bounding box and other early discard tests). Also, sometimes when debugging a bit more storage can be really handy, and performance isn’t much of a concern then.
Using payload (or other memory) seems like a reasonable alternative to me. The use cases I can think of that might really need more attributes tend to already be very heavy on memory access, and a bit more likely won’t make a difference.
I somehow got the impression that the OptiX framework (current or future) had the freedom to parallelize intersection programs along a single ray, thus making payload attributes a potential race condition even with t value checking. I’d be happy to be wrong about that. Can we rely on intersection invocations for a single ray being unordered but non-parallel?

dhart · July 11, 2022, 4:49pm

This is an excellent question. OptiX, and more specifically RTX hardware, does already parallelize intersection tests along a ray. But I don’t think it can currently lead to race conditions, I don’t believe that there will be multiple SM threads for a given ray writing attributes at the same time. It could change in the future, but I think you’re safe to assume for now that t-value checking is enough to make writing to payload or memory safe. You could always use an atomic as well, but I don’t believe it’s currently necessary. I will double-check this assumption with the team and post to correct myself if I’m wrong.

–
David.

bdr · July 14, 2022, 5:20pm

While trying to decide just how paranoid to be in my bit masking operations (the PTX code for simple bitmasking seems dangerously close to the code for using the bitfields), I learned a bit more about how to generate the invalid LLVM select instruction mentioned above.
It appears that conditionally assigning 2^24-1 to the 24-bit unsigned int provides the driver with an invalid optimization opportunity. So something like the following can generate a conditional bitwise OR to simplify overwrite the lower 24 bits of the register, while a different integer would typically require an AND to clear the bits and then an OR to insert the value, which prevents the invalid instruction generation.


// Meaningless values, just assign something for test		
bitfields.number24 = 0;
if (uintBuffer != RT_BUFFER_ID_NULL) {
	bitfields.number24 = rtBufferId<unsigned int>(uintBuffer)[primitiveIndex];
}
if (bitfields.flag2) {
	bitfields.number24 = (unsigned int)16777215;
}

The code above generates “or.b32 %r58, %r61, 16777215;” in PTX, and then the invalid LLVM instruction at execution, while a magic number of 16777214 generates “and.b32 %r58, %r62, -16777216; or.b32 %r59, %r58, 16777214;”, uses another register, and works.
All that said, I still can’t duplicate the error with an SDK example, even with a significant chunk of nearly identical PTX around the problem code. So something else is required for the driver to generate that invalid instruction.

dhart · July 18, 2022, 10:45pm

Is the PTX that’s getting miscompiled something you can share here, or privately via DM?

–
David.

bdr · July 19, 2022, 3:36pm

Unfortunately no, I can’t

dhart · July 19, 2022, 4:19pm

Okay, no worries. We’ll see if we can find a reproducer based on your findings so far. If you do happen to repro in an SDK sample, let us know.

–
David.

dhart · July 19, 2022, 8:54pm

BTW, since you mentioned potential concern about doing manual bit masking - have you tried that already and does it work around the issue? I’d be very surprised if packing a bitfield manually caused any of the errors reported so far, we do a lot of manual bit masking & packing, so it is well tested compared to using bit field structs. If you convert the code into manually masked uints and still have any compiler trouble, please let me know and we’ll try to reproduce that way as well.

–
David.

bdr · July 20, 2022, 4:24pm

Manual bit packing seems safe so far. I’ll let you know if that changes or I get anything reproducible in SDK code.

bdr · July 21, 2022, 2:48am

It seems I spoke too soon. While changing some single bit fields to bit masking, I ran into a new type of error.
Changing the attribute declaration to unsigned int and bit masking, one of my any hit programs becomes the following:

rtDeclareVariable( unsigned int, bitfields, attribute bitfields, );
RT_PROGRAM void anyHitConditionalSkip()
{
  if ( bitfields & 0x8000000 ) {
    rtIgnoreIntersection();
  }
}

During program initialization, createProgramFromPTXFile produces this error:
Compile Error: Unexpected constant in analysis of variable usage while translating (_attribute_bitfields) at: [ i32 add (i32 ptrtoint (i32 addrspace(1)* @_attribute_bitfields to i32), i32 3) ]

The PTX for this program is below. Comparing it with the PTX for the previous bitfield struct version, they’re identical except that the struct version uses “[bitfields+8]” (My workaround for the invalid LLVM instruction issue increased the struct size).

	// .globl	_Z21anyHitConditionalSkipv
.visible .entry _Z21anyHitConditionalSkipv(

)
{
	.reg .pred 	%p<2>;
	.reg .b16 	%rs<3>;


	ld.global.u8 	%rs1, [bitfields+3];
	and.b16  	%rs2, %rs1, 8;
	setp.eq.s16	%p1, %rs2, 0;
	@%p1 bra 	BB2_2;

	// inline asm
	call _rt_ignore_intersection, ();
	// inline asm

BB2_2:
	ret;
}

Test configuration:
Windows 10 21H2
dual Quadro RTX 4000
516.59 driver
CUDA 10.0
OptiX 6.5

bdr · July 21, 2022, 3:33am

Changing the conditional to
if ( ( bitfields & 0x8000001 ) > 2 ) {
produces a ld.global.u32 and runs.
Should I use that kind of bitmasking (for booleans), or is there some preferred method?

dhart · July 21, 2022, 3:41pm

No, you really shouldn’t have to do anything tricky like that. We never do anything but straightforward bit masking to extract only the specific bits that should be involved. I don’t think I understand the error message exactly, so I’ve asked the compiler team for help. It’s suspicious that it’s treating the address of bitfields as a signed int, but I’m guessing the actual error message is referring to the alignment, that it’s the 3 it doesn’t like. I don’t understand why it’s trying to load a single byte. Will respond again with comments or questions from the team.

–
David.

bdr · August 15, 2022, 2:14pm

Any news on the bitmasking and alignment checking?
Thanks

dhart · August 15, 2022, 2:23pm

Hi @bdr,

Yes. :) There is a legitimate compiler bug here that has been confirmed by our team and the proper fix is being worked on. I don’t have an ETA yet, but this potential workaround was suggested by the team:

struct BF { __align__(4) unsigned char d[4]; };
rtDeclareVariable( BF, bitfields2, attribute bitfields2, );
RT_PROGRAM void any_hit_shadow()
{
    unsigned int bitfields = {bitfields2.d[0] | (bitfields2.d[1]<<8) | (bitfields2.d[2]<<16) | (bitfields2.d[3]<<24)};
    if( bitfields & 0x8000000 )
    {
        rtIgnoreIntersection();
    }
}

–
David.