Optix 7.5 memory access problem

Hello all,
I am trying to develop an Optix ray based electromagnetic solver capable of tracing both reflected and transmitted rays through a mesh with triangular primitives.
As a secondary diagnostic output, I would like to get the binary tree of the interactions of each ray with the primitives but I am facing a memory problem that I cannot solve and for which I hope you can help me.
Currently I have a pipeline containing only 3 simple shader programs: raygen, closest_hit and miss programs and I have implemented an iterative algorithm for traversing the ray binary tree so optixTrace is called only by raygen program three times.
Below are the launch parameters on which I write the ray tree and that I want to bring back on the host. I’m debugging launching a single direction (launch grid with a single thread) and with a very small tree dimension (dimTree = 16):
struct RayTree
{
int primIdx[16]; // dimTree = 16 for debug
};
struct Params
{
unsigned int width;
unsigned int height;
int max_depth;
int reflection_maxdepth;
int transmission_maxdepth;

float        scene_epsilon;
RayTree* rayTree;
OptixTraversableHandle handle;
    
float3                 cam_eye;
float3                 cam_u, cam_v, cam_w;

};
I’m observing the following behaviour:
• If I print rayTree[0].primIdx[i] before optixTrace call is ok (inizialized to -1);
• Prints during ray tree traversal seems to be ok showing the correct primitives ID;
• Prints just before __raygen return, are corrupted;
• With a single call to optixTRace the vector is not corrupt even if the algorithm is not correct (if I add a second call to optixTrace the vector is corrupt);
• If I combine the rayTree member’s location within the Params structure, I often get illegal memory access crashes at optixLauch calling.

This is my development environment: Windows 10, Visual Studio 2019, Cuda Toolkit 11.7, Optix 7.5, GPU Quadro P1000 compute capability 6.1, driver 517.37.
I suspect there is some problem with memory alignment, but I can’t fix it.
Thank you for any suggestions.

Hi @l.pandolfo,

It’s a bit hard to tell what’s going on here. Is rayTree pointing to a buffer containing an array of RayTree instances? How are you indexing into this buffer? Is the buffer size is width * height * sizeof(RayTree)? What exactly does it mean to combine the rayTree member’s location within the Params structure?

All signs currently point toward your indexing into this buffer not working the way you expect, that threads are not writing into their own slot but writing into memory you reserved for other threads, and it sounds like some threads are writing out of bounds.

Here are a couple of thoughts that might help you isolate and/or solve the issue:

Instead of an array of structs, you could make a flat int array and do your indexing completely manually. This will allow you to print, debug, and/or bounds-check your index. For example, you can include the buffer size in your launch params and use it to print an error message when the index is too large for the buffer.

It may be worth dropping your resolution to 1x1, and then increasing 1 pixel at a time until it’s more clear exactly which code and/or threads are misbehaving.

You could allocate extra ints in your RayTree struct to contain the writer’s thread ID, and interleave them between each primIdx. This way you can see who is overwriting your data.

It might be worth bisecting further down to which line of code exactly appears to cause the corruption. There is presumably some distance in between the traversal and the end of raygen?

You might try rearranging the order of your memory allocations just in case the corruption in one buffer is coming from the writer of a different buffer.

–
David.

Hi David,

thanks for your prompt response and the several suggestions you provided. I had already reduced my resolution to 1x1, I mean by launching a single thread, and I tried to follow some of your suggestions, but I couldn’t solve the issue. I agree that it is not easy to understand what is going wrong in my code just based on my description. I would then try to build a simplified code in which the problem persists and submit it to your supervision. Do you think you can give me that kind of support? Possibly, let me know how I can send you my project.

Thank you again,

Luca

Yes if you prepare a reproducer we can take a look. Having a working code sample is the ideal scenario in the case this is a legitimate bug somewhere. You can post a link to the hosted project here if you’re comfortable with that, or DM me, or send a link or the zipped project (under 10MB) to optix-help, but rename .zip to something else otherwise our spam filter will eat it.

–
David.

Hi David,
I have prepared reproducer that I try to send you. Let me know if you can download it and open the project. I renamed the file. zip as .prj. let me know if you need any clarification on the code.
SBRPO_GPU_simp.prj (6.0 MB)
Thank you,
Luca

I got the project and had to make a few tweaks, but it’s running for me. It was crashing with illegal access until I set the env var OPTIX_FORCE_DEPRECATED_LAUNCHER=1, and after that I see the following output. Let me know if this looks correct to you, whether I’m reproducing the problem or failing to reproduce the problem. I’m not sure what I should expect to see when it’s working without issues. If this is reproducing the problem, maybe give me a very brief introduction to which code exactly you suspect is going bad.

RayGen before optixTrace launch: rayTree 0 -1
RayGen before optixTrace launch: rayTree 1 -1
RayGen before optixTrace launch: rayTree 2 -1
RayGen before optixTrace launch: rayTree 3 -1
RayGen before optixTrace launch: rayTree 4 -1
RayGen before optixTrace launch: rayTree 5 -1
RayGen before optixTrace launch: rayTree 6 -1
RayGen before optixTrace launch: rayTree 7 -1
RayGen before optixTrace launch: rayTree 8 -1
RayGen before optixTrace launch: rayTree 9 -1
RayGen before optixTrace launch: rayTree 10 -1
RayGen before optixTrace launch: rayTree 11 -1
RayGen before optixTrace launch: rayTree 12 -1
RayGen before optixTrace launch: rayTree 13 -1
RayGen before optixTrace launch: rayTree 14 -1
RayGen before optixTrace launch: rayTree 15 -1
RayGen: first launch optixTrace, linear_idx 0
RayGen  after optixTrace launch: rootNode, primIdx = 0 1372
RayGen reflection: currNode, depth = 1 1
RayGen reflection: prevNode = 0
RayGen Reflection: currNode, primIdx = 1 1587
depth, done = 1 0
RayGen Transmission: currNode, depth = 2 1
RayGen Transmission: prevNode = 0
Miss: currNode = 2
RayGen Transmission: currNode, primIdx = 2 1587
depth, done = 1 1
depth, done = -1 0
RayGen before return:  rayTree 0  1372
RayGen before return:  rayTree 1  1587
RayGen before return:  rayTree 2  1587
RayGen before return:  rayTree 3  -1
RayGen before return:  rayTree 4  -1
RayGen before return:  rayTree 5  -1
RayGen before return:  rayTree 6  -1
RayGen before return:  rayTree 7  -1
RayGen before return:  rayTree 8  -1
RayGen before return:  rayTree 9  -1
RayGen before return:  rayTree 10  -1
RayGen before return:  rayTree 11  -1
RayGen before return:  rayTree 12  -1
RayGen before return:  rayTree 13  -1
RayGen before return:  rayTree 14  -1
RayGen before return:  rayTree 15  -1
RayGen: RETURN

Hi David,

I confirm that you got the correct output but I still can’t get it even setting the environment variable OPTIX_FORCE_DEPRECATED_LAUNCHER=1 on my local user (I’m working with no administrator privileges). Surely you observed a behaviour a little different from mine because you had a crash while my code, in the version that I sent you, ran a wrong result. However, I have often observed crash with illegal access by changing the order of members in the Params structure so I think the problems are related and you’re reproducing my problem.

Did you change anything else in the code? Possibly, if you have no other suggestions, you could send me back the project you modified so that I can look for the differences. Finally, could you please explain something more about this environment variable OPTIX_FORCE_DEPRECATED_LAUNCHER? From the name, it seems to activate a working mode that you consider deprecated. Do I need to set it also for the release version, or just for debugging?

Thank you again for your support.

Greetings,

Luca

Sure, here’s the exact state of the project I ran. I didn’t change anything in the code at all. SBRPO_GPU_simp.zip (4.7 MB)

OPTIX_FORCE_DEPRECATED_LAUNCHER=1 is currently generally required to get limited debugging and printf() calls to work with OptiX. This will only be necessary temporarily, and we hope the need for it will go away by the next OptiX release. Our internal launch infrastructure is undergoing some changes that the CUDA printf system and our Nsight tools are still being adapted for.

There is a possibility that the crashes I’m seeing are due to printf() itself being called when not using OPTIX_FORCE_DEPRECATED_LAUNCHER=1. I can’t think of any better suggestions at the moment, but I will think about it and ask around and maybe early next week we can try again. One way to test or get around the printf issue is to make your own version of device printf. You could allocate a scratch buffer for things you’d like to print, and copy values into that buffer in your device code, then after the launch (assuming it doesn’t crash) transfer and print the contents of the buffer using host-side CPU code.

–
David.

Hi David, the version you sent me continues to behave in the same wrong way. However, I confirm that the problem are the printf() even if using OPTIX_FORCE_DEPRECATED_LAUNCHER=1. If I remove them all, I can get the correct output printing the contents on the host side as you suggested. Now I’m worried about debugging a more complex code because the debugger is not working properly (especially where the OptixTrace is called) and the printf() are the only way. Do you have any suggestions on how to debug a complex code?

Thank you for your support.
Luca

This is a good question, it would be a bummer to not be able to trust printf(). So doing the host-side printing I mentioned above is a good way to circumvent printf() if you suspect it’s causing a problem. The printf() function uses some device stack space, and so it might not be introducing bugs in your code, it might simply be changing the way a latent bug is manifesting. One thing you might want to try is upgrading to OptiX 7.7 because the stack space calculation has introduced the option to pass in your pipeline when calculating stack sizes, and it may be able to do a better job than what you have right now. To be clear, I’m suggesting that maybe the problem here could be a lack of sufficient stack space, which could cause the corruption, and use of printf might be pushing the actual stack usage over the edge into the corruption/crash territory. One thing you could easily experiment with right now, without upgrading your OptiX SDK, is increasing your stack allocation manually - just turn it up and see if the problems suddenly disappear.

Be aware that the issues with the debugger, and the force-deprecated-launcher flag, and printf are temporary, and we are working to get the debugging experienced improved ASAP.

–
David.

1 Like

Hi David,

In the next days I will try both to increase the stack and to make the upgrade to OptiX 7.7, but I think this simplified version of the code, that implements a recursive approach to ray tracing, should not be critical from the point of view of stack memory consumption (maxTraceDepth = 1). What’s your opinion?

Anyway, I’ll let you know.

Thank you so much for your support.

Luca

If I understand the question, I think it’s important to remember that stack size issues can strike regardless of the size & complexity of an application, it’s always critical to make sure you have allocated enough stack space. That said, it should be easy to allocate the right amount and hard to get it wrong. Your code looks correct to me in terms of stack allocation. The problem here could, of course, be unrelated to stack size, I just thought maybe it fit your symptoms. You’re only using 1 call frame, so if you turn up the max trace depth to 2 or 10 and still see the bug, then it’s likely that my stack suggestion is a red herring. (Or you can hijack the CC or DC stack values to have more control over how many extra bytes you allocate.) I’d suggest trying this quickly before bothering with the upgrade to OptiX 7.7. If increasing the stack size manually doesn’t fix it, the upgrade also might not fix it. Have you tried the latest driver version?

–
David.

Hi David,

here I am again for an update and some further questions.

I verified that the problem was not related to the stack size. In fact, increasing the stack size manually did not solve the problem. However, for a number of reasons, I changed development environment which now is as follows: Windows server 2022 standard, Visual Studio 2019, Cuda Toolkit 12.1 driver 531.14, Optix 7.7, GPU NVIDIA A40 compute capability 8.6. In this new configuration the code runs like you and, setting the environment variable OPTIX_FORCE_DEPRECATED_LAUNCHER=1, printf() seem to work correctly.

At this point I can go ahead in the design of my algorithm and I would like to ask you for some further advice.

Unfortunately, the algorithm that I have to implement is not completely parallelizable on rays because the electromagnetic process core (the part much more computationally expensive) is constituted by the calculation of the contribution of the field scattered by all hit points on all observation points. This phase of the calculation would therefore be parallelizable on observation points and makes me fear to have to keep in device memory all hit points and all other inputs needed to field computation. To avoid storing hit points, I wonder if it would be possible to define an output buffer sized as the number of observation point multiplied by the maximum number of active threads on the GPU (Nobs x NmaxActiveThread). I emphasize active threads because they are much less in number than the rays launched (several orders of magnitude less) and I can avoid data racing summing on the different rays’ contributions.

Do you think this (or something similar that doesn’t make me store all rays) is feasible or I have to break the calculation in several steps? The main steps could be:

  1. Ray tracing part parallelized using OptiX with storage of rays on the device memory;

  2. Device to host rays copy;

  3. Host to device rays copy;

  4. CUDA calculation of the electromagnetic field by parallelization on observation points.

Thank you for your support.

Luca

Why would you need steps 2 and 3?
If that is running on the same GPU device, the data is on the GPU device already and you have complete control about OptiX’ resource management with the CUDA host API calls inside your application. You could just use the same CUDA device pointers in native CUDA kernels. You would only need to store the final results somehow.

optixLaunch and native CUDA kernels are asynchronous and the launch overhead is comparably small.
If your working set is too big to fit into VRAM at once, you could break it into digestable chunks of work which should be sized to still saturate the underlying GPU.
You could check with Nsight Systems and Nsight Compute how your application and kernels perform.

Hi Roettger,

thank you for your reply.

I can definitely follow your suggestion because both phases of the calculation run on the same GPU so I can eliminate steps 2 and 3 that would greatly degrade performance.

Certainly my working set is too big to fit into VRAM at once but I could break it into digestible chunks hoping to be able to saturate the underlying GPU.

So the high-level scheme of my algorithm could be as follows:

allocate on device memory output buffer EMfields[Nobs]

loop on rays chunks that fit into VRAM

{

OptixLaunch(chunk) //saving hit points on device memory

Synchronize;

Kernel_ComputeEMfields() //native cuda kernel parallel on observation points

Synchronize;

}

Copy output buffer EMfields from device to host.

Did I understand what you meant?

Can you tell me some sample code that shows the interoperability of a native CUDA kernel with OptiX?

Thank you so much for your support.

Luca

If the optixLaunch and the native CUDA kernels run on the same CUDA stream, there is no need to synchronize between them. CUDA will automatically execute the kernels in the order they have been submitted to the stream.

The crucial parts missing in your loop are how the actual data transport between host and device happens for the rays and the results. That’s where potential synchronizations are either necessary or could be avoided as well.

Note that there exist synchronous and asynchronous CUDA memcpy functions, but a lot of care has to be taken when using asynchronous memcpy operations to have the correct data inside the host memory locations at the time the asynchronous memcpy actually happens!

If all host memory pointers (source data and result data) are in disjunct memory locations the whole loop could run using asynchronous calls and you would only need to synchronize once at the end of the algorithm to make sure the result data has finished copying.

I would recommend implementing the data transport with synchronous CUDA memcpy calls first to get the algorithm working.
Afterwards analyze the performance with Nsight Systems and determine if it’s possible to change to asynchronous memcpy operations correctly and analyze the performance again.
Once that is all working use Nsight Compute and analyze your OptiX and native CUDA kernels to see if they could also be optimized.

Can you tell me some sample code that shows the interoperability of a native CUDA kernel with OptiX?

The OptiX SDK example optixRaycasting is using OptiX only to do ray-triangle intersections.
Ray generation and shading happens in native CUDA kernels.

Hi Roettger,
Thank you very much for your valuable guidelines.

Unfortunately, I have to return to ask for clarifications about the use of the environment variable OPTIX_FORCE_DEPRECATED_LAUNCHER. You had previously explained that OPTIX_FORCE_DEPRECATED_LAUNCHER=1 is currently generally required to get limited debugging and printf() calls to work with OptiX. Now I have deleted all the printf() from my shader programs and I tried to run both the debug and the release version with OPTIX_FORCE_DEPRECATED_LAUNCHER=0, but I get an illegal memory access also for the release version (if I set OPTIX_FORCE_DEPRECATED_LAUNCHER=1 also the release version works properly). This can also be replicated on the simplified code I sent you if compiled in Release. The errors I get are the following for both cases:

Debug version:
[ 2][ ERROR]: Error recording event to prevent concurrent launches on the same OptixPipeline (CUDA error string: an illegal memory access was encountered, CUDA error code: 700)
Error recording resource event on user stream (CUDA error string: an illegal memory access was encountered, CUDA error code: 700)
CUDAOutputBuffer destructor caught exception: CUDA call (cudaFree( reinterpret_cast<void*>( m_device_pixels ) ) ) failed with error: ‘an illegal memory access was encountered’ (C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.7.0\SDK\sutil\CUDAOutputBuffer.h:139)

Release version:
CUDAOutputBuffer destructor caught exception: CUDA call (cudaFree( reinterpret_cast<void*>( m_device_pixels ) ) ) failed with error: ‘an illegal memory access was encountered’ (C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.7.0\SDK\sutil\CUDAOutputBuffer.h:139)
Caught exception: CUDA error on synchronize with error ‘an illegal memory access was encountered’ (C:\Users\l.pandolfo\MyProjects\SBRPO_GPU\Source\OptixRayTracer.cpp:575)

For the above reasons, I ask you the following questions:

  1. OPTIX_FORCE_DEPRECATED_LAUNCHER=1 is also required in the release version?
  2. My code hides a latent problem that I should try to fix, or I can go ahead by setting OPTIX_FORCE_DEPRECATED_LAUNCHER=1?
  3. Is the performance of the code compiled in release with OPTIX_FORCE_DEPRECATED_LAUNCHER=1 affected in any way?

Thank you so much for your support.

Correct OptiX 7 applications compiled for release mode targets should never need to set the OPTIX_FORCE_DEPRECATED_LAUNCHER environment variables. All of the OptiX SDK and my own examples (found in the sticky posts of this sub-forum) should just work fine in release mode.

Please verify first that this is the case to make sure your system configuration is functioning correctly.

The upcoming R535 driver release will fix that missing printf output from OptiX device code.
The environment variable might still be required for debugging and profiling. I haven’t tried that recently.

Means if you’re experiencing illegal memory accesses inside your own application, then it’s most likely a problem inside the application which needs to be found and fixed.

I haven’t tried building your project. Here are some comments from glossing over the provided project code:

  • I would immediately rewrite the launchSubFrame function which is going to break when you’re intending to call it more than once. You should only ever need to create (and destroy) a CUDA stream once for the whole application and definitely not inside the inner loop of a renderer. (The only cases where multiple streams could become useful is for very small kernels which wouldn’t saturate the GPU with their workload and can run in parallel. Your code is not at a state to even consider that.)

  • Similar with CUDA mallocs. Are these even freed again or are there memory leaks?

  • Also note that you’re using the RayTracerState as argument to the other functions without actually having initialized all its fields.
    The AS build functions are not using the same stream as the OptiX launch. They use the default stream 0 and the launch use your stream created later. I would never do that. Try setting up all your state first and use the same stream in all your OptiX host function calls.

  • Never do any dynamic memory allocations inside OptiX device code! Code like this shouldn’t even be allowed.
    RayField* treeStack = new RayField[params.max_depth];
    To see if that has anything to do with the problem, make max_depth a defined constant and just define the RayField as local variable at the top of the raygeneration program and take its pointer.

  • You’re mixing double and float calculations. piGreco, DEG_TO_RAD, RAD_TO_DEG are doubles. Use float only in OptiX device code if you can.

  • The linear_idx calculation assumes that the image size matches the launch dimensions.
    It will break if that is not the case, which also means the image size is currently redundant.

  • I’ve not tried to follow your code further, but all the assignments of the treeStack are full structure copies which look expensive. Can’t you just work with the pointers to these per-ray-data structures instead?

  • Try building your application by only having the OptiX SDK 7.5.0\include folder in your project’s Additional Include Directories. Avoid the OptiX SDK sutil library for a minimal standalone application.

I would recommend starting from scratch with a clean minimal OptiX application which works.
Only then add the desired code in tiny steps while each of the additional device code still works.
Reconsider your per-ray-data handling while doing that. Try not to copy whole structs around when not absolutely required, like for any input and output buffer data.

If you have a cleaned up project which still fails, provide a minimal and complete reproducer in failing state again.

Hi Roettger,

Thank you very much for your guidelines.

I followed your punctual suggestions and now I have a first cleaned (only OptiX SDK 7.5.0\include folder in my additional include directories) and working version of the ray tracer that only solves the geometric part of my problem.

The critical point that caused illegal memory accesses in my application was the point 4 of your suggestion list. Together with step 7, they constituted two steps to implement an iterative version of the ray tracer that emulates the stack copy of a recursive algorithm.

Taking cue from your sample code optixWhitted I also wrote a recursive version that turns out much faster than the iterative one. Can I ask you a general consideration about comparing iterative vs. recursive algorithms in terms of memory management and performance?

Finally, I still have a doubt about your point 3: the AS build functions that I used do not have the stream between the input parameters, but only the context so I do not quite understand the meaning of your phrase: “The AS build functions are not using the same stream as the OptiX launch”.

Thank you so much for your support.

Finally, I still have a doubt about your point 3: the AS build functions that I used do not have the stream between the input parameters, but only the context so I do not quite understand the meaning of your phrase: “The AS build functions are not using the same stream as the OptiX launch”.

Because you didn’t initialize your RayTracerState struct before calling optixAccelBuild, these calls where using the default stream zero, while your launchSubFrame function created a CUDA stream and called optixTrace with that.
You should have seen that when single stepping through your application with a debugger.

Since all OptiX API functions taking a CUDA stream argument are asynchronous, using a different stream argument can potentially run in parallel on the device, which would not work (== crash with illegal access errors) when the optixAccelBuild would still build the acceleration structure while the optixTrace already wants to use it.
Depending on the CUDA context setup, the synchronization behavior of the default CUDA stream zero can be different.
None of these synchronization issues would be possible if you used the same CUDA stream for the optixAccelBuild and optixTrace calls, and that should have been the one you created yourself.
Well, you need to synchronize the respective stream anyway after the last otpixAccelBuild to make sure the traversable handles are available.
So simply initialize your RayTracerState completely before calling any OptiX API functions and always use the stream stored in there.

Taking cue from your sample code optixWhitted I also wrote a recursive version that turns out much faster than the iterative one. Can I ask you a general consideration about comparing iterative vs. recursive algorithms in terms of memory management and performance?

Recursive algorithms usually need more OptiX stack space.
It’s recommended that you always calculate your OptiX pipeline’s stack space yourself.
Please have a look into the helper functions inside optix_stack_size.h and how they are used inside the OptiX SDK examples.

It would make sense to compare the VRAM usage when running iterative vs. recursive versions of your application.

I would generally not expect recursive implementations to be faster than iterative versions of the same ray tracing algorithm. Without knowing what exactly you programmed, it’s not possible to say why your recursive version should be faster.