Ray-local memory

I want to have a piece of memory that is shared across different invocations of the IS program of the same ray. I could do that by using global memory and passing the pointer through ray payload in optixTrace. Or I could call AH in each IS and then use the 8 attributes for the shared data. The former is perhaps slow and the latter limits the shared data to be 8 32-bit registers. Is there a way to allocate ray-local memory and provides hints to the compiler to store them in thread-local registers?

Hmm the second approach wouldn’t work since one invocation of AH can’t set attributes that can then later by read by another invocation by AH. I would have to use the 8 32-bit registers in the ray payload in optixTrace.

Yeah you can pass a pointer in your payload. If you need more than 8 registers, you can pass a pointer to global memory, or you can pass a pointer to a struct that’s on the stack, which is in thread-local memory and/or registers. For example, you could have a struct declared locally in your raygen program so that it’s storage is allocated on the stack rather than on the heap, and you can put a pointer to that struct into your payload. This doesn’t guarantee that the memory will be in registers later, but it at least gives you a fighting chance, depending on how much data you access, where you access the data, and whether you make other function calls besides optixTrace(). The compiler will use registers when it can.

You have some control over the total registers per thread, so if you’d like to play with trying to balance using more registers against lower occupancy, you might be able to find an optimum that is different from the default settings. Note it’s relatively rare for the default behavior to produce slower code, but on occasion you can successfully fine-tune it. The way you do this is by using module_compile_options.maxRegisterCount. Also note that fine-tuning the max register count can make the code more difficult to maintain because nearly all code changes you make affect register usage. You might not want to have to repeatedly fine-tune code that is under ongoing development. Make sure you read the programming guide and header file notes in optix_7_host.h. https://raytracing-docs.nvidia.com/optix7/guide/index.html#program_pipeline_creation#pipeline-stack-size

Is this for the 200KB per-ray data block you mentioned in the other thread? It might be worth elaborating, if you can, on what you’re trying to achieve with the shared memory - perhaps there are known strategies for doing what you want, or alternatives that achieve the same result.


David.

Oh thanks for the clarification. I didn’t know that the “stack” in optix is talking about the storage local to a thread. I thought stack has something to do with recursion… So even if I pass a pointer to a piece of data on the stack, the compiler might still allocate the data in registers? I don’t think a typical C compiler would do this optimization.

I am basically trying to implement a max heap/priority queue that is updated by each IS program. Is there anything related/similar that I could reference to?

The stack is used for recursion (or just local space per-thread and a mechanism for handling function calls). A typical C or C++ compiler on the CPU will use registers instead of memory for local variables if it can.

What are you trying to achieve or compute with your heap/priority queue?


David.

Also how big is the per-thread memory you need to access in your IS program?


David.

Yeah I get that stack is used to handle function calls. What I meant is that I thought the stack was used for storing data related to BVH traversal, which is recursive.

So if I define a struct in a IS/AH program, for instance, will the data be on the stack? If so, the only way for other invocations of the same IS/AH program to access the data would be to pass a pointer?

Well it depends on the priority queue size, which is a user defined variable.

If you have local variables in CUDA or OptiX code (or any C/C++ code), and the compiler can’t optimize them into registers, then the data for those local variables is always on the stack. Stack memory in CUDA/OptiX goes into Local Memory, which is handled differently than Global Memory. Local Memory access can be faster than Global, but it’s not guaranteed.

Since there’s no dynamic memory allocation during a kernel launch, you will have to decide on a maximum size for your priority queue no matter where you put it. If you are still referring to a 200KB block of memory potentially, I think there is no point in looking for tricks to store the memory anywhere other than Global Memory.

Is this shared memory only going to accessed by programs in a single thread, or do you need to share this memory across different threads simultaneously?


David.

It’s only going to be accessed by a single thread/ray, but different IS invocations will access it. For right now, here is what I have done:

In RG:

    float min_dists [K];
    unsigned int u0, u1;
    packPointer( min_dists, u0, u1 );

    unsigned int min_idxs[K];
    unsigned int u2, u3;
    packPointer( min_idxs, u2, u3 );

    float max_key;
    unsigned int max_idx;
    unsigned int size = 0;

Then in IS:

      const unsigned int u0 = optixGetPayload_1();
      const unsigned int u1 = optixGetPayload_2();
      float* keys = reinterpret_cast<float*>( unpackPointer( u0, u1 ) );

      const unsigned int u2 = optixGetPayload_3();
      const unsigned int u3 = optixGetPayload_4();
      unsigned int* vals = reinterpret_cast<unsigned int*>( unpackPointer( u2, u3 ) );

      float max_key = uint_as_float(optixGetPayload_5());
      unsigned int max_idx = optixGetPayload_6();
      unsigned int _size = optixGetPayload_7();

and then manipulating the priority queue is something like:

      if (_size < K) {
        keys[_size] = key;

        vals[_size] = val;

        if (_size == 0 || key > max_key) {
          optixSetPayload_5( float_as_uint(key) ); //max_key = key;
          optixSetPayload_6( _size ); //max_idx = _size;
        }
        optixSetPayload_7( _size + 1 ); // _size++;
      }
      else if (key < max_key) {
        keys[max_idx] = key;

        vals[max_idx] = val;

        optixSetPayload_5( float_as_uint(key) ); //max_key = key;
        for (unsigned int k = 0; k < K; ++k) {
          float cur_key = keys[k];

          if (cur_key > max_key) {
            optixSetPayload_5( float_as_uint(cur_key) ); //max_key = cur_key;
            optixSetPayload_6( k ); //max_idx = k;
          }
        }
      }

I can’t tell whether the two arrays min_dists and min_idxs are allocated in registers or not. K is set to 16 so it’s not huge. Is there a way to inspect the PTX code generated by NVRTC?

Is there a way to inspect the PTX code generated by NVRTC?

You mean the PTX source code you get when calling nvrtcGetPTX()?
Please have a look into the sutil.cpp function getPtxFromCuString() which does that.

Thanks. I will take a look! Before looking at the PTX code, one thing I found interesting in my above code is the following:

If instead of doing optixSetPayload_5( float_as_uint(cur_key) ); I directly update max_key in each iteration and then call optixSetPayload_5(float_as_unit(max_key)) at the end of the loop, then the performance is much slower. That means updating max_key is made to the local memory I’d think, where is sort of weird since max_key is defined and initialized by float max_key = uint_as_float(optixGetPayload_5());, and I’d think max_key would be basically be that payload register.

Looking at the PTX code won’t help you with answering what really changed inside the final kernel when you edit something small.

You will see big mistakes like not compiling with use_fast_math for example, which will dramatically increase the number of instructions for all trigonometric, reciprocal and square root functions.

The PTX code is only a readable intermediate representation which goes through multiple translation steps inside OptiX, the CUDA PTX assembler (which is more of a compiler) and the code generator inside the driver.

You would need to look at the final microcode (SASS) to be able to see what actually runs on the GPU.

That is possible with Nsight Compute https://docs.nvidia.com/nsight-compute/index.html

Make sure your PTX code has been translated with line-info and full optimizations. Debug code will not generate the same microcode.

Thanks for the pointer. One more question, how would I pass macro definition flag (-D) to NVRTC? I could usually do that through flags to nvcc, but in this case the optix code is compiled through NVRTC, how should I do that?

To be more specific, the optix program needs to reference a macro, which I hope to set through the -D flag. Setting that through CMAKE_CUDA_FLAGS in cmake doesn’t help, because I guess that’s actually setting the nvcc flag, which is not used by NVRTC?

That means updating max_key is made to the local memory I’d think, where is sort of weird since max_key is defined and initialized by float max_key = uint_as_float(optixGetPayload_5()); , and I’d think max_key would be basically be that payload register.

By declaring a new local variable, you’ve allocated separate additional storage for the value in payload slot 5. If the value is not constant at compile time, this will always result in either another separate register being used, or local memory. If you wanted it to be an alias to the register in payload slot 5, it might work to declare max_key as reference or const reference, but in general in any C/C++ code you should expect a local variable declaration to occupy it’s own space.

I’m not sure it will be easy to use registers for your priority queue, and even if you can use registers, that doesn’t mean it will be faster, because you will reduce occupancy. My recommendation would be to get it working with local memory, and profile to find the hotspots.

In order to see how many registers you’re using, you can profile with Nsight Compute. Note that the total registers per thread that Nsight Compute reports is not accurate for OptiX programs, because OptiX function calls have a different calling convention than CUDA. The relative registers per thread, however, will be accurate. So what you need to do is use Nsight Compute to measure regs/thread without your priority queue compiled, and then measure again with it enabled, and look at the difference between them.


David.

Thanks. I found that the priority queue wasn’t the main bottleneck, as the performance is roughly the same when K is 1 vs. 50. The bottleneck is simply that the IS program is triggered a lot of times. Even if the IS program virtually does nothing, it takes a lot of time.

The NVRTC options are just a local array of const char* you need to setup before the call to nvrtcCompileProgram().

Have a look at this old thread. This is how I translated runtime generated CUDA C++ code for an advanced material system to PTX:
https://forums.developer.nvidia.com/t/optix-compile-error-in-ptxstring/70636/2

There have been some amount of NVRTC related discussions on this OptiX sub-forum before

I recommend using the search feature in the top right while the page is on the OptiX sub-forum topics before asking a question.

Not surprising at all since the intersection program is the most often called program of all program domains and should be implemented to do the least amount of work in the shortest amount of time.

I would really like to hear a description of the algorithm you want to solve with raytracing instead of these questions about OptiX internal details you shouldn’t really be concerned about.

Maybe there is a much more elegant and faster way using OptiX or a completely different way.

Thanks for the pointer.

Oh I am just trying to implement KNN search using OptiX, a bit like this: http://www.sci.utah.edu/~wald/Publications/2020/springs/springs.pdf, but actually get the topK nearest neighbors rather than everything within a fixed radius. I need to do this search in a large point cloud rendering project.