Casting rays from geometry during OptiX ray generation program

In the ray generation program, I’d like to loop through geometries, check the material tag and, if an emitting material, launch rays by looping through each mesh element on the geometry

Don’t do that! That approach is not thinking “parallel compute”.
OptiX uses a single ray programming mode. If you launch millions of threads (launch dimension) then each of the individual threads (launch indices) needs to know/to be able to calculate exactly what ray it needs to shoot from where, without sequential searching though some data structure. Each of the threads must be completely independent of the other with no duplicate work.

That means you should know before your launch which meshes need to be sampled.
To accomplish that you would only store the data for the meshes which need to be sampled inside buffers accessible to the ray generation program.

I’m assuming OptiX 7 in the below description. (For older versions, there are no launch parameter struct but variables and buffers at the context scope.)
This is probably over-engineering your problem, but this description is effectively what is required for arbitrary mesh light sampling which I have implemented before.

Making the geometry topology and transformation available to the ray generation program is pretty straightforward, but needs some amount of host code to prepare the necessary data.

You say the “material” decides which mesh should be sampled. That is an additional indirection you need to resolve.
That depends on the granularity at which the material is assigned to the geometric primitives (e.g. per triangle, per instance) which would make this more or less involved.

Let’s assume you have a two-level scene hierarchy with instance and geometry acceleration structure (IAS->GAS) and the material is assigned per instance.
That would allow using the same geometry with different materials.

To make things flexible, you store the vertex attributes and indices of all GAS inside the scene in individual device buffers per GAS. That means you have at least two CUdeviceptr per GAS (if your attributes are interleaved).
You also need to store the number of primitives inside the geometry to be able to select one primitive from the indices.

If you need to sample each whole mesh surface uniformly, there also needs to be a one-dimensional cumulative distribution function over the primitive area, which number of entries is also the number of primitives.

=> That would be the geometry part on object coordinates, one structure with all these information per geometry, put into a buffer, referenced via a CUdeviceptr

Now since the instances hold the transformation and the material is assigned per instance, you would need another buffer which stores only the instances which have a material assigned which needs to be sampled to avoid any searches.
Means there needs to be another CUdeviceptr ins you launch parameters which holds the instance transform and an index into the geometry buffer in the launch parameters with the same instance->geometry assignment as used in the IAS->GAS.
=> That would be the instance part.

Now you need to define how the sampling happens. That would define the amount of rays spent on each geometry.
I don’t know how your expected output data should be stored.

Let’s make it simple again. Assuming each mesh should get the same amount of rays independently of their world space surface. Otherwise you would need to pick the proper launch dimension per surface area.
Then you could for example handle each mesh in an independent launch. (That would be your loop over the meshes done on the host, not inside the ray generation program!)
Means each launch would select one of the instances inside the launch parameters by setting an index variable.
That would also simplify handling the resulting output.

The launch dimension would define how many rays are spent sampling that mesh.
Now each launch index would read the instance matrix, read the index into the geometry structures, read the number of primitives, pick one geometry index inside the geometry (via the CDF when the sampling should be uniformly over the surface area), then read the triangle indices, read the three vertex positions from the geometry attributes in object space, sample a point on the triangle uniformly, transform that point into world space and shoot your ray.
Then somehow store the result in a form you need. That would define how the results are written.

If you need to be able to assign the results to primitives again, this would either be done with atomicAdd( )in a scattering algorithm, because the random sampling of primitives means multiple threads can work on one primitive at the same time.
Or your results could be the same number of elements as the launch dimension and you store the primitive ID along with the data and gather the results per primitive ID in a second CUDA kernel.

This could also be done with progressive algorithms where each launch improves the results. That would require an additional sub-frame value which changes the random number generator picking the sample points per launch.

A completely different approach would be to calculate the sample points inside a native CUDA kernel and provide a buffer with ray origin and directions to the ray generation program.

Other posts on this forum touched the same topic, mostly related to ambient occlusion and radiation heat transfer.