What's your solution to get all hit primitives of multiple rays?

Hello all, now I have M rays shooting in to a scene with N primitives, and I want to get all hit primitives of M rays, now I have several solution but I think there are some performance problem. So I’m wondering is there some better solution to do this?

Example: 4 ray and 8 triangles, ray_0 hit No.0, No.3 triangle, ray_1 hit No.1, No.2, No.3 triangle, ray_2 hit No.4, No.6, No.7 triangle, ray_3 hit No.0, No.1, No.4, No.5 triangle, 4 rays are launched at the same time. Finally I want the following list:

primitive_hit[0] = {0, 3};

primitive_hit[1] = {1, 2, 3};

primitive_hit[2] = {4, 6, 7};

primitive_hit[3] = {0, 1, 4, 5};

Solution 1:

Assign N primitives each with an bool array with length equal to M in HitGroupData, and set this array in __any_hit shader by:

const uint3 ray_index = optixGetLaunchIndex();

const unsigned int primitive_index = optixGetPrimitiveIndex();

htData->primitive_hit[primitive_index][ray_index.x] = true;


While my concern is that primitive_hit[primitive_index] array is a critical region, and M rays are launched simultaneously, and one primitive may be hit by multiple rays in a very short time period, which leads to multiple thread (ray) write the critical region almost at the same time. If I want to guarantee the correctness, I must implement some kind of lock/samephore/…, I guess this will severely hurt the performance.

Solution 2:

Assign M rays each with an integer array in RayGenData.Record the ID of hit primitive by setPayload(), which can be implemented by:

// in __raygen shader

const uint3 ray_index = optixGetLaunchIndex();

optixTrace(..., [PayLoad:] primitive_index);

rgData->primitive_hit_by_ray[ray_index.x][rgData->hit_counter] = primitive_index;

rgData->hit_counter += 1;

// in __any_hit shader

const uint3 ray_index = optixGetLaunchIndex();

const unsigned int primitive_index = optixGetPrimitiveIndex();



I’m still trying to implement and debug solution 2 and have not test the performance yet.

Since I think this is a quite common scenario, I’m curious is there an “optimal” solution to finish my job?

The first problem with this is the size of the result vector.

Method A

If a single ray can hit multiple primitives then the worst case would be all rays (M) can hit all primitives (N).
A matrix of M * N bits would be the minimum amount of memory you would require to store that information.
If that result vector size is feasible, means the amount of memory required is small enough to fit into your VRAM, then that would be a rather simple implementation.

Because that is a scattered write algorithm (data is not set per launch index), setting a bit in the matrix, which needs to be initialized to zero before the optixLaunch, should be done with an atomicOr operation .
That atomicOr works on 32 (or 64 bit) values, so that bit matrix would need to be allocated as integers.
matrix_size_in_bytes = ((M + 31) / 32) * 4 * N; or matrix_size_in_bytes = ((N + 31) / 32) * 4 * M; depending on how you want to address the bits (ray-major or primitive-major).

So for each hit (M, N) you would need to calculate the 32-bit integer inside the matrix which contains that bit, calculate the integer bit mask, and set it with the atomicOr operation because up to 32 rays could access that same integer in parallel.

The atomics only get expensive if there is congestion during the access from multiple threads.
Neighboring rays are likely to hit the same primitive, so you could experiment if storing the bit matrix ray-major or primitive-major is faster.

Evaluating that result on the host would mean to copy the matrix buffer from device to host and check the bits with the same (M, N) addressing.

Now while that is the smallest approach which works with a fixed allocation, this might get expensive during evaluation, though there are bitscan instructions which can quickly find set bits inside some data array.

Though if the result vector is much smaller, meaning not all rays hit any primitives and the maximum possible number of primitives hit per ray is rather small, it could be beneficial to actually build a list with hit primitives per ray statically or dynamically.

Method B

Depending on the amount of possible hits, esp. if that number is small, you could also simply allocate a big buffer of M * maximum_possible_hits_per_ray integers, initialize them all to -1 to indicate no hit, and then track a number_of_hits value per ray which is the index into each row of per ray hit primitives.
Then each row (== ray) holds the primitive IDs (positive entry) hit by that ray.
-1 indicates the maximum possible amount of hits was not reached.

Method C

If that is too big, It’s also be possible to build a linked list into a pre-allocated data buffer which holds pairs of primitiveId and nextLinkedListElement values, where -1 would indicate no primitive hit and end of linked list.

The following pseudo algorithm assumes the number of hit results fits into an int (less than 2 Gig hits) per launch.

To build a linked list of primitives hit by each ray, you would need to store pairs (int2) of primitive_id and next_list_index. Let primitiveId == -1 indicate no primitive hit, and next_list_index == -1 is the end of the linked list.

For that you would need two buffers:
1.) A buffer containing a single integer. Let’s call this next_free_list_element_index. Initialize to the number of rays, which is the first free element after the start elements for each ray in the following linked_list buffer
2.) A buffer with int2 elements which is bigger than the maximum expected number of hits.
And this is where things could break if this is bigger than you can store in VRAM.
On the other hand, if the allocated list buffer is too small for the number of hits, it’s possible to detect that when fetching the next free list index. The value of the next_free_list_element_index would indicate how many elements would have been required after the optixLaunch.

Now, each primary ray calculates its linear index from the launch dimensions and launch index inside the ray generation. Store that in a per ray payload field.
That primary ray linear index is used to access the M first elements in the linked_list buffer 2 with the invalid -1 start element index of the linked lists. Let’s call this current_index.

When the ray hits a primitive, store the primitive ID and get a next list element.

// When a ray hits a primitive:
int2 current_element;
current_element.x = primitiveId;
current_element.y = atomicAdd(&next_free_list_element_index, 1);
if (current_index < linked_list_num_elements) // Don't crash when the allocated linked_list buffer was too small.
    linked_list[current_index] = current_element;
current_index = current_element.y;

This builds a dynamic linked list.
The first M entries in buffer 2 are the start elements of the per ray linked list.
If the list element entry is (-1, -1) that is the end of the list.

After that launch the next_free_list_element_index contains the number of elements required.
If that is bigger than the originally allocated number of elements inside buffer 2, the algorithm ran out of memory and the result data is invalid!
That would be the amount of elements which would need to be copied to the host from buffer 2.
Then starting with the first M elements which indicate the per ray linked lists starts, you can iterate through these linked lists and gather all entries.

That algorithm could be optimized a little by handling the first hit differently (one list element less allocated per ray).
It’s also possible to use unsigned int and use 0xFFFFFFFF as invalid entry if 2Gig elements is not enough.

Thanks for the patient reply and I will try these method these days!