Hello, I’m trying to use the OptixCoop APIs to implement a radius search for finding the closest distances between points. Here’s the logic: given two sets of points, A and B, I construct an AABB centered at each point in B with a width of r/2 (the search radius), which serves as input to the BVH. Then, I cast short rays from each point in A. In the IS shader, I accumulate a batch of intersections to leverage the OptixCoop APIs for calculating Euclidean distances using optixCoopVecSub
and optixCoopVecMul
. Here’s the code:
#define TENSOR_2D_BATCH_SIZE (16)
extern "C" __global__ void __intersection__nn_tensor_2d() {
using coopvec_t = OptixCoopVec<float, TENSOR_2D_BATCH_SIZE * 2>;
using point_t = typename decltype(params)::point_t;
auto ray_idx = optixGetPayload_0();
auto point_a_id = params.in_queue[ray_idx];
auto n_hits = optixGetPayload_1();
auto point_b_id = optixGetPrimitiveIndex();
const auto& point_a = params.points_a[point_a_id];
const auto& point_b = params.points_b[point_b_id];
auto radius = params.radius;
optixSetPayload_1(n_hits + 1);
auto offset = n_hits % TENSOR_2D_BATCH_SIZE;
// params.points_a_batched is pre-compuated by duplicating point A by 'TENSOR_2D_BATCH_SIZE' times
params.points_b_batched[ray_idx * TENSOR_2D_BATCH_SIZE + offset] = point_b;
if (offset == TENSOR_2D_BATCH_SIZE - 1) {
auto points_a_batched = optixCoopVecLoad<coopvec_t>(
¶ms.points_a_batched[ray_idx * TENSOR_2D_BATCH_SIZE]);
auto points_b_batched = optixCoopVecLoad<coopvec_t>(
¶ms.points_b_batched[ray_idx * TENSOR_2D_BATCH_SIZE]);
// x_a1 - x_b1, y_a1 - y_b1, x_a2 - x_b2, y_a2 - y_b2...
auto points_a_b = optixCoopVecSub(points_a_batched, points_b_batched);
// (x_a1 - x_b1)^2, (y_a1 - y_b1)^2, (x_a2 - x_b2)^2, (y_a2 - y_b2)^2...
points_a_b = optixCoopVecMul(points_a_b, points_a_b);
auto cmin2_storage = optixGetPayload_2();
FLOAT_TYPE cmin2 = *reinterpret_cast<FLOAT_TYPE*>(&cmin2_storage);
for (auto i = 0; i < TENSOR_2D_BATCH_SIZE; i++) {
// (x_ai - x_bi)^2, (y_ai - y_bi)^2
auto dist2 = points_a_b[i * 2] + points_a_b[i * 2 + 1];
if (dist2 <= radius * radius) {
if (dist2 < cmin2) {
cmin2 = dist2;
cmin2_storage = *reinterpret_cast<unsigned int*>(&cmin2);
optixSetPayload_2(cmin2_storage);
}
}
}
}
}
Here’re my questions
- I found out that batch points to use optixCoop APIs even deteriorate by 20%. I tried to tune the max register count, but there’s no improvement at all.
- I then use NSIGHT Compute to profile the program but it shows that Tensor core is not used in the Roofline section. Does that mean the Tensor cores are note used in my program?
Thank you.