Tensor Core does not seem to be involved in OptiX 9

Hello, I’m trying to use the OptixCoop APIs to implement a radius search for finding the closest distances between points. Here’s the logic: given two sets of points, A and B, I construct an AABB centered at each point in B with a width of r/2 (the search radius), which serves as input to the BVH. Then, I cast short rays from each point in A. In the IS shader, I accumulate a batch of intersections to leverage the OptixCoop APIs for calculating Euclidean distances using optixCoopVecSub and optixCoopVecMul. Here’s the code:

#define TENSOR_2D_BATCH_SIZE (16)

extern "C" __global__ void __intersection__nn_tensor_2d() {
  using coopvec_t = OptixCoopVec<float, TENSOR_2D_BATCH_SIZE * 2>;
  using point_t = typename decltype(params)::point_t;
  auto ray_idx = optixGetPayload_0();
  auto point_a_id = params.in_queue[ray_idx];
  auto n_hits = optixGetPayload_1();
  auto point_b_id = optixGetPrimitiveIndex();
  const auto& point_a = params.points_a[point_a_id];
  const auto& point_b = params.points_b[point_b_id];
  auto radius = params.radius;

  optixSetPayload_1(n_hits + 1);
  auto offset = n_hits % TENSOR_2D_BATCH_SIZE;
  // params.points_a_batched is pre-compuated by duplicating point A by 'TENSOR_2D_BATCH_SIZE' times
  params.points_b_batched[ray_idx * TENSOR_2D_BATCH_SIZE + offset] = point_b;

  if (offset == TENSOR_2D_BATCH_SIZE - 1) {
    auto points_a_batched = optixCoopVecLoad<coopvec_t>(
        &params.points_a_batched[ray_idx * TENSOR_2D_BATCH_SIZE]);
    auto points_b_batched = optixCoopVecLoad<coopvec_t>(
        &params.points_b_batched[ray_idx * TENSOR_2D_BATCH_SIZE]);

    // x_a1 - x_b1, y_a1 - y_b1, x_a2 - x_b2, y_a2 - y_b2...
    auto points_a_b = optixCoopVecSub(points_a_batched, points_b_batched);
    // (x_a1 - x_b1)^2, (y_a1 - y_b1)^2, (x_a2 - x_b2)^2, (y_a2 - y_b2)^2...
    points_a_b = optixCoopVecMul(points_a_b, points_a_b);

    auto cmin2_storage = optixGetPayload_2();
    FLOAT_TYPE cmin2 = *reinterpret_cast<FLOAT_TYPE*>(&cmin2_storage);
    for (auto i = 0; i < TENSOR_2D_BATCH_SIZE; i++) {
      // (x_ai - x_bi)^2, (y_ai - y_bi)^2
      auto dist2 = points_a_b[i * 2] + points_a_b[i * 2 + 1];
      if (dist2 <= radius * radius) {
        if (dist2 < cmin2) {
          cmin2 = dist2;
          cmin2_storage = *reinterpret_cast<unsigned int*>(&cmin2);
          optixSetPayload_2(cmin2_storage);
        }
      }
    }
  }
}

Here’re my questions

  1. I found out that batch points to use optixCoop APIs even deteriorate by 20%. I tried to tune the max register count, but there’s no improvement at all.
  2. I then use NSIGHT Compute to profile the program but it shows that Tensor core is not used in the Roofline section. Does that mean the Tensor cores are note used in my program?

Thank you.

Hi @pwrliang,

Tensor Cores will be utilized when you call optixCoopVecMatMul(). Since you’re not using any matrix operations, that’s why Nsight Compute is reporting no Tensor Core activity. The vector functions provided by the API are necessary for things like activation functions in a neural network, but they run on the SM. It’s normally possible to achieve parity between vector operations using coopVec and code written without using the coopVec API, but it might be harder to tune such code.

Do you feel like the OptiX Programming Guide needs some clarification on Tensor Core Utilization when using the cooperative vectors API? We may need to improve it, and we are open to suggestions, corrections, and any other feedback, especially if there are confusing sections.

If you don’t have any matrix operations you need in this workflow, you can use your normal CUDA code since it sounds like at least the initial implementation was faster. Or, to increase throughput when using fp16, you can consider using half2 math or maybe CUDA SIMD intrinsics with fixed point math.

I don’t fully understand your problem and solution, but because you’re using global memory traffic to collect points to test, that might be the bottleneck here, and maybe not the Euclidean distance math? Does the Nsight profile tend to agree with that, or disprove my assumptions? I’m wondering if there are different ways to frame your problem that might help accelerate it.


David.

My workload seems to be a KNN search with (K = 1). That is, given two sets of points, A and B, for each point in A, I want to find its closest point in B. I initially thought the vector APIs were also accelerated by Tensor Cores, so I accumulated batches of points from A and B to compute their Euclidean distances and kept track of the smallest distance. Now, I understand why that didn’t boost performance—only matrix multiplication operations (MatMul) are accelerated by Tensor Cores. Thank you.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.