Profilometry, performance, and number of threads on the Jetson

Hello folks. I would like assistance with a code that analyzes frames showing the profile of an object with a specific reflected laser (3D profilometry) at time 1 and compares it with the same object and reflected laser at time 2. At time 1, all reference frames of the object with the laser will be produced, and at time 2, all current frames of the object with the laser will be generated. My intention is to check for changes that occurred in the reflected laser image and then make a judgment whether there was a modification in the object’s shape.

I have two questions:

1. I am unable to define more than 32 threads per block for the GPU, but the command cuda.get_current_device().MAX_THREADS_PER_BLOCK returns 1024 threads per block. Why is this? How can I access all 1024 threads to improve the performance of the code?
2. Would the same code in C++/CUDA have significantly better performance? In other words, would it be faster?

SYSTEM USED:
Jetson Xavier NX
Ubuntu 20.04.6 LTS, JetPack 5.1, OpenCV 4.8.0 with CUDA 11.4.315
6-core NVIDIA Carmel ARM® v8.2 64-bit CPU
384-core NVIDIA Volta GPU with 48 Tensor Cores

VARIABLES USED:
SIZE: Number of frames (=500) stacked into 1 image
matrix: Image with SIZE frames (=500) stacked for the object (current instant)
matrix2: Image with SIZE frames (=500) stacked for the object (reference instant)
matrix3: Difference between the two images

ASSUMPTIONS:

1. All the frames are 1280x376 px
2. Consider that the reflected laser in the current frames are pixel-aligned with the reflected laser in the reference frames
3. The first iteration of the third 'for' loop serves only to initialize the GPU, the computed time will be disregarded

GENERATED CODE:

import numpy as np
import time
import cv2
import cupy as cp
from numba import cuda

@cuda.jit(cache=True)
def Threshold(matrix, matrix2, matrix3):
  i,j = cuda.grid(2)

  #Comparing matrix2 with matrix and producing matrix3
  if j<1280 and i<SIZE*376:
  if (matrix2[i, j] == 255) and (matrix[i, j] >= 70):
    matrix3[i, j] = 255

start= time.time()

SIZE = 500
matrix3 = cp.zeros((SIZE*376, 1280), dtype=np.float32)

#Creating matrix 
list_image =[]
for j in range(0, SIZE):
  image = cv2.imread(f'Path_to_directory/image{a}.bmp')
  list_image = list_image + [image]
image = np.vstack(list_image)
matrix = cp.asarray(image[:,:,1])

#Creating matrix2
list_image =[]
for j in range(0, SIZE):
  image2 = cv2.imread(f'Path_to_directory/imagebase{j}.bmp')
  list_image = list_image + [image2]
image2 = np.vstack(list_image)
matrix2 = cp.asarray(image2[:,:,1])

threads_per_block = (32,32)
blocks_per_grid = (SIZE*12,40)

for i in range(0,2):
  start_time = time.time()
  Threshold[blocks_per_grid,threads_per_block](matrix, matrix2, matrix3)
  cuda.synchronize()

  print(f"GPU, interaction {i} --- %s seconds ---" % (time.time()-start_time))
  print(f"Total time, interaction {i} --- %s seconds ---" % (time.time()-start), '\n')

#cv2.namedWindow('Reference Image', cv2.WINDOW_NORMAL)
#cv2.imshow('Reference Image', image2)

#cv2.namedWindow('Raw Image', cv2.WINDOW_NORMAL)
#cv2.imshow('Raw Image', image)

#cv2.namedWindow('Filtered Image', cv2.WINDOW_NORMAL)
#cv2.imshow('Filtered Image', matrix3.get())

cv2.waitKey(0)
cv2.destroyAllWindows()

OUTPUTS:
GPU, interaction 0 — 2.500 seconds
Total time, interaction 0 — 22.807 seconds

GPU, interaction 1 — 0.119 seconds
Total time, interaction 1 — 22.927 seconds

You are defining 1024 threads per block.
This statement:

defines a 2D threadblock with side dimensions of 32 in x and 32 in y. The total number of threads per block is the product of those 2 numbers, 32x32 = 1024.

No, numba CUDA and equivalent code in CUDA C++ should have little or no performance difference.

From a performance perspective, this is a canonically bad pattern:

The cuda.grid() statement in numba cuda returns the dimensionally-unique thread indices in order x,y,z (so cuda.grid(2) returns x,y). In CUDA, for performance reasons (efficient memory access), we’d like indices based on that x to be used in the last position in a multiply-subscripted sequence. But you are using i in the first position to index into your matrices. Instead do something like this:

@cuda.jit(cache=True)
def Threshold(matrix, matrix2, matrix3):
  i,j = cuda.grid(2)

  #Comparing matrix2 with matrix and producing matrix3
  if i<1280 and j<SIZE*376:
  if (matrix2[j, i] == 255) and (matrix[j, i] >= 70):
    matrix3[j, i] = 255

and:

blocks_per_grid = (40, SIZE*12)

We can also probably do a slightly better job of grid sizing (launch fewer unnecessary blocks) by using a calculation like this:

blocks_per_grid = (40, ((SIZE*376)//32)+1)

For a SIZE of 500, your calculation launches 6000 blocks of 32 rows each, to cover the 188,000 rows. My calculation launches 5876 blocks to cover the same 188,000 rows. So we might estimate something like a 5876/6000 or ~2 percent performance improvement from that small optimization - not something you’re actually likely to observe.

1 Like

I guess I should also mention that if this is the only work you intend to do on the GPU: check each of 2 matrices against a threshold value and then set a value in a 3rd matrix - this may not be a good choice for GPU solution.

Your time measurement doesn’t include the cost to transfer data to and from the device. Since the operation is so simple it might be simpler and possibly even faster to do a nice parallel job on the CPU side. You can still use numba (e.g. @jit) for this, or there might even be a clever way to use numpy primitives to do it.

1 Like

Sorry for the delay, thank you very much for the prompt response! Based on the recommendations you provided, I’ve been working on the implementation and refining the code. In fact, I observed a ~63% improvement in processing time when I performed that inversion: i,j = cuda.grid(2), matrix2[j,i] and matrix[j, i]. I understand that this is just a convention defined for memory access, right?

I’m not sure what that means. I would never use the words “just a convention” in the context of coalesced memory access on a CUDA GPU. You may wish to study section 4 of this online training series. The ideas are also covered in many forum posts such as this one.

Got it. However, in our case, the application needs to process 12000 frames in 1 second and will require more operations within the GPU. This was just an initial test to assess the performance of the Jetson. Currently, the code is processing 6000 frames (buffered into a single image transmitted to the GPU) in approximately 0.7 seconds. Furthermore, we are using the Jetson Xavier NX, but the project in question envisions the Jetson Orin AGX.

Another thing, does the direct GPU memory access feature exist in the Jetson Xavier NX? And in the Jetson Orin AGX and IGX models, is it already integrated into the system?

I’m not sure what that is. If you mean something like GPUDirect, I would suggest asking those questions on a relevant Jetson forum

I am not sure what that means, but will point out – just in case – that a soft real-time processing pipeline usually benefits from the use of double buffering (with the associated use of CUDA streams, etc).

As @Rovert_Crovella points out, performance questions related to NVIDIA’s embedded platforms usually get faster / better answers in the sub-forums dedicated to them, in this case:

Double buffering is probably sensible in the context of having an incoming buffer that is “filling” and a filled buffer for compute tasks. Streams may or may not matter here; there does need to be a ping-pong mechanism of some sort.

Jetson enjoys a unified physical memory, so the usual mechanisms for copy-compute overlap should be eschewed in favor of techniques such as pinned memory and managed memory as covered here, dispensing with the usual H->D and D->H copying steps altogether (where one would typically employ streams.)

Thank you, @njuffa, for the comment and the suggestion

I understand, @Robert_Crovella . Thank you for the help!