Hello folks. I would like assistance with a code that analyzes frames showing the profile of an object with a specific reflected laser (3D profilometry) at time 1 and compares it with the same object and reflected laser at time 2. At time 1, all reference frames of the object with the laser will be produced, and at time 2, all current frames of the object with the laser will be generated. My intention is to check for changes that occurred in the reflected laser image and then make a judgment whether there was a modification in the object’s shape.
I have two questions:
1. I am unable to define more than 32 threads per block for the GPU, but the command cuda.get_current_device().MAX_THREADS_PER_BLOCK returns 1024 threads per block. Why is this? How can I access all 1024 threads to improve the performance of the code?
2. Would the same code in C++/CUDA have significantly better performance? In other words, would it be faster?
SYSTEM USED:
Jetson Xavier NX
Ubuntu 20.04.6 LTS, JetPack 5.1, OpenCV 4.8.0 with CUDA 11.4.315
6-core NVIDIA Carmel ARM® v8.2 64-bit CPU
384-core NVIDIA Volta GPU with 48 Tensor Cores
VARIABLES USED:
SIZE: Number of frames (=500) stacked into 1 image
matrix: Image with SIZE frames (=500) stacked for the object (current instant)
matrix2: Image with SIZE frames (=500) stacked for the object (reference instant)
matrix3: Difference between the two images
ASSUMPTIONS:
1. All the frames are 1280x376 px
2. Consider that the reflected laser in the current frames are pixel-aligned with the reflected laser in the reference frames
3. The first iteration of the third 'for' loop serves only to initialize the GPU, the computed time will be disregarded
GENERATED CODE:
import numpy as np
import time
import cv2
import cupy as cp
from numba import cuda
@cuda.jit(cache=True)
def Threshold(matrix, matrix2, matrix3):
i,j = cuda.grid(2)
#Comparing matrix2 with matrix and producing matrix3
if j<1280 and i<SIZE*376:
if (matrix2[i, j] == 255) and (matrix[i, j] >= 70):
matrix3[i, j] = 255
start= time.time()
SIZE = 500
matrix3 = cp.zeros((SIZE*376, 1280), dtype=np.float32)
#Creating matrix
list_image =[]
for j in range(0, SIZE):
image = cv2.imread(f'Path_to_directory/image{a}.bmp')
list_image = list_image + [image]
image = np.vstack(list_image)
matrix = cp.asarray(image[:,:,1])
#Creating matrix2
list_image =[]
for j in range(0, SIZE):
image2 = cv2.imread(f'Path_to_directory/imagebase{j}.bmp')
list_image = list_image + [image2]
image2 = np.vstack(list_image)
matrix2 = cp.asarray(image2[:,:,1])
threads_per_block = (32,32)
blocks_per_grid = (SIZE*12,40)
for i in range(0,2):
start_time = time.time()
Threshold[blocks_per_grid,threads_per_block](matrix, matrix2, matrix3)
cuda.synchronize()
print(f"GPU, interaction {i} --- %s seconds ---" % (time.time()-start_time))
print(f"Total time, interaction {i} --- %s seconds ---" % (time.time()-start), '\n')
#cv2.namedWindow('Reference Image', cv2.WINDOW_NORMAL)
#cv2.imshow('Reference Image', image2)
#cv2.namedWindow('Raw Image', cv2.WINDOW_NORMAL)
#cv2.imshow('Raw Image', image)
#cv2.namedWindow('Filtered Image', cv2.WINDOW_NORMAL)
#cv2.imshow('Filtered Image', matrix3.get())
cv2.waitKey(0)
cv2.destroyAllWindows()
OUTPUTS:
GPU, interaction 0 — 2.500 seconds
Total time, interaction 0 — 22.807 seconds
GPU, interaction 1 — 0.119 seconds
Total time, interaction 1 — 22.927 seconds