CUDARuntimeError: cudaErrorIllegalAddress: an illegal memory access was encountered

I recently updated the Holoscan SDK from the older v1.0.3 to the v2.1.0. My application was running on the older SDK and after update the app stoped working. It runs with a video file with this path (Path 1: replayer,ImageProcessing,preprocessor,inference,postprocessor,PostImageProcessing,viz) but when i try to run it with a video feed from AJA source operator the app shows this error. This error appears exactly when i try to preprocess the frame, but only happens using AJA not using replayer.

sudo python3 AJA_arthrosegmentation_debugging.py -s aja
[info] [gxf_executor.cpp:248] [AJA_arthrosegmentation] Creating context
[info] [gxf_executor.cpp:1691] Loading extensions from configs...
[warning] [gxf_resource.cpp:175] Existing entity already has a GPUDevice resource
[info] [gxf_executor.cpp:1897] Activating Graph...
[info] [gxf_executor.cpp:1929] [AJA_arthrosegmentation] Running Graph...
[info] [gxf_executor.cpp:1931] [AJA_arthrosegmentation] Waiting for completion...
2024-06-19 15:46:08.587 INFO  gxf/std/greedy_scheduler.cpp@191: Scheduling 8 entities
[info] [aja_source.cpp:386] AJA Source: Capturing from NTV2_CHANNEL1
[info] [aja_source.cpp:387] AJA Source: RDMA is disabled
[info] [aja_source.cpp:393] AJA Source: Overlay output is disabled
[info] [infer_utils.cpp:222] Input tensor names empty from Config. Creating from pre_processor map.
[info] [infer_utils.cpp:224] Input Tensor names: [source_video]
[info] [infer_utils.cpp:258] Output tensor names empty from Config. Creating from inference map.
[info] [infer_utils.cpp:260] Output Tensor names: [output]
[info] [inference.cpp:208] Inference Specifications created
[info] [infer_manager.cpp:825] Inference context ID: AJA_arthrosegmentation_[]_
[info] [core.cpp:46] TRT Inference: converting ONNX model at ../data/arthroscopic_segmentation/model/model_full_image_for_clahe_converted.onnx
[info] [utils.cpp:76] Cached engine found: ../data/arthroscopic_segmentation/model/model_full_image_for_clahe_converted.Orin.8.7.16.trt.8.6.1.6.engine.fp16
[info] [core.cpp:79] Loading Engine: ../data/arthroscopic_segmentation/model/model_full_image_for_clahe_converted.Orin.8.7.16.trt.8.6.1.6.engine.fp16
[info] [utils.hpp:46] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[info] [core.cpp:122] Engine loaded: ../data/arthroscopic_segmentation/model/model_full_image_for_clahe_converted.Orin.8.7.16.trt.8.6.1.6.engine.fp16
[info] [infer_manager.cpp:386] HoloInfer buffer created for output
[info] [inference.cpp:219] Inference context setup complete
error: XDG_RUNTIME_DIR not set in the environment.
[info] [context.cpp:50] _______________
[info] [context.cpp:50] Vulkan Version:
[info] [context.cpp:50]  - available:  1.3.204
[info] [context.cpp:50]  - requesting: 1.2.0
[info] [context.cpp:50] ______________________
[info] [context.cpp:50] Used Instance Layers :
[info] [context.cpp:50] 
[info] [context.cpp:50] Used Instance Extensions :
[info] [context.cpp:50] VK_KHR_surface
[info] [context.cpp:50] VK_KHR_xcb_surface
[info] [context.cpp:50] VK_EXT_debug_utils
[info] [context.cpp:50] VK_KHR_external_memory_capabilities
[info] [context.cpp:50] ____________________
[info] [context.cpp:50] Compatible Devices :
[info] [context.cpp:50] 0: NVIDIA Tegra Orin (nvgpu)
[info] [context.cpp:50] Physical devices found : 
[info] [context.cpp:50] 1
[info] [context.cpp:50] ________________________
[info] [context.cpp:50] Used Device Extensions :
[info] [context.cpp:50] VK_KHR_swapchain
[info] [context.cpp:50] VK_KHR_external_memory
[info] [context.cpp:50] VK_KHR_external_memory_fd
[info] [context.cpp:50] VK_KHR_external_semaphore
[info] [context.cpp:50] VK_KHR_external_semaphore_fd
[info] [context.cpp:50] VK_KHR_push_descriptor
[info] [context.cpp:50] VK_EXT_line_rasterization
[info] [context.cpp:50] 
[info] [vulkan_app.cpp:845] Using device 0: NVIDIA Tegra Orin (nvgpu) (UUID 40d49d1be05a5cd98e6a4eb6cbd06e34)
frame count: 0
[error] [gxf_wrapper.cpp:84] Exception occurred for operator: 'ImageProcessing' - CUDARuntimeError: cudaErrorIllegalAddress: an illegal memory access was encountered

At:
  cupy_backends/cuda/api/runtime.pyx(144): cupy_backends.cuda.api.runtime.check_status
  /usr/local/lib/python3.10/dist-packages/cupy/_creation/from_data.py(75): asarray
  /opt/nvidia/holoscan/examples/MyModel_laser_segmentation/python/AJA_arthrosegmentation_debugging.py(158): compute

2024-06-19 15:46:09.300 ERROR gxf/std/entity_executor.cpp@552: Failed to tick codelet ImageProcessing in entity: ImageProcessing code: GXF_FAILURE
2024-06-19 15:46:09.300 WARN  gxf/std/greedy_scheduler.cpp@243: Error while executing entity 26 named 'ImageProcessing': GXF_FAILURE
[info] [utils.hpp:46] 1: [defaultAllocator.cpp::deallocate::61] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [defaultAllocator.cpp::deallocate::61] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [defaultAllocator.cpp::deallocate::61] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[info] [utils.hpp:46] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)

Compute method code

def compute(self, op_input, op_output, context):
        
        global global_range_start

        with nvtx.annotate(message="Image Processing", color="blue"):
                
                
            # Record the start time
            start_time = time.time()
            ## Preprocess file

            image_size = 1024
            resize_size = (1920, 1080)

            self.final_size = (image_size, image_size)
            
            #load the input tensor/original image 
            message_frame = op_input.receive("input_tensor")   # Receive the input tensor  

            print("frame count:", self.framecount)

            input_tensor = message_frame.get("")

            frame = cp.asnumpy(input_tensor)
            
            #frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) 

            #save input frame
            i = self.framecount
            # dir = "DebuggingImageProcessing"
            # os.makedirs(dir, exist_ok=True)
            # filename_in = os.path.join(dir, f"frame_in{i}.png")

            # cv2.imwrite(filename_in, cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))

            #print("'np_frame_array':"np_frame_array)
            #print("Type of 'np_frame_array':"np_frame_array)
            #print("PREPROCESSING: Shape of 'frame in':", frame.shape)
            #print("PREPROCESSING: dtype of 'frame in': ", frame.dtype)
            #assert isinstance(frame, np.ndarray)
            
            self.original_size = tuple(reversed(frame.shape[:-1]))
            self.resized_size = resize_size
            start_time_preprocessing = time.time()
            processed_frame = holoscan_preprocessing.run(frame)
            end_time_preprocessing = time.time()
            #print(f"time pre-processing: {end_time_preprocessing - start_time_preprocessing}", flush=True) # Print the time taken for pre-processing in seconds 
            

            ''' 
            # Python Preprocessing Code
            # load        
            self.original_size = tuple(reversed(frame.shape[:-1]))

            # resize
            resized_image = ImageProcessingOp.resize(frame, size = resize_size, is_label = False)

            self.resized_size = tuple(reversed(resized_image.shape[:-1]))

            # crop image
            size_middle = (resize_size[0] - resize_size[1]) // 2                              # 420
            self.crop_slice = slice(size_middle, resize_size[0] - size_middle)                # (0, 1080)
            cropped_image = resized_image[:, self.crop_slice]                                 # Crop the image

            self.cropped_size = tuple(reversed(cropped_image.shape[:-1]))                     # (1080, 1080)

            processed_image, crop_mask = ImageProcessingOp.crop_outside_circle(cropped_image) # Crop the outside of the circle

            roi_mask = ~crop_mask[...,np.newaxis]

            clahe = cv2.createCLAHE()
            
            for channel_idx, channel in enumerate(np.moveaxis(processed_image, -1, 0).copy()):
                processed_image[roi_mask[...,0], channel_idx] = np.squeeze(clahe.apply(channel[roi_mask[...,0]]), axis=-1)

            processed_frame = ImageProcessingOp.resize(processed_image,
                                    self.final_size,
                                    is_label=False)

            '''
            self.image = processed_frame

            #print("PREPROCESSING: Shape of 'frame out resized':", processed_frame.shape)
            #print("PREPROCESSING: dtype of 'frame out resized': ", processed_frame.dtype)
            
            
            
            # # Record the end time
            # end_time = time.time()
            # # Calculate and print the FPS
            # fps = 1.0 / (end_time - start_time)
            # print(f"FPS pre-processing: {fps}", flush=True)
            

            #processed_frame = cv2.cvtColor(processed_frame, cv2.COLOR_RGB2BGR)

            #save the preprocessed frame    
            # filename_out = os.path.join(dir, f"frame_processed_out{i}.png")
            # cv2.imwrite(filename_out, cv2.cvtColor(processed_frame, cv2.COLOR_RGB2BGR))
            
            processed_frame = cp.asarray(processed_frame)

            self.framecount += 1

            out_message = Entity(context)
            out_message.add(hs.as_tensor(processed_frame),"") 

            # Send the processed frame to the output tensor
            op_output.emit(out_message, "output_tensor")

            # Start a new NVTX range and store it in the global variable
            global_range_start = nvtx.start_range(message="Inference", color="red")


    def stop(self):
        pass

I have an update on this issue. I checked the new release notes and noticed that the update changed the way FormatConverterOp operates on host/device copies. I was using Cupy to acquire the frame that was sent through a Tensor Holoscan, I think the error I am facing is related to Cupy waiting for data on the GPU and after the update this change of FormatConverterOp automatically performs a copy of the host->device could be the problem. What can I do to fix this?

BayerDemosaicOp and FormatConverterOp will now automatically perform host->device copies if needed for either nvidia::gxf::VideoBuffer or Tensor inputs. Previously these operators only did the transfer automatically for nvidia::gxf::VideoBuffer, but not for Tensor and in the case of FormatConverterOp that transfer was only automatically done for pinned host memory. As of this release both operators will only copy unpinned system memory, leaving pinned host memory as-is.

There is anything i can do to solve this issue? After 1 month i’m still not able to run my app after update to the new SDK.

Appologies for the late reply.
In v2.2 explicit copy of pinned host memory was restored. Can you try updating from 2.1 to 2.2 resolves the issue?
I am trying to get an AJA capture card myself to help resolve this better.