cuMemcpyHtoD failed: context is destroyed

Hi,

System :
TensorRT Version: 8.4.1.5
GPU Type: Jetson Xavier NX 16GB
CUDA Version: 11.4
pyCuda: 2020.1, 2021.1, 2022.1
Operating System + Version: L4T 35.1 (Jetpack 5.x.x)

When I am running a python inference script with a TensorRT engine using pycuda, I get the following error. This used to work on a Jetson Xavier NX 8GB with jetpack 4.6, TensorRT 8.0.1.
We tried several pyCuda version as mentioned above.

cuda.memcpy_htod(self.inputs[0]['allocation'], np.ascontiguousarray(batch))

pycuda._driver.LogicError: cuMemcpyHtoD failed: context is destroyed

Do you have any Idea on how to fix this or where to start debugging.

When testing the engine with:
/usr/src/tensorrt/bin/trtexec --loadEngine=output/engine.trt --useCudaGraph --noDataTransfers --iterations=100 --avgRuns=100

we get the following feedback:

[10/14/2022-14:30:58] [I] === Model Options ===
[10/14/2022-14:30:58] [I] Format: *
[10/14/2022-14:30:58] [I] Model: 
[10/14/2022-14:30:58] [I] Output:
[10/14/2022-14:30:58] [I] === Build Options ===
[10/14/2022-14:30:58] [I] Max batch: 1
[10/14/2022-14:30:58] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/14/2022-14:30:58] [I] minTiming: 1
[10/14/2022-14:30:58] [I] avgTiming: 8
[10/14/2022-14:30:58] [I] Precision: FP32
[10/14/2022-14:30:58] [I] LayerPrecisions: 
[10/14/2022-14:30:58] [I] Calibration: 
[10/14/2022-14:30:58] [I] Refit: Disabled
[10/14/2022-14:30:58] [I] Sparsity: Disabled
[10/14/2022-14:30:58] [I] Safe mode: Disabled
[10/14/2022-14:30:58] [I] DirectIO mode: Disabled
[10/14/2022-14:30:58] [I] Restricted mode: Disabled
[10/14/2022-14:30:58] [I] Build only: Disabled
[10/14/2022-14:30:58] [I] Save engine: 
[10/14/2022-14:30:58] [I] Load engine: output/engine.trt
[10/14/2022-14:30:58] [I] Profiling verbosity: 0
[10/14/2022-14:30:58] [I] Tactic sources: Using default tactic sources
[10/14/2022-14:30:58] [I] timingCacheMode: local
[10/14/2022-14:30:58] [I] timingCacheFile: 
[10/14/2022-14:30:58] [I] Input(s)s format: fp32:CHW
[10/14/2022-14:30:58] [I] Output(s)s format: fp32:CHW
[10/14/2022-14:30:58] [I] Input build shapes: model
[10/14/2022-14:30:58] [I] Input calibration shapes: model
[10/14/2022-14:30:58] [I] === System Options ===
[10/14/2022-14:30:58] [I] Device: 0
[10/14/2022-14:30:58] [I] DLACore: 
[10/14/2022-14:30:58] [I] Plugins:
[10/14/2022-14:30:58] [I] === Inference Options ===
[10/14/2022-14:30:58] [I] Batch: 1
[10/14/2022-14:30:58] [I] Input inference shapes: model
[10/14/2022-14:30:58] [I] Iterations: 100
[10/14/2022-14:30:58] [I] Duration: 3s (+ 200ms warm up)
[10/14/2022-14:30:58] [I] Sleep time: 0ms
[10/14/2022-14:30:58] [I] Idle time: 0ms
[10/14/2022-14:30:58] [I] Streams: 1
[10/14/2022-14:30:58] [I] ExposeDMA: Disabled
[10/14/2022-14:30:58] [I] Data transfers: Disabled
[10/14/2022-14:30:58] [I] Spin-wait: Disabled
[10/14/2022-14:30:58] [I] Multithreading: Disabled
[10/14/2022-14:30:58] [I] CUDA Graph: Enabled
[10/14/2022-14:30:58] [I] Separate profiling: Disabled
[10/14/2022-14:30:58] [I] Time Deserialize: Disabled
[10/14/2022-14:30:58] [I] Time Refit: Disabled
[10/14/2022-14:30:58] [I] Inputs:
[10/14/2022-14:30:58] [I] === Reporting Options ===
[10/14/2022-14:30:58] [I] Verbose: Disabled
[10/14/2022-14:30:58] [I] Averages: 100 inferences
[10/14/2022-14:30:58] [I] Percentile: 99
[10/14/2022-14:30:58] [I] Dump refittable layers:Disabled
[10/14/2022-14:30:58] [I] Dump output: Disabled
[10/14/2022-14:30:58] [I] Profile: Disabled
[10/14/2022-14:30:58] [I] Export timing to JSON file: 
[10/14/2022-14:30:58] [I] Export output to JSON file: 
[10/14/2022-14:30:58] [I] Export profile to JSON file: 
[10/14/2022-14:30:58] [I] 
[10/14/2022-14:30:58] [I] === Device Information ===
[10/14/2022-14:30:58] [I] Selected Device: Xavier
[10/14/2022-14:30:58] [I] Compute Capability: 7.2
[10/14/2022-14:30:58] [I] SMs: 6
[10/14/2022-14:30:58] [I] Compute Clock Rate: 1.109 GHz
[10/14/2022-14:30:58] [I] Device Global Memory: 14906 MiB
[10/14/2022-14:30:58] [I] Shared Memory per SM: 96 KiB
[10/14/2022-14:30:58] [I] Memory Bus Width: 256 bits (ECC disabled)
[10/14/2022-14:30:58] [I] Memory Clock Rate: 1.109 GHz
[10/14/2022-14:30:58] [I] 
[10/14/2022-14:30:58] [I] TensorRT version: 8.4.1
[10/14/2022-14:30:58] [I] Engine loaded in 0.083291 sec.
[10/14/2022-14:30:59] [I] [TRT] [MemUsageChange] Init CUDA: CPU +186, GPU +0, now: CPU 256, GPU 6032 (MiB)
[10/14/2022-14:31:00] [I] [TRT] Loaded engine size: 46 MiB
[10/14/2022-14:31:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +261, GPU +265, now: CPU 537, GPU 6317 (MiB)
[10/14/2022-14:31:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +85, GPU +81, now: CPU 622, GPU 6398 (MiB)
[10/14/2022-14:31:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +46, now: CPU 0, GPU 46 (MiB)
[10/14/2022-14:31:02] [I] Engine deserialized in 4.30743 sec.
[10/14/2022-14:31:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 622, GPU 6398 (MiB)
[10/14/2022-14:31:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 622, GPU 6398 (MiB)
[10/14/2022-14:31:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +21, now: CPU 0, GPU 67 (MiB)
[10/14/2022-14:31:02] [I] Using random values for input input_tensor:0
[10/14/2022-14:31:02] [I] Created input binding for input_tensor:0 with dimensions 1x640x640x3
[10/14/2022-14:31:02] [I] Using random values for output num_detections
[10/14/2022-14:31:02] [I] Created output binding for num_detections with dimensions 1x1
[10/14/2022-14:31:02] [I] Using random values for output detection_boxes
[10/14/2022-14:31:02] [I] Created output binding for detection_boxes with dimensions 1x100x4
[10/14/2022-14:31:02] [I] Using random values for output detection_scores
[10/14/2022-14:31:02] [I] Created output binding for detection_scores with dimensions 1x100
[10/14/2022-14:31:02] [I] Using random values for output detection_classes
[10/14/2022-14:31:02] [I] Created output binding for detection_classes with dimensions 1x100
[10/14/2022-14:31:02] [I] Starting inference
[10/14/2022-14:31:06] [I] Warmup completed 0 queries over 200 ms
[10/14/2022-14:31:06] [I] Timing trace has 125 queries over 3.01144 s
[10/14/2022-14:31:06] [I] 
[10/14/2022-14:31:06] [I] === Trace details ===
[10/14/2022-14:31:06] [I] Trace averages of 100 runs:
[10/14/2022-14:31:06] [I] Average on 100 runs - GPU latency: 24.3308 ms - Host latency: 24.3308 ms (enqueue 0.335376 ms)
[10/14/2022-14:31:06] [I] 
[10/14/2022-14:31:06] [I] === Performance summary ===
[10/14/2022-14:31:06] [I] Throughput: 41.5084 qps
[10/14/2022-14:31:06] [I] Latency: min = 22.9097 ms, max = 83.074 ms, mean = 24.0906 ms, median = 23.1106 ms, percentile(99%) = 38.5914 ms
[10/14/2022-14:31:06] [I] Enqueue Time: min = 0.2146 ms, max = 1.84058 ms, mean = 0.340332 ms, median = 0.303223 ms, percentile(99%) = 1.08011 ms
[10/14/2022-14:31:06] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[10/14/2022-14:31:06] [I] GPU Compute Time: min = 22.9097 ms, max = 83.074 ms, mean = 24.0906 ms, median = 23.1106 ms, percentile(99%) = 38.5914 ms
[10/14/2022-14:31:06] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[10/14/2022-14:31:06] [I] Total Host Walltime: 3.01144 s
[10/14/2022-14:31:06] [I] Total GPU Compute Time: 3.01132 s
[10/14/2022-14:31:06] [W] * GPU compute time is unstable, with coefficient of variance = 23.6525%.
[10/14/2022-14:31:06] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/14/2022-14:31:06] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/14/2022-14:31:06] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=output/engine.trt --useCudaGraph --noDataTransfers --iterations=100 --avgRuns=100

Hi,

We are moving this post to the Jetson Xavier NX forum to get better help.

Thank you.