Good Day!, I am trying to run object detection inference on multiple camera sources by utilizing Pebble’s ProcessPool. The pipeline/concept I am using has been tested and performs as expected on x86 servers. However, I seem to run into some issues on the JetsonTX2. I have tested python’s multiprocessing and pathos.multiprocessing which results in the same issues.
Hardware Information:
Jetson TX2: Jetpack 4.6
Python 3.6.9
torch wheel: v1.9.0 from PyTorch for Jetson - #3 by dusty_nv
TRTorch compiled based on GitHub - pytorch/TensorRT: PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT with Jetpack 4.6 support.
Error/ Issues faced:
THCudaCheck FAIL file=/media/nvidia/NVME/pytorch/pytorch-v1.9.0/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
Failed to run StreamPipeline: RuntimeError('cuda runtime error (3) : initialization error at /media/nvidia/NVME/pytorch/pytorch-v1.9.0/aten/src/THC/THCGeneral.cpp:54',)
The above error occasionally arises as:
Failed to run StreamPipeline: RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method",)
Comments: Setting the 'spawn' context results in:
File "/usr/lib/python3.6/multiprocessing/context.py", line 242, in set_start_method
raise RuntimeError('context has already been set')
RuntimeError: context has already been set
And the following error:
Failed to run StreamPipeline: RuntimeError('[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n',)
How to reproduce:
Test model compile using TRTorch for JetsonTx2: nvidia_ssd_300.ts - Google Drive
The following script will help reproduce the errors:
import torch
import trtorch
import time
import numpy as np
from pebble import ProcessPool
import multiprocessing as mp
pool = ProcessPool(max_workers=3, max_tasks=2)
futures = dict()
def test_case(test_argument):
try:
print(test_argument)
test_tensor = torch.tensor(np.array([[1, 2, 3], [4, 5, 6]]), dtype=torch.float16, device=torch.device("cuda:0")).clamp(0, 1)
print(test_tensor)
except Exception as e:
print("Failed to run StreamPipeline: " + repr(e))
try:
t_0 = time.time()
print('Loading Model...')
torch.jit.load("./nvidia_ssd_300.ts")
print('Model loaded in:', time.time() - t_0)
except Exception as e:
print("Failed to run StreamPipeline: " + repr(e))
def main():
global pool
test_argument = 'Without multiprocessing'
test_case(test_argument)
cam = ['cam1','cam2']
for key in cam:
test_argument = 'With multiprocessing '
future = pool.schedule(test_case, [test_argument])
futures[key] = future
print(future.result())
if __name__ == "__main__":
main()
Sample output:
nvidia@tx2:~/development$ python3 reproduce.py
Without multiprocessing
tensor([[1., 1., 1.],
[1., 1., 1.]], device='cuda:0', dtype=torch.float16)
Loading Model...
Model loaded in: 6.3753883838653564
With multiprocessing
Failed to run StreamPipeline: RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method",)
Loading Model...
Failed to run StreamPipeline: RuntimeError('[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n',)
None
With multiprocessing
Failed to run StreamPipeline: RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method",)
Loading Model...
Failed to run StreamPipeline: RuntimeError('[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n',)
None
Additional Information:
Sample Debug from TRTorch model compilation:
This shows the relevant devices/resources are detected with a successful model compilation
model compile time 60.87920784950256
compiled
test results for FP16 TensorRT model
DEBUG: [TRTorch] - Attempting to run engine (ID: __torch___PyTorch_Detection_SSD_src_model_SSD300_trt_engine_)
DEBUG: [TRTorch] - Current Device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)
DEBUG: [TRTorch] - Requested padding of dimensions to 1 but found 4 dimensions, not going to pad
DEBUG: [TRTorch] - Input shape: [1, 3, 300, 300]
DEBUG: [TRTorch] - Output shape: [1, 4, 8732]
DEBUG: [TRTorch] - Output shape: [1, 81, 8732]
Prediction time 0.010217428207397461
(tensor([[[ 0.9756, 0.1747, -1.3818, ..., -0.2827, -0.1379, -0.1949],
[ 1.0098, 1.3086, 0.9346, ..., 0.4170, 0.5889, 0.3745],
[-1.7295, -0.6611, -0.9873, ..., -0.9341, -1.3926, 0.4646],
[-1.3105, -1.6895, -1.5869, ..., -1.6016, -0.0120, -1.7568]]],
device='cuda:0'), tensor([[[ 7.5703, 7.7461, 7.7930, ..., 8.8203, 7.4492, 8.6328],
[ 1.9795, 2.0801, 2.9316, ..., 2.0371, 2.4141, 2.1484],
[-0.3008, -0.2059, 0.0671, ..., -0.2512, -0.2142, -0.1271],
...,
[-0.9277, -1.1660, -1.0635, ..., -0.8608, -0.8203, -0.7422],
[-0.1588, -0.1909, -0.3755, ..., -0.8267, -0.8628, -0.8159],
[-0.1450, -0.5020, -0.4402, ..., -0.7510, -0.8198, -0.7646]]],
device='cuda:0'))
DEBUG: [TRTorch] - Serialized Device Info: 0%6%2%0%NVIDIA Tegra X2
model saved