Multiprocessing on Jetson

venketramana1 · September 10, 2021, 7:23pm

Good Day!, I am trying to run object detection inference on multiple camera sources by utilizing Pebble’s ProcessPool. The pipeline/concept I am using has been tested and performs as expected on x86 servers. However, I seem to run into some issues on the JetsonTX2. I have tested python’s multiprocessing and pathos.multiprocessing which results in the same issues.

Hardware Information:

Jetson TX2: Jetpack 4.6
Python 3.6.9
torch wheel: v1.9.0 from PyTorch for Jetson - #3 by dusty_nv
TRTorch compiled based on GitHub - pytorch/TensorRT: PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT with Jetpack 4.6 support.

Error/ Issues faced:

THCudaCheck FAIL file=/media/nvidia/NVME/pytorch/pytorch-v1.9.0/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
Failed to run StreamPipeline: RuntimeError('cuda runtime error (3) : initialization error at /media/nvidia/NVME/pytorch/pytorch-v1.9.0/aten/src/THC/THCGeneral.cpp:54',)

The above error occasionally arises as:
Failed to run StreamPipeline: RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method",)

Comments: Setting the 'spawn' context results in:
  File "/usr/lib/python3.6/multiprocessing/context.py", line 242, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

And the following error:
Failed to run StreamPipeline: RuntimeError('[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n',)

How to reproduce:
Test model compile using TRTorch for JetsonTx2: nvidia_ssd_300.ts - Google Drive
The following script will help reproduce the errors:

import torch
import trtorch
import time
import numpy as np
from pebble import ProcessPool
import multiprocessing as mp

pool = ProcessPool(max_workers=3, max_tasks=2)
futures = dict()

def test_case(test_argument):
    try:
        print(test_argument)
        test_tensor = torch.tensor(np.array([[1, 2, 3], [4, 5, 6]]), dtype=torch.float16, device=torch.device("cuda:0")).clamp(0, 1)
        print(test_tensor)
    except Exception as e:
        print("Failed to run StreamPipeline: " + repr(e))

    try:
        t_0 = time.time()
        print('Loading Model...')
        torch.jit.load("./nvidia_ssd_300.ts")
        print('Model loaded in:', time.time() - t_0)
    except Exception as e:
        print("Failed to run StreamPipeline: " + repr(e))

def main():
    global pool

    test_argument = 'Without multiprocessing'
    test_case(test_argument)

    cam = ['cam1','cam2']
    for key in cam:
        test_argument = 'With multiprocessing '
        future = pool.schedule(test_case, [test_argument])
        futures[key] = future
        print(future.result())

if __name__ == "__main__":
    main()

Sample output:

nvidia@tx2:~/development$ python3 reproduce.py 
Without multiprocessing
tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0', dtype=torch.float16)
Loading Model...
Model loaded in: 6.3753883838653564
With multiprocessing 
Failed to run StreamPipeline: RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method",)
Loading Model...
Failed to run StreamPipeline: RuntimeError('[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n',)
None
With multiprocessing 
Failed to run StreamPipeline: RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method",)
Loading Model...
Failed to run StreamPipeline: RuntimeError('[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n',)
None

Additional Information:
Sample Debug from TRTorch model compilation:
This shows the relevant devices/resources are detected with a successful model compilation

model compile time 60.87920784950256
compiled
test results for FP16 TensorRT model
DEBUG: [TRTorch] - Attempting to run engine (ID: __torch___PyTorch_Detection_SSD_src_model_SSD300_trt_engine_)
DEBUG: [TRTorch] - Current Device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)
DEBUG: [TRTorch] - Requested padding of dimensions to 1 but found 4 dimensions, not going to pad
DEBUG: [TRTorch] - Input shape: [1, 3, 300, 300]
DEBUG: [TRTorch] - Output shape: [1, 4, 8732]
DEBUG: [TRTorch] - Output shape: [1, 81, 8732]
Prediction time 0.010217428207397461
(tensor([[[ 0.9756,  0.1747, -1.3818,  ..., -0.2827, -0.1379, -0.1949],
         [ 1.0098,  1.3086,  0.9346,  ...,  0.4170,  0.5889,  0.3745],
         [-1.7295, -0.6611, -0.9873,  ..., -0.9341, -1.3926,  0.4646],
         [-1.3105, -1.6895, -1.5869,  ..., -1.6016, -0.0120, -1.7568]]],
       device='cuda:0'), tensor([[[ 7.5703,  7.7461,  7.7930,  ...,  8.8203,  7.4492,  8.6328],
         [ 1.9795,  2.0801,  2.9316,  ...,  2.0371,  2.4141,  2.1484],
         [-0.3008, -0.2059,  0.0671,  ..., -0.2512, -0.2142, -0.1271],
         ...,
         [-0.9277, -1.1660, -1.0635,  ..., -0.8608, -0.8203, -0.7422],
         [-0.1588, -0.1909, -0.3755,  ..., -0.8267, -0.8628, -0.8159],
         [-0.1450, -0.5020, -0.4402,  ..., -0.7510, -0.8198, -0.7646]]],
       device='cuda:0'))
DEBUG: [TRTorch] - Serialized Device Info: 0%6%2%0%NVIDIA Tegra X2
model saved

AastaLLL · September 13, 2021, 3:26am

Hi,

Based on the output message:

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"

This error is related to API usage since it complains about the duplicate initialization call.

Have you tried the sample on another device (ex. x86 system) before?
If not, could you give it a try to see if it can well?

Thanks.

venketramana1 · September 13, 2021, 3:33am

Hello @AastaLLL, Thank You for responding.

Have you tried the sample on another device (ex. x86 system) before?
Answer: Yes I have tried it on x86 system and it functions as expected. I am actually adapting this concept from our implementation on x86 system to the Jetson platform.

venketramana1 · September 14, 2021, 5:30am

Hello @AastaLLL. Just some additional context as this issue covers 2 errors. The first error covers

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method

The second error is a primary roadblock as loading the model without multiprocessing works as expected
but when done with multiprocessing throws the following error. Would really appreciate your input here.

RuntimeError(‘[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n’,)

AastaLLL · September 15, 2021, 6:52am

Hi,

Thanks for the feedback.

Have you tried the single process use case?
Does it work?

rehan2 · September 15, 2021, 6:56am

Yes @AastaLLL . We are able to run single process use case without ProcessPool cc @venketramana1

venketramana1 · September 15, 2021, 10:57am

@AastaLLL As remarked by @rehan2. We have tested for single process use case without the use of ProcessPool. The sample script reproduces both cases whereby the function performs as expected outside of ProcessPool, but fails when used with ProcessPool.

AastaLLL · September 22, 2021, 8:15am

Thanks for the sharing.

We are going to reproduce this internally.
Will share more information with you later.

AastaLLL · September 23, 2021, 9:16am

Hi,

We can reproduce this issue in our environment and are working on it.
Will share more information with you later.

Thanks.

AastaLLL · September 28, 2021, 7:10am

Hi,

Based on the below doc, please use torch.multiprocessing for the multiprocess usage.

https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing

Thanks.

venketramana1 · October 1, 2021, 5:55am

Good Day @AastaLLL, Thank You for the feedback and suggestion. We manage to solve the problem by reverting TRTorch from v0.4.0 to v0.2.0. We need to use Pebble ProcessPool for our implementation.

AastaLLL · October 4, 2021, 4:49am

Thanks for the feedback.
Good to know you found a way to solve this.

Topic		Replies	Views
Multiprocessing PyTorch inference with TensorRT on Jetson Orin NX devices Jetson Orin NX tensorrt , cuda , pytorch , cudnn	2	557	May 7, 2024
Tensorflow fails to create a session and issue with docker Jetson TX2	10	2793	July 6, 2018
Unable to run two TensorRT models in a cascade manner TensorRT tensorrt , python	7	4979	October 12, 2021
TensorFlow 1.5 on TX2 Errors Jetson TX2	6	2679	October 18, 2021
Jetson TX2 cudaMalloc() failed with error all CUDA-capable devices are busy or unavailable Jetson TX2 cuda	9	1055	September 13, 2023
CUDA Fail when running Tensorflow inference Jetson TX2	10	3342	February 2, 2018
Cuda Error, creating more than one session using tensorflow Jetson TX2	9	3843	October 18, 2021
Adding multiple inference on TensorRT (Invalid Resource Handle Error) TensorRT	2	1711	December 4, 2019
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed Jetson TX2	8	6292	October 18, 2021
New installation Multiple Failues DeepStream SDK	18	1131	June 28, 2022

Multiprocessing on Jetson

Related topics