Multiprocessing on Jetson

Good Day!, I am trying to run object detection inference on multiple camera sources by utilizing Pebble’s ProcessPool. The pipeline/concept I am using has been tested and performs as expected on x86 servers. However, I seem to run into some issues on the JetsonTX2. I have tested python’s multiprocessing and pathos.multiprocessing which results in the same issues.

Hardware Information:

Jetson TX2: Jetpack 4.6
Python 3.6.9
torch wheel: v1.9.0 from PyTorch for Jetson - #3 by dusty_nv
TRTorch compiled based on GitHub - pytorch/TensorRT: PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT with Jetpack 4.6 support.

Error/ Issues faced:

THCudaCheck FAIL file=/media/nvidia/NVME/pytorch/pytorch-v1.9.0/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
Failed to run StreamPipeline: RuntimeError('cuda runtime error (3) : initialization error at /media/nvidia/NVME/pytorch/pytorch-v1.9.0/aten/src/THC/THCGeneral.cpp:54',)

The above error occasionally arises as:
Failed to run StreamPipeline: RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method",)

Comments: Setting the 'spawn' context results in:
  File "/usr/lib/python3.6/multiprocessing/context.py", line 242, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

And the following error:
Failed to run StreamPipeline: RuntimeError('[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n',)

How to reproduce:
Test model compile using TRTorch for JetsonTx2: nvidia_ssd_300.ts - Google Drive
The following script will help reproduce the errors:

import torch
import trtorch
import time
import numpy as np
from pebble import ProcessPool
import multiprocessing as mp

pool = ProcessPool(max_workers=3, max_tasks=2)
futures = dict()

def test_case(test_argument):
    try:
        print(test_argument)
        test_tensor = torch.tensor(np.array([[1, 2, 3], [4, 5, 6]]), dtype=torch.float16, device=torch.device("cuda:0")).clamp(0, 1)
        print(test_tensor)
    except Exception as e:
        print("Failed to run StreamPipeline: " + repr(e))

    try:
        t_0 = time.time()
        print('Loading Model...')
        torch.jit.load("./nvidia_ssd_300.ts")
        print('Model loaded in:', time.time() - t_0)
    except Exception as e:
        print("Failed to run StreamPipeline: " + repr(e))

def main():
    global pool

    test_argument = 'Without multiprocessing'
    test_case(test_argument)

    cam = ['cam1','cam2']
    for key in cam:
        test_argument = 'With multiprocessing '
        future = pool.schedule(test_case, [test_argument])
        futures[key] = future
        print(future.result())

if __name__ == "__main__":
    main()

Sample output:

nvidia@tx2:~/development$ python3 reproduce.py 
Without multiprocessing
tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0', dtype=torch.float16)
Loading Model...
Model loaded in: 6.3753883838653564
With multiprocessing 
Failed to run StreamPipeline: RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method",)
Loading Model...
Failed to run StreamPipeline: RuntimeError('[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n',)
None
With multiprocessing 
Failed to run StreamPipeline: RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method",)
Loading Model...
Failed to run StreamPipeline: RuntimeError('[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n',)
None

Additional Information:
Sample Debug from TRTorch model compilation:
This shows the relevant devices/resources are detected with a successful model compilation

model compile time 60.87920784950256
compiled
test results for FP16 TensorRT model
DEBUG: [TRTorch] - Attempting to run engine (ID: __torch___PyTorch_Detection_SSD_src_model_SSD300_trt_engine_)
DEBUG: [TRTorch] - Current Device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)
DEBUG: [TRTorch] - Requested padding of dimensions to 1 but found 4 dimensions, not going to pad
DEBUG: [TRTorch] - Input shape: [1, 3, 300, 300]
DEBUG: [TRTorch] - Output shape: [1, 4, 8732]
DEBUG: [TRTorch] - Output shape: [1, 81, 8732]
Prediction time 0.010217428207397461
(tensor([[[ 0.9756,  0.1747, -1.3818,  ..., -0.2827, -0.1379, -0.1949],
         [ 1.0098,  1.3086,  0.9346,  ...,  0.4170,  0.5889,  0.3745],
         [-1.7295, -0.6611, -0.9873,  ..., -0.9341, -1.3926,  0.4646],
         [-1.3105, -1.6895, -1.5869,  ..., -1.6016, -0.0120, -1.7568]]],
       device='cuda:0'), tensor([[[ 7.5703,  7.7461,  7.7930,  ...,  8.8203,  7.4492,  8.6328],
         [ 1.9795,  2.0801,  2.9316,  ...,  2.0371,  2.4141,  2.1484],
         [-0.3008, -0.2059,  0.0671,  ..., -0.2512, -0.2142, -0.1271],
         ...,
         [-0.9277, -1.1660, -1.0635,  ..., -0.8608, -0.8203, -0.7422],
         [-0.1588, -0.1909, -0.3755,  ..., -0.8267, -0.8628, -0.8159],
         [-0.1450, -0.5020, -0.4402,  ..., -0.7510, -0.8198, -0.7646]]],
       device='cuda:0'))
DEBUG: [TRTorch] - Serialized Device Info: 0%6%2%0%NVIDIA Tegra X2
model saved

Hi,

Based on the output message:

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"

This error is related to API usage since it complains about the duplicate initialization call.

Have you tried the sample on another device (ex. x86 system) before?
If not, could you give it a try to see if it can well?

Thanks.

Hello @AastaLLL, Thank You for responding.

  1. Have you tried the sample on another device (ex. x86 system) before?
    Answer: Yes I have tried it on x86 system and it functions as expected. I am actually adapting this concept from our implementation on x86 system to the Jetson platform.

Hello @AastaLLL. Just some additional context as this issue covers 2 errors. The first error covers

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method

The second error is a primary roadblock as loading the model without multiprocessing works as expected
but when done with multiprocessing throws the following error. Would really appreciate your input here.

RuntimeError(‘[Error thrown at core/runtime/runtime.cpp:12] Expected (cudaSetDevice(cuda_device.id) == cudaSuccess) to be true but got false\nUnable to set device: Device(ID: 0, Name: NVIDIA Tegra X2, SM Capability: 6.2, Type: GPU)as active device\n’,)

Hi,

Thanks for the feedback.

Have you tried the single process use case?
Does it work?

Yes @AastaLLL . We are able to run single process use case without ProcessPool cc @venketramana1

1 Like

@AastaLLL As remarked by @rehan2. We have tested for single process use case without the use of ProcessPool. The sample script reproduces both cases whereby the function performs as expected outside of ProcessPool, but fails when used with ProcessPool.

1 Like

Thanks for the sharing.

We are going to reproduce this internally.
Will share more information with you later.

Hi,

We can reproduce this issue in our environment and are working on it.
Will share more information with you later.

Thanks.

1 Like

Hi,

Based on the below doc, please use torch.multiprocessing for the multiprocess usage.

https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing

Thanks.

Good Day @AastaLLL, Thank You for the feedback and suggestion. We manage to solve the problem by reverting TRTorch from v0.4.0 to v0.2.0. We need to use Pebble ProcessPool for our implementation.

Thanks for the feedback.
Good to know you found a way to solve this.