I have a TX-2 with Jetpack 4.2.2 and am trying to use pytorch.multiprocessing to load my models for inference once, and then have several sub-processes running that use the model to run inference.
I have used the steps here to build my own PyTorch 1.7.0 succesfully, but when I try to use multiprocessing, I get the following error:
RuntimeError: cuda runtime error (71) : operation not supported
I have included a simplified version of my code at the end of this post.
Running the same code on one of my development-VMs (without CUDA) works fine, it seems that pytorch is unable to share the GPU-based tensors across processes???
Any idea what’s causing this? The only reference to this error I can find is with relation to Windows (and the TX2 is on ubuntu obviously).
Hope someone can point me in the right direction…
//Ton
Sample code:
from torch.multiprocessing import Process
import torch.multiprocessing as mp
import torch
myModel = NONE
def test(myModel):
print("RUNNING THIS ONE")
print(torch.cuda.is_available())
def _process():
p = Process(target=test, args=(myModel, ))
p.start()
p.join(300)
if __name__ == '__main__':
import model
myModel = model.model()
myModel.share_memory()
mp.set_start_method('spawn')
_process()
For reference, the underlying issue is that, for tegra architecture, cuda does NOT support IPC. It’s only supported on Linux desktop cards, not on jetson’s.
(I traced the failing call back to: cudaIpcGetMemHandle, which clue led to the final answer).
So, for tegra, sharing models (or tensors) over multiple processes is not possible (that’s what error 71 means).
In short, you do have torchvision v0.8.1, it is just reporting as v0.8.0a0+45f960c. If you look at the torchvision release page for v0.8.1, it has the same commit (45f960c) as reported in the version. I can’t seem to get it to print out v0.8.1 even though that is what is installed, sorry about that.
OK, I think I got to the bottom of it - instead of running sudo python3 setup.py install to build torchvision, instead run python3 setup.py install --user
This will allow setup.py to pick up the BUILD_VERSION environment var, whereas before it was not finding it because it was ran with sudo. I have updated the instructions to reflect this.
Hi @kevintgbd, I don’t see an actual error in your log - did it simply quit building without an additional message, or was it just taking a long time? If it abruptly quit, you may want to mount SWAP memory. If it was taking a long time, torchvision can take a while to compile some files.
Also you may want to see the updated install instructions for torchvision:
$ sudo apt-get install libjpeg-dev zlib1g-dev libpython3-dev libavcodec-dev libavformat-dev libswscale-dev
$ git clone --branch <version> https://github.com/pytorch/vision torchvision # see below for version of torchvision to download
$ cd torchvision
$ export BUILD_VERSION=0.x.0 # where 0.x.0 is the torchvision version
$ python3 setup.py install --user
$ cd ../ # attempting to load torchvision from build dir will result in import error
If you continue to have problems, you can try using the l4t-pytorch container which comes with PyTorch/torchvision pre-installed.
Hi @berkcanerbol98 , you would need to build PyTorch from source against Python 3.8. I believe there are some others on this thread who have done it, and the procedure was mostly the same as the build instructions in the first post from this topic.
Nano is the same GPU architecture as TX1, so yes. If you export this environment variable before building PyTorch, it will work on all Jetson’s (TX1/TX2/Xavier/Nano):
I’ve already installed torchvision (By following your instructions).
But when I try to run the script “train_ssd.py” (To run the Re-training SSD Mobilenet tutorial) I only get:
Traceback (most recent call last):
File “train_ssd.py”, line 14, in
from vision.utils.misc import str2bool, Timer, freeze_net_layers, store_labels
ModuleNotFoundError: No module named ‘vision’
If I import the module torchvision in the python shell, it is imported with no problem, I can see its version too (0.7.0)
I don´t know if the lines:
from vision.utils.misc import str2bool, Timer, freeze_net_layers, store_labels
from vision.ssd.ssd import MatchPrior
from vision.ssd.vgg_ssd import create_vgg_ssd
are trying to import modules that are NOT part of torchvision.
Do you have some advice?
Thank you in advance!
Update:
My bad, It was my mistake. I did not download the vision module on the same directory where I have the scripts, models, data, etc.
i made a fr4e3sh instal of l4t-pytorch, it launched the first time…after reboot wont launch anymore and could not ssh anymore. i consider this another bad comtainer…another unusable container… i would ask questions but really.if you cant make working containers with without issues stupied like this idoght u can help.if u want to help make containers that launch and ssh so far all the containers suck big time…waist of my energy.