Cuda runtime error while re-training SSD

Hi, I am following this documentation on re-training SSD model - jetson-inference/pytorch-ssd.md at master · dusty-nv/jetson-inference · GitHub . I have downloaded images that I want and I am running train_ssd.py, but it runs into cuda error as below.

THCudaCheck FAIL file=/media/nvidia/WD_BLUE_2.5_1TB/pytorch/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu line=195 error=7 : too many resources requested for launch
Traceback (most recent call last):
File “train_ssd.py”, line 343, in
device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File “train_ssd.py”, line 123, in train
loss.backward()
File “/usr/local/lib/python3.6/dist-packages/torch/tensor.py”, line 106, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py”, line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (7) : too many resources requested for launch at /media/nvidia/WD_BLUE_2.5_1TB/pytorch/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:195

Please let me know how to go about this.

Him

May I know how do you install the pyTorch package?
It’s recommended to install the version shared in the below’s comment:

Thanks.

Hi,

Thanks for your reply and sorry for late response. I flashed DLI AI sdk on the Jetson, recommended by the Nvidia tutorial. This has the Pytorch pre-built in it. Do you recommend me to uninstall this and re-build once again using the links?

Is there any other option that I can try?

Thanks

Hi,

This is a similar issue of this topic.
It can be fixed with this change:

Not sure which package is included in the SDK.
Would you mind to give above package a try?

Thanks.

Why I do not have the file named CUDAContext.cpp in my Jetson nano ?

It is inside the PyTorch source code. This patch was already applied to the PyTorch wheels (for PyTorch >= 1.1.0) if you are using one of the pre-built PyTorch wheels from this topic:

https://forums.developer.nvidia.com/t/pytorch-for-jetson-version-1-8-0-now-available/72048

If you are still getting the error, try reducing the --batch-size when running train_ssd.py