This is a copy of original question on stack overflow.
I would like to optimize ML code (SSD in PyTorch) on NVIDIA Jetson Xavier NX (development kit). One of the bottlenecks seems to be list slicing on PyTorch (1.6.0) tensors on GPU device.
The same problem occured on NVIDIA GeForce GTX 1050 Ti (GP107), CPU was ~2 times faster.
Let me create the variables first
import torch
from time import time
cuda0 = torch.device('cuda:0')
probs = torch.ones([3000], dtype=torch.float64, device=cuda0)
mask = torch.ones([3000], dtype=torch.bool, device=cuda0)
probs_cpu = probs.cpu()
mask_cpu = mask.cpu()
Then run the logic (Approximately same results occurred every run)
before = time()
probs[mask]
print(f'GPU {time() - before:.5f}') # output: GPU 0.00263
before = time()
probs_cpu[mask_cpu]
print(f'CPU {time() - before:.5f}') # output: CPU 0.00066
Why is the list slicing ~4 times slower on GPU compared to CPU using PyTorch library vesrion 1.6.0 on NVIDIA Jetson Xavier NX Developer kit according to the code above? How to speed it up?
Code details: see line 51 in predictor.py which is part of SSD Implementation in PyTorch
Run it on CPU?: Whole algorithm will not be faster if I run it on the CPU since the downloading from GPU takes too long (~0.00805 s).