PyTorch list slicing on GPU slower than on CPU

This is a copy of original question on stack overflow.

I would like to optimize ML code (SSD in PyTorch) on NVIDIA Jetson Xavier NX (development kit). One of the bottlenecks seems to be list slicing on PyTorch (1.6.0) tensors on GPU device.

The same problem occured on NVIDIA GeForce GTX 1050 Ti (GP107), CPU was ~2 times faster.

Let me create the variables first

import torch
from time import time

cuda0 = torch.device('cuda:0')

probs = torch.ones([3000], dtype=torch.float64, device=cuda0)
mask = torch.ones([3000], dtype=torch.bool, device=cuda0)

probs_cpu = probs.cpu()
mask_cpu = mask.cpu()

Then run the logic (Approximately same results occurred every run)

before = time()
print(f'GPU {time() - before:.5f}') # output: GPU 0.00263

before = time()
print(f'CPU {time() - before:.5f}') # output: CPU 0.00066

Why is the list slicing ~4 times slower on GPU compared to CPU using PyTorch library vesrion 1.6.0 on NVIDIA Jetson Xavier NX Developer kit according to the code above? How to speed it up?

Code details: see line 51 in which is part of SSD Implementation in PyTorch

Run it on CPU?: Whole algorithm will not be faster if I run it on the CPU since the downloading from GPU takes too long (~0.00805 s).

Hi @pedro.dvoracek, since this issue is not specific to Jetson and internal to PyTorch, I recommend you post an issue to the PyTorch GitHub repo or the PyTorch forums: