Why is torch.tensor.to('cuda') so slow?

Here is my code of running an ViT inferrence.

for i in range(100):
        s_time0 = time.time()
        image = Image.open(image_file).convert('RGB')
        print('open img time:',int((time.time() - s_time0) * 1000))
        s_time = time.time()
        
        image_tensor = process_anyres_image(
                image, model.image_processor, grid_points, False, False
            )
        print('process time:',int((time.time() - s_time0) * 1000))
        s_time = time.time()
        image_tensor = torch.from_numpy(image_tensor)
        print('array to tensor time:',int((time.time() - s_time0) * 1000))
        s_time = time.time()
        image_tensor = image_tensor.to('cuda', dtype=torch.float16)
        print('to gpu time:',int((time.time() - s_time0) * 1000))
        s_time = time.time()
        tokens = model(image_tensor)  # torch.Size([1, 3, 224, 224])
        endtime = time.time()
        print('forward time:',int((endtime - s_time) * 1000))
        print('total:',int((endtime - s_time0) * 1000))
        print('-----------------------')

And the output is like this.

It seems like tensor.to(‘cuda’) cost much time except first two times.

Here is the information of machine.


And version of pytorch.
image
Is it normal?

Hi,
Here are some suggestions for the common issues:

1. Performance

Please run the below command before benchmarking deep learning use case:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Installation

Installation guide of deep learning frameworks on Jetson:

3. Tutorial

Startup deep learning tutorial:

4. Report issue

If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.

Thanks!

Hi,

The function should be related to the memory copy.
Have you tried to boost the device performance with the comment shared above?

Thanks.

Yeah, jetson_clocks works as the time cost of tensor.to(‘cuda’) reduced from 7000ms to about 3000ms, but still much larger than model infer 70ms. Is it normal?

And I also searched that first time of tensor.to(‘cuda’) will be much slower because of warm up. But in my situation, the warm up seems only last for a short time, so the next to(‘cuda’) is still slow.(as shown in the pics, second period of forward is quick)

Hi,

Please also try to change the power model to MAXN.
For example:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.