Why is torch.tensor.to('cuda') so slow?

373472509 · December 6, 2024, 7:37am

Here is my code of running an ViT inferrence.

for i in range(100):
        s_time0 = time.time()
        image = Image.open(image_file).convert('RGB')
        print('open img time:',int((time.time() - s_time0) * 1000))
        s_time = time.time()
        
        image_tensor = process_anyres_image(
                image, model.image_processor, grid_points, False, False
            )
        print('process time:',int((time.time() - s_time0) * 1000))
        s_time = time.time()
        image_tensor = torch.from_numpy(image_tensor)
        print('array to tensor time:',int((time.time() - s_time0) * 1000))
        s_time = time.time()
        image_tensor = image_tensor.to('cuda', dtype=torch.float16)
        print('to gpu time:',int((time.time() - s_time0) * 1000))
        s_time = time.time()
        tokens = model(image_tensor)  # torch.Size([1, 3, 224, 224])
        endtime = time.time()
        print('forward time:',int((endtime - s_time) * 1000))
        print('total:',int((endtime - s_time0) * 1000))
        print('-----------------------')

And the output is like this.

It seems like tensor.to(‘cuda’) cost much time except first two times.

Here is the information of machine.

And version of pytorch.

Is it normal?

carolyuu · December 6, 2024, 8:00am

Hi,
Here are some suggestions for the common issues:

1. Performance

Please run the below command before benchmarking deep learning use case:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Installation

Installation guide of deep learning frameworks on Jetson:

TensorFlow: Installing TensorFlow for Jetson Platform - NVIDIA Docs
PyTorch: Installing PyTorch for Jetson Platform - NVIDIA Docs
We also have containers that have frameworks preinstalled:
Data Science, Machine Learning, AI, HPC Containers | NVIDIA NGC

3. Tutorial

Startup deep learning tutorial:

Jetson-inference: Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson
TensorRT sample: Jetson/L4T/TRT Customized Example - eLinux.org

4. Report issue

If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.

Thanks!

AastaLLL · December 6, 2024, 1:33pm

Hi,

The function should be related to the memory copy.
Have you tried to boost the device performance with the comment shared above?

Thanks.

373472509 · December 6, 2024, 1:59pm

Yeah, jetson_clocks works as the time cost of tensor.to(‘cuda’) reduced from 7000ms to about 3000ms, but still much larger than model infer 70ms. Is it normal?

And I also searched that first time of tensor.to(‘cuda’) will be much slower because of warm up. But in my situation, the warm up seems only last for a short time, so the next to(‘cuda’) is still slow.(as shown in the pics, second period of forward is quick)

AastaLLL · December 9, 2024, 10:53am

Hi,

Please also try to change the power model to MAXN.
For example:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks