Why torch.Tensor.cuda() utilizes GPU?

I’m trying to understand what the torch.Tensor.cuda() is doing in detail.
I thought its to move the tensor from CPU memory to GPU memory, and CPU performs store operations to the GPU memory without involving the GPU (please correct me if its wrong).

import torch
import torch.distributed as dist
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity

inputs = torch.randn(1000)
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    inputs = inputs.cuda()
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

This is the code that I’m profiling. It’s simply create a tensor with 1000 float numbers, and move it to the GPU.


And here is the profile result.

I actually observe some CUDA kernels are running, and GPU utilization is not 0% (different from my expectation). Could anyone let me know what I’m misunderstanding here?

In modern CUDA, operations like creating an allocation that has to be set zero, or copying data from one device allocation to another, are often implemented using GPU kernels.

If you run one of the NVIDIA-provided profilers with the CUDA toolkit, you can see specifics of kernels being launched. With a bit of work, you should be able to identify which torch operations are causing that.

Beyond that, I’m not an expert on torch internals, and pytorch isn’t a product that is developed, maintained, or supported by NVIDIA. You may get better responses on a forum that caters to pytorch.

Thanks for your reply, Robert!

I understood that memory allocation or data copies are actually implemented using GPU kernels, and that’s why GPU is utilized during that.

Following your suggestion, I actually tried to visualize the profile using Nsight Systems, but it’s weird that I see the GPU utilization is 0% in the Nsight Systems.

I checked that GPU utilization is not 0% using nvidia-smi command during the data movement, but the Nsight Systems profile does not report the GPU utilization. Do you have any idea on this situation? Thanks!

nvidia-smi can report non zero GPU utilization simply due to the act of running nvidia-smi.

Do you mean the GPU utilization of the data movement is actually 0%?

According to your previous reply and the PyTorch profile above, the GPU utilization is indeed not 0%. Also, I checked that nvidia-smi shows non-zero GPU utilization only when there is data movement happening. So, I think it’s weird that the Nsight Systems does not show any GPU utilization during the data movement.

Yes, nvidia-smi shows GPU utilization when doing a cudaMemcpyHostToDevice.

Nsight systems shows the actual operation in the timeline. I’m not sure what you mean by saying “Nsight System profile does not report the GPU utilization”. It reports the operation in the timeline, showing the duration of it. Note that your timeline shows mostly cudaMalloc, not data movement. The data movement is probably occurring in the skinny bar right after the long red cudaMalloc bar.

What I mean is that Nsight Systems shows 0% GPU utilization during the data movement unlike the nvidia-smi. So, Nsight Systems says that GPU is not utilized, but nvidia-smi says that GPU is utilized. This is the mismatch that I don’t understand.

FYI, Nsight System shows skyblue bars in the CUDA HW row when the GPU it utilized like this


But there is no skyblue bars at all during the data movement

Could you please take a look at this once again?

It looks to me like the CUDA HW bar has some activity in it, at precisely the point I indicated in my previous message (the skinny bar right after the red cudaMalloc bar).

Sure, its green (I guess), not blue. Nsight systems may color different types of activity differently. If you’re interested in the section where the data transfer activity is actually happening, why not expand that section?

Oh, I see… I thought that only the sky-blue bars are showing the active CUDA kernels, but you mean the green bars are showing the CUDA kernels as well but for some different activities like memory operations.

It’s an interesting fact that I just realized thanks to you!
Thank you, Robert! :)

The green items are not CUDA kernels. Blue = CUDA kernel. Green = data copy from host to device. They are two different kinds of activity/utilization. nvidia-smi lumps them together for percentage utilization of the GPU. Nsight breaks them out in different colors, on the activity bar/timeline.

1 Like