Why torch.Tensor.cuda() utilizes GPU?

inho.choi.12 · June 1, 2022, 12:16pm

I’m trying to understand what the torch.Tensor.cuda() is doing in detail.
I thought its to move the tensor from CPU memory to GPU memory, and CPU performs store operations to the GPU memory without involving the GPU (please correct me if its wrong).

import torch
import torch.distributed as dist
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity

inputs = torch.randn(1000)
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    inputs = inputs.cuda()
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

This is the code that I’m profiling. It’s simply create a tensor with 1000 float numbers, and move it to the GPU.

And here is the profile result.

I actually observe some CUDA kernels are running, and GPU utilization is not 0% (different from my expectation). Could anyone let me know what I’m misunderstanding here?

Robert_Crovella · June 1, 2022, 3:17pm

In modern CUDA, operations like creating an allocation that has to be set zero, or copying data from one device allocation to another, are often implemented using GPU kernels.

If you run one of the NVIDIA-provided profilers with the CUDA toolkit, you can see specifics of kernels being launched. With a bit of work, you should be able to identify which torch operations are causing that.

Beyond that, I’m not an expert on torch internals, and pytorch isn’t a product that is developed, maintained, or supported by NVIDIA. You may get better responses on a forum that caters to pytorch.

inho.choi.12 · June 1, 2022, 3:44pm

Thanks for your reply, Robert!

I understood that memory allocation or data copies are actually implemented using GPU kernels, and that’s why GPU is utilized during that.

Following your suggestion, I actually tried to visualize the profile using Nsight Systems, but it’s weird that I see the GPU utilization is 0% in the Nsight Systems.

I checked that GPU utilization is not 0% using nvidia-smi command during the data movement, but the Nsight Systems profile does not report the GPU utilization. Do you have any idea on this situation? Thanks!

Robert_Crovella · June 1, 2022, 4:12pm

nvidia-smi can report non zero GPU utilization simply due to the act of running nvidia-smi.

inho.choi.12 · June 1, 2022, 4:16pm

Do you mean the GPU utilization of the data movement is actually 0%?

According to your previous reply and the PyTorch profile above, the GPU utilization is indeed not 0%. Also, I checked that nvidia-smi shows non-zero GPU utilization only when there is data movement happening. So, I think it’s weird that the Nsight Systems does not show any GPU utilization during the data movement.

Robert_Crovella · June 1, 2022, 4:31pm

Yes, nvidia-smi shows GPU utilization when doing a cudaMemcpyHostToDevice.

Nsight systems shows the actual operation in the timeline. I’m not sure what you mean by saying “Nsight System profile does not report the GPU utilization”. It reports the operation in the timeline, showing the duration of it. Note that your timeline shows mostly cudaMalloc, not data movement. The data movement is probably occurring in the skinny bar right after the long red cudaMalloc bar.

inho.choi.12 · June 1, 2022, 4:37pm

What I mean is that Nsight Systems shows 0% GPU utilization during the data movement unlike the nvidia-smi. So, Nsight Systems says that GPU is not utilized, but nvidia-smi says that GPU is utilized. This is the mismatch that I don’t understand.

FYI, Nsight System shows skyblue bars in the CUDA HW row when the GPU it utilized like this

But there is no skyblue bars at all during the data movement

Could you please take a look at this once again?

Robert_Crovella · June 1, 2022, 5:11pm

It looks to me like the CUDA HW bar has some activity in it, at precisely the point I indicated in my previous message (the skinny bar right after the red cudaMalloc bar).

Sure, its green (I guess), not blue. Nsight systems may color different types of activity differently. If you’re interested in the section where the data transfer activity is actually happening, why not expand that section?

inho.choi.12 · June 2, 2022, 2:16am

Oh, I see… I thought that only the sky-blue bars are showing the active CUDA kernels, but you mean the green bars are showing the CUDA kernels as well but for some different activities like memory operations.

It’s an interesting fact that I just realized thanks to you!
Thank you, Robert! :)

Robert_Crovella · June 2, 2022, 2:28am

The green items are not CUDA kernels. Blue = CUDA kernel. Green = data copy from host to device. They are two different kinds of activity/utilization. nvidia-smi lumps them together for percentage utilization of the GPU. Nsight breaks them out in different colors, on the activity bar/timeline.

Topic		Replies	Views
PyTorch utilize CPU instead of GPU CUDA on Windows Subsystem for Linux	5	2826	November 25, 2020
Can you use nsight to see tensor core occupancy? Nsight Compute cudnn	4	973	March 23, 2024
No CUDA events collected Profiling x86 Windows Targets	11	800	April 5, 2024
Nsight Profile of NVIDIA/CUDALibrarySamples/cuTENSOR. Does it use CUDA Programming and Performance	4	516	November 22, 2022
Tensor core metrics not showing up in NSight? Profiling Linux Targets pytorch	9	3077	May 18, 2024
Sqlite does not contain CUDA kernel data CUDA on Windows Subsystem for Linux	12	3625	April 28, 2023
How to Get the Exact Amount of Resources the GPU Uses at the Moment (e.g., Used Tensor Cores) Regardless of the Running Process CUDA Programming and Performance performance-metrics	5	64	January 13, 2025
Feature request for average GPU utlization Profiling Linux Targets	7	756	April 24, 2019
Explaining memory usage mismatch between nvidia-smi and Nsight System Profiling Linux Targets	2	437	March 26, 2025
GPU memory is empty, but CUDA out of memory error occurs CUDA Programming and Performance cuda	5	21073	September 19, 2024

Why torch.Tensor.cuda() utilizes GPU?

Related topics