No GPU associated to the given UUID

Hello!
I am trying to run this very simple example to get familiar with DLProf.

I am running it on this cuda cluster using dlprof --mode=pytorch --force true python3 dummy_network.py.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:06:00.0 Off |                    0 |
| N/A   28C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:07:00.0 Off |                    0 |
| N/A   23C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
.
.

Somehow I seem to be getting this error, which I can not explain. Can someone help me out here?

(peft) gslama12@cuda04:~/DLProf$ dlprof --mode=pytorch --force true python3 dummy_network.py
[DLProf-05:58:44] Creating Nsys Scheduler
[DLProf-05:58:44] RUNNING: nsys profile -t cuda,nvtx -s none --show-output=true --force-overwrite=true --export=sqlite -o ./nsys_profile python3 dummy_network.py
WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.

/home/g/gslama12/anaconda3/envs/peft/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
Initializing NVTX monkey patches
Done with NVTX monkey patching
Generating '/tmp/nsys-report-52a4.qdstrm'
FATAL ERROR: /build/agent/work/323cb361ab84164c/QuadD/Common/GpuTraits/Src/GpuTicksConverter.cpp(376): Throw in function QuadDCommon::TimestampType GpuTraits::GpuTicksConverter::ConvertToCpuTime(const QuadDCommon::Uuid&, uint64_t&) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::NotFoundException>
std::exception::what: NotFoundException
[QuadDCommon::tag_message*] = No GPU associated to the given UUID

[DLProf-05:58:56] DLprof completed system call successfully
[DLProf-05:58:57] Error Occurred:
[DLProf-05:58:57] table SystemInfo already exists

DLProf has been end-of-lifed. As you can see, it used Nsys under the covers. Can you tell me what you are trying to do?

@hwilper Thanks for your reply!

I am trying to profile different fine-tuning methods on a simple model (MobileNetV2) and compare them to regular training w.r.t FLOPs and memory consumption.
For this I need to profile forward and backward pass separately and ideally get the FLOPs and memory for every layer of the model. The main concern is profiling the backward pass, since the PyTorch autograd profiler does not seem to have the capability to do so.

Do you have any suggestions for tools that i could use instead of the DLProf?

Georg,

Thanks for reaching out. While dlprof is discontinued as mentioned, I think what might be the root cause of the error here is that the K80 is not a supported hardware platform anymore. DLProf was using Nsight Systems version 2021.3, which requires a Pascal (e.g., P100 or Geforce GTX 1080) or newer GPU.

For what its worth, if you have another GPU to try on, you can get a lot of detailed information from nsys directly, and the nvidia_dlprof_pytorch_nvtx package seems to still work in terms of including the necessary NVTX support for injecting ranges into the PyTorch operations. You can install a modern version of Nsight Systems and use the nsys command that was printed in the dlprof output directly to generate the profiling report, and the dlprof analysis commands. They aren’t actually supported, but still appear to work OK with newer nsys versions.

@mhallock

Thanks for the advice!

I managed to run nsys using the T4 GPU from google colab like this:
nsys profile -t cuda,nvtx -s none --show-output=true --force-overwrite=true --export=sqlite -o ./nsys_profile python3 net.py and inspect the .nsys-rep file using the GUI.

I used this simple script (net.py) to simulate a single training pass:

import torch
from torch import nn
import torch.cuda.profiler as profiler
from torch.autograd import profiler as torch_profiler

device = torch.device("cuda")

model = torch.hub.load('pytorch/vision:v0.10.0', 'mobilenet_v2', pretrained=True)

num_features = model.classifier[1].in_features
model.classifier = nn.Sequential(
    nn.Dropout(0.2, inplace=False),
    nn.Linear(num_features, 1)) 

# dummy inputs and loss
model.to(device)
img = torch.rand(1, 3, 244, 244).to(device)
output = model(img)
g0 = torch.rand_like(output).to(device)

for param in model.parameters():
    param.grad = None

model.train()

with torch_profiler.emit_nvtx(record_shapes=True):
    output = model(img)
    output.backward(g0)

I can see the different sections for forward and backward pass on the timeline but I can’t seem to figure out how i can get the used memory and FLOPs for each section. Does nsys support profiling of FLOPs and memory?

Greetings @georgslamanig,

In terms of memory usage - we can capture the total GPU memory usage by adding the flag --cuda-memory-usage=true. This will add outputs for the memory usage over the lifetime of the program.

In terms of collecting the FLOPs, that is a more challenging endeavor. Nsys itself does not capture that information. You can use our companion application Nsight Compute in order to perform detailed profiling of single kernels, and you are able to get FLOP rates there. However - it is a kernel-by-kernel profile, not over a larger operation like you are trying to capture. What nsys does give you is the ability to measure the time taken, so if you are able to get a count of FLOP/MAC that comprise your model, you could then compute your own computational throughput.

To make it a little easier to quickly gather the forward/backwards timing, you can enclose them into specific NVTX ranges, for example:

with torch_profiler.emit_nvtx(record_shapes=True):
    with nvtx.annotate("forward", color="green"):
        output = model(img)
    with nvtx.annotate("backward", color="red"):
        output.backward(g0)

Visually, you would now see a green and red bar on the NVTX timeline for forward/backprop, and you can use the nsys stats command to get the aggregate statistics over those ranges.