Bad symbolName when handling cudaLaunchKernel via Callback API

Hi there,

When intercepting some cudaLaunchKernel calls via the callback API (cbid: CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernel_v7000), the symbolName field of the CUpti_CallbackData struct is invalid (0x8000000001). Why is that?

Sincerely,
Marcos

Hi, @slomp

We tried to repro this locally but can’t.
Can you provide more details ?

Sure, here’s a PyTorch sample:

import torch
import torch.nn as nn
import torch.optim as optim

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(10, 50)
        self.fc2 = nn.Linear(50, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x
net = SimpleNet().to(device)
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.01)

# Create a dummy input and target
inputs = torch.randn(5, 10).to(device)
targets = torch.randn(5, 1).to(device)

outputs = net(inputs)

loss = criterion(outputs, targets)

optimizer.zero_grad()

loss.backward()

optimizer.step()

The system in question is an A10G device on an Ubuntu 20.04.
Driver version is 525.147.05, with CUDA toolkit 12.2.

Can you also share your code for CUPTI API usage ?

Sure (pinging @frankchen8508 who put the code together):

cuda_profiler.cu:

#include <cuda_runtime.h>
#include <cupti.h>
#include <iostream>

#define CUPTI_API_CALL(call)                                                     \
    do                                                                           \
    {                                                                            \
        CUptiResult _status = call;                                              \
        if (_status != CUPTI_SUCCESS)                                            \
        {                                                                        \
            const char *errstr;                                                  \
            cuptiGetResultString(_status, &errstr);                              \
            fprintf(stderr, "%s:%d: error: function %s failed with error %s.\n", \
                    __FILE__, __LINE__, #call, errstr);                          \
            exit(EXIT_FAILURE);                                                  \
        }                                                                        \
    } while (0)

void CUPTIAPI runtimeAPICallback(
        void *userdata,
        CUpti_CallbackDomain domain,
        CUpti_CallbackId callbackId,
        const CUpti_CallbackData *cbdata)
    {

        if (callbackId == CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernel_v7000)
        {
                
            if (cbdata -> callbackSite == CUPTI_API_ENTER)
            {

                
                if (cbdata->symbolName != nullptr && (unsigned long) cbdata->symbolName != 0x8000000001){
                    // Not sure why sometimes cbdata -> symbolName equals to 0x8000000001, causing segfault
                    printf("Kernel name: %s\n", cbdata->symbolName);
                } else{
                    printf("symbolName address: %p\n", cbdata->symbolName);
                }
                    
            }
        }
    }

extern "C" {
    void start_profiling() {
        CUpti_SubscriberHandle subscriber;
        CUPTI_API_CALL(cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)runtimeAPICallback , NULL));
        CUPTI_API_CALL(cuptiEnableDomain(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API));
    }
}

Compiled as:

nvcc -O3 --shared -Xcompiler -fPIC -o libcupti_profiler.so cupti_profiler.cu -I/usr/local/cuda/extras/CUPTI/include -L/usr/local/cuda/extras/CUPTI/lib64 -lcuda -lcudart -lcupti

The attached screenshot shows the output, with one of the symbolName fields being 0x8000000001.

Hi, @slomp

We can internally repro this “symbolName address: 0x8000000001”. Will check with dev and let you if there is any update.

We just stumbled on another invalid string pointer, this time it was 0x20000000001.

Thanks for the update. Can you also provide the repro step ?
Our dev can check those issues together.

This one may be tricky, as it’s part of an actual ML workflow we have. We’ll see what we can do.
Any updates on the 0x8000000001 address?

import torch
import torch.nn as nn
import torch.cuda.amp as amp
import ctypes

class DoubleConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.double_conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        return self.double_conv(x)

class UNet(nn.Module):
    def __init__(self, in_channels=3, out_channels=1):
        super().__init__()
        
        # Encoder
        self.enc1 = DoubleConv(in_channels, 64)
        self.enc2 = DoubleConv(64, 128)
        self.enc3 = DoubleConv(128, 256)
        self.enc4 = DoubleConv(256, 512)
        self.enc5 = DoubleConv(512, 1024)
        
        # Decoder
        self.dec4 = DoubleConv(1024 + 512, 512)
        self.dec3 = DoubleConv(512 + 256, 256)
        self.dec2 = DoubleConv(256 + 128, 128)
        self.dec1 = DoubleConv(128 + 64, 64)
        
        # Final convolution
        self.final_conv = nn.Conv2d(64, out_channels, kernel_size=1)
        
        # Pooling and upsampling
        self.pool = nn.MaxPool2d(2)
        self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)

    def forward(self, x):
        # Encoder
        enc1 = self.enc1(x)
        x = self.pool(enc1)
        
        enc2 = self.enc2(x)
        x = self.pool(enc2)
        
        enc3 = self.enc3(x)
        x = self.pool(enc3)
        
        enc4 = self.enc4(x)
        x = self.pool(enc4)
        
        # Bridge
        x = self.enc5(x)
        
        # Decoder
        x = self.up(x)
        x = torch.cat([x, enc4], dim=1)
        x = self.dec4(x)
        
        x = self.up(x)
        x = torch.cat([x, enc3], dim=1)
        x = self.dec3(x)
        
        x = self.up(x)
        x = torch.cat([x, enc2], dim=1)
        x = self.dec2(x)
        
        x = self.up(x)
        x = torch.cat([x, enc1], dim=1)
        x = self.dec1(x)
        
        return self.final_conv(x)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
model = UNet().to(device)


# Create dummy input
batch_size = 1
channels = 3
height = 512
width = 512
x = torch.randn(batch_size, channels, height, width).to(device)

# Initialize autocast for mixed precision
scaler = amp.GradScaler()

# Forward pass with mixed precision
with amp.autocast():
    output = model(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Model using FP16: {next(model.parameters()).dtype == torch.float16}")
print(f"Output dtype: {output.dtype}")

If we use cuda_profiler.cu on this example, the symbolName address is 0x20000000001

Thanks! We can also reproduce the 0x20000000001 issue.
Our dev is checking.

I’m also seeing this issue. @veraj was there a resolution? Thanks!

Hi, @mmcloughlin

Sorry for the issue and we already have an internal bug for tracking. But unfortunately, it hasn’t been resolved yet. I’ve pinged the developer to see if we can make progress on this. Will let you know if there is any update.

Hello,

The issue of invalid symbol name can occur because the CUDA context can be NULL in the API callbacks, if the kernel launch is the first API in the application. This is related to the new CUDA Runtime feature of dynamic loading. To prevent segmentation faults, please check if the CUDA context is NULL before accessing the symbolName field:

   if(cbdata->context != NULL) 
     {
      //Access the symbolName field
     }

Thank you for this workaround. Note that this is only valid at callback entry, when cbdata->callbackSite == CUPTI_API_ENTER. CUDA Runtime calls may initialize the context, and the cbdata->context is non-null on exit, but the symbolName is still an invalid non-null pointer.

Thanks for reporting back. We are working on a cleaner solution to set the symbol name to NULL in these cases.
We will post an update when the fix is available.