Compute-sanitizer --kernel-name-exclude option not filtering kernels in CUDA 12.8

The --kernel-name-exclude flag in compute-sanitizer does not work in the latest CUDA version. Even after explicitly specifying kernel names using both kns= (substring match) and kne= (full kernel name), the kernel still experiences a significant slowdown, indicating that compute-sanitizer is still instrumenting it.

System Information

System OS: Linux
OS build: #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 11 00:06:25 UTC 2025
System CPU: AMD EPYC 7413 24-Core Processor
CPU architecture: x86_64
CUDA version: 12.8
Display Driver version: 570.86.15
GPU Model: A100 (SM 80)
compute sanitizer version: 2025.1.0.0 (build 35351055) (public-release)

Issue Details

We developed a Minimum Working Example (MWE) to test the functionality of compute-sanitizer --kernel-name-exclude and observed that excluded kernels are still being instrumented.

Expected Behavior

  • When running compute-sanitizer --kernel-name-exclude kns=ExcludedKernel ./main.exe, ExcludedKernel should run at the same speed as when compute-sanitizer is not used.
  • When running compute-sanitizer ./main.exe, both kernels should be instrumented and experience a slowdown.

Observed Behavior

  • The ExcludedKernel experiences the same slowdown, whether excluded or not.
  • Using both kns= (substring) and kne= (full kernel name) does not prevent instrumentation.
  • compute-sanitizer seems to ignore all kernel exclusion rules in the latest CUDA version.

Reproducible Test (Minimum Working Example)

Here’s a CUDA C++ program to reproduce the issue:

#include <cuda_runtime.h>
#include <iostream>
#include <chrono>

#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    std::cerr << "Error at " << __FILE__ << ":" << __LINE__ << " - " << cudaGetErrorString(x) << std::endl; \
    exit(EXIT_FAILURE);}} while(0)

// Computationally heavy kernel to test exclusion
__global__ void ExcludedKernel(float *d_data, int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        float value = d_data[idx];
        for (int i = 0; i < 100000000; i++) { // Large iteration count
            value = value * 1.00001f + 1.0f; 
            if (i % 1000000 == 0) { // Periodic writes to global memory
                d_data[idx] = value; 
            }
        }
        d_data[idx] = value; // Final write
    }
}

// Lightweight kernel
__global__ void IncludedKernel(float *d_data, int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        d_data[idx] *= 2;
    }
}

void runKernel(const char* name, void (*kernel)(float*, int), float *d_data, int blocks, int threads, int size) {
    auto start = std::chrono::high_resolution_clock::now();
    kernel<<<blocks, threads>>>(d_data, size);
    CUDA_CALL(cudaDeviceSynchronize());
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << name << " execution time: "
              << std::chrono::duration<double>(end - start).count() << " seconds\n";
}

int main() {
    int arraySize = 16384; // Ensure enough memory for all threads
    float *d_data;
    CUDA_CALL(cudaMalloc(&d_data, arraySize * sizeof(float)));
    CUDA_CALL(cudaMemset(d_data, 0, arraySize * sizeof(float)));

    int blocks = 128, threads = 256;

    runKernel("ExcludedKernel", ExcludedKernel, d_data, blocks, threads, arraySize);
    runKernel("IncludedKernel", IncludedKernel, d_data, blocks, threads, arraySize);

    CUDA_CALL(cudaFree(d_data));
    return 0;
}

Steps to Reproduce

1. Compile

nvcc -o test_compute_sanitizer test_compute_sanitizer.cu

2. Run Baseline Execution (No Compute-Sanitizer)

./test_compute_sanitizer

Times observed:

  • ExcludedKernel runs in ~0.8 seconds
  • IncludedKernel runs in milliseconds

3. Run with Compute-Sanitizer and Exclude ExcludedKernel

compute-sanitizer --kernel-name-exclude kns=ExcludedKernel ./test_compute_sanitizer

Times observed Unexpectedly Slow:

  • ExcludedKernel runs in ~58 seconds, indicating it is not excluded.
  • IncludedKernel runs as expected.

4. Run with Compute-Sanitizer (No Exclusions)

compute-sanitizer ./test_compute_sanitizer

Expected: Both kernels should slow down due to instrumentation.

  • ExcludedKernel runs in ~58 seconds, indicating it is not excluded.
  • IncludedKernel runs as expected.

I have tried the same thing with giving the full mangled name as well but the timings still indicate that compute-sanitizer is ignoring the exclusion.

To summarize:

Command Expected Behavior Observed Behavior
./test_compute_sanitizer ExcludedKernel: 0.8s ✅ 0.8s
compute-sanitizer ./test_compute_sanitizer ExcludedKernel should be slow ✅ 58s
compute-sanitizer --kernel-name-exclude kns=ExcludedKernel ./test_compute_sanitizer ExcludedKernel should run in 0.8s 58s (Not excluded)
compute-sanitizer --kernel-name-exclude kne=_Z14ExcludedKernelPii ./test_compute_sanitizer ExcludedKernel should run in 0.8s 58s (Not excluded)

Hi, @samkawtikwar

We verified the exclude option works.

We modified your code to introduce a memory error

Running with compute-sanitizer, it will report memory error.
Running with compute-sanitizer and exclude, it will not report error indicating that our tool has exclude the checking for the kernel.

As to the perf issue, we have a internal issue for tracking. Thanks for reporting this to us !