The --kernel-name-exclude
flag in compute-sanitizer
does not work in the latest CUDA version. Even after explicitly specifying kernel names using both kns=
(substring match) and kne=
(full kernel name), the kernel still experiences a significant slowdown, indicating that compute-sanitizer
is still instrumenting it.
System Information
System OS: Linux
OS build: #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 11 00:06:25 UTC 2025
System CPU: AMD EPYC 7413 24-Core Processor
CPU architecture: x86_64
CUDA version: 12.8
Display Driver version: 570.86.15
GPU Model: A100 (SM 80)
compute sanitizer version: 2025.1.0.0 (build 35351055) (public-release)
Issue Details
We developed a Minimum Working Example (MWE) to test the functionality of compute-sanitizer --kernel-name-exclude
and observed that excluded kernels are still being instrumented.
Expected Behavior
- When running
compute-sanitizer --kernel-name-exclude kns=ExcludedKernel ./main.exe
,ExcludedKernel
should run at the same speed as when compute-sanitizer is not used. - When running
compute-sanitizer ./main.exe
, both kernels should be instrumented and experience a slowdown.
Observed Behavior
- The
ExcludedKernel
experiences the same slowdown, whether excluded or not. - Using both
kns=
(substring) andkne=
(full kernel name) does not prevent instrumentation. compute-sanitizer
seems to ignore all kernel exclusion rules in the latest CUDA version.
Reproducible Test (Minimum Working Example)
Here’s a CUDA C++ program to reproduce the issue:
#include <cuda_runtime.h>
#include <iostream>
#include <chrono>
#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
std::cerr << "Error at " << __FILE__ << ":" << __LINE__ << " - " << cudaGetErrorString(x) << std::endl; \
exit(EXIT_FAILURE);}} while(0)
// Computationally heavy kernel to test exclusion
__global__ void ExcludedKernel(float *d_data, int size) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
float value = d_data[idx];
for (int i = 0; i < 100000000; i++) { // Large iteration count
value = value * 1.00001f + 1.0f;
if (i % 1000000 == 0) { // Periodic writes to global memory
d_data[idx] = value;
}
}
d_data[idx] = value; // Final write
}
}
// Lightweight kernel
__global__ void IncludedKernel(float *d_data, int size) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
d_data[idx] *= 2;
}
}
void runKernel(const char* name, void (*kernel)(float*, int), float *d_data, int blocks, int threads, int size) {
auto start = std::chrono::high_resolution_clock::now();
kernel<<<blocks, threads>>>(d_data, size);
CUDA_CALL(cudaDeviceSynchronize());
auto end = std::chrono::high_resolution_clock::now();
std::cout << name << " execution time: "
<< std::chrono::duration<double>(end - start).count() << " seconds\n";
}
int main() {
int arraySize = 16384; // Ensure enough memory for all threads
float *d_data;
CUDA_CALL(cudaMalloc(&d_data, arraySize * sizeof(float)));
CUDA_CALL(cudaMemset(d_data, 0, arraySize * sizeof(float)));
int blocks = 128, threads = 256;
runKernel("ExcludedKernel", ExcludedKernel, d_data, blocks, threads, arraySize);
runKernel("IncludedKernel", IncludedKernel, d_data, blocks, threads, arraySize);
CUDA_CALL(cudaFree(d_data));
return 0;
}
Steps to Reproduce
1. Compile
nvcc -o test_compute_sanitizer test_compute_sanitizer.cu
2. Run Baseline Execution (No Compute-Sanitizer)
./test_compute_sanitizer
Times observed:
ExcludedKernel
runs in ~0.8 secondsIncludedKernel
runs in milliseconds
3. Run with Compute-Sanitizer and Exclude ExcludedKernel
compute-sanitizer --kernel-name-exclude kns=ExcludedKernel ./test_compute_sanitizer
Times observed Unexpectedly Slow:
ExcludedKernel
runs in ~58 seconds, indicating it is not excluded.IncludedKernel
runs as expected.
4. Run with Compute-Sanitizer (No Exclusions)
compute-sanitizer ./test_compute_sanitizer
Expected: Both kernels should slow down due to instrumentation.
ExcludedKernel
runs in ~58 seconds, indicating it is not excluded.IncludedKernel
runs as expected.
I have tried the same thing with giving the full mangled name as well but the timings still indicate that compute-sanitizer
is ignoring the exclusion.
To summarize:
Command | Expected Behavior | Observed Behavior |
---|---|---|
./test_compute_sanitizer |
ExcludedKernel : 0.8s |
✅ 0.8s |
compute-sanitizer ./test_compute_sanitizer |
ExcludedKernel should be slow |
✅ 58s |
compute-sanitizer --kernel-name-exclude kns=ExcludedKernel ./test_compute_sanitizer |
ExcludedKernel should run in 0.8s |
❌ 58s (Not excluded) |
compute-sanitizer --kernel-name-exclude kne=_Z14ExcludedKernelPii ./test_compute_sanitizer |
ExcludedKernel should run in 0.8s |
❌ 58s (Not excluded) |