Compute-sanitizer infinite loops when instrument a cuDNN function

hluaw · April 3, 2025, 3:50am

I try to use compute-sanitizer to instrument the following tensorflow program.

import tensorflow as tf
from keras import layers
import os
os.environ["TF_DISABLE_RZ_CHECK"] = "1"
os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"
tf.keras.backend.set_image_data_format('channels_first')
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
tf.config.run_functions_eagerly(True)

tensor = tf.zeros([1, 2, 859043])
model = layers.Conv1D(filters=2, kernel_size=524287, strides=1, groups=2)
model(tensor)

print("DONE")

It stuck indefinitely for an hour (without compute-sanitizer it finishes in seconds)
I use NVBit to instrument the program (NVBit also stuck it; I presume this is because they both rely on dynamic instrumentation).
The last executed kernel is the following.

MEMTRACE: CTX 0x00000000050f8db0 - LAUNCH - Kernel pc 0x00007ff9a038f900 - Kernel name sm80_xmma_fprop_implicit_gemm_indexed_tf32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize64x64x64_stage4_warpsize1x4x1_g16_tensor16x8x8_execute_kernel__5x_cudnn - grid launch id 12 - grid size 1,5231,1 - block size 128,1,1 - nregs 166 - shmem 132096 - cuda stream id 1276264096

which seems to be a cudnn kernel.

Since both compute-sanitizer and cuDNN are pretty closed, I don’t know how to debug this.

Thank you!

veraj · April 22, 2025, 7:54am

Hi, @hluaw

Locally tried your script, it runs fail directly.

root@f6260fcd90dd:/workspace/forum_329156# python test.py
2025-04-22 07:48:19.100157: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-22 07:48:19.114471: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8473] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-22 07:48:19.119110: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1471] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-22 07:48:19.130281: I tensorflow/core/platform/cpu_feature_guard.cc:211] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1745308101.275822 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.335417 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.337588 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.344261 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.346498 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.348584 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.473900 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.475133 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
2025-04-22 07:48:21.476225: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:198] Using CUDA malloc Async allocator for GPU: 0
I0000 00:00:1745308101.476369 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
2025-04-22 07:48:21.477523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 76512 MB memory: → device: 0, name: NVIDIA H100 CNX, pci bus id: 0000:c6:00.0, compute capability: 9.0
Traceback (most recent call last):
File “/workspace/forum_329156/test.py”, line 16, in
model(tensor)
File “/usr/local/lib/python3.12/dist-packages/keras/src/utils/traceback_utils.py”, line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
File “/usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py”, line 180, in build
raise ValueError(
ValueError: The number of input channels must be evenly divisible by the number of groups. Received groups=2, but the input has 859043 channels (full input shape is (1, 2, 859043)).

veraj · May 26, 2025, 6:54am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUDA and Murphy's Law Some things you may bump into... CUDA Programming and Performance	16	19252	August 21, 2007
Very strange behaviour. Maybe a bug...? Kernel fails to run strangely, but no errors are reported. CUDA Programming and Performance	5	1058	May 13, 2009
intermittent killer kernel Kernel which causes CUDA to die, followed by launch failures CUDA Programming and Performance	36	35015	June 12, 2009
Strange behaviour of a kernel function CUDA Programming and Performance	2	2426	March 21, 2008
Possible bug on beta 3.0 when using cufft and driver api CUDA Programming and Performance	4	2499	February 4, 2010
S1070 device 0 broken Test case provided CUDA Programming and Performance	10	4335	June 9, 2009
Compute-sanitizer not catching cudaErrorIllegalAddress CUDA Programming and Performance	16	1984	December 17, 2020
Maximum Threads for Kernel Call CUDA Programming and Performance	38	16497	May 25, 2010
Kernel runs perfectly when compiled for debugging, randomly crashes otherwise Debugging suggestions CUDA Programming and Performance	11	5184	August 20, 2009
buffer overflow in computeprof 4.0rc CUDA Programming and Performance	7	1951	April 6, 2011

Compute-sanitizer infinite loops when instrument a cuDNN function

Related topics