I try to use compute-sanitizer to instrument the following tensorflow program.
import tensorflow as tf
from keras import layers
import os
os.environ["TF_DISABLE_RZ_CHECK"] = "1"
os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"
tf.keras.backend.set_image_data_format('channels_first')
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
tf.config.run_functions_eagerly(True)
tensor = tf.zeros([1, 2, 859043])
model = layers.Conv1D(filters=2, kernel_size=524287, strides=1, groups=2)
model(tensor)
print("DONE")
It stuck indefinitely for an hour (without compute-sanitizer it finishes in seconds)
I use NVBit to instrument the program (NVBit also stuck it; I presume this is because they both rely on dynamic instrumentation).
The last executed kernel is the following.
MEMTRACE: CTX 0x00000000050f8db0 - LAUNCH - Kernel pc 0x00007ff9a038f900 - Kernel name sm80_xmma_fprop_implicit_gemm_indexed_tf32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize64x64x64_stage4_warpsize1x4x1_g16_tensor16x8x8_execute_kernel__5x_cudnn - grid launch id 12 - grid size 1,5231,1 - block size 128,1,1 - nregs 166 - shmem 132096 - cuda stream id 1276264096
which seems to be a cudnn kernel.
Since both compute-sanitizer and cuDNN are pretty closed, I don’t know how to debug this.
root@f6260fcd90dd:/workspace/forum_329156# python test.py
2025-04-22 07:48:19.100157: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-22 07:48:19.114471: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8473] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-22 07:48:19.119110: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1471] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-22 07:48:19.130281: I tensorflow/core/platform/cpu_feature_guard.cc:211] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1745308101.275822 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.335417 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.337588 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.344261 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.346498 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.348584 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.473900 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
I0000 00:00:1745308101.475133 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
2025-04-22 07:48:21.476225: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:198] Using CUDA malloc Async allocator for GPU: 0
I0000 00:00:1745308101.476369 1609 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at linux/Documentation/ABI/testing/sysfs-bus-pci at v6.0 · torvalds/linux · GitHub
2025-04-22 07:48:21.477523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 76512 MB memory: → device: 0, name: NVIDIA H100 CNX, pci bus id: 0000:c6:00.0, compute capability: 9.0
Traceback (most recent call last):
File “/workspace/forum_329156/test.py”, line 16, in
model(tensor)
File “/usr/local/lib/python3.12/dist-packages/keras/src/utils/traceback_utils.py”, line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
File “/usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv.py”, line 180, in build
raise ValueError(
ValueError: The number of input channels must be evenly divisible by the number of groups. Received groups=2, but the input has 859043 channels (full input shape is (1, 2, 859043)).