How to debug CUDA?

202476410arsmart · October 14, 2023, 12:32pm

[18/49] /usr/local/cuda/bin/nvcc  -I/home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn -I/home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn/src -I/home/zyhuang/flash-CUDA/flash-attention/csrc/cutlass/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /home/zyhuang/flash-CUDA/flash-attention/build/temp.linux-x86_64-3.10/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1917, in _run_ninja_build
    subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '16']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/zyhuang/flash-CUDA/flash-attention/setup.py", line 288, in <module>
    setup(
  File "/usr/local/lib/python3.10/dist-packages/setuptools/__init__.py", line 103, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.10/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install.py", line 84, in run
    self.do_egg_install()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install.py", line 132, in do_egg_install
    self.run_command('bdist_egg')
  File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/bdist_egg.py", line 167, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/bdist_egg.py", line 153, in call_command
    self.run_command(cmdname)
  File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/usr/lib/python3.10/distutils/command/install_lib.py", line 109, in build
    self.run_command('build_ext')
  File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
    super().run_command(command)
  File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 88, in run
    _build_ext.run(self)
  File "/usr/lib/python3.10/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 865, in build_extensions
    build_ext.build_extensions(self)
  File "/usr/lib/python3.10/distutils/command/build_ext.py", line 449, in build_extensions
    self._build_extensions_serial()
  File "/usr/lib/python3.10/distutils/command/build_ext.py", line 474, in _build_extensions_serial
    self.build_extension(ext)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 249, in build_extension
    _build_ext.build_extension(self, ext)
  File "/usr/local/lib/python3.10/dist-packages/Cython/Distutils/build_ext.py", line 127, in build_extension
    super(build_ext, self).build_extension(ext)
  File "/usr/lib/python3.10/distutils/command/build_ext.py", line 529, in build_extension
    objects = self.compiler.compile(sources,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 678, in unix_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1590, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1933, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

I get this bug…But the real reason is I wrote a BlockIdx.x in my code, after I changed it into blockIdx.x, I solved it. But how can I find this from the bug information? Of course I can not immediately find my bug at once…Could someone kindly help me? Thank you!!!

Robert_Crovella · October 14, 2023, 3:15pm

The world of computer science is characterized by the fact that there are often many ways to do something. So a single description of a topic, like debugging, or compiling, is unlikely to cover every case.

Coupled with that, I would say it helps when you understand the tools you are using. Of course its very difficult to understand every tool used in the assembly of a complex software stack like a modern deep-learning (AI) software stack. But sometimes more familiarity helps.

Coupled with that, some tools are designed to hide you from the details, which might be a good thing in the general case, but is often an impediment when things go wrong.

Let’s take a look. Often the beginning and the end of an error sequence are most instructive.

It helps if you happen to know (or did some searching) that ninja is a widely used build (i.e. compiler) management/accelerator tool. But even if you don’t, if you are working with CUDA, hopefully you know that:

Is invoking the CUDA compiler. Even if you had missed all those clues that this is a compilation failure you are looking at, the very end of your output says the same thing:

Now let’s go a bit farther:

Again, this may just be a matter of experience, but that line there is a tip-off that this tool is hiding the output of the subcommand (ninja, which is calling nvcc), and just giving you the summary statement that “it returned a non-zero exit status”, which is a side-effect of the fact that compilation failed.

So we now know that in the process of trying to run whatever you were trying to run, or building whatever you were trying to build, ninja/nvcc compiled something and that compilation failed. That is almost never a good thing in computer science. But the problem is that the tool (some pythonic build tool) is hiding the actual output from nvcc, which would have been very useful at this point. These build tools are great when they work, but when you are working with people on forums who are having trouble, and don’t know enough (or how) to ask CMake or Visual Studio or the build tool here for verbose output, it gets in the way of speedy problem resolution because of this hiding. For someone like me, who spends some time trying to help other people on these forums, it means I learn to really love CMake and Visual Studio, especially. I mostly just skip those questions. (In my view, if you are a beginner in CUDA, and you are also not that familiar with CMake – you almost certainly don’t know all the complexities of CMake/CUDA integration – then you are not really helping yourself by trying to use CMake, but I digress.)

So once we have realized that a compilation failed but I don’t know why, then we usually need to get more info. At least in this particular case we are told the format of the command that failed:

202476410arsmart:

[18/49] /usr/local/cuda/bin/nvcc  -I/home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn -I/home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn/src -I/home/zyhuang/flash-CUDA/flash-attention/csrc/cutlass/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /home/zyhuang/flash-CUDA/flash-attention/build/temp.linux-x86_64-3.10/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1

So now it will be helpful if you understand nvcc syntax and command options. Because if the compilation command failed, it is often the case that the file being compiled has an error in it (there can be other problems as well, such as misconfigured switches passed to nvcc by the tool itself). So as a first order of business we want to try to identify the source files being compiled in that compilation command that failed. But more generally there may be several things to do at this point:

Identify the source file(s) being compiled (in this case I can see fairly quickly by scanning that nvcc command that the file is /home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu), and study those for errors. This is not the optimal way to do it, the optimal way to do it is to actually compile the file and look at the errors that are actually generated
Look for obvious misconfiguration of switches in that nvcc command line.
See if you can get the command to be reissued, either because you manually copy that command and issue it yourself from the command line (rather than let it spit out from the build tool) or via some other method. If you can, see if you can just reissue/type that command yourself. Or try compiling it yourself somehow, with nvcc.
See if you can learn how to tell the build tool that you want “verbose” output, and ask for that. When you rerun your build command, and it gets to the point of the failure, it will then likely show you the actual output from nvcc, which will almost certainly be very useful.

That last option is often the most effective, but requires a little effort to learn about the build tool you are using.

When people use those build tools that hide important info, and they don’t seem to want to learn how to ask for verbose output, I generally just skip those questions.

unital · November 25, 2023, 10:58am

Hi @Robert_Crovella, sorry for reviving an old thread. I ran into the same issue recently and I just wanted to say thank you for spending the time to explain the whole strategy of debugging an issue. As a beginner in CUDA (and programming in general) your post was very helpful.

Thanks again!

Topic		Replies	Views
The detected CUDA version mismatches the version that was used to compile Pytorch CUDA Setup and Installation cuda , pytorch	4	6585	May 6, 2025
CUDA compilation fails nvcc compiler CUDA Programming and Performance	14	2527	August 21, 2010
Cannot install cuda CUDA Setup and Installation	9	4701	June 27, 2024
CUDA compile trouble CUDA Programming and Performance	47	5118	November 8, 2010
Weird error when compiling CUDA Programming and Performance	3	1327	August 2, 2011
Cant compile because unistd.h CUDA Programming and Performance	19	6481	July 3, 2015
Could NOT find CUDA (missing: CUDA_CUDART_LIBRARY) (found version "10.2") Jetson Nano cuda	16	9464	August 25, 2023
Error in my code... CUDA Programming and Performance	11	2538	December 19, 2014
Build Error MSB3721 When calling object method within kernel, using compiler directives CUDA Programming and Performance	9	5727	November 18, 2015
nvcc and googletest CUDA Programming and Performance	5	16373	July 7, 2011

How to debug CUDA?

Related topics