How to debug CUDA?

[18/49] /usr/local/cuda/bin/nvcc  -I/home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn -I/home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn/src -I/home/zyhuang/flash-CUDA/flash-attention/csrc/cutlass/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn/src/ -o /home/zyhuang/flash-CUDA/flash-attention/build/temp.linux-x86_64-3.10/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/", line 1917, in _run_ninja_build
  File "/usr/lib/python3.10/", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '16']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/zyhuang/flash-CUDA/flash-attention/", line 288, in <module>
  File "/usr/local/lib/python3.10/dist-packages/setuptools/", line 103, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.10/distutils/", line 148, in setup
  File "/usr/lib/python3.10/distutils/", line 966, in run_commands
  File "/usr/local/lib/python3.10/dist-packages/setuptools/", line 989, in run_command
  File "/usr/lib/python3.10/distutils/", line 985, in run_command
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/", line 84, in run
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/", line 132, in do_egg_install
  File "/usr/lib/python3.10/distutils/", line 313, in run_command
  File "/usr/local/lib/python3.10/dist-packages/setuptools/", line 989, in run_command
  File "/usr/lib/python3.10/distutils/", line 985, in run_command
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/", line 167, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/", line 153, in call_command
  File "/usr/lib/python3.10/distutils/", line 313, in run_command
  File "/usr/local/lib/python3.10/dist-packages/setuptools/", line 989, in run_command
  File "/usr/lib/python3.10/distutils/", line 985, in run_command
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/", line 11, in run
  File "/usr/lib/python3.10/distutils/command/", line 109, in build
  File "/usr/lib/python3.10/distutils/", line 313, in run_command
  File "/usr/local/lib/python3.10/dist-packages/setuptools/", line 989, in run_command
  File "/usr/lib/python3.10/distutils/", line 985, in run_command
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/", line 88, in run
  File "/usr/lib/python3.10/distutils/command/", line 340, in run
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/", line 865, in build_extensions
  File "/usr/lib/python3.10/distutils/command/", line 449, in build_extensions
  File "/usr/lib/python3.10/distutils/command/", line 474, in _build_extensions_serial
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/", line 249, in build_extension
    _build_ext.build_extension(self, ext)
  File "/usr/local/lib/python3.10/dist-packages/Cython/Distutils/", line 127, in build_extension
    super(build_ext, self).build_extension(ext)
  File "/usr/lib/python3.10/distutils/command/", line 529, in build_extension
    objects = self.compiler.compile(sources,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/", line 678, in unix_wrap_ninja_compile
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/", line 1590, in _write_ninja_file_and_compile_objects
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/", line 1933, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

I get this bug…But the real reason is I wrote a BlockIdx.x in my code, after I changed it into blockIdx.x, I solved it. But how can I find this from the bug information? Of course I can not immediately find my bug at once…Could someone kindly help me? Thank you!!!

The world of computer science is characterized by the fact that there are often many ways to do something. So a single description of a topic, like debugging, or compiling, is unlikely to cover every case.

Coupled with that, I would say it helps when you understand the tools you are using. Of course its very difficult to understand every tool used in the assembly of a complex software stack like a modern deep-learning (AI) software stack. But sometimes more familiarity helps.

Coupled with that, some tools are designed to hide you from the details, which might be a good thing in the general case, but is often an impediment when things go wrong.

Let’s take a look. Often the beginning and the end of an error sequence are most instructive.

It helps if you happen to know (or did some searching) that ninja is a widely used build (i.e. compiler) management/accelerator tool. But even if you don’t, if you are working with CUDA, hopefully you know that:

Is invoking the CUDA compiler. Even if you had missed all those clues that this is a compilation failure you are looking at, the very end of your output says the same thing:

Now let’s go a bit farther:

Again, this may just be a matter of experience, but that line there is a tip-off that this tool is hiding the output of the subcommand (ninja, which is calling nvcc), and just giving you the summary statement that “it returned a non-zero exit status”, which is a side-effect of the fact that compilation failed.

So we now know that in the process of trying to run whatever you were trying to run, or building whatever you were trying to build, ninja/nvcc compiled something and that compilation failed. That is almost never a good thing in computer science. But the problem is that the tool (some pythonic build tool) is hiding the actual output from nvcc, which would have been very useful at this point. These build tools are great when they work, but when you are working with people on forums who are having trouble, and don’t know enough (or how) to ask CMake or Visual Studio or the build tool here for verbose output, it gets in the way of speedy problem resolution because of this hiding. For someone like me, who spends some time trying to help other people on these forums, it means I learn to really love CMake and Visual Studio, especially. I mostly just skip those questions. (In my view, if you are a beginner in CUDA, and you are also not that familiar with CMake – you almost certainly don’t know all the complexities of CMake/CUDA integration – then you are not really helping yourself by trying to use CMake, but I digress.)

So once we have realized that a compilation failed but I don’t know why, then we usually need to get more info. At least in this particular case we are told the format of the command that failed:

So now it will be helpful if you understand nvcc syntax and command options. Because if the compilation command failed, it is often the case that the file being compiled has an error in it (there can be other problems as well, such as misconfigured switches passed to nvcc by the tool itself). So as a first order of business we want to try to identify the source files being compiled in that compilation command that failed. But more generally there may be several things to do at this point:

  • Identify the source file(s) being compiled (in this case I can see fairly quickly by scanning that nvcc command that the file is /home/zyhuang/flash-CUDA/flash-attention/csrc/flash_attn/src/, and study those for errors. This is not the optimal way to do it, the optimal way to do it is to actually compile the file and look at the errors that are actually generated
  • Look for obvious misconfiguration of switches in that nvcc command line.
  • See if you can get the command to be reissued, either because you manually copy that command and issue it yourself from the command line (rather than let it spit out from the build tool) or via some other method. If you can, see if you can just reissue/type that command yourself. Or try compiling it yourself somehow, with nvcc.
  • See if you can learn how to tell the build tool that you want “verbose” output, and ask for that. When you rerun your build command, and it gets to the point of the failure, it will then likely show you the actual output from nvcc, which will almost certainly be very useful.

That last option is often the most effective, but requires a little effort to learn about the build tool you are using.

When people use those build tools that hide important info, and they don’t seem to want to learn how to ask for verbose output, I generally just skip those questions.


Hi @Robert_Crovella, sorry for reviving an old thread. I ran into the same issue recently and I just wanted to say thank you for spending the time to explain the whole strategy of debugging an issue. As a beginner in CUDA (and programming in general) your post was very helpful.

Thanks again!