Compiling with debug flag gives errors while normal compilation work well

Hi, I have some code that compiles and runs just fine when the GPU debug flag is off, however if I want to debug and turn the flag on, the code compiles fine but when I run (not gpu debug) I get error messages like:

Microsoft C++ exception: cudaError_enum at memory location 0x0020f6ac…

Is this serious? The non debug compiled code works fine, as far as I can tell, as there are no messages and the visual results all make sense. When I gpu debug I get to points where I get the message

Nsight Debug
Detected HW exception on 1 warps. First warp:
blockIdx = {1,1,0}
threadIdx = {0,1,0}
Exception = Misaligned Address
PC = 0x0002a7a0
FunctionRelativePC = 0x000034a0

If I continue debug after the program stops at such a point, and keep continuing, I get the same visuals.

The misaligned part cannot be fixed, and in fact should not be a critical error. So what is happening?

Not aligning addresses could cause your kernel to access memory much slower than expected. For instance, if you have 16 threads in a half warp accessing aligned memory, that would typically cost 1 memory transaction. However, if they are misaligned, it could cost 1 memory transaction for each thread. The program may work fine for now, but you may run into serious performance issues later.

Hi, thanks for the reply. The misalignment should not give any correctness issues, like accessing illegal memory locations, and therefore the debugger should not give error messages for it. I understand your concern about the performance, but the way the implementd algorithm works, I believe this is the best way to do it, in fact, some experiments (though not thorough) supported my belief.

In fact, I think it is a problem with the Nsight debugger, here is a full description:
I have a kernel which has a few template arguments that determine a few function types that the kernel takes as arguments. One of these function is called inside the kernel in the fallowing way: a thread calls a device function where it prepares the argument for the function passed as an argument and then calls the function with the argument. So, two layers of function calls.
Let me simplify, kernel K has this signature K (T functionA), in K I call the device function DF with signature DF(T funcA), where I do a few things and use funcA(args).

With gpu debug flag on, the program gives error messages, and when I debug, the debugger stops at the DF function call, with the misaligned message. I decided to write the DF function inside the kernel instead of calling it. and suddenly no more error messages when I run, and debugger doesnt stop at any random point. Note that I tried both inlined and not inlined versions of all the functions and I keep getting errors when I use DF to call T functionA instead of calling it directly (which makes all error and warning messages disappear).

So what do you make of this? Is it a problem with the nsight debugger? or a compiler issue when compiling with gpu debug info? If so how does one report this?


I see. Your case is more complex, but generally this is a feature to ensure performance isn’t adversely affected. Have you tried disabling the Memory Checker?


In your initial post you have both a CPU C++ exception and a GPU misaligned exception.

My response is only with respect to the second GPU exception.

NVIDIA GPUs do not support misaligned memory accesses. All memory accesses must be to the operations natural alignment. A misaligned address is a correctness issue not a performance issue. The kernel will be terminated when this exception occurs and the error cudaError_t = cudaErrorUnknown = 30 will be returned by the next API call. Please make sure you check errors on all CUDA API calls.

I would debug the problem as follows:

  1. The Nsight VSE CUDA Debugger should set focus to the first warp with the misaligned exception. You should be able to find the misaligned address using source tips (mouse over), the variable view, or the register view. If this does not work then you can use View Disassembly and look at the register values for the specific instruction. In Nsight 3.0 RC the output message should have the address of the misaligned access as well as the size specifier. The memory address must be the natural alignment of the operation access type. The address must meet the natural alignment of the size specifier.

  2. The Nsight VSE CUDA Debugger can be used on release code as well as debug code. You can determine if the debugger or cuda-memcheck.exe can catch the same error on the release code.

If using the above steps you cannot attribute the error to a bug in your code then I recommend filing a CUDA Bug Report through the Registered Developer Program at Alternatively, you can post a minimal reproducible to the forum and someone on the development team will look at it tomorrow.


Thank you for the thorough reply.

I have a few comments on the CPU C++ error below.

First, I am not entirely sure what is meant by alignment, or rather I am now confused. I thought that threads can read any address, however to satisfy each thread in the warp, many reads would have to be done, and that’s the performance concern, this is according to what I make out from section F.5.2 in the CUDA C Programming guide. After reading your reply, I looked in the same guide and found section 5.3.2 talking about this naturally aligned transaction. Now, this is what I gathered from both sections: the GPU reads the memory at addresses of multiple of 32, 64 or 128, and it can coalesce the reads of a single warp’s threads into one if the first threads access the first thread access data at such an aligned address (multiple of 32, 64 or 128), otherwise more transactions have to happen. What confuses me is why this is a correctness issue and not just simply a performance issue, as the GPU will read the necessary data, possibly more, and throw away data the threads don’t need. I would be delighted if you would explain the details I am missing.

In any case, the data I am accessing is not global memory, but shared memory.

I checked the error message on the next CUDA API calls, and indeed got an unknown error, 30. In fact, it turns out these CUDA API calls are generating the CPU C++ error messages, so the only worry is what is happening on the GPU.

Now to address your 2 points:

  1. All warps that enter the section to execute the problematic function generate the misaligned error message. I do not know what you mean by register view, or variable view, but on the warp watch, I checked all input to the function, everything is normal, I checked the disassembly and the instruction(s) throwing this exception is the following: (the first line is the actual signature)
inline __device__ real limiting (Limiter phi, real* main_wave, real* aux_wave)

0x0002e2b0  [2104] func_begin24:  // < problematic line?

0x0002e2f0  [2109] mov.b32	%r1, _Z8limitingILi3E10limiter_MCEfT0_PfS2__param_0;

Now, I do not know what the instruction means, I assume it signals the start of the function or something, it could be this line or the second but the debugger gives the message trying to execute the first line.2) I didn’t know I could run the debugger on release code (which I assume is the code without gpu debug info on)! So I started CUDA debugging without GPU debug info, and I came across something that I couldn’t explain… I will try to describe things here:

When I tried gpu debugging without debug info, I received the same message, however the program exited with an access violation at some other kernel. The other kernel is a reduction, and has nothing to do with the previous problem (according to me).
There are certain peculiar things about this problem with the reduction kernel, first, when I gpu debug with debug info, the problem does not happen, and I see that there are no access violations, second, I need to give some details about the way I do the reduction:
For my test case, I fix the block size to 512, and launch only a single block as the data to reduce on isn’t extremely large, in fact it can be smaller than the block size, therefore I split the part of the kernel where threads bring data to shared into 2, using an if statement [if(size > blockSize)], one for data sizes larger than the block size, and one for smaller. My test data gives a size of only 16, (as computed by hand and as seen through the gpu debugger), so threads do not enter the first section in my test, this is verified first by checking if any threads go in the first section with the gpu debugger, and secondly, by putting bogus data that would certainly disrupt the result which did not happen. Well, if I comment out the statements inside the if statement (so the kernel becomes basically a reduction function for sizes less than the block size), the messages about misaligned access and access violation disappear. So, removing a section of code that does not execute, somehow fixed the issue…

The memory allocated for the data to be reduce on is exactly of the size of the data (16 floats in this case), if I allocated memory for 512 or more floats, then the messages also disappear. I made sure to check that no threads access data outside the active size, I confirmed this with the debugger.

I turned off the memory checker, and I still received messages when gpu debugging with gpu debug info, but no messages when debugging without debug info (whereas when it is on it does give messages).

I do not know what to make of this. My code seems to be behaving well, according to the debug info, however, somehow, something somewhere is failing.

Sorry for the long post, I really thank you for taking the time to read it, and for helping me understand some of these bizarre happenings. Cheers!

Okay, I have isolated one problem. I have the reduction function as the sole function in a simple program.

I did many tests, and here are the details of my tests and findings:

  1. I make a CPU array of 16 elements
  2. I allocate memory for an array on the GPU
    a) matching the CPU size
    b) more than the PCU array size (512 to be exact)
  3. I copy the CPU array into both GPU arrays
    a) for the larger array I make sure the 17th element is a number larger than all elements of the CPU array
  4. I run the reduction function on the GPU on both GPU arrays
  5. I print the result

[ Tried putting the code up here but the message wouldn’t post… ]

There are 4 ways in which I run the code:
(1) With GPU debug information, run CPU debugger
(2) With GPU debug information, run CUDA debugger
(3) Without GPU debug information, run CPU debugger
(4) Without GPU debug information, run CUDA debugger

With case (1), (2) and (3) I get no error messages whatsoever, and the results are as expected. With case (4) the memchecker gives access violation messages when executing the reduction on the exact sized GPU array but not for the larger one. If the code had something wrong it should have shown signs in the other cases specially (2).

So what is happening here?

Hi Gorune, I’ve sent you a private message.

Guys I know this is an ancient thread, but what is the point of the forum if the conclusions move off line?