Thank you for the thorough reply.
I have a few comments on the CPU C++ error below.
First, I am not entirely sure what is meant by alignment, or rather I am now confused. I thought that threads can read any address, however to satisfy each thread in the warp, many reads would have to be done, and that’s the performance concern, this is according to what I make out from section F.5.2 in the CUDA C Programming guide. After reading your reply, I looked in the same guide and found section 5.3.2 talking about this naturally aligned transaction. Now, this is what I gathered from both sections: the GPU reads the memory at addresses of multiple of 32, 64 or 128, and it can coalesce the reads of a single warp’s threads into one if the first threads access the first thread access data at such an aligned address (multiple of 32, 64 or 128), otherwise more transactions have to happen. What confuses me is why this is a correctness issue and not just simply a performance issue, as the GPU will read the necessary data, possibly more, and throw away data the threads don’t need. I would be delighted if you would explain the details I am missing.
In any case, the data I am accessing is not global memory, but shared memory.
I checked the error message on the next CUDA API calls, and indeed got an unknown error, 30. In fact, it turns out these CUDA API calls are generating the CPU C++ error messages, so the only worry is what is happening on the GPU.
Now to address your 2 points:
- All warps that enter the section to execute the problematic function generate the misaligned error message. I do not know what you mean by register view, or variable view, but on the warp watch, I checked all input to the function, everything is normal, I checked the disassembly and the instruction(s) throwing this exception is the following: (the first line is the actual signature)
inline __device__ real limiting (Limiter phi, real* main_wave, real* aux_wave)
0x0002e2b0  func_begin24: // < problematic line?
0x0002e2f0  mov.b32 %r1, _Z8limitingILi3E10limiter_MCEfT0_PfS2__param_0;
Now, I do not know what the instruction means, I assume it signals the start of the function or something, it could be this line or the second but the debugger gives the message trying to execute the first line.2) I didn’t know I could run the debugger on release code (which I assume is the code without gpu debug info on)! So I started CUDA debugging without GPU debug info, and I came across something that I couldn’t explain… I will try to describe things here:
When I tried gpu debugging without debug info, I received the same message, however the program exited with an access violation at some other kernel. The other kernel is a reduction, and has nothing to do with the previous problem (according to me).
There are certain peculiar things about this problem with the reduction kernel, first, when I gpu debug with debug info, the problem does not happen, and I see that there are no access violations, second, I need to give some details about the way I do the reduction:
For my test case, I fix the block size to 512, and launch only a single block as the data to reduce on isn’t extremely large, in fact it can be smaller than the block size, therefore I split the part of the kernel where threads bring data to shared into 2, using an if statement [if(size > blockSize)], one for data sizes larger than the block size, and one for smaller. My test data gives a size of only 16, (as computed by hand and as seen through the gpu debugger), so threads do not enter the first section in my test, this is verified first by checking if any threads go in the first section with the gpu debugger, and secondly, by putting bogus data that would certainly disrupt the result which did not happen. Well, if I comment out the statements inside the if statement (so the kernel becomes basically a reduction function for sizes less than the block size), the messages about misaligned access and access violation disappear. So, removing a section of code that does not execute, somehow fixed the issue…
The memory allocated for the data to be reduce on is exactly of the size of the data (16 floats in this case), if I allocated memory for 512 or more floats, then the messages also disappear. I made sure to check that no threads access data outside the active size, I confirmed this with the debugger.
I turned off the memory checker, and I still received messages when gpu debugging with gpu debug info, but no messages when debugging without debug info (whereas when it is on it does give messages).
I do not know what to make of this. My code seems to be behaving well, according to the debug info, however, somehow, something somewhere is failing.
Sorry for the long post, I really thank you for taking the time to read it, and for helping me understand some of these bizarre happenings. Cheers!