So I tried my best to apply restrict everywhere for each base pointers of my (input and) output structs, still: nvcc generates 2 loads and 2 stores (example here: Compiler Explorer).
If I simply change my kernel to not having structs of arrays (SoAs), but instead using restrict decorated base pointers in the kernel arguments, everything is working as expected (1 ld, 1 store) (example here: Compiler Explorer).
So what do I need to change to inherit restrict to the base pointers of my structs?:)
I am not a language lawyer, and restricted pointers are not part of standard C++. They are, however, part of the ISO-C standard. My assumption is that the language extension __restrict__ in CUDA is modeled on ISO-C restrict specifications. It would probably be a good idea to search the CUDA documentation with a fine-tooth comb to check what, if anything, it has to say on the subject.
By my understanding of the semantics defined in the C standard, what you are observing is expected behavior. I would welcome a clarifying in-depth assessment from a C or compiler specialist, regardless of whether it confirms or refutes what I stated above. If you think you are able to make such a determination yourself based on a close reading of the ISO-C standard, you might want to consider filing an enhancement request with NVIDIA.
So I used locally defined restrict pointers. Interestingly, having them defined inside this global scope results in the “correct” optimization. Having them defined and used locally inside a scope breaks the optimization, for no obvious reason (at least for me). Example here: ( I cannot post links directly, so it’s “godbolt .org /z/WdjxWd”) (def/undef SCOPE for the two combinations). Also global scope defined restrict pointers can be used inside local scopes and the optimizer is working as expected.
Why is that? It works for me. Either mark the link and click on the “chain link” button at the top of the input window (fourth from the left), or use Stackoverflow style markup: [description](url).
As I recall, the semantics of type modifiers/qualifiers (const, volatile, restrict) applied to struct members are a bit non-intuitive. I learned that the hard way and have since avoided using them in this capacity.
Since __restrict__ is a vendor extension to standard C++, you could always file an enhancement request with NVIDIA to have the semantics extended the way you think they should work.
I see. That would seem to be a security feature of these new NVIDIA forums in order to protect against spammers who create throw-away accounts, similar to the ones used by Stackoverflow which restricts the embedding of links and images for new users. After you use the site for a while, such restrictions should presumably be lifted, although the rules controlling this are probably not publicly documented.
So clang trunk (from godbolt) is NOT doing the “correct” optimization in both cases (global and local scope pointers), where as nvcc at least does the right thing having global scope restrict pointers. Yes, it’s not well defined etc., but just putting a scope around a basic block and getting a totally different result is surprising to me…:). I just would like to know where to address this issue/inconsistency.
As I said, restrict is well defined in ISO-C, and ISO-C only. Since it is not part of C++, the semantics of any existing vendor extension for restricted pointers in C++ compilers are likely similar but not necessarily identical to the standard C specification of restrict, and probably also subtly different from each other. I use what I consider the safe route (the smallest common subset), which in my experience means use it only for non-aggregate function arguments, i.e. function arguments that are simple pointers.
If you want, you can file an enhancement request with NVIDIA to clarify the __restrict__ semantics in the CUDA documentation.
Sorry, that is not how it works. This forum is primarily a platform NVIDIA set up so users can help users. It is not really designed as a venue for sending bug reports and enhancement requests to NVIDIA. Every now and then someone from NVIDIA might poke their head in here, and they might take up an issue and file a bug report or enhancement request internally.
But: (1) one cannot count on that happening (2) NVIDIA’s bug database treats all issues as confidential so only the filer and relevant NVIDIA personnel can see it.
You would want to file your own request so you can track its status. Use the regular bug reporting form (there should be a sticky/pinned note at the top of this forum about filing bugs) and prefix the synopsis with “RFE:” which means “request for enhancement”, so it can easily be distinguished from reports of functional bugs. I have not reported an issue in a while, maybe the bug reporting form now also provides a mechanism that directly lets users mark an issue as an RFE.
That likely means your report triggered a malfeasance filter in the reporting form. For years now NVIDIA appears to have set the trigger level on that to insanely strict settings. So it doesn’t give any information away, one gets a generic error message when the filter is triggered. Frustrating, I know.
A common strategy to deal with that is to file a very rudimentary report first. Like a synopsis and a two-sentence description, with a comment that you’ll add more info later. In particular, avoid adding links or anything that looks like HTML to the initial report. Then come back to add additional information bit by bit. Yes, it is a pain in the neck.
Sometimes there are also technical issues of an unspecified nature affecting the bug reporting form.
FWIW, I noticed that your first post in this thread currently shows up for me as “flagged by community”. I doubt that anybody from the community has flagged it, as I see nothing objectionable in it. I suspect that it was auto-flagged by the AI (e.g. “a link in an initial post means suspicious”). I hope that this has not put your account as a whole on the bad side of NVIDIA’s machinery.