nvcc: C99 standard in CUDA frontend?

It appears that at least some C99 features are not supported in a .cu file. For example, passing a 2d array to a function. According to http://stackoverflow.com/questions/6862813/c-passing-a-2d-array-as-a-function-argument this is just fine in C99, but when I try it in my .cu file, I get “error: a parameter is not allowed”, even if I pass “-std=c99” to the compiler. So, is there some list of what level of C is supported somewhere?

I can pass a 2D array to a function. A fully worked 3D example is here:

http://stackoverflow.com/questions/14920931/3d-cuda-kernel-indexing-for-image-filtering/14926201#14926201

CUDA does not officially conform to C but claims to be compliant to C++:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-cplusplus-language-support

As txbob states, CUDA is a dialect of C++, not C. There are numerous small differences between the ISO language standards for C++ and C. When in doubt one would want to compare to the C++ standard.

Historically, CUDA initially started out as a subset of C, with C-style interfaces, a math library based on C99, and so on. However, due to popular demand for C++ features, the tool chain switched to a C++ frontend fairly early on, although I don’t recall exactly at what version; I think it may have occurred after CUDA 2.0, around 2008?

txbob, thanks for your reply, but your example is passing a 2d array with global constants for the second dimensions. I’m trying to pass a 2d array with variable dimensions. I’m not as familiar with newer C++ standards, but I would imagine that if it’s possible in C, then it’s possible in C++, so I would still have the same issue as to why it’s not possible with CUDA, and if there’s some documentation as to what part of the C++ standard isn’t implemented with CUDA.

Variable-length arrays is indeed a C99 feature that is not officially part of the original or previous C standard (e.g. C89):

https://gcc.gnu.org/onlinedocs/gcc/Variable-Length.html

AFAIK C99 VLA is not officially part of any C++ standard (at least up through c++14):

http://stackoverflow.com/questions/1887097/variable-length-arrays-in-c

Actually this information is contained in the link I already provided:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-cplusplus-language-support

At that point in the CUDA programming guide, the following information is presented:

  1. The official named C++ standards that the CUDA device compiler claims compliance to. (section E)
  2. Restrictions, i.e. deviations from the named standard (section E.2)

All of section E is probably relevant reading, especially if you have interest in new features in C++11

Note that for questions pertaining to purely host code constructs, the CUDA compilation tool chain usually provides whatever support the host compiler provides. The named standard adherence and restrictions apply to the CUDA device code compiler, and operations that impact generation of device code.

Also note that (read the gcc VLA doc link) gcc/g++ or other host compilers may provide support for features that are not officially part of a particular standard. This behavior is sometimes modifiable with the use of particular compiler switches (e.g. -pedantic)

As a simple demonstrator, consider the following, compiled with g++ 4.8.2:

$ cat vla.cpp
#include <stdio.h>
#include <stdlib.h>

void foo(int n) {
    int values[n];
}

int main(int argc, char *argv[]){

  int n = atoi(argv[1]);
  foo(n);
  return 0;
}
$ g++ -std=c++11 vla.cpp
$ g++ -pedantic -std=c++11 vla.cpp
vla.cpp: In function âvoid foo(int)â:
vla.cpp:5:17: warning: ISO C++ forbids variable length array âvaluesâ [-Wvla]
     int values[n];
                 ^
$

ISO C++ is definitely not a pure superset of ISO C. The two languages evolved and continue to evolve separately, but the relevant standards committees are trying to avoid gratuitous divergence. Besides VLA, examples of features found in C but not C++ are the ‘restrict’ modifier, and hexadecimal floating-point literals (e.g. 0x1.0p2f).

As txbob points out, C++ compilers sometimes support C features as a proprietary extension: For example, CUDA and other C++ toolchains support ‘restrict’ with semantics equivalent to C’s ‘restrict’ modifier.

When will CUDA have proper c99 support?
This is a major flaw.

My guess would be: never. For the past decade or so, CUDA has used a C++ toolchain. In my thinking, one of the design decisions that made CUDA vastly more popular than other parallel programming environments. When not programming GPUs, I still write C99 code almost exclusively, but the reality is that most of the software world has moved on from C, i.e. there is no market there.

Meanwhile, the C++ standard (in C++11 and C++17) has absorbed many (but not all) features of C99 including the hexadecimal floating-point literals I had personally been missing.

This is causing us such a headache that, honestly, if there was another GPU programming option out there that was purely C—we’d be using it.

Sadly, this was just a (poor) tool chain decision and not something intrinsic to CUDA and the GPU.

I am curious: What major adjustments would you have to make to transform your code base into valid C++? For a C-based GPU programming environment, you could look at OpenCL (not that I would recommend it).

As for causing you major headaches, this is how someone explained software market forces to me years ago: “Norbert, you are not a market!”.

The reality is that C++ is what a large percentage of CUDA users wanted, so NVIDIA was smart to deliver that. Clearly the right decision in terms of securing market share and establishing CUDA as the premier parallel programming environment.

Personally, I find the availability of templates most refreshing after kludging around in C to achieve similar functionality with much poorer maintainability.

It doesn’t address the CUDA topic here, but PGI OpenACC can be used for GPU acceleration and supports C VLA’s. It can be used with “pure C” coding styles.

I’m not sure if the C99 disconnect here is VLA (as in the earlier part of this thread) or some other thing.

Implementing support for VLA in a CUDA per-thread stack frame would be a substantial effort. The VLA itself is cumbersome to handle in a frame that otherwise has a known length at compile time, and in addition there would have to be some engineering to arrange for the beneficial effect of access ordering that happens in concurrent per-thread access to the “stack” (local memory space accesses, for efficiency). Having VLA across threads would add a considerable layer of complexity. When we also consider the infinite variety of ways that VLAs could be encountered across threads, sorting this out across threads may be intractable.

A commonly used alternative in CUDA C++ would be to move such arrays to a host global memory allocation (e.g. cudaMalloc), and allow the device code to access global memory for these kinds of array accesses, instead of local memory. This kind of refactoring can still allow for:

  • variable length
  • per-thread spaces
  • efficient (i.e. warp-coalesced) access

Since large scale usage of local memory can also be a capability concern, this sort of refactoring may have other benefits as well, depending on the code.

Honestly, the (lack of) VLA is not the issue, and I can see why this might be difficult to implement on the GPU and it is easy enough to work around.

That said, we have decades of code and developer applications leveraging our C-based libraries. Converting those (e.g., removing all the fields that have used ‘class’ as an element) is a daunting task and extremely disappointing because it is not something I have the heart to do.

Most all of the APIs that have withstood the test of time have avoided C++ because C (due to the limits/simplicity of the language) tends to remain transparent (and easily documented), whereas C++ tends to be a convenience for the individual programmer—but hellacious for anyone not the author to follow—requiring complex hyper-text enabled tools like doxygen to unravel. Overloading alone (anything) in programming is perhaps the most potent weapon in your arsenal if your goal is to confuse your peers—or future self.

For the most part, programmers in the field of science and technology just want each others algorithms with as little as possible of the baggage of other author’s preferred structure (i.e, …please, please, do NOT tell me about your classes, I will invariable have my own ideas about program structure that suits my application much better). That is why the ten-pole APIs like the UNIX/Linux kernel, GIMP/GTK+, GNU Compiler Collect, OpenGL (prior to its final nail-in-the-coffin revision), and so many others embraced the simplicity and multi-user maintainability of C and ignored the “squeeky-wheel” protests of ardent C++ feature-seekers. If popularity was the only measure of “good”, then I and many others would have abandoned all other operating systems in the late 1990’s/early 2000’s and succumbed to Microsoft’s hegemony by switching to Windows XP and its “popular” language of choice (C++). I strongly suspect CUDA’s popularity has more to do with its ability to unleash the underlying GPU capabilities than the veneer of C++ slapped on top of it a decade after it’s advent. More likely it was the pet-peeve of the next-gen developers who inherited the responsibility to maintain the API after the inventors had left the scene (e.g., OpenGL).

I do not suggest that NVIDIA, with its vast resources, could not have developed a parallel tool-chain for those end-users that absolutely could not manage without the personal convenience of C++ features, but I only wish they had been wise enough to preserve the venerable (and historically unequaled) C-only API, for those of us developers who strive for simplicity and are not interested in climbing aboard the slippery-slope towards unreadable code that is C++. There are many measures of quality, and not all of them begin and end with “more features” or “most popular”.

Instead, I have started down the CUDA development path, armed with NVIDIA’s book by Cheng (which is beautifully written, BTW) entitled “Professional CUDA C Programming”, only to find out that the post-2012 dev-team have turned its title into a lie. Sadly, over the past four decades, I have witnessed that the innovators invariably start out writing in C only to have the next wave of developer’s first and only contribution be to re-write it in the popular language de jour. (This actually happened three (!) times with Mathwork’s MATLAB). Bucking the trend, I was actually very encouraged see the new NVIDIA OptiX 7.0 API that seemed to have been rebuilt with a sensible stripped-down, back-to-basics mindset—only to find that CUDA is now the show-stopper.

Thanks to the posters (Robert and Njuffa) who have pointed me towards OpenCL and OpenACC. Perhaps I can find some solace there.

If you think I am nuts, just consider you own discomfort when the next version of the CUDA tool-chain requires you all to port your GPU code to Python. Perhaps then you too would start to expound the wisdom of multi-language CUDA support.

I understand where you are coming from, given that I still program a lot in C99 myself and have in the past worked quite a bit with low-level APIs, especially in the embedded space. For the record: I do not think you are nuts :-)

But I also know that many whose primary task is not programming (various scientists, for example) highly value the higher level of abstraction of something like the Thrust library, which in turn fundamentally requires C++. There are many domain experts with no prior experience in parallel programming who have been able to put together working GPU-accelerated prototypes within a few weeks by using, say, Thrust and Python, with the resulting code running at 10x the speed of their previous solution. This makes for happy and exited customers. Happy and exited customers become repeat customers, which translates into increased sales.

The vast resources of NVIDIA do not look nearly as vast when one considers the extensive and still growing software ecosystem NVIDIA has created for GPUs. I used to work on CUDA for NVIDIA (2005-2014) and always felt we were a bit short on resources, which frankly is is par for the course in industry settings in my experience: the goal is to run “lean and mean”. As a consequence, all feature requests (including platform or language support) have to be prioritized, and when operating within a for-profit business, the question “How would adding this feature help us sustain or increase sales?” always looms in the background as part of the process. Some undeniably useful features will never make the cut.

Whether C99 support falls into that category, I do not know (although I strongly suspect that is the case), but you may want to consider filing a feature request with NVIDIA. The bug reporting form is the venue for doing so, simply prefix the synopsis with “RFE:” to mark it as an enhancement request.

Thank you njuffa. I will make the request.

Is there a special category for “restoring features lost”? :)

This would be considered an “enhancement” request. Let’s not argue about whether or not it is an enhancement. It’s just a category to collect certain types of requests. As long as you use the word “enhancement” in the description, that should be sufficient.

Can you enumerate a list of problems/concerns?

  • VLA
  • usage of the C++ keyword “class”
  • ?

Re #15: The C support CUDA had in the first couple of years of its life wasn’t proper C99 support, so this is not really a request to restore a lost feature.

One last question please, for njuffa: Other than RTX/OptiX 7 development (which is CUDA-only—and which we have already we invested several weeks unwrapping the C++ classes from the SDK and getting a helloOptix.c (C99) demo to compile with the exception of some C++ incompatibility type-casting bugs in the optix stub-headers), OpenCL looks to be a good path forward for GPU integration for us since it is C-based and not C++, i.e. we wouldn’t continually be swimming against the tide. Nothing is scarier than putting thousands of hours into a project only to find you are a corner-case user that is not supported by an organization/code-base over which you have no control.

Why (as you said) wouldn’t you recommend OpenCL? Is it solely because of some of the missing features that CUDA has that OpenCL lacks by design? Or is the NVIDIA GPU OpenCL support lacking in some other crucial manner? I’d appreciate any insight on this you can share.

A small aside (which speaks a bit to your “Norbert” comment). When the Mathworks company explained to us around 2000 that “you are not a market” about addressing issues in their real-time workshop product near the end of a decade-long aerospace project, we resolved never again to put our neck in their metaphorical noose (“fool me once…”). Since then, we have rewritten a full (in-house) replacement for MATLAB/Simulink (in C99) that we use for flight-system development. Similarly, we’ve been using SGI/NVIDIA GPUs for even longer—explicitly because of their unwavering OpenGL support on UNIX/Linux. Both the mercurial Microsoft and Job’s Apple eventually found their way back to UNIX because it is multi-generational and does not chase market-share—and the language of UNIX (since its inception) is C, not C++. I stand by my earlier comment: neglecting C support from the CUDA tool-chain was/is a novice mistake. Hopefully, the CUDA C99 SDK feature-request will be honored (or OpenCL achieves across-the-board support) because “foundational stability” is a thing too—and for critical systems with a long-term view it undoubtedly trumps “more features” and “most popular”.

First off, when it comes to OpenCL, I am biased, with a capital ‘b’. I was part of the original engineering team that created CUDA. To me, OpenCL is a knock-off.

My very personal theory (call it a conspiracy theory if you have to) as to why OpenCL exists is that Apple created it because Apple wants full control of their platform (hardware and software), and they couldn’t control CUDA and didn’t want to tie themselves to NVIDIA hardware, but wanted a parallel programming environment for the GPUs in their systems. So an open standard created by Apple that initially closely matched CUDA as it was around 2007-2008 was the next best thing.

My observations: Apple has since completely lost interest in OpenCL. NVIDIA’s preferred platform is CUDA for obvious reasons, and OpenCL gets minimal support. Intel’s OpenCL support has been more complete but lacks quality. The only company consistently championing OpenCL since its creation has been AMD, but there are recent signs they may be more interested in pushing other technologies now.

To sum up, the picture that I see is that OpenCL is a dead end in the still dynamically evolving field of parallel programming platforms. That is not something I can recommend as a reasonably future-proof platform.

For a completely different perspective, you may want to look at the website of the Dutch company StreamHPC (formerly StreamComputing), who are big believers in OpenCL. They own https://opencl.org/.

Observation indicates that NVIDIA is generally responsive to customer desires; that is why the company has become increasingly successful. Filing well-motivated feature requests is the best way for the average user to influence feature development. Filing is needed so all feature requests can be collated and prioritized. While the filing of a feature request is typically a prerequisite to that feature materializing, it does not provide any guarantees that it will materialize.

For someone who is biased, you provided a very even-handed response. Thank you very much for your time and insight. I have come to pretty much the same conclusion with regard to OpenCL—it lacks a champion since Apple has begun it’s slow-retreat from the desktop. It also seems to have many implementation issues similar to CUDA (with regard to C code).

As I get further into the OptiX 7 library I see a lot of ‘extern “C”’ protected code which shows a divided mindset at the core developer level as to what language is being used. I think that a good case can be made for a C-only (-std C99) flag for the (now misnamed?) nvcc compiler, and I will try to construct a cogent case for one in a feature request.

That seems to be at the heart of the matter: basically a CUDA C compiler is needed, whereas nvcc has become a C+±only compiler (nv++?). If it’s not going to be nvcc/NVIDIA, then perhaps the open-source community (self included) could step up and provide one under the umbrella of GCC, assuming it’s practical to reverse-engineer enough knowledge of CUDA’s propriety binary (or PTX) GPU internals. Projects like that just take an idea and an itch to scratch—and I know for a fact that I am not the only one that likens C++ to an abrasive wool sweater to wear over a clean and simple C code-base.

Thanks again.