Is CUDA C or C++ ?


the programming guide says that CUDA programming is based on an subset
of C. Meanwhile I found some examples using C+±templates in CUDA code.

Where can I find a reference about supported C++ concepts ???

Greetings, Uwe

From what I understand, templates are the only C++ feature available in CUDA (and even that may be limited to some degree). I believe that nVidia has plans to support some more C+±type code in the future, but it may be some time before that is made available.


What’s new in CUDA 2.1

• C++ templates are now supported in CUDA kernels

operator overloading also works:

inline __device__ __host__ float3 operator *(float3 a, float3 b)


	return make_float3(a.x*b.x, a.y*b.y, a.z*b.z);


(from CudaMath.h in the SDK)

Sorry, I’m behind the times :"> (still using 2.0).

Actually a lot more even works, it’s just not official.

The three big features that don’t work (and will be hard to support because of the GPU memory and execution model)


Dynamic memory allocation ( new/delete)

Virtual functions (anything that starts getting deep into pointers!)


These are big holes, and prevent most CPU C++ libraries from being applicable to GPU.

But much of the rest works!

Inheritance, complex templates, operator overloading, functors, etc.

In fact I often customize my kernels with a template, passing in a functor to change the entire behavior of part of the kernel’s processing loop.

This toy code shows working inheritance, functors, and templates.

class Functor {


	__device__ int operator()(int a, int b) {return a+b;}


template <typename F>

__global__ void

testKernel( float* g_idata, float* g_odata) 


  class AAA {public: int a;};

  class BBB: public AAA {public: int b;};

BBB bb;

  bb.b = 6;

  bb.a = 7;

bb.a=F()(bb.a, bb.B);


SPWorley, you bring up a couple of questions:

  1. Why would this be so hard to implement? I wouldn’t think that most kernels divide up shared memory “asymmetrically”, that is, certain threads access a larger part of the shared memory than others. Why not just take the shared memory, divide by the number of threads in the block, then “allocate” a share to each thread; when new() or whatever is called, each thread will construct an object in it’s share of the memory. This approach should also automatically handle coalesced reads/writes from shared memory in most cases. If there is not enough memory available, throw an exception (see below).

  2. Exceptions. Why not just allocate a couple extra registers (or a tiny piece of shared/global/etc. memory) so that each thread is able to set an ‘exception bit’; using the warp vote function any(), the kernel could check to see if any threads have thrown an exception and stop execution. Across the multiple processors, each register/chunk of memory could be written to a global storage area, then have a few operations performed (retrieve the memory locations, then OR them all together, then check for nonzero) in a reduction-style bit of code; if nonzero, an exception was thrown, and execution should be halted. I don’t know about actually passing back data about the exception (for debugging purposes), but at least it would give some kind of error detection capability on the device.

  3. Inline Functions. I know that by using some #define macros I could “inline” a function into another function, but why not support the inline flag for a small function? This would make the code a bit cleaner and more ‘compliant’ with standard C/C++ practices. It also allows you to support function calls without actually supporting recursion (which I imagine would also be possible with the right combination of PTX instructions).

A big one that works (and that I’m quite happy about) is delayed definition of variables:

global void foo() {

for (i = 0; i < N; i++) {



float a,b;

a = shmem[tid]; b = shmem[tid+blockSize];


It really makes sense to me not to support too much of C++. After all, its really array processing that is at the heart of the kernel code.
I just thank my lucky stars every day that there’s no fortran involved!! I hate fortran…

This will impact the block scheduling, which currently requires knowledge of the shared memory required per block when the kernel starts. With dynamic memory allocation (in shared memory, anyway), you run the risk of starting too many blocks per multiprocessor, preventing the new from succeeding when it could have with fewer concurrent blocks.

I think dynamic allocation of local memory could be implemented, though.

This is not a bad idea, though the semantics would be totally different than a standard C++ exception, which would normally trigger a divergent branch.

Doesn’t nvcc already automatically inline all device functions anyway? I know that is not strictly required of the compiler and could change in the future for big functions, but at the moment I think everyone is already getting function inlining.

That is to a limited degree.

According to my experience with CUDA, the following code will not compile:

[codebox]template type1…

template type2…


global foo (type1, type2, T)


I found it is usually a good idea to explicitly specialze template structures.

For example, the following code will not compile:


struct type1struct




device type1struct foo()



while this will compile:


struct type1struct



template <>

struct type1struct




device type1struct foo()



When calling foo.

I just tried that, and it worked OK in SDK 2.1.

I did have to add a semicolon at the end of the struct definition, and fill in some of the … with

dummy bodies.

But likely my test wasn’t using the structs the way you meant them to be instantiated.

Well, according to the releasenotes, now you can file a bugreport :) That is what I was posting it for: now that they say it is supported, you can complain about things that don’t work. :thumbup:

The code I previously posted will compile, unless you instantiate the templates. I can have:

global foo (type1, type2, T){…}

in my code, and it will compile without a problem. However, when I make a kernel call


I will get the folowing error

error C2912: explicit specialization; foo(…)’ is not a specialization of a function template.

Sometimes templates work beautifully. The more complex a template gets, the higher the chance CUDA will not compile it. It now works perfectly with relatively simple templates; with complicated and nested ones… it’s more about luck. Take the following (succesfully compiled) code for example:


global void CalcField_SPkernel(float2* xyInterleaved, float* z, pointCharge *Charges,

							unsigned int n, unsigned int p, unsigned int fieldIndex, float resolution){..}

{// Some host function

CalcField_SPkernel<KERNEL_STEPS><<<grid, block>>>(fieldLines.xyInterleaved, fieldLines.z,

			pointCharges, lines, points, i, resolution);



It’s very simple. There’s no typename that needs to instantiate another template, so CUDA won’t complain.

I’m not as worried about filing a bug report. Everything I need to work works fine, even if I sometimes use small workarounds.