Is CUDA C or C++ ?

rocksportrocker · January 26, 2009, 12:57pm

Hi,

the programming guide says that CUDA programming is based on an subset
of C. Meanwhile I found some examples using C+±templates in CUDA code.

Where can I find a reference about supported C++ concepts ???

Greetings, Uwe

jack · January 26, 2009, 1:40pm

From what I understand, templates are the only C++ feature available in CUDA (and even that may be limited to some degree). I believe that nVidia has plans to support some more C+±type code in the future, but it may be some time before that is made available.

E.D_Riedijk · January 26, 2009, 2:00pm

From http://forums.nvidia.com/index.php?showtopic=85832

What’s new in CUDA 2.1

â€¢ C++ templates are now supported in CUDA kernels

seibert · January 26, 2009, 2:04pm

operator overloading also works:

inline __device__ __host__ float3 operator *(float3 a, float3 b)

{

	return make_float3(a.x*b.x, a.y*b.y, a.z*b.z);

}

(from CudaMath.h in the SDK)

jack · January 26, 2009, 2:22pm

Sorry, I’m behind the times :"> (still using 2.0).

SPWorley · January 26, 2009, 4:45pm

Actually a lot more even works, it’s just not official.

The three big features that don’t work (and will be hard to support because of the GPU memory and execution model)

are

Dynamic memory allocation ( new/delete)

Virtual functions (anything that starts getting deep into pointers!)

Exceptions

These are big holes, and prevent most CPU C++ libraries from being applicable to GPU.

But much of the rest works!

Inheritance, complex templates, operator overloading, functors, etc.

In fact I often customize my kernels with a template, passing in a functor to change the entire behavior of part of the kernel’s processing loop.

This toy code shows working inheritance, functors, and templates.

class Functor {

public:

	__device__ int operator()(int a, int b) {return a+b;}

};

template <typename F>

__global__ void

testKernel( float* g_idata, float* g_odata) 

{

  class AAA {public: int a;};

  class BBB: public AAA {public: int b;};

BBB bb;

  bb.b = 6;

  bb.a = 7;

bb.a=F()(bb.a, bb.B);

}

jack · January 26, 2009, 5:28pm

SPWorley, you bring up a couple of questions:

Why would this be so hard to implement? I wouldn’t think that most kernels divide up shared memory “asymmetrically”, that is, certain threads access a larger part of the shared memory than others. Why not just take the shared memory, divide by the number of threads in the block, then “allocate” a share to each thread; when new() or whatever is called, each thread will construct an object in it’s share of the memory. This approach should also automatically handle coalesced reads/writes from shared memory in most cases. If there is not enough memory available, throw an exception (see below).
Exceptions. Why not just allocate a couple extra registers (or a tiny piece of shared/global/etc. memory) so that each thread is able to set an ‘exception bit’; using the warp vote function any(), the kernel could check to see if any threads have thrown an exception and stop execution. Across the multiple processors, each register/chunk of memory could be written to a global storage area, then have a few operations performed (retrieve the memory locations, then OR them all together, then check for nonzero) in a reduction-style bit of code; if nonzero, an exception was thrown, and execution should be halted. I don’t know about actually passing back data about the exception (for debugging purposes), but at least it would give some kind of error detection capability on the device.
Inline Functions. I know that by using some #define macros I could “inline” a function into another function, but why not support the inline flag for a small function? This would make the code a bit cleaner and more ‘compliant’ with standard C/C++ practices. It also allows you to support function calls without actually supporting recursion (which I imagine would also be possible with the right combination of PTX instructions).

dneckels · January 26, 2009, 5:29pm

A big one that works (and that I’m quite happy about) is delayed definition of variables:

global void foo() {

for (i = 0; i < N; i++) {
…
}

__syncthreads();

float a,b;

a = shmem[tid]; b = shmem[tid+blockSize];

…

}

It really makes sense to me not to support too much of C++. After all, its really array processing that is at the heart of the kernel code.
I just thank my lucky stars every day that there’s no fortran involved!! I hate fortran…

seibert · January 26, 2009, 5:39pm

This will impact the block scheduling, which currently requires knowledge of the shared memory required per block when the kernel starts. With dynamic memory allocation (in shared memory, anyway), you run the risk of starting too many blocks per multiprocessor, preventing the new from succeeding when it could have with fewer concurrent blocks.

I think dynamic allocation of local memory could be implemented, though.

Exceptions. Why not just allocate a couple extra registers (or a tiny piece of shared/global/etc. memory) so that each thread is able to set an ‘exception bit’; using the warp vote function any(), the kernel could check to see if any threads have thrown an exception and stop execution. Across the multiple processors, each register/chunk of memory could be written to a global storage area, then have a few operations performed (retrieve the memory locations, then OR them all together, then check for nonzero) in a reduction-style bit of code; if nonzero, an exception was thrown, and execution should be halted. I don’t know about actually passing back data about the exception (for debugging purposes), but at least it would give some kind of error detection capability on the device.

This is not a bad idea, though the semantics would be totally different than a standard C++ exception, which would normally trigger a divergent branch.

Doesn’t nvcc already automatically inline all device functions anyway? I know that is not strictly required of the compiler and could change in the future for big functions, but at the moment I think everyone is already getting function inlining.

Mr_Nuke · January 29, 2009, 5:29am

That is to a limited degree.

According to my experience with CUDA, the following code will not compile:

[codebox]template type1…

template type2…

template

global foo (type1, type2, T)

{…}[/codebox]

I found it is usually a good idea to explicitly specialze template structures.

For example, the following code will not compile:

[codebox]template

struct type1struct

{

}

template

device type1struct foo()

{…}

[/codebox]

while this will compile:

[codebox]template

struct type1struct

{…

}

template <>

struct type1struct

{…

}

template

device type1struct foo()

{…}

[/codebox]

When calling foo.

SPWorley · January 29, 2009, 7:06am

I just tried that, and it worked OK in SDK 2.1.

I did have to add a semicolon at the end of the struct definition, and fill in some of the … with

dummy bodies.

But likely my test wasn’t using the structs the way you meant them to be instantiated.

E.D_Riedijk · January 29, 2009, 9:12am

Well, according to the releasenotes, now you can file a bugreport :) That is what I was posting it for: now that they say it is supported, you can complain about things that don’t work. External Media

Mr_Nuke · January 30, 2009, 1:01am

The code I previously posted will compile, unless you instantiate the templates. I can have:

global foo (type1, type2, T){…}

in my code, and it will compile without a problem. However, when I make a kernel call

foo<<<…>>>(…);

I will get the folowing error

error C2912: explicit specialization; foo(…)’ is not a specialization of a function template.

Sometimes templates work beautifully. The more complex a template gets, the higher the chance CUDA will not compile it. It now works perfectly with relatively simple templates; with complicated and nested ones… it’s more about luck. Take the following (succesfully compiled) code for example:

[codebox]template

global void CalcField_SPkernel(float2* xyInterleaved, float* z, pointCharge *Charges,

							unsigned int n, unsigned int p, unsigned int fieldIndex, float resolution){..}

{// Some host function

CalcField_SPkernel<KERNEL_STEPS><<<grid, block>>>(fieldLines.xyInterleaved, fieldLines.z,

			pointCharges, lines, points, i, resolution);

}

[/codebox]

It’s very simple. There’s no typename that needs to instantiate another template, so CUDA won’t complain.

I’m not as worried about filing a bug report. Everything I need to work works fine, even if I sometimes use small workarounds.

Topic		Replies	Views
CUDA 2.0, 100 errors with C++ CUDA Programming and Performance	20	16354	July 9, 2008
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134557	May 26, 2010
A few thaughts about CUDA CUDA Programming and Performance	8	7758	January 7, 2010
CUDA Memory Consistency CUDA Programming and Performance	23	55464	March 8, 2007
nvcc compilation fails on using c++ based header files CUDA Programming and Performance	7	4919	November 29, 2010
How did the CUDA experts get started with CUDA programming? CUDA Programming and Performance	6	5210	October 2, 2023
Wished CUDA 2.2 features! CUDA Programming and Performance	17	14174	May 26, 2009
CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA Announcements	2	75857	February 9, 2009
Will calling a kernel from a kernel help on performance? CUDA Programming and Performance	9	1903	December 23, 2022
CUDA Toolkit 3.0 released CUDA Programming and Performance	62	26025	September 21, 2010

Is CUDA C or C++ ?

Related topics