__device__ function clarifications

Hi I have some questions regarding device functions, mostly about the style of argument/result passing between the caller and callee.

  1. Is it safe to pass a pointer of local variable as a function argument? Local variables in the kernel are mapped to registers, so the idea of taking a pointer

of register bothers me.

  1. The device functions are always inlined. So if I am trying to pass a struct as an input argument, I don’t have to pass the pointer of it to reduce the overhead, right?

  2. Likewise, due to inlining, Passing a result pointer to return a complex return data doesn’t really improve anything, right? For example


struct bigstruct {


device bigstruct foo1 () {

bigstruct ret;

… populate ret …

return ret;


device void foo2(bigstruct* ret_p) {

… populate *ret_p …



foo2 and foo1 are not really different in the overhead-wise, right?

  1. Is reference(&) allowed for the kernel code? like this…


device void foo3(bigstruct &ret) {

… populate ret …



It’s confusing what is supported and what isn’t supported in the device code. Kind of hard to draw a hard line between C and C++…

  1. I consulted the “programming guide” and “reference manual”. Are there any other documentation that I can read? Very little information is in the programming guide.

Taking the address of a register will lead to erroneous behavior. At least the last forum post on the topic had a program that compiled fine but ran incorrectly, so presumably the compiler doesn’t catch this problem…




Well, there is a wealth of information in the programming guide. 90% of all questions on this form could be solved simply if people read it before posting. Your question isn’t one of those, though: for whatever reason, NVIDIA has chosen not to document what C/C++ features are allowed in kernels and what are not.

FYI, here are a few more undocumented features:

templated kernel code works very well, though it is technically unsupported as far as I know.

simple classes with device member functions also work if you are very careful in how you write them (i.e. only simple data members, all members inlined, no requirement for dynamic memory, no polymorphism and a few other gotchas I can’t think of at the moment)

Thanks for the answer.

Could you provide the link to the post? It’s hard to find anything in this forum. Turn off that buggy flood-control NVIDIA!

Taking a pointer and dereferencing it is a fundamental C feature that I should not worry about… it’s disappointing. what should I expect to work. :">

I don’t know how a compiler inlines a function, but as I don’t do any pointer arithmetic, it should be able to figure out and eliminates & and * s and substitute them with regular variable…

Again, I don’t know how ppl debug a kernel when it runs in emulation mode and doesn’t on the device. I cannot gdb (well there’s gdb 2.1 now) or printf to see what’s going on it, and … i can use any feature with a peace of mind.

That’s interesting. I’d love to try them but kinda hesitating because it would be just one more suspect when things do not work as i want…

At NVISION slides were shown that said NVIDIA is working to C++ and Fortran support fro CUDA. Given the fact that templates have been working for a long time (SDK examples use them), and the fact that more C++ features are apparently working in 2.1 beta, I think NVIDIA is taking the gradual approach to the C++ support. So I would expect that you will see more and more C++ functionality working in CUDA with each release.

I guess I was remembering incorrectly. This is the post I was thinking about, which results in a compiler crash:


The issue is that CUDA has many memory models (registers, global, local, shared) and if you get too creative with pointers the compiler cannot determine which one the pointer points to.

I didn’t think of that. It probably will work, as long as you don’t do pointer arithmetic or array indexing on it. Why don’t you try it out?

Code I’ve debugged that ran in emulation and not device is that either:

  1. Had out of bounds or uninitialized memory accesses (debugged with valgrind on emulation mode)

1a) Forgot to allocate dynamic shared memory in the kernel launch

or 2) Reading a host pointer on the device (only way to find these is to carefully comb through your code and use good naming to differentiate between host and device pointers).

By commenting chunks out.

C++ support would really save some porting effort, but would they be able to run it fast with all the abstractions? :shifty:

The debugging device code part, well, i finally found out the problem. I posted it to the relavent thread. Array indexing in the kernel code just didn’t work.


I would add one more debugging methodology…

explicitly cudamemcopy the value you’re interested in to the host and examine it.

well… i put this off till I exhausted all the other options because it’s sooooo TEDIOUS… but it did catch my problem, i i guess it’s worth trying