kernel function cuda, kernel

I have a question about the kernel function. I know that the function type qualifier “global” is defined as a kernel function, when we call it, it can execute N times in parallel by N different CUDA threads. and my question is that if the function defined as “device” can also do the parallel cumputation? or it just works on kernel, i.e “global” function?

Thanks a lot…

device function is just a group of several statements, only executed in device.

You can regard it as simple operation, like addition, multiplication.

Compiler nvcc would do inline on device function and do optimization.

You can see page 31 in programming guide 2.3

you mean that device function cannot do the parallel calculation? except the difference that device function is only callable from the device and __global function is callable frome host, what’s the other difference between the device function and global function? like the execution speed… more specific, can I do the sum reduction in device function? Is the speed much more slowly than in global function?

Read what the previous poster said… device functions are inlined into the global functions which call them.