I have a question about the kernel function. I know that the function type qualifier “global” is defined as a kernel function, when we call it, it can execute N times in parallel by N different CUDA threads. and my question is that if the function defined as “device” can also do the parallel cumputation? or it just works on kernel, i.e “global” function?
you mean that device function cannot do the parallel calculation? except the difference that device function is only callable from the device and __global function is callable frome host, what’s the other difference between the device function and global function? like the execution speed… more specific, can I do the sum reduction in device function? Is the speed much more slowly than in global function?