All function calls from cuda functions are inlined, so no recursions are possible. Also you can not start parallel kernels from a kernel. Because each thread executes the code serial.
suppose i run a kernel which executes well and produces a resultant data…without copying that resultant data to host can another kernel launch access that previous result value…
or i have to memcpy to host and then again to the device …
If I get your question right, the answer is yes. If both kernels share same context you can write data to memory from first kernel and then read it from second kernel. This holds for global (device) memory, not for shared or local memory.
You will probably need to cudaMalloc() storage on GPU from host code and pass pointers to allocated device memory to both kernels.
Uh, why do you think this is recursion? As long as it can be completely inlined (e.g. your functions are just “syntactic sugar”) you can call another function, if you could not, device functions would be completely useless.
Maybe I’m missing something here, but what is the point of nesting the functions? Anything that can be written as nested functions should be able to be written serially as one device function as well… or no?