Nesting kernels Can I do this in CUDA?

Hi,

I want to write a CUDA code as follows:

#include<xyz.h>

Device_function1(j,o,p);

{

 abdcd; 

kkj

}

Device_function2(v,j,k);

{

jjh akk;

}

Device_function3(a, d, g);;

{

jhakhgg;

}

Device_function4();

{

//Calling Device_function1, Device_function2, Device_function3

Device_function1<< w, r , j >>> (a, b, g);

Device_function2<< n, g , j >>> (a, d, e);

Device_function3<< n, m , j >>> (a, dew a);

}

main()

{

//calling kernel Device_function4();

Device_function4<<< n, m , j >>> (a, d, g);

}

Note that Device_function4 calls Device_function1, Device_function2, Device_function3.

Can we do like this in CUDA? If not then any other way for nesting functions?

Thanks for your time guys!!

No you can’t.

function 1, 2 and 3 are never called from the host and therefore not really kernels, so if you just forget about the kernel syntax they will work just fine as normal function calls from kernel/function 4

Alternatively, make function4 a normal function on the host, calling three kernels in a row.

Thanks for your answers guys!

Well, I have seen a code like this:

__device__ void Add()

{

//do some calculation

}

__global__ void scanModified()

{

//do some calculations and call the device function Add();

Add();  //NOTICE THAT WE DO NOT MENTION THE EXECUTION CONFIGURATION <<<  >>> etc

}

NOTICE THAT WE DO NOT MENTION THE EXECUTION CONFIGURATION. CAN WE DO THIS IN CUDA?

Yes. But notice that there is a big difference between nesting kernel launches (your original question and which cannot current be done), and nested device functions (your last post). You should also be aware that all device functions are presently expanded inline by the compiler and deeply nested device functions can have a pretty profound negative effect on register usage and compilation speed. Also, because there is no stack, recursive device function calls probably won’t work (although that is a guess, I don’t recall having tried it).

Do I need to explicitly mention that the function is inline or it is by default done inline by the compiler? Please also explain “Also, because there is no stack…” which stack are you talking about and where it is absent?

I want to ask one more think: why don’t we need execution configuration (no of threads, number of block etc) for the function Add() ?

Read this.

Did you actually read any of the replies in this thread? Device functions aren’t kernels. They are even real subroutines. They are expanded inline (like macros) by the compiler inside kernel code.

Thanks a lot Avidday once again. I read the threads above. Could you please refer to me some document or previous forum post where I can read more about this (“Device functions aren’t kernels”…etc).

You could start with the programming guide. Appendix B describes the properties of device functions pretty thoroughly.

Thanks Avidday, I am going to read Appendix B.

No, you allways mention the configuration at launch of the kernel function. Any subsequent function calls from within the kernel will be with that configuration.