Passing parameters to a cuda function - newbie question

I apologize for the simple nature of this question. Appreciate any help.

While going thru some cuda examples, I came across some code which I thought was not possible, but for some reason this code seem to run. I’ve looked at possible explanation, but couldn’t find one.

My (lack of) understanding was, if I have a variable in host memory, for me to use it inside a cuda kernel I had to create a copy of that variable in device memory, using cudaMalloc and cudaMemcpy.

However, the following code works properly (i.e. the addition works correctly). I’m curious how?
global void add(int a, int b, int* c)
*c = a + b;

int main()
int a=4, b=5, c=0; // values on the host

int *d_c; // device copies of a, b, c

cudaMalloc((void**)&d_c, sizeof(int));

add<<<1,1>>>(a, b, d_c);


My question is, how does a and b gets from host to device?

Thank you.


If you’re using the runtime API, parameters for global functions are implicitly marshalled and copied from the host to the device.

NVCC generates code that hides the marshalling from you.

It’s definitely a subtle concept. Some details on function parameter size limitations are found here in the CUDA C Programming Guide.

The CUDA Runtime API docs reveal an API that will marshal args and launch your kernel.

But the kitchen-sink CUDA Driver API cuLaunchKernel() function reveals much more.

Finally, if you want to see what’s actually being generated by NVCC then compile with “-keep” and then “grep __device_stub *.ii” to see how cudaSetupArgument() and cudaLaunch() are invoked automagically.

Thank you!

Sorry, a follow-up question would be, is there an advantage of doing this implicit marshaling, as opposed to doing it explicitly by creating the variable in the device using cudaMalloc and copying the values using cudaMemcpy?

All intro examples I’ve seen seem to do this explicitly. Since this happens implicitly, why bother to write extra code?

Thanks again.


(Oops, I originally missed the point of your second question.)

Any device or shared resource needs to be explicitly allocated by you. Only the value of a pointer or reference to that resource will be implicitly marshalled and copied.

All the intro examples you see are correct.

I was just pointing out that the actual triple-angle-bracket syntax hides some complexity from you and marshals the actual argument values from host to device.

… and back to the actual marshalling trivia:

If you’re using the Runtime API then I’m pretty sure explicit calls to setup and launch a kernel would be identical to the stub generated by NVCC.

So you would probably have to have a unique use case to explicitly call these runtime functions.

But if you’re using the Driver API you can optionally carefully pack your parameters once and reuse them across a large number of kernel launches (assuming the params aren’t changing between invocations).

I’m not going to guess if that will result in any performance gains. It might make certain kernel launch loops cleaner though.

Thank you for the explanation.

This is kind of unrelated to my previous question: you make a distinction between the runtime API and the driver api? what are those? Again, sorry if this is way too basic.

Thank you.


The Runtime API could also be called “the normal way of programming CUDA”. Anytime you call a function prefixed by “cudaXXX()” you are using the Runtime API.

Most developers use this.

The Driver API is a lower level API that we don’t talk about. The first rule of the Driver API is that you don’t talk about the Driver API.

It’s all documented in the reference manuals.

Thank you!