Run several iterations of kernel - multiGPU cudaLaunch cudaSetupArgument cudaConfigureCall

Hello. I’m implementing a kernel that needs be run several times, so it was defined inside a loop for a given number of iterations. The arguments passed to the kernel amount to 84 bytes and from the tests I’m already run, the fewer the size of the arguments the better the time the device takes from kernel calling to kernel finishing.

Then looking at the reference manual I stumbled across the cudaConfigureCall, cudaSetupArgument and cudaLaunch which basically do the same thing as a call to:

my_kernel<<<blocks,threads,shared_mem_size,stream>>>(ptr *sth1, ptr *sth2, ... , ptr *sthn);

in the following way:

offset=0;

cudaConfigureCall(blocks,threads,shared_mem_size,stream);

cudaSetupArgument(sth1,offset);

offset+=sizeof(sth1);

cudaSetupArgument(sth2,offset);

offset+=sizeof(sth2);

...

...

cudaSetupArgument(sthn,offset);

offset+=sizeof(sthn);

cudaLaunch("my_kernel");

(Do correct me if I’m wrong on the calls above, please)

cudaConfigureCall and cudaSetupArgument push data on the execution stack and cudaLaunch pops that data. If my_kernel runs a given number of iterations, can I use the procedure in the second listing like this?

for(/*given # iterations*/)

{

offset=0;

cudaConfigureCall(blocks,threads,shared_mem_size,stream);

cudaSetupArgument(sth1,offset);

offset+=sizeof(sth1);

cudaSetupArgument(sth2,offset);

offset+=sizeof(sth2);

...

...

cudaSetupArgument(sthn,offset);

offset+=sizeof(sthn);

}

for(/*given # iterations*/)

cudaLaunch(my_kernel);

Perhaps it seems vague as to why would one benefit from it, but running on multiple devices the greater the data passed to the kernel as arguments the greater the overhead on calling the kernel. In such a way I would benefit from having the arguments on the execution stack for that given number of iterations, having only to launch the kernel simultaneously for all devices. If the kernel is launched on four devices at the same time I get less 100 Mbps of throughput than I would with just a single device, caused by OS scheduling on usage of the bus to pass the arguments to the device I suspect.

Please don’t GPUWork me, this needs be done on pthreads. Any help is really appreciated.