Will this improve performance?

Dear All

Now the code is

for(z5=0; z5 < 16;z5++)
kernel1<<<0,X1/32,32,stream(z5)>>>();

for(z5=0; z5 < 16;z5++)
kernel2<<<0,X2/32,32,stream(z5)>>>();

sincronizedevice();

for(z5=0; z5 < 16;z5++)
kernel3<<<0,X3/32,32,stream(z5)>>>();

If I put this

kernel1<<<16,X1/32,32,defaultstream>>>();

kernel2<<<16,X2/32,32,defaultstream>>>();

kernel3<<<16,X3/32,32,defaultstream>>>();

Which I get better performance?
(the number of kernels is about 15)

Thanks

Luis Gonçalves

This cannot be a valid kernel launch:

kernel1<<<0,X1/32,32,stream(z5)>>>();

0 is not a valid launch configuration parameter in either the first or second argument position of the launch configuration (i.e. <<<…>>> )

If you can efficiently do the work of multiple kernel launches in a single kernel launch (refactored), then I would expect in most cases that the single kernel launch will perform better than a sequence of launches.

The same is true for an implied attempt to run kernels concurrently. I would normally choose to run a single kernel over breaking up the work into pieces and attempting to schedule multiple concurrent kernels.

The exception to this is when you are specifically trying to overlap copy and compute operations. Then it is sensible to break the work into pieces, and get the compute operation going on a piece of the overall data, while other pieces are being sent to the GPU (and results copied back) concurrently. This viewpoint does not necessarily depend on concurrent kernel execution, however.

But, the data processed in each stream in first code is chained and perhaps each stream is ran in only one SMX. That guarantees use of data in cache.

In second code, Can I guarantee that kernel1, kernel2 and kernel3 with (1,X1/32,32), (1,X2/32,32) and (1,X3/32,32) will ran in the same SMX? And (2,X1/32,32),(2,X2/32,32) and (2,X3/32,32) in other SMX, and so on.

Compute Capability of 3.5.

You cannot ever guarantee the SM that code will run on, under any circumstances, excepting GPUs that have only 1 SM.

And for most cc 3.5 devices that I am aware of, the L1 cache is disabled by default for global loads, therefore only the L2 cache is in play, and L2 cache is device-wide, not specific to an SM.

The time of my application was reduced from 4.8ms to 3.8ms on a K40.

Thanks