Will this improve performance?

luisgo · January 31, 2015, 5:20pm

Dear All

Now the code is

for(z5=0; z5 < 16;z5++)
kernel1<<<0,X1/32,32,stream(z5)>>>();

for(z5=0; z5 < 16;z5++)
kernel2<<<0,X2/32,32,stream(z5)>>>();

sincronizedevice();

for(z5=0; z5 < 16;z5++)
kernel3<<<0,X3/32,32,stream(z5)>>>();

If I put this

kernel1<<<16,X1/32,32,defaultstream>>>();

kernel2<<<16,X2/32,32,defaultstream>>>();

kernel3<<<16,X3/32,32,defaultstream>>>();

Which I get better performance?
(the number of kernels is about 15)

Thanks

Luis Gonçalves

External Media

Robert_Crovella · February 1, 2015, 12:32am

This cannot be a valid kernel launch:

kernel1<<<0,X1/32,32,stream(z5)>>>();

0 is not a valid launch configuration parameter in either the first or second argument position of the launch configuration (i.e. <<<…>>> )

If you can efficiently do the work of multiple kernel launches in a single kernel launch (refactored), then I would expect in most cases that the single kernel launch will perform better than a sequence of launches.

The same is true for an implied attempt to run kernels concurrently. I would normally choose to run a single kernel over breaking up the work into pieces and attempting to schedule multiple concurrent kernels.

The exception to this is when you are specifically trying to overlap copy and compute operations. Then it is sensible to break the work into pieces, and get the compute operation going on a piece of the overall data, while other pieces are being sent to the GPU (and results copied back) concurrently. This viewpoint does not necessarily depend on concurrent kernel execution, however.

luisgo · February 2, 2015, 2:55am

But, the data processed in each stream in first code is chained and perhaps each stream is ran in only one SMX. That guarantees use of data in cache.

In second code, Can I guarantee that kernel1, kernel2 and kernel3 with (1,X1/32,32), (1,X2/32,32) and (1,X3/32,32) will ran in the same SMX? And (2,X1/32,32),(2,X2/32,32) and (2,X3/32,32) in other SMX, and so on.

Compute Capability of 3.5.

Robert_Crovella · February 2, 2015, 2:05pm

You cannot ever guarantee the SM that code will run on, under any circumstances, excepting GPUs that have only 1 SM.

And for most cc 3.5 devices that I am aware of, the L1 cache is disabled by default for global loads, therefore only the L2 cache is in play, and L2 cache is device-wide, not specific to an SM.

luisgo · February 6, 2015, 2:29pm

The time of my application was reduced from 4.8ms to 3.8ms on a K40.

Thanks

Topic		Replies	Views
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17755	April 1, 2010
Is it recommended to throw multiple kernels at once? CUDA Programming and Performance cuda , kernel	6	2731	October 12, 2021
concurrent kernels CUDA Programming and Performance	2	865	May 2, 2011
Kernel launch concurrency CUDA Programming and Performance	10	1851	December 11, 2014
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3155	January 19, 2018
Parallel computing by cpu thread and gpu kernel CUDA Programming and Performance	5	1306	November 21, 2014
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4601	February 6, 2009
Can it occur that 2 kernels run at the same time if the 2 kernels are continuously launched? CUDA Programming and Performance	2	407	January 8, 2019
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3734	October 21, 2017
Cuda multi stream schedule CUDA Programming and Performance	2	1571	October 11, 2023

Will this improve performance?

Related topics