Duration/performance of a kernel in a system with more than one stream

TrailingStop · January 23, 2023, 1:49pm

Hi,

I have a kernel that I run in two different streams to make sure that one kernel is executing while the host setup the next kernel (copy memory from and to gpu). How can I take the time a single kernel needs to execute in a system with more than one stream? If I use cuda events to take the time the results seems not to be correct. I think it takes the time from launch until the stream execution finished. But this is not necessary the time the kernel needs to execute, because it might have to wait for the last stream to finish first until it can start execution.

If I use nsight compute/trace i get perfect timings for each stream. Are these performance figurers also available my application?

Thanks,
Daniel

Robert_Crovella · January 23, 2023, 4:45pm

I’m not aware of a non-profiler method to get timings for kernels in a general/arbitrary case.

If you want to do it from within an application, it might be possible with CUPTI. I don’t have a recipe for you, but there are sample codes.

TrailingStop · January 23, 2023, 4:53pm

Thanks a lot for the information - will look into this.

TrailingStop · January 24, 2023, 2:43pm

What I need to do is this - maybe you have a much better idea how to get to that.

I have a pool with different kernels and one of them will run in 2 streams on a gpu. I need to run another kernel in a single stream parallel (alternating) that uses about 10% of the computetime the first/main kernel needs. So for example:

Kernel A from the pool runs in 2 stream and I need to run a special kernel Z that uses 10% of the performance of kernel A.

Since the kernel in the pool are all different I cannot run 10 time kernel from the pool and once kernel Z. I need somehow to run kernel Z that much, that the performance of kernel A drop for 10% compared to its performance without kernel Z running. The kernels do something different, so I cannot say if kernel A gives me 1000 results run kernel Z so that it gives me 100 results. Since Kernel A is running in 2 streams I cannot just measure the time it uses since the results are misleading.

Hope I put some light on the background why I try to measure the kernel times on a per stream basis.

Thanks,

system · February 7, 2023, 2:44pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Concurrent kernel timing with cudaEvents CUDA Programming and Performance	1	1923	April 27, 2017
Async start kernel in different stream after another completes? CUDA Programming and Performance	2	590	April 4, 2016
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	109	November 18, 2024
Maximum number of operations in a stream CUDA Programming and Performance	4	1057	June 2, 2023
kernel launches in the same stream CUDA Programming and Performance	4	5233	September 22, 2010
Kernel Timing and cudaThreadSynchronize() CUDA Programming and Performance	6	2002	July 30, 2010
How to Launch Cuda kernel in different processes CUDA Programming and Performance	8	3745	November 6, 2018
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	714	October 8, 2016
Concurrent kernel execution without stream CUDA Programming and Performance	7	2456	December 28, 2016
Some kernel launch is taking much longer (100x) than others in the same Cuda Stream CUDA Programming and Performance	7	439	February 10, 2024

Duration/performance of a kernel in a system with more than one stream

Related topics