Separate kernel grids do not execute concurrent

Dietepiet · December 18, 2009, 2:00pm

Hi all,

I noticed some unexpected behavior in my CUDA application. I tried to execute multiple different kernel grids (each with 1 block) concurrently using streams. As I have a GeForce 8800 GT with 14 multicores, I
expected that at least 14 kernel grids would be able to execute concurrently. But that seemed not the case.

so I made a small experiment. I made a heavy math kernel with almost no memory access. Then executed several kernel grids using streams for each to see if it would scale. Apparently, this is not the case. Increasing the number of streams in steps from 1 to 16 increases the total execution time roughly linear. There seems to be no stream concurrency at all. (I did not call any CUDA memcpy methods or methods on stream 0 during the timing experiment). On the other hand, increasing the grid size from 1 to 14 did not increase the execution time, only with a grid size of 15, the execution time doubled. This indicates the expected concurrency from 14 multicores.

Further experiments with recorded events on separate streams did not clear things up. I recorded a start and end event for each stream and used cudaEventQuery to repeatedly check the current state of each stream after all kernels where started. The results where however very confusing. Different kernels where never simultaneously between their start and end event. Furthermore, the start event of the first stream was triggered immediately, then for some time no events where recorded, until all start and end events of all streams seemed to be recorded almost instantly. I of course double checked if I did not use stream 0, but as far as I can see, I did not.

My questions are:

Why do my separate kernel grids not run concurrently when executed in separate streams, is this normal?
Is cudaEventQuery accurate when used on different streams?

For those interested, I included my experiment source code. It is not very long, about 130 lines.

Thanks for your time,
Dietger

template.cu (3.63 KB)

YDD · December 18, 2009, 3:12pm

In current generation cards, only one grid/kernel can be active at any one time - the behaviour you’re seeing is expected. With streams, you can overlap the execution of one kernel with a DMA copy to other memory; that’s all. This is supposed to be changing in Fermi.

Topic		Replies	Views
Concurrent executions of streams CUDA Programming and Performance	6	521	December 19, 2022
Strange behavior of execution time in concurrent kernels CUDA Programming and Performance	8	960	March 30, 2018
Kernels executing concurrently in different streams do not behave as expected CUDA Programming and Performance	6	506	December 20, 2023
Cannot see concurrent kenrel execution by stream CUDA Programming and Performance	2	591	November 16, 2017
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5653	April 28, 2012
CUDA Streams: Start at the same time CUDA Programming and Performance	3	674	November 12, 2021
Stream Concurrency (or lack thereof) on GTX 480 CUDA Programming and Performance	6	2597	July 15, 2010
Concurrent Kernel Execution on Fermi - confussion CUDA Programming and Performance	13	1780	October 10, 2011
My streams are not running concurrently CUDA Programming and Performance	7	1904	March 6, 2018
Streams not running conccurently CUDA Programming and Performance	4	124	May 22, 2025

Separate kernel grids do not execute concurrent

Related topics