Separate kernel grids do not execute concurrent

Hi all,

I noticed some unexpected behavior in my CUDA application. I tried to execute multiple different kernel grids (each with 1 block) concurrently using streams. As I have a GeForce 8800 GT with 14 multicores, I
expected that at least 14 kernel grids would be able to execute concurrently. But that seemed not the case.

so I made a small experiment. I made a heavy math kernel with almost no memory access. Then executed several kernel grids using streams for each to see if it would scale. Apparently, this is not the case. Increasing the number of streams in steps from 1 to 16 increases the total execution time roughly linear. There seems to be no stream concurrency at all. (I did not call any CUDA memcpy methods or methods on stream 0 during the timing experiment). On the other hand, increasing the grid size from 1 to 14 did not increase the execution time, only with a grid size of 15, the execution time doubled. This indicates the expected concurrency from 14 multicores.

Further experiments with recorded events on separate streams did not clear things up. I recorded a start and end event for each stream and used cudaEventQuery to repeatedly check the current state of each stream after all kernels where started. The results where however very confusing. Different kernels where never simultaneously between their start and end event. Furthermore, the start event of the first stream was triggered immediately, then for some time no events where recorded, until all start and end events of all streams seemed to be recorded almost instantly. I of course double checked if I did not use stream 0, but as far as I can see, I did not.

My questions are:

  1. Why do my separate kernel grids not run concurrently when executed in separate streams, is this normal?
  2. Is cudaEventQuery accurate when used on different streams?

For those interested, I included my experiment source code. It is not very long, about 130 lines.

Thanks for your time,
Dietger

template.cu (3.63 KB)

In current generation cards, only one grid/kernel can be active at any one time - the behaviour you’re seeing is expected. With streams, you can overlap the execution of one kernel with a DMA copy to other memory; that’s all. This is supposed to be changing in Fermi.