CUDA profiles tell me that the average kernel launch latency of a 8800GTX is around 15us.
I just wonder whether the streaming feature in devices with compute capability 1.1 would help reduce the launch latency?
The intuition is when GPU is busy computing something, the DMA controller is transferred information from CPU mem to GPU mem (or vice versa). The information can not only includes data used for calcuation, but the binary program codes used to execute the next kernel.
Is my guess right?
Or does anyone have experiences in this aspect?
Thank you very much!