In a talk from an NVIDIA employee, I heard it said that the kernel launch overhead from the device is roughly the same as the kernel launch from the host. The speaker went on to say that if the kernel launches from the device are done in a batch of, say N, the overhead per kernel would be 1/N of the overhead had that kernel been launched from the host. What is this ‘batch’ mode? I have reason to believe it is not referring to the case where the parent kernel launches one child kernel for each thread.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Cuda Dynamic Parallelism Launch Overhead | 5 | 2189 | March 17, 2017 | |
Concurrent Kernel Launching to Hide Kernel Launching Overhead (Not only kernel execution)) | 0 | 403 | April 9, 2020 | |
What's the cost of loading in blocks? | 3 | 2319 | April 9, 2008 | |
Quick question about kernel launch overhead and algorithm design... | 2 | 616 | June 5, 2014 | |
cost of launching a new grid | 3 | 703 | June 24, 2019 | |
fundamental cuda kernel launch questions | 2 | 16492 | July 31, 2008 | |
Why is there 10uS between kernel launches? | 2 | 3817 | August 6, 2010 | |
Launch Overhead as a function of Kernel Size... Is it Proportional? Characterization? | 1 | 5344 | June 24, 2008 | |
Kernel enqueue overhead Bringing kernel overhead down? | 9 | 13744 | March 12, 2010 | |
Kernel Timing and cudaThreadSynchronize() | 6 | 2004 | July 30, 2010 |