I’ve tried to optimize my code without any great success, so I thought I would try to use concurrent kernels since I have a Fermi card. I’ve read about it in the programming guide but I still have some questions.
Do I have to use more than one stream to get concurrent kernel execution? How do I know how many streams to use, trial and error?
If I use two streams, I guess that this requires two times the memory?
In my case I have a few kernels that are very lightweight, they take 0.5 - 10 ms each, but I want to run them like 100 000 times. Each kernel works on 176 thread blocks. My guess is that some of the multiprocessors become unoccupied very quickly and wait for the new kernel to launch, with concurrent kernel execution they might be occupied all the time? The kernels have to be launched in a specific order.
for (int i = 0; i < 100000; i++)
{
// Copy constants from host memory to constant memory, depends on i, 320 bytes is copied
// Launch kernel A
// Launch kernel B
// Launch kernel C
// Launch kernel D
// Copy result to host, 4 bytes is copied
}
Not necessarily. Is it possible to have terrible race conditions between kernels if you’re all reading and writing from the same space? Sure, don’t do that. If you have 16 kernels reading from one 100MB buffer, doing something with 16 different sets of parameters, and writing 1MB of output each? That’s 116MBs, not 1600MBs.
Not necessarily. Is it possible to have terrible race conditions between kernels if you’re all reading and writing from the same space? Sure, don’t do that. If you have 16 kernels reading from one 100MB buffer, doing something with 16 different sets of parameters, and writing 1MB of output each? That’s 116MBs, not 1600MBs.