If I dispatch several tasks together to GPU, there seem to be a context switch overhead among tasks because GPU has a round-robin scheduler by default. So will it be better if I dispatch tasks individually? I suppose this may depend on many factors, such as memory requirements. However, in my simple experiments, I didn’t observe an improvement in overall finishing time. Does anyone have any ideas about what happened and how to avoid this overhead?
Any thoughts or ideas are appreciated!