Suppose that I have to execute the module 100 times with different data.
In this situation, which is fast below 3 cases, and why.
First, processing by a cuda kernel with 100 threads.
Second, processing by n cuda kernels with m threads in n cuda streams, n * m = 100
Third, processing by n cuda kernels with m threads in n cpu cores, n * m = 100
Thanks in advance.
only 100 threads? that is a very small count even for a single cuda kernel. GPU can execute hundreds of thousands of threads in parrallel. may be you need to revise your algorithm…
I made a example for explaining the situation.
Ok, suppose that the number of threads are 1000000.
I want know the principle.
without any details on what you are trying to do multiple streams of kernels each working on the independent set of data will give the best performance.
Would you tell me what the reason is?
For efficiently written code, it’s generally better from the kernel perspective, to launch a single kernel rather than n kernels. There are overheads associated with kernel launch, and there may also be inefficiences in separation of data.
However, if we consider data transfer as well, then to enable overlap of copy and compute, and assuming there is a problem operating on separable data where data transfer time is significant and can be hidden by overlap of copy and compute, then it is better to break it into several kernel launches, each operating on a portion of the data. “It is better” means it may have a faster execution time if there is significant overlap of copy and compute.
All of the above assume a large problem size. 100 threads is not useful or sensible from a CUDA perspective.