I have the following questions which struck to me while trying to optimize upon my cuda code:
If the number of threads in the block is an exact multiple of warp-size but the number of threads performing the task (set of operations) is less than a warp-size (and the rest are idleâ€¦no task for the â€œelseâ€ condition), I believe the execution of threads is essentially serial then. But what is the impact if the number of threads in any block is not an exact multiple of the warp-size (32)? Does it also lead to serial execution of all the threads?
Streams of GPU kernels overlap execution in the sense that when one stream is performing the kernel execution, the next would be performing data copying to global memory. What if there is not enough memory left for the second stream (since the first one is also occupying some memory) to transfer data?
I believe that the global memory resides in the SDRam on the GPU chip and the SDR would allow simultaneous read/write of only one location (based on the address-bus signal) by the GPU and CPU. So, if overlap by multiple streams called for simultaneous access of the on-card SDR by the CPU and the GPU, would one of them actually stall? If one stalls then how does it happen in a situation where one of the streams is memcpy-ing data onto the global memory while the other is performing computations that involve access to the global memory? If there is frequent access to the global memory, will the CPU slow down or the GPU due to interruptions (if simultaneous access if not allowed) i.e. who will get preference?
Is it possible to have different kernels being executed by different streams? If not, should I use asynchronous memcpy-ies to overlap memcpy and execution of different kernels? Or is there another way?
Would be thankful if someone could reply.
Thanks & regards,