How does the GPU distribute machine code commands to all the threads? Every thread has to be aware of what it needs to run; what feeds these commands? How/When do command(s) get sent from RAM to the GPU; where are these command(s) initially stored?
In general, I would like to have an understanding of how an assembly command goes from a single chunk of memory to being executed on by multiple warps with multiple threads in each warp. The reason for this desired understanding is to answer questions like these:
- Does every thread get its own copy of constant values in assembly commands?
- OR does every warp have a single Program Counter which handles all the child threads?
- Is there a decent overhead on a per machine code command basis; for distribution of the machine code?
I have tried looking up some resources but most fall short of really explaining the execution distribution process. If anyone has a good link or any good resource, please do share! Please note, while interesting, my main focus is not on the compilation of CUDA code but rather what the GPU does with the final result of CUDA compilation (machine code).
Thanks for your time!