Hi guys!
My thesis it’s about parallelization of matching algorithms, using GPUs. I’m using CUDA for that, but i have a doubt, when i launch a kernel with 1 block and 2 threads, and launch again the same kernel with 2 blocks and 1 thread, why execution time it’s better in the second case? It’s because CUDA launch two blocks simultaneously in two Cores of SM (Stream Multiprocessor)? or there’s any different reason for that? In other configurations, like, b:1 t:4 and b:4 t:1, it’s the same thing, it’s better with more blocks!
Other questions: 1:when i launch a kernel with only 16 threads, the SIMT (Single Instruction Multiple Threads) will only create a warp of 16 or one with 32 but only 16 execute? 2:When i launch more than 512 threads, like 2 blocks with 512 threads, with only one SM, the SM it’s only capable to execute 768 threads, what happen to the others? will they be put into a waiting queue? and when they have the necessary resources for their execution will they run? So why i have better results with 1024 threads then 768? It’s because there’snt any overhead of context changing?
Thanks a lot guys! I really appreciate if anyone help me clarify this! Sorry for my english <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />