Consider a single parallel loop which contains no nested loops and where n is sufficiently big to fill the entire GPU with work
!$acc kernels do do i = 1,n calculations end do
so that the compiler output yields something like
!$acc loop gang, vector(128)
So what does this mean exactly? Will this ensure that each core on the GPU has a loop iteration to work on? Is it possible that only one multiprocessor is being used, or that each multiprocessor isn’t being entirely filled with work? Why doesn’t it tell me how many gangs are being used?
My guess is that the n iterations are divided amongst the different gangs, which each correspond to a thread block. The number of thread blocks, or gangs we get, depends on the number of multi-processors our GPU has. The vector(128) specifies that 128 threads are in each thread block.
So, upon execution, each multiprocessor executes 128 threads in parallel, where each thread corresponds to an iteration of the loop. How accurate is this?