How to choose the correct Loop Scheduling

Hi,
I have problems to understand the loop scheduling policy adopted by the
compiler.
I run my tests using a Tesla M2050 GPU
First example, single loop:
If loop makes a sum two vector of length 512

do i=1, 512
c(i) = a(i) + (b(i)
enddo

I think is possible to use the directive
!$acc do vector(512)
to instruct the compiler on how to parallelize the loop
Is it correct?
And what happens if I have vectors of length 10.000, for example?

Second example, nested loops:
Supposing I have to sum two 2-D array of 512x512
do i=1, 512
do j=1, 512
c(i,j) = a(i,j) + (b(i,j)
enddo
enddo
the compiler chooses these two directives to parallelize
!$acc do parallel, vector(16) (for i-loop)
!$acc do parallel, vector(16) (for j-loop)

Why it choose the value of 16 ?
I suppose the GPU is not fully used in this way,
is it correct?
But I noticed that if I force the compiler to use different value, inserting
explicit directives in the code,
for example
!$acc do parallel, vector(64) (for i-loop)
!$acc do parallel, vector(64) (for j-loop)
I don’t obtain better performance.

Hi Fedele.Stabile,

The compiler typically chooses a good schedule but is not guaranteed. Unfortunately, there isn’t an optimal way to find the best schedule except for trying all possible schedules. I typically will spend an hour or two varying the schedule to see how it effects performance. Though, most of the time, I can’t beat the default.

I think is possible to use the directive
!$acc do vector(512)
to instruct the compiler on how to parallelize the loop
Is it correct?

You are just setting the block size (i.e. the number of CUDA threads per block). If you are not familiar with the CUDA threading model, Michael Wolfe has a great introductory article Understanding the CUDA Data Parallel Threading Model: A Primer.

And what happens if I have vectors of length 10.000, for example?

There has to be at least one “parallel” clause. In this case since it was not specified, one will be added. Since you are not limiting the number of blocks (i.e. parallel), more blocks are created.

Why it choose the value of 16 ?

It’s the largest square dimension that can fit on a Tesla card with a compute capability of 1.3. Newer cards could use 32x32 but then other factors such as shared memory and register usage may still warrant the use of a 16x16 block.

!$acc do parallel, vector(64) (for i-loop)
!$acc do parallel, vector(64) (for j-loop)
I don’t obtain better performance.

Check the -Minfo=accel output. A 64x64 thread block will be too large for your device so the compiler is most likely ignoring your values and using the default. To see the maximum number of threads per block for your device, please run the utility “pgaccelinfo”. On a Fermi, this max is 1024 and on Tesla the max is 512.

  • Mat