How to choose the correct Loop Scheduling

Fedele.Stabile · March 26, 2012, 10:50am

Hi,
I have problems to understand the loop scheduling policy adopted by the
compiler.
I run my tests using a Tesla M2050 GPU
First example, single loop:
If loop makes a sum two vector of length 512

do i=1, 512
c(i) = a(i) + (b(i)
enddo

I think is possible to use the directive
!$acc do vector(512)
to instruct the compiler on how to parallelize the loop
Is it correct?
And what happens if I have vectors of length 10.000, for example?

Second example, nested loops:
Supposing I have to sum two 2-D array of 512x512
do i=1, 512
do j=1, 512
c(i,j) = a(i,j) + (b(i,j)
enddo
enddo
the compiler chooses these two directives to parallelize
!$acc do parallel, vector(16) (for i-loop)
!$acc do parallel, vector(16) (for j-loop)

Why it choose the value of 16 ?
I suppose the GPU is not fully used in this way,
is it correct?
But I noticed that if I force the compiler to use different value, inserting
explicit directives in the code,
for example
!$acc do parallel, vector(64) (for i-loop)
!$acc do parallel, vector(64) (for j-loop)
I don’t obtain better performance.

MatColgrove · March 26, 2012, 6:38pm

Hi Fedele.Stabile,

The compiler typically chooses a good schedule but is not guaranteed. Unfortunately, there isn’t an optimal way to find the best schedule except for trying all possible schedules. I typically will spend an hour or two varying the schedule to see how it effects performance. Though, most of the time, I can’t beat the default.

I think is possible to use the directive
!$acc do vector(512)
to instruct the compiler on how to parallelize the loop
Is it correct?

You are just setting the block size (i.e. the number of CUDA threads per block). If you are not familiar with the CUDA threading model, Michael Wolfe has a great introductory article Understanding the CUDA Data Parallel Threading Model: A Primer.

And what happens if I have vectors of length 10.000, for example?

There has to be at least one “parallel” clause. In this case since it was not specified, one will be added. Since you are not limiting the number of blocks (i.e. parallel), more blocks are created.

Why it choose the value of 16 ?

It’s the largest square dimension that can fit on a Tesla card with a compute capability of 1.3. Newer cards could use 32x32 but then other factors such as shared memory and register usage may still warrant the use of a 16x16 block.

!$acc do parallel, vector(64) (for i-loop)
!$acc do parallel, vector(64) (for j-loop)
I don’t obtain better performance.

Check the -Minfo=accel output. A 64x64 thread block will be too large for your device so the compiler is most likely ignoring your values and using the default. To see the maximum number of threads per block for your device, please run the utility “pgaccelinfo”. On a Fermi, this max is 1024 and on Tesla the max is 512.

Mat

Topic		Replies	Views
How to Change Loop Scheduling Legacy PGI Compilers	1	2907	January 19, 2011
good values for width in vector and parallel directives Legacy PGI Compilers	2	7467	November 24, 2009
Questions about 'vector' and 'gang' Legacy PGI Compilers	5	7033	February 10, 2016
Limits on vector width for large loop Legacy PGI Compilers	3	8089	February 12, 2010
Questions about "parallel" and "loop" Legacy PGI Compilers	1	2620	August 5, 2015
Loop optimization question Legacy PGI Compilers	1	2309	April 5, 2011
About effect of number of threads in a block, Tesla C2075 CUDA Programming and Performance	12	3471	December 28, 2012
Tesla Fermi card thread scheduling CUDA Programming and Performance	1	805	August 14, 2014
Specified loop mapping schedule not applied (PGI Acc) Legacy PGI Compilers	2	1644	January 23, 2012
How many parallel threads? CUDA Programming and Performance	19	9971	October 1, 2021

How to choose the correct Loop Scheduling

Related topics