PGI Acc: Matrix-matrix-multiplication

sWienke · September 7, 2010, 7:12am

Hello,
I want to accelerate a simple matrix-matrix-multiplication:
First I just added the accelerate-regions:

call acc_init(acc_device_nvidia)
!$acc region
  do j = 1, matrixDim
     do i=1,matrixDim
        A(i,j) = 0.0
      end do
     do k = 1, matrixDim
        do i = 1, matrixDim
           A(i,j) = A(i,j) + B(i,k) * C(k,j)
        end do
     end do
  end do
!$acc end region

I get 92150 MFLOPS on one (!) GPU of a Tesla S1070 (matrix size: 3000). The following is the compiler feedback:

     34, Generating copyout(a(1:matrixdim,1:matrixdim))
         Generating copyin(c(1:matrixdim,1:matrixdim))
         Generating copyin(b(1:matrixdim,1:matrixdim))
         Generating compute capability 1.3 binary
     35, Loop is parallelizable
     36, Loop is parallelizable
         Accelerator kernel generated
         35, !$acc do parallel, vector(16)
         36, !$acc do parallel, vector(16)
             CC 1.3 : 6 registers; 24 shared, 80 constant, 0 local memory bytes; 100 occupancy
     39, Loop carried reuse of 'a' prevents parallelization
     40, Loop is parallelizable
         Accelerator kernel generated
         35, !$acc do parallel, vector(16)
         39, !$acc do seq
             Cached references to size [16x16] block of 'b'
             Cached references to size [16x16] block of 'c'
         40, !$acc do parallel, vector(16)
             Using register for 'a'
             CC 1.3 : 20 registers; 2072 shared, 92 constant, 0 local memory bytes; 75 occupancy

First question: Could someone explain me the compiler feedback? I mean, which loops are scheduled one Streaming processores and so on? I get especially confused with the “seq” clause.

Second: I tried a lot of things, but I couldn’t get it faster. Is it possibile? How?

Third: I did measurements for comparison with CUDA (and OpenCL) and both are better. E.g. with CUDA I get 121622 MFLOPS. Why can PGI Accelerator here be not as fast as CUDA?

Thanks. Sandra

MatColgrove · September 9, 2010, 8:45pm

Hi Sandra,

First question: Could someone explain me the compiler feedback? I mean, which loops are scheduled one Streaming processores and so on? I get especially confused with the “seq” clause.

Sure let’s break it down.

34, Generating copyout(a(1:matrixdim,1:matrixdim))
Generating copyin(c(1:matrixdim,1:matrixdim))
Generating copyin(b(1:matrixdim,1:matrixdim))
Generating compute capability 1.3 binary

The first section is telling you how the compiler is copying your data to and from the GPU. The C and B arrays will be copied to the GPU but not back to the host. While A’s values will only be copied back from the GPU.

35, Loop is parallelizable
36, Loop is parallelizable
…
39, Loop carried reuse of ‘a’ prevents parallelization
40, Loop is parallelizable

Here the compiler is telling you the results of it’s dependency analysis. It’s determined that the loops at line 35, 36, and 40 contain no dependencies and are parallelizable. However, if the loop at line 39 were parallelized, multiple threads would be working on the same elements of the A array. Hence, this ‘reuse’ of A prevents the k loop from being parallelized.

Accelerator kernel generated
35, !$acc do parallel, vector(16)
36, !$acc do parallel, vector(16)
CC 1.3 : 6 registers; 24 shared, 80 constant, 0 local memory bytes; 100 occupancy

Here the compiler is telling you that it’s created a kernel using the j loop and the first i loop. In other words, it’s split the j loop into two kernels, one to handle the initialization of A and one to perform the computation.

For the kernel schedule, ‘parallel’ indicates the grid dimension and ‘vector’ describes the block dimension. In this case, you have a variable number 2 dimensional blocks in your grid. The actual number of blocks will be determined at run-time base on the size of the arrays. Each block will be a 16x16 thread block. Each thread will perform the initialization of a single element of A.

Finally, the compiler is giving you the statistics for your kernels. Each are using 6 registers of a multi-processor’s local memory. Each register is 4 bytes, hence you’re using 24 bytes of shared memory. The constant memory contains A’s array descriptor as well as your loop bounds and constant data. A S1070 has 64k of constant memory so the 80 bytes you’re using is fine. Finally, the compiler is show that the occupancy of the loop is at 100%. Occupancy basically indicates the utilization of the GPU. Low occupancy generally means lower performance, though high occupancy does not guarantee high performance.

Accelerator kernel generated
35, !$acc do parallel, vector(16)
39, !$acc do seq
Cached references to size [16x16] block of ‘b’
Cached references to size [16x16] block of ‘c’
40, !$acc do parallel, vector(16)
Using register for ‘a’
CC 1.3 : 20 registers; 2072 shared, 92 constant, 0 local memory bytes; 75 occupancy

For the second kernel, the compiler is actually interchanging the k and i loops and uses the same schedule as the first kernel. The k loop is being scheduled sequentially (i.e. seq) due to the reuse of A. In other words, the k loop will be included in the kernel code with each thread executing on the full k loop on it’s element of A.

The cached references of the B and C arrays indicates that the compiler has created a cached 16x16 copy of a portion of the B and C arrays in the multi-processor’s shared memory. Since B and C’s values are reused by each thread and across multiple threads, having a cached copy helps performance since it’s much faster to access shared local memory then the device’s global memory.

Second: I tried a lot of things, but I couldn’t get it faster. Is it possibile? How?

On my C1060 I get between 93000 and 97000 depending on the array size. The one thing I would check is if you’re including the device initialization time. Do you have a call to acc_init before your timers or are you running the PGI pgcudainit utility in the background on your system?

Third: I did measurements for comparison with CUDA (and OpenCL) and both are better. E.g. with CUDA I get 121622 MFLOPS. Why can PGI Accelerator here be not as fast as CUDA?

Is this CUDA code you wrote? or are you using a CULA or other type of CUDA enabled SGEMM?

If it’s your own code, I would make sure that you’re comparing apples to apples. If I exclude the data transfer time my performance goes to ~101000. If you are including data transfer time, then I’m not sure. It would be interesting if you created one using CUDA Fortran.

If it’s CULA then I would not be surprised. It’s similar to comparing a compiled version of SGEMM versus a hand optimized assembly version such those found in ACML, MKL. GOTOBlas. A highly tuned hand optimized code by many highly skilled engineers over many months, will (at least most of the time) beat what a compiler can do. A compiler must be a more general in it solutions and does not always have as much information that an engineer has.

Mat

sWienke · September 10, 2010, 8:02am

Hi Mat,
that’s really a long and good explanation. Thanks!

For the kernel schedule, ‘parallel’ indicates the grid dimension and ‘vector’ describes the block dimension. In this case, you have a variable number 2 dimensional blocks in your grid.

I am not sure whether I understand this: How does “parallel” indicates the grid dimension? Is it a 2D-grid as we have the parallel statement in line 35 and 36? And same with vector? If it is so, how would a 2d-grid + 3d-block look like?

Each block will be a 16x16 thread block.

Is that because if the given width for the vector statement?
What does the width for a parallel statement mean?

In other words, the k loop will be included in the kernel code with each thread executing on the full k loop on it’s element of A.

Now it’s clear!
Is the interchange of the k- and i-loops indicated anyhow? That would be a nice hint from the compiler for the programmer (and then you also understand the seq-clause).

The cached references of the B and C arrays indicates that the compiler has created a cached 16x16 copy of a portion of the B and C arrays in the multi-processor’s shared memory.

Is the cache-clause the corresponding manual statement? As far as I know I can only say in the cache-clause which arrays the compiler should try to cache, can I also help the compiler with the partial size of the array? or will it always take as much as fits into shared memory?

Do you have a call to acc_init before your timers or are you running the PGI pgcudainit utility in the background on your system?

No, I do time measurements without initialization costs:

call acc_init(acc_device_nvidia)
t_start = getRealTime()
!$acc region
 // matrix-matrixx-multiplication
!$acc end region
t_end = getRealTime()
write (*,*) 'Time for computation: ', t_end - t_start
write (*,'(A, F10.2)') 'MFlops: ', ((dble(matrixDim) ** 3) * 2 / (t_end - t_start)) / 10**6

With which matrix did you get the best performance?
Do you have a corresponding value for matrix size 3000?
So, in this case the compiler recognizes then all the possible optimations and you can’t get any better with manual aid?

If it’s your own code, I would make sure that you’re comparing apples to apples.

Yes, I wrote my own matrix-matrix-multiplication with CUDA C (no CULA or the like). And I am comparing both times the performance including the data transfer. But maybe there is another mistake if you say there shouldn’t be a hugh throughput difference. I will check carfully. And I will have a further look at the version of Cuda Fortran soon.

MatColgrove · September 10, 2010, 5:15pm

Hi Sandra,

I am not sure whether I understand this: How does “parallel” indicates the grid dimension? Is it a 2D-grid as we have the parallel statement in line 35 and 36? And same with vector? If it is so, how would a 2d-grid + 3d-block look like?

If you haven’t read it already, Dr. Wolfe has a good primer on the threading model used by NVIDIA GPUs found HERE. This hopefully will help in understanding how ‘parallel’ and ‘vector’ map onto the GPU.

What does the width for a parallel statement mean?

It allows you create a fixed number blocks.

Is the interchange of the k- and i-loops indicated anyhow?

I should back track since it’s not really a doing a loop interchange, even though conceptionally it’s the same idea. In reality the i and j loops are removed altogether and replaced by how the kernel is scheduled. The ‘seq’ schedule indicates that the k loop is inserted into the kernel.

I should note that a ‘kernel’ is the sequential code run by each thread.

Is the cache-clause the corresponding manual statement? As far as I know I can only say in the cache-clause which arrays the compiler should try to cache,

Although the cache clause has not been implemented yet, it does correlate to the Minfo message. Though, the cache clause will be just a suggestion that the compiler may ignore.

can I also help the compiler with the partial size of the array? or will it always take as much as fits into shared memory?

It’s based on the block size. However, the amount of available local memory will limit the number of shared arrays used.

With which matrix did you get the best performance?

I’m getting slightly better performance today when I use the soon to be released 10.9 compiler. The best performance comes from an array size of 7680 (i.e. 30 blocks of 256 threads).

% matmul
 enter array size and number of iterations to run
7680
5
 Starting host calculation.
 Initialize the Device
     MFLOP   = 905969664000.0
HOST MFLOP/S =         3762.0
 GPU MFLOP/S =       100022.5
    SPEED-UP =           25.6
7680 tests completed. 7680 tests PASSED. 0 tests failed.

% matmul
 enter array size and number of iterations to run
4096
5
 Starting host calculation.
 Initialize the Device
     MFLOP   = 137438953472.0
HOST MFLOP/S =         3886.8
 GPU MFLOP/S =        98632.8
    SPEED-UP =           24.4
4096 tests completed. 4096 tests PASSED. 0 tests failed.

% matmul
 enter array size and number of iterations to run
3000
5
 Starting host calculation.
 Initialize the Device
     MFLOP   =  54000001024.0
HOST MFLOP/S =         3914.9
 GPU MFLOP/S =        96503.9
    SPEED-UP =           23.7
3000 tests completed. 3000 tests PASSED. 0 tests failed.

So, in this case the compiler recognizes then all the possible optimations and you can’t get any better with manual aid?

I believe so.

Hope this helps,
Mat

Topic		Replies	Views
Six Loops iteration and reduction Legacy PGI Compilers	15	7896	March 27, 2012
MatMul with openACC Legacy PGI Compilers	7	13034	December 17, 2012
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	15875	July 19, 2013
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13511	December 21, 2012
Matrix multiplication parallelizing Legacy PGI Compilers	4	6903	June 1, 2010
How to parallelize this loop... Legacy PGI Compilers	14	7818	December 18, 2012
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20311	October 25, 2017
Check performance Legacy PGI Compilers	4	3256	September 28, 2017
Fastest matrix-vector multiplication? CUDA Programming and Performance	24	3930	May 21, 2011
matrix multiply reduction CUDA Programming and Performance	41	35543	January 15, 2011

PGI Acc: Matrix-matrix-multiplication

Related topics