Hi Sandra,
First question: Could someone explain me the compiler feedback? I mean, which loops are scheduled one Streaming processores and so on? I get especially confused with the “seq” clause.
Sure let’s break it down.
34, Generating copyout(a(1:matrixdim,1:matrixdim))
Generating copyin(c(1:matrixdim,1:matrixdim))
Generating copyin(b(1:matrixdim,1:matrixdim))
Generating compute capability 1.3 binary
The first section is telling you how the compiler is copying your data to and from the GPU. The C and B arrays will be copied to the GPU but not back to the host. While A’s values will only be copied back from the GPU.
35, Loop is parallelizable
36, Loop is parallelizable
…
39, Loop carried reuse of ‘a’ prevents parallelization
40, Loop is parallelizable
Here the compiler is telling you the results of it’s dependency analysis. It’s determined that the loops at line 35, 36, and 40 contain no dependencies and are parallelizable. However, if the loop at line 39 were parallelized, multiple threads would be working on the same elements of the A array. Hence, this ‘reuse’ of A prevents the k loop from being parallelized.
Accelerator kernel generated
35, !$acc do parallel, vector(16)
36, !$acc do parallel, vector(16)
CC 1.3 : 6 registers; 24 shared, 80 constant, 0 local memory bytes; 100 occupancy
Here the compiler is telling you that it’s created a kernel using the j loop and the first i loop. In other words, it’s split the j loop into two kernels, one to handle the initialization of A and one to perform the computation.
For the kernel schedule, ‘parallel’ indicates the grid dimension and ‘vector’ describes the block dimension. In this case, you have a variable number 2 dimensional blocks in your grid. The actual number of blocks will be determined at run-time base on the size of the arrays. Each block will be a 16x16 thread block. Each thread will perform the initialization of a single element of A.
Finally, the compiler is giving you the statistics for your kernels. Each are using 6 registers of a multi-processor’s local memory. Each register is 4 bytes, hence you’re using 24 bytes of shared memory. The constant memory contains A’s array descriptor as well as your loop bounds and constant data. A S1070 has 64k of constant memory so the 80 bytes you’re using is fine. Finally, the compiler is show that the occupancy of the loop is at 100%. Occupancy basically indicates the utilization of the GPU. Low occupancy generally means lower performance, though high occupancy does not guarantee high performance.
Accelerator kernel generated
35, !$acc do parallel, vector(16)
39, !$acc do seq
Cached references to size [16x16] block of ‘b’
Cached references to size [16x16] block of ‘c’
40, !$acc do parallel, vector(16)
Using register for ‘a’
CC 1.3 : 20 registers; 2072 shared, 92 constant, 0 local memory bytes; 75 occupancy
For the second kernel, the compiler is actually interchanging the k and i loops and uses the same schedule as the first kernel. The k loop is being scheduled sequentially (i.e. seq) due to the reuse of A. In other words, the k loop will be included in the kernel code with each thread executing on the full k loop on it’s element of A.
The cached references of the B and C arrays indicates that the compiler has created a cached 16x16 copy of a portion of the B and C arrays in the multi-processor’s shared memory. Since B and C’s values are reused by each thread and across multiple threads, having a cached copy helps performance since it’s much faster to access shared local memory then the device’s global memory.
Second: I tried a lot of things, but I couldn’t get it faster. Is it possibile? How?
On my C1060 I get between 93000 and 97000 depending on the array size. The one thing I would check is if you’re including the device initialization time. Do you have a call to acc_init before your timers or are you running the PGI pgcudainit utility in the background on your system?
Third: I did measurements for comparison with CUDA (and OpenCL) and both are better. E.g. with CUDA I get 121622 MFLOPS. Why can PGI Accelerator here be not as fast as CUDA?
Is this CUDA code you wrote? or are you using a CULA or other type of CUDA enabled SGEMM?
If it’s your own code, I would make sure that you’re comparing apples to apples. If I exclude the data transfer time my performance goes to ~101000. If you are including data transfer time, then I’m not sure. It would be interesting if you created one using CUDA Fortran.
If it’s CULA then I would not be surprised. It’s similar to comparing a compiled version of SGEMM versus a hand optimized assembly version such those found in ACML, MKL. GOTOBlas. A highly tuned hand optimized code by many highly skilled engineers over many months, will (at least most of the time) beat what a compiler can do. A compiler must be a more general in it solutions and does not always have as much information that an engineer has.