Using OpenMP and PGPROF

Hello,

I would like to profile a region of my application written in Fortran, with the commands cudaProfilerStart() and cudaProfilerStop().

But I have to use the header cuda_profiler_api.h : how do I notify the compiler pgfortran to use this file ? I’m using a Makefile.

Other question : if I want to use OpenMP with pgfortran, I just need to add the -mp flag right ? It seems that I need to add a LIB or INCLUDE file in my Makefile, because even using OMP_NUM_THREADS=2 or + it seems not to work.

Thank you for your attention.

Usually you either include the openmp header file,

#include “omp.h”

or you use openmp modfiles.

use omp

when compiling. Be sure to set the -mp switch during the
compile AND link steps, or you may get a dummy openmp lib
that does not run multi-threaded.

See the entry
http://www.pgroup.com/userforum/viewtopic.php?p=22311#22311

For an example that shows multiple openmp threads speeding up calculations.

dave

Thank you Dave it works now for OpenMP !

For the profiler when I use : #include “cuda_profiler_api.h”
I get an error, it seems that the compiler doesn’t understand this file.

I also have another question about cudaMallocPitch : does it works for 3D arrays ? If not, I have to pad myself the arrays ?

Thanks again.

Hi Mr. Dark,

The CUDA Fortran Module has interfaces for cudaProfilerStart/Stop so no need to include the C "cuda_profiler_api.h. Just add “use cudafor” and compile with -Mcuda. You may need to use “-Mcuda=8.0” if you’re profiling with CUDA 8 and/or “-Mcuda=cc60” if you’re using a Pascal device.

I also have another question about cudaMallocPitch : does it works for 3D arrays ? If not, I have to pad myself the arrays ?

No, cudaMallocPitch is only for 2D arrays. There is a cudaMalloc3D for 3D arrays. Would that work for you?

For example:

      integer, device, allocatable :: idev(:,:,:)
      integer ishap(3)
      type(cudaPitchedPtr) :: devPtr
      type(cudaExtent) :: extent

      m = 100
      n = 60
      p = 20

      extent%width = m
      extent%height = n
      extent%depth = p

!     cudaMalloc3D
      error = cudaMalloc3D(devPtr, extent)
      if (error.ne.0) then
         print *, "Error in cudaMalloc3D ...", cudaGetErrorString(error)
         stop
      endif

      ishap(1) = devPtr%pitch / 4
      ishap(2) = n
      ishap(3) = p
!
      call c_f_pointer(devPtr%ptr, idev, ishap)

      idev = 0

!     cudaMemset3D
      error = cudaMemset3D(devPtr, 99, extent)

-Mat

Oh ok i missed the “use cudafor” sentence in my code but now it worked, thanks !

I’ll give a shot to cudaMalloc3D with your example. I need to add extra-memory to my arrays to get coalescing access. By the way, achieving coalescing access gives how much improvements generally ?

Because I’m using Pascal cards and I would like to know if it is worth to spend some time in that subject !

By the way, achieving coalescing access gives how much improvements generally ?

Unless you’re passing this array to CUDA C and possibility a few other cases, there’s no real advantage to using cudaMalloc3D over a 3D “device” array. Fortran arrays are implicitly coalesced and therefor can be transferred to/from the device in one large memory block.

On the device, coalesced memory access are very important but so long as you’re accessing the Fortran array’s first column contiguously across the threads in a warp, you are accessing the coalesced memory. When coalesced, the memory for all the threads in a warp can be brought into cache in a single memory fetch. Accessing non-coalesced (random access) memory can lead to thread divergence where all the threads in a warp need to wait for every other thread to bring it’s memory.

-Mat