Memcpy and seg fault problems when combining openMP and CUDA

Hi.

Can anyone explain this? A have a simple code below. I am attempting to transfer some largish data arrays to 3 GPUs, having attached a single openMP thread to each. The code produces the expected result if I don’t compile with openMP. It also works fine with three openMP threads if I decrease the size of the arrays being transferred. My problem is that the arrays being transferred are not actually that big, and should surely fit onto the device memory of a C1060 card. The host machine has 24GB of RAM, so that too should not be a problem. You can see in the code that I can try transferring the data in a loop outside the openMP parallel loop - that works fine (Transfer loop 1). Transferring the data inside the openMP loop though (Transfer loop 2) causes a segmentation fault. This is what I get:

export OMP_NUM_THREADS=3
./a.out

Start…
Transfer loop 1 has completed.
Segmentation fault

Here is the code. Can anyone please help?


Thanks,

Rob.

subroutine memtransfer( Fh )

   use cudafor

   implicit none

   real         :: Fh(20000000)
   real, device :: Fd(20000000)
   integer      :: iflag,idev

   Fd = Fh

!  -> Call device kernel, change Fd in some way....
!     Fill with device id for test....

   iflag = cudaGetDevice(idev)

   Fd    = (idev+1)**2

!  -> Transfer data back to OpenMP host thread:

   Fh = Fd

end

program memexample

   use cudafor

   implicit none

   real         :: Fh(20000000), Fsum(20000000)
   real, device :: Fd(20000000)

   integer      :: i,iflag

   Fh = 0.0

   print*, 'Start....'

   do i = 0,2
      iflag = cudaSetDevice(i)
      Fd = Fh
   enddo

   print*, 'Transfer loop 1 has completed.'

!$OMP PARALLEL PRIVATE(i,Fh) SHARED(Fsum)
!$OMP DO

      do i=0,2

         iflag = cudaSetDevice(i)
         call memtransfer(Fh)

         iflag = cudaThreadSynchronize()

!        Sum up results:

         Fsum = Fsum + Fh

      enddo

!$OMP END DO
!$OMP END PARALLEL

   print*, 'Transfer loop 2 has completed.'

   print*, 'Result (should be = 14.0 for 3 OpenMP threads, 3.0 if no openMP):', Fsum(1)

end

Some extra info. The largest arrays I can implement that don’t cause a crash are 2618746 elements long. If there are 4 bytes per single precision floating point number on a C1060, that means that the arrays are just under 10MB in size. It’s my understanding that a C1060 has ~4GB of device memory? Is there a 10MB data transfer limit I should know about?

Rob.

I think I might have solved my own problem. Through shear luck I have found that if I increase the stack limit above the default of 10240, I can transfer larger arrays. Anyone any idea why the stack limit causes a 10MB limit inside the openMP loop, but not outside it?

Rob.

Hi Rob,

You’re hitting the OpenMP per thread stack size limit when entering the OpenMP region. The compiler is trying to automatically allocate on the stack a “Fh” for each thread but gets a stack overflow. The default OpenMP stack size is 8Mb but can be increased using the environment variable OMP_STACKSIZE. Also on Linux and OSX, if you have your shell’s stack size limit set higher then 8Mb, then this value will be used. The exception is if the limit is set to unlimited, then the default 8Mb is used.

Also, the code:

   do i = 0,2
      iflag = cudaSetDevice(i)
      Fd = Fh
   enddo

Isn’t doing what you want. When a static device variable is first declared (in this case Fd is declared at the start of your program), it’s space is allocated on the current device. While you’re changing the device number, Fd is only allocated on the default device. There aren’t three copies.

However, in the “memtransfer” the local “Fd” array (which is distinct from the main program’s Fd array) will be allocated 3 times, once for each device, since the function is called from within the OpenMP region.

Memory can not be shared between nor split across multiple devices. Hence, it’s best when working with OpenMP and CUDA, to encapsulate your CUDA code within subroutines called from an OpenMP region.

The Pseudo-code for a mixed OpenMP/CUDA program would be something like:

- Start an OpenMP Region
-- Set the CUDA Device Number for each thread
-- Divide the work amongst the threads.
-- Call a routine containing the CUDA code.
- End the OpenMP Region

Hope this helps,
Mat

Thankyou so much for pointing that out Mat. Clearly I’ve got a lot to learn!

Thanks again,

Rob.