Memcpy and seg fault problems when combining openMP and CUDA


Can anyone explain this? A have a simple code below. I am attempting to transfer some largish data arrays to 3 GPUs, having attached a single openMP thread to each. The code produces the expected result if I don’t compile with openMP. It also works fine with three openMP threads if I decrease the size of the arrays being transferred. My problem is that the arrays being transferred are not actually that big, and should surely fit onto the device memory of a C1060 card. The host machine has 24GB of RAM, so that too should not be a problem. You can see in the code that I can try transferring the data in a loop outside the openMP parallel loop - that works fine (Transfer loop 1). Transferring the data inside the openMP loop though (Transfer loop 2) causes a segmentation fault. This is what I get:


Transfer loop 1 has completed.
Segmentation fault

Here is the code. Can anyone please help?



subroutine memtransfer( Fh )

   use cudafor

   implicit none

   real         :: Fh(20000000)
   real, device :: Fd(20000000)
   integer      :: iflag,idev

   Fd = Fh

!  -> Call device kernel, change Fd in some way....
!     Fill with device id for test....

   iflag = cudaGetDevice(idev)

   Fd    = (idev+1)**2

!  -> Transfer data back to OpenMP host thread:

   Fh = Fd


program memexample

   use cudafor

   implicit none

   real         :: Fh(20000000), Fsum(20000000)
   real, device :: Fd(20000000)

   integer      :: i,iflag

   Fh = 0.0

   print*, 'Start....'

   do i = 0,2
      iflag = cudaSetDevice(i)
      Fd = Fh

   print*, 'Transfer loop 1 has completed.'


      do i=0,2

         iflag = cudaSetDevice(i)
         call memtransfer(Fh)

         iflag = cudaThreadSynchronize()

!        Sum up results:

         Fsum = Fsum + Fh



   print*, 'Transfer loop 2 has completed.'

   print*, 'Result (should be = 14.0 for 3 OpenMP threads, 3.0 if no openMP):', Fsum(1)


Some extra info. The largest arrays I can implement that don’t cause a crash are 2618746 elements long. If there are 4 bytes per single precision floating point number on a C1060, that means that the arrays are just under 10MB in size. It’s my understanding that a C1060 has ~4GB of device memory? Is there a 10MB data transfer limit I should know about?


I think I might have solved my own problem. Through shear luck I have found that if I increase the stack limit above the default of 10240, I can transfer larger arrays. Anyone any idea why the stack limit causes a 10MB limit inside the openMP loop, but not outside it?


Hi Rob,

You’re hitting the OpenMP per thread stack size limit when entering the OpenMP region. The compiler is trying to automatically allocate on the stack a “Fh” for each thread but gets a stack overflow. The default OpenMP stack size is 8Mb but can be increased using the environment variable OMP_STACKSIZE. Also on Linux and OSX, if you have your shell’s stack size limit set higher then 8Mb, then this value will be used. The exception is if the limit is set to unlimited, then the default 8Mb is used.

Also, the code:

   do i = 0,2
      iflag = cudaSetDevice(i)
      Fd = Fh

Isn’t doing what you want. When a static device variable is first declared (in this case Fd is declared at the start of your program), it’s space is allocated on the current device. While you’re changing the device number, Fd is only allocated on the default device. There aren’t three copies.

However, in the “memtransfer” the local “Fd” array (which is distinct from the main program’s Fd array) will be allocated 3 times, once for each device, since the function is called from within the OpenMP region.

Memory can not be shared between nor split across multiple devices. Hence, it’s best when working with OpenMP and CUDA, to encapsulate your CUDA code within subroutines called from an OpenMP region.

The Pseudo-code for a mixed OpenMP/CUDA program would be something like:

- Start an OpenMP Region
-- Set the CUDA Device Number for each thread
-- Divide the work amongst the threads.
-- Call a routine containing the CUDA code.
- End the OpenMP Region

Hope this helps,

Thankyou so much for pointing that out Mat. Clearly I’ve got a lot to learn!

Thanks again,