Memcpy and seg fault problems when combining openMP and CUDA

alfvenwave · April 13, 2010, 10:27am

Hi.

Can anyone explain this? A have a simple code below. I am attempting to transfer some largish data arrays to 3 GPUs, having attached a single openMP thread to each. The code produces the expected result if I don’t compile with openMP. It also works fine with three openMP threads if I decrease the size of the arrays being transferred. My problem is that the arrays being transferred are not actually that big, and should surely fit onto the device memory of a C1060 card. The host machine has 24GB of RAM, so that too should not be a problem. You can see in the code that I can try transferring the data in a loop outside the openMP parallel loop - that works fine (Transfer loop 1). Transferring the data inside the openMP loop though (Transfer loop 2) causes a segmentation fault. This is what I get:

export OMP_NUM_THREADS=3
./a.out

Start…
Transfer loop 1 has completed.
Segmentation fault

Here is the code. Can anyone please help?

Thanks,

Rob.

subroutine memtransfer( Fh )

   use cudafor

   implicit none

   real         :: Fh(20000000)
   real, device :: Fd(20000000)
   integer      :: iflag,idev

   Fd = Fh

!  -> Call device kernel, change Fd in some way....
!     Fill with device id for test....

   iflag = cudaGetDevice(idev)

   Fd    = (idev+1)**2

!  -> Transfer data back to OpenMP host thread:

   Fh = Fd

end

program memexample

   use cudafor

   implicit none

   real         :: Fh(20000000), Fsum(20000000)
   real, device :: Fd(20000000)

   integer      :: i,iflag

   Fh = 0.0

   print*, 'Start....'

   do i = 0,2
      iflag = cudaSetDevice(i)
      Fd = Fh
   enddo

   print*, 'Transfer loop 1 has completed.'

!$OMP PARALLEL PRIVATE(i,Fh) SHARED(Fsum)
!$OMP DO

      do i=0,2

         iflag = cudaSetDevice(i)
         call memtransfer(Fh)

         iflag = cudaThreadSynchronize()

!        Sum up results:

         Fsum = Fsum + Fh

      enddo

!$OMP END DO
!$OMP END PARALLEL

   print*, 'Transfer loop 2 has completed.'

   print*, 'Result (should be = 14.0 for 3 OpenMP threads, 3.0 if no openMP):', Fsum(1)

end

alfvenwave · April 13, 2010, 10:43am

Some extra info. The largest arrays I can implement that don’t cause a crash are 2618746 elements long. If there are 4 bytes per single precision floating point number on a C1060, that means that the arrays are just under 10MB in size. It’s my understanding that a C1060 has ~4GB of device memory? Is there a 10MB data transfer limit I should know about?

Rob.

alfvenwave · April 13, 2010, 11:14am

I think I might have solved my own problem. Through shear luck I have found that if I increase the stack limit above the default of 10240, I can transfer larger arrays. Anyone any idea why the stack limit causes a 10MB limit inside the openMP loop, but not outside it?

Rob.

MatColgrove · April 13, 2010, 7:57pm

Hi Rob,

You’re hitting the OpenMP per thread stack size limit when entering the OpenMP region. The compiler is trying to automatically allocate on the stack a “Fh” for each thread but gets a stack overflow. The default OpenMP stack size is 8Mb but can be increased using the environment variable OMP_STACKSIZE. Also on Linux and OSX, if you have your shell’s stack size limit set higher then 8Mb, then this value will be used. The exception is if the limit is set to unlimited, then the default 8Mb is used.

Also, the code:

   do i = 0,2
      iflag = cudaSetDevice(i)
      Fd = Fh
   enddo

Isn’t doing what you want. When a static device variable is first declared (in this case Fd is declared at the start of your program), it’s space is allocated on the current device. While you’re changing the device number, Fd is only allocated on the default device. There aren’t three copies.

However, in the “memtransfer” the local “Fd” array (which is distinct from the main program’s Fd array) will be allocated 3 times, once for each device, since the function is called from within the OpenMP region.

Memory can not be shared between nor split across multiple devices. Hence, it’s best when working with OpenMP and CUDA, to encapsulate your CUDA code within subroutines called from an OpenMP region.

The Pseudo-code for a mixed OpenMP/CUDA program would be something like:

- Start an OpenMP Region
-- Set the CUDA Device Number for each thread
-- Divide the work amongst the threads.
-- Call a routine containing the CUDA code.
- End the OpenMP Region

Hope this helps,
Mat

alfvenwave · April 14, 2010, 7:55am

Thankyou so much for pointing that out Mat. Clearly I’ve got a lot to learn!

Thanks again,

Rob.

Topic		Replies	Views
CUDA + OpenMP oddity - looks like a compiler bug. Legacy PGI Compilers	6	12160	April 12, 2010
Newbie question about data transfer CUDA Programming and Performance	4	2701	July 25, 2008
Problems on OpenMP and multi-GPU Legacy PGI Compilers	5	4510	August 15, 2012
Why this segmentation fault occur CUDA Programming and Performance	2	984	September 2, 2018
Help with handling large arrays on a multi-GPU platform. Legacy PGI Compilers	1	9364	July 12, 2010
How to fill a very large array randomly using CUDA Legacy PGI Compilers	5	14686	April 26, 2010
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2634	August 2, 2010
CUDA samples: 6_Advanced/conjugateGradientMultiDeviceCG can't run with big size matrix? CUDA Programming and Performance	6	692	November 25, 2021
how to speed up? data transfer CUDA Programming and Performance	22	3810	April 5, 2011
OpenMP + CUDA Multiple Parallel Sections Does GPU to Thread linking persist across multiple parallel CUDA Programming and Performance	11	3575	June 29, 2011

Memcpy and seg fault problems when combining openMP and CUDA

Related topics