CUDA + OpenMP oddity - looks like a compiler bug.

I have a very strange problem. I am running the code below (omptest.cuf) for:

export OMP_NUM_THREADS=3

I have 3 NVIDIA graphics cards, so I am attaching one card to each openMP thread.

I am compiling the code with:

pgfortran omptst.cuf -mp

Now, if I run the code as it stands, the code hangs - it tells me that all three cards have been initialized - but it just sits there.

However, if I comment out the call to curk4 with argument Fdev (see my comments in the code) , the code finishes, almost instantaneously as it should as the code doesn’t actually do anything. Notice though that there is a return statement before this call. Commenting out the call should make no difference at all!

Anyone got any idea what’s going on? Is this me, or a compiler bug?

Rob.

module curk4_mod

use cudafor

implicit none

contains

!  Kernel subroutines:

   subroutine curk4( Fdev )

      use prec_mod

      implicit none

      real( gpu ), device, intent(in)    :: Fdev(2)

      print*,'Dont even bother to call a kernel function....'

   end subroutine curk4

!  OMP wrapper:

   subroutine omptst( F )

      use prec_mod

      use cudafor

      implicit none

      real (gpu)         :: F   (2)
      real (gpu), device :: Fdev(2)
      integer            :: iflag,idev

      return

!-----------------------
! If I comment out this next line, the code finishes.
! If I leave it in, the code hangs - even though there is a return 
! statement above!
!-----------------------

      call curk4( Fdev )

   end subroutine

end module curk4_mod
  
program wrapper

use cudafor

use prec_mod
use curk4_mod

implicit none

integer  :: i,j
integer  :: numDev, iflag
real     :: F(2),F2(2)

!$OMP PARALLEL PRIVATE(i,F2,iflag) SHARED(F)
!$OMP DO

do i=0,2

   iflag = cudaSetDevice(i)

   print*,'Device ',i,' set'

   F2 = F

   call omptst( F2 )

enddo

!$OMP END DO
!$OMP END PARALLEL

iflag = cudaThreadSynchronize()

print*,'Finished.'

end

Hi Rob,

I tried it on my system and it worked fine.

% pgf90 curk4.cuf -mp
% setenv OMP_NUM_THREADS 3
% a.out
 Device             0  set
 Device             2  set
 Device             1  set
 Finished.

Granted, my system has 4 cards, but that shouldn’t matter.

Does the code work without “-mp”? When “OMP_NUM_THREADS” is set to 1? What devices do you have (see output from pgaccelinfo).

  • Mat

Hi Mat.

In answer to your question about my hardware…

I have 3 C1060s installed + one Quadro display card. They appear in the following order as assigned by pgfortran:

Device 0 is a Tesla C1060 card
Device 1 is a Quadro FX 580 card
Device 2 is a Tesla C1060 card
Device 3 is a Tesla C1060 card

The machine is part of a cluster (w/o display) so all 4 cards should be available for number crunching.

What is a little odd is that the Quadro card is coming up as the second device. When we first installed our machine, we made the mistake of using the latest CUDA v3.0 beta - pgf refused to play nicely with this version of the cuda SDK. The difference though was that then, the quadro card came up as device number 3. After much messing around, we installed everything from scratch with cuda v2.3, and from that point on, the quadro appeared as device #1. For one card only algorithms, all 4 cards seem to work fine. My problem though as you know is in trying to use more than one card at once.

I have noticed something even more odd now. If I set OMP_NUM_THREADS=1 and run the code, it runs fine if the call to curk4 is “uncommented”, but hangs if I comment it out - i.e. the opposite way around.

If I compile without the -mp option, I get the same behaviour as I do if I compile with -mp but set OMP_NUM_THREADS=1.

Now, I have just noticed something that might be highlighting the problem. Everything works fine if I set OMP_NUM_THREAD=2 and loop from i=0,1 in wrapper.cuf. This is presumably only deploying devices 0 and 1 (i.e. one C1060 and the Quadro). It’s almost as if the Quadro card is blocking communication to devices 2 and 3.

I’m completely puzzled though by how changing the code “after” the return statement is making a difference.

So what do you reckon - is our openMP setup to blame? CUDA, of PGF. Or maybe our hardware? Do you think we should pull the quadro card out to see if that’s to blame?

Any help greatly appreciated (I’ll acknowledge you in any papers that come out of this work if you can help me find the answer)…

Have a great weekend,

Rob.

Have just done a dump using pgaccelinfo. Something odd though - the utility hangs half way down the output from device #2 (at the point where it’s supposed to write out “Current Free Memory”). This is now leading me more and more to believe it’s a hardware problem. Here is the dump as far as it gets…


Rob.

CUDA Driver Version            2030

Device Number:                 0
Device Name:                   Tesla C1060
Device Revision Number:        1.3
Global Memory Size:            4294705152
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512 x 512 x 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1296 MHz
Initialization time:           10974 microseconds
Current free memory            4246142976
Upload time (4MB)              1163 microseconds ( 715 ms pinned)
Download time                  1493 microseconds (1252 ms pinned)
Upload bandwidth               3606 MB/sec (5866 MB/sec pinned)
Download bandwidth             2809 MB/sec (3350 MB/sec pinned)

Device Number:                 1
Device Name:                   Quadro FX 580
Device Revision Number:        1.1
Global Memory Size:            536150016
Number of Multiprocessors:     4
Number of Cores:               32
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           8192
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512 x 512 x 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1125 MHz
Initialization time:           10974 microseconds
Current free memory            491466752
Upload time (4MB)              1194 microseconds ( 854 ms pinned)
Download time                  2404 microseconds (2164 ms pinned)
Upload bandwidth               3512 MB/sec (4911 MB/sec pinned)
Download bandwidth             1744 MB/sec (1938 MB/sec pinned)

Device Number:                 2
Device Name:                   Tesla C1060
Device Revision Number:        1.3
Global Memory Size:            4294705152
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512 x 512 x 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1296 MHz
Initialization time:           10974 microseconds

Hi Rob,

I don’t have any great insight here but it does seem to be a hardware issue.

I talked with one of our IT people. The only time he’s seen this type of behavior is where there’s not enough power to the device, another processes already running on the device, or the device is hung up. I would try the following:

  1. Reboot the system, (to see if a device is hung)
  2. Set your program to use just device 2 or 3. (can you even access the cards?)
  3. Swap device 0 and 2 (is device 2 just bad?)
  4. Move the Quadro to the last socket (is the order the problem?)
  5. Remove all but one card then add them back one at a time (is Power the problem?)
  • Mat

Hi Mat.

Oddly, I can access devices 2 and 3, and run single card algorithms on them. I suppose it could be a power issue when I try to run on >2 devices at once, although that would not explain why pgaccelinfo hangs. Our IT guys have all gone home for the weekend - I’ll get them to try everything you suggest next week.

Have a nice weekend yourself,

Rob.

Hi Mat. Rebooting the machine fixed my problem - thanks for the pointer. However, I still have an issue - see next thread by me.