no devices detected

brushman · July 9, 2013, 11:39pm

In my MPI code I assign MPI processes GPU’s with:

        #ifdef _OPENACC
          call acc_init(acc_device_default)
          dtype=acc_get_device_type()
          numdevices = acc_get_num_devices(acc_device_nvidia)
          print *, "device type=", dtype
          print *, "mpi rank = ", MyId
          print *, "# devices on my node = ",numdevices
          mydevice = mod(MyId,numdevices)
          call acc_set_device_num(mydevice,acc_device_nvidia)
        #endif

At run time, my print messages show that I am not detecting any GPUs (I run 8 nodes, 1 MPI process per node):

 device type=            0
 mpi rank =             0
 # devices on my node =             0

 device type=            0
 mpi rank =             4
 # devices on my node =             0

 device type=            0
 mpi rank =             5
 # devices on my node =             0

etc.

However, pgaccel info has no trouble finding the gpu:

-bash-3.2$ pgaccelinfo
CUDA Driver Version:           5050
NVRM version: NVIDIA UNIX x86_64 Kernel Module  319.23  Thu May 16 19:36:02 PDT 2013

Device Number:                 0
Device Name:                   Tesla C1060
Device Revision Number:        1.3
Global Memory Size:            4294770688
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          2147483647B
Texture Alignment:             256B
Clock Rate:                    1296 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            No
ECC Enabled:                   No
Memory Clock Rate:             800 MHz
Memory Bus Width:              512 bits
Max Threads Per SMP:           1024
Async Engines:                 1
Unified Addressing:            No
Initialization time:           657481 microseconds
Current free memory:           4237299456
Upload time (4MB):             1153 microseconds ( 726 ms pinned)
Download time:                 1053 microseconds ( 772 ms pinned)
Upload bandwidth:              3637 MB/sec (5777 MB/sec pinned)
Download bandwidth:            3983 MB/sec (5433 MB/sec pinned)

Removing the assignment code alltogether since there’s only 1 GPU per node anyways still shows that I am having trouble detecting the GPU:

call to cuInit returned error 100: No device

Any common causes of this sort of behavior? I rememberd to have use openacc in my code this time.

MatColgrove · July 10, 2013, 12:06am

Hi Ben,

I have no idea. Other then when the system doesn’t have a GPU or if the GPU isn’t enabled would “acc_get_num_devices” return 0. Are you submitting your job via qsub? Could it be giving you some non-GPU enabled nodes? Maybe the environment isn’t being setup (ie the CUDA driver isn’t loaded)?

While I don’t think will help, but here’s the code I use to set devices in MPI.

#ifdef _OPENACC

function setDevice(nprocs,myrank)

  use iso_c_binding
  use openacc
  implicit none
  include ‘mpif.h’

  interface
    function gethostid() BIND(C)
      use iso_c_binding
      integer (C_INT) :: gethostid
    end function gethostid
  end interface

  integer :: nprocs, myrank
  integer, dimension(nprocs) :: hostids, localprocs
  integer :: hostid, ierr, numdev, mydev, i, numlocal
  integer :: setDevice

! get the hostids so we can determine what other processes are on this node
  hostid = gethostid()
  CALL mpi_allgather(hostid,1,MPI_INTEGER,hostids,1,MPI_INTEGER, &
                     MPI_COMM_WORLD,ierr)

! determine which processare are on this node
  numlocal=0
  localprocs=0
  do i=1,nprocs
    if (hostid .eq. hostids(i)) then
      localprocs(i)=numlocal
      numlocal = numlocal+1
    endif
  enddo

! get the number of devices on this node
  numdev = acc_get_num_devices(ACC_DEVICE_NVIDIA)

  if (numdev .lt. 1) then
    print *, 'ERROR: There are no devices available on this host.  &
              ABORTING.', myrank
    stop
  endif

! print a warning if the number of devices is less then the number
! of processes on this node.  Having multiple processes share devices is not   
! recommended. 
  if (numdev .lt. numlocal) then
   if (localprocs(myrank+1).eq.1) then
     ! print the message only once per node
   print *, 'WARNING: The number of process is greater then the number  &
             of GPUs.', myrank
   endif
   mydev = mod(localprocs(myrank+1),numdev)
  else
   mydev = localprocs(myrank+1)
  endif

 call acc_set_device_num(mydev,ACC_DEVICE_NVIDIA)
 call acc_init(ACC_DEVICE_NVIDIA)
 setDevice = mydev

end function setDevice
#endif

brushman · July 11, 2013, 5:50pm

Hi Mat, thanks for the feedback.

I made a small reproducing example. All it does it assign GPUs in an MPI environment with your setDevice routine. As expected, the code is aborted because it detects 0 devices. But interestingly, your CUDA version of setDevice works (found here: Account Login | PGI). I am wondering if this problem is specific to dirac, as this is all I’ve tested on. The two codes are pasted below.

Compiled with: mpif90 -Mcuda -o testmpiv4 testmpiv4.F90
Job submitted with: qsub -I -V -q dirac_reg -l walltime=10:00 -l nodes=2:ppn=1
Run with: mpirun -np 2 ./testmpiv4
And modules: pgi/12.3, pgi-gpu/12.3

! GPU assignment done by setDevice function utilizing CUDA
! correctly assigns GPUs across multiple nodes
      program testmpiv4
      use cudafor
      include "mpif.h"

      integer ierr, myid,numprocs
      integer devnum

      call MPI_INIT(ierr)

      call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

! assign gpu
      numdev=1
      ierr = cudaGetDeviceCount(numdev)
      if (ierr.ne.0) then
        print*,cudaGetErrorString(ierr)
        stop
      endif
      if(numdev.lt.1) then
        print *, 'ERROR:NO DEVICES FOUND.'
        stop
      endif
      devnum=setDevice(numprocs, myid)
      ierr = cudaSetDevice(devnum)

      call MPI_FINALIZE(ierr)
      end

!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
! Mat's setDevice function
!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
        function setDevice(nprocs,myrank)
          use iso_c_binding
          use cudafor
          implicit none
          include "mpif.h"

          interface
            function gethostid() BIND(C)
              use iso_c_binding
              integer (C_INT) :: gethostid
            end function gethostid
          end interface

          integer :: nprocs, myrank
          integer, dimension(nprocs) :: hostids, localprocs
          integer :: hostid, ierr, numdev, mydev, i, numlocal
          integer :: setDevice

        ! get the hostids so we can determine what other processes are on this node
        hostid = gethostid()
        CALL mpi_allgather(hostid,1,MPI_INTEGER,hostids,1,MPI_INTEGER, &
                             MPI_COMM_WORLD,ierr)
        ! determine which process are are on this node
          numlocal=0
          localprocs=0
          do i=1,nprocs
            if (hostid .eq. hostids(i)) then
              localprocs(i)=numlocal
              numlocal = numlocal+1
            endif
          enddo

        ! get the number of devices on this node
          ierr = cudaGetDeviceCount(numdev)
          print*,"the number of devices on my node is ", numdev

          if (numdev .lt. 1) then
            print *, 'ERROR:no devices available on this host.  &
                      ABORTING.', myrank
            stop
          endif

        ! print a warning if the number of devices is less then the number
        ! of processes on this node.  Having multiple processes share devices is not   
        ! recommended. 
          if (numdev .lt. numlocal) then
           if (localprocs(myrank+1).eq.1) then
             ! print the message only once per node
           print *, 'WARNING:# of process is greater then the number  &
                     of GPUs.', myrank
           endif
           mydev = mod(localprocs(myrank+1),numdev)
          else
           mydev = localprocs(myrank+1)
          endif

         ierr = cudaSetDevice(mydev)
         setDevice = mydev

        end function setDevice

Compiled with: mpif90 -acc -o testmpiv5 testmpiv5.F90
Job sub with: qsub -I -V -q dirac_reg -l walltime=10:00 -l nodes=2:ppn=1
Run with: mpirun -np 2 ./testmpiv5
And modules: pgi/12.3, pgi-gpu/12.3

! GPU assignment done by setDevice function utilizing OACC
! fails to detect GPUs across multiple nodes
      program testmpiv5
      include "mpif.h"

      integer ierr, myid,numprocs
      #ifdef _OPENACC
        integer devnum, setDevice
      #endif

      call MPI_INIT(ierr)

      call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

      #ifdef _OPENACC
        devnum=setDevice(numprocs, myid)
      #endif
      print *, "my dev is ",devnum

      call MPI_FINALIZE(ierr)
      end

!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
! Mat's setDevice function
!cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
        #ifdef _OPENACC

        function setDevice(nprocs,myrank)

          use iso_c_binding
          use openacc
          implicit none
          include "mpif.h"
          interface
            function gethostid() BIND(C)
              use iso_c_binding
              integer (C_INT) :: gethostid
            end function gethostid
          end interface

          integer :: nprocs, myrank
          integer, dimension(nprocs) :: hostids, localprocs
          integer :: hostid, ierr, numdev, mydev, i, numlocal
          integer :: setDevice

        ! get the hostids so we can determine what other processes are on this node
        hostid = gethostid()
        CALL mpi_allgather(hostid,1,MPI_INTEGER,hostids,1,MPI_INTEGER, &
                             MPI_COMM_WORLD,ierr)
        ! determine which process are are on this node
          numlocal=0
          localprocs=0
          do i=1,nprocs
            if (hostid .eq. hostids(i)) then
              localprocs(i)=numlocal
              numlocal = numlocal+1
            endif
          enddo

        ! get the number of devices on this node
          numdev = acc_get_num_devices(ACC_DEVICE_NVIDIA)
          print*,"the number of devices on my node is ", numdev

          if (numdev .lt. 1) then
            print *, 'ERROR:no devices available on this host.  &
                      ABORTING.', myrank
            stop
          endif

        ! print a warning if the number of devices is less then the number
        ! of processes on this node.  Having multiple processes share devices is not   
        ! recommended. 
          if (numdev .lt. numlocal) then
           if (localprocs(myrank+1).eq.1) then
             ! print the message only once per node
           print *, 'WARNING:# of process is greater then the number  &
                     of GPUs.', myrank
           endif
           mydev = mod(localprocs(myrank+1),numdev)
          else
           mydev = localprocs(myrank+1)
          endif

         call acc_set_device_num(mydev,ACC_DEVICE_NVIDIA)
         call acc_init(ACC_DEVICE_NVIDIA)
         setDevice = mydev

        end function setDevice
        #endif

MatColgrove · July 12, 2013, 9:30pm

Hi Ben,

I’ll get on Dirac later today, but it might be just because you’re using 12.3. Can you try again with 12.9? OpenACC wasn’t fully supported until 12.6.

Mat

MatColgrove · July 12, 2013, 11:18pm

Hi Ben,

Don’t you need to add “:fermi” or “:tesla” to the end of your qsub command to get a GPU? (See: http://www.nersc.gov/users/computational-systems/dirac/running-jobs/batch/)

While I did this in interactive mode, “qsub -I -V -q dirac_int -l nodes=1:ppn=8:fermi”, the command worked fine with 12.3.

Mat

[@dirac41 ~/tests]$ mpif90 -V

pgf90 12.3-0 64-bit target on x86-64 Linux -tp nehalem
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2012, STMicroelectronics, Inc. All Rights Reserved.
[@dirac41 ~/tests]$ mpif90 -acc -Minfo=accel setdevice.F90
[@dirac41 ~/tests]$ mpirun -np 1 ./a.out
the number of devices on my node is 1
my dev is 0
[@dirac41 ~/tests]$ module list
Currently Loaded Modulefiles:

modules 3) moab/7.2.3-r11-b103 5) pgi-gpu/12.3 7) altd/1.0
nsg/1.2.0 4) torque/4.2.3.1 6) openmpi/1.4.5 8) usg-default-modules/1.0

brushman · July 13, 2013, 12:10am

Hi Mat,

My understanding is you only need to add :fermi or :tesla if you want to specify one of those types of GPUs. Otherwise, it will automatically assign you nodes with either fermi or tesla.

You didn’t get the error because you only ran with one mpi process. Running with -np 2, in which case you use multiple nodes, should reveal the error. For some reason the GPU assignment works correctly for a single node, but fails for multiple nodes. To use 2 nodes and 1 mpi process per node, you’ll have to submit a job with
qsub -I -V -q dirac_reg -l walltime=10:00 -l nodes=2:ppn=1
and then run with
mpirun -np 2 ./a.out

Also, I tried it with 12.9 and unfortunately that didn’t fix it.

Sorry, I should have been more clear.
Ben

MatColgrove · July 16, 2013, 9:32pm

Hi Ben,

I think you’ll need to contact the Dirac admins. Seems like a configuration issue with the system rather than a problem with the code. Success with a single node, fails with multiple nodes.

For example using the 4 GPU “mfermi” system seems to work:

$ cat run_device.pbs 
#PBS -l nodes=1:ppn=8:mfermi
#PBS -l walltime=00:02:00
#PBS -N test_setdevice
#PBS -q dirac_int
#PBS -V
cd $PBS_O_WORKDIR
mpirun -np 4 ./a.out
$ cat test_setdevice.o5859309 
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
 the number of devices on my node is             4
 my dev is             0
 the number of devices on my node is             4
 the number of devices on my node is             4
 my dev is             3
 the number of devices on my node is             4
 my dev is             2
 my dev is             1

----------------------------------------------------------------
Jobs exit status code is 0
Job test_setdevice/5859309.cvrsvc09-ib completed Tue Jul 16 12:57:12 PDT 2013

Using a single node also works:

$ cat run_device.pbs
#PBS -l nodes=1:ppn=4:fermi
#PBS -l walltime=00:02:00
#PBS -N test_setdevice
#PBS -q dirac_int
#PBS -V
cd $PBS_O_WORKDIR
mpirun -np 2 ./a.out
$ cat test_setdevice.o5859343
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
 the number of devices on my node is             1
 the number of devices on my node is             1
 WARNING:# of process is greater then the number  of GPUs.            1
 my dev is             0
 my dev is             0

----------------------------------------------------------------
Jobs exit status code is 0

Interesting if I request two nodes, but only use one of them, then it’s fine:

$ cat run_device.pbs 
#PBS -l nodes=2:ppn=8:fermi
#PBS -l walltime=00:02:00
#PBS -N test_setdevice
#PBS -q dirac_reg
#PBS -V
cd $PBS_O_WORKDIR
mpirun -np 8 ./a.out
$ cat test_setdevice.o5859702
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
 the number of devices on my node is             1
 the number of devices on my node is             1
 the number of devices on my node is             1
 the number of devices on my node is             1
 WARNING:# of process is greater then the number  of GPUs.            1
 the number of devices on my node is             1
 the number of devices on my node is             1
 the number of devices on my node is             1
 the number of devices on my node is             1
 my dev is             0
 my dev is             0
 my dev is             0
 my dev is             0
 my dev is             0
 my dev is             0
 my dev is             0
 my dev is             0

Use both nodes and it fails:

$ cat run_device.pbs
#PBS -l nodes=2:ppn=8:fermi
#PBS -l walltime=00:02:00
#PBS -N test_setdevice
#PBS -q dirac_reg
#PBS -V
cd $PBS_O_WORKDIR
mpirun -np 12 ./a.out
$ cat test_setdevice.o5859682
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.            3
 my dev is         11110
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.            7
 my dev is         10936
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.           11
 my dev is         11083
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.            9
 my dev is         11153
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.           10
 my dev is         11026
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.            6
 my dev is         11094
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.            1
 my dev is         10924
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.            2
 my dev is         10947
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.            5
 my dev is         11084
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.            0
 my dev is         11058
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.            8
 my dev is         11055
 the number of devices on my node is             0
 ERROR:no devices available on this host.  ABORTING.            4
 my dev is         10928

The same nodes, were used in both runs.

Mat

Topic		Replies	Views
MPI send + OpenACC + acc_malloc fail with NVFortran, but work with C nvc, nvc++ and nvfortran	10	64	September 6, 2024
unsupported device type Legacy PGI Compilers	6	8987	July 30, 2010
Segfault with MPI_Send + acc_malloc Legacy PGI Compilers	3	2581	April 8, 2020
NV 24.1 Default MPI seg faulting on derived type host_data MPI calls - sometimes nvc, nvc++ and nvfortran	15	739	June 6, 2024
OpenACC usage inside OpenMP constructs Legacy PGI Compilers	6	3855	August 26, 2019
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32413	December 13, 2010
Problem with '!$acc update device' in omp+acc fortran code Legacy PGI Compilers	10	6600	October 3, 2018
Multi-GPU MPI launch failing when UVM enabled Legacy PGI Compilers	5	3764	January 2, 2019
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	15832	July 19, 2013
Using multiple GPUs Legacy PGI Compilers	7	22070	August 11, 2009

no devices detected

Related topics