saxpy Nvidia website example fails on GPU culster

Hi,

I try to get the small example from (https://devblogs.nvidia.com/easy-introduction-cuda-fortran/) to run on our GPU cluster with P100 GPUs.

I took the example as it from the website added the MPI initialization code and compiled it using mpif90.
We are using MVAPICH2-GDR-2.3a.

mpif90 --version
pgf90 17.10-0 64-bit target on x86-64 Linux -tp px
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

The program is compiled with the command:

mpif90 -o saxp saxpy.cuf



module mathOps
contains
  attributes(global) subroutine saxpy(x, y, a)
    implicit none
    real :: x(:), y(:)
    real, value :: a
    integer :: i, n
    n = size(x)
    i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
    if (i <= n) y(i) = y(i) + a*x(i)
  end subroutine saxpy
end module mathOps

program testSaxpy
  use mpi
  use mathOps
  use cudafor
  implicit none
  integer, parameter :: N = 4000
  integer ierr, npe0, iam0
  real :: x(N), y(N), a
  real, device :: x_d(N), y_d(N)
  type(dim3) :: grid, tBlock

  call MPI_Init(ierr)
  write(*,*) "Programstart"
  call MPI_Comm_size(mpi_comm_world,npe0,ierr)
  call MPI_Comm_rank(mpi_comm_world,iam0,ierr)

  ierr = cudaSetDevice(0)

  tBlock = dim3(256,1,1)
  grid = dim3(ceiling(real(N)/tBlock%x),1,1)

  x = 1.0; y = 2.0; a = 2.0
  x_d = x
  y_d = y
  call saxpy<<<grid, tBlock>>>(x_d, y_d, a)
  y = y_d
  write(*,*) 'Max error: ', maxval(abs(y-4.0))
end program testSaxpy

However, if I run the program it fails with a seg fault.
If I comment out the kernel and copy operations it runs fine:

  !x_d = x
  !y_d = y
  !call saxpy<<<grid, tBlock>>>(x_d, y_d, a)
  !y = y_d

cudaSetDevice() returns 0

What could cause the error?

Thank you for your help!

Your code works for me, but I am not running mvapich. Does the CUDA Fortran work without MPI in it?

Do you have more than one GPU? Run “pgaccelinfo” and post the output here.

Thank you for your replay!

Without the MPI code it seems to work correctly.

I’m running this on a GPU cluster with 64 GPU nodes which uses a PBS system. In the script I have just selected a single GPU node (4 GPU):

CUDA Driver Version:           9010
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  390.12  Wed Dec 20 07:19:16 PST 2017

Device Number:                 0
Device Name:                   Tesla P100-SXM2-16GB
Device Revision Number:        6.0
Global Memory Size:            17071734784
Number of Multiprocessors:     56
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1480 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             715 MHz
Memory Bus Width:              4096 bits
L2 Cache Size:                 4194304 bytes
Max Threads Per SMP:           2048
Async Engines:                 5
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc60

Device Number:                 1
Device Name:                   Tesla P100-SXM2-16GB
Device Revision Number:        6.0
Global Memory Size:            17071734784
Number of Multiprocessors:     56
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1480 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             715 MHz
Memory Bus Width:              4096 bits
L2 Cache Size:                 4194304 bytes
Max Threads Per SMP:           2048
Async Engines:                 5
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc60

Device Number:                 2
Device Name:                   Tesla P100-SXM2-16GB
Device Revision Number:        6.0
Global Memory Size:            17071734784
Number of Multiprocessors:     56
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1480 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             715 MHz
Memory Bus Width:              4096 bits
L2 Cache Size:                 4194304 bytes
Max Threads Per SMP:           2048
Async Engines:                 5
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc60

Device Number:                 3
Device Name:                   Tesla P100-SXM2-16GB
Device Revision Number:        6.0
Global Memory Size:            17071734784
Number of Multiprocessors:     56
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1480 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             715 MHz
Memory Bus Width:              4096 bits
L2 Cache Size:                 4194304 bytes
Max Threads Per SMP:           2048
Async Engines:                 5
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc60

[/code]

Hi Peter85,

Does the code also work if using a single rank?

I’m wondering if the GPUs are set to be in exclusive mode. If so, then I’d expect 1 rank work, but the code to fail with multiple ranks. Granted, I would expect a different failure, “all CUDA-capable devices are busy or unavailable”, instead of a segv, so this might not be the issue.

To check if you’re running in exclusive mode or not, from the same script that runs the saxp binary, run “nvidia-smi” and look for the “Compute M.” field.

% nvidia-smi
Mon Aug  6 13:45:55 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 0000:02:00.0     Off |                    0 |
| N/A   36C    P0    25W / 250W |      0MiB / 16276MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 0000:82:00.0     Off |                    0 |
| N/A   44C    P0    26W / 250W |      0MiB / 16276MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

I think we could solve the problem.

It seemed there was some version mix-up.

We have eight different pgi compiler installed on the cluster starting from 16.10 up to 18.1.

The MVAPICH-GDR version was build with the pgi/18.1.

With pgi/18.1 and MVAPICH the program runs fine.

Thank you for all your help!