Problem with Using Allocatable Device Variable inside Module

Hi, all.

I have a problem with using allocatable device variable inside another module. A minimal example is here:

module cudamod 
   implicit none 
   integer, device, allocatable, dimension(:) :: int_d 
end module cudamod 

program fcuda 
   use cudafor 
   use cudamod 
   implicit none 
   allocate(int_d(16)) ! error on this line
end program fcuda

and the result is:

$ pgf90 test.cuf -Mcuda
$ ./a.out 
0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)

There is an old topic regarding this problem: https://forums.developer.nvidia.com/t/mapping-between-openacc-and-cuda-parallelism-levels/134450/1
It seems like this bug appeared on random versions.
If I recall correctly, this code worked on 17.10.
However, I can’t confirm since I only have 18.10 community version currently.

In short, my questions are:

  1. Is this a bug or intended limitation?
  2. Is there any workaround?
  3. Is there a way to download previous community edition? Archive seems to only have professional edition.

Thanks in advance.

Hi HeeHoon Kim,

What kind of NVIDIA device are you using? (as seen from running the PGI utility “pgaccelinfo”)

I can only recreate this error when compiling for one device but running on another. For example, if I compile for Pascal (cc60) but try running it on a Volta (cc70).

Another possibility is that there is a problem with your device. (pgaccelinfo will show this as well)

-Mat

Hi, Mat

I’m using Tesla V100, and here is pgaccelinfo print:

$ pgaccelinfo

CUDA Driver Version:           10000
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  410.48  Thu Sep  6 06:36:33 CDT 2018

Device Number:                 0
Device Name:                   Tesla V100-PCIE-16GB
Device Revision Number:        7.0
Global Memory Size:            16914055168
Number of Multiprocessors:     80
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1380 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             877 MHz
Memory Bus Width:              4096 bits
L2 Cache Size:                 6291456 bytes
Max Threads Per SMP:           2048
Async Engines:                 7
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
  Multi-Device:                Yes
PGI Default Target:            -ta=tesla:cc70

Actually, we have several GPUs and all of them showed the same behaviour. So I doubt that it is a device problem.

I tried several target flag combination with no success.

$ pgf90 -Mcuda=cuda10.0,cc70 test.cuf 
$ ./a.out 
0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)
$ pgf90 -ta=tesla:cc70 test.cuf 
$ ./a.out 
0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)
$ pgf90 -Mcuda=cuda10.0,cc70 -ta=tesla:cc70 test.cuf 
$ ./a.out 
0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)

Hi HeeHoon Kim,

Then, I’m not sure what’s wrong since it works fine for me on a very similar system.

Can you try compiling some basic CUDA C examples with nvcc to see if they run? How about a simple OpenACC program? Other CUDA Fortran programs?

(We ship example OpenACC and CUDA Fortran program under the $PGI/2018/examples" directory. For CUDA C, look in /opt/cuda-10.0/samples )


Note that the similar error from the link you posted was from 2011 and only occurred on MacOSX. The error did get fixed in the 11.4 release and is part of our nightly QA testing so should not reoccur (Note Macs no longer use NVIDIA GPUs, so this testing is on Linux and Windows only).

Here’s the output from my compilation of your code on a similar system. The only way I can match your error is if I compile targeting P100 but try running it on a V100. I there any additional information that might help me recreate this error?


% pgaccelinfo

CUDA Driver Version:           10000
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  410.48  Thu Sep  6 06:36:33 CDT 2018

Device Number:                 0
Device Name:                   Tesla V100-PCIE-16GB
Device Revision Number:        7.0
Global Memory Size:            16914055168
Number of Multiprocessors:     80
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1380 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             877 MHz
Memory Bus Width:              4096 bits
L2 Cache Size:                 6291456 bytes
Max Threads Per SMP:           2048
Async Engines:                 7
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
  Multi-Device:                Yes
PGI Default Target:            -ta=tesla:cc70

% cat test.cuf
module cudamod
   implicit none
   integer, device, allocatable, dimension(:) :: int_d
end module cudamod

program fcuda
   use cudafor
   use cudamod
   implicit none
   allocate(int_d(16)) ! error on this line
end program fcuda
% pgf90 -V

pgf90 18.10-1 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
% pgf90 -Mcuda=cuda10.0,cc70 test.cuf ; ./a.out
% pgf90 -Mcuda=cuda10.0,cc60 test.cuf ; ./a.out
0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)

-Mat

Hi Mat.

After reading your reply, I did more tests.
I found out

  1. CUDA C (with nvcc) programs have no problem
  2. OpenACC, CUDA-Fortran programs in pgi example do not run. (kernel launch fail, invalid device symbol, etc…)

So, I setup another machine with V100 and it works fine surprisingly!

I digged in to find difference, used -### flag, and found out the only difference is -tp flag.

The problematic machine is equipped with Intel® Xeon Phi™ CPU 7290, so -tp knl was automatically applied.

I overridded the flag with -tp px, and it works now.

$ pgf90 test.cuf -tp px ; ./a.out 
# no error here
$ pgf90 test.cuf -tp knl ; ./a.out 
0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)

I wonder it is some kind of conflict between CUDA and knl…

Anyway, thanks for helping!

Interesting.

I was able to reproduce the error on a Skylake system when using “-tp knl” (using -tp skylake is fine). Looks like we’re missing the needed device data constructors when targeting KNL systems. We don’t officially support KNL since it was discontinued while we were still working on adding it.

I went ahead and added a problem report (TPR#26784). Not sure we’ll fix it given KNL’s demise, but I would like our engineers to understand why the needed constructor code was not added.

Note that you might try using “-tp skylake” or “-tp haswell” instead of “px” since “px” wont take advantage of any architectural features such as SSE or AVX.

Thanks,
Mat

TPR #26784 should be fixed as of PGI release 19.3