We tried to run an OpenACC code on an IBM power9+A100 system but got error
Line 144: cudaLaunchKernel returned status 98: invalid device function
We used both hpc-sdk 21.1 and 20.11.
The code can run on other intel based CPU +P100/V100 systems. It worked also on an IBM power9+A100 system using PGI 19.x compiler a while ago.
The error cannot be reproduced for the mini-app if use hpc-sdk 21.1 (and built-in cuda 10.2). Instead there is another error message when module hpc-sdk 21.1 and cuda 11.1 are loaded.
line 144: cudaLaunchKernel returned status 3: initialization error
This usually occurs when there’s a mismatch between the CUDA or target device the binary was compiled to and the CUDA driver or device on which you’re attempting to run.
What CUDA Driver is installed on the system you see the error? Since you say that it works if you build with CUDA 10.2, in the failing case, are you compiling with CUDA 11 but the CUDA driver is 10.2?
If you don’t know the driver, please run the “nvaccelinfo” utility on the failing system. This will show what driver version is installed.
Sorry it was not clear. For the code for real problems, the error always occurs using hpc-sdk/20.11 and 21.1 regardless of using built-in CUDA or cuda (11.1) module.
please run the “nvaccelinfo” utility on the failing system
Here is the output of nvaccelinfo:
$ nvaccelinfo
CUDA Driver Version: 10020
NVRM version: NVIDIA UNIX ppc64le Kernel Module 440.64.00 Wed Feb 26 16:01:28 UTC 2020
Device Number: 0
Device Name: Tesla V100-SXM2-16GB
Device Revision Number: 7.0
Global Memory Size: 16911433728
Number of Multiprocessors: 80
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1530 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 877 MHz
Memory Bus Width: 4096 bits
L2 Cache Size: 6291456 bytes
Max Threads Per SMP: 2048
Async Engines: 4
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
Default Target: cc70
Since your driver is CUDA 10.2, the binary needs to be built with CUDA 10.2 as well, or the driver needs to be updated to CUDA 11.1. The CUDA drivers are backwards compatible but not forward compatible.
Did you build the program so that the OpenACC portions are using CUDA 10.2 as well, i,e, add the flag “-gpu=cuda10.2”?
Note that I’m assuming that you download the 21.1 package that includes the current and previous versions of CUDA. If you installed the version with CUDA 11.1 only, then you either re-install using the multi-CUDA package or update your CUDA driver to CUDA 11.1.
You can use the following command to download the 21.2 Multi-CUDA package:
Did you build the program so that the OpenACC portions are using CUDA 10.2 as well, i,e, add the flag “-gpu=cuda10.2”?
No, I use the traditional flag "-acc -Mcuda=cc70" for the V100 system. Now I got another error when using "-acc -gpu=cuda10.2"
Failing in Thread:1
call to cuModuleGetFunction returned error 500: Not found
In this case, I used system installed package hpc-sdk/20.11 and cuda/10.2.
Where I’ve seen the “invalid device function” error is typically due to a mismatch in the CUDA version or target device. So given your CUDA driver is 10.2, it makes sense to use the “-gpu=cuda10.2” flag.
The “500” error is typically a linking issue such as if you had a mixed OpenACC and CUDA C program and there’s a mismatch with RDC (CUDA C by default doesn’t use RDC, but OpenACC does). Though, I wondering if there’s something off with the CUDA math libraries. If you do a “ldd” on the executable, which CUDA math libraries are getting dynamically linked?
Those are just the CPU math libraries. I was referring to the CUDA math libraries like cuBLAS and cuRAND which you referenced earlier. Are you saying that you removed these?
In that case, I’m not sure what’s going on. Again this type of error typically occurs due to a mismatch in the CUDA version or device, but this is just a guess.
Are you able to provide a reproducing example that I can use to investigate?
Those are just the CPU math libraries. I was referring to the CUDA math libraries like cuBLAS and cuRAND which you referenced earlier. Are you saying that you removed these?
No, I just changed the flag from “-acc -Mcuda” to “-acc -gpu=cuda10.2”. With “-acc -gpu=cuda10.2”, the ldd output only shows CPU math libraries. (the removed CUDA fortran kernels are standalone without calling any cuBLAS and cuRAND)
Are you able to provide a reproducing example that I can use to investigate?
Unfortunately the error cannot be reproduced even for the mini-app. And the error only occurs on the Power9+A100 system.
Is it possible for you to take a looking at a case with full version of the code? If so, how can I provide the case?
The CUDA Fortran and CUDA C interoperability “-Mcuda” flag got renamed to “-cuda”. “-gpu” is renamed from the “-ta” flag but modified to control the device code generation for all offload models (OpenACC, OpenMP, CUDA Fortran, C++ and Fortran standard language parallelism), not just OpenACC. “-gpu” doesn’t enable “-cuda”.
Maybe the problem is just the removal of “-cuda”? Can you try adding it back?
Is it possible for you to take a looking at a case with full version of the code? If so, how can I provide the case?
Sure, I’d be happy to if adding “-cuda” doesn’t fix the issue. You can either direct message me and upload the source package, or I can send you a direct mail (I can access your email address, no need to post it) and we can arrange a way for me to access the code.
I went through the email chain I had with Jing and the last note I had from him indicated that he was able to get things working, though I’m not sure how.
Where I’ve seen this error is when there were object files built with mismatched CUDA versions. However he was sure everything in the project, including libraries, were built with the same CUDA version.
Since he was running on a Power9 system, our IT folks suggested to make sure the device’s persistence mode was enabled and the “udev” rule set up as follows: CUDA Installation Guide for Linux
They also suggested clearing the CUDA cache in case something in there was causing issues: