cudaLaunchKernel returned status 98: invalid device function

gongjing · March 3, 2021, 4:53pm

Hi,

We tried to run an OpenACC code on an IBM power9+A100 system but got error

Line 144: cudaLaunchKernel returned status 98: invalid device function

We used both hpc-sdk 21.1 and 20.11.

The code can run on other intel based CPU +P100/V100 systems. It worked also on an IBM power9+A100 system using PGI 19.x compiler a while ago.

The error cannot be reproduced for the mini-app if use hpc-sdk 21.1 (and built-in cuda 10.2). Instead there is another error message when module hpc-sdk 21.1 and cuda 11.1 are loaded.

line 144: cudaLaunchKernel returned status 3: initialization error

Any idea about the error ?

Thanks. /Jing

MatColgrove · March 4, 2021, 9:31pm

Hi Jing,

This usually occurs when there’s a mismatch between the CUDA or target device the binary was compiled to and the CUDA driver or device on which you’re attempting to run.

What CUDA Driver is installed on the system you see the error? Since you say that it works if you build with CUDA 10.2, in the failing case, are you compiling with CUDA 11 but the CUDA driver is 10.2?

If you don’t know the driver, please run the “nvaccelinfo” utility on the failing system. This will show what driver version is installed.

-Mat

gongjing · March 5, 2021, 4:12pm

Hi Mat,

Sorry it was not clear. For the code for real problems, the error always occurs using hpc-sdk/20.11 and 21.1 regardless of using built-in CUDA or cuda (11.1) module.

please run the “nvaccelinfo” utility on the failing system

Here is the output of nvaccelinfo:
$ nvaccelinfo

CUDA Driver Version:           10020
NVRM version:                  NVIDIA UNIX ppc64le Kernel Module  440.64.00  Wed Feb 26 16:01:28 UTC 2020

Device Number:                 0
Device Name:                   Tesla V100-SXM2-16GB
Device Revision Number:        7.0
Global Memory Size:            16911433728
Number of Multiprocessors:     80
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1530 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             877 MHz
Memory Bus Width:              4096 bits
L2 Cache Size:                 6291456 bytes
Max Threads Per SMP:           2048
Async Engines:                 4
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
  Multi-Device:                Yes
Default Target:                cc70

Thanks. /Jing

MatColgrove · March 5, 2021, 4:55pm

Hi Jing,

Since your driver is CUDA 10.2, the binary needs to be built with CUDA 10.2 as well, or the driver needs to be updated to CUDA 11.1. The CUDA drivers are backwards compatible but not forward compatible.

-Mat

gongjing · March 5, 2021, 6:23pm

Hi Mat,

Since your driver is CUDA 10.2, the binary needs to be built with CUDA 10.2 as well,

Yes, one old hpc-sdk module has be built with cuda10.2, the output of ldd is

…
libcublas.so.10 => /opt/compilers/hpc-sdk/2020/binary/Linux_ppc64le/20.11/math_libs/10.2/lib64/libcublas.so.10 (0x00007fff8c100000)
libcublasLt.so.10 => /opt/compilers/hpc-sdk/2020/binary/Linux_ppc64le/20.11/math_libs/10.2/lib64/libcublasLt.so.10 (0x00007fff8a4b0000)
libcurand.so.10 => /opt/compilers/hpc-sdk/2020/binary/Linux_ppc64le/20.11/math_libs/10.2/lib64/libcurand.so.10 (0x00007fff865d0000)
libcudart.so.10.2 => /opt/compilers/hpc-sdk/2020/binary/Linux_ppc64le/20.11/cuda/10.2/lib64/libcudart.so.10.2 (0x00007fff862e0000)

But still got same error:

line 144: cudaLaunchKernel returned status 98: invalid device function

Thanks. /Jing

MatColgrove · March 5, 2021, 7:23pm

Did you build the program so that the OpenACC portions are using CUDA 10.2 as well, i,e, add the flag “-gpu=cuda10.2”?

Note that I’m assuming that you download the 21.1 package that includes the current and previous versions of CUDA. If you installed the version with CUDA 11.1 only, then you either re-install using the multi-CUDA package or update your CUDA driver to CUDA 11.1.

You can use the following command to download the 21.2 Multi-CUDA package:

wget https://developer.download.nvidia.com/hpc-sdk/21.2/nvhpc_2021_212_Linux_x86_64_cuda_multi.tar.gz

gongjing · March 5, 2021, 7:50pm

Hi Mat,

Did you build the program so that the OpenACC portions are using CUDA 10.2 as well, i,e, add the flag “-gpu=cuda10.2”?
No, I use the traditional flag "-acc -Mcuda=cc70" for the V100 system. Now I got another error when using "-acc -gpu=cuda10.2"

Failing in Thread:1
call to cuModuleGetFunction returned error 500: Not found

In this case, I used system installed package hpc-sdk/20.11 and cuda/10.2.

I will try to install hpc sdk 21.2.

Thanks. /Jing

gongjing · March 7, 2021, 7:47pm

Hi Mat,

The hpc-sdk v21.2 with two previous cuda versions has been installed on the power9+v100 system.

$ module use hpc_sdk/modulefiles
$ module load nvhpc
$ module show nvhpc
$ ls -l hpc_sdk/Linux_ppc64le/21.2/cuda/lib64
 hpc_sdk/Linux_ppc64le/21.2/cuda/lib64  -> 10.2/lib64

But I still got same error: i.e. with flag "-acc -Mcuda=cc70"
Line 144: cudaLaunchKernel returned status 98: invalid device function

with flag -acc -gpu=cuda10.2`

call to cuModuleGetFunction returned error 500: Not found

Thanks. /Jing

MatColgrove · March 8, 2021, 12:00am

Hi Jing,

Where I’ve seen the “invalid device function” error is typically due to a mismatch in the CUDA version or target device. So given your CUDA driver is 10.2, it makes sense to use the “-gpu=cuda10.2” flag.

The “500” error is typically a linking issue such as if you had a mixed OpenACC and CUDA C program and there’s a mismatch with RDC (CUDA C by default doesn’t use RDC, but OpenACC does). Though, I wondering if there’s something off with the CUDA math libraries. If you do a “ldd” on the executable, which CUDA math libraries are getting dynamically linked?

-Mat

gongjing · March 8, 2021, 8:28pm

Hi Mat,

The code uses OpenACC and CUDA Fortran. I have already commented out all calls related to the CUDA fortran and the output of ldd likes

$ ldd  |grep -i math
        libnvcpumath.so => /hpc_sdk/Linux_ppc64le/21.2/compilers/lib/libnvcpumath.so (0x00007fffac4e0000)
        libnvcpumath-pwr8.so => /hpc_sdk/Linux_ppc64le/21.2/compilers/lib/libnvcpumath-pwr8.so (0x00007fff94bb0000)

Thanks. /Jing

MatColgrove · March 8, 2021, 8:58pm

Those are just the CPU math libraries. I was referring to the CUDA math libraries like cuBLAS and cuRAND which you referenced earlier. Are you saying that you removed these?

In that case, I’m not sure what’s going on. Again this type of error typically occurs due to a mismatch in the CUDA version or device, but this is just a guess.

Are you able to provide a reproducing example that I can use to investigate?

gongjing · March 9, 2021, 2:18pm

Those are just the CPU math libraries. I was referring to the CUDA math libraries like cuBLAS and cuRAND which you referenced earlier. Are you saying that you removed these?

No, I just changed the flag from “-acc -Mcuda” to “-acc -gpu=cuda10.2”. With “-acc -gpu=cuda10.2”, the ldd output only shows CPU math libraries. (the removed CUDA fortran kernels are standalone without calling any cuBLAS and cuRAND)

Are you able to provide a reproducing example that I can use to investigate?
Unfortunately the error cannot be reproduced even for the mini-app. And the error only occurs on the Power9+A100 system.

Is it possible for you to take a looking at a case with full version of the code? If so, how can I provide the case?

Thanks. /Jing

MatColgrove · March 9, 2021, 5:53pm

The CUDA Fortran and CUDA C interoperability “-Mcuda” flag got renamed to “-cuda”. “-gpu” is renamed from the “-ta” flag but modified to control the device code generation for all offload models (OpenACC, OpenMP, CUDA Fortran, C++ and Fortran standard language parallelism), not just OpenACC. “-gpu” doesn’t enable “-cuda”.

Maybe the problem is just the removal of “-cuda”? Can you try adding it back?

Is it possible for you to take a looking at a case with full version of the code? If so, how can I provide the case?

Sure, I’d be happy to if adding “-cuda” doesn’t fix the issue. You can either direct message me and upload the source package, or I can send you a direct mail (I can access your email address, no need to post it) and we can arrange a way for me to access the code.

-Mat

gongjing · March 9, 2021, 8:33pm

Hi Mat,

Maybe the problem is just the removal of “-cuda”? Can you try adding it back?

If use “-cuda”, the error is same as that with “-Mcuda”.

or I can send you a direct mail (I can access your email address, no need to post it) and we can arrange a way for me to access the code.

Please directly send me one email. I will upload the case to Box which will be accessible using your email address.

Thanks. /Jing

kamkyl196 · January 30, 2023, 11:32pm

Hi, I’m getting a similar error. Was this ever solved?

MatColgrove · January 31, 2023, 12:14am

Hi kamkyl196,

I went through the email chain I had with Jing and the last note I had from him indicated that he was able to get things working, though I’m not sure how.

Where I’ve seen this error is when there were object files built with mismatched CUDA versions. However he was sure everything in the project, including libraries, were built with the same CUDA version.

Since he was running on a Power9 system, our IT folks suggested to make sure the device’s persistence mode was enabled and the “udev” rule set up as follows: CUDA Installation Guide for Linux

They also suggested clearing the CUDA cache in case something in there was causing issues:

$ export CUDA_CACHE_DISABLE=1
$ rm ~/.nv/ComputeCache/*

Though in Jing’s case neither helped and I’m not sure these would apply to you.

What I’d suggest is that you open a new forum topic with the details of your issue and we can try to determine the cause.

-Mat