Error when program reaches GPU code

Hello everyone,

As I managed to overcome the issue I had earlier (see OpenACC "declare link" with routine called in target region) I have backported the solution to the larger Fortran program.

The code compiles fine with version 19.10 Community Edition, Execution of the program starts, but when it reaches the code that has to be executed on the GPU I get now the following error:

  • When compiling with CUDA 9.2 (pgfortran isola15.f90 common_vars.f90 parameters.f90 -O4 -acc -ta=tesla,cc35 -Minfo=accel -Mcuda=cuda9.2 -o isola15c)
line 325: cudaLaunchKernel returned status 11: invalid argument
  • When compiling with CUDA 10.0 (pgfortran isola15.f90 common_vars.f90 parameters.f90 -O4 -acc -ta=tesla,cc35 -Minfo=accel -Mcuda=cuda10.0 -o isola15c)
line 325: cudaLaunchKernel returned status 11: invalid argument
  • When compiling with CUDA 10.1 (pgfortran isola15.f90 common_vars.f90 parameters.f90 -O4 -acc -ta=tesla,cc35 -Minfo=accel -Mcuda=cuda10.1 -o isola15c)
line 325: cudaLaunchKernel returned status 1: invalid argument

I am not certain how I can further debug this and what I can do, as the kernel and the arguments passed to it are generated by the compiler.

It is also weird that the test program in my other post works now without an issue, but applying the same solution to the larger program does not.

Please help!

Hi Ioannis,

I tried replicating the error using the code you previously posted but it compiled and ran correctly for me (though my K80 system uses CUDA 10.2). Have you made any additional changes?

My best guess is that you’ve set either “num_gangs”, “num_workers” and/or “vector_length” to large or incompatible values for your device. I’ve seen a similar error before when this occurred: cudaLaunchKernel returned status 1: invalid argument

-Mat

Hello Mat,

As I mentioned, the test code works. The original (large) Fortran code had the problem after backporting the solution. I figured out just a couple of hours what the problem was. One of the subroutines called within the target region was declaring a very large local array (200000 reals) per thread, which is not supported on the GPU. Thankfully for the problem under consideration the array can be much smaller (~10000 reals are enough). The error message was quite misleading though.

After correcting the above, one last error I was getting in the parallel region was a misaligned memory address access. That was easier to find. In one of the subroutines a CHARACTER*3 array is declared and passed from one subroutine to others. The compiler didn’t like the fact that there were 3 characters (bytes) per element in the array. I changed all these declarations to CHARACTER*4 and that solved it. Maybe some automatic padding should be applied in such cases from the compiler?

I am getting finally the correct results from the GPU! But the program crashes a bit later on the CPU (although the serial version of the code, ignoring the OpenACC directives, works correctly). Hopefully I will figure this out too.

Ioannis

PS: In the meantime I have also switched to the NVidia HPC SDK 20.7 and nvfortran. My understanding is that this compiler is based on pgfortran, right?

In the meantime I have also switched to the NVidia HPC SDK 20.7 and nvfortran. My understanding is that this compiler is based on pgfortran, right?

Correct, nvfortan is just the re-branded and updated pgfortran.