PGI generates kernel that couldn't be launched

Hi guys,

I have some code. I marked it with ACC and compiled it.
PGI reported that kernel was generated, but I got the error at runtime. Debugging show the following error:

pgi_uacc_launch funcnum=0 argptr=0xbfe8c370 sizeargs=0xbfe8c368 async=-1 devid=1
Arguments to function 0 morr_two_moment_micro_1694_gpu:
                    27         26         27         58         71         27          1          2
                     2   89128960   97471648   98050560  114294784   98520224   99099136  113691136
              99568800  100147712  113246208  100617376  101196288  112642560  101665952  102244864
             112197632  102714528  103293440  111593984  103763104  104342016  111149056  104857600
             105302528  110545408  105860256  106439168  110100480  106908832  107487744  109496832
             108402336  108003328  109051904   97001984   93328384   93328896   93329408   93329920
              93330432   93330944   93331456   93331968   93332480   93332992   96423072   93327872
              93327360   93326848   93326336   93325824   93325312   93324800   93324288   93323776
              93323264   92274688   91671040   91226112   90622464   90177536   89145856   89162752
              95865344   95420416   94816768   94371840   92719616        118         58      -6809
                    82       2296         27          1 1127481344       3187         27       1566
                     1          1  674505948 1124897428 1077936128 1160942153 1232348160 1148796928
            1135776330 1195593728 1141014987 1105259141

            0x0000001b 0x0000001a 0x0000001b 0x0000003a 0x00000047 0x0000001b 0x00000001 0x00000002
            0x00000002 0x05500000 0x05cf4ca0 0x05d82200 0x06d00000 0x05df4ca0 0x05e82200 0x06c6ca00
            0x05ef4ca0 0x05f82200 0x06c00000 0x05ff4ca0 0x06082200 0x06b6ca00 0x060f4ca0 0x06182200
            0x06b00000 0x061f4ca0 0x06282200 0x06a6ca00 0x062f4ca0 0x06382200 0x06a00000 0x06400000
            0x0646ca00 0x0696ca00 0x064f4ca0 0x06582200 0x06900000 0x065f4ca0 0x06682200 0x0686ca00
            0x067616a0 0x06700000 0x06800000 0x05c82200 0x05901400 0x05901600 0x05901800 0x05901a00
            0x05901c00 0x05901e00 0x05902000 0x05902200 0x05902400 0x05902600 0x05bf4ca0 0x05901200
            0x05901000 0x05900e00 0x05900c00 0x05900a00 0x05900800 0x05900600 0x05900400 0x05900200
            0x05900000 0x05800000 0x0576ca00 0x05700000 0x0566ca00 0x05600000 0x05504200 0x05508400
            0x05b6ca00 0x05b00000 0x05a6ca00 0x05a00000 0x0586ca00 0x00000076 0x0000003a 0xffffe567
            0x00000052 0x000008f8 0x0000001b 0x00000001 0x43340000 0x00000c73 0x0000001b 0x0000061e
            0x00000001 0x00000001 0x283424dc 0x430c9294 0x40400000 0x45329249 0x49742400 0x44794000
            0x43b2924a 0x47435000 0x440281cb 0x41e0ea85
cuda_launch argument bytes=516, max=240 move 276 bytes at offset 240 to devaddr 0x5400000
call to cuEventSynchronize returned error 700: Launch failed

I remember there was a limit for size of kernel arguments (256 byte). Could it be the reason of launch failure in this case? What else should I check to find the reason of failure? No shared memory is used.


some additional information.

It works under cuda-memcheck.


Hi Alexey,

I remember there was a limit for size of kernel arguments (256 byte). Could it be the reason of launch failure in this case?

No, for kernels with larger number of arguments we work around this limitation by packing the arguments into a struct, copy the struct to the device, and then only pass a pointer to the struct as the kernel argument.

What else should I check to find the reason of failure?

Error 700 is very generic and just means the kernel failed for some reason. My best guess given that the program works when run under “cuda-memcheck”, is that you have some uninitialized memory that is getting set to “zero” with “cuda-memcheck” and garbage without. Though, this is just a guess. I’d need reproducing example to determine the true cause.

What I’d do next, is to compile without OpenACC and run it under Valgrind ( to see if it finds any UMRs.

  • Mat

Hi Mat,

I updated nvidia driver. It helps, but didn’t solve my problem.
Now I get segmentation fault far away from the routine which called kernel.
I can’t us “-C” key, but valgrind sees some errors, which were not present in program without ACC.


Ok, you can try compiling with “-g” and using the PGI debugger (pgdbg) as well. Note if the valgrind errors are coming from a PGI run time libraries, these are safe to ignore.

  • Mat

Hey Mat, may I ask whether you’ve got any bug reports hanging on this feature? I may be able to create you one - I have a program that fails when relying on this feature, but runs fine when I pack certain arguments manually. I haven’t constructed a minimal example yet, but I could do that in case that would be helpful to you - hence my question.

Hi Mat!

I split the kernel into two parts. The first part works Ok while the second one result in “Launch failure”. Cuda-memcheck shows a lot of errors with this part.

My code structure for the second part is the following:

!$acc kernels
!$acc loop independent collapse(2)
do i=its,ite
do j=jts,jte
!$acc loop seq
    do k=kts,kte 
! do something
    end do
    k = kte
    arr1(k) = ...(k)
!$acc loop seq
    do k=kte-1,kts, -1
    end do
end do
end do
!$acc end kernels

my workaround was to substitute ‘k’ with ‘kte’ in the middle part of the kernel

Hey Arom, your case looks very similar to the problem I had. In my case the passing of too many arguments into a CUDA Fortran kernel lead to an integer argument to be corrupted, which lead to invalid read/writes when using arrays that were declared with that integer as a dimension. Reducing the amount of arguments by packing them into an array manually solved my problem.

Hi AROM, MuellerM,

Would it be possible for you both to send to PGI Customer Service ( or post a reproducing example?