Launch failed error

My code with OpenACC directives crashes with the following error:

call to cuMemcpyDtoHAsync returned error 700: Launch failed

or sometimes the foll error:

call to cuEventSynchronize returned error 700: Launch failed

or sometimes the foll error (happened once, number is 999 instead of 700):

call to cuMemcpyDtoHAsync returned error 999: Unknown

I am compiling with pgf90 ( v 13.3). The GPU device I am running on is Tesla K20Xm.

It is hard to create a small code snippet that reproduces this problem, but I can upload the entire code ( MFIX, consists of many files) using any means sugested.

Thanks very much in anticipation of the help. I apologize if I have not provided enough information, but I did it to keep this first post minimal, and also because I am unsure what additional information will be helpful. Please let me know what other information is needed, and I will provide it ASAP.

Best regards

Hi Anirban,

How I would go about this would be to set the environment variable “PGI_ACC_DEBUG=1” and review the output to see where the error is occuring. Most likely it’s in the kernel just before the copy back from the device.

Next begin commenting out the various OpenACC directives. You can comment out the offending the kernel to see if it clears the error. If it does, then start commenting out lines in the loop until you can narrow down the line that’s causing the problem.

If you need me to take a look, let me know. I don’t mind looking at larger codes provided that there aren’t too many library dependicies and the build process is straight forward.

  • Mat

Hi Mat,
Thanks very much for your prompt reply. I got a lot of output after setting the debugging flag you suggested, which is progress :-). I have pasted below a small excerpt just before and including the crash point. Perhaps you can spot something revealing. My eye is untrained in these things as of yet, but it seems to me that the initial few lines of the excerpt show that some variables (pft_tmp, neighbours) were successfully copied to device from host. Then the arguments list for the offending kernel calc_force_des_404_gpu is set up. Then a launch is attempted which fails. If you need to look at more of the debugging output, let me know.

I know exactly which acc loop directive is causing the failure. In fact, I have reduced the number of loop directives now to just two. With just the first one, the code runs fine. But when I include the second loop directive it fails. This second loop indeed starts at line 404 as the debugging output indicates. It is a somewhat long and complicated for loop.

Next I can try your other suggestion of commenting out lines inside the for loop and narrow down which particular line, if any, of the for loop is causing trouble. It will of course be great if you can look at the code too. I think there aren’t any weird library dependencies in the code. There is a script that comes with the code that compiles it (not the standard GNU make or cmake). But still compiling should be painless, by running just that one script.

Let me know what is the best way to share the code with you. For example, I can look at whether you can be given access to the Pittsburgh Supercomputing Center machine I am using. That way, nothing needs to be transferred, and any needed dependencies are already in place. Once you have access to the code, I can tell you how to compile it.

 pgi_uacc_dataon( devid=1, threadid=1 )
pgi_uacc_dataon( devid=1, threadid=1 ) dindex=1
NO map for host:0x533e7f00
pgi_uacc_alloc(size=24,devid=1,threadid=1) returns 0xb01f03c00
map    dev:0xb01f03c00 host:0x533e7f00 size:24 offset:0  data[dev:0xb01f03c00 host:0x533e7f00 size:24] (line:404 name:pft_tmp)
alloc done with devptr at 0xb01f03c00
MemHostRegister( 0x533e7f00, 24, 0 )
     upload 0x533e7f00->0xb01f03c00 for 24 bytes stream (nil)
pgi_uacc_dataon( devid=1, threadid=1 )
pgi_uacc_dataon( devid=1, threadid=1 ) dindex=1
NO map for host:0x7f9ca8ee3860
pgi_uacc_alloc(size=1996488,devid=1,threadid=1) returns 0xb02500000
map    dev:0xb02500000 host:0x7f9ca8ee3860 size:1996488 offset:0  data[dev:0xb02500000 host:0x7f9ca8ee3860 size:1996488] (line:404 name:neighbours) dims=18486x27
alloc done with devptr at 0xb02500000
MemHostRegister( 0x7f9ca8ee3860, 1996488, 0 )
pgi_uacc_datadone( async=-1, devid=1 )
pgi_uacc_cuda_wait(sync on stream=(nil))
pgi_uacc_cuda_wait done
pgi_uacc_cuda_uploads(hostsrc=0x7fff50fdac60,size=24,offset=0,lineno=404) returns 0xb01f00000
pgi_uacc_launch funcnum=0 argptr=0x7fff50fdacf0 sizeargs=0x7fff50fdace0 async=-1 devid=1
reduction array of 370444 bytes at 0xb02700000
Arguments to function 0 calc_force_des_404_gpu: 
                 18471          0   32505856         11   40894464         11          3          3
                     3          3          3          3          3          3          3          3
                     3          3          3          3          3          3          3          3
                     3          3          3          3          3          3          3          3
                     3          3      18471          1          1          1          1          1
                     1     499122    3589632         11   38797312         11   32521216         11
               4194304         11   33629696         11   32506880         11   32513536         11
              32514048         11   32514560         11   32513024         11   32515072         11
              32515584         11   35651584         11   32511488         11   33481888         11
              32510464         11   32509952         11   32509440         11   32508928         11
              32508416         11   32506368         11   32507904         11   32518144         11
              32519680         11   32507392         11   32520704         11    3145728         11
              34603008         11   32520192         11
            0x00004827 0x00000000 0x01f00000 0x0000000b 0x02700000 0x0000000b 0x00000003 0x00000003
            0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003
            0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003
            0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003 0x00000003
            0x00000003 0x00000003 0x00004827 0x00000001 0x00000001 0x00000001 0x00000001 0x00000001
            0x00000001 0x00079db2 0x0036c600 0x0000000b 0x02500000 0x0000000b 0x01f03c00 0x0000000b
            0x00400000 0x0000000b 0x02012600 0x0000000b 0x01f00400 0x0000000b 0x01f01e00 0x0000000b
            0x01f02000 0x0000000b 0x01f02200 0x0000000b 0x01f01c00 0x0000000b 0x01f02400 0x0000000b
            0x01f02600 0x0000000b 0x02200000 0x0000000b 0x01f01600 0x0000000b 0x01fee4a0 0x0000000b
            0x01f01200 0x0000000b 0x01f01000 0x0000000b 0x01f00e00 0x0000000b 0x01f00c00 0x0000000b
            0x01f00a00 0x0000000b 0x01f00200 0x0000000b 0x01f00800 0x0000000b 0x01f03000 0x0000000b
            0x01f03600 0x0000000b 0x01f00600 0x0000000b 0x01f03a00 0x0000000b 0x00300000 0x0000000b
            0x02100000 0x0000000b 0x01f03800 0x0000000b
cuda_launch argument bytes=812, max=240 move 572 bytes at offset 240 to devaddr 0xb00200000
call to cuMemcpyDtoHAsync returned error 700: Launch failed


Hi Anirban,

I agree with Mat. Setting PGI_ACC_DEBUG is the first step for debugging application.

It seams you have problems in your kernel.
Try to launch your application under cuda-memcheck.

P.S. Mat, it would be great to have some document with description of approaches to debug ACC’ed code. I believe you have already have some algorithm in your mind ;)


Hi Anirban,

Why don’t you see if you can narrow down the line that’s causing the error. Look for out-of-bounds or off by one error in arrays since these errors may be forgiven on the host but cause the kernel to crash. Granted, I could be bad code generated by the compiler, in which case I’ll want to create a reproducing example that I can sent to engineering.

I do have a copy of MFIX that I downloaded last December. If you want to send me just your changed files, I can try recreate the error from those. (Please send to PGI Customer Service and ask them to forward them to me). I may need your data set as well (or at least instructions on how to recreate the error). Larger files can be ftp’d to us See:

Mat, it would be great to have some document with description of approaches to debug ACC’ed code. I believe you have already have some algorithm in your mind ;)

Yes, I probably should put together a “best practices” guide. Time is an issue though. We are working toward getting debugging support which will help a lot. We’ll swtich to using the LLVM back-end and skip the intermediate CUDA generation. This is the main issue with debugging since currently the user’s would be debugging the intermediary code instead of their own. With LLVM we’ll be able to start adding DWARF information.

  • Mat

Hi Mat and Alexey,
Thanks very much for all the inputs. I just flew back to Pittsburgh from Boston. I will follow up on each of your suggestions and post the results soon.

Best regards

I tried the simplest suggestion first, which is to run the executable with cuda-memcheck, which returned 321 errors of the foll type at the crash point:

========= Invalid __global__ write of size 8
=========     at 0x00005a28 in calc_force_des_404_gpu
=========     by thread (191,0,0) in block (7,0,0)
=========     Address 0xb0273620c is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/ (cuLaunchKernel + 0x3dc) [0xc9d5c]
=========     Host Frame:mfix.pgi13p3.exe [0x3a33e4]

Let me know if this provide any clue. I didn’t have an AHA! moment seeing this. Could this possibly be caused by insufficient memory?

I am following up on the other suggestions.

Best regards

Hi Anirban,

This error means that the program is writting to protected memory. You’ll need to track down which line in this loop at line 404 of “calc_force_des” is causing the bad write.

  • Mat

I would just like to comment that by looking at the loop code line by line, I found that the issue was caused by a reduction variable that was not declared as such in the loop directive. After I declared

!$acc loop reduction(+: myvar)

the issue was resolved.

Thanks much for guiding me step by step through the debugging process.