Runtime trouble moving legacy code from CUDA 6.5 to 8.0

Hi all,
For my sins, I am trying to get a massive chunk of old CUDA Fortran code originally compiled with CUDA 6.5 and PGI14 to work with CUDA 8.0 and PGI 16.7 so that it runs on more modern hardware (more recent versions of CUDA/PGI give quite some problems). It compiles fine and parts of it work perfectly, but there are some pieces that give very wrong output despite running without complaint. The code is basically just doing a lot of arithmetic crunching.
Compiling with cc 35 using MKL and llvm, and it’s part of a larger code package. Running on a K40m.

Are there any immediate issues that come to mind that could be caused by the CUDA 6.5->8.0 transition? I know without actual code it’s hard for anyone to suggest anything specific, but I’d appreciate any general hints or ideas of what to try or what to think about.

Since you are inviting speculation: Possibly a latent bug in your code that got exposed through the change in toolchain and hardware. Maybe a race condition, an uninitialized variable, or access out of bounds.

Running the code through cuda-memcheck might give some clues, specifically the RaceCheck tool.

https://docs.nvidia.com/cuda/cuda-memcheck/index.html

@njuffa Thanks for the speculations. I fear that it is some latent bug, but still holding out for some kind of simple memory acccess that has changed behaviour somehow.

@cbuchner1 Yes, a good reminder to go back to that, thanks. I had trouble with it earlier which I blamed on dealing with input scripts, but now I realise that actually I get a surprising error with cuda-memcheck [which maybe ought to be a different post, but oh well]:

========= Error: process didn’t terminate successfully
========= The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host side errors.
========= Internal error (20)

I have seen suggestions elsewhere that this kind of error corresponds to segfaults or memory problems on the host. However I find it strange to have no errors when running normally, but to have one crop up when I use cuda-memcheck. Does cuda-memcheck require greater rigour from host memory management, that might cause a crash in this way? It is probably worth checking the allocation and deallocation of memory on the host regardless.

One thing to check is the application return code. The first thing to check is what return code is your application providing?

For example, if you have a main routine, the return code is whatever is being returned from main:

int main(){

  blah blah;
  ...
  return 0;  // return code
}

You can also query the return code from bash (just google that).

If your app returns anything other than zero, you should make it return 0 to use these tools. And if your app is returning a specific non-zero value for a reason, you should probably investigate that reason.

Thanks for pointing out the querying return code thing, it’s handy and I hadn’t seen that before.

If I run the program without cuda-memcheck, we get a 0 return as expected (everything seems to run fine, the numbers are just wrong).

When I run the program with cuda-memcheck, it just returns 1, consistent with the non-committal error message. Although it seems to me that this might be the return code from cuda-memcheck itself?

What puzzles me further is that some debugging reveals that the code runs up until the first device memory allocation of a given subroutine (not the first device allocation overall), then freezes for a bit and cuda-memcheck throws its error. Without cuda-memcheck we go straight through this allocation with no problems. So it appears that somehow cuda-memcheck is affecting this allocation? Is that a possibly correct understanding?

are you doing rigorous, proper CUDA error checking?

Normally cuda-memcheck has its primary impact in terms of device code execution, but it evidently also hooks calls into the CUDA libraries as well. Beyond that I couldn’t tell you precisely what it is doing.

Just to close this off, in the end I did identify a race condition that presumably never caused trouble on older hardware/software somehow. It did require compiling with CUDA 10 and running cuda-memcheck there (although the code causes more problems with CUDA 10 it runs far enough to get racecheck through the relevant errors). I do not know why cuda-memcheck broke with CUDA 8 but not CUDA 10, but there we go. Thanks for the help!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.