In our app, we currently experience a very illusive illegal memory access error. The problem is that it doesn’t happen with cuda-gdb or compute-sanitizer.
Our best hypothesis so far is that it’s a race condition and the speed improvements that we recently did expose this problem. Our app is a using FFmpeg to decoder an input video and then CUDA to process frames. The problem happens more frequently when TRT engines are used, but it has been observed without them.
We recently upgraded to cuda 12.6 hoping it would resolve the problem. Most of the memory allocation is done before the computation starts. The only one we can’t control is from FFmpeg, but when calling av_frame_free, this puts back the data inside the memory pool. After a few frames, you can see the same buffers being reused.
The illegal memory access is reported by one of the cuda function call that returns async errors.
Is there any trick we could do to help find the problematic operation causing this problem?
If you haven’t already done so, implement proper CUDA error checking.
You could try using CUDA_LAUNCH_BLOCKING=1 env var to turn all kernel calls to host-synchronous. This combined with proper CUDA error checking after kernel calls will at least narrow the search down to an offending kernel, unless the race condition is associated with concurrent kernels.
we added a CHECK macro around every cuda calls and added a CHECK(cudaGetLastError()); after each kernel calls.
As for CUDA_LAUNCH_BLOCKING, it doesn’t happen in that situation. I’m guessing that host synchronizations after each kernel call does slow down the execution too much for the bug to show up.
have you tried cuda-gdb but compiling with -lineinfo instead of -G ? See if it traps on the error. likewise try with and without the memory checking feature enabled. See if the trap/backtrace gives any clues
cudaLaunchHostFunc didn’t work as expected. The problem is that it stops being called when there’s an error in the context, it’s “local” to the stream it was attached to.
So now, we’re back to the beginning… looking for ideas how to find where that illegal memory access is happening. Anything that wouldn’t slow down the execution because that’s the only way it happens.
Also, is there a list somewhere of what can be qualified as “illegal memory access”? Can it be caused by a race condition?
If you are completely certain of this characterization, this is a strong clue that the bad address is likely a consequence of a race condition, i.e. a classical example of a Heisenbug.
When you were running compute-sanitizer did you turn on the race checker? If that did not yield any complaints, have you peer reviewed your code to ensure there are no race conditions?
Have you tried adding assertion to the code to ensure that key pieces of data are within their expect bounds? A failed assertion should terminate the program immediately.
Due to my time spent in embedded programming, where access to the hardware was often only via a serial link and a console, I am a fan of debugging by logging data. The idea would be that you log e.g. key pointers, dimensions, and indexes, and once the program terminates abnormally due to illegal memory access you look backwards through the log for things that look out of place by themselves or by comparison with a log from a non-faulting run. Don’t dump too much information at once. You may have to try different sets of data before some interestingly lead appears.
If you included a black box as part of your processing pipeline that reduces the likelihood of identifying a root cause. You may need to build FFmpeg from scratch to be able to include it in your logging by adding debug outputs.
If the problem has not existed since time immemorial, I would also try a binary search on changelists to find the software version that first exhibited the problem. If you follow the useful philosophy that changelist should be small, this might provide some very specific pointers as to where the issue originates. Of course it is also possible that the “bad” change only served to expose an existing software defect.
Root causing the kind of problem described here is difficult. You would want to set aside sufficient time to track this down. Setting aside one work week might be a good initial estimate. You would want to pursue each part of the search strategy systematically and persistently. Good luck.
an illegal memory access would be something that doesn’t fall within a valid location/range in local, global, or shared space. I don’t think there is a list anywhere.
A race condition by itself does not imply an illegal access. A by product of a race condition surely could lead to an illegal access. For example, a race condition results in a thread picking up a value that is not valid. (For example, reading a location before it has been written to by some other thread. The value there is undefined.)
The invalid value is then used to calculate an index into an array. The calculated index is beyond the end of the array (or before the beginning.) attempting to access that location according to that index then results in an illegal access.
That is exactly, what you would need: context and stream locality and only being called, when there was no error.
You put the host callbacks at ‘milestone points’ of your algorithm or before and after suspicious kernels and within the called host function, you just write the current stream position into a variable.
The error itself you catch as you do now: A function called after the asynchronous kernel launch will return with an asynchronous error. At this point in time you read out the variable and know exactly, up till which position there was no error.
If your host callbacks are dense enough, you can deduce, which kernel exhibited the error.
A lot of cuda functions return error from async calls. In the documentation, it’s a bit unclear, but is the error limited to the calling CPU thread? cudaGetLastError says this:
Returns the last error that has been produced by any of the runtime calls in the same instance of the CUDA Runtime library in the host thread and resets it to cudaSuccess.
If we see the illegal memory access in more than one thread, does it mean that it can part of the data exchange between these two threads?
So we located the problem and it was with cuvidDecodePicture inside FFmpeg.
The problem is that cuvidDecodePicture doesn’t use the cudaStream_t assigned to the decoder to synchronize with other operations. So once in a while, there were some memory access problems. The solution was to force a context synchronization around it.
But it would be nice if cuvidDecodePicture could use a cuda stream for its execution.